python - 如何从 Beautiful Soup 4 解析的元素中获取名称

Question

我有一个要转换的简单 HTML 文件。根据标签的类别，我需要修改内容：

<HTML>
<HEAD>
<TITLE>Eine einfache HTML-Datei</TITLE>
<meta name="description" content="A simple HTML page for BS4">
<meta name="author" content="Uwe Ziegenhagen">
<meta charset="UTF-8">
</HEAD>
<BODY>

<H1>Hallo Welt</H1>

<p>Ein kurzer Absatz mit ein wenig Text, der relativ nichtssagend ist.</p>

<H1>Nochmal Hallo Welt!</H1>

<p>Schon wieder ein kurzer Absatz mit ein wenig Text, der genauso nichtssagend ist wie der Absatz zuvor.</p>

</BODY>
</HTML>

如何通过 BS4 树并根据我是否有“H1”或“p”或其他类别的标签进行某些修改？我想我需要一些 switch 语句来决定每个元素如何处理它。

from bs4 import BeautifulSoup

with open ("simple.html", "r") as htmlsource:
  html=htmlsource.read()

soup = BeautifulSoup(html)

for item in soup.body:
  print(item)

score 1 · Accepted Answer

BeautifulSoup 标签对象有一个name可以检查的属性。例如，这是一个函数，它通过将字符串“Done with this”+适当的标签名称添加到 postwalk 中的每个节点来转换树：

def walk(soup):
    if hasattr(soup, "name"):
        for child in soup.children:
            walk(child)
        soup.append("Done with this " + soup.name)

注意。NavigableString表示文本内容的对象和Comment表示评论的对象没有nameor之类的属性children，因此如果您像上面那样遍历整个树，您需要检查您是否真的有一个标签（我正在使用hasattr调用上面；我想您可以检查类型是否为bs4.element.Tag)。

score 0 · Accepted Answer

试试这个代码：

from bs4 import BeautifulSoup
with open ("simple.html", "r") as htmlsource:
    html=htmlsource.read()

soup = BeautifulSoup(html)

for item in soup.body:
    print(item)

# You will select all of elements in the HTML page
elems = soup.findAll()
for item in elems:
   try:
      # Check if the class element is equal to a specified class
      if 'myClass' == item['class'][0]:
         print(item)

     # Check if the tagname element is equal to a specified tagname
     elif 'p' == item.name:
        print(item)

  except KeyError:
     pass

python - 如何从 Beautiful Soup 4 解析的元素中获取名称

2 回答 2

Related

Reference