python - 如何根据正则表达式检索 HTML 标签

Question

我正在尝试提取每个 HTML 标记，包括正则表达式的匹配项。例如，假设我想获取包含字符串“name”的每个标签，并且我有一个这样的 HTML 文档：

<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>

可能，我应该尝试使用正则表达式来捕获打开和关闭之间的每个匹配项"<>"，但是，我希望能够根据这些匹配项遍历解析的树，这样我就可以获得兄弟姐妹或父母或“nextElements”。在上面的示例中，这相当于 get<head>*</head>或者可能<h2>*</h2>一旦我知道他们是包含匹配项的标签的父母或兄弟姐妹。

我尝试了 BeautifulSoap，但在我看来，当您已经知道要查找的标签类型或基于其内容时，它很有用。在这种情况下，我想先获得一个匹配项，将该匹配项作为起点，然后像 BeautifulSoap 和其他 HTML 解析器那样导航树。

建议？

score 2 · Accepted Answer

使用lxml.html. 它是一个很棒的解析器，它支持xpath，可以轻松表达你想要的任何东西。

下面的示例使用此 xpath 表达式：

//*[contains(text(),'name']/parent::*/following-sibling::*[1]/*[@class='name']/text()

这意味着，用英语：

找到我在其文本中包含该单词的任何标签'name'，然后获取父级，然后是下一个兄弟，并在其中找到任何带有类的标签， 'name'最后返回其文本内容。

运行代码的结果是：

['This is also a tag to be retrieved']

这是完整的代码：

text = """
<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>
"""

import lxml.html
doc = lxml.html.fromstring(text)
print doc.xpath('//*[contains(text(), $stuff)]/parent::*/'
    'following-sibling::*[1]/*[@class=$stuff]/text()', stuff='name')

必读，“请不要用正则表达式解析 HTML”答案在这里： https ://stackoverflow.com/a/1732454/17160

score 1 · Accepted Answer

鉴于以下条件：

匹配必须出现在标签上的属性值中
匹配必须发生在作为标记的直接子节点的文本节点中

你可以用漂亮的汤：

from bs4 import BeautifulSoup
from bs4 import NavigableString
import re

html = '''<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>'''

soup = BeautifulSoup(html)
p = re.compile("name")

def match(patt):
    def closure(tag):
        for c in tag.contents:
            if isinstance(c, NavigableString):
                if patt.search(unicode(c)):
                    return True
        for v in tag.attrs.values():
            if patt.search(v):
                return True
    return closure

for t in soup.find_all(match(p)):
    print t

输出：

<title>This tag includes 'name', so it should be retrieved</title>
<h1 class="name">This is also a tag to be retrieved</h1>

python - 如何根据正则表达式检索 HTML 标签

2 回答 2

Related

Reference