python - 使用 BeautifulSoup 查找包含特定文本的 HTML 标签

Question

我正在尝试获取包含以下文本模式的 HTML 文档中的元素：#\S{11}

<h2> this is cool #12345678901 </h2>

因此，前一个将通过使用匹配：

soup('h2',text=re.compile(r' #\S{11}'))

结果将是这样的：

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

我能够得到所有匹配的文本（见上一行）。但我希望文本的父元素匹配，所以我可以使用它作为遍历文档树的起点。在这种情况下，我希望所有 h2 元素都返回，而不是文本匹配。

想法？

score 84 · Accepted Answer

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

印刷：

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

score 21 · Accepted Answer

BeautifulSoup 搜索操作在用作标准BeautifulSoup.NavigableString时提供 [a list of] 对象，而不是在其他情况下。检查对象以查看提供给您的属性。在这些属性中，由于BS4 的变化而受到青睐。text=BeautifulSoup.Tag__dict__parentprevious

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

score 4 · Accepted Answer

使用 bs4 (Beautiful Soup 4)，OP 的尝试完全符合预期：

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

返回[<h2> this is cool #12345678901 </h2>]。

python - 使用 BeautifulSoup 查找包含特定文本的 HTML 标签

3 回答 3

Related

Reference