您将需要使用混合方法,因为text=
当元素具有子元素和文本时会失败。
bs = BeautifulSoup("<html><a>sometext</a></html>")
reg = re.compile(r'some')
elements = [e for e in bs.find_all('a') if reg.match(e.text)]
背景
当 BeautifulSoup 正在搜索一个元素并且text
是一个可调用元素时,它最终会调用:
self._matches(found.string, self.text)
在您给出的两个示例中,该.string
方法返回不同的内容:
>>> bs1 = BeautifulSoup("<html><a>sometext</a></html>")
>>> bs1.find('a').string
u'sometext'
>>> bs2 = BeautifulSoup("<html><a>sometext<img /></a></html>")
>>> bs2.find('a').string
>>> print bs2.find('a').string
None
该.string
方法如下所示:
@property
def string(self):
"""Convenience property to get the single string within this tag.
:Return: If this tag has a single string child, return value
is that string. If this tag has no children, or more than one
child, return value is None. If this tag has one child tag,
return value is the 'string' attribute of the child tag,
recursively.
"""
if len(self.contents) != 1:
return None
child = self.contents[0]
if isinstance(child, NavigableString):
return child
return child.string
如果我们打印出内容,我们可以看到返回的原因None
:
>>> print bs1.find('a').contents
[u'sometext']
>>> print bs2.find('a').contents
[u'sometext', <img/>]