2

假设我有以下 HTML:

<p>
If everybody minded their own business, the world would go around a great deal faster than it does.
</p>

<p>
Who in the world am I? Ah, that's the great puzzle.
</p>

我希望能够找到包含我正在寻找的所有关键字的所有标签。例如(示例 2 和 3 不起作用):

>>> len(soup.find_all(text="world"))
2

>>> len(soup.find_all(text="world puzzle"))
1

>>> len(soup.find_all(text="world puzzle book"))
0

我一直在尝试想出一个允许我搜索所有关键字的正则表达式,但似乎 ANDing 是不可能的(只有 ORing)。

提前致谢!

4

4 回答 4

5

进行这种复杂匹配的最简单方法是编写一个执行匹配的函数,并将该函数作为text参数的值传入。

def must_contain_all(*strings):                                                 
    def must_contain(markup):                                                   
        return markup is not None and all(s in markup for s in strings)         
    return must_contain

现在您可以获得匹配的字符串:

print soup.find_all(text=must_contain_all("world", "puzzle"))
# [u"\nWho in the world am I? Ah, that's the great puzzle.\n"]

要获取包含字符串的标签,请使用 .parent 运算符:

print [text.parent for text in soup.find_all(text=must_contain_all("world", "puzzle"))]
# [<p>Who in the world am I? Ah, that's the great puzzle.</p>]
于 2012-07-26T23:28:48.840 回答
1

您可能需要考虑使用lxml而不是 BeautifulSoup。lxml 允许您通过 XPaths 查找元素:

使用此样板设置:

import lxml.html as LH
import re

html = """
<p>
If everybody minded their own business, the world would go around a great deal faster than it does.
</p>

<p>
Who in the world am I? Ah, that's the great puzzle.
</p>
"""

doc = LH.fromstring(html)

<p>这会在包含字符串的所有标签中找到文本world

print(doc.xpath('//p[contains(text(),"world")]/text()'))
['\nIf everybody minded their own business, the world would go around a great deal faster than it does.\n', "\nWho in the world am I? Ah, that's the great puzzle.\n"]

这会找到所有<p>包含worldand的标签中的所有文本puzzle

print(doc.xpath('//p[contains(text(),"world") and contains(text(),"puzzle")]/text()'))
["\nWho in the world am I? Ah, that's the great puzzle.\n"]
于 2012-07-27T08:51:13.783 回答
0

这可能不是最有效的方法,但您可以尝试设置交叉点:

len(set(soup.find_all(text="world")
    & set(soup.find_all(text="book")
    & set(soup.find_all(text="puzzle")))
于 2012-07-26T21:41:20.107 回答
0

有点骨架(我使用的是 lxml 而不是 BeautifulSoup,但你可以使用 soup.findAll 来适应它):

html = """
<p>
If everybody minded their own business, the world would go around a great deal faster than it does.
</p>

<p>
Who in the world am I? Ah, that's the great puzzle.
</p>
"""

import lxml.html
import re

fragment = lxml.html.fromstring(html)
d = dict(
    (node, set(re.findall(r'\S+', node.text_content())))
    for node in fragment.xpath('//p'))

for node, it in d.iteritems():
    # then use set logic to go from here...
于 2012-07-26T21:50:09.963 回答