python - 如何使用 BeautifulSoup 模拟“:contains”？

Question

我正在做一个需要一点点刮擦的项目。该项目位于 Google App Engine 上，我们目前使用的是 Python 2.5。理想情况下，我们会使用PyQuery，但由于在 App Engine 和 Python 2.5 上运行，这不是一个选项。

我已经看到过类似这样的问题，即查找带有特定文本的 HTML 标记，但它们并没有完全达到目标。

我有一些看起来像这样的 HTML：

<div class="post">
    <div class="description">
        This post is about <a href="http://www.wikipedia.org">Wikipedia.org</a>
    </div>
</div>
<!-- More posts of similar format -->

在 PyQuery 中，我可以做这样的事情（据我所知）：

s = pq(html)
s(".post:contains('This post is about Wikipedia.org')")
# returns all posts containing that text

天真地，我以为我可以在 BeautifulSoup 中做这样的事情：

soup = BeautifulSoup(html)
soup.findAll(True, "post", text=("This post is about Google.com"))
# []

然而，这并没有产生任何结果。我将查询更改为使用正则表达式，并且走得更远，但仍然没有运气：

soup.findAll(True, "post", text=re.compile(".*This post is about.*Google.com.*"))
# []

如果我省略它，它会起作用Google.com，但是我需要手动进行所有过滤。无论如何可以:contains使用 BeautifulSoup 进行模拟吗？

或者，是否有一些适用于 App Engine（在 Python 2.5 上）的类似 PyQuery 的库？

score 5 · Accepted Answer

来自 BeautifulSoup 文档（强调我的）：

“文本是一个参数，可让您搜索 NavigableString 对象 而不是标签”

也就是说，你的代码：

soup.findAll(True, "post", text=re.compile(".*This post is about.*Google.com.*"))

不一样：

regex = re.compile('.*This post is about.*Google.com.*')
[post for post in soup.findAll(True, 'post') if regex.match(post.text)]

您必须删除 Google.com 的原因是 BeautifulSoup 树中有一个 NavigableString 对象用于"This post is about"，另一个用于"Google.com"，但它们位于不同的元素下。

顺便说一句，post.text存在但没有记录，所以我也不会依赖它，我偶然编写了那个代码！使用其他方法将 . 下的所有文本混合在一起post。

python - 如何使用 BeautifulSoup 模拟“:contains”？

1 回答 1

Related

Reference