python - 用 BeautifulSoup 解析 HTML

Question

在此处输入图像描述

（图片很小，这里是另一个链接：http: //i.imgur.com/OJC0A.png）

我正在尝试在底部提取评论的文本。我试过这个：

y = soup.find_all("div", style = "margin-left:0.5em;")
review = y[0].text

问题是未展开的div标签中有不需要的文本，从评论内容中删除变得乏味。对于我的生活，我无法弄清楚这一点。有人可以帮我吗？

编辑：HTML是：

div style="margin-left:0.5em;">
    <div style="margin-bottom:0.5em;"> 9 of 35 people found the following review helpful </div>
    <div style="margin-bottom:0.5em;">
    <div style="margin-bottom:0.5em;">
    <div class="tiny" style="margin-bottom:0.5em;">
        <b>
    </div>
    That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.

文本上方的div标签如下：

<div class="tiny" style="margin-bottom:0.5em;">
    <b>
        <span class="h3color tiny">This review is from: </span>
        <a href="https://rads.stackoverflow.com/amzn/click/com/B005C7QVUE" rel="nofollow noreferrer">A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
    </b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.

score 2 · Accepted Answer

要获取尾部的文本div.tiny：

review = soup.find("div", "tiny").findNextSibling(text=True)

完整示例：

#!/usr/bin/env python
from bs4 import BeautifulSoup

html = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
   9 of 35 people found the following review helpful </div>
<div style="margin-bottom:0.5em;">
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<b>
    <span class="h3color tiny">This review is from: </span>
    <a href="http://rads.stackoverflow.com/amzn/click/B005C7QVUE">
     A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
</b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few."""

soup = BeautifulSoup(html)
review = soup.find("div", "tiny").findNextSibling(text=True)
print(review)

输出

那是真实的。今天早上我自己试过了。Audible 网站上有一个小注释，上面写着“一些标题需要两个学分”或类似的东西。《魔龙狂舞》就是其中之一。

lxml这是产生相同输出的等效代码：

import lxml.html

doc = lxml.html.fromstring(html)
print doc.find(".//div[@class='tiny']").tail

score 2 · Accepted Answer

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#strings-and-stripped-strings建议 .strings 方法是您想要的 - 它返回对象中每个字符串的迭代器。因此，如果您将该迭代器转换为列表并获取最后一项，您应该得到您想要的。例如：

$ python
>>> import bs4
>>> text = '<div style="mine"><div>unwanted</div>wanted</div>'
>>> soup = bs4.BeautifulSoup(text)
>>> soup.find_all("div", style="mine")[0].text
u'unwantedwanted'
>>> list(soup.find_all("div", style="mine")[0].strings)[-1]
u'wanted'

python - 用 BeautifulSoup 解析 HTML

2 回答 2

输出

Related

Reference