python - Python Beautiful Soup 爬取文本数据

Question

我是 python 的新手，我正在尝试使用 Python 中的 beautifulSoup 从网站上抓取一些文本评论。部分html结构如下，

<div style="1st level">
    <div style="2nd level">Here is text 1</div>
    <div style="2nd level">Here is text 2</div>
    <div style="2nd level">Here is text 3</div>
    <div style="2nd level">Here is text 4</div>
    Here is text 5 and this is the part I want to get.
<div>

所以文本 1,2,3,4 在第二级，我不需要这些文本。我只想获取位于结构第一级的文本 5。我的部分代码如下：

reviews=soup.find('div',style="1st level")
reviews=reviews.text
print reviews

但是后来我得到了从文本 1 到文本 5 的所有内容。有没有一种简单的方法可以定位到第一级并且只获取文本 5？

score 0 · Accepted Answer

不确定这些方法是否最好，但请尝试一下：

from bs4 import BeautifulSoup as soup
from collections import deque


input = """<div style="1st level">
    <div style="2nd level">Here is text 1</div>
    <div style="2nd level">Here is text 2</div>
    <div style="2nd level">Here is text 3</div>
    <div style="2nd level">Here is text 4</div>
    Here is text 5 and this is the part I want to get.
<div>"""

web_soup = soup(input)
reviews = web_soup.find('div', style="1st level")

print reviews.contents[-2]
print deque(reviews.strings, maxlen=1).pop()

两个打印：

Here is text 5 and this is the part I want to get.

仅供参考，我曾经deque从strings生成器中获取最后一个元素。

而且，仅供参考，lxml + xpath 通过使用text().

希望有帮助。

python - Python Beautiful Soup 爬取文本数据

1 回答 1

Related

Reference