python - div 内的 div 破坏了整个 div 的提取 - Python/BS4

Question

这是我正在使用的 HTML：

<div id="post_message_64012736" class=" post">
<br>
Just testing something, please ignore this :D<br>
<br>
<br>
<br>
<br>
<div style="margin:20px; margin-top:5px; ">
    <div class="smallfont" style="margin-bottom:2px">

            Quote:

    </div>


    <table cellpadding="6" cellspacing="0" border="0" width="100%">
    <tbody><tr><td class="quotearea">
    <div style="font-style:italic">New browser based game that was directly inspired by Candy Box, but is quite different from it.<br>
<br>
A Dark Room -</div>
        </td>
    </tr>
    </tbody></table>
</div>
I have it running on a tab, pretty interesting. I still don't know how to get scales thought. You can only buy them or get them from the traps?<br>
<br>
Is there a Sentinel demo that doesn't require unity3d in the browser? Like a real windows demo?
</div>

这是我使用的代码，非常简单：

soup = bs4.BeautifulSoup(r.text)
for i in soup.findAll("div",class_=" post"):
    print i.text

但我只得到这个输出：

Just testing something, please ignore this :D







            Quote:




New browser based game that was directly inspired by Candy Box, but is quite different from it.

A Dark Room -

如果我只打印我会得到这个：

<div class=" post" id="post_message_64012736">

            INFO:pyindiegaf<br/>
<br/>
Just testing something, please ignore this :D<br/>
<br/>
<br/>
<br/>
<br/>
<div style="margin:20px; margin-top:5px; ">
<div class="smallfont" style="margin-bottom:2px">

            Quote:

    </div>
<table border="0" cellpadding="6" cellspacing="0" width="100%">
<td class="quotearea">
<div style="font-style:italic">New browser based game that was directly inspired by Candy Box, but is quite different from it.<br/>
<br/>
A Dark Room -</div>
</td>
</table></div></div>

看起来在找到 X 标签后它只是认为它是主 div 的结尾。据我所见，每个打开都有一个关闭标签，所以它不像 html 格式错误。

所以......任何猜测可能会发生在这里？我觉得自己很愚蠢，就像我在那里错过了什么？

谢谢！

编辑：我并没有真正使用那唯一的html，一些澄清，因为像这样的纯html似乎可以工作。

我使用这个网址：http ://www.neogaf.com/forum/showthread.php?t=572913&page=12

是一个vBulleting论坛，所以所有的帖子都有一个类“帖子”。我用 bs4 寻找它们，如果它们中有关键字，我将开始像这样处理它们：

url = "http://www.neogaf.com/forum/showthread.php?t=572913&page=12"
r = requests.get(url)
print "Using url:", url
soup = bs4.BeautifulSoup(r.text)
for i in soup.findAll("div",class_=" post"):
    if "INFO:pyindiegaf" in i.text:
        print i

使用这种方法，我得到了上述结果，bs4 在结束整个 div 块之前停止。

很抱歉造成混乱，试图简化它。

score 1 · Accepted Answer

该站点似乎有一些格式错误的 HTML，它干扰了实际的解析。安装html5lib( pip install html5lib) 并将其用作您的 HTML 解析器：

import requests
from bs4 import BeautifulSoup

url = 'http://www.neogaf.com/forum/showthread.php?t=572913&page=12'
html = requests.get(url).content
soup = BeautifulSoup(html, 'html5lib')

for post in soup.find_all('div', class_='post'):
    text = post.get_text()

    if 'INFO:pyindiegaf' in text:
        print(text)

这是您可以获得的最宽松的 HTML 解析器。而且，class_='post'并class_=' post'产生不同的结果。

由于您正在抓取论坛，因此您可能希望改用Scrapy。它看起来很复杂，但是蜘蛛会比你的 BeautifulSoup 爬虫更简单和更快（如果你真的在爬论坛）。

python - div 内的 div 破坏了整个 div 的提取 - Python/BS4

1 回答 1

Related

Reference