python - BeautifulSoup 未提取所有 html（自动删除页面的大部分 html）

Question

我正在尝试使用 BeautifulSoup 从网站（http://brooklynexposed.com/events/）中提取内容。作为问题的一个例子，我可以运行以下代码：

import urllib
import bs4 as BeautifulSoup

url = 'http://brooklynexposed.com/events/'
html = urllib.urlopen(url).read()

soup = BeautifulSoup.BeautifulSoup(html)
print soup.prettify().encode('utf-8')

输出似乎切断了html，如下所示：

       <li class="event">
        9:00pm - 11:00pm
        <br/>
        <a href="http://brooklynexposed.com/events/entry/5432/2013-07-16">
         Comedy Sh
        </a>
       </li>
      </ul>
     </div>
    </div>
   </div>
  </div>
 </body>
</html>

它正在切断名称为 Comedy Show 的列表以及之后的所有 html，直到最后的结束标记。大多数 html 将被自动删除。我在许多网站上都注意到类似的事情，如果页面太长，BeautifulSoup 无法解析整个页面而只是剪切文本。有人对此有解决方案吗？如果 BeautifulSoup 无法处理此类页面，有没有人知道任何其他具有类似于 prettify() 功能的库？

score 4 · Accepted Answer

我遇到了 bs4 在某些机器上剪切 html 而在某些机器上没有剪切 html 的麻烦。是无法重现的......

我切换到这个：

soup = bs4.BeautifulSoup(html, 'html5lib')

..它现在可以工作了。

score 0 · Accepted Answer

它对我来说很好，但是当我说时我得到了错误soup.prettify().encode('utf-8')

>>> from BeautifulSoup import BeautifulSoup as bs
>>> 
>>> import urllib
>>> url = 'http://brooklynexposed.com/events/'
>>> html = urllib.urlopen(url).read()
>>> 
>>> 
>>> soup = bs(html)
>>> soup.prettify().encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8788: ordinal not in range(128)
>>>
>>> soup.prettify()
'<!doctype html>\n<!--[if lt IE 7 ]&gt; 
&lt;html class="no-js ie6" lang="en"&gt; &lt;![endif]-->\n
<!--[if IE 7 ]&gt;
...
...
...
...
</body>\n</html>\n'

. . . . 我想这可能会对你有所帮助：BeautifulSoup，你把我的 HTML 放在哪里？

python - BeautifulSoup 未提取所有 html（自动删除页面的大部分 html）

2 回答 2

Related

Reference