python - 为什么 BeautifulSoup 无法正确读取/解析此 RSS (XML) 文档？

Question

YCombinator 很好地提供了一个RSS 提要和一个包含HackerNews上热门项目的大型 RSS 提要。我正在尝试编写一个 python 脚本来访问 RSS 提要文档，然后使用 BeautifulSoup 解析出某些信息。但是，当 BeautifulSoup 尝试获取每个项目的内容时，我会遇到一些奇怪的行为。

以下是 RSS 提要的一些示例行：

<rss version="2.0">
<channel>
<title>Hacker News</title><link>http://news.ycombinator.com/</link><description>Links for the intellectually curious, ranked by readers.</description>
<item>
    <title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title>
    <link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch</link>
    <comments>http://news.ycombinator.com/item?id=4944322</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>
</item>
<item>
    <title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</title>
    <link>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_050112_8bit_FLAT.html</link>
    <comments>http://news.ycombinator.com/item?id=4943361</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4943361">Comments</a>]]></description>
</item>
...
</channel>
</rss>

这是我编写的（在 python 中）访问此提要并为每个项目打印出title、link和的代码：comments

import sys
import requests
from bs4 import BeautifulSoup

request = requests.get('http://news.ycombinator.com/rss')
soup = BeautifulSoup(request.text)
items = soup.find_all('item')
for item in items:
    title = item.find('title').text
    link = item.find('link').text
    comments = item.find('comments').text
    print title + ' - ' + link + ' - ' + comments

但是，此脚本给出的输出如下所示：

EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39; -  - http://news.ycombinator.com/item?id=4944322
Two Billion Pixel Photo of Mount Everest (can you find the climbers?) -  - http://news.ycombinator.com/item?id=4943361
...

如您所见，中间的link, 不知何故被省略了。也就是说，结果值link不知何故是一个空字符串。那为什么呢？

当我深入研究其中的内容时soup，我意识到它在解析 XML 时会莫名其妙地窒息。这可以通过查看中的第一项来看出items：

>>> print items[0]
<item><title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title></link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch<comments>http://news.ycombinator.com/item?id=4944322</comments><description>...</description></item>

您会注意到仅使用link标签就发生了一些奇怪的事情。它只是获取关闭标签，然后是该标签的文本。这是一些非常奇怪的行为，特别是与title没有comments问题的解析相比。

这似乎是 BeautifulSoup 的问题，因为请求实际读取的内容没有任何问题。我不认为它仅限于 BeautifulSoup，因为我也尝试使用 xml.etree.ElementTree API 并且出现了同样的问题（BeautifulSoup 是基于这个 API 构建的吗？）。

有谁知道为什么会发生这种情况，或者我如何仍然可以使用 BeautifulSoup 而不会出现此错误？

注意：我终于能够使用 xml.dom.minidom 获得我想要的东西，但这似乎不是一个强烈推荐的库。如果可能的话，我想继续使用 BeautifulSoup。

更新：我在使用 Python 2.7.2 和 BS4 4.1.3 的 OSX 10.8 的 Mac 上。

更新 2：我有 lxml，它是用 pip 安装的。它是 3.0.2 版。至于 libxml，我检查了 /usr/lib，显示的是 libxml2.2.dylib。不确定何时或如何安装。

score 7 · Accepted Answer

哇，好问题。这让我觉得是 BeautifulSoup 中的一个错误。您无法访问链接的原因soup.find_all('item').link是，当您第一次将 html 加载到 BeautifulSoup 中时，它对 HTML 做了一些奇怪的事情：

>>> from bs4 import BeautifulSoup as BS
>>> BS(html)
<html><body><rss version="2.0">
<channel>
<title>Hacker News</title><link/>http://news.ycombinator.com/<description>Links
for the intellectually curious, ranked by readers.</description>
<item>
<title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and 'No
tch'</title>
<link/>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-d
ollar-boost-mark-cuban-and-notch
    <comments>http://news.ycombinator.com/item?id=4944322</comments>
<description>Comments]]&gt;</description>
</item>
<item>
<title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</ti
tle>
<link/>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_
050112_8bit_FLAT.html
    <comments>http://news.ycombinator.com/item?id=4943361</comments>
<description>Comments]]&gt;</description>
</item>
...
</channel>
</rss></body></html>

仔细看——它实际上已经将第一个<link>标签更改为<link/>然后删除了</link>标签。我不确定它为什么会这样做，但是如果不解决BeautifulSoup.BeautifulSoup类初始化中的问题，您现在将无法使用它。

更新：

我认为你现在最好的（尽管是hack-y）的赌注是使用以下方法link：

>>> soup.find('item').link.next_sibling
u'http://news.ycombinator.com/'

score 3 · Accepted Answer

3

@Yan Hudon 是对的。我已经解决了这个问题soup = BeautifulSoup(request.text, 'xml')

于 2015-02-28T13:35:38.303 回答

score 3 · Accepted Answer

实际上，问题似乎与您使用的解析器有关。默认情况下，使用 HTML 格式。安装 lxml 模块后尝试使用 soup = BeautifulSoup(request.text, 'xml') 。

然后它将使用 XML 解析器而不是 HTML 解析器，应该没问题。

有关更多信息，请参阅http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

score 1 · Accepted Answer

我认为 BeautifulSoup 中没有错误。

我从 OS X 10.8.2 在 Apple 的股票 2.7.2 上安装了一个干净的 BS4 4.1.3 副本，一切都按预期工作。它不会错误解析<link>as </link>，因此它不存在item.find('link').

我还尝试在 2.7.2 和python.org 3.3.0 中使用 stockxml.etree.ElementTree和来解析相同的东西，它再次运行良好。这是代码：xml.etree.cElementTreexml.etree.ElementTree

import xml.etree.ElementTree as ET

rss = ET.fromstring(x)
for channel in rss.findall('channel'):
  for item in channel.findall('item'):
    title = item.find('title').text
    link = item.find('link').text
    comments = item.find('comments').text
    print(title)
    print(link)
    print(comments)

然后我安装了 lxml 3.0.2（我相信 BS 使用 lxml 如果可用），使用 Apple 的内置/usr/lib/libxml2.2.dylib（根据xml2-config --version2.7.8），并使用它的 etree 和使用 BS 进行了相同的测试，然后再次，一切正常。

除了搞砸之外<link>，jdotjdot 的输出还显示 BS4 正在以一种奇怪的方式搞砸<description>。原文是这样的：

<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>

他的输出是：

<description>Comments]]&gt;</description>

我运行他完全相同的代码的输出是：

<description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>

所以，这里似乎有一个更大的问题。奇怪的是，它发生在两个不同的人身上，而在干净安装最新版本的任何东西时都没有发生。

这意味着要么这是一个已修复的错误，而我只是有一个更新版本的任何有错误的东西，要么他们俩安装某些东西的方式有些奇怪。

BS4本身可以排除，因为至少Treebranch和我一样有4.1.3。虽然，在不知道他是如何安装的情况下，这可能是安装的问题。

可以排除 Python 及其内置的 etree，因为至少 Treebranch 拥有与我相同的 OS X 10.8 中的 Apple 2.7.2。

这很可能是 lxml 或底层 libxml 的错误，或者它们的安装方式。我知道 jdotjdot 有 lxml 2.3.6，所以这可能是一个在 2.3.6 和 3.0.2 之间修复的错误。事实上，鉴于根据lxml 网站和 2.3.5 之后任何版本的更改说明，没有2.3.6，所以无论他拥有什么，都可能是从很早就取消的分支或其他东西的某种错误发布……我不知道他的libxml版本，或者是如何安装的，或者他在什么平台上，所以很难猜测，但至少这是可以调查的。

python - 为什么 BeautifulSoup 无法正确读取/解析此 RSS (XML) 文档？

4 回答 4

更新：

Related

Reference