python - BeautifulSoup 错误地解析页面并且找不到链接

Question

这是 python 2.7.2 中的一个简单代码，它获取站点并从给定站点获取所有链接：

import urllib2
from bs4 import BeautifulSoup

def getAllLinks(url):
    response = urllib2.urlopen(url)
    content = response.read()
    soup = BeautifulSoup(content, "html5lib")
    return soup.find_all("a")

links1 = getAllLinks('http://www.stanford.edu')
links2 = getAllLinks('http://med.stanford.edu/')

print len(links1)
print len(links2)

问题是它在第二种情况下不起作用。它打印 102 和 0，而第二个站点上显然有链接。BeautifulSoup 不会引发解析错误，并且可以很好地打印标记。我怀疑这可能是由 med.stanford.edu 来源的第一行引起的，它说它是 xml（即使内容类型是：text/html）：

<?xml version="1.0" encoding="iso-8859-1"?>

我不知道如何设置 Beautiful 以忽略它，或解决方法。我使用 html5lib 作为解析器，因为我遇到了默认的问题（不正确的标记）。

score 3 · Accepted Answer

当一个文档声称是 XML 时，我发现 lxml 解析器给出了最好的结果。尝试您的代码但使用 lxml 解析器而不是 html5lib 会找到 300 个链接。

score 2 · Accepted Answer

您说得对，问题出<?xml...在线路上。忽略它很简单：只需跳过第一行内容，通过替换

    content = response.read()

有类似的东西

    content = "\n".join(response.readlines()[1:])

在此更改后，len(links2)变为 300。

ETA：您可能希望有条件地执行此操作，因此您不必总是跳过第一行内容。一个例子是这样的：

content = response.read()
if content.startswith("<?xml"):
    content = "\n".join(content.split("\n")[1:])

python - BeautifulSoup 错误地解析页面并且找不到链接

2 回答 2

Related

Reference