python - BeautifulSoup 正在省略页面主体

Question

BeautifulSoup 新手...需要帮助

这是代码示例...

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mec = Browser()
#url1 = "http://www.wines.com/catalog/index.php?cPath=21"
url2 = "http://www.wines.com/catalog/product_info.php?products_id=4866"
page = mec.open(url2)
html = page.read()
soup = BeautifulSoup(html)

print soup.prettify()

当我使用 url1 时，我得到了一个很好的页面转储。当我使用 url2（我需要的那个）时。我得到没有身体的输出。

<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN">
<html dir="LTR" lang="en">
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <title>
   2005 Jordan Cabernet Sauvignon Sonoma 2005
  </title>
 </head>
</html>

有任何想法吗？

score 2 · Accepted Answer

是的。HTML 很糟糕。

步骤 1a，print soup.prettify()看看它在哪里停止正确缩进。

步骤 1b（如果 1a 不起作用）。只需通过任何 HTML 美化打印原始文件。我将 BBEdit 用于混淆 Beautiful Soup 的内容。

仔细查看 HTML。会有某种可怕的错误。错位"的字符很常见。此外，作为样式给出的 CSS 背景图像有错误的引号。

<tag style="background-image:url("something")">

注意“不恰当”的引号。您需要编写一个正则表达式来查找和修复这些问题。

步骤 2. 编写一个“按摩”正则表达式和函数来解决这个问题。阅读http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps部分，了解如何创建标记按摩。

这是我经常使用的

# Fix background-image:url("some URI")
# to replace the quotes with &quote;
background_image = re.compile(r'background-image:url\("([^"]+)"\)')
def fix_background_image( match ):
    return 'background-image:url(&quote;%s&quote;)' % ( match.group(1) )
# Fix <img src="some URI name="someString"">  -- note the out-of-place quotes
bad_img = re.compile( r'src="([^ ]+) name="([^"]+)""' )
def fix_bad_img( match ):
    return 'src="%s" name="%s"' % ( match.group(1), match.group(2) )
fix_style_quotes = [
    (background_image, fix_background_image),
    (bad_img, fix_bad_img),
]

score 2 · Accepted Answer

它似乎被这个坏标签绊倒了：

<META NAME="description" CONTENT="$49 at Wines.com "Deep red. Red- and blackcurrant, cherry and menthol on the nose, with subtle vanilla, cola and tobacco notes adding complexity. Tightly wound red berry and bitter cherry flavors are framed by dusty...">

很明显，他们未能在属性值中转义引号（呃，哦......站点可能容易受到跨站点脚本的攻击？），这使得解析器认为页面的其余内容都在属性中价值观。（我认为，需要另一个"或一个>内部的真实属性值才能使其意识到错误。）

不幸的是，这是一个很难修复的错误。也许你可以尝试一个稍微不同的解析器？例如。如果您使用的是 Soup 3.0.x 而不是 3.1.x，请尝试使用该版本，反之亦然。或者试试 html5lib。

score 1 · Accepted Answer

HTML 确实很糟糕 :-) BeautifulSoup 3.0.7 在处理格式错误的 HTML 方面比当前版本要好得多。项目网站警告说：“目前 3.0.x 系列比 3.1 系列更擅长解析坏 HTML。”... 有一个很棒的页面专门解释原因，归结为 SGMLParser 在 Python 中被删除的事实3，并且 BS 3.1.x 被编写为可转换为 Py3k。

好消息是你仍然可以下载 3.0.7a（最后一个版本），它在我的机器上完美解析了你提到的 url：http ://www.crummy.com/software/BeautifulSoup/download/3.x/

score 0 · Accepted Answer

在有问题的 HTML 上运行，验证器显示 116 个错误——我猜想，这太多了，无法追踪哪个 BeautifulSoup 被证明无法恢复：-(

html5lib似乎在解析这个恐怖页面的磨难中幸存下来，并留下了很多东西（在我看来，当您使用 html5lib 的解析器生成 BeautifulSoup 对象时，美化几乎包含所有原始页面）。很难说生成的解析树是否能满足您的需求，因为我们并不真正知道那是什么；-)。

python setup.py install注意：我已经从 hg 克隆源（只是从目录）安装了 html5lib html5lib/python，因为最后一个官方版本有点长。

python - BeautifulSoup 正在省略页面主体

4 回答 4

Related

Reference