python - Scrapy 无法正确解析某些 html 文件

Question

我已经使用Scrapy几个星期了，最近，我发现HtmlXPathSelector无法正确解析一些 html 文件。

在网页http://detail.zol.com.cn/series/268/10227_1.html中，只有一个标签名为

`div id='param-more' class='mod_param  '`.

当我使用 xpath "//div[@id='param-more']"选择标签时，它返回[]。

我试过scrapy shell并得到了相同的结果。

在使用wget检索网页时，我也可以在html源文件中找到标签“div id='param-more'class='mod_param'”，我认为这不是标签显示的原因引起的触发一个动作。

请给我一些有关如何解决此问题的提示。

以下是有关该问题的代码片段。处理上述url时，len(nodes_product)始终为0

def parse_series(self, response):
    hxs = HtmlXPathSelector(response)

    xpath_product = "//div[@id='param-normal']/table//td[@class='name']/a | "\
                    "//div[@id='param-more']/table//td[@class='name']/a"
    nodes_product = hxs.select(xpath_product)
    if len(nodes_product) == 0:
        # there's only the title, no other products in the series
        .......
    else:
        .......

score 3 · Accepted Answer

这似乎是 XPathSelectors 的一个错误。我创建了一个快速测试蜘蛛并遇到了同样的问题。我认为这与页面上的非标准字符有关。

我不认为问题在于“param-more”div 与任何 javascript 事件或 CSS 隐藏相关联。我禁用了 javascript 并更改了我的用户代理（和位置）以查看这是否影响了页面上的数据。它没有。

但是，我能够使用 beautifulsoup 解析“param-more”div：

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup

class TestSpider(BaseSpider):
    name = "Test"

    start_urls = [
        "http://detail.zol.com.cn/series/268/10227_1.html"
                 ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        #data = hxs.select("//div[@id='param-more']").extract()

        data = response.body
        soup = BeautifulSoup(data)
        print soup.find(id='param-more')

其他人可能更了解 XPathSelect 问题，但暂时可以将 beautifulsoup 找到的 HTML 保存到一个 item 中，并将其传递到管道中。

这是最新的 beautifulsoup 版本的链接：http ://www.crummy.com/software/BeautifulSoup/#Download

更新

我相信我找到了具体问题。正在讨论的网页在元标记中指定它使用GB 2312 字符集。从 GB 2312 到 unicode 的转换是有问题的，因为有些字符没有对应的 unicode。这不是问题，除了beautifulsoup 的编码检测模块UnicodeDammit 实际上将编码确定为ISO 8859-2。问题是 lxml 通过查看header 的 meta 标记中指定的字符集来确定文档的编码。因此，lxml 和 scrapy 感知的编码类型不匹配。

以下代码演示了上述问题，并提供了一种替代方案，无需依赖 BS4 库：

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup
import chardet

class TestSpider(BaseSpider):
    name = "Test"

    start_urls = [
        "http://detail.zol.com.cn/series/268/10227_1.html"
                 ]

    def parse(self, response):

        encoding = chardet.detect(response.body)['encoding']
        if encoding != 'utf-8':
            response.body = response.body.decode(encoding, 'replace').encode('utf-8')

        hxs = HtmlXPathSelector(response)
        data = hxs.select("//div[@id='param-more']").extract()
        #print encoding
        print data

在这里，您会看到，通过强制 lxml 使用 utf-8 编码，它不会尝试从它所认为的 GB 2312->utf-8 进行映射。

在 scrapy 中，HTMLXPathSelectors 编码是在 scrapy/select/lxmlsel.py 模块中设置的。该模块使用 response.encoding 属性将响应正文传递给 lxml 解析器，该属性最终设置在 scrapy/http/response/test.py 模块中。

处理设置 response.encoding 属性的代码如下：

@property
def encoding(self):
    return self._get_encoding(infer=True)

def _get_encoding(self, infer=False):
    enc = self._declared_encoding()
    if enc and not encoding_exists(enc):
        enc = None
    if not enc and infer:
        enc = self._body_inferred_encoding()
    if not enc:
        enc = self._DEFAULT_ENCODING
    return resolve_encoding(enc)

def _declared_encoding(self):
    return self._encoding or self._headers_encoding() \
        or self._body_declared_encoding()

这里要注意的重要一点是，_headers_encoding 和 _encoding 最终都将反映在标头中元标记中声明的编码，而不是实际使用 UnicodeDammit 或 chardet 之类的东西来确定文档编码。因此，会出现文档包含对其指定的编码无效的字符的情况，我相信 Scrapy 会忽略这一点，最终导致我们今天看到的问题。

score 0 · Accepted Answer

'mod_param ' != 'mod_param'

该类不等于“mod_param”，但它确实包含“mod_param”，注意末尾有一个空格：

stav@maia:~$ scrapy shell http://detail.zol.com.cn/series/268/10227_1.html
2012-08-23 09:17:28-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
Python 2.7.3 (default, Aug  1 2012, 05:14:39)
IPython 0.12.1 -- An enhanced Interactive Python.

In [1]: hxs.select("//div[@class='mod_param']")
Out[1]: []

In [2]: hxs.select("//div[contains(@class,'mod_param')]")
Out[2]: [<HtmlXPathSelector xpath="//div[contains(@class,'mod_param')]" data=u'<div id="param-more" class="mod_param  "'>]

In [3]: len(hxs.select("//div[contains(@class,'mod_param')]").extract())
Out[3]: 1

In [4]: len(hxs.select("//div[contains(@class,'mod_param')]").extract()[0])
Out[4]: 5372

python - Scrapy 无法正确解析某些 html 文件

2 回答 2

Related

Reference