python - Scrapy安全的方式来提取物品

Question

从页面中提取项目信息的最佳安全方法是什么？我的意思是，有时页面中可能缺少一个项目，您最终会破坏爬虫。

看这个例子：

    for cotacao in tabela_cotacoes:
        citem = CotacaoItem()
        citem['name'] = cotacao.select("td[4]/text()").extract()[0]
        citem['symbol'] = cotacao.select("td/a/b/text()").extract()[0]
        citem['current'] = cotacao.select("td[6]/text()").extract()[0]
        citem['last_neg'] = cotacao.select("td[7]/text()").extract()[0]
        citem['oscillation'] = cotacao.select("td[8]/text()").extract()[0]
        citem['openning'] = cotacao.select("td[9]/text()").extract()[0]
        citem['close'] = cotacao.select("td[10]/text()").extract()[0]
        citem['maximum'] = cotacao.select("td[11]/text()").extract()[0]
        citem['minimun'] = cotacao.select("td[12]/text()").extract()[0]
        citem['volume'] = cotacao.select("td[13]/text()").extract()[0]

如果页面中缺少某些项目，.extract() 将返回 [] 并在它们上调用 [0] 将引发异常（超出范围）。

所以问题是，处理这个问题的最佳方式/方法是什么。

score 2 · Accepted Answer

写一个小辅助函数：

def extractor(xpathselector, selector):
    """
    Helper function that extract info from xpathselector object
    using the selector constrains.
    """
    val = xpathselector.select(selector).extract()
    return val[0] if val else None

并像这样使用它：

citem['name'] = extractor(cotacao, "td[4]/text()")

返回适当的值以指示citem未找到 a。在我返回的代码None中，如有必要，请更改它（例如，''如果有意义，请返回）。

python - Scrapy安全的方式来提取物品

1 回答 1

Related

Reference