python - Scrapy：为抓取页面中的选项选择创建多个项目

Question

所以，我有一个爬取页面的蜘蛛，收集它遇到的每个项目的数据。如果项目没有选项，它只是将项目发送到管道中。如果有选项，它会组装一个选项列表列表，并为每个唯一的选项组合发送一个请求（作为 HTML 片段返回，因此我将其视为 XML）。对于每个选项组合，它会提取项目的价格并将其发送到管道中。只是，它没有。

这是一些代码：

#spider code above here that does all the normal stuff, 
#plus gets and organize all options.  Then this:

for optLists in uberList:
  queryString = '?func=Options&currentOption=1&Modal=False&AddUniqueID=False&sku=' + sku + '&option1=' + optLists[0] + '&option2=' + optLists[1] + '&option3=' + optLists[2]
  reqURL = urljoin(baseAjaxURL, queryString)
  req = Request(url=reqURL,
                callback=self.parse_ajax,
                meta = {'item' : item},
               )
  self.log('simplified item: ' + reqURL, level=log.DEBUG)
  yield req

以及回调函数：

def parse_ajax(self, response):
  print 'parsing ajax'
  xxs = XmlXPathSelector(response)
  item = response.meta['item']
  item['price'] = xxs.select("normalize-space(substring-before(substring-after(.//skuMainPrice/text(), 'ppPrice:'),'/span'))").extract()[0]
  print 'parse_ajax price: ', item['price']
  return item

第一种方法中的 for 循环正确触发，每组选项一次。如果回调是针对不存在的方法（这很好），则请求会引发错误，但回调方法中的打印语句永远不会触发，项目也不会沿着管道传播。

任何关于我做错了什么或如何做对的建议都将不胜感激。

谢谢

score 0 · Accepted Answer

花了一些时间和一点绝望，但我想出了这个。我正在为这个蜘蛛使用 CrawlSpider，我必须将 ajax URL 添加到“允许”规则中。没有它，该 url 既不能跟踪也不能解析。

python - Scrapy：为抓取页面中的选项选择创建多个项目

1 回答 1

Related

Reference