我正在抓取一个网站,它包含许多 O 需要从中获取数据的 URL。我使用 XPath 并获取所有href
s (URL) 并保存到一个列表中。我循环了这个列表并产生了一个请求。下面是我的蜘蛛代码,
class ExampledotcomSpider(BaseSpider):
name = "exampledotcom"
allowed_domains = ["www.example.com"]
start_urls = ["http://www.example.com/movies/city.html"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
cinema_links = hxs.select('//div[@class="contentArea"]/div[@class="leftNav"]/div[@class="cinema"]/div[@class="rc"]/div[@class="il"]/span[@class="bt"]/a/@href').extract()
for cinema_hall in cinema_links:
yield Request(cinema_hall, callback=self.parse_cinema)
def parse_cinema(self, response):
hxs = HtmlXPathSelector(response)
cinemahall_name = hxs.select('//div[@class="companyDetails"]/div[@itemscope=""]/span[@class="srchrslt"]/h1/span/text()').extract()
........
在这里,例如,我在列表中有 60 个 URL,并且大约 37 个 URL 没有下载:对于这些,出现错误消息:
2012-06-06 14:00:12+0530 [exampledotcom] ERROR: Error downloading <GET http://www.example.com/city/Cinema-Hall-70mm-%3Cnear%3E-place/040PXX40-XX40-000147377847-A6M3>: Error -3 while decompressing: invalid stored block lengths
2012-06-06 14:00:12+0530 [exampledotcom] ERROR: Error downloading <GET http://www.example.com/city/Cinema-Hall-35mm-%3Cnear%3E-place/040PXX40-XX40-000164969686-H9C5>: Error -3 while decompressing: invalid stored block lengths
仅对于 Scrapy 正在下载的某些 URL,对于其余部分,我不明白发生了什么以及我的代码有什么问题。
谁能建议我如何消除这些错误?