我刚开始使用 scrapy-splash 从 opentable.com 检索预订数量。以下在 shell 中工作正常:
$ scrapy shell 'http://localhost:8050/render.html?url=https://www.opentable.com/new-york-restaurant-listings&timeout=10&wait=0.5'
...
In [1]: response.css('div.booking::text').extract()
Out[1]:
['Booked 59 times today',
'Booked 20 times today',
'Booked 17 times today',
'Booked 29 times today',
'Booked 29 times today',
...
]
然而,这个简单的蜘蛛返回一个空列表:
class TableSpider(scrapy.Spider):
name = 'opentable'
start_urls = ['https://www.opentable.com/new-york-restaurant-listings']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='render.html',
args={'wait': 1.5},
)
def parse(self, response):
yield {'bookings': response.css('div.booking::text').extract()}
调用时:
$ scrapy crawl opentable
...
DEBUG: Scraped from <200 https://www.opentable.com/new-york-restaurant-listings>
{'bookings': []}
我已经尝试过不成功
docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode
并增加了等待时间。