python - scrapy-splash 活动内容选择器适用于 shell 但不适用于蜘蛛

Question

我刚开始使用 scrapy-splash 从 opentable.com 检索预订数量。以下在 shell 中工作正常：

$ scrapy shell 'http://localhost:8050/render.html?url=https://www.opentable.com/new-york-restaurant-listings&timeout=10&wait=0.5'    
...

In [1]: response.css('div.booking::text').extract()
Out[1]: 
['Booked 59 times today',
 'Booked 20 times today',
 'Booked 17 times today',
 'Booked 29 times today',
 'Booked 29 times today',
  ... 
]

然而，这个简单的蜘蛛返回一个空列表：

class TableSpider(scrapy.Spider):
    name = 'opentable'
    start_urls = ['https://www.opentable.com/new-york-restaurant-listings']

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url,
                                callback=self.parse,
                                endpoint='render.html',
                                args={'wait': 1.5},
                                )

    def parse(self, response):
        yield {'bookings': response.css('div.booking::text').extract()}

调用时：

$ scrapy crawl opentable
...
DEBUG: Scraped from <200 https://www.opentable.com/new-york-restaurant-listings>
{'bookings': []}

我已经尝试过不成功

docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode

并增加了等待时间。

score 3 · Accepted Answer

我认为您的问题出在middlewares，首先您需要添加一些设置

# settings.py

# uncomment `DOWNLOADER_MIDDLEWARES` and add this settings to it
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# url of splash server
SPLASH_URL = 'http://localhost:8050'

# and some splash variables
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

现在运行 docker

sudo docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode

如果我执行所有这些步骤，请返回：

scrapy crawl opentable

...

2018-06-23 11:23:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.opentable.com/new-york-restaurant-listings via http://localhost:8050/render.html> (referer: None)
2018-06-23 11:23:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.opentable.com/new-york-restaurant-listings>
{'bookings': [
    'Booked 44 times today',
    'Booked 24 times today',
    'and many others Booked values'
]}

score 0 · Accepted Answer

这不起作用，因为网络的此内容使用的是 JS。

您可以采用多种解决方案：

1）使用硒。

2）如果你看到页面的API，如果你调用这个url <GET https://www.opentable.com/injector/stats/v1/restaurants/<restaurant_id>/reservations>，你会得到这个特定餐厅的当前预订数量（restaurant_id）。

python - scrapy-splash 活动内容选择器适用于 shell 但不适用于蜘蛛

2 回答 2

Related

Reference