python - scrapy 无法关注完整链接

Question

scrapy shell ""https://www.winemag.com/wine-ratings/2/"
response

但是我得到

2019-02-19 14:16:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-02-19 14:16:35 [scrapy.core.engine] INFO: Spider opened 2019-02-19 14:16:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.winemag.com/robots.txt> (referer: None) 2019-02-19 14:16:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.winemag.com/wine-ratings> from <GET https://www.winemag.com/wine-ratings/2/> 2019-02-19 14:16:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.winemag.com/wine-ratings> from <GET http://www.winemag.com/wine-ratings> 2019-02-19 14:16:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.winemag.com/wine-ratings/> from <GET https://www.winemag.com/wine-ratings> 2019-02-19 14:16:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.winemag.com/wine-ratings/> (referer: None)

<200 https://www.winemag.com/wine-ratings/>

我不知道为什么它没有获得完整的链接，请有人给我一个建议。

score 1 · Accepted Answer

It seems winemag redirects crawlers to its homepage:

⇾ curl -I 'https://www.winemag.com/wine-ratings/2/'
HTTP/2 301
[...]
location: http://www.winemag.com/wine-ratings
[...]

so it seems this would be the expected behavior from scrapy, which is following the redirects returned to it by the website you're accessing?

score 0 · Accepted Answer

0

我找到了答案。我必须在设置文件中指定 USER_AGENT。

于 2019-02-19T19:49:37.943 回答

python - scrapy 无法关注完整链接

2 回答 2

Related

Reference