0

I am new to Scrapy framework & currently using it to extract articles from multiple 'Health & Wellness' websites. For some of the requests, scrapy is redirecting to homepage(this behavior is not observed in browser). Below is an example:

Command: scrapy shell "http://www.bornfitness.com/blog/page/10/" Result: 2015-06-19 21:32:15+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 2015-06-19 21:32:15+0530 [default] INFO: Spider opened 2015-06-19 21:32:15+0530 [default] DEBUG: Redirecting (301) to http://www.bornfitness.com/> from http://www.bornfitness.com/blog/page/10/> 2015-06-19 21:32:16+0530 [default] DEBUG: Crawled (200) http://www.bornfitness.com/> (referer: None)

Note that the page number in url(10) is a two-digit number. I don't see this issue with urls with single-sigit page number(8 for example). Result: 2015-06-19 21:43:15+0530 [default] INFO: Spider opened 2015-06-19 21:43:16+0530 [default] DEBUG: Crawled (200) http://www.bornfitness.com/blog/page/8/> (referer: None)

4

1 回答 1

0

当您在使用 scrapy 复制浏览器行为时遇到问题时,您通常希望查看浏览器与网站通信时与蜘蛛与网站通信时通信方式不同的那些东西。请记住,一个网站(几乎总是)不是为了对网络爬虫友好而设计的,而是为了与网络浏览器交互。

对于您的情况,如果您查看与您的 scrapy 请求一起发送的标头,您应该会看到如下内容:

In [1]: request.headers
Out[1]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 'Accept-Encoding': 'gzip,deflate',
 'Accept-Language': 'en',
 'User-Agent': 'Scrapy/0.24.6 (+http://scrapy.org)'}

如果您检查 Web 浏览器对同一页面的请求发送的标头,您可能会看到如下内容:

**Request Headers**

GET /blog/page/10/ HTTP/1.1    
Host: www.bornfitness.com    
Connection: keep-alive    
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36
DNT: 1    
Referer: http://www.bornfitness.com/blog/page/11/
Accept-Encoding: gzip, deflate, sdch    
Accept-Language: en-US,en;q=0.8
Cookie: fealty_segment_registeronce=1; ... ... ...

尝试更改User-Agent您的请求。这应该允许您绕过重定向。

于 2015-06-19T19:25:57.607 回答