python - 如何修复 CrawlSpider 重定向？

Question

我正在尝试为此站点编写 CrawlSpider： http: //www.shams-stores.com/shop/index.php 这是我的代码：

import urlparse
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from project.items import Product
import re



class ShamsStoresSpider(CrawlSpider):
    name = "shamsstores2"
    domain_name = "shams-stores.com"
    CONCURRENT_REQUESTS = 1

    start_urls = ["http://www.shams-stores.com/shop/index.php"]

    rules = (
            #categories
            Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="categories_block_left"]/div/ul/li/a'), unique=False), callback='process', follow=True),
            )

    def process(self,response):
        print response

这是我使用scrapy crawl shamsstores2时得到的响应

2013-11-05 22:56:36+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6081
2013-11-05 22:56:41+0200 [shamsstores2] DEBUG: Crawled (200) <GET http://www.shams-stores.com/shop/index.php> (referer: None)
2013-11-05 22:56:42+0200 [shamsstores2] DEBUG: Redirecting (301) to <GET http://www.shams-stores.com/shop/index.php?id_category=14&controller=category&id_lang=1> from <GET http://www.shams-stores.com/shop/index.php?controller=category&id_category=14&id_lang=1>
2013-11-05 22:56:42+0200 [shamsstores2] DEBUG: Filtered duplicate request: <GET http://www.shams-stores.com/shop/index.php?id_category=14&controller=category&id_lang=1> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2013-11-05 22:56:43+0200 [shamsstores2] DEBUG: Redirecting (301) to <GET http://www.shams-stores.com/shop/index.php?id_category=13&controller=category&id_lang=1> from <GET http://www.shams-stores.com/shop/index.php?controller=category&id_category=13&id_lang=1>
2013-11-05 22:56:43+0200 [shamsstores2] DEBUG: Redirecting (301) to <GET http://www.shams-stores.com/shop/index.php?id_category=12&controller=category&id_lang=1> from <GET http://www.shams-stores.com/shop/index.php?controller=category&id_category=12&id_lang=1>
2013-11-05 22:56:43+0200 [shamsstores2] DEBUG: Redirecting (301) to <GET http://www.shams-stores.com/shop/index.php?id_category=10&controller=category&id_lang=1> from <GET http://www.shams-stores.com/shop/index.php?controller=category&id_category=10&id_lang=1>
2013-11-05 22:56:43+0200 [shamsstores2] DEBUG: Redirecting (301) to <GET http://www.shams-stores.com/shop/index.php?id_category=9&controller=category&id_lang=1> from <GET http://www.shams-stores.com/shop/index.php?controller=category&id_category=9&id_lang=1>
2013-11-05 22:56:44+0200 [shamsstores2] DEBUG: Redirecting (301) to <GET http://www.shams-stores.com/shop/index.php?id_category=8&controller=category&id_lang=1> from <GET http://www.shams-stores.com/shop/index.php?controller=category&id_category=8&id_lang=1>
2013-11-05 22:56:44+0200 [shamsstores2] DEBUG: Redirecting (301) to <GET http://www.shams-stores.com/shop/index.php?id_category=7&controller=category&id_lang=1> from <GET http://www.shams-stores.com/shop/index.php?controller=category&id_category=7&id_lang=1>
2013-11-05 22:56:44+0200 [shamsstores2] DEBUG: Redirecting (301) to <GET http://www.shams-stores.com/shop/index.php?id_category=6&controller=category&id_lang=1> from <GET http://www.shams-stores.com/shop/index.php?controller=category&id_category=6&id_lang=1>
2013-11-05 22:56:44+0200 [shamsstores2] INFO: Closing spider (finished)

它点击从规则中提取的链接，这些链接重定向到其他一些链接，然后它停止而不执行函数：进程。我可以通过使用基础蜘蛛来解决这个问题，但我可以修复它并仍然使用爬虫吗？

score 1 · Accepted Answer

问题不在于重定向。Scrapy 会按照服务器的建议转到备用位置并从那里获取页面。

对于所有访问的页面，您的“restrict_xpaths=('//div[@id="categories_block_left"]/div/ul/li/a')”的问题，它只是提取相同的 8 个 url 集并将它们过滤为重复.

PS我唯一不明白的是为什么scrapy只为一页提供消息。如果我找到原因，我会更新。

编辑：参考 github.com/scrapy/scrapy/blob/master/scrapy/utils/request.py

基本上，首先请求排队并存储指纹。接下来生成重定向的 url，当通过比较指纹检查它是否重复时，scrapy 会找到相同的指纹。Scarpy 找到了相同的指纹，因为正如示例中所引用的，根据 scrapy，重定向 url 和原始 url 的重新排序查询字符串是相同的。

一种“利用”解决方案

rules = (
    #categories
        Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="categories_block_left"]/div/ul/li/a') ), callback='process', process_links= 'appendDummy', follow=True),

    def process(self,response):
        print 'response is called'
        print response

    def appendDummy(self, links):
        for link in links:
            link.url = link.url +"?dummy=true"
        return links

因为服务器忽略了重定向 url 中附加的 dummy，我们有点欺骗指纹来处理原始请求和重定向请求来处理不同的请求。

另一种解决方案是您自己在 process_link 回调中重新排序查询参数（在示例 appendDummy 中）。

其他解决方案可能是覆盖finger_print以区分这些类型的url（我认为在一般情况下它会出错，在这里可能很好）或基于url的简单指纹（再次仅适用于这种情况）。

如果解决方案对您有用，请告诉我。

PS scrapy 处理重新排序和原始 url 的行为是正确的。我不明白服务器重定向到重新排序的查询字符串的原因是什么。

python - 如何修复 CrawlSpider 重定向？

1 回答 1

Related

Reference