1

嘿,我目前正在使用scrapy,并且在运行爬网时注意到我的拒绝规则被完全忽略,导致相同项目的多次刮擦,任何人都可以告诉我为什么。任何帮助表示赞赏

class DIY_spider(CrawlSpider):
    name = 'diy_cat'
    allowed_domains = ['diy.com']

    start_urls =[
        #"http://www.diy.com",
        "http://www.diy.com/nav/decor",
        "http://www.diy.com/nav/garden",
        "http://www.diy.com/nav/rooms",
        "http://www.diy.com/nav/fix",
        "http://www.diy.com/nav/build",

    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=(r'/nav/decor|garden|rooms|fix|build/([A-Za-z0-9-]*)$'),
                               deny=('//diy/jsp/',
                                     'pricerange',
                                     'productId',
                                     '-size%',
                                     'tab=rev')),follow=True),

        Rule(SgmlLinkExtractor(allow=(r'/nav/decor|garden|rooms|fix|build/(.*)[0-9]{8}$' ),)
             , follow=True, callback='parse_items'),

** * ** * ** * ** * ** *编辑* ** * ** * ** * ** * ** *

这是日志中发生的事情

2014-04-07 15:01:47+0100 [diy_cat] 调试:从 <200 http://www.diy.com/nav/garden/garden-buildings/cabins-summerhouses/-constructiontype-Interlocking/-pricerangec中删除-1200-1300/-size%3E3_29_x_2_39m/Shire-11x8-Berryfield-Log-Cabin-Home-Delivered-Only-13538712?height=411&mediaId=m8416757&productId=13538712&skuId=14009418&width=411 >

{'currency_code': 'GBP',
 'supplier_name': 'www.diy.com',
 'supplier_part_description': u'19mm tongue and groove interlocking timber to side walls, 12mm tongue and groove timber to floor and roof, supplied with pressure treated floor joists and green roofing felt.',
 'supplier_part_name': u'Shire 11x8 Berryfield Log Cabin - Home Delivered Only',
 'supplier_part_number': u'5019804112289',
 'supplier_price_gross': 1249.98,
 'supplier_price_net': 1041.65,
 'supplier_price_tax_amount': 208.33,
 'supplier_url': 'http://www.diy.com/nav/garden/garden-buildings/cabins-summerhouses/-constructiontype-Interlocking/-pricerangec-1200-1300/-size%3E3_29_x_2_39m/Shire-11x8-Berryfield-Log-Cabin-Home-Delivered-Only-13538712'}

2014-04-07 15:01:47+0100 [diy_cat] 调试:从 <200 http://www.diy.com/nav/garden/garden-buildings/cabins-summerhouses/-constructiontype-Interlocking/-pricerangec中删除-1200-1300/-size%3E3_29_x_2_39m/Shire-11x8-Berryfield-Log-Cabin-Home-Delivered-Only-13538712?height=411&mediaId=m8416844&productId=13538712&skuId=14009418&width=411 >

{'currency_code': 'GBP',
 'supplier_name': 'www.diy.com',
 'supplier_part_description': u'19mm tongue and groove interlocking timber to side walls, 12mm tongue and groove timber to floor and roof, supplied with pressure treated floor joists and green roofing felt.',
 'supplier_part_name': u'Shire 11x8 Berryfield Log Cabin - Home Delivered Only',
 'supplier_part_number': u'5019804112289',
 'supplier_price_gross': 1249.98,
 'supplier_price_net': 1041.65,
 'supplier_price_tax_amount': 208.33,
 'supplier_url': 'http://www.diy.com/nav/garden/garden-buildings/cabins-summerhouses/-constructiontype-Interlocking/-pricerangec-1200-1300/-size%3E3_29_x_2_39m/Shire-11x8-Berryfield-Log-Cabin-Home-Delivered-Only-13538712'}

2014-04-07 15:01:47+0100 [diy_cat] 调试:从 <200 http://www.diy.com/nav/garden/garden-buildings/cabins-summerhouses/-constructiontype-Interlocking/-pricerangec中删除-1200-1300/-size%3E3_29_x_2_39m/Shire-11x8-Berryfield-Log-Cabin-Home-Delivered-Only-13538712?height=411&mediaId=m8417696&productId=13538712&skuId=14009418&width=411 >

{'currency_code': 'GBP',
 'supplier_name': 'www.diy.com',
 'supplier_part_description': u'19mm tongue and groove interlocking timber to side walls, 12mm tongue and groove timber to floor and roof, supplied with pressure treated floor joists and green roofing felt.',
 'supplier_part_name': u'Shire 11x8 Berryfield Log Cabin - Home Delivered Only',
 'supplier_part_number': u'5019804112289',
 'supplier_price_gross': 1249.98,
 'supplier_price_net': 1041.65,
 'supplier_price_tax_amount': 208.33,
 'supplier_url': 'http://www.diy.com/nav/garden/garden-buildings/cabins-summerhouses/-constructiontype-Interlocking/-pricerangec-1200-1300/-size%3E3_29_x_2_39m/Shire-11x8-Berryfield-Log-Cabin-Home-Delivered-Only-13538712'}

2014-04-07 15:01:47+0100 [diy_cat] 调试:从 <200 http://www.diy.com/nav/garden/garden-buildings/cabins-summerhouses/-constructiontype-Interlocking/-pricerangec中删除-1200-1300/-size%3E3_29_x_2_39m/Shire-11x8-Berryfield-Log-Cabin-Home-Delivered-Only-13538712?heroPopup=true&mediaId=m8417696&productId=13538712&skuId=14009418 >

{'currency_code': 'GBP',
 'supplier_name': 'www.diy.com',
 'supplier_part_description': u'19mm tongue and groove interlocking timber to side walls, 12mm tongue and groove timber to floor and roof, supplied with pressure treated floor joists and green roofing felt.',
 'supplier_part_name': u'Shire 11x8 Berryfield Log Cabin - Home Delivered Only',
 'supplier_part_number': u'5019804112289',
 'supplier_price_gross': 1249.98,
 'supplier_price_net': 1041.65,
 'supplier_price_tax_amount': 208.33,
 'supplier_url': 'http://www.diy.com/nav/garden/garden-buildings/cabins-summerhouses/-constructiontype-Interlocking/-pricerangec-1200-1300/-size%3E3_29_x_2_39m/Shire-11x8-Berryfield-Log-Cabin-Home-Delivered-Only-13538712'}
4

1 回答 1

2

在测试了你的蜘蛛和规则之后,我相信你的拒绝规则没有正确指定(缺少一两个逗号),更具体地说,“/diy/jsp/”的拒绝规则不太正确。

我使用以下修改后的规则运行了蜘蛛大约 10 秒左右,并且在日志中找不到任何“diy/jsp”实例,所以我认为这是有效的。也就是说,值得将其添加mediaId到拒绝列表中,因为此 URL 参数是上面发布的重复 URL 的唯一主要区别。

rules = (
    Rule(SgmlLinkExtractor(allow=(r'/nav/decor|garden|rooms|fix|build/([A-Za-z0-9-]*)$'),
                           deny=('\.\./\.\./diy/jsp/',
                                 'pricerange',
                                 'productId',
                                 '-size%',
                                 'tab=rev'),),follow=True),

    Rule(SgmlLinkExtractor(allow=(r'/nav/decor|garden|rooms|fix|build/(.*)[0-9]{8}$' ),)
         , follow=True, callback='parse_items'),
    )
于 2014-04-07T16:36:35.143 回答