0

Scrapy 为我想抓取的每个项目返回两个结果。我可以稍后在 CSV 文件中删除重复项,但我觉得有一个我没有看到的更优雅的解决方案。我不知道,但我认为重复也可能减慢解析本身的速度。

class kickstarter(CrawlSpider):
    name = 'kickstarter_successful'
    allowed_domains = ['kickstarter.com']    
    start_urls = ['http://www.kickstarter.com/discover/successful']

    rules = (
        Rule(
            SgmlLinkExtractor(allow=r'\?page=\d+'),
            follow=True
        ),
        Rule(
            SgmlLinkExtractor(allow=r'/projects/'),
            callback='parse_item'
        )
    )

    COOKIES_ENABLED = False
    DOWNLOAD_DELAY = 2
    USER_AGENT = "ELinks (0.4pre5; Linux 2.6.10-ac7 i686; 80x33)"

    def parse_item(self, response):
        xpath = HtmlXPathSelector(response)
        loader = XPathItemLoader(item=kickstarteritem(), response=response)

        loader.add_value('url', response.url)
        loader.add_xpath('name', '//div[@class="NS-project_-running_board"]/h2[@id="title"]/a/text()')
        loader.add_xpath('launched', '//li[@class="posted"]/text()')
        loader.add_xpath('ended', '//li[@class="ends"]/text()')
        loader.add_xpath('backers', '//span[@class="count"]/data[@data-format="number"]/@data-value')
        loader.add_xpath('pledge', '//div[@class="num"]/@data-pledged')
        loader.add_xpath('goal', '//div[@class="num"]/@data-goal')

        yield loader.load_item()
4

1 回答 1

0

如果您在日志中看到两个不同的 url 都指向同一个页面,您可以自己规范化这些 url process_value

Rule(
    SgmlLinkExtractor(
        allow=r'/projects/', 
        process_value=lambda v: v.replace('ref=card','')),
    callback='parse_item')
于 2013-03-18T16:21:48.520 回答