python - 链接Scrapy后的问题

Question

试图让我的网络爬虫抓取从网页中提取的链接。我正在使用 Scrapy。我可以用我的爬虫成功地提取数据，但不能让它爬。我相信问题出在我的规则部分。Scrapy 新手。感谢您提前提供帮助。

我正在抓取这个网站：

http://ballotpedia.org/wiki/index.php/Category:2012_challenger

我试图关注的链接在源代码中如下所示：

/wiki/index.php/A._Ghani

或者

/wiki/index.php/A._Keith_Carreiro

这是我的蜘蛛的代码：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider,Rule

from ballot1.items import Ballot1Item

class Ballot1Spider(CrawlSpider):
   name = "stewie"
   allowed_domains = ["ballotpedia.org"]
   start_urls = [
       "http://ballotpedia.org/wiki/index.php/Category:2012_challenger"
   ]
   rules =  (
       Rule (SgmlLinkExtractor(allow=r'w+'), follow=True),
       Rule(SgmlLinkExtractor(allow=r'\w{4}/\w+/\w+'), callback='parse')
   )

 def parse(self, response):
   hxs = HtmlXPathSelector(response)
   sites = hxs.select('*')
   items = []
   for site in sites:
       item = Ballot1Item()
       item['candidate'] = site.select('/html/head/title/text()').extract()
       item['position'] = site.select('//table[@class="infobox"]/tr/td/b/text()').extract()
       item['controversies'] = site.select('//h3/span[@id="Controversies"]/text()').extract()
       item['endorsements'] = site.select('//h3/span[@id="Endorsements"]/text()').extract()
       item['currentposition'] = site.select('//table[@class="infobox"]/tr/td[@style="text-align:center; background-color:red;color:white; font-size:100%; font-weight:bold;"]/text()').extract()
       items.append(item)
   return items

score 1 · Accepted Answer

您所追求的链接仅存在于此元素中：

<div lang="en" dir="ltr" class="mw-content-ltr">

因此，您必须限制 XPath 以防止无关链接：

restrict_xpaths='//div[@id="mw-pages"]/div'

最后，您只想关注看起来像的链接/wiki/index.php?title=Category:2012_challenger&pagefrom=Alison+McCoy#mw-pages，因此您的最终规则应如下所示：

rules = (
    Rule(
        SgmlLinkExtractor(
            allow=r'&pagefrom='
        ),
        follow=True
    ),
    Rule(
        SgmlLinkExtractor(
            restrict_xpaths='//div[@id="mw-pages"]/div',
            callback='parse'
        )
    )
)

score 1 · Accepted Answer

您正在使用带有回调的CrawlSpider parse，scrapy 文档明确警告它将阻止 crawling。

将其重命名为类似的名称parse_items，您应该没问题。

score 0 · Accepted Answer

r'w+'是错误的（我认为您的意思是r'\w+'）并且r'\w{4}/\w+/\w+'看起来也不正确，因为它与您的链接不匹配（它缺少前导/）。你为什么不试试r'/wiki/index.php/.+'呢？不要忘记\w不包括.和其他可以作为文章名称一部分的符号。

python - 链接Scrapy后的问题

3 回答 3

Related

Reference