python - scrapy - 解析分页的项目

Question

我有一个形式的网址：

example.com/foo/bar/page_1.html

总共有 53 页，每页大约有 20 行。

我基本上想从所有页面中获取所有行，即~53*20 项。

我的 parse 方法中有工作代码，它解析单个页面，并且每个项目还深入一页，以获取有关该项目的更多信息：

  def parse(self, response):
    hxs = HtmlXPathSelector(response)

    restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]')

    for rest in restaurants:
      item = DegustaItem()
      item['name'] = rest.select('td[2]/a/b/text()').extract()[0]
      # some items don't have category associated with them
      try:
        item['category'] = rest.select('td[3]/a/text()').extract()[0]
      except:
        item['category'] = ''
      item['urbanization'] = rest.select('td[4]/a/text()').extract()[0]

      # get profile url
      rel_url = rest.select('td[2]/a/@href').extract()[0]
      # join with base url since profile url is relative
      base_url = get_base_url(response)
      follow = urljoin_rfc(base_url,rel_url)

      request = Request(follow, callback = parse_profile)
      request.meta['item'] = item
      return request


  def parse_profile(self, response):
    item = response.meta['item']
    # item['address'] = figure out xpath
    return item

问题是，我如何抓取每个页面？

example.com/foo/bar/page_1.html
example.com/foo/bar/page_2.html
example.com/foo/bar/page_3.html
...
...
...
example.com/foo/bar/page_53.html

score 48 · Accepted Answer

您有两种选择来解决您的问题。一般的一种是用来yield生成新的请求，而不是return. 这样，您可以从单个回调发出多个新请求。检查http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example上的第二个示例。

在您的情况下，可能有一个更简单的解决方案：只需从这样的模式生成 start urs 列表：

class MySpider(BaseSpider):
    start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]

score 12 · Accepted Answer

您可以使用 CrawlSpider 代替 BaseSpider 并使用 SgmlLinkExtractor 提取分页中的页面。

例如：

start_urls = ["www.example.com/page1"]
rules = ( Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@class="next_page"]',))
                , follow= True),
          Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="foto_imovel"]',))
                , callback='parse_call')
    )

第一条规则告诉 scrapy 遵循 xpath 表达式中包含的链接，第二条规则告诉 scrapy 调用 parse_call 到 xpath 表达式中包含的链接，以防你想解析每个页面中的某些内容。

有关更多信息，请参阅文档：http ://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

score 10 · Accepted Answer

'scrapy - 解析分页的项目'可以有两个用例。

一个）。我们只想在表中移动并获取数据。这是相对简单的。

class TrainSpider(scrapy.Spider):
    name = "trip"
    start_urls = ['somewebsite']
    def parse(self, response):
        ''' do something with this parser '''
        next_page = response.xpath("//a[@class='next_page']/@href").extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

观察最后 4 行。这里

我们从“下一页”分页按钮获取下一页链接表单下一页 xpath。
if 条件检查它是否不是分页的结尾。
使用 url join 将这个链接（我们在步骤 1 中得到的）与主 url 连接起来
parse对回调方法的递归调用。

B）我们不仅要跨页面移动，而且还希望从该页面中的一个或多个链接中提取数据。

class StationDetailSpider(CrawlSpider):
    name = 'train'
    start_urls = [someOtherWebsite]
    rules = (
        Rule(LinkExtractor(restrict_xpaths="//a[@class='next_page']"), follow=True),
        Rule(LinkExtractor(allow=r"/trains/\d+$"), callback='parse_trains')
    )
    def parse_trains(self, response):
    '''do your parsing here'''

在这里，请注意：

我们正在使用父类的CrawlSpider子类scrapy.Spider
我们已设置为“规则”

a) 第一条规则，只是检查是否有“next_page”可用并遵循它。

b）第二条规则请求页面上所有格式的链接，例如/trains/12343然后调用parse_trains执行和解析操作。
重要提示：请注意，我们不想parse在这里使用常规方法，因为我们正在使用CrawlSpider子类。这个类也有一个parse方法，所以我们不想重写它。请记住将您的回调方法命名为parse.

python - scrapy - 解析分页的项目

3 回答 3

Related

Reference