python - scrapy如何爬取更多的url？

Question

如我们所见：

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//ul/li')
    items = []

    for site in sites:
        item = Website()
        item['name'] = site.select('a/text()').extract()
        item['url'] = site.select('//a[contains(@href, "http")]/@href').extract()
        item['description'] = site.select('text()').extract()
        items.append(item)

    return items

scrapy 只是得到一个页面响应，并在页面响应中找到 url。我认为这只是表面爬行！！

但我想要更多具有定义深度的网址。

我能做些什么来实现它？

谢谢你！！

score 1 · Accepted Answer

我不明白您的问题，但我注意到您的代码中有几个问题，其中一些可能与您的问题有关（请参阅代码中的注释）：

sites = hxs.select('//ul/li')
items = []

for site in sites:
    item = Website()
    # this extracts a list, so i guess .extract()[0] is expected
    item['name'] = site.select('a/text()').extract() 
    # '//a[...]' maybe you expect that this gets the links within the `site`, but it actually get the links from the entire page; you should use './/a[...]'.
    # And, again, this returns a list, not a single url.
    item['url'] = site.select('//a[contains(@href, "http")]/@href').extract()

score 0 · Accepted Answer

您可以通过使用CrawlSpider可以导入的页面scrapy.contrib.spiders并定义您rules希望爬虫抓取的链接类型来抓取更多页面。

请按照此处有关如何定义规则的说明进行操作

顺便说一句，考虑从文档更改函数名称：

警告

在编写爬虫规则时，避免使用 parse 作为回调，因为 CrawlSpider 使用 parse 方法本身来实现其逻辑。所以如果你重写 parse 方法，爬虫将不再工作。

score 0 · Accepted Answer

查看有关请求和响应的文档。

当您抓取第一页时，您会收集一些链接，用于生成第二个请求并导致第二个回调函数来抓取第二个级别。抽象地说，这听起来很复杂，但您会从文档中的示例代码中看到它非常简单。

此外，CrawlSpider 示例更加充实，并为您提供了您可能只想适应您的情况的模板代码。

希望这能让你开始。

python - scrapy如何爬取更多的url？

3 回答 3

Related

Reference