2

如何使用 Scrapy 抓取多个 URL?

我是否被迫制作多个爬虫?

class TravelSpider(BaseSpider):
    name = "speedy"
    allowed_domains = ["example.com"]
    start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4),"http://example.com/superurl/top/page-%d/" % i for i in xrange(55)]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        item = TravelItem()
        item['url'] = hxs.select('//a[@class="out"]/@href').extract()
        out = "\n".join(str(e) for e in item['url']);
        print out

蟒蛇说:

NameError: name 'i' is not defined

但是当我使用一个 URL 时它工作正常!

start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4)"]
4

3 回答 3

3

您的 python 语法不正确,请尝试:

start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4)] + \
    ["http://example.com/superurl/top/page-%d/" % i for i in xrange(55)]

如果您需要编写代码来生成启动请求,您可以定义一个start_requests()方法而不是使用 start_urls。

于 2013-04-19T16:34:43.383 回答
3

start_urls您可以在方法中初始化__init__.py

from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider


class TravelItem(Item):
    url = Field()


class TravelSpider(BaseSpider):
    def __init__(self, name=None, **kwargs):
        self.start_urls = []
        self.start_urls.extend(["http://example.com/category/top/page-%d/" % i for i in xrange(4)])
        self.start_urls.extend(["http://example.com/superurl/top/page-%d/" % i for i in xrange(55)])

        super(TravelSpider, self).__init__(name, **kwargs)

    name = "speedy"
    allowed_domains = ["example.com"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        item = TravelItem()
        item['url'] = hxs.select('//a[@class="out"]/@href').extract()
        out = "\n".join(str(e) for e in item['url']);
        print out

希望有帮助。

于 2013-04-19T12:15:48.223 回答
0

Python 中只有四个范围:LEGB,因为class定义的局部范围和 的局部范围list derivation不是嵌套函数,所以它们不构成封闭范围。因此,它们是两个独立的局部范围,不能从每个范围访问其他。

所以,不要同时使用“for”和类变量

于 2018-08-10T05:51:02.110 回答