1

我有一个数据对象列表,每个对象都包含一个要抓取的 url。其中一些 url 无效,但我仍然希望数据对象通过以到达项目管道。

在@tomáš-linhart 回复之后,我了解到在这种情况下使用中间件将不起作用,因为scrapy 不允许我首先创建请求对象。

如果 url 无效,另一种方法是返回 item 而不是 request。

以下是我的代码:

def start_requests(self):
        rurls = json.load(open(self.data_file))
        for data in rurls[:100]:
            url = data['Website'] or ''
            rid = data['id']

            # skip creating requests for invalid urls
            if not (url and validators.url(url)):
                yield self.create_item(rid, url)
                continue

            # create request object
            request_object = scrapy.Request(url=url, callback=self.parse, errback=self.errback_httpbin)

            # populate request object
            request_object.meta['rid'] = rid

            self.logger.info('REQUEST QUEUED for RID: %s', rid)
            yield request_object

上面的代码抛出了一个错误,如图所示。除了错误之外,我不确定如何追踪问题的根源。:(

2017-09-22 12:44:38 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method RefererMiddleware.request_scheduled of <scrapy.spidermiddlewares.referer.RefererMiddleware object at 0x10f603ef0>>
Traceback (most recent call last):
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
    *arguments, **named)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 343, in request_scheduled
    redirected_urls = request.meta.get('redirect_urls', [])
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/item.py", line 74, in __getattr__
    raise AttributeError(name)
AttributeError: meta
Unhandled Error
Traceback (most recent call last):
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/commands/crawl.py", line 58, in run
    self.crawler_process.start()
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/crawler.py", line 285, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 1243, in run
    self.mainLoop()
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 1252, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 878, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/utils/reactor.py", line 41, in __call__
    return self._func(*self._a, **self._kw)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 135, in _next_request
    self.crawl(request, spider)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 210, in crawl
    self.schedule(request, spider)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 216, in schedule
    if not self.slot.scheduler.enqueue_request(request):
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/scheduler.py", line 54, in enqueue_request
    if not request.dont_filter and self.df.request_seen(request):
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/item.py", line 74, in __getattr__
    raise AttributeError(name)
builtins.AttributeError: dont_filter

2017-09-22 12:44:38 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/commands/crawl.py", line 58, in run
    self.crawler_process.start()
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/crawler.py", line 285, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 1243, in run
    self.mainLoop()
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 1252, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 878, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/utils/reactor.py", line 41, in __call__
    return self._func(*self._a, **self._kw)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 135, in _next_request
    self.crawl(request, spider)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 210, in crawl
    self.schedule(request, spider)
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 216, in schedule
    if not self.slot.scheduler.enqueue_request(request):
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/scheduler.py", line 54, in enqueue_request
    if not request.dont_filter and self.df.request_seen(request):
  File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/item.py", line 74, in __getattr__
    raise AttributeError(name)
builtins.AttributeError: dont_filter
4

3 回答 3

1

您无法使用当前方法实现目标,因为您得到的错误是在 a 的构造函数中引发的Request,请参阅代码

无论如何,我不明白你为什么要这样做。根据您的要求:

我有一个数据对象列表,每个对象都包含一个要抓取的 url。其中一些 url 无效,但我仍然希望数据对象通过以到达项目管道。

如果我理解正确,您已经拥有一个完整的项目(您的术语中的数据对象)并且您只希望它通过项目管道。然后在蜘蛛中进行 URL 验证,如果它无效,则只产生项目而不是产生对它包含的 URL 的请求。不需要蜘蛛中间件。

于 2017-09-22T06:18:15.547 回答
0

您不能从 start_requests 方法产生 Item 对象。只有请求对象。

于 2017-09-22T17:35:22.207 回答
0

回答你的问题已经晚了,但我是这样做的,

class ImageSpider(scrapy.Spider):
    name = "image"
    allowed_domains = []

    def start_requests(self):
        yield scrapy.Request("https://www.example.org", callback=self.parse)

    def parse(self, response):
        while True:
            task = get_random_task()
            yield ImageItem(image_urls=task.pic_urls.split(","), mid=task.mid)

只需发出一个虚拟请求,然后调用 parse 函数来生成项目。

于 2021-12-07T15:50:20.593 回答