20

我想为某些蜘蛛启用一些 http-proxy,并为其他蜘蛛禁用它们。

我可以做这样的事情吗?

# settings.py
proxy_spiders = ['a1' , b2']

if spider in proxy_spider: #how to get spider name ???
    HTTP_PROXY = 'http://127.0.0.1:8123'
    DOWNLOADER_MIDDLEWARES = {
         'myproject.middlewares.RandomUserAgentMiddleware': 400,
         'myproject.middlewares.ProxyMiddleware': 410,
         'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
    }
else:
    DOWNLOADER_MIDDLEWARES = {
         'myproject.middlewares.RandomUserAgentMiddleware': 400,
         'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
    }

如果上面的代码不起作用,还有其他建议吗?

4

5 回答 5

36

有点晚了,但自 1.0.0 版以来,scrapy 中有一个新功能,您可以在其中覆盖每个蜘蛛的设置,如下所示:

class MySpider(scrapy.Spider):
    name = "my_spider"
    custom_settings = {"HTTP_PROXY":'http://127.0.0.1:8123',
                       "DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
                                                  'myproject.middlewares.ProxyMiddleware': 410,
                                                  'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}




class MySpider2(scrapy.Spider):
        name = "my_spider2"
        custom_settings = {"DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
                                                      'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}
于 2015-12-11T18:28:16.233 回答
14

有一种新的更简单的方法可以做到这一点。

class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'SOME_SETTING': 'some value',
    }

我使用 Scrapy 1.3.1

于 2017-03-11T18:40:48.987 回答
8

您可以在 spider.py 文件中添加 setting.overrides 有效的示例:

from scrapy.conf import settings

settings.overrides['DOWNLOAD_TIMEOUT'] = 300 

对你来说,这样的事情也应该有效

from scrapy.conf import settings

settings.overrides['DOWNLOADER_MIDDLEWARES'] = {
     'myproject.middlewares.RandomUserAgentMiddleware': 400,
     'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
于 2014-08-28T22:43:29.187 回答
4

您可以定义自己的代理中间件,如下所示:

from scrapy.contrib.downloadermiddleware import HttpProxyMiddleware

class ConditionalProxyMiddleware(HttpProxyMiddleware):
    def process_request(self, request, spider):
        if getattr(spider, 'use_proxy', None):
            return super(ConditionalProxyMiddleware, self).process_request(request, spider)

use_proxy = True然后在要启用代理的蜘蛛中定义属性。不要忘记禁用默认代理中间件并启用您修改过的中间件。

于 2013-10-14T12:21:36.847 回答
-2

为什么不使用两个项目而不仅仅是一个?

proj1让我们用和命名这两个项目proj2。在proj1'ssettings.py中,输入以下设置:

HTTP_PROXY = 'http://127.0.0.1:8123'
DOWNLOADER_MIDDLEWARES = {
     'myproject.middlewares.RandomUserAgentMiddleware': 400,
     'myproject.middlewares.ProxyMiddleware': 410,
     'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}

proj2'ssettings.py中,输入以下设置:

DOWNLOADER_MIDDLEWARES = {
     'myproject.middlewares.RandomUserAgentMiddleware': 400,
     'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
于 2013-10-12T03:51:35.600 回答