1

我在scrappinghub上使用scrapy来废弃几千个网站。抓取单个网站时,请求持续时间非常短(< 100 毫秒)。

但我也有一个蜘蛛负责“验证”大约 10k 网址(我正在测试一堆不同的域,有或没有 www.),它所做的只是抓取主页,并且放弃状态不是 200或重定向。

我注意到,当连续多次运行这个蜘蛛时,我得到的结果不一致(项目和请求的数量不同)。

在查看请求日志时,我可以看到请求持续时间逐渐变长,然后恢复到较低的数字,然后变得更高,直到在某些 url 上触发用户超时。

我使用的是CONCURENT_REQUESTS通常 > 100(我尝试过,100、200、500、1000)

这是持续时间日志,这里没有超时,因为只有 100 个 url,但我需要在 10k url 上运行此验证,这种持续时间不稳定令人担忧:

    {"time": 1535517660373, "duration": 26, "status": 400}
    {"time": 1535517661582, "duration": 26, "status": 400}
    {"time": 1535517663724, "duration": 26, "status": 400}
    {"time": 1535517663897, "duration": 26, "status": 400}
    {"time": 1535517665046, "duration": 46, "status": 200}
    {"time": 1535517657573, "duration": 50, "status": 200}
    {"time": 1535517657615, "duration": 83, "status": 200}
    {"time": 1535517657616, "duration": 85, "status": 200}
    {"time": 1535517657822, "duration": 112, "status": 200}
    {"time": 1535517657831, "duration": 112, "status": 200}
    {"time": 1535517657816, "duration": 120, "status": 200}
    {"time": 1535517657837, "duration": 121, "status": 200}
    {"time": 1535517658470, "duration": 130, "status": 200}
    {"time": 1535517663093, "duration": 135, "status": 302}
    {"time": 1535517658133, "duration": 149, "status": 200}
    {"time": 1535517657862, "duration": 153, "status": 200}
    {"time": 1535517657933, "duration": 228, "status": 200}
    {"time": 1535517658362, "duration": 230, "status": 200}
    {"time": 1535517657946, "duration": 258, "status": 200}
    {"time": 1535517657989, "duration": 269, "status": 200}
    {"time": 1535517657967, "duration": 271, "status": 200}
    {"time": 1535517658108, "duration": 389, "status": 200}
    {"time": 1535517665893, "duration": 433, "status": 404}
    {"time": 1535517658142, "duration": 467, "status": 200}
    {"time": 1535517658350, "duration": 467, "status": 200}
    {"time": 1535517668501, "duration": 526, "status": 200}
    {"time": 1535517658216, "duration": 543, "status": 200}
    {"time": 1535517658312, "duration": 670, "status": 200}
    {"time": 1535517658342, "duration": 678, "status": 200}
    {"time": 1535517658347, "duration": 679, "status": 200}
    {"time": 1535517658291, "duration": 682, "status": 200}
    {"time": 1535517658345, "duration": 684, "status": 200}
    {"time": 1535517658310, "duration": 688, "status": 200}
    {"time": 1535517658333, "duration": 688, "status": 200}
    {"time": 1535517658336, "duration": 689, "status": 200}
    {"time": 1535517658317, "duration": 690, "status": 200}
    {"time": 1535517658314, "duration": 694, "status": 200}
    {"time": 1535517658339, "duration": 696, "status": 200}
    {"time": 1535517658319, "duration": 697, "status": 200}
    {"time": 1535517658315, "duration": 701, "status": 200}
    {"time": 1535517658349, "duration": 701, "status": 200}
    {"time": 1535517658322, "duration": 703, "status": 200}
    {"time": 1535517658327, "duration": 703, "status": 200}
    {"time": 1535517658377, "duration": 704, "status": 200}
    {"time": 1535517658309, "duration": 708, "status": 200}
    {"time": 1535517658376, "duration": 710, "status": 200}
    {"time": 1535517658374, "duration": 711, "status": 200}
    {"time": 1535517658335, "duration": 717, "status": 200}
    {"time": 1535517658344, "duration": 720, "status": 200}
    {"time": 1535517658338, "duration": 728, "status": 200}
    {"time": 1535517658372, "duration": 728, "status": 200}
    {"time": 1535517658324, "duration": 732, "status": 200}
    {"time": 1535517658360, "duration": 748, "status": 200}
    {"time": 1535517658341, "duration": 753, "status": 200}
    {"time": 1535517658396, "duration": 797, "status": 200}
    {"time": 1535517658408, "duration": 801, "status": 200}
    {"time": 1535517658529, "duration": 938, "status": 200}
    {"time": 1535517658579, "duration": 994, "status": 200}
    {"time": 1535517658607, "duration": 996, "status": 200}
    {"time": 1535517658604, "duration": 1001, "status": 200}
    {"time": 1535517658611, "duration": 1006, "status": 200}
    {"time": 1535517658606, "duration": 1022, "status": 200}
    {"time": 1535517658707, "duration": 1104, "status": 200}
    {"time": 1535517658634, "duration": 1110, "status": 200}
    {"time": 1535517658772, "duration": 1166, "status": 200}
    {"time": 1535517658859, "duration": 1236, "status": 200}
    {"time": 1535517658956, "duration": 1348, "status": 200}
    {"time": 1535517659025, "duration": 1358, "status": 200}
    {"time": 1535517658958, "duration": 1368, "status": 200}
    {"time": 1535517658959, "duration": 1373, "status": 200}
    {"time": 1535517658985, "duration": 1408, "status": 200}
    {"time": 1535517658960, "duration": 1426, "status": 200}
    {"time": 1535517659349, "duration": 1445, "status": 200}
    {"time": 1535517659469, "duration": 1583, "status": 200}
    {"time": 1535517659283, "duration": 1694, "status": 200}
    {"time": 1535517659278, "duration": 1712, "status": 200}
    {"time": 1535517659620, "duration": 2033, "status": 200}
    {"time": 1535517660588, "duration": 2400, "status": 200}
    {"time": 1535517660353, "duration": 2819, "status": 200}
    {"time": 1535517660756, "duration": 3194, "status": 200}
    {"time": 1535517660752, "duration": 3214, "status": 200}
    {"time": 1535517661403, "duration": 3216, "status": 200}
    {"time": 1535517660889, "duration": 3316, "status": 200}
    {"time": 1535517661535, "duration": 3371, "status": 200}
    {"time": 1535517661407, "duration": 3848, "status": 200}
    {"time": 1535517661966, "duration": 4436, "status": 200}
    {"time": 1535517662355, "duration": 4463, "status": 200}
    {"time": 1535517662153, "duration": 4613, "status": 200}
    {"time": 1535517662336, "duration": 4814, "status": 200}
    {"time": 1535517664132, "duration": 6594, "status": 200}
    {"time": 1535517681367, "duration": 23480, "status": 200}
    {"time": 1535517683665, "duration": 26104, "status": 200}
    {"time": 1535517685281, "duration": 27744, "status": 200}
    {"time": 1535517691127, "duration": 33598, "status": 200}
    {"time": 1535517692933, "duration": 35454, "status": 200}
    {"time": 1535517693278, "duration": 35764, "status": 200}
    {"time": 1535517693337, "duration": 35812, "status": 200}
    {"time": 1535517693972, "duration": 36459, "status": 200}
    {"time": 1535517694212, "duration": 36701, "status": 200}
    {"time": 1535517694576, "duration": 37071, "status": 200}

我的蜘蛛:

from scrapy.spiders import Spider
from scrapy import Request
import pkgutil
from ...utils.parse import parse
from ...utils.errback_httpbin import errback_httpbin


class QuotesSpider(Spider):
    name = "validation_2"
    rotate_user_agent = True

    def start_requests(self):
        urls = pkgutil.get_data("qwarx_spiders", "resources/urls_100.txt").decode('utf-8').splitlines()
        for url in urls:
            yield Request(url=url, callback=self.parse, errback=self.errback_httpbin)

    def parse(self, response):
        return parse(self, response)

    def errback_httpbin(self, failure):
        return errback_httpbin(self, failure)

解析方法:

from ..items.broad import URL
from scrapy.exceptions import NotSupported


def getDomain(url):
    spltAr = url.split("://")
    i = (0, 1)[len(spltAr) > 1]
    dm = spltAr[i].split("?")[0].split('/')[0].split(':')[0].lower()
    return dm.replace('www.', '')


def parse(self, response):
    item = URL()
    id = {}

    id['url'] = response.url
    id['domain'] = getDomain(response.url)
    try:
        id['title'] = response.xpath("//title/text()").extract_first()
        if id['title'] is not None:
            id['title'] = id['title'].strip()
    except (AttributeError, NotSupported) as e:
        yield None

    meta_names = response.xpath("//meta/@name").extract()
    meta_properties = response.xpath("//meta/@property").extract()
    meta = {}
    content = {}

    if 'description' in meta_names:
        meta['description'] = response.xpath("//meta[@name='description']/@content").extract_first()
    else:
        if 'og:description' in meta_properties:
            meta['description'] = response.xpath("//meta[@property='og:description']/@content").extract_first()
        else:
            meta['description'] = ''

    if 'og:image' in meta_names:
        meta['image'] = response.xpath("//meta[@name='og:image']/@content").extract_first()
    else:
        if 'og:image' in meta_properties:
            meta['image'] = response.xpath("//meta[@property='og:image']/@content").extract_first()
        else:
            meta['image'] = ''

    content['p'] = response.xpath('//p/text()').extract_first()
    if content['p'] is not None:
        content['p'] = list(map(lambda x: x.strip()[:150], response.xpath('//p/text()').extract()))[:4]

        if 'redirect_urls' in response.meta:
            meta['redirect_urls'] = response.meta['redirect_urls']

    item['id'] = id
    item['content'] = content
    item['meta'] = meta

    yield item

errback_httpbin:

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError


def errback_httpbin(self, failure):
    # log all errback failures,
    # in case you want to do something special for some errors,
    # you may need the failure's type
    self.logger.error(repr(failure))

    # if isinstance(failure.value, HttpError):
    if failure.check(HttpError):
        # you can get the response
        response = failure.value.response
        self.logger.error('HttpError on %s', response.url)

    # elif isinstance(failure.value, DNSLookupError):
    elif failure.check(DNSLookupError):
        # this is the original request
        request = failure.request
        self.logger.error('DNSLookupError on %s', request.url)

    # elif isinstance(failure.value, TimeoutError):
    elif failure.check(TimeoutError):
        request = failure.request
        self.logger.error('TimeoutError on %s', request.url)

设置.py:

SPIDER_MODULES = ['qwarx_spiders.spiders.broad', 'qwarx_spiders.spiders.custom', 'qwarx_spiders.spiders.validation']
NEWSPIDER_MODULE = 'qwarx_spiders.spiders'

SPIDER_MIDDLEWARES = {
    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': True,
}

DOWNLOADER_MIDDLEWARES = {
    'qwarx_spiders.middlewares.FilterDomainbyLimitMiddleware': 200,
    'qwarx_spiders.middlewares.RotateUserAgentMiddleware': 110,
}

ITEM_PIPELINES = {
    'qwarx_spiders.pipelines.DuplicatesPipeline': 300,
}

EXTENSIONS = {
    'scrapy_dotpersistence.DotScrapyPersistence': 0
}

BOT_NAME = 'Qwarx'

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 ' \
             '(KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.3'

ROBOTSTXT_OBEY = False
LOG_LEVEL = 'INFO'

CONCURRENT_REQUESTS = 1000
REACTOR_THREADPOOL_MAXSIZE = 1000

DOWNLOAD_DELAY = 0

COOKIES_ENABLED = False
REDIRECT_ENABLED = True
AJAXCRAWL_ENABLED = True
AUTOTHROTTLE_ENABLED = False
RETRY_ENABLED = True
DOWNLOAD_TIMEOUT = 60
DNSCACHE_ENABLED=True
DNSCACHE_SIZE=100000

CRAWL_LIMIT_PER_DOMAIN = 100000

URLLENGTH_LIMIT = 180

USER_AGENT_CHOICES = [
    'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:23.0) Gecko/20100101 Firefox/23.0',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.62 Safari/537.36',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20140205 Firefox/24.0 Iceweasel/24.3.0',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
]


URLLENGTH_LIMIT=180
4

1 回答 1

0

所以我找到了解决我的问题的方法。
在抓取大量域时,我遇到了一堆“假阴性”,这意味着在连续多次对 10k url 运行验证抓取时,我永远不会得到相同数量的结果。
但是我已经建立了一个旋转代理系统(通过 Crawlera),它现在完全稳定了。

于 2018-08-30T07:47:39.777 回答