0

这很奇怪,我用它的管道编写了scrapy代码并爬取了大量数据,它总是运行良好。今天,当我重新运行相同的代码时,它突然根本不起作用。以下是详细信息:

我的蜘蛛 - base_url_spider.py

import re
from bs4 import BeautifulSoup
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BaseURLSpider(CrawlSpider):
    '''
    This class is responsible for crawling globe and mail articles and their comments
    '''
    name = 'BaseURL'
    allowed_domains = ["www.theglobeandmail.com"]

    # seed urls
    url_path = r'../Sample_Resources/Online_Resources/sample_seed_urls.txt'
    start_urls = [line.strip() for line in open(url_path).readlines()]

    # Rules for including and excluding urls
    rules = (
    Rule(LinkExtractor(allow=r'\/opinion\/.*\/article\d+\/$'), callback="parse_articles"),
)

    def __init__(self, **kwargs):
        '''
        :param kwargs:
        Read user arguments and initialize variables
        '''
        CrawlSpider.__init__(self)

        self.headers = ({'User-Agent': 'Mozilla/5.0',
                     'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
                     'X-Requested-With': 'XMLHttpRequest'})
        self.ids_seen = set()


    def parse_articles(self, response):
        article_ptn = "http://www.theglobeandmail.com/opinion/(.*?)/article(\d+)/"
        resp_url = response.url
        article_m = re.match(article_ptn, resp_url)
        article_id = ''
        if article_m != None:
            article_id = article_m.group(2)
            if article_id not in self.ids_seen:
                self.ids_seen.add(article_id)

                soup = BeautifulSoup(response.text, 'html.parser')
                content = soup.find('div', {"class":"column-2 gridcol"})
                if content != None:
                    text = content.findAll('p', {"class":''})
                    if len(text) > 0:
                            print('*****In Spider, Article ID*****', article_id)
                            print('***In Spider, Article URL***', resp_url)

                            yield {article_id: {"article_url": resp_url}}

如果我只运行我的蜘蛛代码,通过命令行scrapy runspider --logfile ../logs/log.txt ScrapeNews/spiders/article_base_url_spider.py。它可以抓取start_urls.

我的管道 - base_url_pipelines.py

import json


class BaseURLPipelines(object):

    def process_item(self, item, spider):
        article_id = list(item.keys())[0]
        print("****Pipeline***", article_id)
        f_name = r'../Sample_Resources/Online_Resources/sample_base_urls.txt'
        with open(f_name, 'a') as out:
            json.dump(item, out)
            out.write("\n")

        return(item)

我的设置 - settings.py 我没有注释这些行:

BOT_NAME = 'ScrapeNews'
SPIDER_MODULES = ['ScrapeNews.spiders']
NEWSPIDER_MODULE = 'ScrapeNews.spiders'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 3
ITEM_PIPELINES = {
'ScrapeNews.article_comment_pipelines.ArticleCommentPipeline': 400,
}

我的scrapy.cfg 这个文件应该用来指明设置文件在哪里

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html

[settings]
default = ScrapeNews.settings

[deploy]
#url = http://localhost:6800/
project = ScrapeNews

所有这些东西过去都很好地协同工作。

但是,今天当我重新运行代码时,我得到了这种类型的日志输出:

2017-04-24 14:14:15 [scrapy] INFO: Enabled item pipelines:
['ScrapeNews.article_comment_pipelines.ArticleCommentPipeline']
2017-04-24 14:14:15 [scrapy] INFO: Spider opened
2017-04-24 14:14:15 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-24 14:14:15 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-04-24 14:14:15 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/robots.txt> (referer: None)
2017-04-24 14:14:20 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/opinion/austerity-is-here-all-that-matters-is-the-math/article627532/> (referer: None)
2017-04-24 14:14:24 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/opinion/ontario-can-no-longer-hide-from-taxes-restraint/article546776/> (referer: None)
2017-04-24 14:14:24 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.theglobeandmail.com/life/life-video/video-what-was-starbucks-thinking-with-their-new-unicorn-frappuccino/article34787773/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-04-24 14:14:31 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/opinion/for-palestinians-the-other-enemy-is-their-own-leadership/article15019936/> (referer: None)
2017-04-24 14:14:32 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/opinion/would-quebecs-partitiongo-back-on-the-table/article17528694/> (referer: None)
2017-04-24 14:14:36 [scrapy] INFO: Received SIG_UNBLOCK, shutting down gracefully. Send again to force 
2017-04-24 14:14:36 [scrapy] INFO: Closing spider (shutdown)
2017-04-24 14:14:36 [scrapy] INFO: Received SIG_UNBLOCK twice, forcing unclean shutdown

对比上面的异常日志输出,如果我只在这里运行我的spider,日志是没问题的,显示如下:

2017-04-24 14:21:20 [scrapy] DEBUG: Scraped from <200 http://www.theglobeandmail.com/opinion/were-ripe-for-a-great-disruption-in-higher-education/article543479/>
{'543479': {'article_url': 'http://www.theglobeandmail.com/opinion/were-ripe-for-a-great-disruption-in-higher-education/article543479/'}}
2017-04-24 14:21:20 [scrapy] DEBUG: Scraped from <200 http://www.theglobeandmail.com/opinion/saint-making-the-blessed-politics-of-canonization/article624413/>
{'624413': {'article_url': 'http://www.theglobeandmail.com/opinion/saint-making-the-blessed-politics-of-canonization/article624413/'}}
2017-04-24 14:21:20 [scrapy] INFO: Closing spider (finished)
2017-04-24 14:21:20 [scrapy] INFO: Dumping Scrapy stats:

在上面的异常日志输出中,我注意到了类似机器人的东西:

2017-04-24 14:14:15 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-04-24 14:14:15 [scrapy] DEBUG: Crawled (200) <GET http://www.theglobeandmail.com/robots.txt> (referer: None)

GET http://www.theglobeandmail.com/robots.txt从未出现在整个正常的日志输出中。但是当我在浏览器中输入这个时,我不太明白它是什么。所以我不确定是否是因为我正在抓取的网站添加了一些机器人?

还是问题来自收到的 SIG_UNBLOCK,正常关闭?但我没有找到任何解决方案。

我用来运行代码的命令行是scrapy runspider --logfile ../../Logs/log.txt base_url_spider.py

你知道如何处理这个问题吗?

4

1 回答 1

0

robots.txt 是网站用来让网络爬虫知道是否允许抓取该网站的文件。你设置 ROBOTSTXT_OBEY = True,这意味着 scrapy 将服从 robots.txt 的设置。

更改 ROBOTSTXT_OBEY = False 它应该可以工作。

于 2017-04-25T01:33:25.453 回答