1

我有一个LinkedIn的蜘蛛。它在我的本地机器上运行良好,但是当我在 Scrapinghub 上部署时出现错误:

Error downloading <GET https://www.linkedin.com/>: Connection was refused by other side: 111: Connection refused.

Scrapinghub的完整日志为:

0:  2018-08-30 12:58:34 INFO    Log opened.
1:  2018-08-30 12:58:34 INFO    [scrapy.log] Scrapy 1.0.5 started
2:  2018-08-30 12:58:34 INFO    [scrapy.utils.log] Scrapy 1.0.5 started (bot: facebook_stats)
3:  2018-08-30 12:58:34 INFO    [scrapy.utils.log] Optional features available: ssl, http11, boto
4:  2018-08-30 12:58:34 INFO    [scrapy.utils.log] Overridden settings: {'NEWSPIDER_MODULE': 'facebook_stats.spiders', 'STATS_CLASS': 'sh_scrapy.stats.HubStorageStatsCollector', 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['facebook_stats.spiders'], 'RETRY_TIMES': 10, 'RETRY_HTTP_CODES': [500, 503, 504, 400, 403, 404, 408], 'BOT_NAME': 'facebook_stats', 'MEMUSAGE_LIMIT_MB': 950, 'DOWNLOAD_DELAY': 1, 'TELNETCONSOLE_HOST': '0.0.0.0', 'LOG_FILE': 'scrapy.log', 'MEMUSAGE_ENABLED': True, 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.7'}
5:  2018-08-30 12:58:34 INFO    [scrapy.log] HubStorage: writing items to https://storage.scrapinghub.com/items/341545/3/9
6:  2018-08-30 12:58:34 INFO    [scrapy.middleware] Enabled extensions: CoreStats, TelnetConsole, MemoryUsage, LogStats, StackTraceDump, CloseSpider, SpiderState, AutoThrottle, HubstorageExtension
7:  2018-08-30 12:58:35 INFO    [scrapy.middleware] Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
8:  2018-08-30 12:58:35 INFO    [scrapy.middleware] Enabled spider middlewares: HubstorageMiddleware, HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
9:  2018-08-30 12:58:35 INFO    [scrapy.middleware] Enabled item pipelines: CreditCardsPipeline
10: 2018-08-30 12:58:35 INFO    [scrapy.core.engine] Spider opened
11: 2018-08-30 12:58:36 INFO    [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
12: 2018-08-30 12:58:36 INFO    TelnetConsole starting on 6023
13: 2018-08-30 12:59:32 ERROR   [scrapy.core.scraper] Error downloading <GET https://www.linkedin.com/>: Connection was refused by other side: 111: Connection refused.
14: 2018-08-30 12:59:32 INFO    [scrapy.core.engine] Closing spider (finished)
15: 2018-08-30 12:59:33 INFO    [scrapy.statscollectors] Dumping Scrapy stats: More
16: 2018-08-30 12:59:34 INFO    [scrapy.core.engine] Spider closed (finished)
17: 2018-08-30 12:59:34 INFO    Main loop terminated.

我怎样才能解决这个问题?

4

1 回答 1

3

LinkedIn禁止抓取

禁止的软件和扩展

LinkedIn 致力于保护其会员数据的安全,并确保其网站免受欺诈和滥用。为了保护我们的会员数据和我们的网站,我们不允许使用任何第三方软件,包括“爬虫”、机器人、浏览器插件或浏览器扩展(也称为“附加组件”),抓取、修改外观或自动执行 LinkedIn 网站上的活动。此类工具违反了用户协议,包括但不限于第 8.2 节中列出的许多“注意事项”……</p>

有理由认为他们可能会主动阻止来自 Scrapinghub 和类似服务的连接。

于 2018-08-30T13:29:50.743 回答