python - 不能用scrapy抓取一些网站

Question

过去两年我一直在使用scrapy。现在有一些问题我无法找出这是什么问题。我正在抓取大约 80 个站点。所有这些都被抓取，但大约 6 个网站没有被抓取。我正在使用 RandomProxy 中间件、RotateUserAgent 中间件和飞溅。

所以，你能帮我弄清楚那是什么问题。然后我会搜索解决方案。无法爬取的网站是： http: //proza.ru/avtor/miliku

错误是：

link:http://proza.ru/avtor/miliku; message: Traceback (most recent call last): Failure: twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.>]

score 0 · Accepted Answer

我不确定robots.txt政策是否会导致您的错误。但是您可以尝试在settings.py中禁用“Obey robot.txt”规则：

ROBOTSTXT_OBEY = 假

禁用此功能可能会导致违反网站政策。所以要小心！

python - 不能用scrapy抓取一些网站

1 回答 1

Related

Reference