web-crawler - crawler4j：网站在 20-30 秒的爬网后禁止我的 IP 地址几分钟

Question

我正在尝试使用开源 crawler4j 在 mystore411.com 上抓取一个网站。

爬虫在有限的时间内（比如 20-30 秒）正常工作，然后网站在我可以再次爬取之前禁止我的地址几分钟。我想不出可能的解决方案。

我浏览了它的 robots.txt，这是我从中得到的：

User-agent: Mediapartners-Google 
Disallow:

User-agent: *
Disallow: /js/
Disallow: /css/
Disallow: /images/

User-agent: Slurp
Crawl-delay: 1

User-agent: Baiduspider
Crawl-delay: 1

User-agent: MaxPointCrawler
Disallow: /

User-agent: YandexBot
Disallow: /

请建议是否有任何替代方案。

score 1 · Accepted Answer

我不能告诉你他们禁止你的确切原因。但我可以告诉你一些 IP 被禁止的原因。

1) 您在抓取控制器代码中的礼貌延迟可能太低。

  * Expalnation:- Politeness delay is the time that you set as the gap between two          
                  consecutive requests. The more u reduce the delay the more no. of 
                  requests will be send to the server increasing server work load. SO keep 
                  an appropriate politeness delay.(default 250 ms, use this command 
                  config.setPolitenessDelay(250);

2) 减少数量。爬虫线程数

 * Explanation:- Almost the same reason as above.

3) 不要爬过robot's.txt

 * Explanation:- Set your robottxtenable to false in order to not to get blocked by the
                 domain's robot's.txt.(config.setResumableCrawling(false);

4）尝试使用一个好的用户代理代理： -

  * Exaplantion:- https://en.wikipedia.org/wiki/User_agent.

web-crawler - crawler4j：网站在 20-30 秒的爬网后禁止我的 IP 地址几分钟

1 回答 1

Related

Reference