0

我正在为 url 做 parsechecker:https ://www.nicobuyscars.com o/p Fetch failed with protocol status: exception(16), lastModified=0: Http code=403, url=https://www.nicobuyscars.com

我可以知道是什么问题以及如何解决它。我尝试更改代理名称,但没有成功。请帮我。

4

1 回答 1

3

看起来服务器正在阻止基于用户代理请求标头的请求。使用另一个 HTTP 客户端 (wget) 可以重现它:

$> wget --header='User-Agent: mycrawler/Nutch-1.17' https://www.nicobuyscars.com
--2020-09-25 11:08:19--  https://www.nicobuyscars.com/
Resolving www.nicobuyscars.com (www.nicobuyscars.com)... 205.147.88.151
Connecting to www.nicobuyscars.com (www.nicobuyscars.com)|205.147.88.151|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2020-09-25 11:08:19 ERROR 403: Forbidden.

$> wget https://www.nicobuyscars.com
--2020-09-25 11:08:27--  https://www.nicobuyscars.com/
Resolving www.nicobuyscars.com (www.nicobuyscars.com)... 205.147.88.151
Connecting to www.nicobuyscars.com (www.nicobuyscars.com)|205.147.88.151|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’

在任何情况下,对 Nutch 使用礼貌的设置: large fetcher.server.delay,继续尊重 robots.txt 等。服务器很可能实施了其他启发式方法来检测和阻止机器人。

于 2020-09-25T09:13:45.040 回答