1

可能重复:
Robots.txt 的伦理

我正在尝试使用 Mechanize 来自动化网站上的一些工作。我已经设法通过使用 br.set_handle_robots(False) 绕过上述错误。使用它有多合乎道德?

如果不是,那么我想遵守“robots.txt”,但我试图机械化的网站阻止我查看 robots.txt,这是否意味着不允许机器人访问它?我的下一步应该是什么?

提前致谢。

4

1 回答 1

1

For your first question, see Ethics of robots.txt

You need to keep in mind the purpose of robots.txt. Robots that are crawling a site can potentially wreck havoc on the site and essentially cause a DoS attack. So if your "automation" is crawling at all or is downloading more than just a few pages every day or so, AND the site has a robots.txt file that excludes you, then you should honor it.

Personally, I find a little grey area. If my script is working at the same pace as a human using a browser and is only grabbing a few pages then I, in the spirit of the robots exclusion standard, have no problem scrapping the pages so long as it doesn't access the site more than once a day. Please read that last sentence carefully before judging me. I feel it is perfectly logical. Many people may disagree with me there though.

For your second question, web servers have the ability to return a 403 based on the User-Agent attribute of the HTTP header sent with your request. In order to have your script mimic a browser, you have to miss-represent yourself. Meaning, you need to change the HTTP header User-Agent attribute to be the same as the one used by a mainstream web browser (e.g., Firefox, IE, Chrome). Right now it probably says something like 'Mechanize'.

Some sites are more sophisticated than that and have other methods for detecting non-human visitors. In that case, give up because they really don't want you accessing the site in that manner.

于 2012-08-31T01:48:03.657 回答