python - 如何使用 python 以编程方式从 MusicBrainz 检索编辑历史页面？

Question

我正在尝试以编程方式从 MusicBrainz 网站检索编辑历史记录页面。（musicbrainzngs是 MB Web 服务的库，无法从 Web 服务访问编辑历史）。为此，我需要使用我的用户名和密码登录 MB 网站。

我尝试使用该mechanize模块，并使用登录页面第二个表单（第一个是搜索表单），我提交了我的用户名和密码；从响应来看，我似乎成功登录了该站点；但是，对编辑历史页面的进一步请求会引发异常：

mechanize._response.httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

我了解例外情况及其原因。我对不滥用网站负全部责任（毕竟，任何使用都会用我的用户名标记），我只是想避免手动打开页面、保存 HTML 并在保存的 HTML 上运行脚本。我可以克服 403 错误吗？

score 2 · Accepted Answer

The better solution is to respect the robots.txt file and simply download the edit data itself and not screen scrape MusicBrainz. You can down load the complete edit history here:

ftp://ftp.musicbrainz.org/pub/musicbrainz/data/fullexport

Look for the file mbdump-edit.tar.bz2.

And, as the leader of the MusicBrainz team, I would like to ask you to respect robots.txt and download the edit data. Thats one of the reasons why we make the edit data downloadable.

Thanks!

score 1 · Accepted Answer

如果您想绕过站点的robots.txt，您可以通过告诉您mechanize.Browser忽略该robots.txt文件来实现此目的。

br = mechanize.Browser()
br.set_handle_robots(False)

此外，您可能希望更改浏览器的用户代理，使您看起来不像机器人：

br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

请注意，这样做时，您实际上是在欺骗网站，让其认为您是有效客户。

python - 如何使用 python 以编程方式从 MusicBrainz 检索编辑历史页面？

2 回答 2

Related

Reference