0

尝试打开 URL 时,我在使用 Mechanize 时遇到 406 错误:

for url in urls:
    if "http://" not in url: 
        url = "http://" + url
    print url
    try:
        page = mech.open("%s" % url)
    except urllib2.HTTPError, e:
        print "there was an error opening the URL, logging it"
        print e.code
        logfile = open ("log/urlopenlog.txt", "a")
        logfile.write(url + "," + "couldn't open this page" + "\n")
        continue
    else:
        print "opening this URL..."
        page = mech.open(url)

知道什么会导致 406 错误发生吗?如果我转到有问题的 URL,我可以在浏览器中打开它。

4

2 回答 2

2

尝试根据浏览器发送的内容向您的请求添加标头;从添加Accept标头开始(406 通常意味着服务器不喜欢您想要接受的内容)。

请参阅文档中的“添加标题”

req = mechanize.Request(url)
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
page = mechanize.urlopen(req)

那里的Accept标头值基于 Chrome 发送的标头。

于 2012-12-22T21:53:43.813 回答
0

如果您想了解您的浏览器发送了哪些标头,此网页会显示给您:https ://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending

'Accept' 和 'User-Agent' 标题应该足够了。这是我为摆脱错误所做的:

#establish counter
j = 0

#Create headers for webpage
headers = {'User-Agent': 'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}

#Create for loop to get through list of URLs
for url in URLs:

    #Verify scraper agent so that web security systems don't block webpage scraping upon URL opening, with j as a counter
    req = mechanize.Request(URLs[j], headers = headers)

    #Open the url
    page = mechanize.urlopen(req)

    #increase counter
    j += 1

您还可以尝试导入“urllib2”或“urllib”库来打开这些 URL。语法是一样的。

于 2015-10-26T00:43:03.257 回答