python - urllib2 获取网页

Question

我有一个让我发疯的问题。我正在使用 urllib2 来获取许多 url。有一个 url 有时会返回给我整个 html 页面，有时不会。这是我的代码：

def find_html(url):
    req = urllib2.Request(url)
    req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;   rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14')
    page_html = urllib2.urlopen(req).read()

    n = string.find(page_html, "filter clearfix active")
    print "find element:",n

url = "http://it.hotels.com/ho113127/rome-cavalieri-waldorf-astoria-hotels-resorts-roma-italia/"
find_html(url)

为什么会这样？我在哪里做错了？（我不想对这个 url 使用 selenium，我想使用 urllib2）

score 4 · Accepted Answer

我Moved Permanently从那个 URL 得到 200 和 301 ( ) 响应，所以这是服务器的事情。

由于urllib2会自动跟随重定向，如果您想阻止处理重定向页面（如果我理解正确，它不包含您想要的信息），您必须检查是否发生了重定向：

...
response = urllib2.urlopen(req)
if response.geturl() == url:
  // no redirect occurred
else:
  // a redirect occurred because the url has changed

这取决于您的确切设置和您必须如何处理的意图（因为对于某些 URL，您可能实际上想要处理重定向的页面）。

python - urllib2 获取网页

1 回答 1

Related

Reference