python - 在 Python 中使用 urlopen() 防止“隐藏”重定向

Question

我正在使用BeautifulSoup进行网络抓取，并且在使用urlopen时遇到了特定类型的网站问题。网站上的每件商品都有自己独特的页面，并且商品有不同的格式（例如：500 mL、1L、2L...）。

当我使用 Internet 浏览器打开产品的 URL ( www.example.com/product1 ) 时，我会看到 500 mL 格式的图片、有关它的信息（价格、数量、风味等）和列表此特定项目可用的所有其他格式。如果单击另一种格式（例如：1L），图片和有关该项目的信息会发生变化，但我浏览器顶部的 URL 将保持不变（www.example.com/product1）。但是，通过检查页面的 HTML 代码，我知道所有格式都有自己唯一的 URL（500 mL：www.example.com/product1/123；1L：www.example.com/product1/456，...）。在我的 Internet 浏览器中使用 1L 格式的唯一 URL 时，我会自动重定向到页面www.example.com/product1但页面上显示的图片和信息对应于 1L 格式。HTML 代码还包含我需要的有关 1L 格式的信息。

当我使用urlopen打开这些唯一的 URL 时，我的问题就出现了。

from bs4 import BeautifulSoup 
from urllib import urlopen
webpage = urlopen('www.example.com/product1/456')
soup=BeautifulSoup(webpage)
print soup

汤中包含的信息与使用我的 Internet 浏览器显示的唯一 URL 信息不对应：www.example.com/product1/456。它为我提供了有关www.example.com/product1上默认显示的项目格式的信息，该格式始终为 500 mL 格式。

有什么方法可以阻止这种重定向，让我可以使用 BeautifulSoup 捕获包含在唯一 URL 的 HTML 代码中的信息？

score 4 · Accepted Answer

import urllib2

class RedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_302(self, req, fp, code, msg, headers):
        result = urllib2.HTTPError(req.get_full_url(), code, msg, headers, fp)
        result.status = code
        return result
    http_error_301 = http_error_303 = http_error_307 = http_error_302

opener = urllib2.build_opener(RedirectHandler())
webpage = opener.open('http://www.example.com/product1/456')
...

python - 在 Python 中使用 urlopen() 防止“隐藏”重定向

1 回答 1

Related

Reference