python - Python urlparse：小问题

Question

我正在制作一个解析 html 并从中获取图像的应用程序。使用 Beautiful Soup 和下载 html 很容易解析，并且图像也适用于 urllib2。

我确实对 urlparse 有问题，无法从相对路径中创建绝对路径。这个问题最好用一个例子来解释：

>>> import urlparse
>>> urlparse.urljoin("http://www.example.com/", "../test.png")
'http://www.example.com/../test.png'

如您所见， urlparse 并没有带走 ../ 。当我尝试下载图像时，这会出现问题：

HTTPError: HTTP Error 400: Bad Request

有没有办法在 urllib 中解决这个问题？

score 3 · Accepted Answer

“..” 会为您打开一个目录（“.” 是当前目录），因此将其与域名 url 结合起来没有多大意义。也许你需要的是：

>>> urlparse.urljoin("http://www.example.com","./test.png")
'http://www.example.com/test.png'

score 2 · Accepted Answer

我认为您能做的最好的事情是预先解析原始 URL，并检查路径组件。一个简单的测试是

if len(urlparse.urlparse(baseurl).path) > 1:

然后你可以将它与 demas 建议的索引结合起来。例如：

start_offset = (len(urlparse.urlparse(baseurl).path) <= 1) and 2 or 0
img_url = urlparse.urljoin("http://www.example.com/", "../test.png"[start_offset:])

这样，您将不会尝试转到根 URL 的父级。

score 1 · Accepted Answer

如果您愿意，/../test这意味着与文件系统中的路径相同，/test那么您可以使用normpath()：

>>> url = urlparse.urljoin("http://example.com/", "../test")
>>> p = urlparse.urlparse(url)
>>> path = posixpath.normpath(p.path)
>>> urlparse.urlunparse((p.scheme, p.netloc, path, p.params, p.query,p.fragment))
'http://example.com/test'

score 0 · Accepted Answer

0

urlparse.urljoin("http://www.example.com/", "../test.png"[2:])

这是你需要的吗？

于 2010-11-06T17:31:30.767 回答

python - Python urlparse：小问题

4 回答 4

Related

Reference