python - urlparse.urljoin() 不处理无效的父目录

Question

在从相对 URL 构造绝对 URL 时，有没有办法解决“无效”父目录，或者我应该只使用.replace()？

>>> from urlparse import urljoin
>>> url = urljoin('http://www.example.com/path/', '../../../index.html')
>>> url
'http://www.example.com/../../index.html'
>>> url.replace('../', '')
'http://www.example.com/index.html'

更好的是，在 Python 中进行抓取时，是否有一种更清洁的方法来清理 url？

score 0 · Accepted Answer

正如你所说，这没有意义。你可以从根目录往上走。因此，在不知道作者意图的情况下，将第二部分标准化是很困难的。只有您知道如何正确消毒它。:)

python - urlparse.urljoin() 不处理无效的父目录

1 回答 1

Related

Reference