python - 从页面上的相对 url 重构绝对 url

Question

给定页面的绝对 url，以及在该页面中找到的相对链接，是否有办法a)明确重建或b)尽力重建相对链接的绝对 url？

就我而言，我正在使用漂亮的汤从给定的 url 读取 html 文件，剥离所有 img 标记源，并尝试构建页面图像的绝对 url 列表。

到目前为止，我的 Python 函数如下所示：

function get_image_url(page_url,image_src):

    from urlparse import urlparse
    # parsed = urlparse('http://user:pass@NetLoc:80/path;parameters?query=argument#fragment')
    parsed = urlparse(page_url)
    url_base = parsed.netloc
    url_path = parsed.path

    if src.find('http') == 0:
        # It's an absolute URL, do nothing.
        pass
    elif src.find('/') == 0:
        # If it's a root URL, append it to the base URL:
        src = 'http://' + url_base + src
    else:
        # If it's a relative URL, ?

注意：不需要 Python 答案，只需要所需的逻辑。

score 43 · Accepted Answer

非常简单：

>>> from urlparse import urljoin
>>> urljoin('http://mysite.com/foo/bar/x.html', '../../images/img.png')
'http://mysite.com/images/img.png'

score 19 · Accepted Answer

用于urllib.parse.urljoin根据基本 URL 解析（可能是相对的）URL。

但是，网页的基本 URL 不一定与您从中获取文档的 URL 相同，因为 HTML 允许页面通过BASE元素指定其首选基本 URL 。您需要的逻辑如下：

base_url = page_url
head = document.getElementsByTagName('head')[0]
for base in head.getElementsByTagName('base'):
    if base.hasAttribute('href'):
        base_url = urllib.parse.urljoin(base_url, base.getAttribute('href'))
        # HTML5 4.2.3 "if there are multiple base elements with href
        # attributes, all but the first are ignored."
        break

（如果您正在解析 XHTML，那么理论上您应该考虑到相当复杂的XML Base 规范。但是您可能无需担心这一点，因为没有人真正使用 XHTML。）

python - 从页面上的相对 url 重构绝对 url

2 回答 2

Related

Reference