url - 规范化/规范化 URL？

Question

我正在寻找一个库函数来规范化 Python 中的 URL，即删除路径中的“./”或“../”部分，或添加默认端口或转义特殊字符等。结果应该是指向同一网页的两个 URL 唯一的字符串。例如http://google.comandhttp://google.com:80/a/../应该返回相同的结果。

我更喜欢 Python 3 并且已经浏览了该urllib模块。它提供了拆分 URL 的功能，但没有将它们规范化。Java 有URI.normalize()做类似事情的功能（尽管它不认为默认端口 80 等于没有给定端口），但是有没有类似的东西是 python？

score 4 · Accepted Answer

这个怎么样：

In [1]: from urllib.parse import urljoin

In [2]: urljoin('http://example.com/a/b/c/../', '.')
Out[2]: 'http://example.com/a/b/'

受到这个问题的答案的启发。它不会规范化端口，但创建一个功能应该很简单。

score 4 · Accepted Answer

这就是我使用的，到目前为止它一直有效。您可以从 pip 获取 urlnorm。

请注意，我对查询参数进行了排序。我发现这是必不可少的。

from urlparse import urlsplit, urlunsplit, parse_qsl
from urllib import urlencode
import urlnorm

def canonizeurl(url):
    split = urlsplit(urlnorm.norm(url))
    path = split[2].split(' ')[0]

    while path.startswith('/..'):
        path = path[3:]

    while path.endswith('%20'):
        path = path[:-3]

    qs = urlencode(sorted(parse_qsl(split.query)))
    return urlunsplit((split.scheme, split.netloc, path, qs, ''))

score 2 · Accepted Answer

旧（已弃用）答案

[不再维护] urltools模块标准化多个斜杠.和..组件，而不会弄乱http://.

一旦你这样做（这不再起作用，因为作者重命名了 repo），用法如下：~~pip install urltools~~

print urltools.normalize('http://example.com:80/a////b/../c')
>>> 'http://example.com/a/c'

虽然该模块不再是 pip 可安装的，但它是一个文件，因此您可以重复使用它的一部分。

更新了 Python3 的答案

对于 Python3，请考虑urljoin从urllib.urlparse模块中使用。

from urllib.parse import urljoin

urljoin('https://stackoverflow.com/questions/10584861/', '../dinsdale')
# Out[17]: 'https://stackoverflow.com/questions/dinsdale'

score 1 · Accepted Answer

现在有一个专门解决这个问题的库url-normalize

它不仅仅按照文档规范化路径：

URI 标准化功能：

照顾 IDN 域。

始终以小写字符提供 URI 方案。

始终以小写字符提供主机（如果有）。

仅在必要时执行百分比编码。

百分比编码时始终使用大写的 A 到 F 字符。

防止点段出现在非相对 URI 路径中。

对于定义默认权限的方案，如果需要默认权限，请使用空权限。

对于将空路径定义为等效于“/”路径的方案，请使用“/”。

对于定义端口的方案，如果需要默认端口，请使用空端口

URI 的所有部分必须是来自 Unicode 字符串的 utf-8 编码 NFC

这是一个例子：

from url_normalize import url_normalize

url = 'http://google.com:80/a/../'
print(url_normalize(url))

这使：

http://google.com/

score 0 · Accepted Answer

在良好的开端之后，我编写了一个适合大多数网络常见案例的方法。

def urlnorm(base, link=''):
  '''Normalizes an URL or a link relative to a base url. URLs that point to the same resource will return the same string.'''
  new = urlparse(urljoin(base, url).lower())
  return urlunsplit((
    new.scheme,
    (new.port == None) and (new.hostname + ":80") or new.netloc,
    new.path,
    new.query,
    ''))

score 0 · Accepted Answer

我在上面使用了@Antony 的答案并使用了url-normalize库，但它有一个当前未修复的错误：当在没有方案的情况下发送 URL 时，不小心将其设置为 HTTPS。我编写了一个函数，通过将其设置为 HTTP 来包装和修复它：

from url_normalize import url_normalize
from urllib.parse import urlparse


def parse_url(url):
    return_val = url_normalize(url)
    wrong_default_prefix = "https://"
    new_default_prefix = "http://"
    # If the URL came with no scheme and the normalize function mistakenly 
    # set it to the HTTPS protocol, then fix it and set it to HTTP
    if urlparse(url).scheme.strip() == '' and return_val.startswith(wrong_default_prefix):
        return_val = new_default_prefix + return_val[len(wrong_default_prefix):]
    return return_val

url - 规范化/规范化 URL？

6 回答 6

旧（已弃用）答案

更新了 Python3 的答案

Related

Reference