python - 如何使用 `urlparse` 检查 URL 是否有效？

Question

在打开 URL 以读取数据之前，我想检查 URL 是否有效。

我正在使用包中的urlparse功能urlparse：

if not bool(urlparse.urlparse(url).netloc):
 # do something like: open and read using urllin2

但是，我注意到一些有效的 URL 被视为损坏，例如：

url = upload.wikimedia.org/math/8/8/d/88d27d47cea8c88adf93b1881eda318d.png

此 URL 有效（我可以使用浏览器打开它）。

有没有更好的方法来检查 URL 是否有效？

score 13 · Accepted Answer

您可以检查 url 是否有方案：

>>> url = "no.scheme.com/math/12345.png"
>>> parsed_url = urlparse.urlparse(url)
>>> bool(parsed_url.scheme)
False

如果是这种情况，您可以替换该方案并获得一个真正有效的 url：

>>> parsed_url.geturl()
"no.scheme.com/math/12345.png"
>>> parsed_url = parsed_url._replace(**{"scheme": "http"})
>>> parsed_url.geturl()
'http:///no.scheme.com/math/12345.png'

score 6 · Accepted Answer

TL;DR：实际上你不能。给出的每个答案都已经错过了 1 个或多个案例。

字符串是google.com（无效，因为没有方案，即使浏览器默认采用 http）。Urlparse 将缺少方案和 netloc。所以all([result.scheme, result.netloc, result.path])似乎适用于这种情况
字符串为http://google（由于 .com 缺失而无效）。Urlparse 将仅缺少路径。all([result.scheme, result.netloc, result.path])似乎再次抓住了这个案例
字符串是http://google.com/（正确）。Urlparse 将填充方案、netloc 和路径。因此，对于这种情况，all([result.scheme, result.netloc, result.path])效果很好
字符串是http://google.com（正确）。Urlparse 将仅缺少路径。所以对于这种情况all([result.scheme, result.netloc, result.path]) 似乎给出了假阴性

因此，从上述案例中，您可以看到最接近解决方案的案例是all([result.scheme, result.netloc, result.path]). 但这仅适用于 url 包含路径的情况（即使那是 / 路径）。

即使您尝试强制执行路径（即urlparse(urljoin(your_url, "/"))，在案例 2 中您仍然会得到误报

也许更复杂的东西，比如

final_url = urlparse(urljoin(your_url, "/"))
is_correct = (all([final_url.scheme, final_url.netloc, final_url.path]) 
              and len(final_url.netloc.split(".")) > 1)

也许您还想跳过方案检查并假设 http 如果没有方案。但即使这样也会让你达到一定程度。尽管它涵盖了上述情况，但并未完全涵盖 url 包含 ip 而不是主机名的情况。对于这种情况，您必须验证 ip 是正确的 ip。还有更多的场景。请参阅https://en.wikipedia.org/wiki/URL以思考更多案例

score 5 · Accepted Answer

您可以尝试下面的检查函数scheme，netloc以及path解析 url 后出现的变量。支持 Python 2 和 3。

try:
    # python 3
    from urllib.parse import urlparse
except ImportError:
    from urlparse import urlparse

def url_validator(url):
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc, result.path])
    except:
        return False

score 1 · Accepted Answer

没有架构的网址实际上是无效的，您的浏览器足够聪明，可以建议 http:// 作为它的架构。检查 url 是否没有架构 ( not re.match(r'^[a-zA-Z]+://', url)) 并http://添加到它可能是一个很好的解决方案。

python - 如何使用 `urlparse` 检查 URL 是否有效？

4 回答 4

Related

Reference