python - 如何在 Python 中验证 url？（格式不正确）

Question

我url从用户那里得到，我必须用获取的 HTML 回复。

如何检查 URL 是否格式错误？

例如：

url = 'google' # Malformed
url = 'google.com' # Malformed
url = 'http://google.com' # Valid
url = 'http://google' # Malformed

score 192 · Accepted Answer

使用验证器包：

>>> import validators
>>> validators.url("http://google.com")
True
>>> validators.url("http://google")
ValidationFailure(func=url, args={'value': 'http://google', 'require_tld': True})
>>> if not validators.url("http://google"):
...     print "not valid"
... 
not valid
>>>

使用 pip ( )从 PyPI安装它。pip install validators

score 137 · Accepted Answer

实际上，我认为这是最好的方法。

from django.core.validators import URLValidator
from django.core.exceptions import ValidationError

val = URLValidator(verify_exists=False)
try:
    val('http://www.google.com')
except ValidationError, e:
    print e

如果设置verify_exists为True，它实际上会验证 URL 是否存在，否则它只会检查它的格式是否正确。

编辑：啊，是的，这个问题与此重复：如何使用 Django 的验证器检查 URL 是否存在？

score 122 · Accepted Answer

django url 验证正则表达式（来源）：

import re
regex = re.compile(
        r'^(?:http|ftp)s?://' # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
        r'localhost|' #localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
        r'(?::\d+)?' # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)

print(re.match(regex, "http://www.example.com") is not None) # True
print(re.match(regex, "example.com") is not None)            # False

score 100 · Accepted Answer

基于@DMfll 答案的 True 或 False 版本：

try:
    # python2
    from urlparse import urlparse
except:
    # python3
    from urllib.parse import urlparse

a = 'http://www.cwi.nl:80/%7Eguido/Python.html'
b = '/data/Python.html'
c = 532
d = u'dkakasdkjdjakdjadjfalskdjfalk'
e = 'https://stackoverflow.com'

def uri_validator(x):
    try:
        result = urlparse(x)
        return all([result.scheme, result.netloc])
    except:
        return False

print(uri_validator(a))
print(uri_validator(b))
print(uri_validator(c))
print(uri_validator(d))
print(uri_validator(e))

给出：

True
False
False
False
True

score 30 · Accepted Answer

如今，我根据帕达姆的回答使用以下内容：

$ python --version
Python 3.6.5

这就是它的外观：

from urllib.parse import urlparse

def is_url(url):
  try:
    result = urlparse(url)
    return all([result.scheme, result.netloc])
  except ValueError:
    return False

只需使用is_url("http://www.asdf.com").

希望能帮助到你！

score 18 · Accepted Answer

我登陆此页面试图找出一种将字符串验证为“有效”网址的合理方法。我在这里分享我使用 python3 的解决方案。不需要额外的库。

如果您使用的是 python2，请参阅https://docs.python.org/2/library/urlparse.html 。

如果您像我一样使用 python3，请参阅https://docs.python.org/3.0/library/urllib.parse.html 。

import urllib
from pprint import pprint

invalid_url = 'dkakasdkjdjakdjadjfalskdjfalk'
valid_url = 'https://stackoverflow.com'
tokens = [urllib.parse.urlparse(url) for url in (invalid_url, valid_url)]

for token in tokens:
    pprint(token)

min_attributes = ('scheme', 'netloc')  # add attrs to your liking
for token in tokens:
    if not all([getattr(token, attr) for attr in min_attributes]):
        error = "'{url}' string has no scheme or netloc.".format(url=token.geturl())
        print(error)
    else:
        print("'{url}' is probably a valid url.".format(url=token.geturl()))

ParseResult(scheme='', netloc='', path='dkakasdkjdjakdjadjfalskdjfalk', params='', query='', fragment='')

ParseResult(scheme='https', netloc='stackoverflow.com', path='', params='', query='', fragment='')

'dkakasdkjdjakdjadjfalskdjfalk' 字符串没有方案或 netloc。

' https://stackoverflow.com ' 可能是一个有效的 URL。

这是一个更简洁的函数：

from urllib.parse import urlparse

min_attributes = ('scheme', 'netloc')


def is_valid(url, qualifying=min_attributes):
    tokens = urlparse(url)
    return all([getattr(tokens, qualifying_attr)
                for qualifying_attr in qualifying])

score 9 · Accepted Answer

注意- lepl 不再受支持，抱歉（欢迎您使用它，我认为下面的代码有效，但不会得到更新）。

rfc 3696 http://www.faqs.org/rfcs/rfc3696.html定义了如何执行此操作（对于 http url 和电子邮件）。我使用 lepl（一个解析器库）在 python 中实现了它的建议。见http://acooke.org/lepl/rfc3696.html

使用：

> easy_install lepl
...
> python
...
>>> from lepl.apps.rfc3696 import HttpUrl
>>> validator = HttpUrl()
>>> validator('google')
False
>>> validator('http://google')
False
>>> validator('http://google.com')
True

score 5 · Accepted Answer

编辑

正如@Kwame 所指出的，即使.comor.co等不存在，下面的代码也会验证 url。

@Blaise 还指出，像https://www.google这样的 URL 是一个有效的 URL，您需要单独进行 DNS 检查以检查它是否解析。

这很简单并且有效：

Somin_attr包含定义 URL 有效性所需的基本字符串集，即http://部分和google.com部分。

urlparse.scheme商店http://和

urlparse.netloc存储域名google.com

from urlparse import urlparse
def url_check(url):

    min_attr = ('scheme' , 'netloc')
    try:
        result = urlparse(url)
        if all([result.scheme, result.netloc]):
            return True
        else:
            return False
    except:
        return False

all()如果其中的所有变量都返回 true，则返回 true。所以如果result.schemeandresult.netloc存在 ie 有一些值，那么 URL 是有效的，因此返回True。

score 2 · Accepted Answer

`urllib`使用类似 Django 的正则表达式验证 URL

Django URL 验证正则表达式实际上非常好，但我需要针对我的用例对其进行一些调整。随意适应你的！

蟒蛇 3.7

import re
import urllib

# Check https://regex101.com/r/A326u1/5 for reference
DOMAIN_FORMAT = re.compile(
    r"(?:^(\w{1,255}):(.{1,255})@|^)" # http basic authentication [optional]
    r"(?:(?:(?=\S{0,253}(?:$|:))" # check full domain length to be less than or equal to 253 (starting after http basic auth, stopping before port)
    r"((?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+" # check for at least one subdomain (maximum length per subdomain: 63 characters), dashes in between allowed
    r"(?:[a-z0-9]{1,63})))" # check for top level domain, no dashes allowed
    r"|localhost)" # accept also "localhost" only
    r"(:\d{1,5})?", # port [optional]
    re.IGNORECASE
)
SCHEME_FORMAT = re.compile(
    r"^(http|hxxp|ftp|fxp)s?$", # scheme: http(s) or ftp(s)
    re.IGNORECASE
)

def validate_url(url: str):
    url = url.strip()

    if not url:
        raise Exception("No URL specified")

    if len(url) > 2048:
        raise Exception("URL exceeds its maximum length of 2048 characters (given length={})".format(len(url)))

    result = urllib.parse.urlparse(url)
    scheme = result.scheme
    domain = result.netloc

    if not scheme:
        raise Exception("No URL scheme specified")

    if not re.fullmatch(SCHEME_FORMAT, scheme):
        raise Exception("URL scheme must either be http(s) or ftp(s) (given scheme={})".format(scheme))

    if not domain:
        raise Exception("No URL domain specified")

    if not re.fullmatch(DOMAIN_FORMAT, domain):
        raise Exception("URL domain malformed (domain={})".format(domain))

    return url

解释

该代码仅验证给定 URL 的schemeandnetloc部分。（要正确执行此操作，我将 URL 拆分为urllib.parse.urlparse()两个相应的部分，然后与相应的正则表达式术语匹配。）

该netloc部分在第一次出现斜线之前停止/，因此port数字仍然是的一部分netloc，例如：

https://www.google.com:80/search?q=python
^^^^^   ^^^^^^^^^^^^^^^^^
  |             |      
  |             +-- netloc (aka "domain" in my code)
  +-- scheme

IPv4 地址也经过验证

IPv6 支持

如果您希望 URL 验证器也适用于 IPv6 地址，请执行以下操作：

is_valid_ipv6(ip)从Markus Jarderot 的回答中添加，它有一个非常好的 IPv6 验证器正则表达式
添加and not is_valid_ipv6(domain)到最后if

例子

以下是netloc(aka domain) 部分的一些正则表达式示例：

IPv4 和字母数字： https ://regex101.com/r/A326u1/5
IPv6： https ://regex101.com/r/lKIIgq/1 （使用Markus Jarderot 的正则表达式）

score 2 · Accepted Answer

上述所有解决方案都将像“ http://www.google.com/path,www.yahoo.com/path ”这样的字符串识别为有效。此解决方案始终按应有的方式工作

import re

# URL-link validation
ip_middle_octet = u"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5]))"
ip_last_octet = u"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"

URL_PATTERN = re.compile(
                        u"^"
                        # protocol identifier
                        u"(?:(?:https?|ftp|rtsp|rtp|mmp)://)"
                        # user:pass authentication
                        u"(?:\S+(?::\S*)?@)?"
                        u"(?:"
                        u"(?P<private_ip>"
                        # IP address exclusion
                        # private & local networks
                        u"(?:localhost)|"
                        u"(?:(?:10|127)" + ip_middle_octet + u"{2}" + ip_last_octet + u")|"
                        u"(?:(?:169\.254|192\.168)" + ip_middle_octet + ip_last_octet + u")|"
                        u"(?:172\.(?:1[6-9]|2\d|3[0-1])" + ip_middle_octet + ip_last_octet + u"))"
                        u"|"
                        # IP address dotted notation octets
                        # excludes loopback network 0.0.0.0
                        # excludes reserved space >= 224.0.0.0
                        # excludes network & broadcast addresses
                        # (first & last IP address of each class)
                        u"(?P<public_ip>"
                        u"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
                        u"" + ip_middle_octet + u"{2}"
                        u"" + ip_last_octet + u")"
                        u"|"
                        # host name
                        u"(?:(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)"
                        # domain name
                        u"(?:\.(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)*"
                        # TLD identifier
                        u"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
                        u")"
                        # port number
                        u"(?::\d{2,5})?"
                        # resource path
                        u"(?:/\S*)?"
                        # query string
                        u"(?:\?\S*)?"
                        u"$",
                        re.UNICODE | re.IGNORECASE
                       )
def url_validate(url):   
    """ URL string validation
    """                                                                                                                                                      
    return re.compile(URL_PATTERN).match(url)

score 2 · Accepted Answer

这是一个正则表达式解决方案，因为最高投票的正则表达式不适用于顶级域等奇怪的情况。下面是一些测试用例。

regex = re.compile(
    r"(\w+://)?"                # protocol                      (optional)
    r"(\w+\.)?"                 # host                          (optional)
    r"((\w+)\.(\w+))"           # domain
    r"(\.\w+)*"                 # top-level domain              (optional, can have > 1)
    r"([\w\-\._\~/]*)*(?<!\.)"  # path, params, anchors, etc.   (optional)
)

cases = [
    "http://www.google.com",
    "https://www.google.com",
    "http://google.com",
    "https://google.com",
    "www.google.com",
    "google.com",
    "http://www.google.com/~as_db3.2123/134-1a",
    "https://www.google.com/~as_db3.2123/134-1a",
    "http://google.com/~as_db3.2123/134-1a",
    "https://google.com/~as_db3.2123/134-1a",
    "www.google.com/~as_db3.2123/134-1a",
    "google.com/~as_db3.2123/134-1a",
    # .co.uk top level
    "http://www.google.co.uk",
    "https://www.google.co.uk",
    "http://google.co.uk",
    "https://google.co.uk",
    "www.google.co.uk",
    "google.co.uk",
    "http://www.google.co.uk/~as_db3.2123/134-1a",
    "https://www.google.co.uk/~as_db3.2123/134-1a",
    "http://google.co.uk/~as_db3.2123/134-1a",
    "https://google.co.uk/~as_db3.2123/134-1a",
    "www.google.co.uk/~as_db3.2123/134-1a",
    "google.co.uk/~as_db3.2123/134-1a",
    "https://...",
    "https://..",
    "https://.",
    "https://.google.com",
    "https://..google.com",
    "https://...google.com",
    "https://.google..com",
    "https://.google...com"
    "https://...google..com",
    "https://...google...com",
    ".google.com",
    ".google.co."
    "https://google.co."
]
for c in cases:
    print(c, regex.match(c).span()[1] - regex.match(c).span()[0] == len(c))

score 0 · Accepted Answer

不直接相关，但通常需要识别某些令牌是否可以是 url，不一定 100% 正确形成（即省略 https 部分等等）。我已经阅读了这篇文章并没有找到解决方案，所以为了完整起见，我在这里发布自己的解决方案。

def get_domain_suffixes():
    import requests
    res=requests.get('https://publicsuffix.org/list/public_suffix_list.dat')
    lst=set()
    for line in res.text.split('\n'):
        if not line.startswith('//'):
            domains=line.split('.')
            cand=domains[-1]
            if cand:
                lst.add('.'+cand)
    return tuple(sorted(lst))

domain_suffixes=get_domain_suffixes()

def reminds_url(txt:str):
    """
    >>> reminds_url('yandex.ru.com/somepath')
    True
    
    """
    ltext=txt.lower().split('/')[0]
    return ltext.startswith(('http','www','ftp')) or ltext.endswith(domain_suffixes)

score 0 · Accepted Answer

基于多米尼克塔罗回答的功能：

import re
def is_url(x):
    return bool(re.match(
        r"(https?|ftp)://" # protocol
        r"(\w+(\-\w+)*\.)?" # host (optional)
        r"((\w+(\-\w+)*)\.(\w+))" # domain
        r"(\.\w+)*" # top-level domain (optional, can have > 1)
        r"([\w\-\._\~/]*)*(?<!\.)" # path, params, anchors, etc. (optional)
    , x))

python - 如何在 Python 中验证 url？（格式不正确）

13 回答 13

django url 验证正则表达式（来源）：

urllib使用类似 Django 的正则表达式验证 URL

蟒蛇 3.7

解释

IPv6 支持

例子

Related

Reference

`urllib`使用类似 Django 的正则表达式验证 URL