python - Python中的URL解析-规范化路径中的双斜杠

Question

我正在开发一个需要解析 HTML 页面中的 URL（主要是 HTTP URL）的应用程序——我无法控制输入，并且正如预期的那样，其中一些有点混乱。

我经常遇到的一个问题是 urlparse 在解析和连接在路径部分有双斜杠的 URL 时非常严格（甚至可能有问题？），例如：

testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl, 
                 urlparse.urlparse(testUrl).path)

http://www.example.com//path我最终得到的是http://path.

顺便说一句，我运行此类代码的原因是因为这是迄今为止我发现从 URL 中剥离查询/片段部分的唯一方法。也许有更好的方法可以做到这一点，但我找不到。

任何人都可以推荐一种避免这种情况的方法，还是我应该使用（相对简单，我知道）正则表达式自己规范化路径？

score 5 · Accepted Answer

如果您只想获取不带查询部分的 url，我会跳过 urlparse 模块并执行以下操作：

testUrl.rsplit('?')

url 将位于返回列表的索引 0 处，查询位于索引 1 处。

不可能有两个“？” 在一个网址中，所以它应该适用于所有网址。

score 5 · Accepted Answer

单独的路径 ( //path) 无效，这会混淆函数并被解释为主机名

https://www.rfc-editor.org/rfc/rfc3986.html#section-3.3

如果 URI 不包含权限组件，则路径不能以两个斜杠字符（“//”）开头。

我并不特别喜欢这两种解决方案，但它们都有效：

import re
import urlparse

testurl = 'http://www.example.com//path?foo=bar'

parsed = list(urlparse.urlparse(testurl))
parsed[2] = re.sub("/{2,}", "/", parsed[2]) # replace two or more / with one
cleaned = urlparse.urlunparse(parsed)

print cleaned
# http://www.example.com/path?foo=bar

print urlparse.urljoin(
    testurl, 
    urlparse.urlparse(cleaned).path)

# http://www.example.com//path

根据您在做什么，您可以手动加入：

import re
import urlparse

testurl = 'http://www.example.com//path?foo=bar'
parsed = list(urlparse.urlparse(testurl))

newurl = ["" for i in range(6)] # could urlparse another address instead

# Copy first 3 values from
# ['http', 'www.example.com', '//path', '', 'foo=bar', '']
for i in range(3):
    newurl[i] = parsed[i]
    
# Rest are blank
for i in range(4, 6):
    newurl[i] = ''

print urlparse.urlunparse(newurl)
# http://www.example.com//path

score 2 · Accepted Answer

在官方 urlparse 文档中提到：

如果 url 是绝对 URL（即，以 // 或 scheme:// 开头），则 url 的主机名和/或方案将出现在结果中。例如

urljoin('http://www.cwi.nl/%7Eguido/Python.html',
...         '//www.python.org/%7Eguido')
'http://www.python.org/%7Eguido'

如果您不希望这种行为，请使用 urlsplit() 和 urlunsplit() 预处理 url，删除可能的方案和 netloc 部分。

所以你可以这样做：

urlparse.urljoin(testUrl,
             urlparse.urlparse(testUrl).path.replace('//','/'))

输出 ='http://www.example.com/path'

score 2 · Accepted Answer

试试这个：

def http_normalize_slashes(url):
    url = str(url)
    segments = url.split('/')
    correct_segments = []
    for segment in segments:
        if segment != '':
            correct_segments.append(segment)
    first_segment = str(correct_segments[0])
    if first_segment.find('http') == -1:
        correct_segments = ['http:'] + correct_segments
    correct_segments[0] = correct_segments[0] + '/'
    normalized_url = '/'.join(correct_segments)
    return normalized_url

示例网址：

print(http_normalize_slashes('http://www.example.com//path?foo=bar'))
print(http_normalize_slashes('http:/www.example.com//path?foo=bar'))
print(http_normalize_slashes('www.example.com//x///c//v///path?foo=bar'))
print(http_normalize_slashes('http://////www.example.com//x///c//v///path?foo=bar'))

将返回：

http://www.example.com/path?foo=bar
http://www.example.com/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar

希望能帮助到你.. ：）

score 0 · Accepted Answer

0

这不能成为一个解决方案吗？

urlparse.urlparse(testUrl).path.replace('//', '/')

于 2012-01-19T12:54:38.293 回答

score 0 · Accepted Answer

在我尝试纠正路径中的双斜杠的情况下，这个答案似乎给出了最好的结果，而没有触及 http:// 位中的初始双斜杠。

这是代码：

from urlparse import urljoin
from functools import reduce


def slash_join(*args):
    return reduce(urljoin, args).rstrip("/")

score 0 · Accepted Answer

我已经满足了我的需求@yunhasnawa 的回答。这是一部分：

import urllib2
from urlparse import urlparse, urlunparse

def sanitize_url(url):
    url_parsed = urlparse(url)  
    return urlunparse((url_parsed.scheme, url_parsed.netloc, avoid_double_slash(url_parsed.path), '', '', ''))

def avoid_double_slash(path):
  parts = path.split('/')
  not_empties = [part for part in parts if part]
  return '/'.join(not_empties)


>>> sanitize_url('https://hostname.doma.in:8443/complex-path////next//')
'https://hostname.doma.in:8443/complex-path/next'

score 0 · Accepted Answer

这可能并不完全安全，但您可以使用这个正则表达式：

import re


def sanitize_url(url: str) -> str:
    return re.sub(r"([^:]/)(/)+", r"\1", url)

它将用“[非冒号] 后跟一个斜杠”替换“[非冒号] 后跟2个斜杠”。[非冒号] 用于保留 http:// 或 https://。

python - Python中的URL解析-规范化路径中的双斜杠

8 回答 8

Related

Reference