python - 在python中的（扩展）url编码中取消转义/取消引用二进制字符串

Question

为了进行分析，我必须取消转义 URL 编码的二进制字符串（很可能是不可打印的字符）。遗憾的是，这些字符串以扩展的 URL 编码形式出现，例如“%u616f”。我想将它们存储在一个文件中，然后包含原始二进制值，例如。0x61 0x6f 在这里。

如何将其转换为 python 中的二进制数据？（urllib.unquote 只处理“%HH”形式）

score 3 · Accepted Answer

不幸的是，这些字符串以扩展的 URL 编码形式出现，例如“%u616f”

顺便说一句，这与 URL 编码无关。它是由 JavaScript escape() 函数生成的任意组合格式，几乎没有别的。如果可以，最好的办法是将 JavaScript 更改为使用 encodeURIComponent 函数。这将为您提供一个正确的、标准的 URL 编码的 UTF-8 字符串。

例如“%u616f”。我想将它们存储在一个文件中，然后包含原始二进制值，例如。0x61 0x6f 在这里。

您确定 0x61 0x6f（字母“ao”）是您要存储的字节流吗？这意味着 UTF-16BE 编码；你是这样对待你所有的弦的吗？

通常，您希望将输入转换为 Unicode，然后使用适当的编码（例如 UTF-8 或 UTF-16LE）将其写出来。这是一种快速的方法，依赖于让 Python 将 '%u1234' 读取为字符串转义格式 u'\u1234' 的技巧：

>>> ex= 'hello %e9 %u616f'
>>> ex.replace('%u', r'\u').replace('%', r'\x').decode('unicode-escape')
u'hello \xe9 \u616f'

>>> print _
hello é 慯

>>> _.encode('utf-8')
'hello \xc2\xa0 \xe6\x85\xaf'

score 1 · Accepted Answer

我想您将不得不自己编写解码器功能。这是一个帮助您入门的实现：

def decode(file):
    while True:
        c = file.read(1)
        if c == "":
            # End of file
            break
        if c != "%":
            # Not an escape sequence
            yield c
            continue
        c = file.read(1)
        if c != "u":
            # One hex-byte
            yield chr(int(c + file.read(1), 16))
            continue
        # Two hex-bytes
        yield chr(int(file.read(2), 16))
        yield chr(int(file.read(2), 16))

用法：

input = open("/path/to/input-file", "r")
output = open("/path/to/output-file", "wb")
output.writelines(decode(input))
output.close()
input.close()

score 0 · Accepted Answer

这是一种基于正则表达式的方法：

# the replace function concatenates the two matches after 
# converting them from hex to ascii
repfunc = lambda m: chr(int(m.group(1), 16))+chr(int(m.group(2), 16))

# the last parameter is the text you want to convert
result = re.sub('%u(..)(..)', repfunc, '%u616f')
print result

给

ao

python - 在python中的（扩展）url编码中取消转义/取消引用二进制字符串

3 回答 3

Related

Reference