python - Python 文件名/路径解析错误的希伯来语编码（使用 optparse 库）

Question

我对这段代码有疑问：

import optparse
parser = optparse.OptionParser(version=__version__,
    usage="%prog [options] file1 ... host[:dest]",
    description=main.__doc__)
parser.add_option("-c", "--config", help="Specify an alternate config "
    "file.  Default = '%s'" % config_file)
parser.add_option('-l', '--log-level', type="choice",
    choices=LOG_LEVELS.keys(),
    help="Override the default logging level. Choices=%s, Default=%s" %
        (",".join(LOG_LEVELS.keys()), LOG_LEVEL))
parser.add_option("-o", "--overwrite", action="store_true",
    help="If specified, overwrite existing files at destination.  If "
    "not specified, throw an exception if you try to overwrite a file")
parser.add_option('-s', "--speed", action="store_true", \
    help="If specifed, print the data transfer rate for each file "
        "that is uploaded (infers verbose option)")
parser.add_option('-v', '--verbose', action="store_true",
    help="If specified, print every file that is being uploaded and every "
        "directory that is being created")
parser.add_option("-u", "--user", help="The username to use for "
    "authentication.  Not needed if you have set up a config file.")
parser.add_option("-p", "--password", help="The password to use for "
    "authentication.  Not needed if you have set up a config file.")

parser.set_defaults(config=config_file, log_level=LOG_LEVEL)
options, args = parser.parse_args()
print (args)

如您所见，当我打印我们正在使用希伯来命名文件进行的测试时，打印结果包括： ['/root/mezeo_sdk/1/\xfa\xe5\xeb\xf0\xe9\xfa \xf2\ xe1\xe5\xe3\xe4.xlsx', 'hostname'] 而不是 /root/mezeo_sdk/1/"תוכנית עבודה.xlsx"

此外，一旦脚本将文件上传到服务器（传递文件名的方式），最终结果是：http: //i.imgur.com/pP6fA.png

文件名本身在 linux 源上很好，因为如果我将它 SCP 到我自己的计算机上，它看起来还不错，但一旦我使用 python 脚本将它传输到文件服务器就不行了。

我也不相信问题出在文件服务器端，因为如果我使用 Web 界面上传希伯来命名文件，它们就可以了。

我认为问题在于 optparse 库的使用。

score 4 · Accepted Answer

与往常一样，我将从 Unicode 建议阅读开始：您应该真正阅读其中一个或两个

实用的 Unicode (Ned Batchelder)
每个软件开发人员绝对、绝对必须了解 Unicode 和字符集的绝对最低要求（没有任何借口！）（Joel Spolsky）

简而言之（非常小），Unicode 代码点是一个抽象的“事物”，代表一个字符¹。程序员喜欢使用这些，因为我们喜欢将字符串视为一次一个字符。不幸的是，很久以前就规定一个字符必须适合一个字节的内存，因此最多可以有 256 个不同的字符。这对于简单的英语来说很好，但不适用于其他任何东西。有一个全球代码点列表——数以千计的代码点——旨在保存所有可能的字符，但显然它们不适合一个字节。

解决方案：构成字符串的代码点的有序列表与其编码为字节序列之间存在差异。每当您使用字符串时，您必须清楚它应该采用哪种形式。要在形式之间进行转换，您可以.encode()将代码点列表（Unicode 字符串）作为字节列表，并将.decode()字节转换为代码列表点。为此，您需要知道如何将代码点映射为字节，反之亦然，这就是编码。

¹种。

好的，现在已经不碍事了，让我们看看你有什么。您已经给出了一个（原始）字符串——一个字节序列：

\xfa\xe5\xeb\xf0\xe9\xfa \xf2\xe1\xe5\xe3\xe4

你想成为的编码

תוכנית עבודה

一点点谷歌搜索告诉我你正在使用Windows-1255编码，这是 ASCII 的扩展，使用高字节来保存希伯来字母。您希望字符串采用 Unicode，因为 Unicode 代表普通数据。所以，你应该decode使用编码的字节序列"Windows-1255"：

>>> s
'\xfa\xe5\xeb\xf0\xe9\xfa \xf2\xe1\xe5\xe3\xe4'
>>> s.decode("Windows-1255")
u'\u05ea\u05d5\u05db\u05e0\u05d9\u05ea \u05e2\u05d1\u05d5\u05d3\u05d4'

现在您有了正确的数据类型。接下来，您需要将数据发送到服务器，这意味着将其编码为普通编码，即“UTF-8”：

>>> s.decode("Windows-1255").encode("utf-8")
'\xd7\xaa\xd7\x95\xd7\x9b\xd7\xa0\xd7\x99\xd7\xaa \xd7\xa2\xd7\x91\xd7\x95\xd7\x93\xd7\x94'

最后，您可能想知道服务器哪里出错了。好吧，如果您不为数据指定编码，人们将不得不猜测，这是一个注定要失败的企业。在您的情况下，看起来您将原始字节发送到服务器，然后将它们解码为latin-1. 这给出了你看到的奇怪的重音字母，因为 latin-1 使用 ASCII 字节的上半部分不是希伯来字符而是重音英文字符。

故事的寓意：了解 Unicode！

score 3 · Accepted Answer

它打印列表的 repr()；如果您打印它们应该在终端模拟器中正确呈现的字符串。

至于您的 imgur 链接，如果这是网页上显示的内容，则需要在 html 中设置正确的编码。

>>> a=['/root/mezeo_sdk/1/\xfa\xe5\xeb\xf0\xe9\xfa \xf2\xe1\xe5\xe3\xe4.xlsx', 'hostname']
>>> print a[0].decode('windows-1255')
/root/mezeo_sdk/1/תוכנית עבודה.xlsx

python - Python 文件名/路径解析错误的希伯来语编码（使用 optparse 库）

2 回答 2

Related

Reference