python - 在 Windows 上的 Python 2.x 中从命令行参数读取 Unicode 字符

Question

我希望我的 Python 脚本能够在 Windows 中读取 Unicode 命令行参数。但似乎 sys.argv 是以某种本地编码而不是 Unicode 编码的字符串。如何阅读完整的 Unicode 命令行？

示例代码：argv.py

import sys

first_arg = sys.argv[1]
print first_arg
print type(first_arg)
print first_arg.encode("hex")
print open(first_arg)

在为日语代码页设置的 PC 上，我得到：

C:\temp>argv.py "PC・ソフト申請書08.09.24.doc"
PC・ソフト申請書08.09.24.doc
<type 'str'>
50438145835c83748367905c90bf8f9130382e30392e32342e646f63
<open file 'PC・ソフト申請書08.09.24.doc', mode 'r' at 0x00917D90>

我相信这是 Shift-JIS 编码的，它对那个文件名“有效”。但它会中断带有不在 Shift-JIS 字符集中的字符的文件名——最终的“打开”调用失败：

C:\temp>argv.py Jörgen.txt
Jorgen.txt
<type 'str'>
4a6f7267656e2e747874
Traceback (most recent call last):
  File "C:\temp\argv.py", line 7,
in <module>
    print open(first_arg)
IOError: [Errno 2] No such file or directory: 'Jorgen.txt'

注意——我说的是 Python 2.x，而不是 Python 3.0。我发现 Python 3.0 提供sys.argv了正确的 Unicode。但是现在过渡到 Python 3.0 还为时过早（由于缺乏 3rd 方库支持）。

更新：

一些答案说我应该根据sys.argv编码的任何内容进行解码。问题在于它不是完整的 Unicode，因此某些字符无法表示。

这是让我感到悲伤的用例：我在 Windows Explorer 中启用了将文件拖放到 .py 文件上。我的文件名包含各种字符，包括一些不在系统默认代码页中的字符。当字符在当前代码页编码中不可表示时，我的 Python 脚本在所有情况下都无法通过 sys.argv 获得正确的 Unicode 文件名。

当然有一些 Windows API 可以读取带有完整 Unicode 的命令行（Python 3.0 可以做到）。我假设 Python 2.x 解释器没有使用它。

score 30 · Accepted Answer

这是我正在寻找的解决方案，调用 WindowsGetCommandLineArgvW函数：
Get sys.argv with Unicode characters under Windows (from ActiveState)

但我做了一些更改，以简化其使用并更好地处理某些用途。这是我使用的：

win32_unicode_argv.py

"""
win32_unicode_argv.py

Importing this will replace sys.argv with a full Unicode form.
Windows only.

From this site, with adaptations:
      http://code.activestate.com/recipes/572200/

Usage: simply import this module into a script. sys.argv is changed to
be a list of Unicode strings.
"""


import sys

def win32_unicode_argv():
    """Uses shell32.GetCommandLineArgvW to get sys.argv as a list of Unicode
    strings.

    Versions 2.x of Python don't support Unicode in sys.argv on
    Windows, with the underlying Windows API instead replacing multi-byte
    characters with '?'.
    """

    from ctypes import POINTER, byref, cdll, c_int, windll
    from ctypes.wintypes import LPCWSTR, LPWSTR

    GetCommandLineW = cdll.kernel32.GetCommandLineW
    GetCommandLineW.argtypes = []
    GetCommandLineW.restype = LPCWSTR

    CommandLineToArgvW = windll.shell32.CommandLineToArgvW
    CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
    CommandLineToArgvW.restype = POINTER(LPWSTR)

    cmd = GetCommandLineW()
    argc = c_int(0)
    argv = CommandLineToArgvW(cmd, byref(argc))
    if argc.value > 0:
        # Remove Python executable and commands if present
        start = argc.value - len(sys.argv)
        return [argv[i] for i in
                xrange(start, argc.value)]

sys.argv = win32_unicode_argv()

现在，我使用它的方式很简单：

import sys
import win32_unicode_argv

从那时起，sys.argv就是一个 Unicode 字符串列表。Pythonoptparse模块似乎很乐意解析它，这很棒。

score 12 · Accepted Answer

处理编码非常混乱。

我相信，如果您通过命令行输入数据，它会将数据编码为您的系统编码是和不是 unicode。（即使复制/粘贴也应该这样做）

所以使用系统编码解码成unicode应该是正确的：

import sys

first_arg = sys.argv[1]
print first_arg
print type(first_arg)

first_arg_unicode = first_arg.decode(sys.getfilesystemencoding())
print first_arg_unicode
print type(first_arg_unicode)

f = codecs.open(first_arg_unicode, 'r', 'utf-8')
unicode_text = f.read()
print type(unicode_text)
print unicode_text.encode(sys.getfilesystemencoding())

运行以下会输出：Prompt> python myargv.py "PC・ソフト申请书08.09.24.txt"

PC・ソフト申請書08.09.24.txt
<type 'str'>
<type 'unicode'>
PC・ソフト申請書08.09.24.txt
<type 'unicode'>
?日本語

其中“PC·ソフト申请书08.09.24.txt”包含“日本语”的文字。（我使用 Windows 记事本将文件编码为 utf8，我有点困惑为什么打印时开头有一个“？”。与记事本如何保存 utf8 有关？）

字符串 'decode' 方法或内置 unicode() 可用于将编码转换为 unicode。

unicode_str = utf8_str.decode('utf8')
unicode_str = unicode(utf8_str, 'utf8')

此外，如果您处理编码文件，您可能希望使用 codecs.open() 函数代替内置的 open()。它允许您定义文件的编码，然后将使用给定的编码透明地将内容解码为 unicode。

所以当你打电话时content = codecs.open("myfile.txt", "r", "utf8").read() content将是unicode。

codecs.open：http://docs.python.org/library/codecs.html?#codecs.open _

如果我对某些事情有误解，请告诉我。

如果您还没有，我建议您阅读 Joel 关于 unicode 和编码的文章：http: //www.joelonsoftware.com/articles/Unicode.html

score 2 · Accepted Answer

试试这个：

import sys
print repr(sys.argv[1].decode('UTF-8'))

也许您必须替换CP437或CP1252替换UTF-8. 您应该能够从注册表项中推断出正确的编码名称HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\OEMCP

score 0 · Accepted Answer

命令行可能采用 Windows 编码。尝试将参数解码为unicode对象：

args = [unicode(x, "iso-8859-9") for x in sys.argv]

python - 在 Windows 上的 Python 2.x 中从命令行参数读取 Unicode 字符

4 回答 4

Related

Reference