python - 对 cmd Python 模块使用 utf-8 输入

Question

在创建一个小型 CLI 笔记本应用程序的过程中，我决定使用cmdPython 库（另请参阅cmdPyMOTW）。

我的外壳是 UTF-8。

→ echo $LANG
fr_FR.utf-8
→ echo $LC_ALL
fr_FR.utf-8

它运行良好。

→ echo "東京"
東京

启动我的小应用程序的代码并尝试使用 utf-8：

→ python nb.py 
log> foobar
2013-01-15 foobar
log> æ±äº¬
2013-01-15 æ±äº¬

已编辑预期的输入/输出是。当我输入 utf-8 字符时，在这种情况下是重音字符或日文字符，我得到了垃圾。

log> 東京
2013-01-15 東京

因此，当启动程序时，命令行会更改输入的类型。

#!/usr/bin/env python2.7
# encoding: utf-8
import datetime
import os.path
import logging
import cmd

ROOT = "~/test/"
NOTENAME = "notes.md"

def todaynotepath(rootpath, notename):
    isodate = datetime.date.today().isoformat()
    isodate.replace("-", "/")
    return rootpath + isodate.replace("-", "/") + "/%s" % (notename)

def addcontent(content):
    logging.info(content)

class NoteBook(cmd.Cmd):
    """Simple cli notebook."""
    prompt = "log> "

    def precmd(self, line):
        # What is the date path NOW
        notepath = todaynotepath(ROOT, NOTENAME)
        # if the directory of the note doesn't exist, create it.
        notedir = os.path.dirname(notepath)
        if not os.path.exists(notedir):
            os.makedirs(notedir)
        # if the file for notes today doesn't exist, create it.
        logging.basicConfig(filename=notepath, level=logging.INFO, format='%(asctime)s - %(message)s')
        return cmd.Cmd.precmd(self, line)

    def default(self, line):
        if line:
            print datetime.date.today().isoformat(), line
            addcontent(line)

    def do_EOF(self, line):
        return True

    def postloop(self):
        print

if __name__ == "__main__":
    NoteBook().cmdloop()

所以我想在原始的 cmd 类中可能有一些东西需要覆盖。我检查了模块，但还没有运气。

编辑 2：LESSCHARSET按照@dda 的建议添加

LANG=fr_FR.utf-8
LANGUAGE=fr_FR.utf-8
LC_ALL=fr_FR.utf-8
LC_CTYPE=fr_FR.UTF-8
LESSCHARSET=utf-8

score 2 · Accepted Answer

我认为在 SO 上还有另一个类似的问题，但专门针对 C 共享库模块；这个答案在那里可能更合适，但我现在找不到链接:)

简而言之，我的回答是 -locale.setlocale(locale.LC_ALL, '')在加载模块之前尝试（我cmd自己还没有使用过）。更详细地说：

我试图将 SWIG Python 绑定用于 Subversion (SVN)。这些基本上是 SWIG 直接从 SVN C 库代码 ( libsvn1) 生成的 Python 自动接口。当我svn status MyWorkingCopy从终端运行时，它会挂接到libsvn代码中——而且它多年来一直没有失败（对于那个存储库）。但是，当我从同一个终端运行 Python 示例（svn status执行libsvn与.

这意味着 Python 以某种方式“影响”了库在字符集方面的行为。但我的终端不断报告：

$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
...
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

因此，这与终端/shell（bash在这种情况下）的想法无关 - 它与底层 C 代码（libsvn在这种情况下）对当前设置的看法有关。我认为，同样的情况也适用于 python：

$ python -c 'import locale; print locale.getdefaultlocale()'
('en_US', 'UTF-8')

所以，现在是关于查看 C 代码看到的内容，从终端运行时与从 Python 运行时（在同一终端中）。进一步调试libsvn，事实证明它实际上来自另一个库libapr（Apache Portable Runtime），SVN 使用它来分配内存。我最终做的是重复在独立 C 程序libsvn中使用的字符串复制；libapr然后通过 SWIG 将其构建为 Python 模块。这个程序，aprtest接受一个字符串作为参数，调用libapr引擎来复制它，并显示结果；它的来源发布在这里：

http://sdaaubckp.sourceforge.net/dbg/swig-py/aprtest/

请参阅脚本build-aprtest.sh了解我使用过的库版本（Ubuntu 11.04）；构建，运行bash build-aprtest.sh。

现在，如果你在终端中运行这样构建的可执行文件，你会得到：

$ locale
LANG=en_US.UTF-8
...
$ ./aprtest "test"
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 0 
(*dest)->data: test
$ ./aprtest "test東京"
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 22

尽管终端报告，libapr引擎在命令行输入 UTF-8 时显然失败了UTF-8。aprtest_s当我们通过 Python作为共享模块（称为）运行时：

$ python -c 'import aprtest_s; aprtest_s.pysmain("test")'
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 0 
(*dest)->data: test
$ python -c 'import aprtest_s; aprtest_s.pysmain("test東京")'
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 22

...同样的事情发生了（顺便说一句，对于 SVN 和 APR 的相同问题，但对于 Perl，请参阅Is there a variable or function that return the native platform encoding (APR_LOCALE_CHARSET)）。所以我们可以得出结论：

C 程序是直接从终端运行还是通过 Python 运行都没有关系 - C 程序只看到与调用程序可能看到的不同的语言环境/编码设置
ASCII 字符串没有问题，只有 UTF-8 字符串

那么，svn 客户端如何从终端正常工作，同时最终使用libapr而不会崩溃？好吧，可以看出aprtest_s.c的源代码的注释；它是通过使用 C 函数设置程序自己的语言环境，setlocale(LC_CTYPE,"")事实证明，它设置了进程的语言环境的所有类别。这个问题实际上是在apr-dev 邮件列表中提到的： Re: Misbehavior of apr_os_locale_encoding on Windows：

...从 55 个不同的当前语言环境中选择一个可能只能由应用程序正确完成，而不是 APR。

因此，通过在 C 应用程序中进行编码setlocale()，我们显然明确地选择了默认语言环境，因此libapr了解它。在测试用例中，此调用setlocale必须发生在调用apr_xlate_open.

现在，发布的版本aprtest不做setlocale，所以当我们使用 Python 版本时，我们可以看到 Python 发生了什么（还要注意这个）locale.setlocale()：

$ PYTHONIOENCODING='utf-8' echo 'import sys;print sys.stdin.encoding' | python
None
$ echo 'import sys;print sys.stdin.encoding' | PYTHONIOENCODING='utf-8' python
utf-8
$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
...
$ python
Python 2.7.1+ (r271:86832, Sep 27 2012, 21:16:52) 
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import aprtest_s
>>> aprtest_s.print_locale()
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
>>> aprtest_s.pysmain("test")
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 0 
(*dest)->data: test
>>> aprtest_s.pysmain("test東京")
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 22 
>>> import locale
>>> print locale.getdefaultlocale()
('en_US', 'UTF-8')
>>> print locale.getlocale()
(None, None)
>>> import sys
>>> print sys.stdin.encoding
UTF-8
>>> locale.setlocale(locale.LC_ALL, '')
'en_US.UTF-8'
>>> print sys.stdin.encoding
UTF-8
>>> print locale.getlocale()
('en_US', 'UTF-8')
>>> aprtest_s.pysmain("test")
LC_CTYPE 0 CODESET 14
UTF-8
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 0 
(*dest)->data: test
>>> aprtest_s.pysmain("test東京")
LC_CTYPE 0 CODESET 14
UTF-8
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 0 
(*dest)->data: test東京
>>>

因此，为了确定 C 应用程序在 Python 中看到的是什么 - 使用locale.getlocale()( ~~NOTlocale.getdefaultlocale()~~ )。我现在理解的方式是getdefaultlocale返回一些保存在某处的操作系统/用户设置，这些设置被认为是默认设置，但在应用程序启动时必须应用为默认设置；并getlocale获取实际的、当前应用的语言环境设置。我猜，当我们setlocale使用空字符串调用时，这会导致其余代码：读取默认设置（由给出的设置getdefaultlocale），然后将默认设置应用为当前设置。

最后一点 - 尽管它看起来相关，但stdin/的编码设置stdout（显然）与当前语言环境的编码无关（至少正如在该环境中运行的 C 程序所看到的那样）。

希望这对某人有帮助，
干杯！

score 1 · Accepted Answer

你的代码对我来说非常适合，卡尔。看到这个：

dda$ ./nb.py 
log> tagada
2013-01-15 tagada
log> 香港
2013-01-15 香港
log>

该notes.md文件包含正确的条目。所以我认为这不是cmd问题所在，但可能是您的终端设置中的问题。尝试添加

export LESSCHARSET=utf-8

在你的.profile.

python - 对 cmd Python 模块使用 utf-8 输入

2 回答 2

Related

Reference