python - 如何在 python (3.3) 中使用 unidecode

Question

我正在尝试从文本文档中删除所有非 ascii 字符。我找到了一个应该这样做的包，https://pypi.python.org/pypi/Unidecode

它应该接受一个字符串并将所有非 ascii 字符转换为最接近的可用 ascii 字符。我通过调用就很容易地在 perl 中使用了这个相同的模块，这个模块是while (<input>) { $_ = unidecode($_); }perl 模块的直接端口，文档表明它应该工作相同。

我确信这很简单，我只是对字符和文件编码了解不够，无法知道问题所在。我的 origfile 以 UTF-8 编码（从 UCS-2LE 转换而来）。这个问题可能更多地与我缺乏编码知识和处理错误的字符串有关，而不是模块，希望有人能解释原因。我已经尝试了我所知道的一切，而不仅仅是随机插入代码并搜索我到目前为止没有运气的错误。

这是我的蟒蛇

from unidecode import unidecode

def toascii():
    origfile = open(r'C:\log.convert', 'rb')
    convertfile = open(r'C:\log.toascii', 'wb')

    for line in origfile:
        line = unidecode(line)
        convertfile.write(line)

    origfile.close()
    convertfile.close()

toascii();

如果我没有以字节模式 ( ) 打开原始文件，那么我会从该行中origfile = open('file.txt','r'得到一个错误。UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1563: character maps to <undefined>for line in origfile:

如果我确实以字节模式打开它，'rb'我会TypeError: ord() expected string length 1, but int found从这line = unidecode(line)条线上得到。

如果我将 line 声明为字符串line = unidecode(str(line))，那么它将写入文件，但是......不正确。\r\n'b'\xef\xbb\xbf[ 2013.10.05 16:18:01 ] User_Name > .\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\它正在写出 \n、\r 等和 unicode 字符，而不是将它们转换为任何东西。

如果我如上所述将行转换为字符串，并以字节模式打开转换文件，则会'wb'出现错误TypeError: 'str' does not support the buffer interface

如果我以字节模式打开它而不将其声明为字符串'wb'，unidecode(line)然后我TypeError: ord() expected string length 1, but int found再次收到错误。

score 11 · Accepted Answer

该unidecode模块接受unicode字符串值并在 Python 3 中返回一个 unicode 字符串。您正在给它二进制数据。解码为 unicode 或以文本模式打开输入文本文件，并将结果编码为 ASCII，然后再将其写入文件，或以文本模式打开输出文本文件。

引用模块文档：

该模块导出单个函数，该函数采用 Unicode 对象 (Python 2.x) 或字符串 (Python 3.x)并返回一个字符串（可以在 Python 3.x 中编码为 ASCII 字节）

强调我的。

这应该有效：

def toascii():
    with open(r'C:\log.convert', 'r', encoding='utf8') as origfile, open(r'C:\log.toascii', 'w', encoding='ascii') as convertfile:
        for line in origfile:
            line = unidecode(line)
            convertfile.write(line)

这将以文本方式打开输入文件（使用 UTF8 编码，根据您的示例行判断是正确的）并以文本方式写入（编码为 ASCII）。

您确实需要明确指定要打开的文件的编码；如果您省略编码，则使用当前系统区域设置（locale.getpreferredencoding(False)调用的结果），如果您的代码需要可移植，这通常不是正确的编解码器。

python - 如何在 python (3.3) 中使用 unidecode

1 回答 1

Related

Reference