python - UnicodeDecodeError：“ascii”编解码器无法解码字节 - Python

Question

这与以下问题有关 -

我有 python 应用程序执行以下任务 -

# -*- coding: utf-8 -*-

1.读取unicode文本文件（非英文）-

def readfile(file, access, encoding):
    with codecs.open(file, access, encoding) as f:
        return f.read()

text = readfile('teststory.txt','r','utf-8-sig')

这将给定的文本文件作为字符串返回。

2. 将文本拆分成句子。

3. 浏览每个句子中的单词并识别动词、名词等。

参考 -在 Python 中搜索 Unicode 字符并在 Python 列表的前后查找单词

4.将它们添加到单独的变量中，如下所示

名词 = "汽车" | 《巴士》 |

动词=“驱动器”| “命中”

5. 现在我试图将它们传递给 NLTK 上下文无关语法，如下所示 -

grammar = nltk.parse_cfg('''
    S -> NP VP
    NP -> N
    VP -> V | NP V

    N -> '''+nouns+'''
    V -> '''+verbs+'''
    ''')

它给了我以下错误-

第 40 行，在 V -> '''+verbs+''' UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 114: ordinal not in range(128)

我怎样才能克服这个问题并将变量传递给 NLTK CFG ？

完整代码 - https://dl.dropboxusercontent.com/u/4959382/new.zip

score 1 · Accepted Answer

总体而言，您有以下策略：

将输入视为字节序列，则输入和语法都是 utf-8 编码的数据（字节）
将输入视为 unicode 代码点序列，则输入和语法都是 unicode。
将 unicode 代码点重命名为 ascii，即使用转义序列。

与 pip 一起安装的 nltk，在我的情况下为 2.0.4，不直接接受 unicode，但接受引用的 unicode 常量，即以下所有似乎都有效：

In [26]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar')
Out[26]: <Grammar with 2 productions>

In [27]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar'.encode("utf-8"))
Out[27]: <Grammar with 2 productions>

In [28]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar'.encode("unicode_escape"))
Out[28]: <Grammar with 2 productions>

请注意，我引用了 unicode 文本而不是普通文本"€"vs bar.

python - UnicodeDecodeError：“ascii”编解码器无法解码字节 - Python

1 回答 1

Related

Reference