python - TypeError：必须是 unicode，而不是 NLTK 中的 str

Question

我正在使用 python2.7、nltk 3.2.1 和 python-crfsuite 0.8.4。我正在关注此页面：http ://www.nltk.org/api/nltk.tag.html?highlight= stanford#nltk.tag.stanford.NERTagger 用于 nltk.tag.crf 模块。

首先我只是运行这个

from nltk.tag import CRFTagger
ct = CRFTagger()
train_data = [[('dfd','dfd')]]
ct.train(train_data,"abc")

我也试过这个

f = open("abc","wb")
ct.train(train_data,f)

但我收到以下错误，

  File "C:\Python27\lib\site-packages\nltk\tag\crf.py", line 129, in <genexpr>
    if all (unicodedata.category(x) in punc_cat for x in token):
TypeError: must be unicode, not str

score 15 · Accepted Answer

在 Python 2 中，常规引用'...'或"..."创建字节字符串。要获取 Unicode 字符串，u请在字符串前使用前缀，例如u'dfd'.

要从文件中读取，您需要指定编码。有关选项，请参阅将Python 3 向后移植open(encoding="utf-8")到 Python 2；最直接的，替换open()为io.open().

要转换现有字符串，请使用该unicode()方法；虽然通常，您也需要使用decode()和提供编码。

对于（更多）更多细节，推荐 Ned Batchelder 的“Pragmatic Unicode”幻灯片，如果不是完全强制性的阅读；http://nedbatchelder.com/text/unipain.html

python - TypeError：必须是 unicode，而不是 NLTK 中的 str

1 回答 1

Related

Reference