python - 如何防止 nltk 语料库将扩展 ascii 读取为 unicode

Question

我正在使用以下代码加载一个带有纯文本版本的南非荷兰语维基百科的文本文件作为 nltk 语料库：

import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.util import LazyCorpusLoader
from __future__ import division
afwikipedia = LazyCorpusLoader('afwikipedia', PlaintextCorpusReader, r'(?!\.).*\.txt')
af = nltk.Text(afwikipedia.words())

然后，我使用以下内容查看最热门的单词：

from nltk.probability import FreqDist
fdist = FreqDist(af)
vocabulary = fdist.keys()
vocabulary[:250]   # 250 most frequently used words.

不幸的是，这种方法有几个问题。“'n”在南非荷兰语中是一个非常流行的词，与英语中的“a”意思相同。上面的方法把它分成“'”和“n”两部分。此外，所有扩展的 ASCII 字符似乎都被视为 unicode 而不是 ascii，因此“verpleër”变为“verple\xc3r”。

有谁知道我将如何解决这个问题？尤其是ASCII字符的unicode处理真的很烦人。

我还做了以下事情：

# Create a file called sitecustomize.py in c:\python24\Lib\site-packages.
import sys
sys.setdefaultencoding('iso-8859-1')   # ASCII latin.

score 0 · Accepted Answer

这不是 unicode，它是混有 8 位字符的 ascii。PlaintextCorpusReader需要一个encoding参数，您可以使用它来解决您的问题。

'至于从中分解n，这是分词器的事情。找到一个令您满意的分词器，并告诉您的语料库阅读器使用它。

python - 如何防止 nltk 语料库将扩展 ascii 读取为 unicode

1 回答 1

Related

Reference