python - 不能对从 Silmarillion 提取的文本使用 NLTK

Question

我正在尝试使用 Tolkein 的 Silmarillion 作为练习文本，用于使用 nltk 学习一些 NLP。

我无法开始使用，因为我遇到了文本编码问题。

我在 NLTK 周围使用 TextBlob 包装器（https://github.com/sloria/TextBlob），因为它更容易。文本博客位于：

我无法解析的句子是：

"But Húrin did not answer, and they sat beside the stone, and did not speak again".

我相信这是 Hurin 中的特殊角色导致了这个问题。

我的代码：

from text.blob import TextBlob
b = TextBlob( 'But Húrin did not answer, and they sat beside the stone, and did not speak again' )
b.noun_phrases

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

由于这只是一个有趣的项目，我只是希望能够使用此文本并提取一些属性并进行一些基本处理。

当我不知道初始编码是什么时，如何将此文本转换为 ASCII？我尝试从 UTF8 解码，然后重新编码为 ASCII：

>>> asc = unicode_text.decode('utf-8')
>>> asc = unicode_text.encode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128)

但即使这样也不担心。任何建议都值得赞赏 - 只要在整个文档中始终如一地完成，我就可以丢失特殊字符。

我正在使用 python 2.6.8 并正确安装了所需的模块。

score 2 · Accepted Answer

首先，将 TextBlob 更新到最新版本（撰写本文时为 0.6.0），因为最近的更新中有一些 unicode 修复。这可以通过

$ pip install -U textblob

然后，使用 unicode 文字，如下所示：

from text.blob import TextBlob
b = TextBlob( u'But Húrin did not answer, and they sat beside the stone, and did not speak again' )
noun_phrases = b.noun_phrases
print noun_phrases
# WordList([u'h\xfarin'])
print noun_phrases[0]
# húrin

这在带有 TextBlob 0.6.0 的 Python 2.7.5 上得到了验证，但它也应该适用于 Python 2.6.8。

python - 不能对从 Silmarillion 提取的文本使用 NLTK

1 回答 1

Related

Reference