python - 无法遵循 NLTK 语料库结构

Question

我从NLTK&开始，但我真的对语料库结构Python感到困惑。NLTK例如

我无法理解为什么我们需要在 nltk.corpus 模块中添加两次单词，

wordlist=[w for w in nltk.corpus.words.words('en') if w.islower()]
nltk.corpus.words 和 nltk.corpus.words.words 的类型保持不同。为什么会这样？

type(nltk.corpus) nltk.corpus type(nltk.corpus.words) nltk.corpus.words type(nltk.corpus.words.words) nltk.corpus.words.words C:\\Documents and Settings\\Administrator\ \nltk_data\\语料库\\单词'>>
第三，如何知道需要将单词附加两次nltk.corpus才能生成单词表。nltk.corpus.words我的意思是 call和有什么区别nltk.corpus.words.words？

有人可以详细说明。现在很难继续阅读NLTK本书的第三章。

万分感谢

score 2 · Accepted Answer

真的很简单，words就是包含的类实例的名字nltk.corpus，相关代码：

words = LazyCorpusLoader('words', WordListCorpusReader, r'(?!README|\.).*')

所有这一切都在说这words是一个实例LazyCorpusLoader。

因此，您可以nltk.corpus.words作为参考。

可是等等！

如果您查看的代码LazyCorpusLoader，它也会调用LazyCorpusLoaderwith WordListCorpusReader。

WordListCorpusReader恰好有一个名为的方法words，它看起来像这样：

def words(self, fileids=None):
    return line_tokenize(self.raw(fileids))

并LazyCorpusLoader这样做corpus = self.__reader_cls(root, *self.__args, **self.__kwargs)

本质上，它所做的就是创建self.__reader__cls一个实例WordListCorpusReader（它有自己的 words 方法）。

然后它会这样做：

self.__dict__ = corpus.__dict__ 
self.__class__ = corpus.__class__

根据 Python 文档__dict__ is the module’s namespace as a dictionary object。所以它正在将命名空间更改为corpus. 同样，对于__class__文档来说__class__ is the instance’s class，它也会改变类。因此，casenltk.corpus.words.words指的是包含在名为的实例中的实例方法词words。那有意义吗？此代码说明了相同的行为：

class Bar(object):
    def foo(self):
        return "I am a method of Bar"

class Foo(object):
    def __init__(self, newcls):
        newcls = newcls()
        self.__class__ = newcls.__class__
        self.__dict__ = newcls.__dict__

foo = Foo(Bar)
print foo.foo()

这里还有源代码的链接，您可以自己查看：

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus-pysrc.html

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.wordlist-pysrc.html#WordListCorpusReader

python - 无法遵循 NLTK 语料库结构

1 回答 1

Related

Reference