我想获取WordNet
语料库中单词的长度
代码:
from nltk.corpus import wordnet as wn
len_wn = len([word.lower() for word in wn.words()])
print(len_wn)
我得到的输出为147306
我的问题:
- 我得到了单词的总长度
WordNet
吗? tokens
像这样zoom_in
算吗word
?
取决于“单词”的定义是什么。该wn.words()
函数遍历所有的lemma_names
,https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1701和https://github.com/nltk/nltk/blob/develop /nltk/corpus/reader/wordnet.py#L1191
def words(self, lang="eng"):
"""return lemmas of the given language as list of words"""
return self.all_lemma_names(lang=lang)
def all_lemma_names(self, pos=None, lang="eng"):
"""Return all lemma names for all synsets for the given
part of speech tag and language or languages. If pos is
not specified, all synsets for all parts of speech will
be used."""
if lang == "eng":
if pos is None:
return iter(self._lemma_pos_offset_map)
else:
return (
lemma
for lemma in self._lemma_pos_offset_map
if pos in self._lemma_pos_offset_map[lemma]
)
因此,如果“单词”的定义是所有可能的引理,那么是的,此函数为您提供 Wordnet 中引理名称中单词的总长度:
>>> sum(len(lemma_name) for lemma_name in wn.words())
1692291
>>> sum(len(lemma_name.lower()) for lemma_name in wn.words())
1692291
小写不是必需的,因为引理名称应该被降低。甚至命名实体,例如
>>> 'new_york' in wn.words()
True
但请注意,相同的引理可以有非常相似的引理名称:
>>> 'new_york' in wn.words()
True
>>> 'new_york_city' in wn.words()
True
那是因为 wordnet 的结构。NLTK 中的 API 将“含义”组织为同义词集,一个同义词集包含链接到多个引理,每个引理至少有一个名称:
>>> wn.synset('new_york.n.1')
Synset('new_york.n.01')
>>> wn.synset('new_york.n.1').lemmas()
[Lemma('new_york.n.01.New_York'), Lemma('new_york.n.01.New_York_City'), Lemma('new_york.n.01.Greater_New_York')]
>>> wn.synset('new_york.n.1').lemma_names()
['New_York', 'New_York_City', 'Greater_New_York']
但是您查询的每个“单词”都可以有多个同义词(即多重含义),例如
>>> wn.synsets('new_york')
[Synset('new_york.n.01'), Synset('new_york.n.02'), Synset('new_york.n.03')]
取决于“单词”的定义是什么,就像上面的示例一样,如果您遍历 ,则wn.words()
您正在遍历 lemma_names 并且new_york
示例显示多单词表达式存在于每个同义词集的 lemma 名称列表中。