python - 如何使用 WordNet 找到英语单词的频率计数？

Question

有没有办法使用 WordNet 或 NLTK 使用 Python 来查找英语单词的使用频率？

注意：我不希望给定输入文件中单词的频率计数。我想根据今天的使用情况来计算一个单词的频率。

score 19 · Accepted Answer

在 WordNet 中，每个引理都有一个由方法返回的频率计数 lemma.count()，并存储在文件中nltk_data/corpora/wordnet/cntlist.rev。

代码示例：

from nltk.corpus import wordnet
syns = wordnet.synsets('stack')
for s in syns:
    for l in s.lemmas():
        print l.name + " " + str(l.count())

结果：

stack 2
batch 0
deal 1
flock 1
good_deal 13
great_deal 10
hatful 0
heap 2
lot 13
mass 14
mess 0
...

但是，许多计数为零，源文件或文档中没有信息是使用哪个语料库来创建此数据的。根据Daniel Jurafsky 和 James H. Martin 的《语音和语言处理》一书，感觉频率来自SemCor语料库，它是已经很小且过时的 Brown 语料库的一个子集。

因此，最好选择最适合您的应用程序的语料库，并按照 Christopher 的建议自己创建数据。

要使这个 Python3.x 兼容，只需执行以下操作：

代码示例：

from nltk.corpus import wordnet
syns = wordnet.synsets('stack')
for s in syns:
    for l in s.lemmas():
        print( l.name() + " " + str(l.count()))

score 11 · Accepted Answer

你可以用棕色语料库来做这件事，虽然它已经过时了（最后一次修订是在 1979 年），所以它缺少很多当前的单词。

import nltk
from nltk.corpus import brown
from nltk.probability import *

words = FreqDist()

for sentence in brown.sents():
    for word in sentence:
        words.inc(word.lower())

print words["and"]
print words.freq("and")

然后，您可以cpickle将 FreqDist 关闭到一个文件以便稍后更快地加载。

语料库基本上只是一个充满句子的文件，每行一个，并且还有很多其他语料库，因此您可能会找到适合您目的的语料库。更多最新语料库的其他来源：谷歌、美国国家语料库。

您还可以从当代美式英语语料库中获得前 60,000 个单词的当前列表及其频率

score 3 · Accepted Answer

查看这个网站的词频： http ://corpus.byu.edu/coca/

有人整理了一份来自 opensubtitles.org（电影剧本）的单词列表。有一个像这样格式的免费简单文本文件可供下载。在许多不同的语言中。

you 6281002
i 5685306
the 4768490
to 3453407
a 3048287
it 2879962

http://invokeit.wordpress.com/frequency-word-lists/

score 2 · Accepted Answer

查看http://wn-similarity.sourceforge.net/上 Wordnet Similarity 项目的信息内容部分。在那里，您将找到 Wordnet 引理的词频数据库（或者更确切地说，是从词频派生的信息内容），这些数据库是从几个不同的语料库中计算出来的。源代码在 Perl 中，但数据库是独立提供的，可以很容易地与 NLTK 一起使用。

score 2 · Accepted Answer

你不能真正做到这一点，因为它在很大程度上取决于上下文。不仅如此，对于频率较低的单词，频率将在很大程度上取决于样本。

您最好的选择可能是找到给定类型的大量文本语料库（例如，从Project Gutenberg下载一百本书）并自己计算单词。

score 1 · Accepted Answer

维基词典项目有一些基于电视脚本和古腾堡计划的频率列表，但它们的格式并不是特别适合解析。

score 1 · Accepted Answer

glove.6B.zip您可以从下载单词向量https://github.com/stanfordnlp/GloVe，解压缩并查看文件glove.6B.50d.txt。

在那里，您会发现 400.000 个英语单词，每行一个（加上同一行中的每个单词 50 个数字），小写，从最频繁 ( the) 到最不频繁排序。您可以通过以原始格式或pandas.

它并不完美，但我过去曾使用过它。同一网站提供的其他文件最多包含 220 万个英文单词，大小写。

score 0 · Accepted Answer

Christopher Pickslay 解决方案的 Python 3 版本（包括将频率保存到 tempdir）：

from pathlib import Path
from pickle import dump, load
from tempfile import gettempdir

from nltk.probability import FreqDist


def get_word_frequencies() -> FreqDist:
  tmp_path = Path(gettempdir()) / "word_freq.pkl"
  if tmp_path.exists():
    with tmp_path.open(mode="rb") as f:
      word_frequencies = load(f)
  else:
    from nltk import download
    download('brown', quiet=True)
    from nltk.corpus import brown
    word_frequencies = FreqDist(word.lower() for sentence in brown.sents()
                                for word in sentence)
    with tmp_path.open(mode="wb") as f:
      dump(word_frequencies, f)

  return word_frequencies

用法：

word_frequencies = get_word_frequencies()

print(word_frequencies["and"])
print(word_frequencies.freq("and"))

输出：

28853
0.02484774266443448

python - 如何使用 WordNet 找到英语单词的频率计数？

8 回答 8

Related

Reference