我可以使用 NLTK python2.6 阅读文本语料库:
from nltk.corpus import gutenberg
for fileid in gutenberg.fileids():
num_chars = len(gutenberg.raw(fileid))
num_words = len(gutenberg.words(fileid))
num_sents = len(gutenberg.sents(fileid))
num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))
print int(num_chars/num_words), int(num_words/num_sents), int(num_words/num_vocab), fileid
现在我想通过单词和句子找到字母的平均出现次数,比如 num_letters(whole_text, ['a', 'bb', 'ccc'])。预期输出为:
a = n11/n12,bb = n21/n22,ccc = n31/n32
其中 n11 = 单词中的出现次数,n12 = 句子中的出现次数。