python - 通过 NLTK 制表和打印频率分布

问问题 2014-10-03T16:58:33.343

1763 次

我试图让 NLTK 将整个 12,000 个文本文件的语料库中的三元组制成表格，然后将每个三元组的频率分布打印到一个文件中，但出现以下错误：

Traceback (most recent call last):
  File "TPNngrams2.py", line 19, in <module>
    fdisttab = fdist.tabulate()
  File "/Library/Python/2.7/site-packages/nltk/probability.py", line 281, in tabulate
     print("%4s" % samples[i], end=' ')
TypeError: not all arguments converted during string formatting

这是代码：

import nltk
import re
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk import FreqDist

#this imports the text files in the folder into corpus called speeches
corpus_root = '/Users/root'
speeches = PlaintextCorpusReader(corpus_root, '.*\.txt')

print "Finished importing corpus"
fdist = nltk.FreqDist()  # Empty distribution

for filename in speeches.fileids():
    (str(trigram) for trigram in nltk.trigrams(speeches.words(filename)))
    fdist.update(nltk.trigrams(speeches.words(filename)))

fdisttab = fdist.tabulate()
print fdisttab
f = open('freqdists.txt', 'w+')
f.write(fdisttab)
f.close()

print "All done. Check file."

预先感谢您的帮助。我不知道如何开始解决这个问题。

python - 通过 NLTK 制表和打印频率分布

0 回答 0

Related

Reference