5

我有一组文档,我想返回一个元组列表,其中每个元组都有给定文档的日期以及给定搜索词在该文档中出现的次数。我的代码(如下)可以工作,但速度很慢,而且我是 n00b。有没有明显的方法可以加快速度?任何帮助将不胜感激,主要是为了让我可以学习更好的编码,同时也让我可以更快地完成这个项目!

def searchText(searchword):
    counts = []
    corpus_root = 'some_dir'
    wordlists = PlaintextCorpusReader(corpus_root, '.*')
    for id in wordlists.fileids():
        date = id[4:12]
        month = date[-4:-2]
        day = date[-2:]
        year = date[:4]
        raw = wordlists.raw(id)
        tokens = nltk.word_tokenize(raw)
        text = nltk.Text(tokens)
        count = text.count(searchword)
        counts.append((month, day, year, count))

    return counts
4

1 回答 1

8

如果您只想要单词计数的频率,那么您不需要创建nltk.Text对象,甚至不需要使用nltk.PlainTextReader. 相反,直接去nltk.FreqDist.

files = list_of_files
fd = nltk.FreqDist()
for file in files:
    with open(file) as f:
        for sent in nltk.sent_tokenize(f.lower()):
            for word in nltk.word_tokenize(sent):
                fd.inc(word)

或者,如果您不想进行任何分析 - 只需使用dict.

files = list_of_files
fd = {}
for file in files:
    with open(file) as f:
        for sent in nltk.sent_tokenize(f.lower()):
            for word in nltk.word_tokenize(sent):
                try:
                    fd[word] = fd[word]+1
                except KeyError:
                    fd[word] = 1

使用生成器表达式可以使这些更有效,但我使用 for 循环是为了便于阅读。

于 2010-10-10T21:54:20.640 回答