python - NLTK 的一些问题

Question

我对 Python 和 NLTK 很陌生，但我有一个问题。我正在写一些东西来从自制的语料库中提取长度超过 7 个字符的单词。但事实证明，它提取了每一个单词……有人知道我做错了什么吗？

loc="C:\Users\Dell\Desktop\CORPUS"
Corpus= CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r '(Shakespeare|Milton)/.*)
def long_words(corpus)
    for cat in corpus.categories():
        fileids=corpus.fileids(categories=cat)
        words=corpus.words(fileids)
         long_tokens=[]
         words2=set(words)
         if len(words2) >=7:
             long_tokens.append(words2)


Print long_tokens

谢谢大家！

score 1 · Accepted Answer

代替

if len(words2) >=7:
    long_tokens.append(words2)

和：

long_tokens += [w for w in words2 if len(w) >= 7]

corpus.words(fileids)解释：如果单词数至少为 7（所以我想总是为您的语料库），您正在做的是附加所有单词（标记）。您真正想要做的是从标记集中过滤掉短于 7 个字符的单词，并将剩余的长单词附加到long_tokens.

您的函数应返回结果 - 具有 7 个或更多字符的标记。我假设您创建和处理的方式CategorizedPlaintextCorpusReader是可以的：

loc="C:\Users\Dell\Desktop\CORPUS"
Corpus= CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r'(Shakespeare|Milton)/.*)

def long_words(corpus = Corpus):
    long_tokens=[]
    for cat in corpus.categories():
        fileids = corpus.fileids(categories=cat)
        words = corpus.words(fileids)
        long_tokens += [w for w in set(words) if len(w) >= 7]
    return set(long_tokens)

print "\n".join(long_words())

以下是您在评论中提出的问题的答案：

for loc in ['cat1','cat2']:
  print len(long_words(corpus=CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r'(Shakespeare|Milton)/.*)), 'words over 7 in', loc

python - NLTK 的一些问题

1 回答 1

Related

Reference