python - 如何使用 spacy 找到最常用的单词？

Question

我在 python 中使用 spacy，它可以很好地标记每个单词，但我想知道是否可以在字符串中找到最常见的单词。也可以得到最常用的名词、动词、副词等吗？

包含一个 count_by 函数，但我似乎无法让它以任何有意义的方式运行。

score 33 · Accepted Answer

我最近不得不计算文本文件中所有标记的频率。您可以使用 pos_ 属性过滤掉单词以获得您喜欢的 POS 令牌。这是一个简单的例子：

import spacy
from collections import Counter
nlp = spacy.load('en')
doc = nlp(u'Your text here')
# all tokens that arent stop words or punctuations
words = [token.text
         for token in doc
         if not token.is_stop and not token.is_punct]

# noun tokens that arent stop words or punctuations
nouns = [token.text
         for token in doc
         if (not token.is_stop and
             not token.is_punct and
             token.pos_ == "NOUN")]

# five most common tokens
word_freq = Counter(words)
common_words = word_freq.most_common(5)

# five most common noun tokens
noun_freq = Counter(nouns)
common_nouns = noun_freq.most_common(5)

score 11 · Accepted Answer

这应该看起来与在 Python 中计算其他任何东西基本相同。spaCy 让您只需遍历文档，然后返回一系列 Token 对象。这些可用于访问注释。

from __future__ import print_function, unicode_literals
import spacy
from collections import defaultdict, Counter

nlp = spacy.load('en')

pos_counts = defaultdict(Counter)
doc = nlp(u'My text here.')

for token in doc:
    pos_counts[token.pos][token.orth] += 1

for pos_id, counts in sorted(pos_counts.items()):
    pos = doc.vocab.strings[pos_id]
    for orth_id, count in counts.most_common():
        print(pos, count, doc.vocab.strings[orth_id])

请注意，.orth 和 .pos 属性是整数。您可以通过 .orth_ 和 .pos_ 属性获取它们映射到的字符串。.orth 属性是令牌的非规范化视图，还有 .lower、.lemma 等字符串视图。您可能想要绑定一个 .norm 函数，以进行自己的字符串规范化。有关详细信息，请参阅文档。

整数对您的计数很有用，因为如果您要对大型语料库进行计数，您可以使计数程序的内存效率更高。您还可以将频繁计数存储在 numpy 数组中，以提高速度和效率。如果您不想为此烦恼，请随意直接使用 .orth_ 属性，或使用其别名 .text。

请注意，上面片段中的 .pos 属性给出了一组粗粒度的词性标签。.tag 属性上提供了更丰富的树库标签。

score 4 · Accepted Answer

我很晚才加入这个线程。然而，事实上，有一种内置方法可以使用 spacy 中的 doc.count_by() 函数来执行此操作。

import spacy
import spacy.attrs
nlp = spacy.load("en_core_web_sm")
doc = nlp("It all happened between November 2007 and November 2008")

# Returns integers that map to parts of speech
counts_dict = doc.count_by(spacy.attrs.IDS['POS'])

# Print the human readable part of speech tags
for pos, count in counts_dict.items():
    human_readable_tag = doc.vocab[pos].text
    print(human_readable_tag, count)

输出是：

动词 1

ADP 1

CCONJ 1

检测 1

数字 2

代号 1

提案 2

python - 如何使用 spacy 找到最常用的单词？

3 回答 3

Related

Reference