python - 文本中最常见的 n 个单词

Question

我目前正在学习使用 NLP。我面临的问题之一是在文本中找到最常见的 n 个单词。考虑以下：

text=['狮子猴象草','虎象狮子水草','狮子草马尔科夫象猴精','守卫象草财富狼']

假设 n = 2。我不是在寻找最常见的二元组。我正在搜索文本中一起出现最多的 2 个单词。就像，上面的输出应该给出：

“狮子”和“大象”：3 “大象”和“杂草”：3 “狮子”和“猴子”：2 “大象”和“猴子”：2

等等..

谁能提出一个合适的方法来解决这个问题？

score 1 · Accepted Answer

这很棘手，但我为你解决了，我使用空格来检测 elem 是否包含超过 3 个单词 :-) 因为如果 elem 有 3 个单词，那么它必须是 2 个空格 :-) 在这种情况下，只有 elem 有 2 个单词将被退回

l = ["hello world", "good night world", "good morning sunshine", "wassap babe"]
for elem in l:

   if elem.count(" ") == 1:
      print(elem)

输出

hello world
wassap babe

score 1 · Accepted Answer

我建议使用Counter和combinations如下。

from collections import Counter
from itertools import combinations, chain

text = ['Lion Monkey Elephant Weed', 'Tiger Elephant Lion Water Grass', 'Lion Weed Markov Elephant Monkey Fine', 'Guard Elephant Weed Fortune Wolf']


def count_combinations(text, n_words, n_most_common=None):
    count = []
    for t in text:
        words = t.split()
        combos = combinations(words, n_words)
        count.append([" & ".join(sorted(c)) for c in combos])
    return dict(Counter(sorted(list(chain(*count)))).most_common(n_most_common))

count_combinations(text, 2)

python - 文本中最常见的 n 个单词

2 回答 2

Related

Reference