python - 最常见的 2-gram 使用 python

Question

给定一个字符串：

this is a test this is

如何找到前 n 个最常见的 2 克？在上面的字符串中，所有 2-gram 都是：

{this is, is a, test this, this is}

如您所见，2-gramthis is出现了 2 次。因此结果应该是：

{this is: 2}

我知道我可以使用Counter.most_common()方法来查找最常见的元素，但是如何从字符串开始创建一个 2-gram 列表？

score 8 · Accepted Answer

您可以使用本博文中提供的方法在 Python 中方便地创建 n-gram。

from collections import Counter

bigrams = zip(words, words[1:])
counts = Counter(bigrams)
print(counts.most_common())

当然，这假设输入是单词列表。如果您的输入是您提供的字符串（没有任何标点符号），那么您可以只words = text.split(' ')获取单词列表。不过，一般来说，您必须考虑标点符号、空格和其他非字母字符。在这种情况下，你可能会做类似的事情

import re

words = re.findall(r'[A-Za-z]+', text)

或者您可以使用外部库，例如nltk.tokenize。

编辑。如果您通常需要三元组或任何其他任何其他 n-gram，那么您可以使用我链接到的博客文章中提供的功能：

def find_ngrams(input_list, n):
  return zip(*(input_list[i:] for i in range(n)))

trigrams = find_ngrams(words, 3)

score 2 · Accepted Answer

好吧，你可以使用

words = s.split() # s is the original string
pairs = [(words[i], words[i+1]) for i in range(len(words)-1)]

(words[i], words[i+1])是位置 i 和 i+1 处的单词对，我们遍历从 (0,1) 到 (n-2, n-1) 的所有单词对，其中 n 是字符串 s 的长度。

score 1 · Accepted Answer

最简单的方法是：

s = "this is a test this is"
words = s.split()
words_zip = zip(words, words[1:])
two_grams_list = [item for item in words_zip]
print(two_grams_list)

上面的代码将为您提供所有两克的列表，例如：

[('this', 'is'), ('is', 'a'), ('a', 'test'), ('test', 'this'), ('this', 'is')]

现在，我们需要计算每两克的频率

count_freq = {}
for item in two_grams_list:
    if item in count_freq:
        count_freq[item] +=1
    else:
        count_freq[item] = 1

现在，我们需要对结果进行降序排序并打印结果。

sorted_two_grams = sorted(count_freq.items(), key=lambda item: item[1], reverse = True)
print(sorted_two_grams)

输出：

[(('this', 'is'), 2), (('is', 'a'), 1), (('a', 'test'), 1), (('test', 'this'), 1)]

3 回答 3