python - 如何在熊猫数据框中以不同的顺序从文本数据框列中提取所有 ngram？

Question

下面是我的输入数据框。

id  description
1   **must watch avoid** **good acting**
2   average movie bad acting
3   good movie **acting good**
4   pathetic avoid
5   **avoid watch must**

我想从短语中的常用词中提取 ngram，即 bigram、trigram 和 4 wordgram。让我们将短语标记为单词，然后即使经常使用的单词的顺序不同，我们也能找到 ngrams （如果我们经常使用单词作为“好电影”并且在第二个短语我们经常使用的词是“电影好”，我们可以提取二元组作为“好电影”）。我期望的示例如下所示：

ngram              frequency
must watch            2
acting good           2
must watch avoid      2
average               1

正如我们在第一句中看到的，经常使用的词是“必须观看”，而在最后一句中，我们有“观看必须”，即频繁词的顺序发生了变化。因此，它以 2 的频率提取必须观看的二元组。

我需要从短语中的常用词中提取 ngrams/bigrams。

如何使用 Python 数据框实现这一点？任何帮助是极大的赞赏。

谢谢！

score 7 · Accepted Answer

import pandas as pd
from collections import Counter
from itertools import chain

data = [
    {"sentence": "Run with dogs, or shoes, or dogs and shoes"},
    {"sentence": "Run without dogs, or without shoes, or without dogs or shoes"},
    {"sentence": "Hold this while I finish writing the python script"},
    {"sentence": "Is this python script written yet, hey, hold this"},
    {"sentence": "Can dogs write python, or a python script?"},
]

def find_ngrams(input_list, n):
    return list(zip(*[input_list[i:] for i in range(n)]))

df = pd.DataFrame.from_records(data)
df['bigrams'] = df['sentence'].map(lambda x: find_ngrams(x.split(" "), 2))
df.head()

现在进入频率计数

# Bigram Frequency Counts
bigrams = df['bigrams'].tolist()
bigrams = list(chain(*bigrams))
bigrams = [(x.lower(), y.lower()) for x,y in bigrams]

bigram_counts = Counter(bigrams)
bigram_counts.most_common(10)

 [(('dogs,', 'or'), 2),
 (('shoes,', 'or'), 2),
 (('or', 'without'), 2),
 (('hold', 'this'), 2),
 (('python', 'script'), 2),
 (('run', 'with'), 1),
 (('with', 'dogs,'), 1),
 (('or', 'shoes,'), 1),
 (('or', 'dogs'), 1),
 (('dogs', 'and'), 1)]

python - 如何在熊猫数据框中以不同的顺序从文本数据框列中提取所有 ngram？

1 回答 1

Related

Reference