python - 在 CSV 文件中查找多次出现的对

Question

我正在尝试编写一个 Python 脚本，该脚本将搜索 CSV 文件并确定两个项目彼此相邻出现时的出现次数。

例如，假设 CSV 如下所示：

red,green,blue,red,yellow,green,yellow,red,green,purple,blue,yellow,red,blue,blue,green,purple,red,blue,blue,red,green

而且我想找到“红色，绿色”彼此相邻出现的次数（但我想要一个不仅仅针对此 CSV 中的单词的解决方案）。

到目前为止，我认为将 CSV 转换为列表可能是一个好的开始：

import csv
with open('examplefile.csv', 'rb') as f:
    reader = csv.reader(f)
    your_list = list(reader)

print your_list

返回：

[['red', 'green', 'blue', 'red', 'yellow', 'green', 'yellow', 'red', 'green', 'purple', 'blue', 'yellow', 'red', 'blue', 'blue', 'green', 'purple', 'red', 'blue', 'blue', 'red', 'green ']]

在这个列表中，出现了三种情况'red', 'green'——我可以使用什么方法/模块/循环结构来确定列表中是否有不止一次出现在列表中彼此相邻的两个项目？

score 5 · Accepted Answer

您要查找的内容称为二元组（两个单词对）。您通常会在文本挖掘/NLP 类型的问题中看到这些问题。尝试这个：

from itertools import islice, izip
from collections import Counter
print Counter(izip(your_list, islice(your_list, 1, None)))

返回：

Counter({('red', 'green'): 3, ('red', 'blue'): 2, ('yellow', 'red'): 2, ('green', 'purple'): 2 , ('蓝色', '蓝色'): 2, ('蓝色', '红色'): 2, ('紫色', '蓝色'): 1, ('红色', '黄色'): 1, ( 'green', 'blue'): 1, ('purple', 'red'): 1, ('blue', 'yellow'): 1, ('blue', 'green'): 1, ('yellow' ', '绿色'): 1, ('绿色', '黄色'): 1})

如果您只需要获取超过 1 次出现的项目，请将 Counter 对象视为 python dict。

counts = Counter(izip(your_list, islice(your_list, 1, None)))
print [k for k,v in counts.iteritems() if v  > 1]

所以你只有相关的对：

[('red', 'blue'), ('red', 'green'), ('yellow', 'red'), ('green', 'purple'), ('blue', 'blue') ，（'蓝红'）]

请参阅我借用一些代码的帖子：Counting bigrams (pair of two words) in a file using python

score 1 · Accepted Answer

这将一次性检查两者'red','green'和'green','red'组合：

pair = ('red', 'green')
positions = [i for i in xrange(len(l)-1) if ((l[i],l[i+1]) == pair or (l[i+1],l[i]) == pair)]
print positions
>>> [0, 7] # notice that your last entry was 'green ' not 'green'

输出打印模式开始的第 i 个索引。

用你的例子进行测试（最后更正 'green '）：

l = [['red', 'green', 'blue', 'red', 'yellow', 'green', 'yellow', 'red', 'green', 'purple', 'blue', 'yellow', 'red', 'blue', 'blue', 'green', 'purple', 'red', 'blue', 'blue', 'red', 'green ']]
l = l[0]

# add another entry to test reversed matching
l.append('red')

pair = ('red', 'green')
positions = [i for i in xrange(len(l)-1) if ((l[i],l[i+1]) == pair or (l[i+1],l[i]) == pair)]

print positions
>>> [0, 7, 20, 21]

if positions > 1:
    print 'do stuff'

python - 在 CSV 文件中查找多次出现的对

2 回答 2

Related

Reference