0

我有一个包含项目列表的 CSV,每个项目都附加了一系列属性:

"5","coffee|peaty|sweet|cereal|cream|barley|malt|creosote|sherry|sherry|manuka|honey|peaty|peppercorn|chipotle|chilli|salt|caramel|coffee|demerara|sugar|molasses|spicy|peaty"
"6","oil|lemon|apple|butter|toffee|treacle|sweet|cola|oak|cereal|cinnamon|salt|toffee"

“5”和“6”都是项目 ID,在文件中是唯一的。

最终,我想创建一个矩阵来展示文档中每个属性与其他所有属性在同一行中被提及的次数。例如:

        peaty    sweet    cereal    cream    barley ...
coffee    1       2         2         1        1
oil       0       1         0         0        0 

请注意,我更愿意减少重复:即“peaty”既不是一列也不是一行。

原始数据库本质上是一个键值存储(包含“itemId”和“value”列的表)——如果有帮助,我可以重新格式化数据。

知道如何使用 Python、PHP 或 Ruby(哪个最简单)来做到这一点?我觉得 Python 可能是最容易做到这一点的,但我缺少一些相当基本和/或关键的东西(我刚刚开始用 Python 进行数据分析)。

谢谢!

编辑:为了回应(有点无益)“你试过什么”评论,这是我目前正在使用的(不要笑,我的 Python 很糟糕):

#!/usr/bin/python
import csv

matrix = {}

with open("field.csv", "rb") as csvfile:
    csvreader = csv.reader(csvfile)
    for row in csvreader:
        attribs = row[1].split("|")
        for attrib in attribs:
            if attrib not in matrix:
                matrix[attrib] = {}
            for attrib2 in attribs:
                if attrib2 in matrix[attrib]:
                    matrix[attrib][attrib2] = matrix[attrib][attrib2] + 1 
                else:
                    matrix[attrib][attrib2] = 1
print matrix 

输出是一个大的、未排序的术语字典,行和列之间可能有很多重复。如果我使用熊猫并用以下内容替换“打印矩阵”行......

from pandas import *
df = DataFrame(matrix).T.fillna(0)
print df

我得到:

<class 'pandas.core.frame.DataFrame'>
Index: 195 entries, acacia to zesty
Columns: 195 entries, acacia to zesty
dtypes: float64(195)

...这让我觉得我做错了什么。

4

2 回答 2

1

I'd do this with an undirected graph, where the frequency is the edge weight. Then you can generate the matrix quite easily by looping through each vertex, where each edge weight represents how many times each element occurred with another.

Graph docs: http://networkx.github.io/documentation/latest/reference/classes.graph.html

Starter code:

import csv
import itertools
import networkx as nx

G = nx.Graph()

reader = csv.reader(open('field.csv', "rb"))
for row in reader:
  row_elements = row[1].split("|")
  combinations = itertools.combinations(row_elements, 2)
  for (a, b) in combinations:
    if G.has_edge(a, b):
      G[a][b]['weight'] += 1
    else:
      G.add_edge(a, b, weight=1)

print(G.edges(data=True))

Edit: woah see if this does everything for ya http://networkx.github.io/documentation/latest/reference/linalg.html#module-networkx.linalg.graphmatrix

于 2013-05-28T16:43:59.867 回答
1

我会使用由 2 个字符串组成的元组作为键的计数器。当然,您将拥有双重组合,但到目前为止,我不知道如何避免这种情况:

from collections import Counter
from itertools import combinations

counter = Counter()
with open("field.csv", "rb") as csvfile:
    csvreader = csv.reader(csvfile)
    for row in csvreader:
        attribs = row[1].split("|")
        for cmb in itertools.combinations(attribs, 2):
            counter[cmb] += 1
于 2013-05-29T08:16:46.297 回答