-2

我正在尝试对不同学生之间的食物量向量进行余弦相似度。我有一个 CSV 文件,其中包含:

Student   food      amount
John      apple       15
John      banana      20
John      orange      1
John      grape       3
Ben       apple       2
Ben       orange      4
Ben       strawberry  8
Andrew    apple       10
Andrew    watermelon  3

以下代码:

import csv
from collections import defaultdict
data = defaultdict(dict)
with open('data.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        data[row['Student']][row['food']] = row['amount']

给了我这样的结构:

{'John': {'apple': 15, 'banana': 20, 'orange': 1, 'grape': 3}, 
 'Ben': {'apple': 2, 'orange': 4, 'strawberry': 8}, #etc.
}

我想将这些字典转换为向量,其中向量的长度是唯一食物的数量,学生不吃的食物将默认为 0,这样:

for John: [15,20,1,3,0] corresponds to [apple,banana,orange,grape,strawberry,watermelon]
for Ben: [2,0,4,0,8,0] corresponds to [apple,banana,orange,grape,strawberry,watermelon] #etc

然后我会在每个学生之间输出一个余弦相似度矩阵。感谢您花时间阅读。任何帮助将不胜感激。

4

1 回答 1

0
>>> D = {'John': {'apple': 15, 'banana': 20, 'orange': 1, 'grape': 3}, 
...  'Ben': {'apple': 2, 'orange': 4, 'strawberry': 8}, #etc.
... }

首先列出所有唯一键

>>> all_keys = sorted({k for i in D for k in D[i]})
>>> all_keys
['apple', 'banana', 'grape', 'orange', 'strawberry']

现在你可以为每个人循环这些键

>>> {k:[D[k].get(i, 0) for i in all_keys] for k in D}
{'John': [15, 20, 3, 1, 0], 'Ben': [2, 0, 0, 4, 8]}
于 2014-02-19T23:22:40.393 回答