12

我知道这不是特定于编码的问题,但这是提出此类问题的最合适的地方。所以请多多包涵。

假设我有一本像下面给出的字典,列出了每个人的十个喜欢的项目

likes={
    "rajat":{"music","x-men","programming","hindi","english","himesh","lil wayne","rap","travelling","coding"},
    "steve":{"travelling","pop","hanging out","friends","facebook","tv","skating","religion","english","chocolate"},
    "toby":{"programming","pop","rap","gardens","flowers","birthday","tv","summer","youtube","eminem"},
    "ravi":{"skating","opera","sony","apple","iphone","music","winter","mango shake","heart","microsoft"},
    "katy":{"music","pics","guitar","glamour","paris","fun","lip sticks","cute guys","rap","winter"},
    "paul":{"office","women","dress","casuals","action movies","fun","public speaking","microsoft","developer"},
    "sheila":{"heart","beach","summer","laptops","youtube","movies","hindi","english","cute guys","love"},
    "saif":{"women","beach","laptops","movies","himesh","world","earth","rap","fun","eminem"}
    "mark":{"pilgrimage","programming","house","world","books","country music","bob","tom hanks","beauty","tigers"},
    "stuart":{"rap","smart girls","music","wrestling","brock lesnar","country music","public speaking","women","coding","iphone"},
    "grover":{"skating","mountaineering","racing","athletics","sports","adidas","nike","women","apple","pop"},
    "anita":{"heart","sunidhi","hindi","love","love songs","cooking","adidas","beach","travelling","flowers"},
    "kelly":{"travelling","comedy","tv","facebook","youtube","cooking","horror","movies","dublin","animals"},
    "dino":{"women","games","xbox","x-men","assassin's creed","pop","rap","opera","need for speed","jeans"},
    "priya":{"heart","mountaineering","sky diving","sony","apple","pop","perfumes","luxury","eminem","lil wayne"},
    "brenda":{"cute guys","xbox","shower","beach","summer","english","french","country music","office","birds"}
}

我如何确定具有相似喜好的人。或者也许两个人最相似。此外,如果您可以为我指出基于用户或基于项目的过滤的适当示例或教程,这将很有帮助。

4

6 回答 6

10

(免责声明,我不擅长这个领域,只是对集体过滤有一点了解。以下只是我发现有用的资源的集合)

“编程集体智能”一书的第 2 章非常全面地介绍了这方面的基础知识。示例代码在 Python 中,这是另一个优点。

您可能还会发现此站点很有用 - A Programmer's Guide to Data Mining,特别是第 2章和第 3 章讨论了推荐系统和基于项目的过滤。

简而言之,可以使用诸如计算皮尔逊相关系数余弦相似度k-最近邻等技术来根据用户喜欢/购买/投票的项目来确定用户之间的相似性。

请注意,有各种为此目的编写的 python 库,例如pysuggestCrabpython-recsysSciPy.stats.stats.pearsonr

对于用户数量超过项目数量的大型数据集,您可以通过反转数据并计算项目之间的相关性(即基于项目的过滤)并使用它来推断相似用户,从而更好地扩展解决方案。自然,您不会实时执行此操作,而是将定期重新计算安排为后端任务。一些方法可以并行化/分布式以大大缩短计算时间(假设您有资源可以投入使用)。

于 2012-07-16T12:35:42.823 回答
7

使用 python recsys 库的解决方案 [ http://ocelma.net/software/python-recsys/build/html/quickstart.html ]

from recsys.algorithm.factorize import SVD
from recsys.datamodel.data import Data

likes={
    "rajat":{"music","x-men","programming","hindi","english","himesh","lil wayne","rap","travelling","coding"},
    "steve":{"travelling","pop","hanging out","friends","facebook","tv","skating","religion","english","chocolate"},
    "toby":{"programming","pop","rap","gardens","flowers","birthday","tv","summer","youtube","eminem"},
    "ravi":{"skating","opera","sony","apple","iphone","music","winter","mango shake","heart","microsoft"},
    "katy":{"music","pics","guitar","glamour","paris","fun","lip sticks","cute guys","rap","winter"},
    "paul":{"office","women","dress","casuals","action movies","fun","public speaking","microsoft","developer"},
    "sheila":{"heart","beach","summer","laptops","youtube","movies","hindi","english","cute guys","love"},
    "saif":{"women","beach","laptops","movies","himesh","world","earth","rap","fun","eminem"},
    "mark":{"pilgrimage","programming","house","world","books","country music","bob","tom hanks","beauty","tigers"},
    "stuart":{"rap","smart girls","music","wrestling","brock lesnar","country music","public speaking","women","coding","iphone"},
    "grover":{"skating","mountaineering","racing","athletics","sports","adidas","nike","women","apple","pop"},
    "anita":{"heart","sunidhi","hindi","love","love songs","cooking","adidas","beach","travelling","flowers"},
    "kelly":{"travelling","comedy","tv","facebook","youtube","cooking","horror","movies","dublin","animals"},
    "dino":{"women","games","xbox","x-men","assassin's creed","pop","rap","opera","need for speed","jeans"},
    "priya":{"heart","mountaineering","sky diving","sony","apple","pop","perfumes","luxury","eminem","lil wayne"},
    "brenda":{"cute guys","xbox","shower","beach","summer","english","french","country music","office","birds"}
}

data = Data()
VALUE = 1.0
for username in likes:
    for user_likes in likes[username]:
        data.add_tuple((VALUE, username, user_likes)) # Tuple format is: <value, row, column>

svd = SVD()
svd.set_data(data)
k = 5 # Usually, in a real dataset, you should set a higher number, e.g. 100
svd.compute(k=k, min_values=3, pre_normalize=None, mean_center=False, post_normalize=True)

svd.similar('sheila')
svd.similar('rajat')

结果:

In [11]: svd.similar('sheila')
Out[11]: 
[('sheila', 0.99999999999999978),
 ('brenda', 0.94929845546505753),
 ('anita', 0.85943494201162518),
 ('kelly', 0.53385495931440263),
 ('saif', 0.39985366653259058),
 ('rajat', 0.30757664244952165),
 ('toby', 0.28541364367155014),
 ('priya', 0.26184289111194581),
 ('steve', 0.25043700194182622),
 ('katy', 0.21812807229358305)]

In [12]: svd.similar('rajat')
Out[12]: 
[('rajat', 1.0000000000000004),
 ('mark', 0.89164019482177692),
 ('katy', 0.65207273451425907),
 ('stuart', 0.61675507205285718),
 ('steve', 0.55730648750670264),
 ('anita', 0.49836982296014803),
 ('brenda', 0.42759524471725929),
 ('kelly', 0.40436047539358799),
 ('toby', 0.35972227835054826),
 ('ravi', 0.31113813325818901)]
于 2013-05-14T17:26:16.520 回答
3

SequenceMatcherin difflib对这种事情很有用。如果您使用ratio()它返回一个介于 0 和 1 之间的值,对应于两个序列之间的相似性,来自文档:

将序列相似性的度量作为 [0, 1] 范围内的浮点数返回。其中 T 是两个序列中元素的总数,M 是匹配数,这是 2.0*M / T。请注意,如果序列相同,则为 1.0,如果它们没有共同点,则为 0.0。

从您的示例中,仅'rajat'与其他所有人进行比较(通过切换 internal {}for更正为字典[]):

import difflib
for key in likes:
    print 'rajat', key, difflib.SequenceMatcher(None,likes['rajat'],likes[key]).ratio()
#Output:
rajat sheila 0.2
rajat katy 0.2
rajat brenda 0.1
rajat saif 0.2
rajat dino 0.2
rajat toby 0.2
rajat mark 0.1
rajat steve 0.1
rajat priya 0.1
rajat grover 0.0
rajat ravi 0.1
rajat rajat 1.0
rajat stuart 0.2
rajat kelly 0.1
rajat paul 0.0
rajat anita 0.2
于 2012-07-16T10:46:31.880 回答
1

我能想到的最基本的方法是找到每个人的点赞列表之间的交集,点赞最匹配的两个人的交叉点数量最多。

list(set(list1).intersection(list2))可以使用类似的东西。这将返回一个包含定义交集的项目的列表。

请记住,这种方法无法很好地扩展到大量条目,因为它需要将每个用户的喜好相互比较,它的复杂度约为 O(n^2),其中 n 是用户。

在您的一些评论中,您提到了协同过滤,但这通常适用于让不同用户对相同项目进行排名,然后找到排名之间的相似性,因此您可以推断出某些项目以相同方式排名的用户,但不是其他(这里您使用在其他项目上给出类似排名的用户的排名)。我不认为这是完全相同的问题。

于 2012-07-16T10:45:24.673 回答
0
for k in likes:
    if likes["rajat"] & likes[k]:
        print k, likes["rajat"] & likes[k]
    else:
        print k,  " No Like with rajat" 

Output

sheila set(['hindi', 'english'])
katy set(['music', 'rap'])
brenda set(['english'])
saif set(['himesh', 'rap'])
dino set(['x-men', 'rap'])
toby set(['programming', 'rap'])
mark set(['programming'])
steve set(['travelling', 'english'])
priya set(['lil wayne'])
grover No Likes with rajat
ravi set(['music'])
rajat set(['lil wayne', 'x-men', 'himesh', 'coding', 'programming', 'music', 'hindi',  'rap', 'english', 'travelling'])
stuart set(['music', 'coding', 'rap'])
kelly set(['travelling'])
paul No Likes with rajat
anita set(['travelling', 'hindi'])

这会将“rajat”的共同点与字典的其他成员进行比较。必须有更好的方法来做到这一点

于 2012-07-16T11:01:19.203 回答
0

也可以使用 scikit-learn 进行基于用户的过滤:

举一个更简单的例子,如果你有:

"stuart":{"rap","rock"}

你想检查他的音乐品味相似性:

"toby:{"hip-hop","pop","rap"}

您可以使用 sklearn 的成对余弦相似度函数,

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vec = CountVectorizer(analyzer='char')
vec.fit(stuart_list)

x = cosine_similarity(vec.transform(toby_list),
                 vec.transform(stuart_list))

这会给你一个余弦矩阵,如:

[[ 0.166  0.327  1]
 [ 0.123  0.267  0.230]]

其中第一行表示rap与所有 3 个 toby 选择的余弦相似度。请注意,1 代表完全相似,用适当的三角术语来说,这意味着 2 个选项的角度为 0º(即相同),因此余弦为 1。

第二行相似表示rock与所有 3 个 toby 选择的余弦相似度。

我找不到在 sklearn 中找到两个列表之间总体相似性的方法,但是,给定余弦矩阵,您可以计算其中1s 的数量,并将其作为相似度数。或者您可以计算0.9s 及以上的数量来解释几乎相同的词,例如“嘻哈”和“嘻哈”。

(Sklearn 还具有欧几里得相似度,可用作余弦相似度的替代方案。)

于 2017-12-11T17:55:45.647 回答