我想在 Python 中构建一个基于内容的推荐系统,它使用多个属性来确定两个项目是否相似。在我的例子中,“项目”是由 C# 包管理器(示例)托管的包,它们具有各种属性,例如名称、描述、标签,可以帮助识别类似的包。
我这里有一个原型推荐系统,目前只使用一个属性,描述,来决定包是否相似。它计算描述的 TF-IDF 排名,并基于此打印出前 10 条推荐:
# Code mostly stolen from http://blog.untrod.com/2016/06/simple-similar-products-recommendation-engine-in-python.html
def train(dataframe):
tfidf = TfidfVectorizer(analyzer='word',
ngram_range=(1, 3),
min_df=0,
stop_words='english')
tfidf_matrix = tfidf.fit_transform(dataframe['description'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
for idx, row in dataframe.iterrows():
similar_indices = cosine_similarities[idx].argsort()[:-10:-1]
similar_items = [(dataframe['id'][i], cosine_similarities[idx][i])
for i in similar_indices]
id = row['id']
similar_items = [it for it in similar_items if it[0] != id]
# This 'sum' is turns a list of tuples into a single tuple:
# [(1,2), (3,4)] -> (1,2,3,4)
flattened = sum(similar_items, ())
try_print("Top 10 recommendations for %s: %s" % (id, flattened))
如何cosine_similarities
与其他相似性度量(基于同一作者、相似姓名、共享标签等)相结合,为我的推荐提供更多背景信息?