machine-learning - Apache Mahout 样本数据培训与实际数据实施

Question

场景是这样的：

我正在尝试使用 apache mahaout 创建推荐器，并且我有一些样本偏好（用户、项目、偏好值）数据用于生成相似度矩阵并确定项目相似度。但实际偏好数据远大于样本偏好数据。实际偏好数据中存在的项目 ID 列表也都存在于样本偏好数据中。但是样本数据中的用户 ID 比实际数据要少得多。

现在，当我尝试在实际数据上运行推荐器时，它一直给我错误，用户 ID 不存在，因为它不存在于示例数据中。如何在 mahout 的推荐器中注入新的用户 ID 和他们的偏好，以便它可以根据项目相似度为任何用户动态生成推荐？或者，如果有任何其他方式可以为新用户生成推荐，那么请提出建议。

谢谢。

score 0 · Accepted Answer

If you think your sample data is complete for computing the item-item similarities, why don't you precompute them and use Collection<GenericItemSimilarity.ItemItemSimilarity> corrMatrix = new ArrayList<GenericItemSimilarity.ItemItemSimilarity>(); to store your precomputed similarities. Then from this you can create your ItemSimilarity like this: ItemSimilarity similarity = new GenericItemSimilarity(correlationMatrix);

I think it is not good idea for using sample of your data for computing item-item similarities based on the preference values, because you might be missing a lot of useful data. If you think that computing it on the fly is slow, you can always precomputed it and store it in a database, and load it when needed.

If you are still getting this error, than you probably use your sample data model in the recommendation class, or you use UserSimilarity to compute the item similarities.

If you want to add new user you can either use Mahout's FileDataModel and update the file periodically by including new users (I think you can create new file with some suffix, I am not sure). You can find more about this in the book Mahout in Action. The in-memory DataModel implementations are immutable. You can extend them by implementing the methods setPreference() and removePreference().

EDIT: I have an implementation for MutableDataModel that extends the AbstractDataModel. I can share it with you if you want.

machine-learning - Apache Mahout 样本数据培训与实际数据实施

1 回答 1

Related

Reference