apache - 如何让 Mahout 推荐器工作得更快？

Question

嗨，SO 的 Mahout 社区！

我有几个关于加快推荐计算的问题。在我的服务器上，我安装了没有 Hadoop 的 Mahout。jRuby 也用于推荐脚本。在数据库中，我有 3k 个用户和 100k 个项目（连接表中有 270k 个项目）。因此，当用户请求推荐时，简单的脚本开始工作：

首先，它使用PGPoolingDataSource如下方式建立数据库连接：

  connection = org.postgresql.ds.PGPoolingDataSource.new()
  connection.setDataSourceName("db_name");
  connection.setServerName("localhost")
  connection.setPortNumber(5432)
  connection.setDatabaseName("db_name")
  connection.setUser("mahout")
  connection.setPassword("password")
  connection.setMaxConnections(100)
  connection

我收到这个警告：

WARNING: You are not using ConnectionPoolDataSource. Make sure your DataSource pools connections to the database itself, or database performance will be severely reduced.

任何想法如何解决这个问题？

之后我创建建议：

model = PostgreSQLJDBCDataModel.new(
    connection,
    'stars',
    'user_id',
    'repo_id',
    'preference',
    'created_at'
  )

  similarity = TanimotoCoefficientSimilarity.new(model)
  neighborhood = NearestNUserNeighborhood.new(5, similarity, model)
  recommender = GenericBooleanPrefUserBasedRecommender.new(model, neighborhood, similarity)
  recommendations = recommender.recommend user_id, 30

目前，为一位用户生成推荐大约需要 5-10 秒。问题是如何更快地提出建议（200ms 会很好）？

score 7 · Accepted Answer

如果您知道您正在使用池数据源，则可以忽略该警告。这意味着该实现没有实现用于池实现的常用接口，ConnectionPoolDataSource.

如果尝试直接从数据库运行，您将永远无法快速运行。有太多的数据访问。将其包裹JDBCDataModel起来ReloadFromJDBCDataModel，它将被缓存在内存中，这应该可以快 100 倍。

apache - 如何让 Mahout 推荐器工作得更快？

1 回答 1

Related

Reference