java - 在 GAE 上实施“逆文档频率”的建议？

Question

我需要在 Google 应用引擎中实现“逆文档频率”。我正在寻找提高效率的建议。现在我把基本的例程当作，

解析网页时，我将每一对保存到数据存储区，例如，

for(String phrase : phrase_collection){
  dataStore.put(phrase, domain);
}

稍后在计算 IDF 时，我从数据存储中获取短语的出现，例如，

for(String phrase : phrase_collection){
  long count = dataStore.get(phrase).size();
}

但是速度并不令人满意，并且经常导致 30 秒超时。在这种情况下，我有额外的挑战，

- 多语言输入（网页）。因此，这些短语也使用不同的语言，这使得缓存变得困难。

- 解析网页和排名短语也需要很多时间。整个过程就像 charset_detect -> language_detect -> 根据不同语言解析 -> 排名。

在 GAE 中始终启用。

我期待着任何建议！提前致谢！

score 1 · Accepted Answer

You're doing an individual get (and put) for each phrase. This is naturally going to be very slow, as you're doing a great many roundtrips to the datastore. Instead, you should use the variants of put and get that accept an iterable of entities or keys, and execute them all in a single transaction.

You should also do this work 'offline' - as Stefan suggests, using backends or task queues. Task queues would likely be a better match here.

score 0 · Accepted Answer

你有几个选择：

您可以使用新引入的后端来启动和运行。这样您就不必处理超时，也不必担心并行任务。

您可以使用任务队列。（作为后端的替代方案。）但这取决于您并行任务的能力。

在任何情况下，您都应该开始使用 memcache。（如果您使用 JDO，您可以像这样简单地启用它）。您还可以考虑切换到“更原生”的持久层，例如 objectify 或 twig，它们支持异步访问和/或开箱即用的 memcache。

java - 在 GAE 上实施“逆文档频率”的建议？

2 回答 2

Related

Reference