java - 检测数十亿行中预定义关键字的最有效方法/库？

Question

假设我有几十亿行文本和几百万个“关键字”。任务是遍历这些行并查看哪一行包含哪些关键字。换句话说，给定一个和的映射 (K1 -> V1)，(K2 -> V2)创建一个(K2 -> K1)where K1=lineID、V1=text和K2=keywordID的映射V2=keyword。另请注意：

所有文字/关键字均为英文
文本 (V1) 可能包含拼写错误。
大多数关键字（V2）是单个单词，但有些关键字可能包含多个英文单词（例如“clean towel”）

到目前为止，我解决这个问题的初步想法如下：

1) Chop up all my keywords into single words and 
   create a large set of single words (K3)
2) Construct a BK-Tree out of these chopped up keywords,
   using Levenshtein distance
3) For each line of data (V1), 
    3.1) Chop up the text (V1) into words
    3.2) For each said word,
        3.2.1) Retrieve words (K3) from the BK-Tree that
               are close enough to said word
    3.3) Since at this point we still have false positives,
        (e.g. we would have matched "clean" from "clean water" against
         keyword "clean towel"), we check all possible combination
          using a trie of keyword (V2) to filter such false 
          positives out. We construct this trie so that at the
          end of an successful match, the keywordID (K2) can be retrieved.
    3.4) Return the correct set of keywordID (K2) for this line (V1)!
4) Profit!

我的问题

这是一个好方法吗？效率很重要——有没有更好的方法？有什么需要改进的吗？
有没有我可以使用的库？最好是可以与 Java 很好地配合使用的东西。

提前致谢！

score 0 · Accepted Answer

不确定，但您在这里的期望（K2->K1）与倒排索引（http://en.wikipedia.org/wiki/Inverted_index）非常相似。

我相信 Lucene/Solr 在索引数据时使用相同的算法（它也会进行数据前分析/标记），您可能需要找出一种可以读取 Lucene 构建索引的方法（从 Lucene 的“IndexReader”javadoc 开始）。

在索引您的数据时，将每一行视为 Lucene 索引中的一个文档，在索引中创建两个字段 1) 行 ID 和 2) 数据 - 一旦您索引所有文档（行），您已经为您创建了 K2->K1，您只需需要找到一种方法来解析它。

我不确定在创建 K2->K1 之后您的下一步是什么，如果它的查找速度比您不需要解析索引的速度更快，您可以触发 Lucene 查询。

在 SOLR 中，如果有帮助，您还可以在索引上生成分面搜索结果。

编辑： 您可以使用 LUKE 工具来分析 Lucene 索引（https://code.google.com/p/luke/）

score 0 · Accepted Answer

有一些优化的多模式/二维搜索算法。不要再发明轮子了。您还应该考虑分配您的计算。也许hadoop和map/reduce？

java - 检测数十亿行中预定义关键字的最有效方法/库？

2 回答 2

Related

Reference