javascript - 从字符串数组创建唯一组合数组

Question

我正在写一些东西，它需要一段文本并将其分解为可能的数据库查询，这些查询可用于查找类似的文本块。（类似于我输入时生成的“类似问题”列表）基本过程：

从文本中删除停用词
删除特殊字符
从剩余的文本中创建一系列独特的“词干”
创建一系列可能的茎数组组合（我被卡住了......有点）

这是我到目前为止所拥有的：

    //baseList starts with an empty array
    //candList starts with the array of unique stems
    //target is where the arrays of unique combinations are stored

    function createUniqueCombos(baseList,candList,target){

    for(var i=0;i<candList.length;i++){         

        //copy the base List
        var newList = baseList.slice(0);

        //add the candidate list item to the base list copy
        newList.push(candList[i]);

        //add the new array to the target array
        target.push(newList);   

        //re-call function using new array as baseList
        //and remaining candidates as candList
        var nextCandList = candList.slice(i + 1);       
        createUniqueCombos(newList,nextCandList,target);
    }

}

这可行，但在大于 25 个字左右的文本块上，它会使我的浏览器崩溃。我意识到在数学上可能存在大量可能的组合。我想知道的是：

有没有更有效的方法来做到这一点？
如何定义最小/最大组合数组长度？

score 1 · Accepted Answer

发现了这个以前的问题：Algorithm to find article with similar text

其中一个答案提供了一篇文章的链接，该文章建议找出两个字符串中包含多少相邻字符对。[ http://www.catalysoft.com/articles/StrikeAMatch.html ]

该示例使用 Java，但我确信可以轻松移植到 JS：

/** @return an array of adjacent letter pairs contained in the input string */
private static String[] letterPairs(String str) {
   int numPairs = str.length()-1;
   String[] pairs = new String[numPairs];
   for (int i=0; i<numPairs; i++) {
       pairs[i] = str.substring(i,i+2);
   }
   return pairs;
}

/** @return an ArrayList of 2-character Strings. */
private static ArrayList wordLetterPairs(String str) {
   ArrayList allPairs = new ArrayList();
   // Tokenize the string and put the tokens/words into an array
   String[] words = str.split("\\s");
   // For each word
   for (int w=0; w < words.length; w++) {
       // Find the pairs of characters
       String[] pairsInWord = letterPairs(words[w]);
       for (int p=0; p < pairsInWord.length; p++) {
           allPairs.add(pairsInWord[p]);
       }
   }
   return allPairs;
}

/** @return lexical similarity value in the range [0,1] */
public static double compareStrings(String str1, String str2) {
   ArrayList pairs1 = wordLetterPairs(str1.toUpperCase());
   ArrayList pairs2 = wordLetterPairs(str2.toUpperCase());
   int intersection = 0;
   int union = pairs1.size() + pairs2.size();
   for (int i=0; i<pairs1.size(); i++) {
       Object pair1=pairs1.get(i);
       for(int j=0; j<pairs2.size(); j++) {
           Object pair2=pairs2.get(j);
           if (pair1.equals(pair2)) {
               intersection++;
               pairs2.remove(j);
               break;
           }
       }
   }
   return (2.0*intersection)/union;
}

score 1 · Accepted Answer

我认为你的逻辑从根本上是有缺陷的，因为你创造了多少组合。

我会采取的一种方法是；

将文本拆分为单个单词（我们将调用此变量split_words）
删除特殊字符
删除短/常用词（and、or、I、a）；要么通过长度来做到这一点，要么更智能地通过单词黑名单来做到这一点
有一个表（例如blocks），它有列block_id和word

有一个 SQL 查询，例如

SELECT block_id FROM blocks 
WHERE word IN (split_words) GROUP BY block_id 
ORDER BY COUNT(*) DESC

然后你会得到一个列表，block_ids其中的排序取决于块有多少共同的单词。

score 0 · Accepted Answer

我的二项式系数类可以轻松解决您的问题。看看我对某个相关问题的回答中的代码。我不知道将 C# 代码移植到 SQL 存储过程是否是个好主意。将它移植到 java 或 js 并从该代码调用您存储的过程可能会更容易。

javascript - 从字符串数组创建唯一组合数组

3 回答 3

Related

Reference