algorithm - 在很长的文本中查找列表或单词列表的所有许可

Question

给定单词列表 = { w1,w2,w3,w1,w2 }

在长文本中查找上述单词列表的所有排列。

长文本列表 = {这是长文本 w1 w2 w3 w4 和 w1 w2 w1 w2 w3。这是另一个没有排列的长文本，因为它不包含所有单词 w1,w2,w2,w2,w2 ，但这是由空格分隔的排列w2 w2 w3 w1 w1 }

解决这个问题的最有效算法是什么？

我认为首先为列表中的每个唯一单词分配一个元组（唯一＃，唯一素数＃）{w1 = [101, 5]，w2 = [103, 7]，w3 = [205, 11]}并计算总和使用分配的元组的整个列表：w1 [101 * 5] + w2 [103 * 7] + w3 [205 * 11] + w1 [101 *5] + + w2 [103 * 7] = 4707

这是 pudo 代码：

targetSum = 4707;
long sum = 0;
for (int i = 0;  i < Text.size(); i++){
     look up (unique #, unique prime #) 
     sum  + = ((unique # * unique prime) ;
     if(  i >  list.size() ){
         sum = sum – (look up (unique #, unique prime # for index 
                ( i – list.size()) and subtract tuple sum)
     }

    if(targetSum = = sum ){
        // this is possible match so hashMap lookup verify  again  that this reagion is actual match.
}

}

有没有更好的逻辑或算法呢？

更新：

我正在进一步阅读模式匹配 Z-Algorithm (Z-Boxes)，但我无法看到 Z-boxes 或 Z-Array 如何使它变得更好，除非预先知道所有排列。不知道有没有更好的办法？

谢谢大家分享知识。

谢谢，

巴韦什

score 1 · Accepted Answer

用素数识别模式的想法很好，但不同素数的乘积是唯一的，而不是它们的总和。

然后，您可以使用移动窗口方法。计算你的模式的乘积和前五个单词的乘积。然后，您通过将产品从左侧分开并向右相乘来移动窗口。不在您的模式中的所有单词的中性值为 1。

def permindex(text, pattern, start = 0):
    """Index of first permutation of the pattern in text"""

    if len(text) - start < len(pattern):
        return -1

    primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

    value = {}
    goal = 1
    for p in pattern:
        if not p in value:
            value[p] = primes.pop(0)

        goal *= value[p]

    prod = 1
    for t in text[start:start + len(pattern)]:
        prod *= value.get(t, 1)

    i = start

    for j in range(start + len(pattern), len(text)):

        if goal == prod:
            return i

        prod /= value.get(text[i], 1)
        prod *= value.get(text[j], 1)

        i += 1

    if goal == prod:
        return len(text) - len(pattern)

    return -1

将此应用于您的问题：

import re

patt = "w1 w2 w3 w1 w2".split()

text = re.split("\W+", 
        """This is long text w1 w2 w3 w4 and w1 w2 w1 w2 w3. This 
        yet another long text which does not have permutation because 
        it does not contain all words w1,w2,w2,w2,w2 , but this is 
        permutation w2 w2 w3 w1 w1""")

p = permindex(text, patt)
while p >= 0:
    print p, text[p: p + len(patt)]
    p = permindex(text, patt, p + 1)

产量：

9 ['w1', 'w2', 'w1', 'w2', 'w3']
40 ['w2', 'w2', 'w3', 'w1', 'w1']

score 1 · Accepted Answer

如果单词必须是连续的，那么首先构建一个包含您要查找的单词的字典以及它们的计数。对于您的示例[w1, w2, w3, w1, w2]，字典将包含：

{w1, 2}
{w2, 2}
{w3, 1}

称其为传入字典。

还要创建一个相同类型（即单词、计数）的空字典。称其为传出字典。

然后，建立一个与您要查找的单词数量相同的队列。队列最初是空的。

然后，你开始逐字阅读文本，这样做：

for each text_word in text
    if queue.count == number of words
        queue_word = remove word from queue
        if queue_word is in outgoing dictionary
            remove from outgoing
            add to incoming
        end if
    end if

    add text_word to queue
    if text_word is in incoming dictionary
        remove text_word from incoming dictionary
        add text_word to outgoing dictionary
        if incoming dictionary is empty
            you found a permutation
        end if
    end if

这里唯一的复杂之处是“将单词添加到字典”和“将单词删除到字典”必须考虑到同一个单词多次出现的可能性。所以你的删除实际上是这样的：

count = dictionary[word].Count = 1
if (count == 0)
    dictionary.Remove(word)
else
    dictionary[word].Count = count

添加是类似的。

algorithm - 在很长的文本中查找列表或单词列表的所有许可

2 回答 2

Related

Reference