algorithm - 首次出现的并行字符串匹配算法

Question

首先，这是家庭作业。话虽如此，它是非常开放的，对于如何开始思考这个问题（或一般的并行算法），我们的指导几乎为零。我想要正确方向的指针，而不是完整的解决方案。任何可以帮助的阅读也将是极好的。

我正在研究一种使用并行算法匹配大量文本中第一次出现的模式的有效方法。该模式是简单的字符匹配，不涉及正则表达式。我设法想出了一种找到所有匹配项的可能方法，但这需要我查看所有匹配项并找到第一个匹配项。

所以问题是，我会更成功地在进程之间分解文本并以这种方式扫描吗？还是最好进行某种进程同步搜索，其中第 j 个进程搜索模式的第 j 个字符？如果然后所有进程对其匹配返回真，则这些进程将改变它们在匹配所述模式中的位置并再次向上移动，继续直到所有字符都匹配，然后返回第一个匹配的索引。

到目前为止，我所拥有的非常基本，而且很可能不起作用。我不会实现这一点，但任何指针将不胜感激。

使用 p 个处理器，一个长度为 t 的文本，一个长度为 L 的模式，以及使用 L 个处理器的上限：

对于 i=0 到 tl：
    对于 j=0 到 p：
        处理器 j 将 text[i+j] 与 pattern[i+j] 进行比较
            在错误匹配上：
                所有处理器终止当前比较，i++
            在所有处理器真正匹配时：
                一次迭代 p 个字符，直到比较了 L 个字符
                如果所有 L 比较返回 true：
                    返回 i（模式的位置）
                别的：
                    我++

score 5 · Accepted Answer

给定一个长度为 L 的模式，并在 P 个处理器上搜索长度为 N 的字符串，我只需将字符串拆分到处理器上。每个处理器将占用一个长度为 N/P + L-1 的块，最后一个 L-1 与属于下一个处理器的字符串重叠。然后每个处理器将执行 boyer moore（两个预处理表将共享）。当每个完成时，它们会将结果返回给第一个处理器，该处理器维护一个表

Process Index
   1    -1
   2    2
   3    23

After all processes have responded (or with a bit of thought you can have an early escape), you return the first match. This should be on average O(N/(L*P) + P).

The approach of having the i'th processor matching the i'th character would require too much inter process communication overhead.

EDIT: I realize you already have a solution, and are figuring out a way without having to find all solutions. Well I don't really think this approach is necessary. You can come up with some early escape conditions, they aren't that difficult, but I don't think they'll improve your performance that much in general (unless you have some additional knowledge the distribution of matches in your text).

score 5 · Accepted Answer

I am afraid that breaking the string will not do.

Generally speaking, early escaping is difficult, so you'd be better off breaking the text in chunks.

But let's ask Herb Sutter to explain searching with parallel algorithms first on Dr Dobbs. The idea is to use the non-uniformity of the distribution to have an early return. Of course Sutter is interested in any match, which is not the problem at hand, so let's adapt.

Here is my idea, let's say we have:

Text of length N
p Processors
heuristic: max is the maximum number of characters a chunk should contain, probably an order of magnitude greater than M the length of the pattern.

Now, what you want is to split your text into k equal chunks, where k is is minimal and size(chunk) is maximal yet inferior to max.

Then, we have a classical Producer-Consumer pattern: the p processes are feeded with the chunks of text, each process looking for the pattern in the chunk it receives.

The early escape is done by having a flag. You can either set the index of the chunk in which you found the pattern (and its position), or you can just set a boolean, and store the result in the processes themselves (in which case you'll have to go through all the processes once they have stop). The point is that each time a chunk is requested, the producer checks the flag, and stop feeding the processes if a match has been found (since the processes have been given the chunks in order).

Let's have an example, with 3 processors:

[ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ]
                      x       x

The chunks 6 and 8 both contain the string.

The producer will first feed 1, 2 and 3 to the processes, then each process will advance at its own rhythm (it depends on the similarity of the text searched and the pattern).

Let's say we find the pattern in 8 before we find it in 6. Then the process that was working on 7 ends and tries to get another chunk, the producer stops it --> it would be irrelevant. Then the process working on 6 ends, with a result, and thus we know that the first occurrence was in 6, and we have its position.

The key idea is that you don't want to look at the whole text! It's wasteful!

algorithm - 首次出现的并行字符串匹配算法

2 回答 2

Related

Reference