c - 如何快速对许多正则表达式键进行字符串匹配？

Question

我有一个元素数组，其键是regex。我想提出一个快速算法，给定一个字符串（不是regex）将在少于O(N)时间内返回基于键regex执行的匹配数组值是什么。

目前我对数组进行线性扫描，对于每个元素，我使用 posix regexec API 执行相应的正则表达式，但这意味着要找到匹配的元素，我必须在整个数组中进行搜索。

我知道如果 aray 仅由简单的字符串作为键组成，我可以保留它的排序器并使用bsearch样式 API，但是使用正则表达式看起来并不那么容易。

我在这里错过了什么吗？

示例如下

// this is mainly to be considered
// as pseudocode
typedef struct {
  regex_t  r;
  ... some other data
} value;

const char *key = "some/key";
value my_array[1024];
bool  my_matches[1024];
for(int i =0; i < 1024; ++i) {
  if(!regexec(&my_array[i].r, key, 0, 0, REG_EXTENDED))
    my_matches[i] = 1;
  else
    my_matches[i] = 0;
}

但是，如您所见，上述是线性的。谢谢

附录：

我已经整理了一个简单的可执行文件，它执行了上面的算法和下面的答案中提出的东西，其中形成了一个大的正则表达式，它构建了一个子正则表达式的二叉树并导航它以找到所有匹配项。
源代码在这里（GPLv3）： http: //qpsnr.youlink.org/data/regex_search.cpp
编译：g++ -O3 -o regex_search ./regex_search.cpp -lrt
并运行：（./regex_search "a/b"或使用--help标志作为选项）

有趣的是（我会说正如预期的那样）在树中搜索时，执行的正则表达式数量较少，但每次比较运行这些正则表达式要复杂得多，因此最终它所花费的时间与向量的线性扫描相平衡. 结果已打印出来，std::cerr因此您可以看到它们是相同的。

当使用长字符串和/或许多令牌运行时，请注意内存使用情况；准备好Ctrl-C阻止它阻止您的系统崩溃。

score 1 · Accepted Answer

This is possible but I think you would need to write your own regex library to achieve it.

Since you're using posix regexen, I'm going to assume that you intend to actually use regular expressions, as opposed to the random collection of computational features which modern regex libraries tend to implement. Regular expressions are closed under union (and many other operations), so you can construct a single regular expression from your array of regular expressions.

Every regular expression can be recognized by a DFA (deterministic finite-state automaton), and a DFA -- regardless of how complex -- recognizes (or fails to recognize) a string in time linear to the length of the string. Given a set of DFAs, you can construct a union DFA which recognizes the languages of all DFAs, and furthermore (with a slight modification of what it means for a DFA to accept a string), you can recover the information about which subset of the DFAs matched the string.

I'm going to try to use the same terminology as the Wikipedia article on DFAs. Let's suppose we have a set of DFAs M = {M₁...M_n} which share a single alphabet Σ. So we have:

M_i = (Q_i, Σ, δ_i, qⁱ₀, F_i) where Q_i = {qⁱ_j} for 0 ≤ j < |Q_i|, and Q_i ⊂ F_i.

We construct the union-DFA M_⋃ = (Q_⋃, Σ, δ_⋃, q^⋃₀) (yes, no F; I'll get to that) as follows:

q^⋃₀ = <q¹₀,...,qⁿ₀>

δ_⋃(<q¹_j₁,...,qⁿ_{j_n}>, α) = <δ¹(q¹_j₁, α),... , δⁿ(qⁿ_{j_n}, α)> for each α ∈ Σ

Q_⋃ consists of all states reachable through δ_⋃ starting from q^⋃₀.

We can compute this using a standard closure algorithm in time proportional to the product of the sizes of the δ_i transition functions.

Now to do a union match on a string α₁...α_m, we run the union DFA in the usual fashion, starting with its start symbol and applying its transition function to each α in turn. Once we've read the last symbol in the string, the DFA will be in some state <q¹_j₁,...,qⁿ_{j_n}>. From that state, we can extract the set of M_i which would have matched the string as: {M_i | qⁱ_{j_i} ∈ F_i}.

In order for this to work, we need the individual DFAs to be complete (i.e., they have a transition from every state on every symbol). Some DFA construction algorithms produce DFAs which are lacking transitions on some symbols (indicating that no string with that prefix is in the language); such DFAs must be augmented with a non-accepting "sink" state which has a transition to itself on every symbol.

I don't know of any regex library which exposes its DFAs sufficiently to implement the above algorithm, but it's not too much work to write a simple regex library which does not attempt to implement any non-regular features. You might also be able to find a DFA library.

Constructing a DFA from a regular expression is potentially exponential in the size of the expression, although such cases are rare. (The non-deterministic FA can be constructed in linear time, but in some cases, the powerset construction on the NFA will require exponential time and space. See the Wikipedia article.) Once the DFAs are constructed, however, the union FA can be constructed in time proportional to the product of the sizes of the DFAs.

So it should be easy enough to allow dynamic modification to the set of regular expressions, by compiling each regex to a DFA once, and maintaining the set of DFAs. When the set of regular expressions changes, it is only necessary to regenerate the union DFA.

Hope that all helps.

c - 如何快速对许多正则表达式键进行字符串匹配？

1 回答 1

Related

Reference