regex - 确定一个正则表达式是否是另一个正则表达式的子集

Question

我有大量的正则表达式在匹配时调用特定的 http 处理程序。一些较旧的正则表达式无法访问（例如a.c* ⊃ abc*），我想修剪它们。

有没有给出两个正则表达式的库会告诉我第二个是否是第一个的子集？

起初我不确定这是可以决定的（它闻起来像是一个不同名称的停止问题）。但事实证明这是可判定的。

score 14 · Accepted Answer

试图找出这个问题的复杂性导致我写了这篇论文。

问题的正式定义见于：这通常称为包含问题

R, 的包含问题是测试两个给定的表达式 r, r′ ∈ R，是否 r ⊆ r′。

那篇论文有一些很好的信息（总结：除了最简单的表达式之外的所有表达式都相当复杂），但是搜索有关包含问题的信息会直接回到StackOverflow。该答案已经链接到描述可通过多项式时间算法的论文，该算法应涵盖许多常见情况。

score 7 · Accepted Answer

I found a python regex library that provides set operations.

http://github.com/ferno/greenery

The proof says Sub ⊆ Sup ⇔ Sub ∩ ¬Sup is {}. I can implement this with the python library:

import sys
from greenery.lego import parse

subregex = parse(sys.argv[1])
supregex = parse(sys.argv[2])

s = subregex&(supregex.everythingbut())
if s.empty():
  print("%s is a subset of %s"%(subregex,supregex))
else:
  print("%s is not a subset of %s, it also matches %s"%(subregex,supregex,s)

examples:

subset.py abcd.* ab.*
abcd.* is a subset of ab.*

subset.py a[bcd]f* a[cde]f*
a[bcd]f* is not a subset of a[cde]f*, it also matches abf*

The library may not be robust because as mentioned in the other answers you need to use the minimal DFA in order for this to work. I'm not sure ferno's library makes (or can make) that guarantee.

As an aside: playing with the library to calculate inverse or simplify regexes is lots of fun.
a(b|.).* simplifies to a.+. Which is pretty minimal.
The inverse of abf* is ([^a]|a([^b]|bf*[^f])).*|a?. Try to come up with that on your own!

score 5 · Accepted Answer

如果正则表达式使用允许接受非正则语言的典型过程匹配器（如 Perl、Java、Python、Ruby 等中的匹配器）的“高级特性”，那么你就不走运了。这个问题通常是无法确定的。例如，一个下推自动机是否识别与另一个相同的上下文无关（CF）语言的问题是不可判定的。扩展的正则表达式可以描述 CF 语言。

另一方面，如果正则表达式在理论上是“真”的，则仅由连接、交替和 Kleene 星号组成，带有有限字母表的字符串，加上这些（字符类、+、?、等），然后有一个简单的多项式时间算法。

我不能给你图书馆，但是这个：

For each pair of regexes r and s for languages L(r) and L(s)
  Find the corresponding Deterministic Finite Automata M(r) and M(s)
    Compute the cross-product machine M(r x s) and assign accepting states
       so that it computes L(r) - L(s)
    Use a DFS or BFS of the the M(r x s) transition table to see if any
       accepting state can be reached from the start state
    If no, you can eliminate s because L(s) is a subset of L(r).
    Reassign accepting states so that M(r x s) computes L(s) - L(r)
    Repeat the steps above to see if it's possible to eliminate r

将正则表达式转换为 DFA 通常使用 Thompson 构造来获得非确定性自动机。使用子集构造将其转换为 DFA。叉积机是另一种标准算法。

这一切都是在 1960 年代完成的，现在是任何优秀的本科计算机科学理论课程的一部分。该主题的黄金标准是Hopcroft 和 Ullman，自动机理论。

score 4 · Accepted Answer

数学部分有答案：https ://math.stackexchange.com/questions/283838/is-one-regular-language-subset-of-another 。

基本思路：

计算两种语言的最小DFA。
计算自动 M1 和 M2 的叉积，这意味着每个状态由一对 [m1, m2] 组成，其中 m1 来自 M1，m2 来自 M2，用于所有可能的组合。
新的转换 F12 是：F12([m1, m2], x) => [F1(m1, x), F2(m2, x)]。这意味着如果在读取 x 时 M1 从状态 m1 到 m1' 有一个转换，在读取 x 时 M2 从状态 m2 到 m2' 有一个转换，那么 M12 中有一个从 [m1, m2] 到 [m1', m2' 的转换] 在阅读 x 时。
最后，您查看可达状态：
- 如果存在一对[接受，拒绝]，则 M2 不是 M1 的子集
- 如果存在一对 [rejecting, accapting] 则 M1 不是 M2 的子集

如果您只计算新的转换和结果状态，从一开始就省略所有不可到达的状态，那将是有益的。

regex - 确定一个正则表达式是否是另一个正则表达式的子集

4 回答 4

Related

Reference