我有一个关于 DFA 最小化的问题。因此,我使用了众所周知的技术将正则表达式转换为 NFA,然后使用 goto / 闭包算法从中构造 DFA。现在的问题是如何最小化它?我在这里看过有关它的演讲:http ://www.youtube.com/watch?v=T9Z66NF5YRk,但我仍然无法理解这一点。什么是 DFA 最小化?这只是合并相同的状态(在相同字符上进入相同状态的状态)还是不同的东西?
所以,我从以下语法开始:
%digit = '0'..'9'
%letter = 'a'..'z' | 'A'..'Z'
%exponent = ("e" | "E") ("+" | "-")? digit+
T_INT = digit+
T_FLOAT = T_INT exponent
T_IDENTIFIER = (letter | "$" | "_") (letter | "$" | "_" | digit)*
并最终得到以下 DFA(表示为 JSON):
{
"START": [{
"type": "range",
"from": 36,
"to": 36,
"shift": "1"
}, {
"type": "range",
"from": 48,
"to": 57,
"shift": "2"
}, {
"type": "range",
"from": 65,
"to": 90,
"shift": "1"
}, {
"type": "range",
"from": 95,
"to": 95,
"shift": "1"
}, {
"type": "range",
"from": 97,
"to": 122,
"shift": "1"
}],
"1": [{
"type": "range",
"from": 36,
"to": 36,
"shift": "1"
}, {
"type": "range",
"from": 48,
"to": 57,
"shift": "1"
}, {
"type": "range",
"from": 65,
"to": 90,
"shift": "1"
}, {
"type": "range",
"from": 95,
"to": 95,
"shift": "1"
}, {
"type": "range",
"from": 97,
"to": 122,
"shift": "1"
}, {
"shift": ["t_identifier"]
}],
"2": [{
"type": "range",
"from": 48,
"to": 57,
"shift": "2"
}, {
"type": "range",
"from": 69,
"to": 69,
"shift": "3"
}, {
"type": "range",
"from": 101,
"to": 101,
"shift": "3"
}, {
"shift": ["t_int"]
}],
"3": [{
"type": "range",
"from": 43,
"to": 43,
"shift": "5"
}, {
"type": "range",
"from": 45,
"to": 45,
"shift": "5"
}, {
"type": "range",
"from": 48,
"to": 57,
"shift": "4"
}],
"4": [{
"type": "range",
"from": 48,
"to": 57,
"shift": "4"
}, {
"shift": ["t_float"]
}],
"5": [{
"type": "range",
"from": 48,
"to": 57,
"shift": "4"
}]
}
那么如何最小化它呢?
更新:
好的,这是我的算法。给定以下 DFA:
{
0: [{
from: 97,
to: 97,
shift: 1
}],
1: [{
from: 97,
to: 97,
shift: 3
}, {
from: 98,
to: 98,
shift: 2
}],
2: [{
from: 98,
to: 98,
shift: 4
}],
3: [{
from: 97,
to: 97,
shift: 3
}, {
from: 98,
to: 98,
shift: 4
}],
4: [{
from: 98,
to: 98,
shift: 4
}]
}
这就是我为最小化它所做的事情:
对于每个状态(在我的示例中编号为 0、1、2、3、4),获取标识此类状态的唯一哈希(例如,对于 state0,这将是:from=97,to=97,shift=1,对于 state1,此将是:from=97,to=97,shift=3&from=98,to=98,shift=2 等等...)
比较获得的哈希值,如果我们找到两个相同的哈希值,则将其合并。在我的例子中,state2 hash 为:from=98&shift=4&to=98,state4 hash 为:from=98&shift=4&to=98,它们是相同的,所以我们可以将它们合并到新的 state5 中,之后 DFA 将看起来像这样:
{ 0: [{ from: 97, to: 97, shift: 1 }], 1: [{ from: 97, to: 97, shift: 3 }, { from: 98, to: 98, shift: 5 }], 3: [{ from: 97, to: 97, shift: 3 }, { from: 98, to: 98, shift: 5 }], 5: [{ from: 98, to: 98, shift: 5 }]
}
继续这个,直到我们没有任何变化,所以下一步(合并状态 1 和 3)会将 DFA 转换为以下形式:
{ 0: [{ from: 97, to: 97, shift: 6 }], 6: [{ from: 97, to: 97, shift: 6 }, { from: 98, to: 98, shift: 5 }], 5: [{ from: 98, to: 98, shift: 5 }]
}
没有更多相同的状态,这意味着我们完成了。
第二次更新:
好的,给定以下正则表达式: 'a' ('ce')* ('d' | 'xa' | 'AFe')+ | 'b' ('ce')* ('d' | 'xa' | 'AFe')+ 'ce'+ 我有以下 DFA (START -> start state, ["accept"] -> so to说转换到接受状态):
{
"START": [{
"type": "range",
"from": 98,
"to": 98,
"shift": "1.2"
}, {
"type": "range",
"from": 97,
"to": 97,
"shift": "17.18"
}],
"1.2": [{
"type": "range",
"from": 120,
"to": 120,
"shift": "10"
}, {
"type": "range",
"from": 100,
"to": 100,
"shift": "6.7"
}, {
"type": "range",
"from": 65,
"to": 65,
"shift": "8"
}, {
"type": "range",
"from": 99,
"to": 99,
"shift": "4"
}],
"10": [{
"type": "range",
"from": 97,
"to": 97,
"shift": "6.7"
}],
"6.7": [{
"type": "range",
"from": 99,
"to": 99,
"shift": "15"
}, {
"type": "range",
"from": 120,
"to": 120,
"shift": "13"
}, {
"type": "range",
"from": 100,
"to": 100,
"shift": "6.7"
}, {
"type": "range",
"from": 65,
"to": 65,
"shift": "11"
}],
"15": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "14.accept"
}],
"14.accept": [{
"type": "range",
"from": 99,
"to": 99,
"shift": "16"
}, {
"shift": ["accept"]
}],
"16": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "14.accept"
}],
"13": [{
"type": "range",
"from": 97,
"to": 97,
"shift": "6.7"
}],
"11": [{
"type": "range",
"from": 70,
"to": 70,
"shift": "12"
}],
"12": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "6.7"
}],
"8": [{
"type": "range",
"from": 70,
"to": 70,
"shift": "9"
}],
"9": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "6.7"
}],
"4": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "2.3"
}],
"2.3": [{
"type": "range",
"from": 120,
"to": 120,
"shift": "10"
}, {
"type": "range",
"from": 100,
"to": 100,
"shift": "6.7"
}, {
"type": "range",
"from": 65,
"to": 65,
"shift": "8"
}, {
"type": "range",
"from": 99,
"to": 99,
"shift": "5"
}],
"5": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "2.3"
}],
"17.18": [{
"type": "range",
"from": 120,
"to": 120,
"shift": "25"
}, {
"type": "range",
"from": 100,
"to": 100,
"shift": "22.accept"
}, {
"type": "range",
"from": 65,
"to": 65,
"shift": "23"
}, {
"type": "range",
"from": 99,
"to": 99,
"shift": "20"
}],
"25": [{
"type": "range",
"from": 97,
"to": 97,
"shift": "22.accept"
}],
"22.accept": [{
"type": "range",
"from": 120,
"to": 120,
"shift": "28"
}, {
"type": "range",
"from": 100,
"to": 100,
"shift": "22.accept"
}, {
"type": "range",
"from": 65,
"to": 65,
"shift": "26"
}, {
"shift": ["accept"]
}],
"28": [{
"type": "range",
"from": 97,
"to": 97,
"shift": "22.accept"
}],
"26": [{
"type": "range",
"from": 70,
"to": 70,
"shift": "27"
}],
"27": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "22.accept"
}],
"23": [{
"type": "range",
"from": 70,
"to": 70,
"shift": "24"
}],
"24": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "22.accept"
}],
"20": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "18.19"
}],
"18.19": [{
"type": "range",
"from": 120,
"to": 120,
"shift": "25"
}, {
"type": "range",
"from": 100,
"to": 100,
"shift": "22.accept"
}, {
"type": "range",
"from": 65,
"to": 65,
"shift": "23"
}, {
"type": "range",
"from": 99,
"to": 99,
"shift": "21"
}],
"21": [{
"type": "range",
"from": 101,
"to": 101,
"shift": "18.19"
}]
}
故事是一样的,我如何最小化它?如果我遵循经典的 Hopcroft 算法来构建所有这些表,确定不可区分的状态,将它们合并在一起等等,那么我将获得包含 15 个状态的 DFA(使用此工具:http ://regexvisualizer.apphb.com/使用这个正则表达式 a(ce) (d|xa|AFe)+|b(ce) (d|xa|AFe)+ce+ 来检查)。以下是使用 Hopcroft 算法缩小后 DFA 的样子:
我想出的算法,在“重新思考” Hopcroft 的算法之后,构建的 DFA 比您在上面看到的要小(对不起图像质量,我不得不一步一步地重新绘制它以了解为什么它更小):
这就是它的工作原理,关于“状态等价”的决定是基于给定状态(例如“START”)的哈希函数的结果,如果我们从该状态开始,可以构建可以从 DFA 构造的短字符串. 给定上面的 DFA 和 START 状态,我们可以构造以下字符串:98->120, 98->100, 98->65, 98->99, 97->120, 97->100, 97->65 , 97->99 所以让它成为 START 状态的哈希函数的结果。如果我们为 DFA 中的每个状态运行此函数,我们将看到对于某些状态,此函数为我们提供相同的结果(“1.2”、“6.7”、“2.3”和“10”、“13”和“15” ,“16”和“11”,“8”,“26”,“23”和“12”,“9”,“4”,
我看到我在某个地方错了,但不明白我的算法产生的最小化 DFA 有什么问题?