我正在尝试实现一组函数,这些函数可以在线性时间内直接创建一个 DAWG,用于我正在为个人项目编码的某些搜索功能。我阅读了这篇论文,其中详细介绍了 DAWG 背后的想法,甚至还提供了线性时间构建的伪代码!
但是,遵循伪代码似乎会产生(在我看来)类似 trie 的结构。具体来说,似乎没有明确共享后缀(实际上由图中的边连接)。相反,它们由后缀指针表示,后缀指针实际上与图的实际遍历没有关系。
例如,看看这张 DAWG 的图片中的单词{tap, taps, top, tops}
(来自DAWG Wikipedia 页面):
现在,将其与上述论文中详述的这些步骤所获得的结构进行比较(使用这组单词手动完成所需的时间可以忽略不计):
Note: Edges are labeled by letters
Nodes are labeled by the concatenation of the labels of the primary edges
used to reach them
Suffix pointers are not visually represented on the graph
primary edges: solid edges used to traverse graph
secondary edges: dotted edges implying a suffix relationship between
the letter labeling the edge and the substring
represented by the target node
builddawg(S)
1. Create a node named source.
2. Let activenode be source.
3. For each word w of S do:
A. For each letter 'a' of w do:
Let activenode be update (activenode, a).
B. Let activenode be source.
4. Return source.
update (activenode, a)
1. If activenode has an outgoing edge labeled 'a', then
A. Let newactivenode be the node that this edge leads to.
B. If this edge is primary, return newactivenode.
C. Else, return split (activenode, newactivenode).
2. Else
A. Create a node named newactivenode.
B. Create a primary edge labeled 'a' from activenode to newactivenode.
C. Let currentnode be activenode.
D. Let suflxnode be undefined.
E. While currentnode isn’t source and sufixnode is undefined do:
i. Let currentnode be the node pointed to by the suffix
pointer of currentnode.
ii. If currentnode has a primary outgoing edge labeled 'a',
then let sufixnode be the node that this edge leads to.
iii. Else,if currentnode has a secondary outgoing edge labeled 'a' then
a. Let childnode be the node that this edge leads to.
b. Let suffixnode be split (currentnode, childnode).
iv. Else, create a secondary edge from currentnode to newactivenode
labeled 'a'.
F. If sufixnode is still undefined, let suffixnode be source.
G. Set the suffix pointer of newactivenode to point to sufixnode.
H. Return newactivenode.
split (parentnode, childnode)
1. Create a node called newchildnode.
2. Make the secondary edge from parentnode to childnode into
a primary edge from parentnode to newchildnode (with the same label).
3. For every primary and secondary outgoing edge of childnode,
create a secondary outgoing edge of newchildnode with the
same label and leading to the same node.
4. Set the suffix pointer of newchildnode equal to that of childnode.
5. Reset the suffix pointer of childnode to point to newchildnode.
6. Let currentnode be parentnode.
7. While currentnode isn’t source do:
A. Let currentnode be the node pointed to by the
suffix pointer of currentnode.
B. If currentnode has a secondary edge to childnode,
then make it a secondary edge to newchildnode (with the same label).
C. Else, break out of the while loop.
8. Return newchildnode.
我得到的结构不等同于上图。事实上,它看起来几乎与 trie 相同,除了将次要边转换为主要边所产生的额外节点。与上面的 DAWG 等效的 trie 是:
我只是应用了错误的算法,有几种类型的 DAWGS,还是我只是误解了 DAWG 应该是什么样子?
我看过的大多数详细介绍 DAWG 的论文都有似乎是由算法创建的结构,但我读过的大多数在线资料(以及我看过的图片)都有连接常见后缀的实际边缘。我不知道该相信什么,或者它们是否实际上是等价的。