java - 简单的 DAWG 创建算法？

Question

给定文件中的单词列表，我需要为我的拼字游戏播放器创建一个 DAWG ( http://en.wikipedia.org/wiki/Directed_acyclic_word_graph ) 结构。我正在使用 Java。我只需要执行一次，然后将其存储在一个或多个文件中。到目前为止，我已经看到了 2 种方法：1）构建一个 Trie 并将其简化为 DAWG 或 2）立即构建一个 DAWG。因为我只需要做一次，我想我只想要最简单的算法来实现它。速度和内存要求无关紧要。

另外我想知道我应该如何在运行时将结构存储在内存中，以及我应该如何将它保存在文件中？DAWG 基本上是一个图表，它建议使用我编写的一些非常简单的类的一些节点和边/指针，但我看到使用数组和偏移量（在这个数组中）的实现，这看起来很复杂且难以辨认。这次我关心内存大小（在运行时和保存的文件）和加载 DAWG/使用 DAWG 的速度。

score 2 · Accepted Answer

本文定义了最简单、最有效的 DAWG 构建算法，并且要求对 DAWG 要表示的词集进行排序。鉴于您计划从预先存在的单词列表构建 DAWG，该列表可能已经排序，或者可以用于此目的。

我已经以比论文中给出的更“程序员友好”的格式粗略地转录了算法的伪代码（免责声明：我可能犯了一些转录错误；你可能应该看看原文确定是否有）：

Given:
        startState is the state from which traversal of the DAWG is to start
        register is a map of representations (hint: hashes) OF graphs which extend 
        from states in the DAWG TO said states

While there is newWord in wordList
    Get newWord from wordList
    Determine longestPrefix of newWord, starting from startState, which already exists in DAWG
    Get longestPrefixEndState, the state which the sequence of transitions defined by longestPrefix leads to
    Get suffix of newWord, the substring of newWord after longestPrefix
    if longestPrefixEndState has children 
        replace_or_register(longestPrefixEndState)
    endIf
    Create the sequence of transitions and states defined by suffix, starting from longestPrefixEndState
endWhile
replace_or_register(startState)


function replace_or_register(argState)
    Get greatestCharEndState of argState, the state which the lexicographically-greatest-char-labelled-transition in the outgoing transition set of argState leads to
    if greatestCharEndState has children
        replace_or_register(greatestCharEndState)
    endIf
    if there exists state in DAWG that is in the register and is equivalent (has an identical graph extending from it) to greatestCharEndState
        Redefine the transition that extends from argState to greatestCharEndState, as one that extends from argState to state
        Delete greatestCharEndState
    endIf
    else 
        add greatestCharEndState to the register
    endElse

鉴于您使用的是 Java，您可以利用Serializable接口来处理您所有的序列化和反序列化需求。

如果您对实现上述算法的 Java 中现有的 DAWG 实现感兴趣，请查看MDAG，它还提供了其他实现不具备的几个漂亮功能（包括即时添加和删除字符串），并由我！

score 0 · Accepted Answer

我曾经为我的一个客户用 C 语言实现过这样的结构。最终结构是从描述字符集和 dawg 的 xml 文件加载的，另一个进程从单词列表创建 xml 文件。

第 1 步：构建第一个 dawg 序列化为 xml 文件的结构

我们用了：

typedef struct _s_build_node build_node_t;
struct _s_build_node {
  char_t letter;
  build_node_t* first_child;
  build_node_t* right_sibling;

  hash_t hash;
  size_t depth;
  size_t ID;
};
typedef struct _s_build_dawg {
  charset_t charset;
  node_t* all_nodes; // an array of all the created nodes
  node_t* root;
} build_dawg_t;

siblibgs 按升序排列，词尾特殊字符小于任何其他字符。该算法非常简单：

// create the build dawg
foreach word in wordlist
  insert(dawg, word)
// compact the dawg
compact(dawg)
// generate the xml file
xml_dump(dawg)

为了压缩 dawg，我们为每个节点计算了一个哈希值。可以分解具有相同散列的两个节点。这部分可能很棘手。只有深度最低的节点被保留，其他节点被删除，它们的父节点现在指向保留的节点。
压缩后，我们为每个节点分配一个唯一的 ID（通过 bfs，ID 介于 0 和 N-1 之间，N 是压缩后的 dawg 中的节点数）。xml 文件简单地描述了 trie ：

<dawg>
  <charset ....>
    ...
  </charset>

  <node ID="node_id" letter="letter" fist_child="first_child_ID" next_sibling="next_sibling_id" />
  <node .... />
  <node .... />
  <node .... />
</dawg>

第 2 步：最后的 dagw

结构稍微简单一点

typedef struct {
  char_t letter;
  size_t first_child;
  size_t next_sibling;
} node_t;

typedef struct {
  node_t nodes[];
  ... whatever you need ...
} dawg_t;

这里 root 是 dawg.nodes[0]，first_child/next_sibling 是节点数组中的索引。从 xml 文件创建这样的结构很容易。主要缺点是任何词表修改都会触发新 xml 文件的生成。

java - 简单的 DAWG 创建算法？

2 回答 2

第 1 步：构建第一个 dawg 序列化为 xml 文件的结构

第 2 步：最后的 dagw

Related

Reference