2

dawgdic是一个很棒的 DAWG 库,但它有一个明显的缺点,因为它是静态的(不可更新),并且必须以按字母顺序排序的字符串构造。如果构建 DAWG 的原始数据很大(几千兆字节),则 DAWG 的初始构建涉及对大量字符串进行排序可能需要太多资源。

是否有提供内存高效结构的库,dawgdic允许从未排序的字典进行构造?

4

3 回答 3

2

目前,我认为没有任何库允许从未排序的字典构建 DAWG。

但是,经过大量搜索,我找到了这篇论文,“Incremental Construction of Minimal Acyclic Finite-State Automata” ,我认为它正是你想要的。或许你可以在读完这篇文章后自己制作一个库,分享给大家!

编辑:你看过这个问题吗?

于 2013-10-04T11:36:41.703 回答
1

我发现了一些很棒的库,它们允许从未排序的数据进行在线构建,尽管它们不是基于 DAWG:

  1. cedar - 一个非常快速的双数组 trie
  2. marisa-trie - 一个非常节省空间的字符串匹配库
于 2013-10-04T11:53:36.953 回答
1

I currently know of no C++ implementations of a DAWG which support the construction from non-sorted data, but if you're open to creating your own solution which has such a feature, Incremental Construction of Minimal Acyclic Finite-State Automata (2000) is a paper which basically lays out the algorithm behind it.

Alternatively, if you're open to porting solutions from other languages, it may be worth your while to check out MDAG, a Java implementation of the data structure. It supports both on-the-fly string addition and string removal, which is exactly what you are looking for. The code is also easy to follow, and extensively commented, so porting it should be a fairly simple task.

Disclaimer: I am the author of MDAG :) .

于 2016-07-01T18:32:10.160 回答