11

我已经在 Python 中实现了一个后缀树来进行全文搜索,它运行得非常好。但是有一个问题:索引文本可能非常大,所以我们无法在 RAM 中拥有整个结构。

在此处输入图像描述

图片:单词的后缀树BANANAS(在我的场景中,想象一棵树大 100000 倍)。

所以,稍微研究了一下,我发现了这个pickle模块,一个很棒的 Python 模块,用于从/到文件中“加载”和“转储”对象,你猜怎么着?它与我的数据结构完美配合。

所以,长话短说:在磁盘上/从磁盘上存储和检索这种结构的最佳策略是什么?我的意思是,一种解决方案可能是将每个节点存储在一个文件中,并在需要时从磁盘加载它,但这不是最好的想法(磁盘访问次数过多)。


脚注:虽然我已将此问题标记为,但编程语言并不是问题的重要部分,磁盘存储/检索策略确实是重点。

4

5 回答 5

4

An effective way to manage a structure like this is to use a memory-mapped file. In the file, instead of storing references for the node pointers, you store offsets into the file. You can still use pickle to serialise the node data to a stream suitable for storing on disk, but you will want to avoid storing references since the pickle module will want to embed your entire tree all at once (as you've seen).

Using the mmap module, you can map the file into address space and read it just like a huge sequence of bytes. The OS takes care of actually reading from the file and managing file buffers and all the details.

You might store the first node at the start of the file, and have offsets that point to the next node(s). Reading the next node is just a matter of reading from the correct offset in the file.

The advantage of memory-mapped files is that they aren't loaded into memory all at once, but only read from disk when needed. I've done this (on a 64-bit OS) with a 30 GB file on a machine with only 4 GB of RAM installed, and it worked fine.

于 2011-11-28T18:49:13.527 回答
3

If pickle is already working for you, you may want to take a look at ZODB which adds some functionality on top of pickle. Looking at the documentation, I saw this paragraph that looks to address the size concerns you're having:

The database moves objects freely between memory and storage. If an object has not been used in a while, it may be released and its contents loaded from storage the next time it is used.

于 2011-11-28T18:59:07.037 回答
1

将它存储在sqlite中怎么样?

SQLite:

  • 支持高达 2 TB 的数据,
  • 支持 SQL 查询,
  • 是存储应用内数据的绝佳替代品,
  • 可以支持每天约 10 万次访问(如果用于普通 Web 应用程序),

因此,仔细研究一下这个解决方案可能会很好,因为它已被证明是在应用程序中存储数据的有效方式。

于 2011-11-28T18:42:32.823 回答
1

也许您可以将 cPickle 和一个bsddb数据库结合起来,让您将腌制节点存储在一个类似字典的对象中,该对象将存储在文件系统上;因此,您可以稍后加载数据库并以非常好的性能从您需要的节点获取。

以一种非常简单的方式:

import bsddb
import cPickle

db = bsddb.btopen('/tmp/nodes.db', 'c')
def store_node(node, key, db):
    db[key] = cPickle.dumps(node)

def get_node(key, db):
    return cPickle.loads(db[key])
于 2011-11-28T18:48:22.790 回答
1

Try a compressed suffix tree instead.

The main idea is that instead of having lots of nodes of 1 character, you can compact them into 1 node of multiple characters thus saving extra nodes.

This link here (http://www.cs.sunysb.edu/~algorith/implement/suds/implement.shtml) says you can transform a 160MB suffix tree to 33MB compressed suffix tree. Quite a gain.

These compressed trees are used for genetic substring matching on huge strings. I used to run out of memory with a suffix tree, but after I compressed it, the out of memory error disappeared.

我希望我能找到一篇更好地解释实施的无偿文章。(http://dl.acm.org/citation.cfm?id=1768593

于 2011-12-16T03:10:58.617 回答