python - Efficiently storing a set of integer sequences in python

Question

I have a large set of integer sequences, small sample below:

Simply storing them as a list of tuples gives me memory errors, so I am looking for a better datastructure that can hold them (or the equivalent information). The order among the sequences is not important (i.e. the sequences do not have an id) but the order within each sequence needs to be preserved.

Any idea on how to do this efficiently? I am considering a nested dictionary; for the example above would then look like this:

{1: {2: {-1: 1}, 
     3: {2: {-1: 2}, -1: 0}, 
    -1: 2}, 
 2: {-1: 1}, 
 4: {-1: 1}}

The "leaf" values (given by key -1) are the counts of the respective sequences. Of course, this does not take advantage of the fact that all keys are integers.

Roughly speaking, I am looking to manage about a billion sequences, of average length 3, with a large amount of redundancy. The number of distinct integers is around one million. Any good ideas/existing libraries?

EDIT :

It should be efficient to build "subsets" of the datastructure, as follows. Given an integer x, get only the sequences that contain x. For example, if x=2, then we would build from the initial datastructure the following subset:

{1: {2: {-1: 1}, 
     3: {2: {-1: 2}, -1: 0}, 
    -1: 0}, 
 2: {-1: 1}}

If possible, I would also like to be able to build a subset as follows. I specify a pair of integers (x,y), and the corresponding subset is the set of sequences where both x and y occur, and the first x occurs before the first y. For example, for (x,y)=(1,2) we would get

{1: {2: {-1: 1}, 
     3: {2: {-1: 2}, -1: 0}, 
    -1: 0}}

I don't have explicit O(log n) requirements, in the end it should just run as fast as possible :) Unfortunately I can not provide an actual data sample, as it is not mine to share.

score 2 · Accepted Answer

考虑到冗余量，您可以使用Trie获得良好的压缩效果，它会折叠所有常见的前缀。

看起来您的目标是使用嵌套字典结构朝这个方向发展，但哈希映射不一定是最节省空间的存储（您没有提到对查找性能的任何要求）。

例如，考虑将其存储为嵌套列表，使用 1 元组来区分叶值（尽管我们可能会想出更简洁的方法）：

trie = [ [1, (2,),
             [2, (1,)],
             [3, [2, (2,)]],
         [2, (1,)],
         [4, (1,)]
       ]

平均长度为 3，这应该很浅，希望使用列表而不是字典在每个级别上减少松弛。

请注意，如果某些分支又长又裸，您可以以代码复杂性为代价来压缩冗余深度。考虑子树

[1, [2, [3, ... [n, (1,)]...]]]

可以改为编码为

[1, 2, 3, ..., n, (1,)]

只要你区分平面序列和兄弟（这里，叶子总是一个 1 元组，子树总是一个列表，其他任何东西都是平面序列中的下一个元素）。

新的子集要求更复杂。我可以想到一些可能的方法：

天真/蛮力搜索：再次取决于您的访问模式，您可以返回一个生成器，而不是匹配子树（子尝试）的急切全深度副本
在每个子树的顶部存储一个布隆过滤器，指示在其中找到的值。您的搜索与 (1) 相同，但您可以快速跳过无法匹配的子树。
- 每个级别的过滤器只是其子级过滤器的逻辑或
如果您需要快速进行查找，则可以进行反向查找。例如，这可能是一组包含一个键的每个序列（因此您将拥有一百万个键）。在将包含x的序列和包含y的序列合并后，您需要遍历结果以查看哪些序列具有正确的x和y顺序。

该方案的困难在于表示每个序列。您要么需要指向 Trie 中叶节点的链接（并且能够以某种方式从叶节点迭代到根节点），要么需要每个序列的单独平面副本，这似乎很浪费

score 1 · Accepted Answer

对于像平均 3 个整数这样的短序列，我看不出有什么比将它们编码为具有引用计数的数组更有效的了。

[
 (2, [1]),
 (1, [2]),
 (1, [1, 2]),
 (2, [1, 3, 2]),
 (1, [4])
]

考虑一下；按照建议使用 Trie，对于每个新子序列，您至少需要一个新数组和一个元组，这意味着 100M 数组和元组，就像这个版本一样。数组存储也有最多100M 的数组/元组，而如果子序列相差多个数字，则 Trie 不会。Trie 以更长的序列大放异彩，消除了很多重复，但是加入一个额外的数组来平均保存 <2 个整数并不足以提供明显的优势。

您甚至可以通过将 refcount 添加为数组中的第一个元素来删除元组并节省更多内存。作为第二个好处，你能看到比迭代数组更简单的迭代序列的方法吗？

至于空间优化数组，您可以查看数组模块。

python - Efficiently storing a set of integer sequences in python

2 回答 2

Related

Reference