I have a large set of integer sequences, small sample below:
1
2
1
1 2
1 3 2
4
1 3 2
...
Simply storing them as a list of tuples gives me memory errors, so I am looking for a better datastructure that can hold them (or the equivalent information). The order among the sequences is not important (i.e. the sequences do not have an id) but the order within each sequence needs to be preserved.
Any idea on how to do this efficiently? I am considering a nested dictionary; for the example above would then look like this:
{1: {2: {-1: 1},
3: {2: {-1: 2}, -1: 0},
-1: 2},
2: {-1: 1},
4: {-1: 1}}
The "leaf" values (given by key -1) are the counts of the respective sequences. Of course, this does not take advantage of the fact that all keys are integers.
Roughly speaking, I am looking to manage about a billion sequences, of average length 3, with a large amount of redundancy. The number of distinct integers is around one million. Any good ideas/existing libraries?
EDIT :
It should be efficient to build "subsets" of the datastructure, as follows. Given an integer x
, get only the sequences that contain x
. For example, if x=2
, then we would build from the initial datastructure the following subset:
{1: {2: {-1: 1},
3: {2: {-1: 2}, -1: 0},
-1: 0},
2: {-1: 1}}
If possible, I would also like to be able to build a subset as follows. I specify a pair of integers (x,y)
, and the corresponding subset is the set of sequences where both x
and y
occur, and the first x
occurs before the first y
. For example, for (x,y)=(1,2)
we would get
{1: {2: {-1: 1},
3: {2: {-1: 2}, -1: 0},
-1: 0}}
I don't have explicit O(log n) requirements, in the end it should just run as fast as possible :) Unfortunately I can not provide an actual data sample, as it is not mine to share.