4

我是一名物理学研究生,我正在编写一些代码来对数百 GB 的数据进行排序,并在我要求时返回该数据的切片。这是诀窍,我不知道排序和搜索这种数据的好方法。

我的数据本质上由大量的数字组成。这些集合中可以包含 1 到 n 个数字(尽管在 99.9% 的集合中,n 小于 15)并且这些集合中大约有 1.5 到 20 亿个(不幸的是,这个大小排除了蛮力搜索)。

我需要能够指定一个包含 k 个元素的集合,并让每个包含 k+1 个元素或更多元素的集合返回给我。

简单示例:
假设我的数据有以下集合:
(1,2,3)
(1,2,3,4,5)
(4,5,6,7)
(1,3,8,9)
( 5,8,11)

如果我提出请求 (1,3),我将拥有以下集合:(1,2,3)、(1,2,3,4,5) 和 (1,3,8,9)。
请求 (11) 将返回集合:(5,8,11)。
请求 (1,2,3) 将返回集合:(1,2,3) 和 (1,2,3,4,5)
请求 (50) 将不返回集合:

现在模式应该很清楚了。这个例子和我的数据的主要区别是我的数据的集合更大,集合中每个元素使用的数字从 0 到 16383(14 位),还有很多很多很多的集合。

如果重要的话,我会用 C++ 编写这个程序,尽管我也知道 java、c、一些程序集、一些 fortran 和一些 perl。

有没有人知道如何解决这个问题?

编辑:
回答几个问题并补充几点:

1.) 数据不变。这一切都是在一组很长的运行中完成的(每个运行分为 2 个演出文件)。

2.) 至于存储空间。原始数据占用大约 250 GB。我估计,在处理和剥离了很多我不感兴趣的无关元数据之后,我可以将其降低到 36 到 48 GB 之间,具体取决于我决定保留多少元数据(没有索引)。此外,如果在我对数据的初始处理中遇到足够多的相同集合,我可能能够通过添加重复事件的计数器来进一步压缩数据,而不是简单地一遍又一遍地重复事件。

3.) Each number within a processed set actually contains at LEAST two numbers 14 bits for the data itself (detected energy) and 7 bits for metadata (detector number). So I will need at LEAST three bytes per number.

4.) My "though in 99.9% of the sets, n is less than 15" comment was misleading. In a preliminary glance through some of the chunks of the data I find that I have sets that contain as many as 22 numbers but the median is 5 numbers per set and the average is 6 numbers per set.

5.) While I like the idea of building an index of pointers into files I am a bit leery because for requests involving more than one number I am left with the semi slow task (at least I think it is slow) of finding the set of all pointers common to the lists, ie finding the greatest common subset for a given number of sets.

6.) In terms of resources available to me, I can muster approximately 300 gigs of space after I have the raw data on the system (The remainder of my quota on that system). The system is a dual processor server with 2 quad core amd opterons and 16 gigabytes of ram.

7.) Yes 0 can occur, it is an artifact of the data acquisition system when it does but it can occur.

4

6 回答 6

11

Your problem is the same as that faced by search engines. "I have a bajillion documents. I need the ones which contain this set of words." You just have (very conveniently), integers instead of words, and smallish documents. The solution is an inverted index. Introduction to Information Retrieval by Manning et al is (at that link) available free online, is very readable, and will go into a lot of detail about how to do this.

You're going to have to pay a price in disk space, but it can be parallelized, and should be more than fast enough to meet your timing requirements, once the index is constructed.

于 2009-01-30T05:55:28.790 回答
2

Assuming a random distribution of 0-16383, with a consistent 15 elements per set, and two billion sets, each element would appear in approximately 1.8M sets. Have you considered (and do you have the capacity for) building a 16384x~1.8M (30B entries, 4 bytes each) lookup table? Given such a table, you could query which sets contain (1) and (17) and (5555) and then find the intersections of those three ~1.8M-element lists.

于 2009-01-30T04:58:12.803 回答
2

My guess is as follows.

Assume that each set has a name or ID or address (a 4-byte number will do if there are only 2 billion of them).

Now walk through all the sets once, and create the following output files:

  • A file which contains the IDs of all the sets which contain '1'
  • A file which contains the IDs of all the sets which contain '2'
  • A file which contains the IDs of all the sets which contain '3'
  • ... etc ...

If there are 16 entries per set, then on average each of these 2^16 files will contain the IDs of 2^20 sets; with each ID being 4 bytes, this would require 2^38 bytes (256 GB) of storage.

You'll do the above once, before you process requests.

When you receive requests, use these files as follows:

  • Look at a couple of numbers in the request
  • Open up a couple of the corresponding index files
  • Get the list of all sets which exist in both these files (there's only a million IDs in each file, so this should't be difficult)
  • See which of these few sets satisfy the remainder of the request

My guess is that if you do the above, creating the indexes will be (very) slow and handling requests will be (very) quick.

于 2009-01-30T05:24:06.887 回答
2

I have recently discovered methods that use Space Filling curves to map the multi-dimensional data down to a single dimension. One can then index the data based on its 1D index. Range queries can be easily carried out by finding the segments of the curve that intersect the box that represents the curve and then retrieving those segments.

I believe that this method is far superior to making the insane indexes as suggested because after looking at it, the index would be as large as the data I wished to store, hardly a good thing. A somewhat more detailed explanation of this can be found at:

http://www.ddj.com/184410998
and
http://www.dcs.bbk.ac.uk/~jkl/publications.html

于 2009-09-11T03:37:10.517 回答
1

Make 16383 index files, one for each possible search value. For each value in your input set, write the file position of the start of the set into the corresponding index file. It is important that each of the index files contains the same number for the same set. Now each index file will consist of ascending indexes into the master file.

To search, start reading the index files corresponding to each search value. If you read an index that's lower than the index you read from another file, discard it and read another one. When you get the same index from all of the files, that's a match - obtain the set from the master file, and read a new index from each of the index files. Once you reach the end of any of the index files, you're done.

If your values are evenly distributed, each index file will contain 1/16383 of the input sets. If your average search set consists of 6 values, you will be doing a linear pass over 6/16383 of your original input. It's still an O(n) solution, but your n is a bit smaller now.

P.S. Is zero an impossible result value, or do you really have 16384 possibilities?

于 2009-01-30T16:44:02.443 回答
0

Just playing devil's advocate for an approach which includes brute force + index lookup :

  1. Create an index with the min , max and no of elements of sets.
  2. Then apply brute force excluding sets where max < max(set being searched) and min > min (set being searched)
  3. In brute force also exclude sets whole element count is less than that of the set being searched.

95% of your searches would really be brute forcing a very smaller subset. Just a thought.

于 2009-01-30T05:49:37.740 回答