我有一个包含大量值(53,000,000+)的数据文件,我想提取其中n个值的随机子集(例如,2,000,000)。我实现了一个 Perl 脚本,它将列表拉入内存,使用Fisher-Yates 方法对数组进行混洗,然后打印出混洗列表中的前n 个值。但是,即使在更小的测试集(50,000 个值)上,这种洗牌过程也需要很多时间。
我正在寻找一种更有效、更可扩展的方法来识别大量值的随机子集并将其打印出来。有什么建议么?
更新:根据答案和更多搜索,看起来正确的术语是“随机抽样”。
我有一个包含大量值(53,000,000+)的数据文件,我想提取其中n个值的随机子集(例如,2,000,000)。我实现了一个 Perl 脚本,它将列表拉入内存,使用Fisher-Yates 方法对数组进行混洗,然后打印出混洗列表中的前n 个值。但是,即使在更小的测试集(50,000 个值)上,这种洗牌过程也需要很多时间。
我正在寻找一种更有效、更可扩展的方法来识别大量值的随机子集并将其打印出来。有什么建议么?
更新:根据答案和更多搜索,看起来正确的术语是“随机抽样”。
Elaborating on aix's answer above, to choose k
out of a stream of items, read the items one at a time. Keep the first k
items in a set S
.
Now when reading the m
-th item I
(m>k
now), keep it with probability k/m
. If you do keep it, select an item U
uniformly at random from S
, and replace U
with I
.
The proof that this yields all subsets of size k
with equal probability is based on induction on m
. Note that you don't need to know n
(the total number of items) in advance, and that S
at each step is suitable. The algorithm is "streaming" - it doesn't require storing all items, or making a second pass.
首先,检查您的 shuffle 实现。如果实施得当,那应该会给你线性时间。此外,修改算法以在所需数量的元素被打乱后停止:没有必要(实际上和理论上)打乱比实际输出更多的数字。
如果您要求 k 个数字,这将花费您 k 个元素操作。我怀疑你能做得比这更好。
不要洗牌,它不必要地昂贵。
Jon Bentley 的“Programming Pearls”(Bentley 说他从 Knuth 的“Seminumerical Algorithms”中学到)讨论了一个简单的线性算法。请改用此方法。
有一些Perl 实现关于:
这两个片段实现了 Knuth's Art Of Programming 中的 Algortihm S(3.4.2) 和 Algortihm R(3.4.2)。第一个从元素数组中随机选择 N 项,并返回对包含元素的数组的引用。请注意,它不一定会考虑列表中的所有元素。
The second randomly selects N items from a file of indeterminate size and returns an array containing the selected elements. Records in the file are assumed to be per line, and the lines are chomped while reading. This requires only 1 pass through the list. A slight modification can be made to use the snippet in situations where N records would exceed memory limitations, however this requires slightly more than 1 pass (/msg if you need this explained)
Reading and shuffling the array would involve a lot of unnecessary data movement.
Here are a few ideas:
One: When you say you need a random subset, what exactly do you mean by "random" in this context? By which I mean, are the records in any particular order, or is the order relevant to whatever it is you are trying to randomize?
Because my first thought is that if the records are not in any relevant order, than you can get a random selection by simply calculating total size divided by sample size, and then selecting every n-th record. So for example, if you have 53 million records and you want a sample of 2 million, take 53 millions / 2 million ~= 26, so read every 26th record.
Two: If that's not adequate, a more rigorous solution would be to generate 2 million random numbers in the range of zero to 53 million, insuring no duplicates.
Two-A: If you're sample size was small compared to the total number of records, like if you were just picking out a few hundred or a few thousand, I'd generate an array of however many entries, and for each entry, compare it to all previous entries to check for duplicates. If it's a duplicate, loop around and try again until you find a unique value.
Two-B: Assuming your numbers are not just examples but the actual values, then your sample size is large compared to the total population. In that case, given the ample memory on modern computers, you should be able to do this efficiently by creating an array of 53 million booleans initialized to false, each, of course, representing one record. Then run through a loop 2 million times. For each iteration, generate a random number from 0 to 53 million. Check the corresponding boolean in the array: If it's false, set it to true. If it's true, generate another random number and try again.
Three: Or wait, here's a better idea yet, given the relatively large percentage: Calculate the percentage of records you want to include. Then loop through a counter of all the records. For each, generate a random number from 0 to 1 and compare it to the desired percentage. If it's less, read that record and include it in the sample. If it's greater, skip the record.
If it's important to get the exact number of sample records, you can recalculate the percentage for each record. For example -- and to keep the example simple, let's pretend you want 10 out of 100 records:
You'd start with 10 / 100 = .1 So we generate a random number, say it comes up .04. .04<.1, so we include record #1.
Now we recalculate the percentage. We want 9 more records out of 99 remaining gives 9/99~=.0909 Say our random number is .87. That's greater, so we skip record #2.
Recalculate again. We still need 9 records out of 98 remaining. So the magic number is 9/98, whatever that comes to. Etc.
Once we've got as many records as we want, the probability for future records will be zero, so we'll never go over. If we near the end and haven't picked up enough records, the probability will get very close to 100%. Like, if we still need 8 records and there are only 8 records left, the probability will be 8/8=100% so we'll be guaranteed to take the next record.