perl - 如何随机采样文件的内容？

Question

我有一个包含内容的文件

abc
def
high
lmn
...
...

文件中有超过 200 万行。我想从文件中随机采样行并输出 50K 行。关于如何解决这个问题的任何想法？我在思考 Perl 及其rand功能的思路（或者一个方便的 shell 命令会很整洁）。

相关（可能重复）问题：

score 12 · Accepted Answer

假设您基本上想要输出大约 2.5% 的行，这样可以：

print if 0.025 > rand while <$input>;

score 5 · Accepted Answer

5

外壳方式：

sort -R file | head -n 50000

于 2009-06-23T20:05:35.387 回答

score 3 · Accepted Answer

来自perlfaq5：“如何从文件中选择随机行？”

除了将文件加载到数据库或预先索引文件中的行之外，您还可以做几件事。

这是来自骆驼书的水库采样算法：

srand;
rand($.) < 1 && ($line = $_) while <>;

与读取整个文件相比，这在空间上具有显着优势。您可以在Donald E. Knuth的计算机编程艺术第 2 卷第 3.4.2 节中找到这种方法的证明。

您可以使用为该算法提供函数的 File::Random 模块：

use File::Random qw/random_line/;
my $line = random_line($filename);

另一种方法是使用 Tie::File 模块，它将整个文件视为一个数组。只需访问一个随机数组元素。

score 2 · Accepted Answer

如果您需要提取确切数量的行：

use strict;
use warnings;

# Number of lines to pick and file to pick from
# Error checking omitted!
my ($pick, $file) = @ARGV;

open(my $fh, '<', $file)
    or die "Can't read file '$file' [$!]\n";

# count lines in file
my ($lines, $buffer);
while (sysread $fh, $buffer, 4096) {
    $lines += ($buffer =~ tr/\n//);
}

# limit number of lines to pick to number of lines in file
$pick = $lines if $pick > $lines;

# build list of N lines to pick, use a hash to prevent picking the
# same line multiple times
my %picked;
for (1 .. $pick) {
    my $n = int(rand($lines)) + 1;
    redo if $picked{$n}++
}

# loop over file extracting selected lines
seek($fh, 0, 0);
while (<$fh>) {
    print if $picked{$.};
}
close $fh;

score 2 · Accepted Answer

2

Perl方式：

使用 CPAN。有一个模块File::RandomLine可以完全满足您的需要。

于 2009-06-23T20:44:05.747 回答

perl - 如何随机采样文件的内容？

5 回答 5

Related

Reference