linux - 从可能重复的文件中随机选择 n 行

Question

我有一个超过 100 万行的文本文件。单独的行不是很大（每行大约 200-270 个字符）。

我试图在输入中随机选择 60% 的行数，其中每一行都可以在输出中重复。在上面的示例中，我的输出将有 600,000 行，但其中只有 500,000 行可能是唯一的。我还需要在不同的输出文件中根本没有选择的行。任何单独的行都不应该出现在两个输出文件中。

输入文件中的每一行都有如下记录。

记录1
记录2
记录3
记录4
记录5
记录6
记录7

如果我试图在文件 output1.txt 中选择 5 个随机行，其中每一行都可以重复。可以说以下是选择的行并且在 output1.txt 中

记录3
记录5
记录2
记录2
记录5

其余记录应转到 output2.txt。

记录1
记录4
记录6
记录7

记录的顺序无关紧要。

我想我可以使用 Java 编写代码来执行此操作，但我想知道我可以使用一些命令或脚本快速完成此操作。我尝试使用 'shuf' 来选择线条，但是我怎样才能确保已选择的线条不会出现在我想要获得的第二个输出中。

我正在使用 Linux 机器。欢迎任何建议或意见。谢谢。

score 2 · Accepted Answer

这是一个 Perl 解决方案。

我最近似乎写了很多，但是索引一个非常大的文本文件是在不将整个文件读入内存的情况下随机访问它的最佳方法。

该程序使用tell操作符在源文件中建立当前记录的偏移量，seek操作符返回到特定记录，并vec跟踪已选择的记录。

请注意，do { ... } while EXPR表单在第一次检查条件之前执行 do-block，并且是专门为此目的选择的。

该程序希望扫描文件以查找要在命令行上指定的数据。输出文件selected.txt用于选定的 60% 和unselected.txt其余部分。

use strict;
use warnings;

my $file = shift or die "No input file specified";

open my $infh, '<', $file or die qq(Unable to open "$file" for input: $!);
my @index;
do { push @index, tell $infh } while <$infh>;

my $used = "\0" x (@index / 8 + 1);

my $outfh;

open $outfh, '>', 'selected.txt' or die $!;
my $n = 0;
while ($n++ / @index < 0.6) {
  my $rec = int rand scalar @index;
  seek $infh, $index[$rec], 0;
  print $outfh scalar <$infh>;
  vec($used, $rec, 1) = 1;
}

open $outfh, '>', 'unselected.txt' or die $!;
for my $rec (0 .. $#index) {
  next if vec($used, $rec, 1);
  seek $infh, $index[$rec], 0;
  print $outfh scalar <$infh>;
}

编辑

我犹豫是否要使用模块来替换这么少的代码，但这里有一个使用ikegamiTie::File推荐的版本，以防有人更喜欢这种方法。

use strict;
use warnings;

use Tie::File;

my $file = shift or die "No input file specified";

tie my @index, 'Tie::File', $file, mode => O_RDONLY
    or die qq(Unable to open "$file" for input: $!);

my $outfh;
my @used;

open $outfh, '>', 'selected.txt' or die $!;
my $n = 0;
while ($n++ / @index < 0.6) {
  my $rec = int rand scalar @index;
  print $outfh $index[$rec], "\n";
  $used[$rec]++;
}

open $outfh, '>', 'unselected.txt' or die $!;
for my $rec (0 .. $#index) {
  print $outfh $index[$rec], "\n" unless $used[$rec];
}

score 1 · Accepted Answer

这会随机选择文件的 N 行中的一行，直到选择 N/6 行。重复率不受控制。

为了节省内存，我们将在内存中保留行的文件位置而不是行本身。Tie::File为我们做到了这一点。

#!/usr/bin/env perl
use strict;
use warnings;

use Tie::File  qw( );

my ($input_qfn, $picked_qfn, $unpicked_qfn) = @ARGV;

tie(my @lines, 'Tie::File', $input_qfn, autochomp => 0)
   or die;

my $num_lines = @lines;
my @unpicked_indexes = 0..$num_lines-1;
my @picked_indexes;
for (1..$num_lines*.6) {
   my $rnd_idx = int(rand($num_lines));
   $unpicked_indexes[$rnd_idx] = undef;
   push @picked_indexes, $rnd_idx;
}

open(my $picked_fh, '>', $picked_qfn)
   or die $!;
print($picked_fh $lines[$_]) for @picked_indexes;

open(my $unpicked_fh, '>', $unpicked_qfn)
   or die $!;
print($unpicked_fh $lines[$_]) for grep defined, @unpicked_indexes;

score 0 · Accepted Answer

您可以bash script使用以下代码执行此操作：

在输出中不重复行：

#!/bin/bash

lines=$(wc -l inputfile.txt | awk '{print $1}')

echo $lines

# computation of percentage of random lines we
# want to pick e.g. 60%
let percentage=$((lines*60/100))

echo $percentage

# pick the random lines
random_lines=$(sort -R inputfile.txt | head -n $percentage)

# show the random lines
echo $random_lines

在输出中重复行：

#!/bin/bash

lines=$(wc -l inputfile.txt | awk '{print $1}')

echo $lines

# computation of percentage of random lines we
# want to pick e.g. 60%
let percentage=$((lines*60/100))

echo $percentage

# pick the random lines
for ((i=1; i<$percentage; i++))
do
  echo $(sort -R inputfile.txt | head -n 1)
done

score 0 · Accepted Answer

示例 ~ STDOUT 两次 10%，STDOUT 一次 50%，STDERR 其余 40%

awk 'BEGIN {srand()} !/^$/ { r = rand(); if (r <= .60) print $0; if (r <= 0.10) print $0; if (r > .60) print $0 > "/dev/stderr"; }'

注意：将 STDOUT 重定向到一个文件> file1，将 STDERR 重定向到另一个文件2> file2...

score 0 · Accepted Answer

您正在寻找的数学术语如下：您有一组 100 万个元素，并且您希望“替换”选择样本元素；此外，您想知道未选择的元素。

universe = range(10**6)  # or whatever your elements are
numElementsToChoose = int(0.6*len(universe))

chosen = [random.choice(universe) for _ in range(numElementsToChoose)]
unchosen = set(universe) - set(chosen)

演示：

>>> len(chosen), len(unchosen)
(600000, 548815)

（这段代码不优雅，因为universe应该是一个集合，但是python本身不支持从集合中选择一个随机元素，只支持一个序列......呃。）

score 0 · Accepted Answer

如果你有shuf，你可能有comm，它有 -3 选项来比较两个排序的文件并输出仅在一个文件中找到的行。

linux - 从可能重复的文件中随机选择 n 行

6 回答 6

Related

Reference