bash - 从bash中的大文件中获取随机行

Question

如何n从无法放入内存的非常大的文件中获取随机行。

如果我可以在随机化之前或之后添加过滤器，那就太好了。

更新 1

就我而言，规格是：

> 1 亿行
> 10GB 文件
通常随机批量大小 10000-30000
512RAM 托管 ubuntu 服务器 14.10

所以从文件中丢失几行不会是一个大问题，因为它们有万分之一的机会，但是性能和资源消耗将是一个问题

score 8 · Accepted Answer

在这样的限制因素下，下面的方法会更好。

寻找文件中的随机位置（例如，您将在某行“内部”）
从这个位置向后走并找到给定行的开始
前进并打印整行

为此，您需要一个可以在文件中查找的工具，例如perl.

use strict;
use warnings;
use Symbol;
use Fcntl qw( :seek O_RDONLY ) ;
my $seekdiff = 256; #e.g. from "rand_position-256" up to rand_positon+256

my($want, $filename) = @ARGV;

my $fd = gensym ;
sysopen($fd, $filename, O_RDONLY ) || die("Can't open $filename: $!");
binmode $fd;
my $endpos = sysseek( $fd, 0, SEEK_END ) or die("Can't seek: $!");

my $buffer;
my $cnt;
while($want > $cnt++) {
    my $randpos = int(rand($endpos));   #random file position
    my $seekpos = $randpos - $seekdiff; #start read here ($seekdiff chars before)
    $seekpos = 0 if( $seekpos < 0 );

    sysseek($fd, $seekpos, SEEK_SET);   #seek to position
    my $in_count = sysread($fd, $buffer, $seekdiff<<1); #read 2*seekdiff characters

    my $rand_in_buff = ($randpos - $seekpos)-1; #the random positon in the buffer

    my $linestart = rindex($buffer, "\n", $rand_in_buff) + 1; #find the begining of the line in the buffer
    my $lineend = index $buffer, "\n", $linestart;            #find the end of line in the buffer
    my $the_line = substr $buffer, $linestart, $lineend < 0 ? 0 : $lineend-$linestart;

    print "$the_line\n";
}

将上述内容保存到“randlines.pl”等文件中，并将其用作：

perl randlines.pl wanted_count_of_lines file_name

例如

perl randlines.pl 10000 ./BIGFILE

该脚本执行非常低级的 IO 操作，即它非常快。（在我的笔记本上，从 10M 中选择 30k 行需要半秒）。

score 4 · Accepted Answer

这是一个适合您的 bash 函数。正如您所说，它抓取“一批”行，在文件中具有随机起点。

randline() {
  local lines c r _

  # cache the number of lines in this file in a symlink in the temp dir
  lines="/tmp/${1//\//-}.lines"
  if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
    c=$(ls -l "$lines" | sed 's/.* //')
  else
    read c _ < <(wc -l $1)
    ln -sfn "$c" "$lines"
  fi

  # Pick a random number...
  r=$[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) ]
  echo "start=$r" >&2

  # And start displaying $2 lines before that number.
  head -n $r "$1" | tail -n ${2:-1}
}

根据需要编辑echo行。

该解决方案的优点是管道更少、资源密集型管道更少（即 no | sort ... |）、平台依赖性更少（即 nosort -R是 GNU-sort-specific）。

请注意，这依赖于 Bash 的$RANDOM变量，该变量实际上可能是随机的，也可能不是。此外，如果您的源文件包含超过 32768^2 行，它将丢失行，并且如果您指定的行数 (N) 大于 1 并且随机起点小于 N 行，则会出现故障边缘情况开始。解决这个问题留给读者作为练习。:)

更新＃1：

mklement0 在有关该方法的潜在性能问题的评论中提出了一个很好的问题head ... | tail ...。老实说，我不知道答案，但我希望两者head都tail得到充分优化，以便在显示输出之前不会缓冲所有输入。

如果我的希望没有实现，这里有一个替代方案。这是一个基于 awk 的“滑动窗口”尾巴。我会将它嵌入到我之前编写的函数中，以便您可以根据需要对其进行测试。

randline() {
  local lines c r _

  # Line count cache, per the first version of this function...
  lines="/tmp/${1//\//-}.lines"
  if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
    c=$(ls -l "$lines" | sed 's/.* //')
  else
    read c _ < <(wc -l $1)
    ln -sfn "$c" "$lines"
  fi

  r=$[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) ]

  echo "start=$r" >&2

  # This simply pipes the functionality of the `head | tail` combo above
  # through a single invocation of awk.
  # It should handle any size of input file with the same load/impact.
  awk -v lines=${2:-1} -v count=0 -v start=$r '
    NR < start { next; }
    { out[NR]=$0; count++; }
    count > lines { delete out[start++]; count--; }
    END {
      for(i=start;i<start+lines;i++) {
        print out[i];
      }
    }
  ' "$1"
}

嵌入的 awk 脚本替换了head ... | tail ...先前版本的函数中的管道。它的工作原理如下：

它会跳过行，直到由早期随机化确定的“开始”。
它将当前行记录到一个数组中。
如果数组大于我们要保留的行数，它将消除第一条记录。
在文件的末尾，它打印记录的数据。

结果是 awk 进程不应该增加它的内存占用，因为输出数组被修剪的速度与它构建的一样快。

注意：我没有用你的数据实际测试过。

更新#2：

Hrm，随着对您的问题的更新，您想要 N 随机行而不是从随机点开始的一行块，我们需要不同的策略。您施加的系统限制非常严格。以下可能是一个选项，也使用 awk，随机数仍然来自 Bash：

randlines() {
  local lines c r _

  # Line count cache...
  lines="/tmp/${1//\//-}.lines"
  if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
    c=$(ls -l "$lines" | sed 's/.* //')
  else
    read c _ < <(wc -l $1)
    ln -sfn "$c" "$lines"
  fi

  # Create a LIST of random numbers, from 1 to the size of the file ($c)
  for (( i=0; i<$2; i++ )); do
    echo $[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) + 1 ]
  done | awk '
    # And here inside awk, build an array of those random numbers, and
    NR==FNR { lines[$1]; next; }
    # display lines from the input file that match the numbers.
    FNR in lines
  ' - "$1"
}

这通过将随机行号列表作为“第一个”文件输入 awk 来工作，然后让 awk 打印来自“第二个”文件的行，其行号包含在“第一个”文件中。它用于wc确定要生成的随机数的上限。这意味着您将阅读此文件两次。如果您有文件中行数的其他来源（例如数据库），请在此处插入。:)

一个限制因素可能是第一个文件的大小，它必须加载到内存中。我相信 30000 个随机数应该只占用大约 170KB 的内存，但是数组在 RAM 中的表示方式取决于您使用的 awk 的实现。（虽然通常，awk 实现（包括 Ubuntu 中的 Gawk）非常擅长将内存浪费降至最低。）

这对你有用吗？

score 2 · Accepted Answer

简单（但缓慢）的解决方案

n=15 #number of random lines
filter_before | sort -R | head -$n | filter_after

#or, if you could have duplicate lines
filter_before | nl | sort -R | cut -f2- | head -$n | filter_after
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

或者，如果需要，将以下内容保存到randlines脚本中

#!/bin/bash
nl | sort -R | cut -f2 | head -"${1:-10}"

并将其用作：

filter_before | randlines 55 | filter_after   #for 55 lines

这个怎么运作：

通过为每一行sort -R计算的随机哈希对文件进行排序，因此您将获得随机的行顺序，因此前 N 行是随机行。

因为散列对同一行产生相同的散列，所以重复的行不会被视为不同。可以消除添加行号的重复行（带有nl），因此排序永远不会得到完全相同的重复。sort删除添加的行号后。

例子：

seq -f 'some line %g' 500 | nl | sort -R | cut -f2- | head -3

在后续运行中打印：

some line 65
some line 420
some line 290

some line 470
some line 226
some line 132

some line 433
some line 424
some line 196

带有重复行的演示：

yes 'one
two' | head -10 | nl | sort -R | cut -f2- | head -3

在随后的运行中打印：

one
two
two

one
two
one

one
one
two

最后，如果你想可以使用，而不是cut sed太：

sed -r 's/^\s*[0-9][0-9]*\t//'

score 0 · Accepted Answer

#!/bin/bash
#contents of bashScript.sh

file="$1";
lineCnt=$2;
filter="$3";
nfilter="$4";
echo "getting $lineCnt lines from $file matching '$filter' and not matching '$nfilter'" 1>&2;

totalLineCnt=$(cat "$file" | grep "$filter" | grep -v "$nfilter" | wc -l | grep -o '^[0-9]\+');
echo "filtered count : $totalLineCnt" 1>&2;

chances=$( echo "$lineCnt/$totalLineCnt" | bc -l );
echo "chances : $chances" 1>&2;

cat "$file" | awk 'BEGIN { srand() } rand() <= $chances { print; }' | grep "$filter" | grep -v "$nfilter" | head -"$lineCnt";

用法：

获取 1000 个随机样本

bashScript.sh /path/to/largefile.txt 1000

行有数字

bashScript.sh /path/to/largefile.txt 1000 "[0-9]"

没有迈克和简

bashScript.sh /path/to/largefile.txt 1000 "[0-9]" "mike|jane"

score 0 · Accepted Answer

我已经用于rl行随机化，发现它表现得很好。不确定它如何扩展到您的情况（您只需执行 eg rl FILE | head -n NUM）。你可以在这里得到它：http: //arthurdejong.org/rl/

bash - 从bash中的大文件中获取随机行

更新 1

5 回答 5

用法：

Related

Reference