linux - Shuffling a large text file without/with group order maintained

Question

Instead of making a script, it there a one liner to shuffle a large tab separated text file, based on the unique elements in the first column. That means, for each unique element in the first column, number of rows will be equal and be specified by the user.

There are two output possibilities, maintaining the row order or randomized row order.

Input :

chr1    3003204 3003454 *   37  +
chr1    3003235 3003485 *   37  +
chr1    3003148 3003152 *   37  -
chr1    3003461 3003711 *   37  +
chr11   71863609    71863647    *   37  +
chr11   71864025    71864275    *   37  +
chr11   71864058    71864308    *   37  -
chr11   71864534    71864784    *   37  +
chrY    90828920    90829170    *   23  -
chrY    90829096    90829346    *   23  +
chrY    90828924    90829174    *   23  -
chrY    90828925    90829175    *   23  -

Output (1 row per category - defined by the user) Output1 (randomized - row order will change) :

chr1    3003235 3003485 *   37  +
chr11   71863609    71863647    *   37  +
chrY    90828925    90829175    *   23  -

Output1 (randomized - row order will be maintained) :

chr1    3003204 3003454 *   37  +
chr11   71863609    71863647    *   37  +
chrY    90828920    90829170    *   23  -

I tried using sort -u with cut on first column to fetch unique elements and then running a combination of grep and head for each element to generate the output file, which can be randomized using shuf, there might be a better solution as the file can be huge > 50 Million lines.

Cheers

score 1 · Accepted Answer

尝试使用awk

维护行顺序

awk '!($1 in a) {a[$1]=$0} END { asort(a,b); for (x in b) print b[x] }' file

输出：

chr1    3003204 3003454 *   37  +
chr11   71863609    71863647    *   37  +
chrY    90828920    90829170    *   23  -

随机行顺序

为此，只需将shuf的输出通过管道传输到上述awk命令

shuf file | awk '!($1 in a) {a[$1]=$0} END { asort(a,b); for (x in b) print b[x] }'

输出（每次运行不同）

chr1    3003148 3003152 *   37  -
chr11   71864025    71864275    *   37  +
chrY    90829096    90829346    *   23  +

可变行数

#!/bin/bash
numRow=3
awk 'n[$1]<'$numRow' {a[$1]=a[$1]"\n"$0; n[$1]++} END { asort(a,b); for (x in b) print b[x] }' file

输出：

chr1    3003204 3003454 *   37  +
chr1    3003235 3003485 *   37  +
chr1    3003148 3003152 *   37  -

chr11   71863609    71863647    *   37  +
chr11   71864025    71864275    *   37  +
chr11   71864058    71864308    *   37  -

chrY    90828920    90829170    *   23  -
chrY    90829096    90829346    *   23  +
chrY    90828924    90829174    *   23  -

score 1 · Accepted Answer

编写脚本肯定更容易吗？

perl -n -e 'BEGIN{ %c=qw(chr1 4 chr11 4 chrY 4); $c{$_}=int(rand($c{$_})) for keys %c;  $r="^(".join("|",keys %c).")\\s";} print if (/$r/o and !$c{$1}--);' filename.txt

BEGIN 块在脚本启动时执行一次。该print if..语句用于文件中的每一行

%c 关联数组具有要查找的键以及每个键的项目数

$r 是一个正则表达式，看起来像^(chr1|chr11|chrY)\s

如果找到正则表达式，则匹配中的匹配键用作对递减的关联数组的查找。当它为零时，将打印该行

score 1 · Accepted Answer

如果有人喜欢在 Python 中使用 pandas 执行此操作。这是我的答案：

#!/bin/env python

import sys
import pandas as pd

column = 0
number = 1
method = pd.Series.head  # or pd.Series.sample

pd.read_table(sys.stdin, header=None) \
  .groupby(column) \
  .apply(method, n=number) \
  .to_csv(sys.stdout, sep="\t", index=False, header=False)

pd.read_table将读取一个表格文件。它与pd.read_csv(..., sep='\t'). header=None将告诉熊猫不要使用第一行作为标题，默认情况下会这样做。
.groupby将按 DataFrame 的给定列分组。
.apply(method, n=number)将调用method给定关键字参数的每个组n=number。
.to_csv将写入 DataFrame，在这种情况下是制表符分隔的，没有 DataFrame 的索引和标头到 stdout。

调用如下：

%$ python myscript.py < ${input_tsv} > ${output_tsv}

Pandas 是一个大包，需要时间来加载。因此，这个脚本比awk脚本慢得多。但在更大的 Python 程序中可能很有用。

基准测试：

包含 49144 条记录的 BED 文件。

在 Zsh 中从 @jkshah 运行 Awk 脚本：

%$ awk '!($1 in a) {a[$1]=$0} END { asort(a,b); for (x in b) print b[x] }' ${bedfile} | sort >/dev/null
%$ shuf ${bedfile} | awk '!($1 in a) {a[$1]=$0} END { asort(a,b); for (x in b) print b[x] }' | sort >/dev/null

前大约 21 毫秒的挂墙时间（平均 70 次运行）。第二个大约 30 毫秒的挂墙时间（平均 70 次运行）。

%timeit使用魔法在 IPython 中运行 Python 语句：

In [1]: %timeit pd.read_table("Every10cM.sort.bed", header=None).groupby(0).apply(pd.Series.head, n=1).to_csv(sep="\t", index=False, header=False)
In [2]: %timeit pd.read_table("Every10cM.sort.bed", header=None).groupby(0).apply(pd.Series.sample, n=1).to_csv(sep="\t", index=False, header=False)

两者都大约 72 毫秒的挂墙时间（平均 70 次运行）。所以速度比较慢...

linux - Shuffling a large text file without/with group order maintained

3 回答 3

调用如下：

基准测试：

Related

Reference