perl - 如何转置具有 1,743,680 列和 2890 行的巨大 txt 文件

Question

我有 2890 个人的大量遗传标记文件。我想转置这个文件。我的数据格式如下：（我这里只显示了6个标记）

ID rs4477212 kgp15297216 rs3131972 kgp6703048 kgp15557302 kgp12112772 ..... 
BV04976 0 0 1 0 0 0 
BV76296 0 0 1 0 0 0 
BV02803 0 0 0 0 0 0 
BV09710 0 0 1 0 0 0 
BV17599 0 0 0 0 0 0 
BV29503 0 0 1 1 0 1 
BV52203 0 0 0 0 0 0 
BV61727 0 0 1 0 0 0 
BV05952 0 0 0 0 0 0

事实上，我的文本文件中有 1,743,680 列和 2890 行。如何转置它？我希望输出应该是这样的：

ID BV04976 BV76296 BV02803 BV09710 BV17599 BV29503 BV52203 BV61727 BV05952  
rs4477212 0 0 0 0 0 0 0 0 0 
kgp15297216 0 0 0 0 0 0 0 0 0 
rs3131972 1 1 0 1 0 1 0 1 0 
kgp6703048 0 0 0 0 0 1 0 0 0 
kgp15557302 0 0 0 0 0 0 0 0 0 
kgp12112772 0 0 0 0 0 1 0 0 0

score 3 · Accepted Answer

我会对文件进行多次传递，也许是 100 次，每次传递得到 1743680/passes 列，在每次传递结束时将它们写出（作为行）。

将数据组装成数组中的字符串，而不是数组数组，以减少内存使用和更少的传递。在每遍开始时为每个字符串预分配空间（例如$new_row[13] = ' ' x 6000; $new_row[13] = '';）可能有帮助，也可能没有帮助。

score 0 · Accepted Answer

（请参阅：在 Bash 中转置文件的有效方法）

你有没有尝试过

awk -f tr.awk input.txt > out.txt

tr.awk在哪里

{ 
    for (i=1; i<=NF; i++) a[NR,i]=$i
}
END {
    for (i=1; i<=NF; i++) {
        for (j=1; j<=NR; j++) {
            printf "%s", a[j,i]
            if (j<NR) printf "%s", OFS
        }
        printf "%s",ORS
    }
}

对于上述过程，您的文件可能太大了。那你可以先尝试拆分一下。例如：

#! /bin/bash
numrows=2890
echo "Splitting file.."
split -d -a4 -l1 input.txt
arg=""
outfile="out.txt"
tempfile="temp.txt"
if [ -e $outfile ] ; then
    rm -i $outfile
fi
for (( i=0; i<$numrows; i++ )) ; do
    echo "Processing file: "$(expr $i + 1)"/"$numrows
    file=$(printf "x%04d\n" $i)
    tfile=${file}.tr
    cat $file | tr -s ' ' '\n' > $tfile
    rm $file
    if [ $i -gt 0 ] ; then
        paste -d' ' $outfile $tfile > $tempfile
        rm $outfile
        mv $tempfile $outfile
        rm $tfile
    else
        mv $tfile $outfile
    fi
done

请注意，这split将生成 2890 个临时文件（！）

perl - 如何转置具有 1,743,680 列和 2890 行的巨大 txt 文件

2 回答 2

Related

Reference