bash - 基于bash中的列数据组合和多个文件并组织

Question

我正在尝试将单个目录中的 115 个文件组合在一起。以下是文件外观的示例：

文件一

表型标记 值 1 值 2 值 3
P1 1:54390 0.2948 0.4837 0.2198
P2 1:54390 0.3482 0.6583 0.1937
P3 1:54390 0.1983 0.1837 0.4177
P4 1:54390 0.9128 0.9930 0.0043
P5 1:54390 0.1938 0.0109 0.6573
P1 1:69402 0.2039 0.2340 0.2346
P2 1:69402 0.0239 0.3545 0.1987
P3 1:69402 0.8239 0.8677 0.4177
P4 1:69402 0.2498 0.3099 0.0765
P5 1:69402 0.0982 0.0198 0.0123

文件二

表型标记 值 1 值 2 值 3
P1 9:21048 0.8568 0.1231 0.1654
P2 9:21048 0.1244 0.3213 0.1223
P3 9:21048 0.9869 0.1231 0.4776
P4 9:21048 0.3543 0.7657 0.0033
P5 9:21048 0.1231 0.3213 0.8578
P1 9:87758 0.1231 0.8768 0.4653
P2 9:87758 0.7657 0.5435 0.8845
P3 9:87758 0.9879 0.8437 0.7464
P4 9:87758 0.1231 0.9879 0.5523
P5 9:87758 0.9879 0.9868 0.0006

所以基本上每个文件都有一组独特的标记，其中所有 5 个表型（P1、P2、P3、P4、P5）都与它们匹配。

几件事：

A. 我希望一个文件看起来像这样（下图），其中数据按表型组织

表型标记 值 1 值 2 值 3
P1 1:54390 0.2948 0.4837 0.2198
P1 1:69402 0.2039 0.2340 0.2346
P1 9:21048 0.8568 0.1231 0.1654
P1 9:87758 0.1231 0.8768 0.4653
P2 1:54390 0.3482 0.6583 0.1937
P2 1:69402 0.0239 0.3545 0.1987
P2 9:21048 0.1244 0.3213 0.1223     
P3 1:54390 0.1983 0.1837 0.4177
P3 1:69402 0.8239 0.8677 0.4177
P3 9:21048 0.9869 0.1231 0.4776
P3 9:87758 0.9879 0.8437 0.7464
P4 1:54390 0.9128 0.9930 0.0043
P4 1:69402 0.2498 0.3099 0.0765
P4 9:21048 0.3543 0.7657 0.0033
P4 9:87758 0.1231 0.9879 0.5523
P5 1:54390 0.1938 0.0109 0.6573
P5 1:69402 0.0982 0.0198 0.0123
P5 9:21048 0.1231 0.3213 0.8578
P5 9:87758 0.9879 0.9868 0.0006

我想在bash中执行此操作。谁能给我一些见解？我对这种语言很陌生！

B. 一旦我有了这个巨大的文件，我还想根据表型保存单独的文件（我计划在中间做一些质量控制步骤），所以我会有 5 个文件用于 P1、P2、P3、P4 , 和 P5 以及其他列中的各自数据）

score 2 · Accepted Answer

#!awk -f
{
  /Phenotype/ ? hd=$0 : rw[$0]
}
END {
  print hd
  PROCINFO["sorted_in"] = "@ind_str_asc"
  for (each in rw) print each
}

score 2 · Accepted Answer

要解决 A，您可以使用 spiehr 提出的方法。解决 B：

# Name of your big merged file
BIG_FILE='...'


TYPES='P1 P2 P3 P4 P5'    
for T in $TYPES; do
    # Will reduce the input file to
    # all lines starting with $T, which is one of P1, P2 etc.,
    # and write them to a file named accordingly
    grep "^$T" $BIG_FILE > file_$T
done

score 0 · Accepted Answer

要获取标题，带有列标题：

head -1 > tmpfile

数据可以这样插入：

for file in $(ls); do
    tail -n +2 ${file} >> tmpfile2
done
sort tmpfile2 >> tmpfile
rm tmpfile2

tmpfile 将是包含所有数据的文件。您可以添加另一个 linux 命令，而不是编写 $(ls)，该命令会列出所有相关文件。

要仅获取第一列中带有“P3”的条目，您可以使用 grep：

grep '^P3' tmpfile | cut -f1 --complement

cut 命令用于删除第一个条目，您可能不再需要它了。

score 0 · Accepted Answer

我会把第一步写成

{
    sed 1q file1
    sed 1d * | sort
} > file_all

然后

awk '
    FNR == 1 {head = $0; next}
    !seen[$1]++ {print head > $1}
    {print > $1}
' file_all

这会产生名为“P1”、“P2”等的文件

bash - 基于bash中的列数据组合和多个文件并组织

4 回答 4

Related

Reference