linux - Combine rows in linux

Question

If I have an input file below, is there any command/way in Linux to convert this into my desired file as followed?

Input file:

Column_1     Column_2  
scaffold_A   SNP_marker1
scaffold_A   SNP_marker2
scaffold_A   SNP_marker3
scaffold_A   SNP_marker4
scaffold_B   SNP_marker5
scaffold_B   SNP_marker6
scaffold_B   SNP_marker7
scaffold_C   SNP_marker8
scaffold_A   SNP_marker9
scaffold_A   SNP_marker10

Desired Output file:

Column_1     Column_2  
scaffold_A   SNP_marker1;SNP_marker2;SNP_marker3;SNP_marker4
scaffold_B   SNP_marker5;SNP_marker6;SNP_marker7
scaffold_C   SNP_marker8
scaffold_A   SNP_marker9;SNP_marker10

I was thinking of using grep, uniq, etc, but still couldn't figure out how to get this done.

score 2 · Accepted Answer

python解决方案（假设文件名在命令行传入）

from __future__ import print_function #not needed with Python3
with open('infile') as infile, open('outfile', 'w') as outfile:
    outfile.write(infile.readline()) # transfer the header
    col_one, col_two = infile.readline().split()
    col_two = [col_two] # make it a list
    for line in infile:
        data = line.split()
        if col_one != data[0]:
            print("{}\t{}".format(col_one, ';'.join(col_two)), file=outfile)
            col_one = data[0]
            col_two = [data[1]]
        else:
            col_two.append(data[1])
    print("{}\t{}".format(col_one, ';'.join(col_two)), file=outfile)

score 2 · Accepted Answer

Perl 解决方案：

perl -lane 'sub output {
                print "$last\t", join ";", @buff;
            }
            $last //= $F[0];
            if ($F[0] ne $last) {
               output();
               undef @buff;
               $last = $F[0];
            }
            push @buff, $F[1];
            }{ output();'

score 0 · Accepted Answer

bash 脚本中的 awk 解决方案

#!/bin/bash 

awk '
BEGIN{
    str = ""
}
{
    if ( str != $1 ) {
        if ( NR != 1 ){
            printf("\n")
        }
        str = $1
        printf("%s\t%s",$1,$2)
    } else if ( str == $1 ) {
        printf(";%s",$2)
    }
}
END{
        printf("\n")
}' your_file.txt

score 0 · Accepted Answer

您也可以在 bash 中尝试以下解决方案：

cat input.txt | while read L; do y=`echo $L | cut -f1 -d' '`; { test "$x" = "$y" && echo -n ";`echo $L | cut -f2 -d' '`"; } || { x="$y";echo -en "\n$L"; }; done

或以人类更易读的形式进行审查：

cat input.txt | while read L;
do
  y=`echo $L | cut -f1 -d' '`;
  {
    test "$x" = "$y" && echo -n ";`echo $L | cut -f2 -d' '`";
  } || 
  {
    x="$y";echo -en "\n$L"; 
  };
done

请注意，脚本执行结果中格式良好的输出是基于bash echo命令的。

score 0 · Accepted Answer

如果你不介意使用 Python，它有itertools.groupby，它可以达到这个目的：

# file: comebine.py
import itertools

with open('data.txt') as f:
    data = [row.split() for row in f]

for column1, rows_group in itertools.groupby(data, key=lambda row: row[0]):
    print column1, ';'.join(column2 for column1, column2 in rows_group)

将此脚本另存为combine.py。假设您的输入文件在data.txt中，运行它以获得您想要的输出：

python combine.py

讨论

with open(...)块的结果是data一个行列表，每一行本身就是一个列列表。
该itertools.groupby函数接受一个可迭代的，在这种情况下是一个列表。您告诉它如何使用一个键（column1）将行组合在一起。
rows_group 是共享同一列的行的列表1

linux - Combine rows in linux

5 回答 5

讨论

Related

Reference