1

我有一个包含 80 列左右的大型制表符分隔文件,如下所示:

184     
2       
P   2853263 4998463
SS      
AG0001-C        
T/T      C/C      A/A
AG0002-C        
T/T      C/C      A/T   
AG0003-C        
T/T      C/C      A/A   
AG0004-C         
T/T      C/C      T/A

我想将斜杠字符(“\”)替换为新行,以便将一列的内容分成两行,如下所示:

184     
2       
P   2853263 4998463
SS      
AG0001-C        
T        C         A
T        C         A
AG0002-C        
T        C         A
T        C         T
AG0003-C         
T        C         A
T        C         A
AG0004-C        
T        C         T
T        C         A
4

4 回答 4

3

对于这样的输入(第一列左侧没有初始选项卡):

184
2
P   2853263 4998463
SS
AG0001-C
T/T C/C A/A
AG0002-C
T/T C/C A/T
AG0003-C
T/T C/C A/A
AG0004-C
T/T C/C T/A

此脚本应与 Mawk 一起使用:

#!/usr/bin/awk -f

NR <= 4 || NR % 2 { print; next; }
{
    rows = 0
    for (i = 1; i <= NF; ++i) {
        count = split($i, b, /\//)
        if (count > rows) {
            rows = count
        }
        for (j = 1; j <= count; ++j) {
            key = i "|" j
            a[key] = b[j]
        }
    }
    for (i = 1; i <= rows; ++i) {
        key = 1 "|" i
        printf("%s", a[key])
        for (j = 2; j <= NF; ++j) {
            key = j "|" i
            printf("\t%s", a[key])
        }
        print ""
    }
    for (i in a) {
        delete a[i]
    }
}

输出:

184
2
P   2853263 4998463
SS
AG0001-C
T   C   A
T   C   A
AG0002-C
T   C   A
T   C   T
AG0003-C
T   C   A
T   C   A
AG0004-C
T   C   T
T   C   A

它甚至应该适用于像这样的不同格式:

184
2
P   2853263 4998463
SS
AG0001-C
A/A/C/X/Y/Z T/T C/C A/A A/A/C/X A/A/B   A/A/C/X/Y
AG0002-C
T/T C/C A/T
AG0003-C
T/T C/C A/A
AG0004-C
T/T C/C T/A

输出:

184
2
P   2853263 4998463
SS
AG0001-C
A   T   C   A   A   A   A
A   T   C   A   A   A   A
C               C   B   C
X               X       X
Y                       Y
Z                       
AG0002-C
T   C   A
T   C   T
AG0003-C
T   C   A
T   C   A
AG0004-C
T   C   T
T   C   A

对于左侧带有选项卡的输入:

    184
    2
    P   2853263 4998463
    SS
    AG0001-C
    T/T C/C A/A
    AG0002-C
    T/T C/C A/T
    AG0003-C
    T/T C/C A/A
    AG0004-C
    T/T C/C T/A

这段代码

#!/usr/bin/awk -f

NR <= 4 || NR % 2 { print; next; }
{
    rows = 0
    for (i = 1; i <= NF; ++i) {
        count = split($i, b, /\//)
        if (count > rows) {
            rows = count
        }
        for (j = 1; j <= count; ++j) {
            key = i "|" j
            a[key] = b[j]
        }
    }
    for (i = 1; i <= rows; ++i) {
        for (j = 1; j <= NF; ++j) {
            key = j "|" i
            printf("\t%s", a[key])
        }
        print ""
    }
    for (i in a) {
        delete a[i]
    }
}

会给出一个输出

    184
    2
    P   2853263 4998463
    SS
    AG0001-C
    T   C   A
    T   C   A
    AG0002-C
    T   C   A
    T   C   T
    AG0003-C
    T   C   A
    T   C   A
    AG0004-C
    T   C   T
    T   C   A
于 2013-08-29T10:28:50.950 回答
2

一个GNU awk解决方案:

$ awk '/[/]/{print $1,$3,$6;print $2,$4,$6;next}1' FS='/| +' OFS='\t' file
184
2
P   2853263 4998463
SS
AG0001-C
T       C       A
T       C       A
AG0002-C
T       C       T
T       C       T
AG0003-C
T       C       A
T       C       A
AG0004-C
T       C       A
T       C       A
于 2013-08-29T09:37:04.433 回答
1

使用sed

$ sed -e "s|/|\t|g" -e "s/\([^\t]*\t[^\t]*\t[^\t]*\)\t\(.*\)/\1\n\2/" inputfile
184
2
P   2853263 4998463
SS
AG0001-C
T   T   C   
C   A   A   
AG0002-C
T   T   C   
C   A   T   
AG0003-C
T   T   C   
C   A   A   
AG0004-C
T   T   C   
C   T   A   
于 2013-08-29T10:13:22.767 回答
0

这可能对您有用(GNU sed):

sed '/\//!b;h;s|/.||g;G;s|./||g' file

对于包含/副本的行。删除/和以下字符。附加复制的行并删除任何/'s 之前的字符。

于 2013-08-29T14:53:47.520 回答