1

我有大的制表符分隔的两列文本文件,如下所示:

...
"001R_FRG3G"    "81941549; 47060116; 49237298"
"002L_FRG3G"    "49237299; 47060117; 81941548"
"002R_IIV3" "106073503; 123808694; 109287880"
...

如您所见,第二列不包含原子值。这就是为什么我想“规范化”这个文件,使其具有以下内容:

...
"001R_FRG3G"    "81941549"
"001R_FRG3G"    "47060116"
"001R_FRG3G"    "49237298"
"002L_FRG3G"    "49237299"
"002L_FRG3G"    "47060117"
"002L_FRG3G"    "81941548"
"002R_IIV3" "106073503"
"002R_IIV3" "123808694"
"002R_IIV3" "109287880"
...

任何人都知道如何有效地做到这一点?

4

3 回答 3

1

珀尔:

perl -lne '
s/[";]//g;
($a, @b) = split;
print qq("$a" "$_") for @b;
' FILE
于 2012-04-17T07:05:28.867 回答
1
awk '{for (i=2; i<=NF; i++) {gsub(/[";]/, "", $i); printf "%s%s\"%s\"", $1, OFS, $i; printf "%s", "\n"}}' inputfile

对于 之后的每个字段$1,去掉引号和分号,然后打印$1后跟用引号括起来的字段内容。对输入文件中的每一行执行此操作。

于 2012-04-16T21:08:22.527 回答
0

这可能对您有用(GNU awk):

awk '{while(/;/) $0=gensub(/^((.*[ \t]").*);[ \t]*/,"\\1\"\n\\2",1)};1' file
"001R_FRG3G"    "81941549"
"001R_FRG3G"    "47060116"
"001R_FRG3G"    "49237298"
"002L_FRG3G"    "49237299"
"002L_FRG3G"    "47060117"
"002L_FRG3G"    "81941548"
"002R_IIV3" "106073503"
"002R_IIV3" "123808694"
"002R_IIV3" "109287880"

或者,它不是 awk,但它优雅地解决了问题。

sed -i ':a;s/\(\(.*\s"\).*\);\s*/\1"\n\2/;ta' file
"001R_FRG3G"    "81941549"
"001R_FRG3G"    "47060116"
"001R_FRG3G"    "49237298"
"002L_FRG3G"    "49237299"
"002L_FRG3G"    "47060117"
"002L_FRG3G"    "81941548"
"002R_IIV3" "106073503"
"002R_IIV3" "123808694"
"002R_IIV3" "109287880"
于 2012-04-17T06:53:04.367 回答