1

我想知道如何为在第 1 列中共享相同条目的行连接文件的行(具有多列和多行,>100mb 文件)。以及如何取消连接以这种方式连接的文件。

例子:

来自文件.txt:

a 3494 3929 asd 12 fdfdf
b 2323 2390 kjk 32 kjkjk
b 1323 2390 kjk 32 kjkjk
c 2399 9009 dfd 90 sasd
c 9090 1212 jkk 01 kjkk
c 0900 2311 gfg 09 dkjs
d 0909 2322 kjk 98 dskk
d 0909 0903 kjk 98 dskk
d 0909 2422 fdd 98 cvcv

to, concatenatedfile.txt

a 3494 3929 asd 12 fdfdf
b 2323 2390 kjk 32 kjkjk b 1323 2390 kjk 32 kjkjk
c 2399 9009 dfd 90 sasd c 9090 1212 jkk 01 kjkk c 0900 2311 gfg 09 dkjs
d 0909 2322 kjk 98 dskk d 0909 0903 kjk 98 dskk d 0909 2422 fdd 98 cvcv

反之亦然,所以:从 concatenatedfile.txt -> file.txt

4

3 回答 3

0

如果字段 1 是连续顺序的,则没有数组的替代方法,请尝试:

awk 'END{print RS} p!=$1{if(p)print RS; p=$1}1' ORS= file

对于反向尝试类似:

awk '{for(i=2; i<=NF; i+=1) if( $i==$1 ) $i=RS $i}1' file

但是,如果其他字段中的一个可能与重建记录的第一个字段具有相同的值,那么这将失败,在这种情况下,您将需要额外的检查。

于 2013-02-19T21:20:28.640 回答
0

Deconcatenating is fairly simple in awk:

awk 'NF % 6 != 0 { print "Garbage: ", $0 }
     NF % 6 == 0 { for (i = 1; i < NF; i += 6)
                   {
                       pad = ""
                       for (j = i; j < i+6; j++)
                       {
                           printf "%s%s", pad, $j
                           pad = " "
                       }
                       print ""
                   }
                 }'

And here's my concatenation solution. It assumes that the values in column 1 are grouped together. It won't actually care if they aren't grouped; it will just generate extra lines that it would not have done had the data been grouped.

awk 'NF % 6 != 0 { printf "\nGarbage: %s\n", $0 }
     NF % 6 == 0 { if ($1 != old && NR > 1) print ""
                   if ($1 != old) printf "%s", $0
                   else           printf " %s", $0
                   old = $1
                 }
     END         { print "" }'

You can leave out the garbage handling if you like — silently ignoring data that doesn't match.

于 2013-02-19T22:17:58.017 回答
0

尝试这样做(需要 > 100MO RAM):

awk '
    {for (i=2; i<=NF; i++) arr[$1]=arr[$1]" "$i}
    END{for (a in arr) print a, arr[a]}
' file.txt

输出

a  3494 3929 asd 12 fdfdf
b  2323 2390 kjk 32 kjkjk 1323 2390 kjk 32 kjkjk
c  2399 9009 dfd 90 sasd 9090 1212 jkk 01 kjkk 0900 2311 gfg 09 dkjs
d  0909 2322 kjk 98 dskk 0909 0903 kjk 98 dskk 0909 2422 fdd 98 cvcv
于 2013-02-19T18:40:04.943 回答