3

我有一个文件,其中包含按时间戳排序的条目,但其中包含相同时间戳的多个实例,每个实例都有一个单独的主题。我想将具有相同时间戳的所有条目连接到一行。时间戳是第 1 列

输入文件可能会读取

Time,Tag,Value  
1,ABC,3  
2,ABC,2.7  
2,DEF,3.4  
3,ABC,2.8  
3,DEF,3.6  
3,GHI,2.99  
3,JKL,3.01  
4,ABC,3.42  
4,DEF,3.62  
4,JKL,3.82  

期望的输出就像(选项1);

Time,Tag,Value  
1,ABC,3  
2,ABC,2.7,DEF,3.4  
3,ABC,2.8,DEF,3.6,GHI,2.99,JKL,3.01  
4,ABC,3.42,DEF,3.62,JKL,3.82  

更好的是(选项2);

1,ABC,3  
2,ABC|DEF,2.7|3.4  
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01  
4,ABC|DEF|JKL,3.42|3.62|3.82  

我认为我可以通过使用循环编写脚本来获得选项 1。这首先需要我获取“标签”所有值的唯一列表,以确定我需要循环多少次迭代。

但我也假设;

1)即使在 bash 中,这对于长文件也可能很昂贵,并且;
2)很可能有一些更优雅的方式来做到这一点。

纽布问题。所有帮助表示赞赏。

谢谢

4

12 回答 12

1

假设您的数据按时间顺序排列,您可以使用这个 awk 解决方案:

解析.awk

# Use comma as input and output field separators
BEGIN { FS = OFS = "," }

# Print header and skip to next line
NR == 1 { print; next }

# If previous timestamp is the same as current append tag and value
pt == $1 {
  tag = tag "|" $2
  val = val "|" $3
}

# If not the first data line and timestamps are not equal then print
NR != 2 && pt != $1 { print pt, tag, val }

# Save previous timestamp and reset accumulator variables    
pt != $1 {
  pt  = $1
  tag = $2
  val = $3
}

END { print pt, tag, val }

像这样运行它:

awk -f parse.awk infile

输出:

Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

或作为单行:

<infile awk 'BEGIN {FS=OFS=","} NR==1{print;next} pt==$1 {tag=tag"|"$2;val=val"|"$3} NR!=2&&pt!=$1 {print pt,tag,val} pt!=$1 {pt=$1;tag=$2;val=$3} END {print pt,tag,val}'
于 2013-02-18T15:43:41.940 回答
1

新答案:

我意识到我之前的答案可能难以阅读和理解——尤其是对于初学者。但是,它确实很好地利用了 gawk 的数组排序功能,这对于处理您在问题中谈到的“标签”的唯一值非常有益。然而,在阅读了一些评论之后,我相信我可能误解了你的问题——也许只是轻微的。这是一种不关心“标签”及其值的唯一性的方法。它只是将它们全部连接起来。它也应该是非常可读和可扩展的。像这样运行:

awk -f script.awk file

script.awk 的内容:

BEGIN {
    FS=OFS=","
}

NR==1 {
    print
    next
}

{
    tag[$1]=(tag[$1] ? tag[$1] "|" : "") $2
    val[$1]=(val[$1] ? val[$1] "|" : "") $3
}

END {
    for (i in tag) {
        print i, tag[i], val[i] | "sort -n"
    }
}

结果:

Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

或者,这是单线:

awk -F, 'NR==1 { print; next } { tag[$1]=(tag[$1] ? tag[$1] "|" : "") $2; val[$1]=(val[$1] ? val[$1] "|" : "") $3 } END { for (i in tag) print i, tag[i], val[i] | "sort -n" }' OFS=, file

以前的答案:

这是使用GNU awk. 像这样运行:

awk -f script.awk file

内容script.awk

BEGIN {
    FS=OFS=","
}

NR==1 {
    print
    next
}

{
    a[$1][$2]=$3
}

END {

    for (i in a) {
        b[x++] = i
    }

    n = asort(b)

    for (j=1;j<=n;j++) {

        m = asorti(a[b[j]],c)

        for (k=1;k<=m;k++) {

            s = (s ? s "|" : "") c[k]
            r = (r ? r "|" : "") a[b[j]][c[k]]
        }

        print b[j], s, r
        s = r = ""
    }
}

结果:

Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

或者,这是单线:

awk -F, 'NR==1 { print; next } { a[$1][$2]=$3 } END { for (i in a) b[x++] = i; n = asort(b); for (j=1;j<=n;j++) { m = asorti(a[b[j]],c); for (k=1;k<=m;k++) { s = (s ? s "|" : "") c[k]; r = (r ? r "|" : "") a[b[j]][c[k]] } print b[j], s, r; s = r = "" } }' OFS=, file
于 2013-02-18T13:50:47.923 回答
1

这会起作用:

awk -F, '{if($1 in a){ split(a[$1],t,","); a[$1]=t[1]"|"$2","t[2]"|"$3
}else a[$1]=$2","$3;}END{asort(a);for(x in a)print x","a[x]}' file|sort -n

用你的例子:

kent$  awk -F, '{if($1 in a){split(a[$1],t,","); a[$1]=t[1]"|"$2","t[2]"|"$3
}else a[$1]=$2","$3;}END{asort(a);for(x in a)print x","a[x]}' file|sort -n                                                                                                  
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82
于 2013-02-18T13:32:13.917 回答
0

If this is an operation that you would like to repeat often, I would opt for a utility script written in a more 'complete' scripting language. You can then call the script within your own bash script or use it in the command line as and when needed.

Here's a Python example:

#!/usr/bin/env python
# --- merge_groups.py ----
import fileinput, operator, itertools
lines = (line.strip() for line in fileinput.input())
data = (line.split(",") for line in lines if line)
for key, group in itertools.groupby(data, operator.itemgetter(0)):
  _, label, value =  zip(*group)
  print "%s,%s,%s" % (key, "|".join(label), "|".join(value))

Note that the script assumes that the entries with the same timestamp are already grouped together.

You can use the script to process existing data files or pipe data directly to it, e.g:

[me@home]$ ./merge_groups.py data.txt  # parse existing data file
Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82

[me@home]$ cat data.txt | ./merge_groups.py  # post-process command output
Time,Tag,Value
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82
于 2013-02-18T14:01:27.300 回答
0

嗯,您确实说过“所有帮助”,那么这会包括 Ruby 解决方案吗?

require 'csv'

puts(CSV.read('f.csv').group_by(&:first).map do |k, v|
  t = v.transpose
  [k, t[1].join('|'), t[2].join('|')].join(',')
end.drop(1))
于 2013-02-18T14:21:35.050 回答
0

Perl 没有代表。

use strict;
my $skip_header = <>;
my %d;
while(<>) {
    s/\s+$//;
    my ($no, $k, $v )  = split ",";
    push @{$d{int($no)}}, [ $k,  $v ];
}
END {
    foreach my $no (sort { $a <=> $b } keys %d  )  {
        print $no, ",";
        print join("|", map { $_->[0] } @{$d{$no}});
        print ",";
        print join("|", map { $_->[1] } @{$d{$no}});
        print "\n";
    }
}

给出:

1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82
于 2013-02-18T21:56:00.633 回答
0

简称:sed方式

sed -ne ':a;$!N;/^\([0-9]\+\),.*\n\1,/s/\n[0-9]*//;ta;P;D'
Time,Tag,Value
1,ABC,3
2,ABC,2.7,DEF,3.4
3,ABC,2.8,DEF,3.6,GHI,2.99,JKL,3.01
4,ABC,3.42,DEF,3.62,JKL,3.82
于 2013-02-19T08:16:53.333 回答
0

对于第一个选项,您可以尝试:

awk -F, 'p x!=$1{if(p x)print s; p=s=$1} {sub($1,x); s=s $0} END{print s}' file
于 2013-02-18T22:04:32.893 回答
0

第一:

> awk -F, '{a[$1]=a[$1]","$2","$3}END{for(i in a)print i","substr(a[i],2)}' temp | sort
1,ABC,3
2,ABC,2.7,DEF,3.4
3,ABC,2.8,DEF,3.6,GHI,2.99,JKL,3.01
4,ABC,3.42,DEF,3.62,JKL,3.82

第二个:

> awk -F, '{a[$1]=a[$1]"|"$2;b[$1]=b[$1]"|"$3}END{for(i in a)print i","substr(a[i],2)","substr(b[i],2)}' temp | sort
1,ABC,3
2,ABC|DEF,2.7|3.4
3,ABC|DEF|GHI|JKL,2.8|3.6|2.99|3.01
4,ABC|DEF|JKL,3.42|3.62|3.82
于 2013-02-19T07:02:25.560 回答
0

bash可能是错误的工具。尝试 Python:

import fileinput
import sys

oldTime = None
for line in fileinput.input():
    line = line.strip()
    pos = line.find(',')
    time = line[0:pos]
    if oldTime == time:
        sys.stdout.write(',')
        sys.stdout.write(line[pos+1:])
    else:
        if oldTime is not None:
            sys.stdout.write('\n')
        sys.stdout.write(line)

    oldTime = time

sys.stdout.write('\n')
于 2013-02-18T13:35:39.260 回答
0

@AaronDigulla 和 @Kent 有一些很好的解决方案,但如果你有/想要使用 bash,这里有一个:

for ts in `cat inputfile | cut --delimiter="," --fields=1 | uniq`
do
  p1="";
  p2="";
  for line in `grep "^${ts}," inputfile | cut --delimiter="," --fields=2-`
  do
    f1=`echo ${line} | cut --delimiter="," --fields=1`;
    f2=`echo ${line} | cut --delimiter="," --fields=2`;
    p1=${p1}"|"$f1;
    p2=${p2}"|"$f2;
  done
  echo ${ts}","${p1#?}","${p2#?};
done
于 2013-02-18T14:10:34.170 回答
0

以防万一您需要更多 awk 解决方案!

function read() {
  split($0, buf, ",")
}

function write() {
  for (i = 1; i < length(buf); i++) {
    printf "%s,", buf[i]
  }
  print buf[length(buf)]
}

BEGIN {
  FS = ","
}

NR == 1 {
  print
  next
}

NR == 2 {
  read()
  next
}

{
  if ($1 != time) { # new time                                                                                                                                                                   
    time = $1
    write()
    read()
  } else { # repeated time                                                                                                                                                                       
    for (i = 2; i <= NF; i++) {
      buf[i] = buf[i] "|" $i
    }
  }
}

END {
  write()
}

我不太擅长 awk,所以我不得不强调可读性!

于 2013-02-18T14:10:45.523 回答