linux - Linux - Join multiple CSV files into one

Question

I'm trying to join together multiple results stored in CSV files, with the first column as the index.

The problem is that I have a various number of files, and the index results can vary from file to file, with some having less results.

In Linux, what would be the easiest way to script the process to join all the files in one, with blank fields for empty results? I have experienced Linux join, paste and column with no luck. Mybe I should try with another tool, or maybe Python or Perl?

The data files look like this:

file1:

          header1   header2 header3
result_A  10         11      12
result_B  13         14      15
result_C  16         17      18
result_D  19         20      21
result_E  22         23      24
result_F  25         26      27

file2:

          header1   header2  header3
result_B  40         41       42
result_F  43         44       45

file3:

          header1  header2  header3
result_C  60       61       62
result_D  63       64       65
result_F  66       67       68

And hopefully the results should look like this:

          file1    file1      file1    file2    file2      file2     file3    file3      file3
          header1  header2  header3    header1  header2  header3     header1  header2  header3
result_A  10       11       12                                                           
result_B  13       14       15         40       41       42                             
result_C  16       17       18                                       60       61       62
result_D  19       20       21                                       63       64       65
result_E  22       23       24                                                           
result_F  25       26       27         43       44       45          66       67       68

score 2 · Accepted Answer

UNIX join should get you a long way:

join -a 1 -e '0' "-t  " -j 1 
    <(sort <(join -a 1 -e '0' "-t    " -j 1 <(sort file1) <(sort file2))) 
    <(sort file3)

(all on one line). Note that "-t " has the TAB character within quotes. Enter it using ^V<Tab>.

If you know the input is sorted, it would be better to use

join -a 1 -e '0' "-t  " -j 1 
    <(join -a 1 -e '0' "-t    " -j 1 file1 file2) 
    file3

(all on one line) prints:

id      header1 header2 header3 header1 header2 header3 header1 header2 header3
result_A        10      11      12
result_B        13      14      15      40      41      42
result_C        16      17      18      60      61      62
result_D        19      20      21      63      64      65
result_E        22      23      24
result_F        25      26      27      43      44      45      66      67      68

Now, as you can see, on my Cygwin system -e '0' apparently doesn't work as advertised. I'd suggest trying this on a different system though, as I don't imagine having uncovered such an essential bug in a standard UNIX utility.

score 1 · Accepted Answer

1

csvjoin from csvkit should also work.

于 2017-07-18T04:25:31.590 回答

linux - Linux - Join multiple CSV files into one

2 回答 2

Related

Reference