bash - bash：清理三个文件的外部连接，保留文件成员身份

Question

考虑以下三个文件，其标题位于第一行：

文件1：

id name in1
1 jon 1
2 sue 1

文件2：

id name in2
2 sue 1
3 bob 1

文件3：

id name in3
2 sue 1
3 adam 1

我想合并这些文件以获得以下输出，merged_files：

id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 bob 0 1 0
3 adam 0 0 1

这个请求有几个特殊功能，我没有发现在 grep/sed/awk/join 等中以方便的方式实现。编辑：为简单起见，您可以假设这三个文件已经排序。

score 3 · Accepted Answer

GNU awk的代码：

{
if ($1=="id") { v[i++]=$3; next }
b[$1,$2]=$1" "$2
c[i-1,$1" "$2]=$3
}

END {
printf ("id name")
for (x in v) printf (" %s", v[x]); printf ("\n")
for (y in b)  {
    printf ("%s", b[y])
    for (z in v) if (c[z,b[y]]==0) {printf (" 0")} else printf (" %s", c[z,b[y]])
    printf ("\n")
    }
}

$cat 文件是什么？
id名称in1
1 琼 1
2 起诉 1
id名称in2
2 起诉 1
3 鲍勃 1
id 名称 in3
2 起诉 1
3 亚当 1

$awk -f prog.awk 文件？
id 名称 in1 in2 in3
3 鲍勃 0 1 0
3 亚当 0 0 1
1 琼 1 0 0
2 起诉 1 1 1

score 3 · Accepted Answer

这与Bash 脚本中解决的从多个 CSV 文件中查找匹配行的问题非常相似。它不完全相同，但非常相似。（非常相似，我只需要删除三个sort命令，sed稍微更改三个命令，更改文件名，将 'missing' 值从更改no为0，并将最后的替换sed从逗号更改为空格。）

join带有sed（通常sort也是，但数据已经充分排序）的命令是需要的主要工具。假设:没有出现在原始数据中。为了记录文件中一行的存在，我们需要文件中的一个1字段（它几乎在那里）；当没有匹配时，我们将join提供。每个非标题行的末尾需要变成0，标题中的最后一个字段也需要在前面加上。然后，使用的进程替换，我们可以编写：1:1:bash

$ sed 's/[ ]\([^ ]*\)$/:\1/' file1 |
> join -t: -a 1 -a 2 -e 0 -o 0,1.2,2.2     - <(sed 's/[ ]\([^ ]*\)$/:\1/' file2) |
> join -t: -a 1 -a 2 -e 0 -o 0,1.2,1.3,2.2 - <(sed 's/[ ]\([^ ]*\)$/:\1/' file3) |
> sed 's/:/ /g'
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 adam 0 0 1
3 bob 0 1 0
$

该sed命令（三次）:在文件的每一行的最后一个字段之前添加。连接非常接近对称。-t:指定字段分隔符是冒号；-a 1and表示当文件中-a 2没有匹配项时，该行仍将包含在输出中；这-e 0意味着如果文件中没有匹配项，0则在输出中生成 a；并且该-o选项指定输出列。对于第一个连接，-o 0,1.2,2.2输出是连接列 (0)，然后是1两个文件中的第二列 (the )。第二个连接在输入中有 3 列，因此它指定-o 0,1.2,1.3,2.2. 该参数-本身的意思是“读取标准输入”。这<(...)符号是“进程替换”，其中文件名（通常/dev/fd/NN）提供给连接命令，它包含括号内的命令输出。然后再次过滤输出sed以用空格替换冒号，从而产生所需的输出。

与所需输出的唯一区别是3 bobafter的排序3 adam；目前还不清楚您在所需输出中反向订购它们的依据是什么。如果它很重要，可以设计一种方法来以不同的方式解决顺序（例如sort -k1,1 -k3,5，除了将标签行排序在数据之后；如有必要，有解决方法）。

score 2 · Accepted Answer

这个awk脚本会做你想做的事：

$1=="id"&&$2=="name"{
    ins[$3]= 1;
    lastin = $3;
}
$1!="id"||$2!="name" {
    ids[$1] = 1;
    names[$2] = 1;
    a[$1,$2,lastin]= $3
    used[$1,$2] = 1;
}
END {
    printf "id name"
    for (i in ins) {
        printf " %s", i
    }
    printf "\n"
    for (id in ids) {
        for (name in names) {
            if (used[id,name]) {
                printf "%s %s", id, name
                for (i in ins) {
                    printf " %d", a[id,name,i]
                }
                printf "\n"
            }
        }
    }
}

假设你的文件被称为list1,list2等，而 awk 文件是script.awk，你可以像这样运行它

$ cat list* | awk -f script.awk
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 bob 0 1 0
3 adam 0 0 1

我确信这是一种更短、更简单的方法，但这是我在凌晨 1:30 能想到的全部 :)

score 0 · Accepted Answer

这是我前一阵子写的。把它发布在网上，然后在这里发布，以便下次我查找它时，我可以找到它。这有点杂乱无章，但它支持外部、左、独占等连接、重复处理（删除或相乘）等。

https://code.google.com/p/ea-utils/source/browse/trunk/clipper/xjoin

TODO：更好地处理标题，处理流输入。

Usage: xjoin [options] [:]<operator> <f1> <f2> [...*]

Joins file 1 and file 2 by the first column, suitable
for arbitratily large files (disk-based sort).

Operator is one of:

# Pasted ops, combines rows:

  in[ner]   return rows in common
  le[ft]    return rows in common, left joined
  ri[ght]   return rows in common, right joined
  ou[ter]   return all rows, outer joined

# Exclusive (not pasted) ops, only return rows from 1 file:

  ex[clude] return only those rows with nothing in common (see -f)
  xl[eft]   return left file rows that are not in right file
  xr[ight]  return right file rows that are not in left file

Common options:

  -1,-2=N     per file, column number to join on (def 1)
  -k=N        set the key column to N (for both files)
  -d    STR   column delimiter (def tab)
  -q    STR   quote char (def none)
  -h    [N]   files have headers (optionally, N is the file number)
  -u    [N]   files may contain duplicate entries, only output first match
  -s    [N]   files are already sorted, don't sort first
  -n          numeric sort key columns
  -p          prefix headers with filename/
  -f          prefix rows with the input file name (op:ex only)

bash - bash：清理三个文件的外部连接，保留文件成员身份

4 回答 4

Related

Reference