linux - 如何从多个文件中提取一列，并将这些列粘贴到一个文件中？

Question

我想从多个文件中提取第5列，以数字顺序命名，然后将这些列并排粘贴到一个输出文件中。

文件名如下所示：

sample_problem1_part1.txt
sample_problem1_part2.txt

sample_problem2_part1.txt
sample_problem2_part2.txt

sample_problem3_part1.txt
sample_problem3_part2.txt
......

每个问题文件 (1,2,3...) 有两个部分 (part1, part2)。每个文件的行数相同。内容如下：

sample_problem1_part1.txt
1 1 20 20 1
1 7 21 21 2
3 1 22 22 3
1 5 23 23 4
6 1 24 24 5
2 9 25 25 6
1 0 26 26 7

sample_problem1_part2.txt
1 1 88 88 8
1 1 89 89 9
2 1 90 90 10
1 3 91 91 11
1 1 92 92 12
7 1 93 93 13
1 5 94 94 14

sample_problem2_part1.txt
1 4 330 30 a
3 4 331 31 b
1 4 332 32 c
2 4 333 33 d
1 4 334 34 e
1 4 335 35 f
9 4 336 36 g

输出应如下所示：（按问题1 _part 1、问题1 _part 2、问题2 _part 1、问题2 _part 2、问题3 _part 1、问题3 _part 2等顺序排列）

1 8 a ...
2 9 b ...
3 10 c ...
4 11 d ...
5 12 e ...
6 13 f ...
7 14 g ...

我正在使用：

 paste sample_problem1_part1.txt sample_problem1_part2.txt > \
     sample_problem1_partall.txt
 paste sample_problem2_part1.txt sample_problem2_part2.txt > \
     sample_problem2_partall.txt
 paste sample_problem3_part1.txt sample_problem3_part2.txt > \
     sample_problem3_partall.txt

接着：

for i in `find . -name "sample_problem*_partall.txt"`
do
    l=`echo $i | sed 's/sample/extracted_col_/'`
    `awk '{print $5, $10}'  $i > $l`
done

和：

paste extracted_col_problem1_partall.txt \
      extracted_col_problem2_partall.txt \
      extracted_col_problem3_partall.txt > \
    extracted_col_problemall_partall.txt

它适用于几个文件，但当文件数量很大（超过4000）时，这是一种疯狂的方法。任何人都可以帮助我解决能够处理多个文件的更简单的解决方案吗？谢谢！

score 8 · Accepted Answer

这是使用awk和排序文件的一种方式：

awk '{ a[FNR] = (a[FNR] ? a[FNR] FS : "") $5 } END { for(i=1;i<=FNR;i++) print a[i] }' $(ls -1v *)

结果：

解释：

对于每个输入文件的每一行输入：
- 将文件行号添加到值为第 5 列的数组中。
- (a[FNR] ? a[FNR] FS : "")是一个三元运算，它被设置为将数组值构建为记录。它只是询问文件行号是否已经在数组中。如果是这样，在添加第五列之前添加数组值，后跟默认文件分隔符。否则，如果行号不在数组中，不要添加任何东西，让它等于第五列。
在脚本的最后：
- 使用 C 风格的循环遍历数组，打印每个数组值。

score 1 · Accepted Answer

仅对于 ~4000 个文件，您应该能够：

 find . -name sample_problem*_part*.txt | xargs paste

如果find以错误的顺序命名，请将其通过管道传输到sort：

 find . -name sample_problem*_part*.txt | sort ... | xargs paste

score 1 · Accepted Answer

# print filenames in sorted order
find -name sample\*.txt | sort |
# extract 5-th column from each file and print it on a single line
xargs -n1 -I{} sh -c '{ cut -s -d " " -f 5 $0 | tr "\n" " "; echo; }' {} |
# transpose
python transpose.py ?

其中transpose.py：

#!/usr/bin/env python
"""Write lines from stdin as columns to stdout."""
import sys
from itertools import izip_longest

missing_value = sys.argv[1] if len(sys.argv) > 1 else '-'
for row in izip_longest(*[column.split() for column in sys.stdin],
                         fillvalue=missing_value):
    print " ".join(row)

输出

1 8 a
2 9 b
3 10 c
4 11 d
5 ? e
6 ? f
? ? g

假设第一个和第二个文件的行数少于第三个（缺失值替换为'?'）。

score 1 · Accepted Answer

试试这个。我的脚本假定每个文件都有相同的行数。

# get number of lines
lines=$(wc -l sample_problem1_part1.txt | cut -d' ' -f1)

for ((i=1; i<=$lines; i++)); do
  for file in sample_problem*; do
    # get line number $i and delete everything except the last column
    # and then print it
    # echo -n means that no newline is appended
    echo -n $(sed -n ${i}'s%.*\ %%p' $file)" "
  done
  echo
done

这行得通。对于 4800 个文件，在 AMD Athlon(tm) X2 双核处理器 BE-2400 上，每 7 行需要2 分 57.865 秒。

PS：我的脚本的时间随着行数线性增加。合并 1000 行的文件需要很长时间。您应该考虑学习 awk 并使用 steve 的脚本。我测试了它：对于 4800 个文件，每个文件有 1000 行，只用了65 秒！

score 0 · Accepted Answer

您可以将 awk 输出传递给粘贴并将其重定向到一个新文件，如下所示：

粘贴 <(awk '{print $3}' file1) <(awk '{print $3}' file2) <(awk '{print $3}' file3) > file.txt

linux - 如何从多个文件中提取一列，并将这些列粘贴到一个文件中？

5 回答 5

输出

Related

Reference