shell - Counts of various elements in various files

Question

So I have about 1000 files that are multiple columns, but I'm only interested in some stats of two of those columns. If $4 was something like a star's spectral class (ie a unique value) and $5 in each of these files was a result, like seen, unseen, unknown, etc, is there a recommended way to grep or awk out the stats like so across the 1000 or so files so I get something like:

Type O, #verified, #not-verified, #property-j, ...
Type B, ...
Type A, ...
.
.
.
Type i,

Where, in each file, you'd see something like:

$1, $2, $3, Spectral Type, Result
foo, foo, foo, A, verified
foo, foo, foo, G, verified
foo, foo, foo, A, unknown
foo, foo, foo, F, verified
foo, foo, foo, G, verified
foo, foo, foo, K, verified
foo, foo, foo, K, seen

score 1 · Accepted Answer

perl -aF, -nle '{${$h{@F[3]}}{@F[4]}=1}END{while(($k,$v)=each%h){print"$k, @{[keys%$v]}";}}' files

编辑

为什么这可以解决问题。

对于标志信息类型

perl --help

算法

{..} END{..}    # first block is evaluated on each line, END block only once at the end

@F[3]应该写成$F[3]，不同的是@F[3]是一个元素的数组，而$F[3]是元素

${h{$F[3]}}     # gets value or creates and return a new entry in the hash %h with key $F[3] third element of array @F
${..}{$F[4]}=1  # supposes that value in hash %h is type HASHREF and creates a new entry with key

可以编写整个表达式（也许更容易），但这是我首先想到的第一个语法

$h{$F[3]}{$F[4]}=1

结尾

while(($k,$v)=each%h)  # loop over entries of hash %h
"@{[..]}"   # is a trick to display array in a double quote expression
%$v         # dereferences HASHREF

更接近问题的解决方案：

perl -lnaF'/\s*,\s*/' -e '{$h{$F[3]}{$F[4]}=1;}END{while(($k,$v)=each%h){print("Type $k, ",join(", ",map("#$_",keys%$v)));}}'

注意：在这种情况下，打印后的括号是可选的，但为了提高可读性，在关闭花括号之前也要保留分号

score 1 · Accepted Answer

如果您的问题是：“如何生成格式为“Type $4, $5”的输出，其中 $4 和 $5 分别是输入的第 4 列和第 5 列？一种解决方案是：

for i in list of input file; do
  awk '{print "Type "$4, $5}' $i > $i.result
done

这给出了您想要的输出，但依赖于所有不包含空格的列。如果可能有空格，您可以执行以下操作：

 awk '{printf( "Type %s, %s", $4, $5 )}' FS=, $i > $i.result

但您可能想要修剪这将产生的额外空白。请注意，尽管在示例中我将输入文件列表硬编码为 4 个文件名“list”、“of”、“input”和“file”，但我不希望您输入名称。而是，你应该以某种方式生成它们，我只是演示了一个（许多！）迭代一组文件的方法。似乎这个问题的核心是处理的部分awk，而不是迭代。

对该问题的第二次阅读表明您每个输入文件只有一行，并且您希望将结果汇总在一个文件中。在这种情况下，只需执行以下操作：

cat list of all files | awk '{print "Type "$4, $5}'

score 1 · Accepted Answer

如果分隔符只是逗号，并且不需要转义的 CSV 解析，请使用 cut 实用程序：

cat $file | cut -d, -f4

shell - Counts of various elements in various files

3 回答 3

Related

Reference