linux - 我使用正确的命令吗？

Question

我正在尝试在终端上编写一个单行命令来计算一个非常大的文件中所有唯一的“gene-MIR”。“基因-MIR”后面是一系列数字，例如。基因-MIR334223，基因-MIR633235，基因-MIR53453 ...等，并且有多个相同的“基因-MIR” ex。基因-MIR342433可能在脚本中出现10 倍。

我的问题是，如何编写一个命令来注释我的文件中存在的唯一“gene-MIR”？

到目前为止我一直在使用的命令是：

grep -c "gene-MIR" myfile.txt | uniq
grep "gene-MIR" myfile.txt | sort -u

第一个命令为我提供了一个计数；但是，我相信它不包括“MIR”之后的数字系列，而只是计算存在多少“gene-MIR”本身。

谢谢！

[1]：https://i.stack.imgur.com/Y7EcD.png

score 0 · Accepted Answer

If you have information like this:

Inf1
Inf2
Inf1
Inf2

And you want to know the amount of "inf" kinds, you always need to sort it first. Only afterwards you can start counting.

Edit

I've created a similar file, containing the examples, mentioned in the requester's comment, as follows:

Nonsense
gene-MIR4232
gene-MIR2334
gene-MIR93284
gene-MIR4232
gene-MIR2334
gene-MIR93284
More nonsense

On that, I've applied both commands, as mentioned in the question:

grep -c "gene-MIR" myfile.txt | uniq

Which results in 6, just like the following command:

grep -c "gene-MIR" myfile.txt

Why? The question here is "How many lines contain the string "gene-MIR"?".
This is clearly not the requested information.

The other command also is not correct:

grep "gene-MIR" myfile.txt | sort -u

The result:

gene-MIR2334
gene-MIR4232
gene-MIR93284

Explanation:
grep "gene-MIR" ... means: show all the lines, which contain "gene-MIR"
| sort -u means: sort the displayed lines and if there are multiple instances of the same, only show one of them.

Also this is not what the requester wants. Therefore I have following proposal:

grep "gene-MIR" myfile.txt | sort | uniq -c

With following result:

      2 gene-MIR2334
      2 gene-MIR4232
      2 gene-MIR93284

This is more what the requester is looking for, I presume.

What does it mean? grep "gene-MIR" myfile.txt : only show the lines which contain "gene-MIR"
| sort : sort the lines, which are shown. Like this, you get an intermediate result like this:

    gene-MIR2334
    gene-MIR2334
    gene-MIR4232
    gene-MIR4232
    gene-MIR93284
    gene-MIR93284

| uniq -c : group those results together and show the count for every instance.

Unfortunately, the example is badly chosen as every instance occurs exactly two times. Therefore, for clarification purposes, I've created another "myfile.txt", as follows:

Nonsense
gene-MIR4232
gene-MIR2334
gene-MIR93284
gene-MIR2334
gene-MIR2334
gene-MIR93284
More nonsense

I've applied the same command again:

grep "gene-MIR" myfile.txt | sort | uniq -c

With following result:

      3 gene-MIR2334
      1 gene-MIR4232
      2 gene-MIR93284

Here you can see in a much clearer way that the proposed command is correct.

... and your next question is: "Yes, but is it possible to sort the result?", on which I answer:

grep "gene-MIR" myfile.txt | sort | uniq -c | sort -n

With following result:

      1 gene-MIR4232
      2 gene-MIR93284
      3 gene-MIR2334

Have fun!

score 0 · Accepted Answer

假设所有条目都在不同的行上，试试这个：

grep "gene-MIR" myfile.txt | sort | uniq -c

如果条目与其他文本混合在一起，并且系统具有GNU grep ，请尝试以下操作：

grep -o 'gene-MIR[0-9]*' myfile.txt | sort | uniq -c

要获得总数：

grep -o 'gene-MIR[0-9]*' myfile.txt  | wc -l

linux - 我使用正确的命令吗？

2 回答 2

Related

Reference