python - 获取唯一值的数量

Question

我有一些包含两列的文本文件。第一列是氨基酸的位置，第二列是氨基酸的名称。我想从所有文件中获取每个氨基酸的总数。我只需要独特的价值观。在以下示例中，LEU 的总数为 2（一个来自 file1，另一个来自 file2）。您的建议将不胜感激！

文件 1

54   LEU
54   LEU
78   VAL
112  ALA
78   VAL

文件 2

54   LEU
113  ALA
113  ALA
12   ALA
112  ALA

期望的输出

total no:of LEU - 2
total no:of VAL - 1
total no:of ALA - 4

score 2 · Accepted Answer

如果您只有两个文件，只需使用awk：

awk '{ a[$2]++ } END { for (i in a) print "total no:of", i, a[i] }' <(awk '!a[$1,$2]++' file1) <(awk '!a[$1,$2]++' file2)

如果你有很多很多文件，试试这个awk脚本。像这样运行：

awk -f script.awk file{1..200}

内容script.awk：

{
    a[FILENAME,$1,$2]
}

END {
    for (i in a) {
        split (i,x,SUBSEP)
        b[x[3]]++
    }
    for (j in b) {
        print "total no:of", j, b[j]
    }
}

或者，这是单线：

awk '{ a[FILENAME,$1,$2] } END { for (i in a) { split (i,x,SUBSEP); b[x[3]]++ } for (j in b) print "total no:of", j, b[j] }' file{1..200}

结果：

total no:of LEU 2
total no:of ALA 4
total no:of VAL 1

score 0 · Accepted Answer

name_dict = {}
for filename in filenames:
    fsock = open(filename, 'r')
    lines = fsock.readlines()
    fsock.close()
    for line in lines:
        a = line.split()
        key = a[-1]
        if name_dict[key]:
            name_dict[key] += 1 
        else:
            name_dict[key] = 1

for i in name_dict.items():
    print "total no:of ", i[0], " - ", i[1]

score 0 · Accepted Answer

with open('file1.txt', 'r') as f1, open('file2.txt', 'r') as f2:
    # open both files, then close afterwards
    data = f1.readlines().split() + f2.readlines.split()
    # read the data, then split it by spaces
d = {elem:data.count(elem) for elem in set(data[0::2])}
for i in d:
    print('total no:of {} - {}'.format(i, d[i]))

score 0 · Accepted Answer

打开文件，读取一行，获取 protien 的名称，如果它存在于字典中，则添加 1 或附加到字典中。

protien_dict = {}
openfile = open(filename)
while True:
    line = openfile.readline()
    if line = "":
            break
    values = line.split(" ")
    if protien_dict.has_key(values[1]):
        protien_dict[values[1]] = protien_dict[values[1]] + 1
    else:
        protien_dict[values[1]] = 1
for elem in protien_dict:
    print "total no. of " + elem + " = " + protien_dict[elem]

score 0 · Accepted Answer

collections.Counter特别有用——你猜对了！——计算东西！：

from collections import Counter
counts = Counter()
for filename in filenames:
    with open(filename) as f:
        counts.update(set(tuple(line.split()) for line in f if line.strip()))

score 0 · Accepted Answer

您提到了 Python、Perl 和 Awk。

在所有三个中，想法都是一样的：使用散列来存储值。

散列就像数组，除了每个条目都用一个键而不是位置来索引。在散列中，只能有一个带有该键的条目。正因为如此，散列被用来检查之前是否出现过值。这是一个快速的 Perl 示例：

my %value_hash;
for my $value ( qw(one two three one three four) ) {
    if ( exists $value_hash{$value} ) {
       print "I've seen the value $value before\n";
    }
    else {
       print "The value of $value is new\n";
       $value_hash{$value} = 1;
    }
}

这将打印出：

The value of one is new
The value of two is new
The value of three is new
I've seen the value of one before
I've seen the value of three before
The value of four is new

首先，您需要两个循环：一个循环遍历所有文件，另一个循环遍历特定文件的每一行。

for my $file_name ( @file_list ) {
    open my $file_fh, "<", $file_name 
       or die qw(File $file_name doesn't exist);
    while (my $line = <$file_fh>) {
       chomp $line;
       ...
    }
}

接下来，我们将介绍每个氨基酸总数的哈希值和这些氨基酸的跟踪哈希值：

use strict;
use warnings;
use autodie;

my %total_amino_acids;
my @file_list = qw(file1 file2);   #Your list of files

for my $file_name ( @file_list ) {
    open my $file_fh, "<", $file_name; 
    my %seen_amino_acid_before;  # "Initialize" hash which tracks seen
    while (my $line = <$file_fh>) {
       chomp $line;
       my ( $location, $amino_acid ) = split $line;
       if ( not %seen_amino_acid_before{$amino_acid} ) {
           $total_amino_acids{$amino_acid} += 1;
       }
    }
}

现在，我假设当您说unique时，您仅在谈论氨基酸而不是位置。这split是拆分两个值，我只看氨基酸。如果位置也很重要，则必须将其包含在%seen_amino_acid_before散列的键中。这很棘手，因为我可以想象以下内容：

54    LEU
54 LEU
054.00  LEU

这些是不同的字符串，但都具有相同的信息。您需要确保标准化位置/氨基酸键。

    while (my $line = <$file_fh>) {
       chomp $line;
       my ( $location, $amino_acid ) = split $line;
       my $amino_acid_key = sprinf "%04d-%s", $location, uc $amino_acid;
       if ( not %seen_amino_acid_before{$amino_acid_key} ) {
           $total_amino_acids{$amino_acid} += 1;
       }
    }

在上面，我正在创建一个$amino_acid_key. 我使用sprintf将我的数字部分格式化为零填充的十进制和氨基酸为大写。这边走：

54    LEU
54 leu
054.00  Leu

都是关键0054-LEU。这样，您在文件中输入数据的方式就不会影响您的结果。这可能是一个完全不必要的步骤，但您应该始终考虑这一点。例如，如果您的数据是计算机生成的，这可能不是问题。如果您的数据是由一群过度劳累的研究生在半夜输入的，您可能需要担心格式。

现在，您只需要一个循环来读取数据：

for my $amino_acid ( sort keys %total_amino_acids ) {
     printf "total no:of %4s - %4d\n", $amino_acid, $total_amino_acids{$amino_acid};
}

请注意，我曾经printf帮助格式化总计，因此它们将被排列。

score 0 · Accepted Answer

另外一个选项：

use strict;
use warnings;

my ( $argv, %hash, %seen ) = '';

while (<>) {
    $argv ne $ARGV and $argv = $ARGV and undef %seen;
    !$seen{ $1 . $2 }++ and $hash{$2}++ if /(.+)\s+(.+)/;
}

print "total no:of $_ - $hash{$_}\n" for keys %hash;

数据集上的输出：

total no:of ALA - 4
total no:of VAL - 1
total no:of LEU - 2

score 0 · Accepted Answer

0

只是 perl oneliner：

perl -anE'$h{$F[1]}++}{say"total no:of $_ - $h{$_}"for keys%h'

于 2013-04-07T07:28:27.893 回答

score 0 · Accepted Answer

ls file* | parallel 'sort -u {}  >> tmp' ; awk '{print $2}' tmp | sort | uniq -c

这给出了输出：

4 阿拉巴马

2 低浓度铀

1 值

python - 获取唯一值的数量

9 回答 9

Related

Reference