python - 相对于第三列的内容重新编号两列中制表符分隔的数据

Question

我还有另一个数据处理问题。所以我有这个制表符分隔数据的 .gtf 文件，我需要提取某些特征。这在我只需要为每个基因中的每个“外显子”类型提取基因 ID、POS1 和 POS2 之前就更简单了。我需要做同样的事情，但是我首先需要找到每个外显子相对于其在基因中的位置的 POS1 和 POS2。现在，POS1 和 POS2 列是根据 TYPE 在整个基因组上的位置进行编号的（这就是数字如此之高的原因）。还有另一个问题，如果链是-，则相反。如果您查看 PITG_00002，您会发现终止密码子似乎位于起始密码子之前。这是因为所有内容都相对于 +（模板）链进行编号。因此，这是数据表的示例：

GENE ID     TYPE        POS1    POS2    STRAND
PITG_00003  start_codon 38775   38777   +   0
PITG_00003  stop_codon  39069   39071   +   0
PITG_00003  exon        38775   39071   +   .
PITG_00003  CDS         38775   39068   +   0
PITG_00004  start_codon 39526   39528   +   0
PITG_00004  stop_codon  41492   41494   +   0
PITG_00004  exon        39526   40416   +   .
PITG_00004  CDS         39526   40416   +   0
PITG_00004  exon        40486   40771   +   .
PITG_00004  CDS         40486   40771   +   0
PITG_00004  exon        40827   41494   +   .
PITG_00004  CDS         40827   41491   +   2
PITG_00002  start_codon 10520   10522   -   0
PITG_00002  stop_codon  10097   10099   -   0
PITG_00002  exon        10474   10522   -   .
PITG_00002  CDS         10474   10522   -   0
PITG_00002  exon        10171   10433   -   .
PITG_00002  CDS         10171   10433   -   2
PITG_00002  exon        10097   10114   -   .
PITG_00002  CDS         10100   10114   -   0

因此，对于每个基因，我需要相对于“起始密码子”类型的位置从 1 开始数字。不幸的是，STRAND 上列出的基因的数字是倒数的（例如，PITG_00002）。所以对于这些情况，编号需要从相对于 start_codon 的 POS2 的 1 开始，并在外显子的 POS1 结束。

所以对于每个外显子，我需要得到一个新的 POS1 和 POS2，我将它们称为 POSA 和 POSB。

要获得每个外显子的 POSA，我会这样做：

POS1 of "exon" - POS1 of "start_codon" + 1 = POSA

要获得每个外显子的 POSB，我会这样做：

POS2 of "exon" - POS1 of "start_codon" + 1 = POSB

以 PITG_00004 为例：

POSA = 39526-39526 + 1 = 1
POSB = 40416 - 39526 + 1 = 891

然后对每个基因中的每个外显子做同样的事情，使用该基因的 start_codon 位置来重置编号。除了负链的情况外，在这种情况下我必须这样做：

要获得每个外显子的 POSA，我会这样做：

POS2 of "start_codon" - POS2 of "exon" + 1 = POSA

要获得每个外显子的 POSB，我会这样做：

POS1 of "start_codon" - POS1 of "exon" + 1 = POSB

最终我想得到这个：

PITG_00002 exon 1 49
PITG_00002 exon 90 352
PITG_00002 exon 409 426
PITG_00003 exon 1 297
PITG_00004 exon 1 891
PITG_00004 exon 961 1246
PITG_00004 exon 1302 1969

我不确定如何为 + 链执行此操作，而为 - 链执行另一种方式。我最近更经常使用 python，但我也可以使用 perl。

score 1 · Accepted Answer

Perl 解决方案。使用哈希来存储每个基因的信息。该@idxs数组用于避免重复公式。

#!/usr/bin/perl
use warnings;
use strict;
use feature qw(switch);

my %hash;
<>;                   # Skip header.
while (<>) {
    my ($id, $type, $pos1, $pos2, $strand, undef) = split;
    given ($type) {
        when ('start_codon') {
            $hash{$id}{start}  = [$pos1, $pos2];
            $hash{$id}{strand} = $strand;
        }
        when ('stop_codon') {
            $hash{$id}{stop}  = [$pos1, $pos2];
        }
        when ('exon') {
            push @{ $hash{$id}{exons} }, [$pos1, $pos2];
        }
    }
}

for my $id (sort keys %hash) {
    my @idxs = '+' eq $hash{$id}{strand} ? (0, 1) : (1, 0);
    for my $exon (@{ $hash{$id}{exons} }) {
        my $posa = 1 + abs $hash{$id}{start}[$idxs[0]] - $exon->[$idxs[0]];
        my $posb = 3 + abs $hash{$id}{start}[$idxs[1]] - $exon->[$idxs[1]];
        print "$id exon $posa $posb\n";
    }
}

score 0 · Accepted Answer

好的，这是您问题的解决方案（至少就我所理解的而言）：它基于 pandas 库（http://pandas.pydata.org/），这是目前 Python 中数据分析的黄金标准。

首先加载您的数据：

data = pd.read_csv('genetest.csv', sep='\t',
                   converters={'STRAND': lambda s: s[0]})

转换后的只是从 strand 列中去除多余的字符，只留下 + 或 -。

现在你去使用 groupby 函数通过链方向和基因名称来分离你的序列

groups = data.groupby(['STRAND', 'GENE_ID'])

这将返回具有相同链方向和基因名称的数据集，您可以分别处理它们中的每一个。因此，我们将它们作为字典的项目（键值对列表）进行迭代并对其进行操作。

corrected = []
for (direction, gene_name), group in groups:
    print direction,gene_name
    # take the index of the element you are going to subtract to the others
    start_exon = group.index[group.TYPE=='start_codon'][0]
    # now you perform your normalization and put it back into your group
    if direction == '+':
        group['POSA'] = 1 + group.POS1 - group.POS1[start_exon]
        group['POSB'] = 1 + group.POS2 - group.POS1[start_exon]
    else:
        group['POSA'] = 1 - group.POS2 + group.POS2[start_exon]
        group['POSB'] = 1 - group.POS1 + group.POS2[start_exon]
    print group
    # put into the result array
    corrected.append(group)
# join them together to obtain the whole dataset with the POSA and POSB
new_data = pd.concat(corrected)
print new_data

这就是你得到的：

    GENE_ID     TYPE    POS1    POS2    STRAND  POSA    POSB
0   PITG_00003  start_codon     38775   38777   +   1   3
1   PITG_00003  stop_codon  39069   39071   +   295     297
2   PITG_00003  exon    38775   39071   +   1   297
3   PITG_00003  CDS     38775   39068   +   1   294
4   PITG_00004  start_codon     39526   39528   +   1   3
5   PITG_00004  stop_codon  41492   41494   +   1967    1969
6   PITG_00004  exon    39526   40416   +   1   891
7   PITG_00004  CDS     39526   40416   +   1   891
8   PITG_00004  exon    40486   40771   +   961     1246
9   PITG_00004  CDS     40486   40771   +   961     1246
10  PITG_00004  exon    40827   41494   +   1302    1969
11  PITG_00004  CDS     40827   41491   +   1302    1966
12  PITG_00002  start_codon     10520   10522   -   1   3
13  PITG_00002  stop_codon  10097   10099   -   424     426
14  PITG_00002  exon    10474   10522   -   1   49
15  PITG_00002  CDS     10474   10522   -   1   49
16  PITG_00002  exon    10171   10433   -   90  352
17  PITG_00002  CDS     10171   10433   -   90  352
18  PITG_00002  exon    10097   10114   -   409     426
19  PITG_00002  CDS     10100   10114   -   409     423

顺便说一句，您在问题中写下错误的距离校正，应该是

POS1 of "start_codon" - POS2 of "exon" + 1 = POSB

使倒置的字符串具有几何意义（并获得您发布的值）

python - 相对于第三列的内容重新编号两列中制表符分隔的数据

2 回答 2

Related

Reference