python - 如果文件 1 中的 A 列 = 文件 2 中的 A 列，则替换为文件 2 中的 B 列

Question

通常我会使用 R 并执行 merge.by，但这个文件似乎太大了，部门中的任何计算机都无法处理这个问题！（为从事遗传学工作的任何人提供的附加信息）本质上，插补似乎删除了 snp ID 的 rs 数字，而我只剩下 Chromosome:Position 信息。所以我用我想要的所有 rs 数字创建了一个链接文件，并想用文件 2 中的 rs 数字替换文件 1 中的 Chr:Pos 列。

所以我试图想出一种编码方式：

If $3 of file 1 = $5 of file 2, replace $3 file 1 with $2 of file 2.

文件 1 看起来像

1111 1111 1:10583  G G
1112 1112 1:10583  G G
1113 1113 1:10583  G G
1114 1114 1:10583  G G
1115 1115 1:10583  G G

文件 2 看起来像

1   rs58108140  0   10583       1:10583
1   rs192319073 0   105830003   1:105830003
1   rs190151039 0   10583005    1:10583005
1   rs2809302   0   105830229   1:105830229
1   rs191085550 0   105830291   1:105830291

期望的输出是：

1111 1111 rs58108140  G G
1112 1112 rs58108140  G G
1113 1113 rs58108140  G G
1114 1114 rs58108140  G G
1115 1115 rs58108140  G G

score 2 · Accepted Answer

简单awk：

$ awk 'FNR==NR{a[$5]=$2;next}$3 in a{$3=a[$3]}1' file2 file1
1111 1111 rs58108140 G G
1112 1112 rs58108140 G G
1113 1113 rs58108140 G G
1114 1114 rs58108140 G G
1115 1115 rs58108140 G G

score 0 · Accepted Answer

join并且awk可以做到。您也可以使用cut代替awk，但之后您必须以其他方式重新排序字段。

join -1 3 -2 5 file1 file2 | awk '{print $2, $3, $7, $4, $5}'

警告：正如 sudo_O 所提到的，这仅在文件已排序时才有效 - 我假设它们是基于给定示例的。如果他们不是，这不会很快。如果它们已经排序，则不需要将它们读入内存，因为这两个命令只会在读取数据时处理数据。

score 0 · Accepted Answer

从 file2 创建字典

with open('file2', 'r') as file2:
   replacement = {}
   for line in file2:
       splited_line = line.split()
       replacement[splited_line[4]] = splited_line[1]

with open('file1', 'r') as file1:
    with open('file1_new', 'w') as file1_new:
        for line in file1:
            splitted_line = line.split()
            splitted_line[2] = replacement.get(splitted_line[1], splitted_line[1])
            file1_new.write(' '.join(splitted_line)+'\n')

python - 如果文件 1 中的 A 列 = 文件 2 中的 A 列，则替换为文件 2 中的 B 列

3 回答 3

Related

Reference