bash - 从与另一个大文件的行匹配的大文件中查找并提取一行

Question

我允许自己创建一个新问题，因为与我在 bash 脚本优化中的第一个问题相比，一些参数发生了巨大变化（优化我的脚本，它查找到一个大的压缩文件）

简而言之：我想查找并提取文件（1）（bam 文件）第一列的变量与文本文件（2）的第一列匹配的所有行。对于生物信息学家来说，它实际上是从两个文件中提取匹配的读取 id。文件1是二进制压缩的130GB文件文件2是10亿行的tsv文件

最近一个用户带来了一个非常优雅的单行器，它结合了文件的解压和使用 awk 的查找，它运行得非常好。随着文件的大小，它现在要查找 200 多个小时（多线程）。

这个“问题”在算法中有名称吗？
什么是应对这一挑战的好方法？（如果可能，使用简单的解决方案，例如 sed、awk、bash ..）

十分感谢

编辑：对不起代码，因为它在链接上，我虽然它是一个“doublon”。这是使用的一种衬里：

#!/bin/bash

samtools view -@ 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'

score 0 · Accepted Answer

将此视为长评论而不是答案。“合并排序”方法可以概括为：如果两条记录不匹配，则将文件中的一条记录推进到具有较小记录的记录。如果它们匹配，则记录匹配并在大文件中推进一条记录。

在伪代码中，这看起来像：

currentSmall <- readFirstRecord(smallFile)
currentLarge <- readFirstRecord(largeFile)
searching <- true
while (searching)
  if (currentLarge < currentSmall)
    currentLarge <- readNextRecord(largeFile)
  else if (currentLarge = currentSmall)
    //Bingo!
    saveMatchData(currentLarge, currentSmall)
    currentLarge <- readNextRecord(largeFile)
  else if (currentLarge > currentsmall)
    currentSmall <- readNextRecord(smallFile)
  endif

  if (largeFile.EOF or smallFile.EOF)
    searching <- false
  endif
endwhile

您如何将其转换为 awk 或 bash 超出了我的知识范围。

bash - 从与另一个大文件的行匹配的大文件中查找并提取一行

1 回答 1

Related

Reference