python - 如何为 Python 加速（fasta）子采样程序？

Question

我设计了一个小脚本，从原始文件中对 x 行进行子采样。原始文件是 fasta，每个序列有两行，程序提取这 x 个序列（这两行一起）。这是它的外观：

#!/usr/bin/env python3
import random
import sys
# How many random sequences do you want?
num = int(input("Enter number of random sequences to select:\n"))

# Import arguments
infile = open(sys.argv[1], "r")
outfile = open(sys.argv[2], "w")

# Define lists
fNames = []
fSeqs = []
# Extract fasta file into the two lists
for line in infile:
    if line.startswith(">"):
        fNames.append(line.rstrip())
    else:
        fSeqs.append(line.rstrip())

# Print total number of sequences in the original file
print("There are "+str(len(fNames))+" in the input file")

# Take random items out of the list for the total number of samples required
for j in range(num):
    a = random.randint(0, (len(fNames)-1))
    print(fNames.pop(a), file = outfile)
    print(fSeqs.pop(a), file = outfile)

infile.close()
outfile.close()
input("Done.")

带有 ID 和核苷酸（分别为第 1 行和第 2 行）的列表的创建过程非常快，但打印出来需要很长时间。被提取的数字可以达到 2M，但从 10000 开始变慢。

我想知道是否有任何方法可以使它更快。是.pop问题吗？如果我先创建唯一数字的随机列表然后提取它们会更快吗？

最后，终端在打印后没有回到“正常完成状态” Done.，我不知道为什么。使用我的所有其他脚本，我可以在它们完成后立即输入。

score 0 · Accepted Answer

random.sample（在评论中建议）和字典使脚本更快。这是最终的脚本：

#!/usr/bin/env python3
import random
import sys
# How many random sequences do you want?
num = int(input("Enter number of random sequences to select:\n"))

# Import arguments
infile = open(sys.argv[1], "r")
outfile = open(sys.argv[2], "w")

# Define list and dictionary
fNames = []
dicfasta = {}
# Extract fasta file into the two lists
for line in infile:
    if line.startswith(">"):
        fNames.append(line.rstrip())
        Id = line.rstrip()
    else:
        dicfasta[Id] = line.rstrip()

# Print total number of sequences in the original file
print("There are "+str(len(fNames))+" in the input file")

# Create subsamples
subsample = []
subsample = random.sample(fNames, num)

# Take random items out of the list for the total number of samples required
for j in subsample:
    print(j, file = outfile)
    print(dicfasta[j], file = outfile)

infile.close()
outfile.close()
input("Done.")

python - 如何为 Python 加速（fasta）子采样程序？

1 回答 1

Related

Reference