python - Python 在文件中搜索一百万个字符串并计算每个字符串的出现次数

Question

这更多的是要找到最快的方法来做到这一点。我有一个 file1，它在单独的行中包含大约一百万个字符串（长度 6-40）。我想在另一个包含大约 80,000 个字符串的文件 2 中搜索它们中的每一个并计算出现次数（如果在一个字符串中多次找到小字符串，则该字符串的出现次数仍然为 1）。如果有人有兴趣比较性能，有下载 file1 和 file2 的链接。dropbox.com/sh/oj62918p83h8kus/sY2WejWmhu?m

我现在正在做的是为文件 2 构建一个字典，使用字符串 ID 作为键和字符串作为值。（因为 file2 中的字符串有重复值，只有字符串 ID 是唯一的）我的代码是

for line in file1:
   substring=line[:-1].split("\t")
   for ID in dictionary.keys():
       bigstring=dictionary[ID]
       IDlist=[]
       if bigstring.find(substring)!=-1:
           IDlist.append(ID)
   output.write("%s\t%s\n" % (substring,str(len(IDlist))))

我的代码需要几个小时才能完成。任何人都可以建议一种更快的方法吗？file1 和 file2 都只有 50M 左右，我的电脑有 8G 内存，你可以使用尽可能多的内存来让它更快。任何可以在一小时内完成的方法都是可以接受的：）

在这里，在我尝试了下面这些评论的一些建议之后，请查看性能比较，首先是代码，然后是运行时。

Mark Amery 和其他人提出的一些改进

import sys
from Bio import SeqIO

#first I load strings in file2 to a dictionary called var_seq, 
var_seq={}
handle=SeqIO.parse(file2,'fasta')
for record in handle:
    var_seq[record.id]=str(record.seq)

print len(var_seq) #Here print out 76827, which is the right number. loading file2 to var_seq doesn't take long, about 1 second, you shall not focus here to improve performance

output=open(outputfilename,'w')
icount=0
input1=open(file1,'r')
for line in input1:
    icount+=1
    row=line[:-1].split("\t")
    ensp=row[0]   #ensp is just peptides iD
    peptide=row[1] #peptides is the substrings i want to search in file2
    num=0
    for ID,bigstring in var_seq.iteritems(): 
        if peptide in bigstring:
            num+=1

    newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
    output.write(newline)
    if icount%1000==0:
        break

input1.close()
handle.close()
output.close()

完成需要1m4s。与我的旧版本相比，提高了 20 多岁

#######熵建议的NEXT METHOD

from collections import defaultdict
var_seq=defaultdict(int)
handle=SeqIO.parse(file2,'fasta')
for record in handle:
    var_seq[str(record.seq)]+=1

print len(var_seq) # here print out 59502, duplicates are removed, but occurances of duplicates are stored as value 
handle.close()

output=open(outputfilename,'w')
icount=0

with open(file1) as fd:
    for line in fd:
        icount+=1
        row=line[:-1].split("\t")
        ensp=row[0]
        peptide=row[1]
        num=0
        for varseq,num_occurrences in var_seq.items():
            if peptide in varseq:
                num+=num_occurrences

    newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
    output.write(newline)
    if icount%1000==0:
        break

output.close()

这需要 1 分 10 秒，因为它避免搜索重复项，所以没有像预期的那样快，不明白为什么。

Mark Amery 建议的 Haystack 和 Needle 方法，结果是最快的，这种方法的问题是所有子串的计数结果都是 0，我还不明白。

这是我实现他的方法的代码。

class Node(object):
    def __init__(self):
        self.words = set()
        self.links = {}

base = Node()

def search_haystack_tree(needle):
    current_node = base
    for char in needle:
        try:
            current_node = current_node.links[char]
        except KeyError:
            return 0
    return len(current_node.words)

input1=open(file1,'r')
needles={}
for line in input1:
    row=line[:-1].split("\t")
    needles[row[1]]=row[0]

print len(needles)

handle=SeqIO.parse(file2,'fasta')
haystacks={}
for record in handle:
    haystacks[record.id]=str(record.seq)

print len(haystacks)

for haystack_id, haystack in haystacks.iteritems(): #should be the same as enumerate(list)
    for i in xrange(len(haystack)):
        current_node = base
        for char in haystack[i:]:
            current_node = current_node.links.setdefault(char, Node())
            current_node.words.add(haystack_id)

icount=0
output=open(outputfilename,'w')
for needle in needles:
    icount+=1
    count = search_haystack_tree(needle)
    newline="%s\t%s\t%s\n" % (needles[needle],needle,str(count))
    output.write(newline)
    if icount%1000==0:
        break

input1.close()
handle.close()
output.close()

只需0m11s即可完成，比其他方法快得多。但是，我不知道将所有计数结果都设为0是我的错误，还是Mark的方法存在缺陷。

score 3 · Accepted Answer

Your code doesn't seem like it works(are you sure you didn't just quote it from memory instead of pasting the actual code?)

For example, this line:

substring=line[:-1].split("\t")

will cause substring t be a list. But later you do:

if bigstring.find(substring)!=-1:

That would cause an error if you call str.find(list).

In any case, you are building lists uselessly in your innermost loop. This:

IDlist=[]
if bigstring.find(substring)!=-1:
     IDlist.append(ID)

 #later
 len(IDlist)

That will uselessly allocate and free lists which would cause memory thrashing as well as uselessly bogging everything down.

This is code that should work and uses more efficient means to do the counting:

from collections import defaultdict

dictionary = defaultdict(int)
with open(file2) as fd:
    for line in fd:
        for s in line.split("\t"):
            dictionary[s.strip()] += 1

with open(file1) as fd:
    for line in fd:
        for substring in line.split('\t'):
            count = 0
            for bigstring,num_occurrences in dictionary.items():
                if substring in bigstring:
                    count += num_occurrences
            print substring, count

PS: I am assuming that you have multiple words per line that are tab-split because you do line.split("\t") at some point. If that is wrong, it should be easy to revise the code.

PPS: If this ends up being too slow for your use(you'd have to try it, but my guess is this should run in ~10min given the number of strings you said you had). You'll have to use suffix trees as one of the comments suggested.

Edit: Amended the code so that it handles multiple occurrences of the same string in file2 without negatively affecting performance

Edit 2: Trading maximum space for time.

Below is code that will consume quite a bit of memory and take a while to build the dictionary. However, once that's done, each search out of the million strings to search for should complete in the time it takes for a single hashtable lookup, that is O(1).

Note, I have added some statements to log the time it takes for each step of the process. You should keep those so you know which part of the time is taken when searching. Since you are testing with 1000 strings only this matters a lot since if 90% of the cost is the build time, not the search time, then when you test with 1M strings you will still only be doing that once, so it won't matter

Also note that I have amended my code to parse file1 and file2 as you do, so you should be able to just plug this in and test it:

from Bio import SeqIO
from collections import defaultdict
from datetime import datetime

def all_substrings(s):
    result = set()
    for length in range(1,len(s)+1):
        for offset in range(len(s)-length+1):
            result.add(s[offset:offset+length])
    return result

print "Building dictionary...."
build_start = datetime.now()

dictionary = defaultdict(int)
handle = SeqIO.parse(file2, 'fasta')
for record in handle:
    for sub in all_substrings(str(record.seq).strip()):
        dictionary[sub] += 1
build_end = datetime.now()
print "Dictionary built in: %gs" % (build-end-build_start).total_seconds()

print "Searching...\n"
search_start = datetime.now()

with open(file1) as fd:
    for line in fd:
        substring = line.strip().split("\t")[1]
        count = dictionary[substring]
        print substring, count

search_end = datetime.now()
print "Search done in: %gs" % (search_end-search_start).total_seconds()

score 0 · Accepted Answer

我不是算法专家，但我认为这应该会给您带来健康的性能提升。您需要将 'haystacks' 设置为要查看的大词列表，并将 'needles' 设置为您要查找的子字符串的列表（可以包含重复项），我会让你最终实施。如果您可以发布您的针头列表和干草堆列表，那就太好了，这样我们就可以轻松地比较提议的解决方案的性能。

haystacks = <some list of words>
needles = <some list of words>

class Node(object):
    def __init__(self):
        self.words = set()
        self.links = {}

base = Node()

for haystack_id, haystack in enumerate(haystacks):
    for i in xrange(len(haystack)):
        current_node = base
        for char in haystack[i:]:
            current_node = current_node.links.setdefault(char, Node())
            current_node.words.add(haystack_id)

def search_haystack_tree(needle):
    current_node = base
    for char in needle:
        try:
            current_node = current_node.links[char]
        except KeyError:
            return 0
    return len(current_node.words)

for needle in needles:
    count = search_haystack_tree(needle)
    print "Count for %s: %i" % (needle, count)

您可能可以通过查看代码来弄清楚发生了什么，但只是用文字来表达：我构建了一个巨大的大海捞针子字符串树，这样给定任何针，您可以逐个字符地导航树并结束在一个节点上，该节点附加了包含该子字符串的 haystacks 的所有 haystack id 的集合。然后对于每根针，我们只需遍历树并在最后计算集合的大小。

python - Python 在文件中搜索一百万个字符串并计算每个字符串的出现次数

2 回答 2

Related

Reference