1

我试图在列表中找到具有特定频率的子子串,即python中较大字符串的子串(即windows)。也就是说,目标是找出至少一个子串中存在哪些子串(固定长度)(如果有的话,以特定要求的频率):

strand='JJJKJKHHGHJKKLHHGJJJHHGJJJ'
#now, I break the string by windows (substrings) and define the patterns to look (sub-substrings) :

A=20 #(fixed lenght of each window (substring) moving along the string in a one-by-one way)
B=3 #(fixed length of the pattern (sub-substring))
C=3 #(frequency of pattern (sub-substring))
pattcount = {}
for i in range(0, len(strand)-A+1):
  win=strand[i:i+A]
  for n in range(0, len(win)-B+1):
   patt=win[n:n+B]
   pattcount[patt] = pattcount[patt] + 1 if pattcount.has_key(patt) else 1

pattgroup = []
for p,f in pattcount.iteritems():
   if f != C:
     pattgroup = pattgroup
   elif f == C:
     pattgroup += [p]

print (" ".join(pattgroup))

因此,我得到的结果是:JKJ 当答案应该是:HHG(在长度为 20 的窗口中包含 C=3 次)并且没有 JKJ 或 JJJ(后者包含 C=3 次,但在整个字符串,不在长度为 20 的窗口中)

我究竟做错了什么?我怎样才能找到所需频率中存在的模式,但至少在一个窗口中?(没有将来自其他窗口的任何匹配模式添加到最终计数中)提前致谢。

4

2 回答 2

0

您可以简化为

from collections import defaultdict

strand='JJJKJKHHGHJKKLHHGJJJHHGJJJ'
print strand
A=20
B=3
C=3

D = A-B+1
pattcount = defaultdict(int)
res = set()

for i in xrange(len(strand)-A+1):
    pattcount.clear()
    win=strand[i:i+A]
    for n in xrange(D):
        pattcount[win[n:n+B]] += 1
    res.update(k for k,v in pattcount.iteritems() if v==3)
    print 'i==%d   res==%s' % (i,res)

print (" ".join(res))

编辑

我们甚至可以避免使用win

from collections import defaultdict

strand='JJJKJKHHGHJKKLHHGJJJHHGJJJ'
print strand

A=20
B=3
C=3

D = A-B+1
pattcount = defaultdict(int)
res = set()

for i in xrange(len(strand)-A+1):
    pattcount.clear()
    for n in xrange(i,i+D):
        pattcount[strand[n:n+B]] += 1
    res.update(k for k,v in pattcount.iteritems() if v==3)

print (" ".join(res))
于 2013-11-14T20:53:46.623 回答
0

问题是程序通过窗口(在本例中为 7 个)运行并计算每个模式的频率而不重置。所以最后,它发现 HHG 遇到了 18 次(就像 badc0re 说的那样)。您要求它找到 3 次重复,因此提供了 JKJ,因为它在所有窗口中仅找到 3 次。

所以基本上,每次启动新窗口时,您都必须重置该 pattcount 变量。您可以将每个窗口的 pattcount 存储在另一个变量中,例如 pattcounttotal。我做了一个粗略的例子:

strand='JJJKJKHHGHJKKLHHGJJJHHGJJJ'
#now, I break the string by windows (substrings) and define the patterns to look (sub-substrings) :

A=20 #(fixed lenght of each window (substring) moving along the string in a one-by-one way)
B=3 #(fixed length of the pattern (sub-substring))
C=3 #(frequency of pattern (sub-substring))

pattcounttotal = {} #So in this var everything will be stored   

for i in range(0, len(strand)-A+1):
    pattcount = {}  #This one is moved inside the for loop so it is emptied with every window
    win=strand[i:i+A]
    for n in range(0, len(win)-B+1):
        patt=win[n:n+B]

        #I've partly rewritten your code to make it more readable (for me)
        if pattcount.has_key(patt):
            pattcount[patt] = pattcount[patt] + 1
        else:
            pattcount[patt] = 1

    pattcounttotal[i] = pattcount  #This pattcount is stored into the total one

现在您必须通过 pattcounttotal(而不是 pattcount)来找到一个仅在一个窗口中重复 3 次的模式。

当我“打印” pattcounttotal var 时收到的输出如下所示:

{0: {'HGH': 1, 'KJK': 1, 'KLH': 1, 'HJK': 1, 'LHH': 1, 'KHH': 1, 'KKL': 1, 'GJJ': 1, 'HGJ': 1, 'JJK': 1, 'JJJ': 2, 'JKJ': 1, 'JKK': 1, 'JKH': 1, 'HHG': 2, 'GHJ': 1}, 
1: {'JJH': 1, 'HGH': 1, 'KJK': 1, 'JKK': 1, 'KLH': 1, 'LHH': 1, 'KHH': 1, 'KKL': 1, 'GJJ': 1, 'HGJ': 1, 'JJK': 1, 'HJK': 1, 'JKJ': 1, 'JKH': 1, 'HHG': 2, 'GHJ': 1, 'JJJ': 1}, 
2: {'JHH': 1, 'HGH': 1, 'KJK': 1, 'JJH': 1, 'KLH': 1, 'LHH': 1, 'KHH': 1, 'KKL': 1, 'GJJ': 1, 'HGJ': 1, 'JKH': 1, 'HJK': 1, 'JKJ': 1, 'JKK': 1, 'HHG': 2, 'GHJ': 1, 'JJJ': 1}, 
3: {'JHH': 1, 'KJK': 1, 'JJH': 1, 'KLH': 1, 'LHH': 1, 'KHH': 1, 'KKL': 1, 'GJJ': 1, 'HGJ': 1, 'JKH': 1, 'HJK': 1, 'HGH': 1, 'JKK': 1, 'HHG': 3, 'GHJ': 1, 'JJJ': 1}, 
4: {'JHH': 1, 'JJH': 1, 'KLH': 1, 'LHH': 1, 'KHH': 1, 'KKL': 1, 'GJJ': 1, 'HGJ': 2, 'HHG': 3, 'GHJ': 1, 'HGH': 1, 'JKK': 1, 'JKH': 1, 'HJK': 1, 'JJJ': 1}, 
5: {'JHH': 1, 'JJH': 1, 'KLH': 1, 'LHH': 1, 'KHH': 1, 'KKL': 1, 'GJJ': 2, 'HHG': 3, 'GHJ': 1, 'HGH': 1, 'JKK': 1, 'HGJ': 2, 'HJK': 1, 'JJJ': 1}, 
6: {'JHH': 1, 'JJH': 1, 'KLH': 1, 'LHH': 1, 'KKL': 1, 'GJJ': 2, 'HHG': 3, 'GHJ': 1, 'HGH': 1, 'JKK': 1, 'HGJ': 2, 'HJK': 1, 'JJJ': 2}}

所以基本上每个窗口都从 0-6 编号,并呈现每个模式的频率。因此,如果我快速浏览它,我会发现在 Windows 3-5 中遇到了 3 次 HHG 模式。

(如果您想要一个“自动”输出,那么您仍然需要编写一段代码循环遍历每个窗口(因此通过字典的上层)并跟踪遇到的模式三次。)更新

我认为这是一个有趣的问题,所以我也制作了“自动”输出。它与您已经拥有的非常相似。这是我使用的代码(希望它不是剧透:))

pattgroup = []
for i in pattcounttotal.iteritems(): #It first retrieves the upper levels of pattcounttotal
    for p,f in i[1].iteritems(): #i[0] is the window's number and i[1] is the dictionary that contains the frequencies
        if f != C:
            pattgroup = pattgroup
        elif f == C:
            if not(p in pattgroup):  #This makes sure if HHG is not already in pattgroup.
                pattgroup += [p]        

print (" ".join(pattgroup))

如果我将这两个代码放在一起并运行程序,我会得到 HHG。

于 2013-11-14T18:22:25.550 回答