python - 如何在 FASTA 序列中找到反向重复模式？

Question

假设我的长序列看起来像：

5’-AGGGTTTCCC**TGACCT**TCACTGC**AGGTCA**TGCA-3

这个长序列中的两个斜体子序列（这里是两颗星内）一起称为反向重复模式。这两个子序列中四个字母（如 A、T、G、C）的长度和组合会有所不同。但这两个子序列之间存在关系。请注意，当您考虑第一个子序列时，它的互补子序列是 ACTGGA（根据 A 与 T 结合，G 与 C 结合），当您反转此互补子序列（即最后一个字母在前）时，它与第二个子序列匹配。

FASTA 序列中存在大量此类模式（包含 1000 万个 ATGC 字母），我想找到此类模式及其开始和结束位置。

score 5 · Accepted Answer

I'm new to both Python and bioinformatics, but I'm working my way through the rosalind.info web site to learn some of both. You do this with a suffix tree. A suffix tree (see http://en.wikipedia.org/wiki/Suffix_tree) is the magical data structure that makes all things possible in bioinformatics. You quickly locate common substrings in multiple long sequences. Suffix trees only require linear time, so length 10,000,000 is feasible.

So first find the reverse complement of the sequence. Then put both into the suffix tree, and find the common substrings between them (of some minimum length).

The code below uses this suffix tree implementation: http://www.daimi.au.dk/~mailund/suffix_tree.html. It's written in C with Python bindings. It won't handle a large number of sequences, but two is no problem. However I can't say whether this will handle length 10,000,000.

from suffix_tree import GeneralisedSuffixTree

baseComplement = { 'A' : 'T', 'C' : 'G', 'G' : 'C', 'T' : 'A' }

def revc(seq):
    return "".join([baseComplement[base] for base in seq[::-1]])

data = "AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA"
# revc  TGCATGACCTGCAGTGAAGGTCAGGGAAACCCT
#       012345678901234567890123456789012
#                 1         2         3
minlength = 6

n = len(data)
tree = GeneralisedSuffixTree([data, revc(data)])
for shared in tree.sharedSubstrings(minlength):
    #print shared
    _, start, stop = shared[0]
    seq = data[start:stop]
    _, rstart, rstop = shared[1]
    rseq = data[n-rstop:n-rstart]
    print "Match: {0} at [{1}:{2}] and {3} at [{4}:{5}]".format(seq, start, stop, rseq, n-rstop, n-rstart)

This produces output

Match: AGGTCA at [23:29] and TGACCT at [10:16]
Match: TGACCT at [10:16] and AGGTCA at [23:29]
Match: CTGCAG at [19:25] and CTGCAG at [19:25]

It finds each match twice, once from each end. And there's a reverse palindrome in there, too!

score 1 · Accepted Answer

这是一个蛮力实现，尽管它在超长序列上可能不是很有用。

def substrings(s, lmin, lmax):
    for i in range(len(s)):
        for l in range(lmin, lmax+1):
            subst = s[i:i+l]
            if len(subst) == l:
                yield i, l, subst

def ivp(s, lmin, lmax):
    mapping = {'A': 'T', 'G': 'C', 'T': 'A', 'C': 'G'}
    for i, l, sub in substrings(s, lmin, lmax):
        try:
            from string import maketrans
        except ImportError: # we're on Python 3
            condition = sub.translate(
                       {ord(k): v for k, v in mapping.items()})[::-1] in s
        else: # Python 2
            condition = sub.translate(maketrans('ATGC', 'TACG'))[::-1] in s
        if condition:
            yield i, l, sub

让我们找到长度为 6 的“反向重复模式”（以及它们的起始位置和长度）：

>>> list(ivp('AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA', 6, 6))
[(10, 6, 'TGACCT'), (19, 6, 'CTGCAG'), (23, 6, 'AGGTCA')]

不过，这不会检查两种模式是否重叠。例如，'CTGCAG'匹配自身。

score 0 · Accepted Answer

我有一个想法来找到一个长序列的反转回文序列。考虑整个序列的一段 DNA 序列并生成其互补序列。然后反转这个补码序列的部分。然后执行这两个部分之间的动态对齐并计算它的成本（一个来自实际序列，另一个来自反向互补序列）。成本将给出最佳对齐方式的想法。现在，如果最佳对齐的成本 >= 阈值成本，则选择该部分并找到公共区域。该特定部分的两个共同区域将是一个反向重复单元。一旦找到该单元，然后在下一部分重复它，否则连续增加部分的长度。任何人都可以实现这个算法。也许这将是一个有用的解决方案。

score 0 · Accepted Answer

我使用列表推导对其进行了拍摄。我还是 python 新手，但在过去 5 年左右的时间里一直是 C# 开发人员。这会产生您想要的输出，尽管它绝对没有优化到能够有效地处理 1000 万个字符串。

注意：因为我们将列表转换为frequent_words中的集合，为了去除重复条目，结果没有排序。

def pattern_matching(text, pattern):
    """ Returns the start and end positions of each instance of the pattern  """
    return [[x, str(len(pattern) + x)] for x in range(len(text) - len(pattern) + 1) if
            pattern in text[x:len(pattern) + x]]


def frequent_words(text, k):
    """ Finds the most common k-mers of k """
    counts = [len(pattern_matching(text, text[x:x + k])) for x in range(len(text) - k)]
    return set([text[x:x + k] for x in range(len(text) - k) if counts[x] == max(counts)])


def reverse_complement(pattern):
    """ Formed by taking the complement of each nucleotide in Pattern """
    complements = {'A': 'T', 'C': 'G', 'T': 'A', 'G': 'C'}
    return ''.join([complements.get(c, c) for c in pattern][::-1])


def find_invert_repeats(text, pattern_size):
    """ Returns the overlap for frequent words and its reverse """
    freqs = frequent_words(text, pattern_size)
    rev_freqs = frequent_words(reverse_complement(text), pattern_size)
    return [[x, pattern_matching(text, x)] for x in set(freqs).intersection(rev_freqs)]

print(find_invert_repeats("AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA", 6))

[['TGACCT', [[10, '16']]], ['AGGTCA', [[23, '29']]], ['CTGCAG', [[19, '25']]]]

python - 如何在 FASTA 序列中找到反向重复模式？

4 回答 4

Related

Reference