0

I'm interested in creating a program that will search for a certain string (known henceforth as string A) in a large library of other strings. Basically, if string A existed in the library it would be discarded and another string's existence would be checked for within the library. The program would then give me a final list of strings that did not exist as substrings within the large library. I was able to make a program that finds EXACT matches, but I need to add an additional module that allows the sub-string search to allow for partial matches. Namely, one or two of the sub-string characters would be alright. The list of string A's (which are all permutations of a,t,g,c in a 7-letter string 4^7 different ones) has difficulties with highly diverse libraries.

My initial thought was to use regex and perhaps a hamming distance algorithm to find all those partial matches. Basically this first attempt allows me to put a "?" or wildcard into all positions of the string A in question (1-7), but I can only get it into the first position. The wildcard would then allow me to search for partial matches of the particular string A in question. If this the wrong way to approach this problem, I'd gladly change it up. I used fnmatch as per suggestion on another question This is what I have so far:

from Bio import SeqIO
import fnmatch
import random
import itertools

#Define a splitting string algorithm
def split_by_n(seq,n):
    while seq:
        yield seq[:n]
        seq = seq[n:]

#Import all combinations/permutations from fasta fille, 4^7
my_combinations = []
fasta_sequences = SeqIO.parse(open("Combinations/base_combinations_7.fasta"),'fasta')
for fasta in fasta_sequences:
    name, sequence = fasta.id, str(fasta.seq)
    x = sequence.lower()
    my_combinations.append(x)

primer = "tgatgag"
final = []

#List to make wildcard permutations
wildCard = ['?']

i = list(split_by_n(primer, 1))

for letter in i:
    wildCard.append(letter)

del wildCard[1]

final.append(''.join(wildCard))

#Search for wildcard permutation
for entry in final:
    filtered = fnmatch.filter(my_combinations, entry)

This is my desired output:

primer = "tgatgag"

['?', 'g', 'a', 't', 'g', 'a', 'g']
['t', '?', 'a', 't', 'g', 'a', 'g']
['t', 'g', '?', 't', 'g', 'a', 'g']
['t', 'g', 'a', '?', 'g', 'a', 'g']
['t', 'g', 'a', 't', '?', 'a', 'g']
['t', 'g', 'a', 't', 'g', '?', 'g']
['t', 'g', 'a', 't', 'g', 'a', '?']
['agatgag', 'tgatgag', 'cgatgag', 'ggatgag']
['taatgag', 'ttatgag', 'tcatgag', 'tgatgag']
['tgatgag', 'tgttgag', 'tgctgag', 'tggtgag']
['tgaagag', 'tgatgag', 'tgacgag', 'tgaggag']
['tgataag', 'tgattag', 'tgatcag', 'tgatgag']
['tgatgag', 'tgatgtg', 'tgatgcg', 'tgatggg']
['tgatgaa', 'tgatgat', 'tgatgac', 'tgatgag']
4

1 回答 1

0

这是 2 个元素替换的示例解决方案:

primer = 'cattagc'
bases = ['a','c','g','t']

# this is the generator for all possible index combinations
p = itertools.permutations(range(len(primer)), 2)
# this is the list of all possible base pair combinations
c = list(itertools.combinations_with_replacement(bases, 2))

results = []
for i1, i2 in p:
    for c1, c2 in c:
        temp = list(primer)
        temp[i1], temp[i2] = c1, c2
        results.append(''.join(temp))

这将为替换原始底漆的任何两个元素创建所有可能的替换。

于 2014-10-24T20:07:35.980 回答