I'm interested in creating a program that will search for a certain string (known henceforth as string A) in a large library of other strings. Basically, if string A existed in the library it would be discarded and another string's existence would be checked for within the library. The program would then give me a final list of strings that did not exist as substrings within the large library. I was able to make a program that finds EXACT matches, but I need to add an additional module that allows the sub-string search to allow for partial matches. Namely, one or two of the sub-string characters would be alright. The list of string A's (which are all permutations of a,t,g,c in a 7-letter string 4^7 different ones) has difficulties with highly diverse libraries.
My initial thought was to use regex and perhaps a hamming distance algorithm to find all those partial matches. Basically this first attempt allows me to put a "?" or wildcard into all positions of the string A in question (1-7), but I can only get it into the first position. The wildcard would then allow me to search for partial matches of the particular string A in question. If this the wrong way to approach this problem, I'd gladly change it up. I used fnmatch as per suggestion on another question This is what I have so far:
from Bio import SeqIO
import fnmatch
import random
import itertools
#Define a splitting string algorithm
def split_by_n(seq,n):
while seq:
yield seq[:n]
seq = seq[n:]
#Import all combinations/permutations from fasta fille, 4^7
my_combinations = []
fasta_sequences = SeqIO.parse(open("Combinations/base_combinations_7.fasta"),'fasta')
for fasta in fasta_sequences:
name, sequence = fasta.id, str(fasta.seq)
x = sequence.lower()
my_combinations.append(x)
primer = "tgatgag"
final = []
#List to make wildcard permutations
wildCard = ['?']
i = list(split_by_n(primer, 1))
for letter in i:
wildCard.append(letter)
del wildCard[1]
final.append(''.join(wildCard))
#Search for wildcard permutation
for entry in final:
filtered = fnmatch.filter(my_combinations, entry)
This is my desired output:
primer = "tgatgag"
['?', 'g', 'a', 't', 'g', 'a', 'g']
['t', '?', 'a', 't', 'g', 'a', 'g']
['t', 'g', '?', 't', 'g', 'a', 'g']
['t', 'g', 'a', '?', 'g', 'a', 'g']
['t', 'g', 'a', 't', '?', 'a', 'g']
['t', 'g', 'a', 't', 'g', '?', 'g']
['t', 'g', 'a', 't', 'g', 'a', '?']
['agatgag', 'tgatgag', 'cgatgag', 'ggatgag']
['taatgag', 'ttatgag', 'tcatgag', 'tgatgag']
['tgatgag', 'tgttgag', 'tgctgag', 'tggtgag']
['tgaagag', 'tgatgag', 'tgacgag', 'tgaggag']
['tgataag', 'tgattag', 'tgatcag', 'tgatgag']
['tgatgag', 'tgatgtg', 'tgatgcg', 'tgatggg']
['tgatgaa', 'tgatgat', 'tgatgac', 'tgatgag']