All hamming distances can be produced in O(K^2/D) using the python code below.
This is faster in some cases than the trivial code which is O(N*K).
Where N is the number of fixed length strings
K is the length of each string
and D is the size of the dictionary.
# DATABASE is a tuple of the strings
# eg. ('asdfjjajwi...', 'hsjsiei...', ...)
# SINGLE is the string you are matching
# eg. 'jfjdkaks...'
SIZE_OF_STRING = 5000
NUMBER_OF_STRINGS = 400
FIRST_K_REQUIRED = 100
def setup_index():
index = []
for x in xrange(SIZE_OF_STRING):
index_dict = {}
for y in xrange(NUMBER_OF_STRINGS):
temp = index_dict.get(DATABASE[y][x], [])
temp.append(y)
index_dict[DATABASE[y][x]] = temp
index.append(index_dict)
return index
index = setup_index()
output = []
for x in xrange(NUMBER_OF_STRINGS):
output.append([SIZE_OF_STRING, x])
for key, c in enumerate(SINGLE):
for x in index[key][c]:
output[x][0] -= 1
output.sort()
print output[:FIRST_K_REQUIRED]
This is a faster method only when SIZE_OF_STRING / DICTIONARY_SIZE < NUMBER_OF_STRINGS.
Hope this helps.
EDIT:
The complexity of the above code is incorrect.
The Hamming Distances can be produced in O(N*K/D) on average.
This is faster in ALL cases than the trivial code which is O(N*K).
Where N is the number of fixed length strings
K is the length of each string
and D is the size of the dictionary.