So the problem collection is something like:
A = {'abc', 'abc', 'abd', 'bcde', 'acbdg', ...}
Using some type of string metric like Levenshtein distance, it's simple enough to find some sort of heuristic of string similarity between 2 strings.
However, I would like to determine, without evaluating all pairs of strings in the collection (an O(N^2) problem), some type of heuristic based on the entire collection that gives me a good idea of the overall similarity between all the strings.
The brute force approach is:
Sum(Metric(All Pairs in A))
CollectionSimilarity(A) = ---------------------------
N*(N+1)/2
Is there a way to evaluate the similarity of the entire collection of A without evaluating every pair?