algorithm - Find element similarity within a collection of strings without evaluating all element pairs

Question

So the problem collection is something like:

A = {'abc', 'abc', 'abd', 'bcde', 'acbdg', ...}

Using some type of string metric like Levenshtein distance, it's simple enough to find some sort of heuristic of string similarity between 2 strings.

However, I would like to determine, without evaluating all pairs of strings in the collection (an O(N^2) problem), some type of heuristic based on the entire collection that gives me a good idea of the overall similarity between all the strings.

The brute force approach is:

                          Sum(Metric(All Pairs in A))
CollectionSimilarity(A) = ---------------------------
                                 N*(N+1)/2

Is there a way to evaluate the similarity of the entire collection of A without evaluating every pair?

score 0 · Accepted Answer

由于每个字符串都是某个度量空间中的一个向量（其中每个字符都是特定的坐标），我的解决方案是找到集合A和某个点P之间的距离。

让我们看看一个度量的属性——三角不等式：

Distance(x, y) <= Distance(x, *P*) + Distance(y, *P*)

Sum(Distance(All pairs in A))所以我们可以找到as的上限|A| * Sum(Distance(All elements in A to point P)：

  Sum(Distance(x, y))      N * Sum(x, *P*)     Sum(x, *P*)
---------------------- <= ----------------- = ------------
     N*(N+1)/2               N*(N+1)/2          (N+1)/2

这个点PA可以是随机点或集合或空字符串（零点）或其他任何东西的质心（在这种情况下，您会找到集合的平均半径）。一般来说，P可以是任何超平面。无论如何，你会发现你的集合的某种平均半径（或直径）。
也许一些线性预变换[集合或坐标系，这是相同的]是好的。或者迭代多次，并在每次迭代中找到到新随机超平面的距离。

希望这可能会有所帮助！

score 0 · Accepted Answer

您总是可以使用一些近似值（例如采样对）。根据 N 的大小，该值应与 NlogN 个样本收敛。

algorithm - Find element similarity within a collection of strings without evaluating all element pairs

2 回答 2

Related

Reference