2

So the problem collection is something like:

A = {'abc', 'abc', 'abd', 'bcde', 'acbdg', ...}

Using some type of string metric like Levenshtein distance, it's simple enough to find some sort of heuristic of string similarity between 2 strings.

However, I would like to determine, without evaluating all pairs of strings in the collection (an O(N^2) problem), some type of heuristic based on the entire collection that gives me a good idea of the overall similarity between all the strings.

The brute force approach is:

                          Sum(Metric(All Pairs in A))
CollectionSimilarity(A) = ---------------------------
                                 N*(N+1)/2

Is there a way to evaluate the similarity of the entire collection of A without evaluating every pair?

4

2 回答 2

0

由于每个字符串都是某个度量空间中的一个向量(其中每个字符都是特定的坐标),我的解决方案是找到集合A和某个点P之间的距离。

让我们看看一个度量的属性——三角不等式:

Distance(x, y) <= Distance(x, *P*) + Distance(y, *P*)

Sum(Distance(All pairs in A))所以我们可以找到as的上限|A| * Sum(Distance(All elements in A to point P)

  Sum(Distance(x, y))      N * Sum(x, *P*)     Sum(x, *P*)
---------------------- <= ----------------- = ------------
     N*(N+1)/2               N*(N+1)/2          (N+1)/2

这个点PA可以是随机点或集合或空字符串(零点)或其他任何东西的质心(在这种情况下,您会找到集合的平均半径) 。一般来说,P可以是任何超平面。无论如何,你会发现你的集合的某种平均半径(或直径)。
也许一些线性预变换[集合或坐标系,这是相同的]是好的。或者迭代多次,并在每次迭代中找到到新随机超平面的距离。

希望这可能会有所帮助!

于 2015-01-13T17:05:25.750 回答
0

您总是可以使用一些近似值(例如采样对)。根据 N 的大小,该值应与 NlogN 个样本收敛。

于 2015-01-13T15:56:48.273 回答