algorithm - 两个 ORDERED 集合的相似性

Question

我有一个有序的集合：

[Doc1, Doc2, Doc3, Doc4, Doc5]

其中 Doc1 排在前面Doc2（想象一个搜索查询情况，这个有序集合是搜索的结果。

现在，假设我有第二个有序集合：

[Doc1, Doc2, Doc3, Doc5, Doc4]

我需要一种将这种差异量化为数字分数的方法。它还必须考虑重量，因此[Doc1, Doc2, Doc3, Doc5, Doc4]更接近原始集合，然后[Doc2, Doc1, Doc3, Doc4, Doc5]是，因为差异发生在靠近顶部的位置。

我已经考虑了 Levenshtein 的差异，但看不到如何考虑订单。

score 1 · Accepted Answer

根据维基百科，可以使用以下伪代码计算 Levenshtein 距离。

int LevenshteinDistance(string s, string t)
{
  int len_s = length(s), len_t = length(t), cost = 0;
  if (s[0] != t[0])
    cost = 1;
  if (len_s == 0)
    return len_t;
  else if (len_t == 0)
    return len_s;
  else
    return minimum(
        LevenshteinDistance(s[1..len_s], t) + 1,
        LevenshteinDistance(s, t[1..len_t]) + 1,
        LevenshteinDistance(s[1..len_s], t[1..len_t]) + cost);
}

如果我正确理解您的要求，您希望集合开始时的差异比结束时的差异更重要。让我们调整这个递归函数来反映这个需求。

float LevenshteinDistance(string s, string t, float decay)
{
  int len_s = length(s), len_t = length(t), cost = 0;
  if (s[0] != t[0])
    cost = 1;
  if (len_s == 0)
    return len_t;
  else if (len_t == 0)
    return len_s;
  else
    return decay * minimum(
        LevenshteinDistance(s[1..len_s], t, decay) + 1,
        LevenshteinDistance(s, t[1..len_t], decay) + 1,
        LevenshteinDistance(s[1..len_s], t[1..len_t], decay) + cost);
}

什么时候decay属于区间 (0,1) 的参数在较大指数上的差异变得比以前的差异更不显着。

这是decay=0.9.

s       t       dist
"1234"  "1234"  0.0000
"1234"  "1243"  1.3851
"1234"  "2134"  1.6290

algorithm - 两个 ORDERED 集合的相似性

1 回答 1

Related

Reference