algorithm - 比较五种不同的来源

Question

我需要编写一个函数来比较 2-5 个“文件”（实际上是 2-5 组数据库行，但概念相似），我不知道该怎么做。生成的差异应该并排显示 2-5 个文件。输出应显示添加、删除、更改和未更改的行，每个文件都有一列。

我应该使用什么算法来遍历行以保持低复杂度？每个文件的行数少于 10,000。我可能不需要外部合并，因为总数据大小在兆字节范围内。简单易读的代码当然也不错，但这不是必须的。

编辑：文件可能来自一些未知来源，没有“原始”可以与其他 1-4 个文件进行比较；所有文件都必须以某种方式与其他文件进行比较。

编辑 2：我，或者更确切地说是我的同事，意识到可以对内容进行排序，因为输出顺序无关紧要。这个解决方案意味着对应用程序的这一部分使用额外的领域知识，而且差异复杂度是 O(N) 并且代码不太复杂。这个解决方案很简单，当我关闭赏金时，我将忽略对此编辑的任何答案。但是，我将回答我自己的问题以供将来参考。

score 2 · Accepted Answer

如果必须将所有n 个文件（例如 2 <= n <= 5）与其他文件进行比较，那么在我看来，要比较的组合数将是 C( n ,2)，定义为（例如在 Python 中）为：

def C(n,k): 
    return math.factorial(n)/(math.factorial(k)*math.factorial(n-k))

因此，对于n = 2、3、4、5，您将分别进行 1、3、6 或 10 次成对比较。

然后，时间复杂度将是 C( n ,2) 乘以您选择使用的成对diff算法的复杂度，在Myers算法的情况下，这将是预期的 O(ND)，其中 N 是要比较的两个序列 A 和 B 的长度，D 是 A 和 B 的最小编辑脚本的大小。

我不确定您需要此代码的环境，但作为示例，Python 中的difflib可用于查找各种序列之间的差异 - 不仅仅是文本行 - 所以它可能对您有用。difflib文档没有确切说明它使用什么算法，但它对时间复杂度的讨论让我认为它与 Myers 的相似。

score 1 · Accepted Answer

伪代码（用于编辑 2）：

10: stored cells = <empty list>
for each column:
  if cell < stored cells:
    stored cells = cell
  elif cell == lastCell:
    stored cells += cell

if stored cells == <empty>:
  return result
result += stored cells
goto 10

score 0 · Accepted Answer

2个文件的情况可以用标准的diff算法来解决。从 3 个文件中，您可以使用“多数票”算法：

如果超过一半的记录相同：3 条中的 2 条、4 条中的 3 条、5 条中的 3 条比这些是考虑其他记录已更改的参考。

如果更改的数量相对较低，这也意味着算法的相当大的加速。

Pseudocode:
  initialize as many line indexes as there are files
  while there are still at least 3 indexes incrementable
    if all indexed records are the same 
      increment all line indexes
    else 
      if at least one is different - check majority vote
      if there is a majority
        mark minority changes, increment all line indexes
      else
        mark minority additions (maybe randomly deciding e.g. in a 2:2 vote)
        check addition or removing and set line indexes accordingly
        increment all indexes
      endif
    endif
  endwhile

algorithm - 比较五种不同的来源

3 回答 3

Related

Reference