这是一个非常有效的解决方案——我认为它大约是 O(n),但这取决于添加和删除的分布。内存消耗非常低,但也取决于连续添加和删除的数量。
限制:
- 该算法不会使补丁行保持与原始文件中相同的顺序;如果这很关键,您可以执行类似使用 Dictionary<string,int> 的操作,其中键是行,值是原始行号,而不是使用 HashSet<string> 来跟踪添加和删除的行。
- 目标(“新”)文件必须与源(“旧”)文件有些相似。具体来说,所有未更改的行在源和目标中的顺序必须相同。如果不满足此条件,则算法将表现不佳。
- 每条线就其附近的线而言必须是唯一的,其中“附近”表示在源和目标之间不变的最近线之间。如果不满足此条件,算法将错过更改。
- 此实现不考虑修改的行。我认为您可以通过将 == 比较替换为用于检测两行是“相同”行的任何操作来添加该功能,然后如果它们是具有内容更改的“相同”行,则将它们写到补丁中。
该算法使用一对“添加的”和“删除的”缓冲区来跟踪在文件中运行时可能添加和删除的行。当文件之间的行不匹配时,它们会暂时标记为“添加”或“删除”。当在其中一个文件中找到临时标记的行时(如果在目标文件中找到“删除”行,或者在源文件中找到“添加”行),这是一个信号,表明所有行另一个缓冲区确实属于那里,因此另一个缓冲区被刷新到补丁文件,然后阅读器在文件中找到匹配行的地方前进一行。
例如:
源目标已添加已删除
A--------A _ _
B--------X +X +B
C-------B 同花顺 X -B
D--\ \-C _ _
E-\ \---E +E +D
F \----F -E 齐平 D
这是代码:
public void Diff(
string sourcePath,
string targetPath,
string patchPath,
string addedSuffix,
string removedSuffix)
{
using(var sourceReader = new StreamReader(sourcePath))
using(var targetReader = new StreamReader(targetPath))
using(var patchWriter = new StreamWriter(patchPath, append:false))
{
var sourceLine = sourceReader.ReadLine();
var targetLine = targetReader.ReadLine();
var added = new HashSet<string>();
var removed = new HashSet<string>();
do{
if(sourceLine == targetLine)
{
sourceLine = sourceReader.ReadLine();
targetLine = targetReader.ReadLine();
}
else
{
if(removed.Contains(targetLine))
{
// Found targetLine in tentatively removed lines, so it wasn't actually removed.
removed.Remove(targetLine);
// Since we found something we thought had been removed, we know that all tentatively added lines actually are new.
Flush(patchWriter, added, addedSuffix);
added.Clear();
targetLine = targetReader.ReadLine();
}
else if(added.Contains(sourceLine))
{
// Found sourceLine in tentatively added lines, so it wasn't actually added.
added.Remove(sourceLine);
// We found something we thought had been added, so all candidates for removal should actually be removed.
Flush(patchWriter,removed, removedSuffix);
removed.Clear();
sourceLine = sourceReader.ReadLine();
}
else
{
// Source and target don't match, so we assume that the source was removed and the target was added.
// If we're wrong, we'll clean it up when we come across the line later on.
removed.Add(sourceLine);
added.Add(targetLine);
sourceLine = sourceReader.ReadLine();
targetLine = targetReader.ReadLine();
}
}
} while(sourceLine != null || targetLine != null);
Flush(patchWriter, added, addedSuffix);
Flush(patchWriter, removed, removedSuffix);
}
}
public void Flush(StreamWriter writer, IEnumerable<string> lines, string suffix)
{
foreach (var line in lines.Where (l => l != null))
{
writer.WriteLine("{0} {1}", line.Trim(), suffix);
}
}
这是我用来生成测试文件的一些代码:
var path = /* path */;
var sourcePath = Path.Combine(path, "source.txt");
var targetPath = Path.Combine(path, "target.txt");
var expectedPath = Path.Combine(path, "expected.txt");
var rnd = new Random(10);
using(var sourceWriter = new StreamWriter(sourcePath))
using(var targetWriter = new StreamWriter(targetPath))
using(var expectedWriter = new StreamWriter(expectedPath))
{
var limit = 10.0 * 100000;
for (int i = 0; i < limit; i++)
{
if(i % 10000 == 0) Console.Write("{0:P0} ...", i / limit);
var guid = Guid.NewGuid().ToString();
var r = rnd.Next(0,10);
var removed = 3;
var added = 6;
if(r >= 0 && r < removed)
{
sourceWriter.WriteLine(guid);
expectedWriter.WriteLine(guid + " 0");
}
else if(r >= removed && r < added)
{
targetWriter.WriteLine(guid);
expectedWriter.WriteLine(guid + " 1");
}
else if(r >= added)
{
sourceWriter.WriteLine(guid);
targetWriter.WriteLine(guid);
}
}
}
看到任何错误或问题?这是你要找的吗?