c# - 在 C# 中列出文件夹内的重复文件：利用 LINQ.AsParallel

Question

我在 C# 代码中编写了以下算法，以递归方式列出文件夹中的文件。

开始遍历目录及其子目录中的文件列表。
将文件名和路径存储在列表中。
如果当前文件与列表中的任何其他文件匹配，则在将两个文件标记为重复时。
从列表中获取所有标记为重复的文件。
按名称分组并返回。

在包含 50,000 个文件和 12,000 个子目录的文件夹中，执行速度非常慢。由于磁盘读取操作基本上是耗时的任务。甚至LINQ.Parallel()也无济于事。

实施：

class FileTuple
{
    public string FileName { set; get; }
    public string ContainingFolder { set; get; }
    public bool HasDuplicate { set; get; }
    public override bool Equals(object obj)
    {
        if (this.FileName == (obj as FileTuple).FileName)
            return true;
        return false;
    }
}

FileTuple 类跟踪文件名和包含目录，标志跟踪重复状态。
我已经覆盖了 equals 方法以仅比较 fileTuples 集合中的文件名。

以下方法查找重复文件并作为列表返回。

    private List<FileTuple> FindDuplicates()
    {
        List<FileTuple> fileTuples = new List<FileTuple>();
        //Read all files from the given path
        List<string> enumeratedFiles = Directory.EnumerateFiles(txtFolderPath.Text, "*.*", SearchOption.AllDirectories).Where(str => str.Contains(".exe") || str.Contains(".zip")).AsParallel().ToList();
        foreach (string filePath in enumeratedFiles)
        {
            var name = Path.GetFileName(filePath);
            var folder = Path.GetDirectoryName(filePath);
            var currentFile = new FileTuple { FileName = name, ContainingFolder = folder, HasDuplicate = false, };

            int foundIndex = fileTuples.IndexOf(currentFile);
            //mark both files as duplicate, if found in list
            //assuming only two duplicate file
            if (foundIndex != -1)
            {
                currentFile.HasDuplicate = true;                    
                fileTuples[foundIndex].HasDuplicate = true;
            }
            //keep of track of the file navigated
            fileTuples.Add(currentFile);
        }

        List<FileTuple> duplicateFiles = fileTuples.Where(fileTuple => fileTuple.HasDuplicate).Select(fileTuple => fileTuple).OrderBy(fileTuple => fileTuple.FileName).AsParallel().ToList();
        return duplicateFiles;
    }

您能否提出一种提高性能的方法。

谢谢您的帮助。

score 3 · Accepted Answer

您能否提出一种提高性能的方法。

一个明显的改进是使用 aDictionary<FileTuple, FileTuple>和 a List<FileTuple>。这样您就不会IndexOf在每次检查时进行 O(N) 操作。请注意，您还需要覆盖GetHashCode()- 您应该已经对此有警告。

我怀疑它会产生很大的不同——我希望这主要是受 IO 限制的。

此外，我怀疑最后的过滤和排序是否会成为一个重要的瓶颈，所以AsParallel在最后一步中使用它不太可能做太多事情。当然，您应该衡量所有这些。

最后，整个方法可以变得相当简单，甚至不需要HasDuplicate标志或任何覆盖Equals/ GetHashCode：

private List<FileTuple> FindDuplicates()
{
    return Directory.EnumerateFiles(txtFolderPath.Text, "*.*", 
                                    SearchOption.AllDirectories)
                    .Where(str => str.Contains(".exe") || 
                           str.Contains(".zip")
                    .Select(str => new FileTuple { 
                               FileName = Path.GetFileName(str),
                               ContainingFolder = Path.GetDirectoryName(str))
                            })
                    .GroupBy(tuple => tuple.FileName)
                    .Where(g => g.Count() > 1) // Only keep duplicates
                    .OrderBy(g => g.Key)       // Order by filename
                    .SelectMany(g => g)        // Flatten groups
                    .ToList();                     
}

score 1 · Accepted Answer

如果性能很关键，我可以建议使用来自http://www.voidtools.com/download.php的第三方库，尝试下载这个工具并运行一些查询，它会很快点亮，它通过建立一个索引来工作在第一次运行时整个文件系统上的文件和目录，索引在不到一分钟的时间内构建得非常快，并且在内存和磁盘上都需要一些时间，但是之后查询会非常快，你可以在他们的 C# 示例中查看如何在你的代码。

c# - 在 C# 中列出文件夹内的重复文件：利用 LINQ.AsParallel

2 回答 2

Related

Reference