c# - 用 40K 个对象查找 2 个集合的差异

Question

我有 2 个集合都包含相同类型的对象，并且两个集合每个都有大约 40K 对象。

每个集合包含的对象的代码基本上就像一个字典，除了我重写了 equals 和 hash 函数：

public class MyClass: IEquatable<MyClass>
{
    public int ID { get; set; }
    public string Name { get; set; }

    public override bool Equals(object obj)
    {
        return obj is MyClass && this.Equals((MyClass)obj);
    }

    public bool Equals(MyClass ot)
    {
        if (ReferenceEquals(this, ot))
        {
            return true;
        }

        return 
         ot.ID.Equals(this.ID) &&
         string.Equals(ot.Name, this.Name, StringComparison.OrdinalIgnoreCase); 
    }

    public override int GetHashCode()
    {
         unchecked
         {
             int result = this.ID.GetHashCode();
             result = (result * 397) ^ this.Name.GetSafeHashCode();
             return result;
         }
    }
}

我用来比较集合并获取差异的代码只是一个使用 PLinq 的简单 Linq 查询。

ParallelQuery p1Coll = sourceColl.AsParallel();
ParallelQuery p2Coll = destColl.AsParallel();

List<object> diffs = p2Coll.Where(r => !p1Coll.Any(m => m.Equals(r))).ToList();

有人知道比较这么多对象的更快方法吗？目前在四核计算机上大约需要 40 秒 +/- 2 秒。根据数据进行一些分组，然后并行比较每组数据可能会更快吗？如果我首先根据名称对数据进行分组，我最终会得到大约 490 个唯一对象，如果我首先按 ID 对数据进行分组，我最终会得到大约 622 个唯一对象。

score 15 · Accepted Answer

您可以使用except方法，该方法将为您提供p2Coll不在p1Coll.

var diff = p2Coll.Except(p1Coll);

更新（一些性能测试）：

免责声明：

实际时间取决于多种因素（例如集合的内容、硬件、机器上运行的内容、哈希码冲突的数量等），这就是我们使用复杂性和大 O 表示法的原因（参见 Daniel Brückner 的评论）。

以下是在我 4 岁的机器上运行 10 次的一些性能统计数据：

Median time for Any(): 6973,97658ms
Median time for Except(): 9,23025ms

我的测试的源代码可在 gist 上找到。

更新 2：

如果您想从第一个和第二个集合中获得不同的项目，您必须在这两个集合上实际执行Expect并且该Union结果：

var diff = p2Coll.Except(p1Coll).Union(p1Coll.Except(p2Coll));

score 0 · Accepted Answer

相交

int[] id1 = { 44, 26, 92, 30, 71, 38 };
int[] id2 = { 39, 59, 83, 47, 26, 4, 30 };

IEnumerable<int> both = id1.Intersect(id2);

foreach (int id in both)
    Console.WriteLine(id);

/*
This code produces the following output:

26
30
*/

c# - 用 40K 个对象查找 2 个集合的差异

2 回答 2

Related

Reference