asp.net - Linq“包含”查询时间过长

Question

我有这个查询：

var newComponents = from ic in importedComponents
                    where !existingComponents.Contains(ic)
                    select ic;

importedComponents并且existingComponents属于类型List<ImportedComponent>，并且仅存在于内存中（不依赖于数据上下文）。在本例中，importedComponents只有 6,100 多个项目，并且existingComponents有 511 个项目。

这个语句完成的时间太长（我不知道多长时间，我在20 分钟后停止脚本）。我尝试了以下方法，但执行速度没有提高：

var existingComponentIDs = from ec in existingComponents
                           select ec.ID;

var newComponents = from ic in importedComponents
                    where !existingComponentIDs.Contains(ic.ID)
                    select ic;

任何帮助都感激不尽。

score 5 · Accepted Answer

问题是该算法的二次复杂度。将所有现有组件ID 的ID 放入一个HashSet 并使用HashSet.Contains 方法。与列表中包含/任何内容的 O(N) 相比，它的查找成本为 O(1)。

morelinq项目包含一种方法，可以在一个方便的步骤中完成所有这些操作：ExceptBy 。

score 4 · Accepted Answer

您可以使用Except来获得设定的差异：

var existingComponentIDs = existingComponents.Select(c => c.ID); 
var importedComponentIDs = importedComponents.Select(c => c.ID);
var newComponentIDs = importedComponentIDs.Except(existingComponentIDs);
var newComponents = from ic in importedComponents
        join newID in newComponentIDs on ic.ID equals newID
        select ic;
foreach (var c in newComponents)
{ 
    // insert into database?
}

为什么 LINQ JOIN 比使用 WHERE 链接快得多？

简而言之：Join 方法可以设置一个哈希表作为索引来快速将两个表压缩在一起

score 0 · Accepted Answer

好吧，根据您提供的逻辑和数字，这意味着您在运行该语句时基本上是在执行 3117100 比较。显然，这并不完全准确，因为您的条件可能在遍历整个数组之前已经满足，但您明白我的意思。

对于这么大的集合，您将希望使用一个集合，您可以在其中索引您的键（在本例中为您的组件 ID），以帮助减少搜索的开销。要记住的是，即使 LINQ 看起来像 SQL，这里也没有神奇的索引。主要是为了方便。事实上，我看过一些文章，其中链接查找实际上比蛮力查找慢一点。

编辑：如果可能的话，我建议您尝试使用 Dictionary 或 SortedList 作为您的值。我相信任何一个都会有更好的查找性能。

asp.net - Linq“包含”查询时间过长

3 回答 3

Related

Reference