c# - 相交和任何或包含和任何。找到至少一个共同元素哪个更有效？

Question

如果我有两个列表并且我想知道是否至少有一个共同元素，我有这两个选项：

lst1.Intersect(lst2).Any();

Lst1.Any(x => lst2.Contains(x));

这两个选项给了我我期望的结果，但是我不知道什么是最好的选择。哪个更有效率？为什么？

谢谢。

编辑：当我创建这篇文章时，除了解决方案之外，我还在寻找原因。我知道我可以运行测试，但我不知道结果的原因。一个比另一个快？一种解决方案总是比另一种更好吗？

因此，出于这个原因，我接受了 Matthew 的答案，不仅是为了测试代码，而且他解释了什么时候比另一个更好以及为什么。我也非常感谢 Nicholas 和 Oren 的贡献。

谢谢。

score 18 · Accepted Answer

Oren 的回答在使用秒表的方式上存在错误。在测量所用时间后，它不会在循环结束时重置Any()。

请注意它如何回到循环的开始，而秒表从不存在，Reset()因此添加到的时间intersect 包括Any().

以下是修正版。

在任何调试器之外运行的发布版本会在我的 PC 上给出以下结果：

Intersect:    1ms
Any:       6743ms

请注意我如何为此测试制作两个不重叠的字符串列表。另请注意，这是一个最坏情况测试。

如果有许多交叉点（或恰好在数据开始附近发生的交叉点），那么 Oren 说Any()应该更快是完全正确的。

如果真实数据通常包含交叉点，那么最好使用Any(). 否则，使用Intersect(). 它非常依赖数据。

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

namespace Demo
{
    class Program
    {
        void run()
        {
            double intersect = 0;
            double any = 0;

            Stopwatch stopWatch = new Stopwatch();

            List<string> L1 = Enumerable.Range(0, 10000).Select(x => x.ToString()).ToList();
            List<string> L2 = Enumerable.Range(10000, 10000).Select(x => x.ToString()).ToList();

            for (int i = 0; i < 10; i++)
            {
                stopWatch.Restart();
                Intersect(L1, L2);
                stopWatch.Stop();
                intersect += stopWatch.ElapsedMilliseconds;

                stopWatch.Restart();
                Any(L1, L2);
                stopWatch.Stop();
                any += stopWatch.ElapsedMilliseconds;
            }

            Console.WriteLine("Intersect: " + intersect + "ms");
            Console.WriteLine("Any: " + any + "ms");
        }

        private static bool Any(List<string> lst1, List<string> lst2)
        {
            return lst1.Any(lst2.Contains);
        }

        private static bool Intersect(List<string> lst1, List<string> lst2)
        {
            return lst1.Intersect(lst2).Any();
        }            

        static void Main()
        {
            new Program().run();
        }
    }
}

出于比较目的，我编写了自己的测试比较int序列：

intersect took 00:00:00.0065928
any took       00:00:08.6706195

编码：

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

namespace Demo
{
    class Program
    {
        void run()
        {
            var lst1 = Enumerable.Range(0, 10000);
            var lst2 = Enumerable.Range(10000, 10000);

            int count = 10;

            DemoUtil.Time(() => lst1.Intersect(lst2).Any(), "intersect", count);
            DemoUtil.Time(() => lst1.Any(lst2.Contains),     "any",      count);
        }

        static void Main()
        {
            new Program().run();
        }
    }

    static class DemoUtil
    {
        public static void Print(this object self)
        {
            Console.WriteLine(self);
        }

        public static void Print(this string self)
        {
            Console.WriteLine(self);
        }

        public static void Print<T>(this IEnumerable<T> self)
        {
            foreach (var item in self)
                Console.WriteLine(item);
        }

        public static void Time(Action action, string title, int count)
        {
            var sw = Stopwatch.StartNew();

            for (int i = 0; i < count; ++i)
                action();

            (title + " took " + sw.Elapsed).Print();
        }
    }
}

如果我还通过将列表更改为此并将其增加到 10000来为重叠范围计时：count

var lst1 = Enumerable.Range(10000, 10000);
var lst2 = Enumerable.Range(10000, 10000);

我得到这些结果：

intersect took 00:00:03.2607476
any took 00:00:00.0019170

在这种情况下Any()显然要快得多。

结论

最坏情况下的性能对来说非常糟糕Any()但可以接受Intersect()。最佳情况下的性能Any()对Intersect(). （最好的情况Any()可能是最坏的情况Intersect()！）

该Any()方法在最坏情况下为 O(N^2)，在最佳情况下为 O(1)。该Intersect()方法总是 O(N) （因为它使用散列，而不是排序，否则它将是 O(N(Log(N))）。

您还必须考虑内存使用情况：该Intersect()方法需要获取其中一个输入的副本，Any()而不需要。

因此，要做出最佳决策，您确实需要了解真实数据的特征，并实际执行测试。

如果您真的不希望Any()在最坏的情况下变成 O(N^2)，那么您应该使用Intersect(). 但是，您最好使用Any().

当然，大多数时候这些都不重要！

除非您发现这部分代码是一个瓶颈，否则这只是学术兴趣。如果没有问题，您不应该在这种分析上浪费时间。:)

score 3 · Accepted Answer

这取决于您的 IEnumerables 的实现。

您的第一次尝试 ( Intersect/ Any) 会找到所有匹配项，然后确定该集合是否为空。从文档来看，这看起来类似于 O(n) 操作：

当枚举此方法返回的对象时，Intersect 首先枚举，收集该序列的所有不同元素。然后它枚举 [the] 秒，标记出现在两个序列中的那些元素。最后，标记的元素按照它们被收集的顺序产生。

您的第二次尝试 ( Any/ Contains) 枚举第一个集合，一个 O(n) 操作，对于第一个集合中的每个项目，枚举第二个，另一个 O(n) 操作，以查看是否找到匹配元素。这使它类似于 O(n ² ) 操作，不是吗？你认为哪个可能更快？

但是，要考虑的一件事是，如果实现足够智能以利用它，那么Contains()查找某些集合或集合类型（例如，字典、二叉树或允许二分查找或哈希表查找的有序集合）可能是一种廉价的操作Contains()它操作的集合的语义。

但是您需要对您的集合类型进行试验，以找出哪种效果更好。

score 1 · Accepted Answer

See Matthew's answer for a complete and accurate breakdown.

Relatively easy to mock up and try yourself:

        bool found;

        double intersect = 0;
        double any = 0;

        for (int i = 0; i < 100; i++)
        {
            List<string> L1 = GenerateNumberStrings(200000);
            List<string> L2 = GenerateNumberStrings(60000);
            Stopwatch stopWatch = new Stopwatch();

            stopWatch.Start();
            found = Intersect(L1, L2);
            stopWatch.Stop();
            intersect += stopWatch.ElapsedMilliseconds;

            stopWatch.Reset();

            stopWatch.Start();
            found = Any(L1, L2);
            stopWatch.Stop();
            any += stopWatch.ElapsedMilliseconds;
        }

        Console.WriteLine("Intersect: " + intersect + "ms");
        Console.WriteLine("Any: " + any + "ms");
    }

    private static bool Any(List<string> lst1, List<string> lst2)
    {
        return lst1.Any(x => lst2.Contains(x));
    }

    private static bool Intersect(List<string> lst1, List<string> lst2)
    {
        return lst1.Intersect(lst2).Any();
    }

You'll find that the Any method is significantly faster in the long run, likely because it does not require the memory allocations and setup that intersect requires (Any stops and returns true as soon as it finds a match whereas Intersect actually needs to store the matches in a new List<T>).

c# - 相交和任何或包含和任何。找到至少一个共同元素哪个更有效？

3 回答 3

Related

Reference