algorithm - 对两个字典进行排序并查找排序列表中项目索引位置的差异

Question

我面前有一个挑战。让我提出令我困惑的挑战——

有两个字典说- D1和D2。这些字典大部分时间都有相同的键，但不能保证它总是相同的。这两个字典可以表示如下 -

D1 = {["R1", 0.7], ["R2",0.73], ["R3", 1.5], ["R4", 2.5], ["R5", 0.12], ["R6", 1.9] , ["R7", 9.8], ["R8", 6.5], ["R9", 7.2], ["R10", 5.6]};

D2 = {["R1", 0.7], ["R2",0.8], ["R3", 1.5], ["R4", 3.1], ["R5", 0.10], ["R6", 2.0] , ["R7", 8.0], ["R8", 1.0], ["R9", 0.0], ["R10", 5.6], ["R11", 6.23]};

在这些字典中，键是字符串数据类型，值是浮点数据类型。从物理上讲，它们是两个不同时间的系统快照。D1 比 D2 早。我需要根据升序对这些字典进行独立排序。完成后将这些字典更改为 -

D1 = {["R5", 0.12], ["R1", 0.7], ["R2",0.73], ["R3", 1.5], ["R6", 1.9], ["R4", 2.5] , ["R10", 5.6], ["R8", 6.5], ["R9", 7.2], ["R7", 9.8]}; D2 = {["R9", 0.0], ["R5", 0.10], ["R1", 0.7], ["R2",0.8], ["R8", 1.0], ["R3", 1.5 ], ["R6", 2.0], ["R4", 3.1], ["R10", 5.6], ["R11", 6.23], ["R7", 8.0]};

这里将字典 D1 中元素的排序作为参考点。D1 的每个元素都与 D1 中的下一个元素相连。预计在 D2 中识别出在排序后出现在参考字典 D1 中的已破坏序列的元素。在确定向 D2 添加元素（即，密钥不存在于 D1 中但存在于 D2 中）和从 D1 中移除元素（即，密钥存在于 D1 中但不存在于 D2）时，将被忽略。即它们不应在结果中突出显示。

例如，继续上面列出的示例，参考 D1（忽略添加和删除）在 D2 中破坏序列的元素是 -

Breakers = {["R9", 0.0],["R8", 1.0]} 因为，R9 已将序列从 D1 排序字典中的第 8 个索引跳转到 D2 排序字典中的第 0 个索引。同样，R8 将序列从 D1 排序字典中的第 7 个索引跳转到 D2 排序字典中的第 4 个索引（所有索引都从 0 开始）。

注意 - ["R11", 6.23] 预计不会出现在断路器列表中，因为它是 D2 的补充。

请建议一种算法以最佳方式实现此目的，因为需要对从具有 3,256,190 条记录的数据库中获取的数据执行此操作。

编程语言不用担心，如果以逻辑为指导，我可以承担用任何语言实现它的任务。

score 1 · Accepted Answer

我在 C# 中提出了这个算法。它非常适合您的示例数据。我还用 3000000 个完全随机的值进行了测试（因此检测到很多断路器），它在我的笔记本电脑（Intel Core i3 2.1GHz，64 位）上在 3.2 秒内完成。

我首先将您的数据放入临时字典中，以便在将它们放入列表之前复制粘贴您的值。当然，您的应用程序会将它们直接放入列表中。

class Program
{
    struct SingleValue
    {
        public string Key;
        public float Value;
        public SingleValue(string key, float value)
        {
            Key = key;
            Value = value;
        }
        public override string ToString()
        {
            return string.Format("{0}={1}", Key, Value);
        }
    }

    static void Main(string[] args)
    {
        List<SingleValue> D1 = new List<SingleValue>();
        HashSet<string> D1keys = new HashSet<string>();
        List<SingleValue> D2 = new List<SingleValue>();
#if !LARGETEST
        Dictionary<string, double> D1input = new Dictionary<string, double>() { { "R1", 0.7 }, { "R2", 0.73 }, { "R3", 1.5 }, { "R4", 2.5 }, { "R5", 0.12 }, { "R6", 1.9 }, { "R7", 9.8 }, { "R8", 6.5 }, { "R9", 7.2 }, { "R10", 5.6 } };
        Dictionary<string, double> D2input = new Dictionary<string, double>() { { "R1", 0.7 }, { "R2", 0.8 }, { "R3", 1.5 }, { "R4", 3.1 }, { "R5", 0.10 }, { "R6", 2.0 }, { "R7", 8.0 }, { "R8", 1.0 }, { "R9", 0.0 }, { "R10", 5.6 }, { "R11", 6.23 } };

        // You should directly put you values into this list... I converted them from a Dictionary so I didn't have to type over your input values :)
        foreach (KeyValuePair<string, double> kvp in D1input)
        {
            D1.Add(new SingleValue(kvp.Key, (float)kvp.Value));
            D1keys.Add(kvp.Key);
        }
        foreach (KeyValuePair<string, double> kvp in D2input)
            D2.Add(new SingleValue(kvp.Key, (float)kvp.Value));
#else
        Random ran = new Random();
        for (int i = 0; i < 3000000; i++)
        {
            D1.Add(new SingleValue(i.ToString(), (float)ran.NextDouble()));
            D1keys.Add(i.ToString());
            D2.Add(new SingleValue(i.ToString(), (float)ran.NextDouble()));
        }
#endif

        // Sort the lists
        D1.Sort(delegate(SingleValue x, SingleValue y)
        {
            if (y.Value > x.Value)
                return -1;
            else if (y.Value < x.Value)
                return 1;
            return 0;
        });
        D2.Sort(delegate(SingleValue x, SingleValue y)
        {
            if (y.Value > x.Value)
                return -1;
            else if (y.Value < x.Value)
                return 1;
            return 0;
        });

        int start = Environment.TickCount;

        Dictionary<string, float> breakers = new Dictionary<string, float>();
        List<SingleValue> additions = new List<SingleValue>();

        // Walk through D1
        IEnumerator<SingleValue> i1 = D1.GetEnumerator();
        IEnumerator<SingleValue> i2 = D2.GetEnumerator();

        while (i1.MoveNext() && i2.MoveNext())
        {
            while (breakers.ContainsKey(i1.Current.Key))
            {
                if (!i1.MoveNext())
                    break;
            }

            while (i1.Current.Key != i2.Current.Key)
            {
                if (D1keys.Contains(i2.Current.Key))
                    breakers.Add(i2.Current.Key, i2.Current.Value);
                else
                    additions.Add(i2.Current);
                if (!i2.MoveNext())
                    break;
            }
        }

        int duration = Environment.TickCount - start;
        Console.WriteLine("Lookup took {0}ms", duration);
        Console.ReadKey();
    }
}

在此处输入图像描述

score 0 · Accepted Answer

如果您可以在排序之前从 D2 中删除内容，那将很容易，对吧？你说你不能删除数据。但是，您可以创建一个模拟此类删除的附加数据结构（例如，向项目添加一个“已删除”位，或者如果您无法更改其类型，则创建一组“已删除”项目）。然后运行简单的算法，但要确保忽略“删除”的项目。

score 0 · Accepted Answer

我一直在思考这个问题。正如您提到的 Levenshtein 距离，我假设您想通过将谁从 D2 中的位置移动到 D2 中的其他位置来获得这些元素，您将以最少的移动次数从 D2 获得 D1（忽略不存在的元素）两个序列）。

我写了一个贪心算法，它可能足以满足您的需求，但不一定在所有情况下都能给出最佳结果。老实说，我不确定，可能会在稍后（最早周末）回来检查正确性。但是，如果您真的需要对 300 万个元素的序列执行此操作，我相信在这方面做得好的算法都不会足够快，因为我看不到一个 O(n) 算法可以做到干得好，即使在一些琐碎的输入上也不会失败。

该算法尝试将每个元素移动到其预期位置，并计算移动后的误差总和（每个元素距其原始位置的距离）。导致错误总和最低的元素被宣布为断路器并移动。重复此过程，直到序列恢复到 D1。

我认为它具有 O(n^3) 复杂性，虽然元素有时需要移动多次，所以它可能是 O(n^4) 最坏的情况，我不确定，但在 100 万个具有 50 个元素的随机示例中外循环运行的最大次数是 51（n^4 意味着它可以是 2500，不知何故，我在所有的百万次测试中都很幸运）。只有键，没有值。这是因为这些值在此步骤中无关紧要，因此存储它们没有意义。

编辑：我为此写了一个反例生成器，实际上它并不总是最优的。断路器越多，获得最佳解决方案的机会就越小。例如，在 1000 个元素和 50 个随机移动的元素中，当最优解最多为 50 个时，通常会找到一组 55-60 个断路器。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace Breakers
{
    class Program
    {
        static void Main(string[] args)
        {
            //test case 1
            //List<string> L1 = new List<string> { "R5", "R1", "R2", "R3", "R6", "R4", "R10", "R8", "R9", "R7" };
            //List<string> L2 = new List<string> { "R9", "R5", "R1", "R2", "R8", "R3", "R6", "R4", "R10", "R11", "R7" };
            //GetBreakers<string>(L1, L2);

            //test case 2
            //List<string> L1 = new List<string> { "R5", "R1", "R2", "R3", "R6", "R4", "R10", "R8", "R9", "R7" };
            //List<string> L2 = new List<string> { "R5", "R9", "R1", "R6", "R2", "R3", "R4", "R10", "R8", "R7" };
            //GetBreakers<string>(L1, L2);

            //test case 3
            List<int> L1 = new List<int>();
            List<int> L2 = new List<int>();
            Random r = new Random();
            int n = 100;
            for (int i = 0; i < n; i++)
            {
                L1.Add(i);
                L2.Add(i);
            }
            for (int i = 0; i < 5; i++) // number of random moves, this is the upper bound of the optimal solution
            {
                int a = r.Next() % n;
                int b = r.Next() % n;
                if (a == b)
                {
                    i--;
                    continue;
                }
                int x = L2[a];
                Console.WriteLine(x);
                L2.RemoveAt(a);
                L2.Insert(b, x);
            }
            for (int i = 0; i < L2.Count; i++) Console.Write(L2[i]);
            Console.WriteLine();
            GetBreakers<int>(L1, L2);
        }

        static void GetBreakers<T>(List<T> L1, List<T> L2)
        {
            Dictionary<T, int> Appearances = new Dictionary<T, int>();
            for (int i = 0; i < L1.Count; i++) Appearances[L1[i]] = 1;
            for (int i = 0; i < L2.Count; i++) if (Appearances.ContainsKey(L2[i])) Appearances[L2[i]] = 2;
            for (int i = L1.Count - 1; i >= 0; i--) if (!(Appearances.ContainsKey(L1[i]) && Appearances[L1[i]] == 2)) L1.RemoveAt(i);
            for (int i = L2.Count - 1; i >= 0; i--) if (!(Appearances.ContainsKey(L2[i]) && Appearances[L2[i]] == 2)) L2.RemoveAt(i);
            Dictionary<T, int> IndInL1 = new Dictionary<T, int>();
            for (int i = 0; i < L1.Count; i++) IndInL1[L1[i]] = i;

            Dictionary<T, int> Breakers = new Dictionary<T, int>();

            int steps = 0;
            int me = 0;
            while (true)
            {
                steps++;
                int minError = int.MaxValue;
                int minErrorIndex = -1;

                for (int from = 0; from < L2.Count; from++)
                {
                    T x = L2[from];
                    int to = IndInL1[x];
                    if (from == to) continue;

                    L2.RemoveAt(from);
                    L2.Insert(to, x);

                    int error = 0;
                    for (int i = 0; i < L2.Count; i++)
                        error += Math.Abs((i - IndInL1[L2[i]]));

                    L2.RemoveAt(to);
                    L2.Insert(from, x);

                    if (error < minError)
                    {
                        minError = error;
                        minErrorIndex = from;
                    }
                }

                if (minErrorIndex == -1) break;

                T breaker = L2[minErrorIndex];
                int breakerOriginalPosition = IndInL1[breaker];

                L2.RemoveAt(minErrorIndex);
                L2.Insert(breakerOriginalPosition, breaker);

                Breakers[breaker] = 1;

                me = minError;
            }
            Console.WriteLine("Breakers: " + Breakers.Count + " Steps: " + steps);
            foreach (KeyValuePair<T, int> p in Breakers)
                Console.WriteLine(p.Key);
            Console.ReadLine();
        }
    }
}

algorithm - 对两个字典进行排序并查找排序列表中项目索引位置的差异

3 回答 3

Related

Reference