c# - 需要一种在 c# 中反序列化 100 万个字符串和 Guid 的快速方法

Question

我想为性能关键的应用程序反序列化 100 万对 (String,Guid) 的列表。格式可以是我选择的任何格式，并且序列化没有相同的性能要求。

哪种方法最好？文本还是二进制？连续写每一对（字符串，guid），还是写所有字符串后跟所有guid？

我开始使用 LinqPad（以及仅反序列化字符串的更简单示例）并发现（有点违反直觉），使用 a TextReaderandReadLine()比使用 a BinaryReaderand快很多ReadString()。（文件系统缓存是否在欺骗我？）

public string[] DeSerializeBinary()
{
    var tmr = System.Diagnostics.Stopwatch.StartNew();
    long ms = 0;
    string[] arr = null;
    using (var rdr = new BinaryReader(new FileStream(file, FileMode.Open, FileAccess.Read)))
    {
        var num = rdr.ReadInt32();
        arr = new String[num];
        for (int i = 0; i < num; i++)
        {
            arr[i] = rdr.ReadString();
        }
        tmr.Stop();
        ms = tmr.ElapsedMilliseconds;
        Console.WriteLine("DeSerializeBinary took {0}ms", ms);
    }
    return arr;
}

public string[] DeserializeText()
{
    var tmr = System.Diagnostics.Stopwatch.StartNew();
    long ms = 0;
    string[] arr = null;
    using (var rdr = File.OpenText(file))
    {
        var num = Int32.Parse(rdr.ReadLine());
        arr = new String[num];
        for (int i = 0; i < num; i++)
        {
            arr[i] = rdr.ReadLine();
        }
        tmr.Stop();
        ms = tmr.ElapsedMilliseconds;
        Console.WriteLine("DeserializeText took {0}ms", ms);
    }
    return arr;
}

一些编辑：

我使用 RamMap 来清除文件系统缓存，结果发现文本和二进制阅读器仅对字符串几乎没有区别。
我有一个相当简单的类来保存字符串和 guid。它还拥有一个 int 索引，该索引对应于它在列表中的位置。显然没有必要在序列化中包含这个。
在（二进制）反序列化字符串和 Guid 的测试中，我得到了大约 500 毫秒。
理想的时间是 50 毫秒，或者尽可能接近。然而，一个简单的实验表明，将（压缩的）文件从相当快的 SSD 驱动器读取到内存中至少需要 120 毫秒，而根本不需要任何类型的解析。所以 50ms 似乎不太可能。
我们的字符串没有理论上的长度限制。但是，我们可以假设性能目标仅适用于它们全部为 20 个字符或更少的情况。
时间包括打开文件。

当前代码实验简介

读取字符串现在是明显的瓶颈（因此我只尝试序列化字符串）。在我预先分配一个 16 字节的数组来读取 GUID 之前，JIT_NewFast 占用了 30%。

score 3 · Accepted Answer

读取一堆字符串的速度StreamReader比 with快也就不足为奇了BinaryReader。StreamReader从底层流中读取块，并解析该缓冲区中的字符串。BinaryReader没有这样的缓冲区。它从底层流中读取字符串长度，然后读取那么多字符。因此对基本流的方法BinaryReader进行了更多调用。Read

但是反序列化一(String, Guid)对不仅仅是阅读。您还必须解析 Guid。如果您以二进制形式编写文件，则以二进制Guid形式编写文件，这使得创建Guid结构更加容易和快捷。如果它是一个字符串，那么在将行拆分为两个字段之后，您必须调用new Guid(string)来解析文本并创建一个Guid, 。

很难说哪一个会更快。

我无法想象我们在这里谈论了很多时间。当然，读取一百万行的文件大约需要一秒钟。除非字符串真的很长。如果算上分隔符，GUID 只有 36 个字符，对吧？

使用BinaryWriter，您可以像这样编写文件：

writer.Write(count); // integer number of records
foreach (var pair in pairs)
{
    writer.Write(pair.theString);
    writer.Write(pair.theGuid.ToByteArray());
}

阅读它，你有：

count = reader.ReadInt32();
byte[] guidBytes = new byte[16];
for (int i = 0; i < count; ++i)
{
    string s = reader.ReadString();
    reader.Read(guidBytes, 0, guidBytes.Length);
    pairs.Add(new Pair(s, new Guid(guidBytes));
}

这是否比拆分字符串和调用Guid带有字符串参数的构造函数更快，我不知道。

我怀疑任何差异都会非常轻微。我可能会使用最简单的方法：文本文件。

如果你想变得非常疯狂，你可以编写一个自定义格式，只需几次大读取（一个标题、一个索引和两个字符串和 GUID 数组）就可以轻松完成，并在内存中执行其他所有操作。那几乎肯定会更快。但速度是否足以保证额外的工作？疑。

更新

或者也许没有疑问。这是一些编写和读取自定义二进制格式的代码。格式为：

计数（int32）
指导（计数 * 16 字节）
字符串（一个大的连接字符串）
index（每个字符串在大字符串中的起始字符的索引）

我假设您正在使用 aDictionary<string, Guid>来保存这些东西。但是您的数据结构并不重要。代码将基本相同。

请注意，我对此进行了非常简短的测试。我不会说代码 100% 没有错误，但我认为您可以了解我在做什么。

private void WriteGuidFile(string filename, Dictionary<string, Guid>guids)
{
    using (var fs = File.Create(filename))
    {
        using (var writer = new BinaryWriter(fs, Encoding.UTF8))
        {
            List<int> stringIndex = new List<int>(guids.Count);
            StringBuilder bigString = new StringBuilder();

            // write count
            writer.Write(guids.Count);

            // Write the GUIDs and build the string index
            foreach (var pair in guids)
            {
                writer.Write(pair.Value.ToByteArray(), 0, 16);
                stringIndex.Add(bigString.Length);
                bigString.Append(pair.Key);
            }
            // Add one more entry to the string index.
            // makes deserializing easier
            stringIndex.Add(bigString.Length);

            // Write the string that contains all of the strings, combined
            writer.Write(bigString.ToString());

            // write the index
            foreach (var ix in stringIndex)
            {
                writer.Write(ix);
            }
        }
    }
}

阅读只是稍微复杂一些：

private Dictionary<string, Guid> ReadGuidFile(string filename)
{
    using (var fs = File.OpenRead(filename))
    {
        using (var reader = new BinaryReader(fs, Encoding.UTF8))
        {
            // read the count
            int count = reader.ReadInt32();

            // The guids are in a huge byte array sized 16*count
            byte[] guidsBuffer = new byte[16*count];
            reader.Read(guidsBuffer, 0, guidsBuffer.Length);

            // Strings are all concatenated into one
            var bigString = reader.ReadString();

            // Index is an array of int. We can read it as an array of
            // ((count+1) * 4) bytes.
            byte[] indexBuffer = new byte[4*(count+1)];
            reader.Read(indexBuffer, 0, indexBuffer.Length);

            var guids = new Dictionary<string, Guid>(count);
            byte[] guidBytes = new byte[16];
            int startix = 0;
            int endix = 0;
            for (int i = 0; i < count; ++i)
            {
                endix = BitConverter.ToInt32(indexBuffer, 4*(i+1));
                string key = bigString.Substring(startix, endix - startix);
                Buffer.BlockCopy(guidsBuffer, (i*16),
                                    guidBytes, 0, 16);
                guids.Add(key, new Guid(guidBytes));
                startix = endix;
            }
            return guids;
        }
    }
}

这里有几点说明。首先，我BitConverter用来将字节数组中的数据转换为整数。使用不安全代码并使用int32*.

您可以通过使用指针来索引guidBuffer并调用Guid 构造函数 (Int32, Int16, Int16, Byte, Byte, Byte, Byte, Byte, Byte, Byte, Byte)而不是使用Buffer.BlockCopy将 GUID 复制到临时数组中来获得一些速度.

您可以将字符串索引设为长度索引，而不是起始位置。这将消除对数组末尾额外值的需求，但它不太可能对速度产生任何影响。

可能还有其他优化机会，但我想你在这里得到了大致的想法。

c# - 需要一种在 c# 中反序列化 100 万个字符串和 Guid 的快速方法

1 回答 1

更新

Related

Reference