c# - 优化数百万个 char* 到字符串的转换

Question

我有一个应用程序需要接收数百万个 char* 作为输入参数（通常是少于 512 个字符的字符串（在 unicode 中）），并将它们转换并存储为 .net 字符串。

事实证明，它是我的应用程序性能的真正瓶颈。我想知道是否有一些设计模式或想法可以使其更有效。

有一个关键部分让我觉得它可以改进：有很多重复。假设有 100 万个对象进入，可能只有 50 个独特的 char* 模式。

作为记录，这是我用来将 char* 转换为字符串的算法（该算法在 C++ 中，但项目的其余部分在 C# 中）

String ^StringTools::MbCharToStr ( const char *Source ) 
{
   String ^str;

   if( (Source == NULL) || (Source[0] == '\0') )
   {
      str = gcnew String("");
   }
   else
   {
      // Find the number of UTF-16 characters needed to hold the
      // converted UTF-8 string, and allocate a buffer for them.
      const size_t max_strsize = 2048;

      int wstr_size = MultiByteToWideChar (CP_UTF8, 0L, Source, -1, NULL, 0);
      if (wstr_size < max_strsize)
      {
         // Save the malloc/free overhead if it's a reasonable size.
         // Plus, KJN was having fits with exceptions within exception logging due
         // to a corrupted heap.

         wchar_t wstr[max_strsize];

         (void) MultiByteToWideChar (CP_UTF8, 0L, Source, -1, wstr, (int) wstr_size);
         str = gcnew String (wstr);
      }
      else
      {
         wchar_t *wstr = (wchar_t *)calloc (wstr_size, sizeof(wchar_t));
         if (wstr == NULL) 
            throw gcnew PCSException (__FILE__, __LINE__, PCS_INSUF_MEMORY, MSG_SEVERE);

         // Convert the UTF-8 string into the UTF-16 buffer, construct the
         // result String from the UTF-16 buffer, and then free the buffer.

         (void) MultiByteToWideChar (CP_UTF8, 0L, Source, -1, wstr, (int) wstr_size);
         str = gcnew String ( wstr );
         free (wstr);
      }
   }
   return str;
}

score 5 · Accepted Answer

您可以使用输入字符串中的每个字符来提供trie结构。在叶子上，有一个 .NET 字符串对象。然后，当char*您之前看到的 a 出现时，您可以快速找到现有的 .NET 版本，而无需分配任何内存。

伪代码：

从一个空的尝试开始，
通过搜索 trie 来处理 char* 直到你不能再继续
添加节点，直到您的整个 char* 已被编码为节点
在叶子上，附加一个实际的 .NET 字符串

这个其他 SO 问题的答案应该让你开始：如何在 c# 中创建一个 trie

score 3 · Accepted Answer

有一个关键部分让我觉得它可以改进：有很多重复。假设有 100 万个对象进入，可能只有 50 个独特的 char* 模式。

如果是这种情况，您可能需要考虑将“找到”模式存储在映射中（例如使用std::map<const char*, gcroot<String^>>[虽然您需要比较器来const char*），并使用它来返回先前转换的值。

存储映射、进行比较等会产生开销。但是，这可以通过显着减少的内存使用量（您可以重用托管字符串实例）以及节省内存分配（calloc/free）来缓解。此外，使用malloc代替calloc可能是一个（非常小的）改进，因为您不需要在调用之前将内存清零MultiByteToWideChar。

score 2 · Accepted Answer

我认为您可以在这里进行的第一个优化是让您第一次尝试MultiByteToWideChar使用缓冲区而不是空指针调用 start。因为您指定CP_UTF8,MultiByteToWideChar必须遍历整个字符串以确定预期长度。如果有一些长度比绝大多数字符串长，您可能会考虑乐观地在堆栈上分配该大小的缓冲区；如果失败，则进行动态分配。也就是说，如果您的if/else块在if/else.

您还可以通过计算一次源字符串的长度并显式传递它来节省一些时间——这样 MultiByteToWideChar 就不必strlen每次调用它时都执行一次。

也就是说，如果您的项目的其余部分是 C#，您应该使用旨在执行此操作的 .NET BCL 类库，而不是在 C++/CLI 中使用并行程序集来转换字符串。这System.Text.Encoding就是为了。

我怀疑您可以在此处使用的任何类型的缓存数据结构都会产生重大影响。

哦，不要忽略 - 的结果MultiByteToWideChar- 你不仅不应该将任何东西投射到，而且在事件失败void时你会有未定义的行为。MultiByteToWideChar

score 1 · Accepted Answer

我可能会使用基于三叉树结构或类似结构的缓存，并在将单个字符转换为 .NET 表示之前查看输入字符串是否已经转换。

c# - 优化数百万个 char* 到字符串的转换

4 回答 4

Related

Reference