c - 在c中构建未知长度的大字符串

Question

我毫不怀疑在某个地方对此有答案，我只是找不到。

经过长时间的休息，我刚刚回到 c 并且非常生疏，所以请原谅愚蠢的错误。我需要生成一个大的（可能相当于 10mb）字符串。不知道要多久才能建成。

我尝试了以下两种方法来测试速度：

int main() {
#if 1
  size_t message_len = 1; /* + 1 for terminating NULL */
  char *buffer = (char*) malloc(message_len);
  for (int i = 0; i < 200000; i++)
  {
    int size = snprintf(NULL, 0, "%d \n", i);
    char * a = malloc(size + 1);
    sprintf(a, "%d \n", i);

    message_len += 1 + strlen(a); /* 1 + for separator ';' */
    buffer = (char*) realloc(buffer, message_len);
    strncat(buffer, a, message_len);
  }
#else
  FILE *f = fopen("test", "w"); 
  if (f == NULL) return -1; 
  for (int i = 0; i < 200000; i++)
  {
    fprintf(f, "%d \n", i);
  }
  fclose(f);
  FILE *fp = fopen("test", "r");
  fseek(fp, 0, SEEK_END);
  long fsize = ftell(f);
  fseek(fp, 0, SEEK_SET);
  char *buffer = malloc(fsize + 1);
  fread(buffer, fsize, 1, f);
  fclose(fp);
  buffer[fsize] = 0;
#endif
  char substr[56];
  memcpy(substr, buffer, 56);
  printf("%s", substr);
  return 1;
}

每次连接字符串的第一个解决方案耗时 3.8 秒，第二个写入文件然后读取的解决方案耗时 0.02 秒。

当然有一种快速的方法可以在 c 中构建一个大字符串，而无需读取和写入文件？我只是在做一些非常低效的事情吗？如果不能，我可以写入某种文件对象，然后在最后读取它并且永远不要保存它吗？

在 C# 中，您将使用字符串缓冲区来避免缓慢的连接，c 中的等价物是什么？

提前致谢。

score 5 · Accepted Answer

这些台词让你的生活变得非常艰难：

for (int i = 0; i < 200000; i++)
  {
    int size = snprintf(NULL, 0, "%d \n", i);  // << executed in first loop only
    char * a = malloc(size + 1);               // allocate enough space for "0 \n" + 1
    sprintf(a, "%d \n", i);                    // may try to squeeze "199999 \n" into a

    message_len += 1 + strlen(a); /* 1 + for separator ';' */
    buffer = (char*) realloc(buffer, message_len);
    strncat(buffer, a, message_len);
  }

您在第一次迭代中计算size并分配空间a- 然后在每次后续迭代中继续使用它（其中i变得更大，原则上您将超过为分配的存储空间a）。如果你正确地做到了这一点（a在每个循环中分配大小），你将不得不free在每个循环中，或者创建一个巨大的内存泄漏。

在 C 中，解决方案是预先分配大量内存 - 并且仅在紧急情况下重新分配。如果您“大致”知道字符串的大小，请立即分配所有内存；跟踪它有多大，如果不足，可以添加更多。最后，您总是可以“归还您未使用的东西”。太多的调用来realloc保持移动内存（因为你经常没有足够的连续内存可用）。正如@Matt 在他的评论中澄清的那样：每次调用都存在realloc 移动整个内存块的真正风险- 随着块变大，这将成为系统上二次增加的负载。这是一个可能的更好的解决方案（完整，用小 N 和 BLOCK 测试只是为了展示原理；您将希望使用大 N（您的值 200000）和更大的 BLOCK - 并摆脱printf那里显示的语句一切正常）：

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>

#define N 2000000 
#define BLOCK 32 
int main(void) {
size_t message_len = BLOCK; //
  char *buffer = (char*) malloc(message_len);
  int bb;
  int i, n=0;
  char* a = buffer;
  clock_t start, stop;
  for(bb = 1; bb < 128; bb *= 2) {
  int rCount = 0;
  start = clock();
  for (i = 0; i < N; i++)
  {
    a = buffer + n;
    n += sprintf(a, "%d \n", i);
    if ((message_len - n) < BLOCK*bb) {
      rCount++;
      message_len += BLOCK*bb;
      //printf("increasing buffer\n");
      //printf("increased buffer to %ld\n", (long int)message_len);
      buffer = realloc(buffer, message_len);
    }
  }
  stop = clock();
  printf("\nat the end, buffer length is %d; rCount = %d\n", strlen(buffer), rCount);
//  buffer = realloc(buffer, strlen(buffer+1));
  //printf("buffer is now: \n%s\n", buffer);
  printf("time taken with blocksize = %d: %.1f ms\n", BLOCK*bb, (stop - start) * 1000.0 / CLOCKS_PER_SEC);
  }
}

您将希望使用一个相当大的值BLOCK- 这将限制对的调用次数realloc。我会使用类似 100000 的东西；无论如何，您最终都会摆脱空间。

编辑我修改了我发布的代码以允许循环计时 - 将 N 增加到 200 万以获得“合理时间”。我还最小化了初始内存分配（强制大量调用realloc并修复了一个错误（当realloc必须移动内存时，不再a指向buffer.n

这是相当快的 - 最小块为 450 毫秒，对于较大的块（200 万个数字）下降到 350 毫秒。这与您的文件读/写操作相当（在我测量的分辨率范围内）。但是是的 - 文件 I/O 流和相关的内存管理是高度优化的......

score 1 · Accepted Answer

我遗漏了一些细节，但我的方法通常是这样的

创建一个像这样的结构

typedef struct {
    char *curr ;
    char *start ;
    char *end ;
} VBUF ;

按照这些思路编写一些函数：

void vbuf_alloc(VBUF *v,int n)
{
    v->start = malloc(n) ;
    v->end = v->start + n ;
    v->curr = v->start ;
    }

int vbuf_add(VBUF *v,char *s,int length)
{
    if (v->end - v->curr < length) {
        vbuf_realloc(v,(v->end - v->start) * 2) ;
        }
    memcpy(v->curr,s,length) ;
    v->curr += length ;
    return length ;
    }

int vbuf_adds(VBUF *v,char *s)
{
    return vbud_add(v,s,strlen(s)) ;
    }

您可以随心所欲地扩展这套功能。

score 0 · Accepted Answer

如果长度太短，我建议不要realloc在每个连续的字符串上尝试智能地提前。realloc换句话说，尽可能避免重新分配。

伪代码中的幼稚实现可能类似于

Initialize an int/long to "written so far"
Initialize an int/long to remember "buffer size"
Alloc memory for a string up to the "buffer size"
Read in the next chunk into a temporary buffer
Get the "chunk size" from the temporary buffer
If "written so far" + "chunk size" > "buffer size"
    Reallocate the chunk to be much bigger (double "buffer size"?)
    Set the new "buffer size"
Copy the data from the temporary buffer to "buffer address" + "written so far" + 1
Set "written so far" to "written so far" + "chunk size"

我只是把它放在一起，所以可能会有索引错误，但你明白了：只在必要时分配和复制，而不是每次都通过循环。

score 0 · Accepted Answer

c没有对象，因此没有与字符串缓冲区等效的对象C#（尽管C++您会使用std::string）。

通过不在每个追加上调用 realloc 并且从不按你的方式调用 malloc ，你会获得性能提升。

您可以通过简单地声明一个足够大的 char[] 来完全避免您的 malloc 以将最大的 int 打印到；这也可以避免 snprintf，而且尺寸相当小。

与其不断调用 realloc，不如将缓冲区增加一些合理的大小……比如说 4kb（与页面大小相对应的一个很好的大小），并且只有在它接近用尽时（也就是说，当它的当前使用量为小于您从顶部使用的数组大小）。

c - 在c中构建未知长度的大字符串

4 回答 4

Related

Reference