c - 如何在 C 中实现可变长度的“字符串”-y

Question

我用谷歌搜索了很多，但我找不到有关如何在高级语言中通常实现可变长度字符串的信息。我正在创建自己的这种语言，但不确定从哪里开始使用字符串。

我有一个描述string类型的结构，然后是一个create分配这样一个“字符串”的函数：

/* A safer `strcpy()`, using `strncpy()` and `sizeof()` */
#define STRCPY(TO, FROM) \
  strncpy(TO, FROM, sizeof(TO)); TO[sizeof(TO) - 1] = '\0'

struct string {
  // …
  char  native[1024];
};

string String__create(char native[]) {
  string this = malloc(sizeof(struct string));

  // …
  STRCPY(this->native, native);

  return this;
}

但是，这将只允许 1kb 长的字符串。这有点愚蠢，在大多数情况下会浪费大量内存。

鉴于我必须以某种方式声明要使用的内存……我该如何实现一个可以（有效）存储（有效）无限数量的字符的字符串？

score 12 · Accepted Answer

许多 C++std::string实现现在使用“小字符串优化”。在伪代码中：

struct string {
    Int32 length
    union {
        char[12] shortString
        struct {
           char* longerString
           Int32 heapReservedSpace
        }
    }
}

这个想法是将最多 12 个字符的字符串存储在shortString数组中。整个字符串将是连续的，并且只使用一个缓存行。较长的字符串存储在堆上。这会在字符串对象中留下 12 个备用字节。指针不会占用所有这些，因此您还可以记住您在堆上分配了多少内存 ( >=length)。这有助于支持您以小增量增长字符串的场景。

score 4 · Accepted Answer

除了其他人告诉你的之外，我还包括“容量”的概念：不可能知道内存中分配的向量的大小，你必须存储它。如果您对 String 结构执行 sizeof，它将返回 4 个字节 * 3 个数字字段 = 12 个字节（由于结构中使用了填充，可能会更大）。此外，您无法通过 sizeof 获取分配内存的长度。

typedef struct _mystring {
        char * native;
        size_t size;
        size_t capacity;
} String;

这样，容量总是承载字符串所在块的实际大小。假设您的字符串变短：您不必重新分配以获得容量和字符串大小之间的精确匹配。此外，您可以从头开始分配您希望字符串具有的字符，而不是初始字符串具有的字符。最后，您可以模仿 C++ 字符串 char 动态向量，并在每次字符串增长超出容量限制时将容量加倍。所有这些都将内存操作保持在最低限度，这将转化为更好的性能（您也会浪费一些内存，但不会超过 1Kb）。

String String__create(char native[], size_t capacity) {
  String this;

  this.size = strlen( native );
  if ( capacity < ( this.size + 1 ) )
        this.capacity = this.size + 1;
  else  this.capacity = capacity;

  this.native = (char *) malloc( capacity * sizeof( char ) );
  strcpy( this.native, native );

  return this;
}

String * String__set(String *this, char native[]) {
    this->size = strlen( native );

    if ( this->size >= this->capacity ) {
        do {
            this->capacity <<= 1;
        } while( this->size > this->capacity );

        this->native = realloc( this->native, this->capacity );
    }

    strcpy( this->native, native );

    return this;
}

void String__delete(String *this)
{
    free( this->native );
}

score 4 · Accepted Answer

The common approach is to have a field for length and a pointer to a dynamically allocated region of memory to hold the characters.

typedef struct string {
    size_t length;
    unsigned char *native;
} string_t;

string_t String__create(char native[]) {
    string_t this;
    this.length = strlen(native);
    this.native = malloc(this.length+1);
    if (this.native) {
        strncpy(this.native, native, this.length+1);
    } else {
        this.length = 0;
    }
    return this;
}

If you want to dynamically allocate the string_t:

string_t* String__create(char native[]) {
    string_t* this;
    if (this = malloc(sizeof(*this))) {
        this->length = strlen(native);
        this->native = malloc(this->length+1);
        if (this->native) {
            strncpy(this->native, native, this->length+1);
        } else {
            free(this);
            this=NULL;
        }
    }
    return this;
}
void String__delete(string_t** str) {
    free((*str)->native);
    free((*str));
    *str=NULL;
}

score 2 · Accepted Answer

realloc will relocate your memory to a bigger buffer -- simply using this command will allow you to resize your string. Use the following struct for your string:

struct mystring
{
    char * string;
    size_t size;
};

The important part being keeping a track of the size of your string, and having every string manipulation function testing if the size makes sense.

You could pre-allocate a large buffer and keep adding to it, and only realloc when said buffer is full -- you have to implement all the functions for this. It is preferable (far less error prone, and more secure) to mutate string by moving from one immutable string to another (using the semantics of realoc).

score 0 · Accepted Answer

有些人更喜欢使用“绳索”数据结构来存储无限长度的字符串，而不是连续的字符串（C 字符串）。

简化的绳索可以定义为：

#include <stdio.h>

struct non_leaf_rope_node{
    char zero;
    union rope* left_substring;
    union rope* right_substring;
    // real rope implementations have a few more items here
};
#define rope_leaf_max ( sizeof( struct non_leaf_rope_node ) )

typedef union rope {
    char rope_data[ rope_leaf_max ];
    struct non_leaf_rope_node pointers;
} rope;

void print( union rope *node ){
    if( node->rope_data[0] != '\0' ){
        // short literal data
        fputs( node->rope_data, stdout );
    }else{
        // non-root node
        print( node->pointers.left_substring );
        print( node->pointers.right_substring );
    };
};
// other details involving malloc() and free() go here

int main(void){
    rope left = { "Hello," };
    rope right = { " World!" };
    rope root = {0,0,0};
    root.pointers.left_substring = &left;
    root.pointers.right_substring = &right;
    print( &root );

    return 0;
};

少于rope_leaf_max 个字符的rope 与以null 结尾的C 字符串存储相同。包含超过rope_leaf_max 个字符的rope 存储为根non_leaf_rope_node 指向左右子字符串，（可能依次指向左右子子字符串），最终指向叶节点，叶节点每个都包含完整字符串的至少一个字符。

绳索总是至少存储一个字符，因此我们总是可以判断：如果绳索节点的第一个字节非零，则该节点是存储文字字符的叶节点。如果绳索节点的第一个字节为零，则该节点存储指向左右子字符串的指针。（真正的绳索实现通常有第三种绳索节点）。

通常使用绳索比使用 C 字符串需要更少的总 RAM 空间。（包含诸如“纽约市”之类的短语的节点可以在一根绳索中重复使用多次，或者在某些实现中在两条绳索之间共享）。有时使用绳索比使用 C 字符串更快。

c - 如何在 C 中实现可变长度的“字符串”-y

5 回答 5

Related

Reference