为了复习我的 C,我正在编写一些有用的库代码。在阅读文本文件时,拥有一个方便的标记化功能总是有用的,它可以完成大部分繁重的工作(循环strtok
不方便且危险)。
当我编写这个函数时,我对它的复杂性感到惊讶。说实话,我几乎确信它包含错误(特别是在分配错误的情况下内存泄漏)。这是代码:
/* Given an input string and separators, returns an array of
** tokens. Each token is a dynamically allocated, NUL-terminated
** string. The last element of the array is a sentinel NULL
** pointer. The returned array (and all the strings in it) must
** be deallocated by the caller.
**
** In case of errors, NULL is returned.
**
** This function is much slower than a naive in-line tokenization,
** since it copies the input string and does many allocations.
** However, it's much more convenient to use.
*/
char** tokenize(const char* input, const char* sep)
{
/* strtok ruins its input string, so we'll work on a copy
*/
char* dup;
/* This is the array filled with tokens and returned
*/
char** toks = 0;
/* Current token
*/
char* cur_tok;
/* Size of the 'toks' array. Starts low and is doubled when
** exhausted.
*/
size_t size = 2;
/* 'ntok' points to the next free element of the 'toks' array
*/
size_t ntok = 0;
size_t i;
if (!(dup = strdup(input)))
return NULL;
if (!(toks = malloc(size * sizeof(*toks))))
goto cleanup_exit;
cur_tok = strtok(dup, sep);
/* While we have more tokens to process...
*/
while (cur_tok)
{
/* We should still have 2 empty elements in the array,
** one for this token and one for the sentinel.
*/
if (ntok > size - 2)
{
char** newtoks;
size *= 2;
newtoks = realloc(toks, size * sizeof(*toks));
if (!newtoks)
goto cleanup_exit;
toks = newtoks;
}
/* Now the array is definitely large enough, so we just
** copy the new token into it.
*/
toks[ntok] = strdup(cur_tok);
if (!toks[ntok])
goto cleanup_exit;
ntok++;
cur_tok = strtok(0, sep);
}
free(dup);
toks[ntok] = 0;
return toks;
cleanup_exit:
free(dup);
for (i = 0; i < ntok; ++i)
free(toks[i]);
free(toks);
return NULL;
}
这是简单的用法:
int main()
{
char line[] = "The quick brown fox jumps over the lazy dog";
char** toks = tokenize(line, " \t");
int i;
for (i = 0; toks[i]; ++i)
printf("%s\n", toks[i]);
/* Deallocate
*/
for (i = 0; toks[i]; ++i)
free(toks[i]);
free(toks);
return 0;
}
哦,还有strdup
:
/* strdup isn't ANSI C, so here's one...
*/
char* strdup(const char* str)
{
size_t len = strlen(str) + 1;
char* dup = malloc(len);
if (dup)
memcpy(dup, str, len);
return dup;
}
关于tokenize
函数代码的几点注意事项:
strtok
有覆盖其输入字符串的不礼貌习惯。为了保存用户的数据,我只在输入的副本上调用它。副本是使用 获得的strdup
。strdup
然而,它不是 ANSI-C,所以我不得不写一个数组随着
toks
动态增长realloc
,因为我们事先不知道会有多少令牌。初始大小为 2 仅用于测试,在实际代码中我可能会将其设置为更高的值。它也会返回给用户,用户必须在使用后释放它。在所有情况下,都特别注意不要泄漏资源。例如,如果
realloc
返回 NULL,它将不会在旧指针上运行。旧指针将被释放,函数返回。返回时没有资源泄漏tokenize
(除非在返回给用户的数组在使用后必须释放的名义情况下)。- A
goto
用于更方便的清理代码,根据在某些情况下可能很好的哲学(这是一个很好的例子,恕我直言)。goto
以下函数可以帮助在单个调用中进行简单的释放:
/* Given a pointer to the tokens array returned by 'tokenize',
** frees the array and sets it to point to NULL.
*/
void tokenize_free(char*** toks)
{
if (toks && *toks)
{
int i;
for (i = 0; (*toks)[i]; ++i)
free((*toks)[i]);
free(*toks);
*toks = 0;
}
}
我真的很想与 SO 的其他用户讨论此代码。有什么可以做得更好的?你会为这样的分词器推荐一个不同的接口吗?解除分配的负担是如何从用户那里承担的?代码中是否存在内存泄漏?
提前致谢