c - C 中字符串标记化函数的复杂性

Question

为了复习我的 C，我正在编写一些有用的库代码。在阅读文本文件时，拥有一个方便的标记化功能总是有用的，它可以完成大部分繁重的工作（循环strtok不方便且危险）。

当我编写这个函数时，我对它的复杂性感到惊讶。说实话，我几乎确信它包含错误（特别是在分配错误的情况下内存泄漏）。这是代码：

/* Given an input string and separators, returns an array of 
** tokens. Each token is a dynamically allocated, NUL-terminated
** string. The last element of the array is a sentinel NULL 
** pointer. The returned array (and all the strings in it) must
** be deallocated by the caller.
**
** In case of errors, NULL is returned.
** 
** This function is much slower than a naive in-line tokenization,
** since it copies the input string and does many allocations.
** However, it's much more convenient to use.
*/ 
char** tokenize(const char* input, const char* sep)
{
    /* strtok ruins its input string, so we'll work on a copy 
    */
    char* dup;

    /* This is the array filled with tokens and returned
    */
    char** toks = 0;

    /* Current token
    */
    char* cur_tok;

    /* Size of the 'toks' array. Starts low and is doubled when
    ** exhausted.
    */
    size_t size = 2;

    /* 'ntok' points to the next free element of the 'toks' array
    */
    size_t ntok = 0;
    size_t i;

    if (!(dup = strdup(input)))
        return NULL;

    if (!(toks = malloc(size * sizeof(*toks))))
        goto cleanup_exit;

    cur_tok = strtok(dup, sep);

    /* While we have more tokens to process...
    */
    while (cur_tok)
    {
        /* We should still have 2 empty elements in the array, 
        ** one for this token and one for the sentinel.
        */
        if (ntok > size - 2)
        {
            char** newtoks;
            size *= 2;

            newtoks = realloc(toks, size * sizeof(*toks));

            if (!newtoks)
                goto cleanup_exit;

            toks = newtoks;
        }

        /* Now the array is definitely large enough, so we just
        ** copy the new token into it.
        */
        toks[ntok] = strdup(cur_tok);

        if (!toks[ntok])
            goto cleanup_exit;

        ntok++;
        cur_tok = strtok(0, sep);
    }    

    free(dup);
    toks[ntok] = 0;
    return toks;

cleanup_exit:
    free(dup);
    for (i = 0; i < ntok; ++i)
        free(toks[i]);
    free(toks);
    return NULL;
}

这是简单的用法：

int main()
{
    char line[] = "The quick brown fox jumps over the lazy dog";
    char** toks = tokenize(line, " \t");
    int i;

    for (i = 0; toks[i]; ++i)
        printf("%s\n", toks[i]);

    /* Deallocate
    */
    for (i = 0; toks[i]; ++i)
        free(toks[i]);
    free(toks);

    return 0;
}

哦，还有strdup：

/* strdup isn't ANSI C, so here's one...
*/
char* strdup(const char* str)
{
    size_t len = strlen(str) + 1;
    char* dup = malloc(len);

    if (dup)
        memcpy(dup, str, len);

    return dup;
}

关于tokenize函数代码的几点注意事项：

strtok有覆盖其输入字符串的不礼貌习惯。为了保存用户的数据，我只在输入的副本上调用它。副本是使用获得的strdup。
strdup然而，它不是 ANSI-C，所以我不得不写一个
数组随着toks动态增长realloc，因为我们事先不知道会有多少令牌。初始大小为 2 仅用于测试，在实际代码中我可能会将其设置为更高的值。它也会返回给用户，用户必须在使用后释放它。
在所有情况下，都特别注意不要泄漏资源。例如，如果realloc返回 NULL，它将不会在旧指针上运行。旧指针将被释放，函数返回。返回时没有资源泄漏tokenize（除非在返回给用户的数组在使用后必须释放的名义情况下）。
Agoto用于更方便的清理代码，根据在某些情况下可能很好的哲学（这是一个很好的例子，恕我直言）。goto

以下函数可以帮助在单个调用中进行简单的释放：

/* Given a pointer to the tokens array returned by 'tokenize',
** frees the array and sets it to point to NULL.
*/
void tokenize_free(char*** toks)
{
    if (toks && *toks)
    {
        int i;

        for (i = 0; (*toks)[i]; ++i)
            free((*toks)[i]);
        free(*toks);
        *toks = 0;
    }
}

我真的很想与 SO 的其他用户讨论此代码。有什么可以做得更好的？你会为这样的分词器推荐一个不同的接口吗？解除分配的负担是如何从用户那里承担的？代码中是否存在内存泄漏？

提前致谢

score 2 · Accepted Answer

我推荐的一件事是提供tokenize_free处理所有释放的方法。这对用户来说更容易，并且让您可以灵活地在未来更改分配策略，而不会破坏您图书馆的用户。

当字符串的第一个字符是分隔符时，下面的代码会失败：

另一个想法是不要费心复制每个单独的令牌。我看不到它添加了什么，只是为您提供了更多可以归档代码的地方。相反，只需保留您制作的完整缓冲区的副本。我的意思是改变：

toks[ntok] = strdup(cur_tok); if (!toks[ntok]) goto cleanup_exit;

到：

toks[ntok] = cur_tok;

free(buf)从非错误路径中删除该行。最后，这将清理更改为：

free(toks[0]); free(toks);

score 1 · Accepted Answer

我看不出用 strtok 方法在线修改字符串有什么问题——如果他们想对重复的字符串进行操作，这是调用者的选择，因为语义很好理解。下面是按预期使用 strtok 稍微简化的相同方法，但仍返回一个方便的 char * 指针数组（现在只是指向原始字符串的标记化段）。它为您的原始 main() 调用提供相同的输出。

这种方法的主要优点是你只需要释放返回的字符数组，而不是循环清除所有元素——我认为这个方面带走了很多简单因素，而且调用者不太可能期望按照任何正常的 C 约定来做。

我还删除了 goto 语句，因为重构了代码，它们对我来说没有多大意义。我认为拥有单个清理点的危险在于它可能会开始变得过于笨重，并且会执行一些不需要在特定位置清理问题的额外步骤。

就我个人而言，我认为我要提出的主要哲学观点是，您应该尊重使用该语言的其他人的期望，尤其是在创建库类型的调用时。即使 strtok 替换行为对您来说似乎很奇怪，但绝大多数 C 程序员习惯于将 \0 放在 C 字符串的中间以拆分它们或创建更短的字符串，因此这看起来很自然。同样如前所述，没有人会期望使用函数的返回值执行单个 free() 之外的任何操作。您需要以任何需要的方式编写代码，以确保代码以这种方式工作，因为人们根本不会阅读您可能提供的任何文档，而是根据返回值的内存约定（即 char * * 所以来电者会期望必须释放它）。

char** tokenize(char* input, const char* sep)
{
  /* Size of the 'toks' array. Starts low and is doubled when
  ** exhausted.
  */
  size_t size = 4;

  /* 'ntok' points to the next free element of the 'toks' array
   */
  size_t ntok = 0;

  /* This is the array filled with tokens and returned
   */
  char** toks = malloc(size * sizeof(*toks));

  if ( toks == NULL )
    return;

  toks[ntok] = strtok( input, sep );


  /* While we have more tokens to process...
   */

  do
    {
      /* We should still have 2 empty elements in the array, 
      ** one for this token and one for the sentinel.
      */
      if (ntok > size - 2)
        {
      char** newtoks;
      size *= 2;

      newtoks = realloc(toks, size * sizeof(*toks));

      if (newtoks == NULL)
        {
          free(toks);
          return NULL;
        }

      toks = newtoks;
        }
      ntok++;
      toks[ntok] = strtok(0, sep);
    }  while (toks[ntok]);

  return toks;
}

score 1 · Accepted Answer

只是几件事：

使用 goto 本质上并不邪恶或不好，就像预处理器一样，它们经常被滥用。在像您这样的情况下，您必须根据事情的进展情况以不同的方式退出函数，它们是合适的。
提供释放返回数组的功能方法。即tok_free(指针)。
最初使用strtok() 的可重入版本，即strtok_r()。有人为此传递一个额外的参数（如果不需要，甚至是 NULL）不会很麻烦。

score 1 · Accepted Answer

您不需要 strdup() 每个令牌；您复制输入字符串，并可以让 strtok() 将其切碎。它也简化了事后释放资源的过程——您只需要释放指针数组和单个字符串。

我同意那些说你需要一个函数来释放数据的人的观点——除非你彻底改变界面并让用户提供指针数组作为输入参数，然后你可能还会决定用户负责复制如果必须保留字符串。这导致一个接口：

int tokenize(char *source, const char *sep, char **tokens, size_t max_tokens);

返回值将是找到的令牌数。

您必须决定当数组中的令牌多于插槽时该怎么做。选项包括：

返回错误指示（负数，可能是 -1），或
找到的令牌的全部数量，但不能分配的指针不是，或者
只是适合的令牌数量，或
比token的数量多1，表示有更多，但没有关于到底有多少的信息。

我选择返回'-1'，它导致了这个代码：

/*
@(#)File:           $RCSfile: tokenise.c,v $
@(#)Version:        $Revision: 1.9 $
@(#)Last changed:   $Date: 2008/02/11 08:44:50 $
@(#)Purpose:        Tokenise a string
@(#)Author:         J Leffler
@(#)Copyright:      (C) JLSS 1987,1989,1991,1997-98,2005,2008
@(#)Product:        :PRODUCT:
*/

/*TABSTOP=4*/

/*
**  1.  A token is 0 or more characters followed by a terminator or separator.
**      The terminator is ASCII NUL '\0'.  The separators are user-defined.
**  2.  A leading separator is preceded by a zero-length token.
**      A trailing separator is followed by a zero-length token.
**  3.  The number of tokens found is returned.
**      The list of token pointers is terminated by a NULL pointer.
**  4.  The routine returns 0 if the arguments are invalid.
**      It returns -1 if too many tokens were found.
*/

#include "jlss.h"
#include <string.h>

#define NO  0
#define YES 1

#define IS_SEPARATOR(c,s,n) (((c) == *(s)) || ((n) > 1 && strchr((s),(c))))
#define DIM(x)              (sizeof(x)/sizeof(*(x)))

#ifndef lint
/* Prevent over-aggressive optimizers from eliminating ID string */
const char jlss_id_tokenise_c[] = "@(#)$Id: tokenise.c,v 1.9 2008/02/11 08:44:50 jleffler Exp $";
#endif /* lint */

int             tokenise(
    char   *str,            /* InOut: String to be tokenised */
    char   *sep,            /* In:    Token separators */
    char  **token,          /* Out:   Pointers to tokens */
    int     maxtok,         /* In:    Maximum number of tokens */
    int     nulls)          /* In:    Are multiple separators OK? */
{
    int             c;
    int             n_tokens;
    int             tokenfound;
    int             n_sep = strlen(sep);

    if (n_sep <= 0 || maxtok <= 2)
        return(0);

    n_tokens = 1;
    *token++ = str;
    while ((c = *str++) != '\0')
    {
        tokenfound = NO;
        while (c != '\0' && IS_SEPARATOR(c, sep, n_sep))
        {
            tokenfound = YES;
            *(str - 1) = '\0';
            if (nulls)
                break;
            c = *str++;
        }
        if (tokenfound)
        {
            if (++n_tokens >= maxtok - 1)
                return(-1);
            if (nulls)
                *token++ = str;
            else
                *token++ = str - 1;
        }
        if (c == '\0')
            break;
    }
    *token++ = 0;
    return(n_tokens);
}

#ifdef TEST

struct
{
    char           *sep;
    int             nulls;
}               data[] =
{
    {   "/.",   0   },
    {   "/.",   1   },
    {   "/",    0   },
    {   "/",    1   },
    {   ".",    0   },
    {   ".",    1   },
    {   "",     0   }
};

static char string[] = "/fred//bill.c/joe.b/";

int main(void)
{
    int             i;
    int             j;
    int             n;
    char            input[100];
    char           *token[20];

    for (i = 0; i < DIM(data); i++)
    {
        strcpy(input, string);
        printf("\n\nTokenising <<%s>> using <<%s>>, null %d\n",
               input, data[i].sep, data[i].nulls);
        n = tokenise(input, data[i].sep, token, DIM(token),
                     data[i].nulls);
        printf("Return value = %d\n", n);
        for (j = 0; j < n; j++)
            printf("Token %d: <<%s>>\n", j, token[j]);
        if (n > 0)
            printf("Token %d: 0x%08lX\n", n, (unsigned long)token[n]);
    }
    return(0);
}

#endif /* TEST */

score 0 · Accepted Answer

0

如果您想查找内存泄漏，一种可能性是使用valgrind运行它。

于 2009-11-07T06:34:04.470 回答

score 0 · Accepted Answer

0

有一个很棒的工具可以检测内存泄漏，称为 Valgrind。

http://valgrind.org/

于 2009-11-07T06:38:12.363 回答

c - C 中字符串标记化函数的复杂性

6 回答 6

Related

Reference