c - 在C中匹配（几个）字符串的最有效方法？

Question

我们的系统需要接受来自终端的用户输入并匹配几个已知的关键字字符串（可能是 10 个）。

我们没有空间/计算机来做正则表达式等，代码需要小而快。

现在，这样做的讨厌的方法是：

   // str is null-terminated, assume we know it's safe/sane here
   if(!strncmp(str,"hello",5)
   {
      do_hello();
   }
   else if(!strncmp(str,"world",5)
   {
      do_world();
   }
   else
   {
      meh(); // Wasn't a match
   }

因此，经过一番谷歌搜索和阅读后，我确信更好的方法是将各种匹配项的哈希值预先计算为 int，然后使用 case 语句：

// Assume hash() stops at NULL
switch(hash(str))
{
   case HASH_OF_HELLO:
      do_hello();
      break;

   case HASH_OF_WORLD:
      do_world();
      break;

   default:
      meh();
      break;
}

我们可以在编译时计算 *HASH_OF_match*。这似乎是一种从相对较小的集合中选择字符串的更快/更优雅的方式。

那么——这看起来合理吗？/ 这样做有明显的问题吗？/ 有人有更优雅的方法吗？

作为一个脚注，这是我今天下午看到的最好看的哈希算法；），归功于 dan bernstein，它看起来很适合手头的工作。

unsigned int
get_hash(const char* s)
{
    unsigned int hash = 0;
    int c;

    while((c = *s++))
    {
        // hash = hash * 33 ^ c 
        hash = ((hash << 5) + hash) ^ c;
    }

    return hash;
}

score 5 · Accepted Answer

散列的问题在于，用户输入的任意字符串可能会生成与您的匹配项相同的散列，并且您将执行错误的内容。对于小至 10 的搜索集，我会坚持使用这种if-else方法。或者使用字符串数组和函数指针数组（假设所有函数具有相同的签名）来选择要执行的函数。

char const *matches[10] = {"first", "second", ..., "tenth"};
void (*fn[10])(void) = {&do_first, &do_second, ..., &do_tenth};

for( i = 0; i < 10; ++i ) {
  if( strcmp( str, matches[i] ) == 0 ) {
    (*fn[i])();
  }
}

score 3 · Accepted Answer

像在 Boyer-Moore 字符串搜索算法中那样在最后一个字符上使用嵌套 switch 语句怎么样？

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

score 1 · Accepted Answer

散列和散列表最适合处理大量数据。由于输入字符串的数量是已知且有限的，您或许可以考虑这种方法：

让我们假设已知的字符串是

const char* STR_TABLE [STR_N] =
{
  "hello",
  "world",
  "this",
  "is",
  "a",
  "number",
  "of",
  "ten",
  "test",
  "strings"
};

然后我们可以在编译之前按字母顺序手动对它们进行排序，因为排序表提供了更快的搜索可能性。然后，您可以使用二进制搜索。

#include <stdio.h>
#include <stdlib.h>

#define STR_N 10


const char* STR_TABLE [STR_N] =
{
  "a",
  "hello",
  "is",
  "number",
  "of",
  "strings",
  "ten",
  "test",
  "this",
  "world"
};


int ptr_strcmp (const void* str1, const void* str2)
{
  return strcmp(str1, *(const char**)str2);
}

int main()
{
  const char* user_input = "world"; // worst case
  const char** result;

  result = bsearch (user_input,
                    STR_TABLE,
                    STR_N,
                    sizeof(const char*),
                    ptr_strcmp);

  if(result != NULL)
  {
    printf("%s\n", *result);
  }
  else
  {
    printf("meh\n");
  }

}

这将归结为：

比较 "world" 和 "of"，1 比较 'w' != 'o'。

比较“世界”和“测试”，1比较'w'！='t'。

比较“world”和“this”，1比较'w'！='t'。

比较“世界”和“世界”，5 次比较。

比较总数为 8。

这当然涉及一些开销代码，检查'\0'和二进制搜索调用。您必须在您的特定平台上测量建议的各种方法，以找出最佳方法。

score 1 · Accepted Answer

1

听起来你想使用gperf。

于 2012-09-04T18:39:26.090 回答

score 0 · Accepted Answer

也许解决方案可能是这样的：

struct keyword {
    unsigned int hash;
    const char *str;
    void (*job)();
};

//A table with our keywords with their corresponding hashes. If you could not
//compute the hash at compile time, a simple init() function at the beginning
//of your program could initialize each entry by using the value in 'str'
//You could also implement a dynamic version of this table (linked list of keywords)
//for extending your keyword table during runtime
struct keyword mykeywords[] = {
    {.hash = HASH_OF_HELLO, .str = "hello", .job = do_hello},
    {.hash = HASH_OF_WORLD, .str = "world", .job = do_world},
    ...
    {.str = 0} //signal end of list of keywords

};

void run(const char *cmd)
{
    unsigned int cmdhash = get_hash(cmd);
    struct keyword *kw = mykeywords;
    while(kw->str) {
        //If hash matches then compare the string, since we should consider hashing collisions too!
        //The order of conditions below is important
        if (kw->hash == cmdhash && !strcmp(cmd, kw->str)) { 
             kw->job();
             break;   
        }
        kw++;
    }
}

c - 在C中匹配（几个）字符串的最有效方法？

5 回答 5

Related

Reference