笔记
下面的问题是在 2008 年提出的关于 2003 年的一些代码的问题。正如 OP 的更新所显示的那样,整个帖子已被 2008 年的老式算法所淘汰,并且仅作为历史好奇心保留在这里。
我需要在 C/C++ 中进行快速不区分大小写的子字符串搜索。我的要求如下:
- 应该表现得像 strstr() (即返回一个指向匹配点的指针)。
- 必须不区分大小写 (doh)。
- 必须支持当前语言环境。
- 必须在 Windows (MSVC++ 8.0) 上可用或易于移植到 Windows(即来自开源库)。
这是我正在使用的当前实现(取自 GNU C 库):
/* Return the offset of one string within another.
Copyright (C) 1994,1996,1997,1998,1999,2000 Free Software Foundation, Inc.
This file is part of the GNU C Library.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with the GNU C Library; if not, write to the Free
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
02111-1307 USA. */
/*
* My personal strstr() implementation that beats most other algorithms.
* Until someone tells me otherwise, I assume that this is the
* fastest implementation of strstr() in C.
* I deliberately chose not to comment it. You should have at least
* as much fun trying to understand it, as I had to write it :-).
*
* Stephen R. van den Berg, berg@pool.informatik.rwth-aachen.de */
/*
* Modified to use table lookup instead of tolower(), since tolower() isn't
* worth s*** on Windows.
*
* -- Anders Sandvig (anders@wincue.org)
*/
#if HAVE_CONFIG_H
# include <config.h>
#endif
#include <ctype.h>
#include <string.h>
typedef unsigned chartype;
char char_table[256];
void init_stristr(void)
{
int i;
char string[2];
string[1] = '\0';
for (i = 0; i < 256; i++)
{
string[0] = i;
_strlwr(string);
char_table[i] = string[0];
}
}
#define my_tolower(a) ((chartype) char_table[a])
char *
my_stristr (phaystack, pneedle)
const char *phaystack;
const char *pneedle;
{
register const unsigned char *haystack, *needle;
register chartype b, c;
haystack = (const unsigned char *) phaystack;
needle = (const unsigned char *) pneedle;
b = my_tolower (*needle);
if (b != '\0')
{
haystack--; /* possible ANSI violation */
do
{
c = *++haystack;
if (c == '\0')
goto ret0;
}
while (my_tolower (c) != (int) b);
c = my_tolower (*++needle);
if (c == '\0')
goto foundneedle;
++needle;
goto jin;
for (;;)
{
register chartype a;
register const unsigned char *rhaystack, *rneedle;
do
{
a = *++haystack;
if (a == '\0')
goto ret0;
if (my_tolower (a) == (int) b)
break;
a = *++haystack;
if (a == '\0')
goto ret0;
shloop:
;
}
while (my_tolower (a) != (int) b);
jin:
a = *++haystack;
if (a == '\0')
goto ret0;
if (my_tolower (a) != (int) c)
goto shloop;
rhaystack = haystack-- + 1;
rneedle = needle;
a = my_tolower (*rneedle);
if (my_tolower (*rhaystack) == (int) a)
do
{
if (a == '\0')
goto foundneedle;
++rhaystack;
a = my_tolower (*++needle);
if (my_tolower (*rhaystack) != (int) a)
break;
if (a == '\0')
goto foundneedle;
++rhaystack;
a = my_tolower (*++needle);
}
while (my_tolower (*rhaystack) == (int) a);
needle = rneedle; /* took the register-poor approach */
if (a == '\0')
break;
}
}
foundneedle:
return (char*) haystack;
ret0:
return 0;
}
你能让这段代码更快吗,或者你知道更好的实现吗?
注意:我注意到 GNU C 库现在有一个新的实现strstr()
,但我不确定它是否可以轻松地修改为不区分大小写,或者它实际上是否比旧的更快(在我的情况下)。我还注意到旧的实现仍然用于宽字符串,所以如果有人知道为什么,请分享。
更新
只是为了让事情清楚——以防它还没有——我没有编写这个函数,它是 GNU C 库的一部分。我只是将其修改为不区分大小写。
另外,感谢有关strcasestr()
其他来源(如 OpenBSD、FreeBSD 等)的其他实现的提示和检查。这似乎是要走的路。上面的代码来自 2003 年,这就是为什么我在这里发布它以希望有更好的版本可用,显然是这样。:)