delphi - 不区分大小写的位置

Question

在 D2010 (unicode) 中是否有类似 Pos 的类似功能不区分大小写？

我知道我可以使用 Pos(AnsiUpperCase(FindString), AnsiUpperCase(SourceString)) ，但是每次调用函数时都会通过将字符串转换为大写来增加大量的处理时间。

例如，在 1000000 循环中，Pos 需要 78 毫秒，而转换为大写需要 764 毫秒。

str1 := 'dfkfkL%&/s"#<.676505';
  for i := 0 to 1000000 do
    PosEx('#<.', str1, 1); // Takes 78ms

  for i := 0 to 1000000 do
    PosEx(AnsiUpperCase('#<.'), AnsiUpperCase(str1), 1); // Takes 764ms

我知道为了提高这个特定示例的性能，我可以在循环之前先将字符串转换为大写，但是我希望有一个不区分大小写的类似 Pos 的函数的原因是从 FastStrings 替换一个. 我将使用 Pos 的所有字符串都会有所不同，因此我需要将每个字符串都转换为大写。

是否有任何其他函数可能比 Pos + 将字符串转换为大写更快？

score 25 · Accepted Answer

用于执行此操作的内置 Delphi 函数在AnsiStrings.ContainsText和Unicode 字符串的StrUtils.ContainsText中。

然而，在后台，他们使用的逻辑与您的逻辑非常相似。

无论在哪个库中，这样的函数总是很慢：尤其是要尽可能与 Unicode 兼容，它们需要有相当多的开销。而且由于它们在循环内部，因此成本很高。

避免这种开销的唯一方法是尽可能在循环之外进行这些转换。

所以：按照你自己的建议，你有一个非常好的解决方案。

——杰伦

score 11 · Accepted Answer

我之前的答案的这个版本适用于 D2007 和 D2010。

在 Delphi 2007 中CharUpCaseTable是 256 字节
在 Delphi 2010 中为 128 KB (65535*2)。

原因是字符大小。在旧版本的 Delphi 中，我的原始代码在初始化时只支持当前的语言环境字符集。我InsensPosEx的比你的代码快 4 倍。当然可以走得更快，但我们会失去简单性。

type
  TCharUpCaseTable = array [Char] of Char;

var
  CharUpCaseTable: TCharUpCaseTable;

procedure InitCharUpCaseTable(var Table: TCharUpCaseTable);
var
  n: cardinal;
begin
  for n := 0 to Length(Table) - 1 do
    Table[Char(n)] := Char(n);
  CharUpperBuff(@Table, Length(Table));
end;

function InsensPosEx(const SubStr, S: string; Offset: Integer = 1): Integer;
var
  n:            Integer;
  SubStrLength: Integer;
  SLength:      Integer;
label
  Fail;
begin
  Result := 0;
  if S = '' then Exit;
  if Offset <= 0 then Exit;

  SubStrLength := Length(SubStr);
  SLength := Length(s);

  if SubStrLength > SLength then Exit;

  Result := Offset;
  while SubStrLength <= (SLength-Result+1) do 
  begin
    for n := 1 to SubStrLength do
      if CharUpCaseTable[SubStr[n]] <> CharUpCaseTable[s[Result+n-1]] then
        goto Fail;
      Exit;
Fail:
    Inc(Result);
  end;
  Result := 0;
end;

//...

initialization
  InitCharUpCaseTable({var}CharUpCaseTable);

score 5 · Accepted Answer

我还遇到了转换 FastStrings 的问题，它使用 Boyer-Moore (BM) 搜索来获得一些速度，用于 D2009 和 D2010。由于我的许多搜索都只查找单个字符，并且其中大多数都在查找非字母字符，因此我的 D2010 版本的 SmartPos 有一个重载版本，第一个参数是宽字符，并通过字符串进行简单循环找到这些。我使用两个参数的大写来处理少数不区分大小写的情况。对于我的应用程序，我相信这个解决方案的速度可以与 FastStrings 相媲美。

对于“字符串查找”的情况，我的第一遍是使用 SearchBuf 并进行大写并接受惩罚，但我最近一直在研究使用 Unicode BM 实现的可能性。您可能知道，BM 不能很好地或轻松地扩展到 Unicode 比例的字符集，但是Soft Gems有一个 Unicode BM 实现. 这早于 D2009 和 D2010，但看起来好像很容易转换。作者 Mike Lischke 通过包含一个 67kb Unicode 大写表来解决大写问题，这对于我的适度要求来说可能有点过头了。由于我的搜索字符串通常很短（尽管不像您的单个三字符示例那么短），Unicode BM 的开销也可能是不值得付出的代价：BM 优势随着被搜索字符串的长度而增加。

这绝对是在将 Unicode BM 合并到我自己的应用程序之前需要使用一些实际应用程序特定示例进行基准测试的情况。

编辑：一些基本的基准测试表明我对“Unicode Tuned Boyer-Moore”解决方案保持警惕是正确的。在我的环境中，UTBM 会产生更大的代码和更长的时间。如果我需要此实现提供的一些附加功能（处理代理和仅全词搜索），我可能会考虑使用它。

score 4 · Accepted Answer

这是我编写并使用多年的一个：

function XPos( const cSubStr, cString :string ) :integer;
var
  nLen0, nLen1, nCnt, nCnt2 :integer;
  cFirst :Char;
begin
  nLen0 := Length(cSubStr);
  nLen1 := Length(cString);

  if nLen0 > nLen1 then
    begin
      // the substr is longer than the cString
      result := 0;
    end

  else if nLen0 = 0 then
    begin
      // null substr not allowed
      result := 0;
    end

  else

    begin

      // the outer loop finds the first matching character....
      cFirst := UpCase( cSubStr[1] );
      result := 0;

      for nCnt := 1 to nLen1 - nLen0 + 1 do
        begin

          if UpCase( cString[nCnt] ) = cFirst then
            begin
              // this might be the start of the substring...at least the first
              // character matches....
              result := nCnt;

              for nCnt2 := 2 to nLen0 do
                begin

                  if UpCase( cString[nCnt + nCnt2 - 1] ) <> UpCase( cSubStr[nCnt2] ) then
                    begin
                      // failed
                      result := 0;
                      break;
                    end;

                end;

            end;


          if result > 0 then
            break;
        end;


    end;
end;

score 2 · Accepted Answer

为什么不在常规 Pos 语句中将子字符串和源字符串都转换为小写或大写。结果实际上是不区分大小写的，因为两个参数都在一种情况下。简单而精简。

score 1 · Accepted Answer

Jedi 代码库具有StrIPos 和数以千计的其他有用函数来补充 Delphi 的 RTL。当我仍然在 Delphi 中工作很多时，JCL 和它的视觉兄弟 JVCL 是我添加到新安装的 Delphi 的第一批东西之一。

score 0 · Accepted Answer

相反，您可以使用“AnsiUpperCase”，它更快。我已经重塑了我的旧代码。它非常简单，也非常快。核实：

type
  TAnsiUpCaseTable = array [AnsiChar] of AnsiChar;

var
  AnsiTable: TAnsiUpCaseTable;

procedure InitAnsiUpCaseTable(var Table: TAnsiUpCaseTable);
var
  n: cardinal;
begin
  for n := 0 to SizeOf(TAnsiUpCaseTable) -1 do
  begin
    AnsiTable[AnsiChar(n)] := AnsiChar(n);
    CharUpperBuff(@AnsiTable[AnsiChar(n)], 1);
  end;
end;

function UpCasePosEx(const SubStr, S: string; Offset: Integer = 1): Integer;
var
  n              :integer;
  SubStrLength   :integer;
  SLength        :integer;
label
  Fail;
begin
  SLength := length(s);
  if (SLength > 0) and (Offset > 0) then begin
    SubStrLength := length(SubStr);
    result := Offset;
    while SubStrLength <= SLength - result + 1 do begin
      for n := 1 to SubStrLength do
        if AnsiTable[SubStr[n]] <> AnsiTable[s[result + n -1]] then
          goto Fail;
      exit;
Fail:
      inc(result);
    end;
  end;
  result := 0;
end;

initialization
  InitAnsiUpCaseTable(AnsiTable);
end.

score 0 · Accepted Answer

我认为，在 Pos 之前转换为大写或小写是最好的方法，但你应该尽量少调用 AnsiUpperCase/AnsiLowerCase 函数。

score 0 · Accepted Answer

在这种情况下，我找不到任何比 Pos() + 某种形式的字符串规范化（大写/小写转换）更好的方法。

这并不完全令人惊讶，因为在 Delphi 2009 中对 Unicode 字符串处理进行基准测试时，我发现 Pos() RTL 例程自 Delphi 7 以来有了显着改进，部分原因是 FastCode 库的各个方面已被合并到 RTL 中现在有一段时间了。

另一方面，FastStrings 库 - iirc - 很长一段时间没有得到显着更新。在测试中，我发现许多 FastStrings 例程实际上已被等效的 RTL 函数取代（有几个例外，解释为 Unicode 的额外复杂性导致的不可避免的开销）。

史蒂夫提出的解决方案的“Char-Wise”处理是迄今为止最好的。

任何涉及规范化整个字符串（字符串和子字符串）的方法都有可能在结果中任何基于字符的位置引入错误，因为使用 Unicode 字符串时，大小写转换可能会导致字符串长度发生变化（某些字符在大小写转换中转换为更多/更少的字符）。

这些可能是罕见的情况，但史蒂夫的例行程序避免了它们，并且只比已经相当快的 Pos + Uppercase 慢了大约 10%（你的基准测试结果与我的分数不一致）。

score 0 · Accepted Answer

通常，简单的解决方案就是您想要使用的解决方案：

if AnsiPos(AnsiupperCase('needle'), AnsiupperCase('The Needle in the haystack')) <> 0 then
    DoSomething;

参考：

score 0 · Accepted Answer

Windows 上的任何程序都可以调用 shell-API 函数，这可以减少代码大小。像往常一样，从下往上阅读程序。这仅使用 Ascii 字符串进行了测试，而不是宽字符串。

program PrgDmoPosIns; {$AppType Console} // demo case-insensitive Pos function for Windows

// Free Pascal 3.2.2 [2022/01/02], Win32 for i386
// FPC.EXE -vq -CoOr -Twin32 -oPrgStrPosDmo.EXE PrgStrPosDmo.LPR
//         -vq Verbose: Show message numbers
//             -C Code generation:
//               o Check overflow of integer operations
//                O Check for possible overflow of integer operations - Integer Overflow checking turns on Warning 4048
//                 r Range checking
//                   -Twin32 Target 32 bit Windows operating systems
// 29600 bytes code, 1316 bytes data, 35,840 bytes file

function StrStrIA( pszHaystack, pszNeedle : PChar ) : PChar; stdcall; external 'shlwapi.dll'; // dynamic link to Windows API's case-INsensitive search
// https://docs.microsoft.com/en-us/windows/win32/api/shlwapi/nf-shlwapi-strstria
// "FPC\3.2.2\Source\Packages\winunits-base\src\shlwapi.pp" line 557

function StrPos(        strNeedle, strHaystk : string ) : SizeInt; // return the position of Needle within Haystack, or zero if not found
var
  intRtn       : SizeInt; // function result
  ptrHayStk             , // pointers to
  ptrNeedle             , //   search strings
  strMchFnd    : PChar  ; // pointer to match-found string, or null-pointer/empty-string when not found
  bolFnd       : boolean; // whether Needle was found within Haystack
  intLenHaystk          , // length of haystack
  intLenMchFnd : SizeInt; // length of needle
begin
  strHayStk :=       strHayStk + #0            ; // strings passed to API must be
  strNeedle :=       strNeedle + #0            ; //   null-terminated

  ptrHayStk := Addr( strHayStk[ 1 ] )          ; // set pointers to point at first characters of
  ptrNeedle := Addr( strNeedle[ 1 ] )          ; //   null-terminated strings, so API gets C-style strings

  strMchFnd := StrStrIA( ptrHayStk, ptrNeedle ); // call Windows to perform search; match-found-string now points inside the Haystack
  bolFnd    := ( strMchFnd <> '' )             ; // variable is True when           match-found-string is not null/empty

  if bolFnd then begin                         ; // when Needle was yes found in Haystack
    intLenMchFnd := Length( strMchFnd )        ; // get length of needle
    intLenHaystk := Length( strHayStk )        ; // get length of haystack
    intRtn       := intLenHaystk - intLenMchFnd; // set  function result to the position of needle within haystack, which is the difference in lengths
  end       else                                 // when Needle was not found in Haystack
    intRtn       := 0                          ; // set  function result to tell caller needle does not appear within haystack
  StrPos := intRtn                             ; // pass function result back to caller
end; // StrPos

procedure TstOne( const strNeedle, strHayStk : string ); // run one test with this Needle
var
  intPos : SizeInt; // found-match location of Needle within Haystack, or zero if none
begin
  write  ( 'Searching for : [', strNeedle, ']' ); // bgn output row for this test
  intPos := StrPos(  strNeedle, strHaystk      ); // get Needle position
  writeln(' StrPos is '       , intPos         ); // end output row for this test
end; // TstOne

procedure TstAll(                                     ); // run all tests with various Needles
const
  strHayStk = 'Needle in a Haystack'; // all tests will search in this string
begin
  writeln( 'Searching in  : [', strHayStk, ']' ); // emit header row
  TstOne ( 'Noodle'           , strHayStk      ); // test not-found
  TstOne ( 'Needle'           , strHayStk      ); // test found at yes-first character
  TstOne ( 'Haystack'         , strHayStk      ); // test found at not-first character
end; // TstAll

begin // ***** MAIN *****
  TstAll( ); // run all tests
end.

delphi - 不区分大小写的位置

11 回答 11

Related

Reference