java - 使用 Pattern.CASE_INSENSITIVE 的 Java RegEx 性能

Question

我正在使用一个非常简单的正则表达式

%%(products?)%%

现在我希望它能够匹配两种产品？和产品？显而易见的答案是在编译模式时使用 CASE_INSENSITIVE 标记：

Pattern.compile("%%(products?)%%", Pattern.CASE_INSENSITIVE)

但是在文档中它说“指定这个标志可能会造成轻微的性能损失。” 因此，我想到了一个没有标志的替代方案：

Pattern.compile("%%([Pp]roducts?)%%")

我的问题是：哪个性能更好？

score 3 · Accepted Answer

由于不区分大小写的版本相当于

Pattern.compile("%%([Pp][Rr][Oo][Dd][Uu][Cc][Tt][Ss]?)%%")

很明显，您会受到某种性能损失。

所以在你的情况下，最后一个版本会更有效（也更有限）。
但是，在这种情况下（可能是大多数情况下），我会说惩罚小到可以忽略。如果您的应用程序确实是性能密集型的，您可以随时进行基准测试以查看加速是否明显。

score 2 · Accepted Answer

实际上，这些方法之间存在显着差异。

虽然Pattern.compile("%%(products?)%%", Pattern.CASE_INSENSITIVE)看起来效率不如Pattern.compile("%%([Pp]roducts?)%%")乍一看，但它的内部功能并不完全是将每个字符与它们的小写和大写对应物进行比较。实际发生的是，第一种方法对 Unicode 的小写和大写块进行范围检查，而第二种方法进行文字比较。

我没有比这更深入的知识，但重要的部分是这个简单但非常有趣的测试（最后包括我机器上的结果）：

String base = "I have a product that is the product of my hard work." 
  + "Products are always nice, because I can win cash if I sell my products.\n" 
  + "The product of me making my product is cash, because cash is the product of selling my product.\n" 
  + "With the cash I win with my product, I can buy other people's products.";

int processRepeats = 1000000; //One million runs, enough to take time for each clocking.
int averageRepeats = 10;

long averager = 0;
int count = 0;

//Switch the commenting to test the opposing method.
Pattern p = Pattern.compile("products?", Pattern.CASE_INSENSITIVE);
//Pattern p = Pattern.compile("[Pp]roducts?");
Matcher m;
long clocking;
for (int i = 0; i < averageRepeats; i++) {
  clocking = System.nanoTime();
  for (int ii = 0; ii < processRepeats; ii++) {
    m = p.matcher(base); //Here because the "base" would change in a real environment.
    while (m.find()) {
      count++;
    }
  }
  clocking = System.nanoTime() - clocking;
  averager += clocking;
  //System.out.printf("This method found %9d matches in %15d nanos [%9.3f ms]\n", count, clocking, clocking / 1000000f);
}
System.out.printf("This method averages %15d nanos [%16.3f ms] for %d times executing %d runs.\n",
averager / averageRepeats, (averager / (float) averageRepeats) / 1000000f, averageRepeats, processRepeats);

//RESULTS ON MY MACHINE:

//FIRST METHOD: [3 runs to demonstrate/guarantee consistency]
//This method averages      5024404693 nanos [        5024,404 ms] for 10 times executing 1000000 runs.
//This method averages      5021385539 nanos [        5021,386 ms] for 10 times executing 1000000 runs.
//This method averages      5017170143 nanos [        5017,170 ms] for 10 times executing 1000000 runs.

//SECOND METHOD: [same deal]
//This method averages      5806310774 nanos [        5806,311 ms] for 10 times executing 1000000 runs.
//This method averages      5809879747 nanos [        5809,880 ms] for 10 times executing 1000000 runs.
//This method averages      5804277386 nanos [        5804,277 ms] for 10 times executing 1000000 runs.

如您所见，不仅第一种方法更快（最终取决于它正在运行的机器），而且考虑到大量运行，几乎 800 毫秒（8/10 秒）的性能差异可能不如正如人们想象的那样，影响可以忽略不计！

java - 使用 Pattern.CASE_INSENSITIVE 的 Java RegEx 性能

2 回答 2

Related

Reference