performance - String 构造函数中缺少边界检查消除？

Question

查看 UTF8 解码性能，我注意到 protobuf 的性能UnsafeProcessor::decodeUtf8优于String(byte[] bytes, int offset, int length, Charset charset)以下非 ascii 字符串"Quizdeltagerne spiste jordbær med flØde, mens cirkusklovnen"：

我试图找出原因，所以我复制了相关代码String并将数组访问替换为不安全的数组访问，与UnsafeProcessor::decodeUtf8. 以下是 JMH 基准测试结果：

Benchmark                       Mode  Cnt    Score   Error  Units
StringBenchmark.safeDecoding    avgt   10  127.107 ± 3.642  ns/op
StringBenchmark.unsafeDecoding  avgt   10  100.915 ± 4.090  ns/op

我认为差异是由于缺少我希望启动的边界检查消除，特别是因为checkBoundsOffCount(offset, length, bytes.length)在String(byte[] bytes, int offset, int length, Charset charset).

问题真的是缺少边界检查消除吗？

这是我使用 OpenJDK 17 和 JMH 进行基准测试的代码。请注意，这只是String(byte[] bytes, int offset, int length, Charset charset)构造函数代码的一部分，并且仅适用于这个特定的德语字符串。静态方法是从String. 查找// the unsafe version:表明我将安全访问替换为不安全的位置的注释。

    private static byte[] safeDecode(byte[] bytes, int offset, int length) {
        checkBoundsOffCount(offset, length, bytes.length);
        int sl = offset + length;
        int dp = 0;
        byte[] dst = new byte[length];
        while (offset < sl) {
            int b1 = bytes[offset];
            // the unsafe version:
            // int b1 = UnsafeUtil.getByte(bytes, offset);
            if (b1 >= 0) {
                dst[dp++] = (byte)b1;
                offset++;
                continue;
            }
            if ((b1 == (byte)0xc2 || b1 == (byte)0xc3) &&
                    offset + 1 < sl) {
                // the unsafe version:
                // int b2 = UnsafeUtil.getByte(bytes, offset + 1);
                int b2 = bytes[offset + 1];
                if (!isNotContinuation(b2)) {
                    dst[dp++] = (byte)decode2(b1, b2);
                    offset += 2;
                    continue;
                }
            }
            // anything not a latin1, including the repl
            // we have to go with the utf16
            break;
        }
        if (offset == sl) {
            if (dp != dst.length) {
                dst = Arrays.copyOf(dst, dp);
            }
            return dst;
        }

        return dst;
    }

跟进

显然，如果我将 while 循环条件从更改为offset < sl，0 <= offset && offset < sl 我会在两个版本中获得相似的性能：

Benchmark                       Mode  Cnt    Score    Error  Units
StringBenchmark.safeDecoding    avgt   10  100.802 ± 13.147  ns/op
StringBenchmark.unsafeDecoding  avgt   10  102.774 ± 3.893  ns/op

结论

这个问题被 HotSpot 开发人员选为https://bugs.openjdk.java.net/browse/JDK-8278518。

优化此代码最终使解码上述 Latin1 字符串的速度提高了 2.5 倍。

这种 C2 优化缩小了以下基准之间令人难以置信的超过7 倍的差距，commonBranchFirst并将commonBranchSecond在 Java 19 中实现。

Benchmark                         Mode  Cnt     Score    Error  Units
LoopBenchmark.commonBranchFirst   avgt   25  1737.111 ± 56.526  ns/op
LoopBenchmark.commonBranchSecond  avgt   25   232.798 ± 12.676  ns/op

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class LoopBenchmark {

  private final boolean[] mostlyTrue = new boolean[1000];

  @Setup
  public void setup() {
    for (int i = 0; i < mostlyTrue.length; i++) {
      mostlyTrue[i] = i % 100 > 0;
    }
  }

  @Benchmark
  public int commonBranchFirst() {
    int i = 0;
    while (i < mostlyTrue.length) {
      if (mostlyTrue[i]) {
        i++;
      } else {
        i += 2;
      }
    }
    return i;
  }

  @Benchmark
  public int commonBranchSecond() {
    int i = 0;
    while (i < mostlyTrue.length) {
      if (!mostlyTrue[i]) {
        i += 2;
      } else {
        i++;
      }
    }
    return i;
  }
}

score 3 · Accepted Answer

为了测量您感兴趣的分支，特别是while循环变热时的场景，我使用了以下基准：

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class StringConstructorBenchmark {
  private byte[] array;

  @Setup
  public void setup() {
    String str = "Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen. Я";
    array = str.getBytes(StandardCharsets.UTF_8);
  }

  @Benchmark
  public String newString()  {
      return new String(array, 0, array.length, StandardCharsets.UTF_8);
  }
}

事实上，使用修改后的构造函数确实会带来显着的改进：

//baseline
Benchmark                             Mode  Cnt    Score   Error  Units
StringConstructorBenchmark.newString  avgt   50  173,092 ± 3,048  ns/op

//patched
Benchmark                             Mode  Cnt    Score   Error  Units
StringConstructorBenchmark.newString  avgt   50  126,908 ± 2,355  ns/op

这可能是一个热点问题：由于某种原因优化编译器未能消除while-loop 中的数组边界检查。我猜原因是offset在循环中修改了：

while (offset < sl) {
    int b1 = bytes[offset];
    if (b1 >= 0) {
        dst[dp++] = (byte)b1;
        offset++;                     // <---
        continue;
    }
    if ((b1 == (byte)0xc2 || b1 == (byte)0xc3) &&
            offset + 1 < sl) {
        int b2 = bytes[offset + 1];
        if (!isNotContinuation(b2)) {
            dst[dp++] = (byte)decode2(b1, b2);
            offset += 2;
            continue;
        }
    }
    // anything not a latin1, including the repl
    // we have to go with the utf16
    break;
}

我还通过查看代码LinuxPerfAsmProfiler，这里是基线链接https://gist.github.com/stsypanov/d2524f98477d633fb1d4a2510fedeea6和这个是修补构造函数https://gist.github.com/stsypanov/16c787e4f9fa3dd122522f16331b68b7

应该注意什么？让我们找到对应的代码int b1 = bytes[offset];（第 538 行）。在基线中，我们有这个：

  3.62%           ││            │  0x00007fed70eb4c1c:   mov    %ebx,%ecx
  2.29%           ││            │  0x00007fed70eb4c1e:   mov    %edx,%r9d
  2.22%           ││            │  0x00007fed70eb4c21:   mov    (%rsp),%r8                   ;*iload_2 {reexecute=0 rethrow=0 return_oop=0}
                  ││            │                                                            ; - java.lang.String::&lt;init&gt;@107 (line 537)
  2.32%           ↘│            │  0x00007fed70eb4c25:   cmp    %r13d,%ecx
                   │            │  0x00007fed70eb4c28:   jge    0x00007fed70eb5388           ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                   │            │                                                            ; - java.lang.String::&lt;init&gt;@110 (line 537)
  3.05%            │            │  0x00007fed70eb4c2e:   cmp    0x8(%rsp),%ecx
                   │            │  0x00007fed70eb4c32:   jae    0x00007fed70eb5319
  2.38%            │            │  0x00007fed70eb4c38:   mov    %r8,(%rsp)
  2.64%            │            │  0x00007fed70eb4c3c:   movslq %ecx,%r8
  2.46%            │            │  0x00007fed70eb4c3f:   mov    %rax,%rbx
  3.44%            │            │  0x00007fed70eb4c42:   sub    %r8,%rbx
  2.62%            │            │  0x00007fed70eb4c45:   add    $0x1,%rbx
  2.64%            │            │  0x00007fed70eb4c49:   and    $0xfffffffffffffffe,%rbx
  2.30%            │            │  0x00007fed70eb4c4d:   mov    %ebx,%r8d
  3.08%            │            │  0x00007fed70eb4c50:   add    %ecx,%r8d
  2.55%            │            │  0x00007fed70eb4c53:   movslq %r8d,%r8
  2.45%            │            │  0x00007fed70eb4c56:   add    $0xfffffffffffffffe,%r8
  2.13%            │            │  0x00007fed70eb4c5a:   cmp    (%rsp),%r8
                   │            │  0x00007fed70eb4c5e:   jae    0x00007fed70eb5319
  3.36%            │            │  0x00007fed70eb4c64:   mov    %ecx,%edi                    ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}
                   │            │                                                            ; - java.lang.String::&lt;init&gt;@113 (line 538)
  2.86%            │           ↗│  0x00007fed70eb4c66:   movsbl 0x10(%r14,%rdi,1),%r8d       ;*baload {reexecute=0 rethrow=0 return_oop=0}
                   │           ││                                                            ; - java.lang.String::&lt;init&gt;@115 (line 538)
  2.48%            │           ││  0x00007fed70eb4c6c:   mov    %r9d,%edx
  2.26%            │           ││  0x00007fed70eb4c6f:   inc    %edx                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                   │           ││                                                            ; - java.lang.String::&lt;init&gt;@127 (line 540)
  3.28%            │           ││  0x00007fed70eb4c71:   mov    %edi,%ebx
  2.44%            │           ││  0x00007fed70eb4c73:   inc    %ebx                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                   │           ││                                                            ; - java.lang.String::&lt;init&gt;@134 (line 541)
  2.35%            │           ││  0x00007fed70eb4c75:   test   %r8d,%r8d
                   ╰           ││  0x00007fed70eb4c78:   jge    0x00007fed70eb4c04           ;*iflt {reexecute=0 rethrow=0 return_oop=0}
                               ││                                                            ; - java.lang.String::&lt;init&gt;@120 (line 539)

在补丁代码中，相应的部分是

 17.28%           ││  0x00007f6b88eb6061:   mov    %edx,%r10d                   ;*iload_2 {reexecute=0 rethrow=0 return_oop=0}
                  ││                                                            ; - java.lang.String::&lt;init&gt;@107 (line 537)
  0.11%           ↘│  0x00007f6b88eb6064:   test   %r10d,%r10d
                   │  0x00007f6b88eb6067:   jl     0x00007f6b88eb669c           ;*iflt {reexecute=0 rethrow=0 return_oop=0}
                   │                                                            ; - java.lang.String::&lt;init&gt;@108 (line 537)
  0.39%            │  0x00007f6b88eb606d:   cmp    %r13d,%r10d
                   │  0x00007f6b88eb6070:   jge    0x00007f6b88eb66d0           ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                   │                                                            ; - java.lang.String::&lt;init&gt;@114 (line 537)
  0.66%            │  0x00007f6b88eb6076:   mov    %ebx,%r9d
 13.70%            │  0x00007f6b88eb6079:   cmp    0x8(%rsp),%r10d
  0.01%            │  0x00007f6b88eb607e:   jae    0x00007f6b88eb6671
  0.14%            │  0x00007f6b88eb6084:   movsbl 0x10(%r14,%r10,1),%edi       ;*baload {reexecute=0 rethrow=0 return_oop=0}
                   │                                                            ; - java.lang.String::&lt;init&gt;@119 (line 538)
  0.37%            │  0x00007f6b88eb608a:   mov    %r9d,%ebx
  0.99%            │  0x00007f6b88eb608d:   inc    %ebx                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                   │                                                            ; - java.lang.String::&lt;init&gt;@131 (line 540)
 12.88%            │  0x00007f6b88eb608f:   movslq %r9d,%rsi                    ;*bastore {reexecute=0 rethrow=0 return_oop=0}
                   │                                                            ; - java.lang.String::&lt;init&gt;@196 (line 548)
  0.17%            │  0x00007f6b88eb6092:   mov    %r10d,%edx
  0.39%            │  0x00007f6b88eb6095:   inc    %edx                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                   │                                                            ; - java.lang.String::&lt;init&gt;@138 (line 541)
  0.96%            │  0x00007f6b88eb6097:   test   %edi,%edi
  0.02%            │  0x00007f6b88eb6099:   jl     0x00007f6b88eb60dc           ;*iflt {reexecute=0 rethrow=0 return_oop=0}
                   │                                                            ; - java.lang.String::&lt;init&gt;@124 (line 539)

在基线if_icmpge和aload_1字节码指令之间，我们有边界检查，但在补丁代码中我们没有。

所以你最初的假设是正确的：它是关于缺失边界检查的消除。

UPD我必须更正我的答案：原来边界检查仍然存在：

13.70%            │  0x00007f6b88eb6079:   cmp    0x8(%rsp),%r10d
 0.01%            │  0x00007f6b88eb607e:   jae    0x00007f6b88eb6671

我指出的代码是编译器引入的，但它什么也没做。问题本身仍然与边界检查有关，因为它的显式声明解决了临时问题。

performance - String 构造函数中缺少边界检查消除？

跟进

结论

1 回答 1

Related

Reference