Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8150730

Improve performance and reduce variance for AVX-assisted arraycopy stubs

    Details

      Description

      If you have a dumb benchmark like this:

          @Param({"1024"})
          int size;

          byte[] pad;
          long[] source, destination;

          @Setup(Level.Iteration)
          public void setUp() {
              Random r = new Random(42);

              pad = new byte[r.nextInt(1024)];
              source = new long[size];
              destination = new long[size];

              for (int i = 0; i < size; ++i) {
                  source[i] = r.nextInt();
              }

              // Promote all the arrays
              System.gc();
          }

          @Benchmark
          public void arraycopy() {
              System.arraycopy(source, 0, destination, 0, size);
          }

      And run it with JDK 9b107 on i7-4790K @ 4.0 GHz, Linux x86_64, then you will see that performance fluctuates a lot:

      # Warmup Iteration 1: 351.178 ns/op
      # Warmup Iteration 2: 385.568 ns/op
      # Warmup Iteration 3: 366.771 ns/op
      # Warmup Iteration 4: 341.570 ns/op
      # Warmup Iteration 5: 420.488 ns/op
      Iteration 1: 309.817 ns/op
      Iteration 2: 346.652 ns/op
      Iteration 3: 408.156 ns/op
      Iteration 4: 343.857 ns/op
      Iteration 5: 137.810 ns/op
      Iteration 6: 283.327 ns/op
      Iteration 7: 356.355 ns/op
      Iteration 8: 319.256 ns/op
      Iteration 9: 136.157 ns/op
      Iteration 10: 302.372 ns/op
      Iteration 11: 299.792 ns/op
      Iteration 12: 389.018 ns/op
      Iteration 13: 329.284 ns/op
      Iteration 14: 142.508 ns/op
      Iteration 15: 297.566 ns/op

      Since this run with the same generated code that ends up calling jlong_disjoint_arraycopy, and the hottest piece of code is AVX-assisted copy:

        1.90% 0.69% │ ↗ │││ 0x00007feb44f11a70: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
       36.10% 36.21% │ │ │││ 0x00007feb44f11a76: vmovdqu %ymm0,-0x38(%rcx,%rdx,8)
       10.28% 11.38% │ │ │││ 0x00007feb44f11a7c: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
       29.87% 26.29% │ │ │││ 0x00007feb44f11a82: vmovdqu %ymm1,-0x18(%rcx,%rdx,8)
       15.40% 18.50% ↘ │ │││ 0x00007feb44f11a88: add $0x8,%rdx
                          ╰ │││ 0x00007feb44f11a8c: jle Stub::jlong_disjoint_arraycopy+48 0x00007feb44f11a70

      ...the suspicion obviously falls to data alignment.

      A more complicated experiment that uses JOL to poll the source/destination array addresses clearly shows the correlation between src/dst alignments and benchmark score:
        http://cr.openjdk.java.net/~shade/8150730/benchmarks.jar
        http://cr.openjdk.java.net/~shade/8150730/ArrayCopyAlign.java

      $ java -jar benchmark.jar | grep ^Iteration | sort -k 15:
        http://cr.openjdk.java.net/~shade/8150730/scores-sorted.txt

      So it seems "dst = 4, srcBase = 4" is a sweet-spot. We need to figure out why, and how to level out performance for other alignments.

        Issue Links

          Activity

          shade Aleksey Shipilev created issue -
          shade Aleksey Shipilev made changes -
          Field Original Value New Value
          Description If you run a dumb benchmark like this:

              @Param({"1024"})
              int size;

              byte[] pad;
              long[] source, destination;

              @Setup(Level.Iteration)
              public void setUp() {
                  Random r = new Random(42);

                  pad = new byte[r.nextInt(1024)];
                  source = new long[size];
                  destination = new long[size];

                  for (int i = 0; i < size; ++i) {
                      source[i] = r.nextInt();
                  }

                  // Promote all the arrays
                  System.gc();
              }

              @Benchmark
              public void arraycopy() {
                  System.arraycopy(source, 0, destination, 0, size);
              }

          ...then you will see that performance fluctuates a lot:

          # Warmup Iteration 1: 351.178 ns/op
          # Warmup Iteration 2: 385.568 ns/op
          # Warmup Iteration 3: 366.771 ns/op
          # Warmup Iteration 4: 341.570 ns/op
          # Warmup Iteration 5: 420.488 ns/op
          Iteration 1: 309.817 ns/op
          Iteration 2: 346.652 ns/op
          Iteration 3: 408.156 ns/op
          Iteration 4: 343.857 ns/op
          Iteration 5: 137.810 ns/op
          Iteration 6: 283.327 ns/op
          Iteration 7: 356.355 ns/op
          Iteration 8: 319.256 ns/op
          Iteration 9: 136.157 ns/op
          Iteration 10: 302.372 ns/op
          Iteration 11: 299.792 ns/op
          Iteration 12: 389.018 ns/op
          Iteration 13: 329.284 ns/op
          Iteration 14: 142.508 ns/op
          Iteration 15: 297.566 ns/op

          Since this run with the same generated code that ends up calling jlong_disjoint_arraycopy, and the hottest piece of code is AVX-assisted copy:

            1.90% 0.69% │ ↗ │││ 0x00007feb44f11a70: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
           36.10% 36.21% │ │ │││ 0x00007feb44f11a76: vmovdqu %ymm0,-0x38(%rcx,%rdx,8)
           10.28% 11.38% │ │ │││ 0x00007feb44f11a7c: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
           29.87% 26.29% │ │ │││ 0x00007feb44f11a82: vmovdqu %ymm1,-0x18(%rcx,%rdx,8)
           15.40% 18.50% ↘ │ │││ 0x00007feb44f11a88: add $0x8,%rdx
                              ╰ │││ 0x00007feb44f11a8c: jle Stub::jlong_disjoint_arraycopy+48 0x00007feb44f11a70

          ...the suspicion obviously falls to alignment.

          A more complicated experiment that uses JOL to poll the source/destination array addresses clearly shows the correlation between alignments and benchmark score:
           (TBD)
          If you run a dumb benchmark like this:

              @Param({"1024"})
              int size;

              byte[] pad;
              long[] source, destination;

              @Setup(Level.Iteration)
              public void setUp() {
                  Random r = new Random(42);

                  pad = new byte[r.nextInt(1024)];
                  source = new long[size];
                  destination = new long[size];

                  for (int i = 0; i < size; ++i) {
                      source[i] = r.nextInt();
                  }

                  // Promote all the arrays
                  System.gc();
              }

              @Benchmark
              public void arraycopy() {
                  System.arraycopy(source, 0, destination, 0, size);
              }

          ...then you will see that performance fluctuates a lot:

          # Warmup Iteration 1: 351.178 ns/op
          # Warmup Iteration 2: 385.568 ns/op
          # Warmup Iteration 3: 366.771 ns/op
          # Warmup Iteration 4: 341.570 ns/op
          # Warmup Iteration 5: 420.488 ns/op
          Iteration 1: 309.817 ns/op
          Iteration 2: 346.652 ns/op
          Iteration 3: 408.156 ns/op
          Iteration 4: 343.857 ns/op
          Iteration 5: 137.810 ns/op
          Iteration 6: 283.327 ns/op
          Iteration 7: 356.355 ns/op
          Iteration 8: 319.256 ns/op
          Iteration 9: 136.157 ns/op
          Iteration 10: 302.372 ns/op
          Iteration 11: 299.792 ns/op
          Iteration 12: 389.018 ns/op
          Iteration 13: 329.284 ns/op
          Iteration 14: 142.508 ns/op
          Iteration 15: 297.566 ns/op

          Since this run with the same generated code that ends up calling jlong_disjoint_arraycopy, and the hottest piece of code is AVX-assisted copy:

            1.90% 0.69% │ ↗ │││ 0x00007feb44f11a70: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
           36.10% 36.21% │ │ │││ 0x00007feb44f11a76: vmovdqu %ymm0,-0x38(%rcx,%rdx,8)
           10.28% 11.38% │ │ │││ 0x00007feb44f11a7c: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
           29.87% 26.29% │ │ │││ 0x00007feb44f11a82: vmovdqu %ymm1,-0x18(%rcx,%rdx,8)
           15.40% 18.50% ↘ │ │││ 0x00007feb44f11a88: add $0x8,%rdx
                              ╰ │││ 0x00007feb44f11a8c: jle Stub::jlong_disjoint_arraycopy+48 0x00007feb44f11a70

          ...the suspicion obviously falls to data alignment.

          A more complicated experiment that uses JOL to poll the source/destination array addresses clearly shows the correlation between src/dst alignments and benchmark score:
            http://cr.openjdk.java.net/~shade/8150730/benchmarks.jar
            http://cr.openjdk.java.net/~shade/8150730/ArrayCopyAlign.java

          $ java -jar benchmark.jar | grep ^Iteration | sort -k 15:
            http://cr.openjdk.java.net/~shade/8150730/scores-sorted.txt

          So it seems "dst = 4, srcBase = 4" is a sweet-spot. We need to figure out why, and how to level out performance for other alignments.
          shade Aleksey Shipilev made changes -
          Status New [ 10000 ] Open [ 1 ]
          shade Aleksey Shipilev made changes -
          Description If you run a dumb benchmark like this:

              @Param({"1024"})
              int size;

              byte[] pad;
              long[] source, destination;

              @Setup(Level.Iteration)
              public void setUp() {
                  Random r = new Random(42);

                  pad = new byte[r.nextInt(1024)];
                  source = new long[size];
                  destination = new long[size];

                  for (int i = 0; i < size; ++i) {
                      source[i] = r.nextInt();
                  }

                  // Promote all the arrays
                  System.gc();
              }

              @Benchmark
              public void arraycopy() {
                  System.arraycopy(source, 0, destination, 0, size);
              }

          ...then you will see that performance fluctuates a lot:

          # Warmup Iteration 1: 351.178 ns/op
          # Warmup Iteration 2: 385.568 ns/op
          # Warmup Iteration 3: 366.771 ns/op
          # Warmup Iteration 4: 341.570 ns/op
          # Warmup Iteration 5: 420.488 ns/op
          Iteration 1: 309.817 ns/op
          Iteration 2: 346.652 ns/op
          Iteration 3: 408.156 ns/op
          Iteration 4: 343.857 ns/op
          Iteration 5: 137.810 ns/op
          Iteration 6: 283.327 ns/op
          Iteration 7: 356.355 ns/op
          Iteration 8: 319.256 ns/op
          Iteration 9: 136.157 ns/op
          Iteration 10: 302.372 ns/op
          Iteration 11: 299.792 ns/op
          Iteration 12: 389.018 ns/op
          Iteration 13: 329.284 ns/op
          Iteration 14: 142.508 ns/op
          Iteration 15: 297.566 ns/op

          Since this run with the same generated code that ends up calling jlong_disjoint_arraycopy, and the hottest piece of code is AVX-assisted copy:

            1.90% 0.69% │ ↗ │││ 0x00007feb44f11a70: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
           36.10% 36.21% │ │ │││ 0x00007feb44f11a76: vmovdqu %ymm0,-0x38(%rcx,%rdx,8)
           10.28% 11.38% │ │ │││ 0x00007feb44f11a7c: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
           29.87% 26.29% │ │ │││ 0x00007feb44f11a82: vmovdqu %ymm1,-0x18(%rcx,%rdx,8)
           15.40% 18.50% ↘ │ │││ 0x00007feb44f11a88: add $0x8,%rdx
                              ╰ │││ 0x00007feb44f11a8c: jle Stub::jlong_disjoint_arraycopy+48 0x00007feb44f11a70

          ...the suspicion obviously falls to data alignment.

          A more complicated experiment that uses JOL to poll the source/destination array addresses clearly shows the correlation between src/dst alignments and benchmark score:
            http://cr.openjdk.java.net/~shade/8150730/benchmarks.jar
            http://cr.openjdk.java.net/~shade/8150730/ArrayCopyAlign.java

          $ java -jar benchmark.jar | grep ^Iteration | sort -k 15:
            http://cr.openjdk.java.net/~shade/8150730/scores-sorted.txt

          So it seems "dst = 4, srcBase = 4" is a sweet-spot. We need to figure out why, and how to level out performance for other alignments.
          If you have a dumb benchmark like this:

              @Param({"1024"})
              int size;

              byte[] pad;
              long[] source, destination;

              @Setup(Level.Iteration)
              public void setUp() {
                  Random r = new Random(42);

                  pad = new byte[r.nextInt(1024)];
                  source = new long[size];
                  destination = new long[size];

                  for (int i = 0; i < size; ++i) {
                      source[i] = r.nextInt();
                  }

                  // Promote all the arrays
                  System.gc();
              }

              @Benchmark
              public void arraycopy() {
                  System.arraycopy(source, 0, destination, 0, size);
              }

          And run it with JDK 9b107 on i7-4790K @ 4.0 GHz, Linux x86_64, then you will see that performance fluctuates a lot:

          # Warmup Iteration 1: 351.178 ns/op
          # Warmup Iteration 2: 385.568 ns/op
          # Warmup Iteration 3: 366.771 ns/op
          # Warmup Iteration 4: 341.570 ns/op
          # Warmup Iteration 5: 420.488 ns/op
          Iteration 1: 309.817 ns/op
          Iteration 2: 346.652 ns/op
          Iteration 3: 408.156 ns/op
          Iteration 4: 343.857 ns/op
          Iteration 5: 137.810 ns/op
          Iteration 6: 283.327 ns/op
          Iteration 7: 356.355 ns/op
          Iteration 8: 319.256 ns/op
          Iteration 9: 136.157 ns/op
          Iteration 10: 302.372 ns/op
          Iteration 11: 299.792 ns/op
          Iteration 12: 389.018 ns/op
          Iteration 13: 329.284 ns/op
          Iteration 14: 142.508 ns/op
          Iteration 15: 297.566 ns/op

          Since this run with the same generated code that ends up calling jlong_disjoint_arraycopy, and the hottest piece of code is AVX-assisted copy:

            1.90% 0.69% │ ↗ │││ 0x00007feb44f11a70: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
           36.10% 36.21% │ │ │││ 0x00007feb44f11a76: vmovdqu %ymm0,-0x38(%rcx,%rdx,8)
           10.28% 11.38% │ │ │││ 0x00007feb44f11a7c: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
           29.87% 26.29% │ │ │││ 0x00007feb44f11a82: vmovdqu %ymm1,-0x18(%rcx,%rdx,8)
           15.40% 18.50% ↘ │ │││ 0x00007feb44f11a88: add $0x8,%rdx
                              ╰ │││ 0x00007feb44f11a8c: jle Stub::jlong_disjoint_arraycopy+48 0x00007feb44f11a70

          ...the suspicion obviously falls to data alignment.

          A more complicated experiment that uses JOL to poll the source/destination array addresses clearly shows the correlation between src/dst alignments and benchmark score:
            http://cr.openjdk.java.net/~shade/8150730/benchmarks.jar
            http://cr.openjdk.java.net/~shade/8150730/ArrayCopyAlign.java

          $ java -jar benchmark.jar | grep ^Iteration | sort -k 15:
            http://cr.openjdk.java.net/~shade/8150730/scores-sorted.txt

          So it seems "dst = 4, srcBase = 4" is a sweet-spot. We need to figure out why, and how to level out performance for other alignments.
          shade Aleksey Shipilev made changes -
          Summary arraycopy stubs performance is dramatically unstable AVX-assisted arraycopy stubs' performance is dramatically unstable
          shade Aleksey Shipilev made changes -
          Summary AVX-assisted arraycopy stubs' performance is dramatically unstable SSE-assisted unaligned arraycopy stubs' performance is dramatically unstable
          shade Aleksey Shipilev made changes -
          Summary SSE-assisted unaligned arraycopy stubs' performance is dramatically unstable AVX-assisted unaligned arraycopy stubs' performance is dramatically unstable
          shade Aleksey Shipilev made changes -
          Summary AVX-assisted unaligned arraycopy stubs' performance is dramatically unstable Improve performance and variance for AVX-assisted arraycopy stubs
          shade Aleksey Shipilev made changes -
          Labels performance
          kvn Vladimir Kozlov made changes -
          Fix Version/s 10 [ 16302 ]
          zmajo Zoltan Majo (Inactive) made changes -
          Labels performance c1 c2 community-candidate performance
          zmajo Zoltan Majo (Inactive) made changes -
          Summary Improve performance and variance for AVX-assisted arraycopy stubs Improve performance and reduce variance for AVX-assisted arraycopy stubs
          zmajo Zoltan Majo (Inactive) made changes -
          Link This issue relates to JDK-8149758 [ JDK-8149758 ]
          zmajo Zoltan Majo (Inactive) made changes -
          Labels c1 c2 community-candidate performance c1 c2 community-candidate performance wnf-candidate
          thartmann Tobias Hartmann made changes -
          Affects Version/s 10 [ 16302 ]
          zmajo Zoltan Majo (Inactive) made changes -
          Labels c1 c2 community-candidate performance wnf-candidate c1 c2 c2-cg community-candidate performance wnf-candidate
          thartmann Tobias Hartmann made changes -
          Labels c1 c2 c2-cg community-candidate performance wnf-candidate c1 c2 c2-cg community-candidate performance
          shade Aleksey Shipilev made changes -
          Fix Version/s 11 [ 18723 ]
          Fix Version/s 10 [ 16302 ]

            People

            • Assignee:
              shade Aleksey Shipilev
              Reporter:
              shade Aleksey Shipilev
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated: