Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8150730

Improve performance and reduce variance for AVX-assisted arraycopy stubs

    Details

      Description

      If you have a dumb benchmark like this:

          @Param({"1024"})
          int size;

          byte[] pad;
          long[] source, destination;

          @Setup(Level.Iteration)
          public void setUp() {
              Random r = new Random(42);

              pad = new byte[r.nextInt(1024)];
              source = new long[size];
              destination = new long[size];

              for (int i = 0; i < size; ++i) {
                  source[i] = r.nextInt();
              }

              // Promote all the arrays
              System.gc();
          }

          @Benchmark
          public void arraycopy() {
              System.arraycopy(source, 0, destination, 0, size);
          }

      And run it with JDK 9b107 on i7-4790K @ 4.0 GHz, Linux x86_64, then you will see that performance fluctuates a lot:

      # Warmup Iteration 1: 351.178 ns/op
      # Warmup Iteration 2: 385.568 ns/op
      # Warmup Iteration 3: 366.771 ns/op
      # Warmup Iteration 4: 341.570 ns/op
      # Warmup Iteration 5: 420.488 ns/op
      Iteration 1: 309.817 ns/op
      Iteration 2: 346.652 ns/op
      Iteration 3: 408.156 ns/op
      Iteration 4: 343.857 ns/op
      Iteration 5: 137.810 ns/op
      Iteration 6: 283.327 ns/op
      Iteration 7: 356.355 ns/op
      Iteration 8: 319.256 ns/op
      Iteration 9: 136.157 ns/op
      Iteration 10: 302.372 ns/op
      Iteration 11: 299.792 ns/op
      Iteration 12: 389.018 ns/op
      Iteration 13: 329.284 ns/op
      Iteration 14: 142.508 ns/op
      Iteration 15: 297.566 ns/op

      Since this run with the same generated code that ends up calling jlong_disjoint_arraycopy, and the hottest piece of code is AVX-assisted copy:

        1.90% 0.69% │ ↗ │││ 0x00007feb44f11a70: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
       36.10% 36.21% │ │ │││ 0x00007feb44f11a76: vmovdqu %ymm0,-0x38(%rcx,%rdx,8)
       10.28% 11.38% │ │ │││ 0x00007feb44f11a7c: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
       29.87% 26.29% │ │ │││ 0x00007feb44f11a82: vmovdqu %ymm1,-0x18(%rcx,%rdx,8)
       15.40% 18.50% ↘ │ │││ 0x00007feb44f11a88: add $0x8,%rdx
                          ╰ │││ 0x00007feb44f11a8c: jle Stub::jlong_disjoint_arraycopy+48 0x00007feb44f11a70

      ...the suspicion obviously falls to data alignment.

      A more complicated experiment that uses JOL to poll the source/destination array addresses clearly shows the correlation between src/dst alignments and benchmark score:
        http://cr.openjdk.java.net/~shade/8150730/benchmarks.jar
        http://cr.openjdk.java.net/~shade/8150730/ArrayCopyAlign.java

      $ java -jar benchmark.jar | grep ^Iteration | sort -k 15:
        http://cr.openjdk.java.net/~shade/8150730/scores-sorted.txt

      So it seems "dst = 4, srcBase = 4" is a sweet-spot. We need to figure out why, and how to level out performance for other alignments.

        Issue Links

          Activity

          Hide
          shade Aleksey Shipilev added a comment - - edited
          Theoretically, there is a problem in aligning for both source and destination: think about src at base 0, and dst at base 8 -- no matter how much pre-loop operations you do, either src or dst would come misaligned at modulos larger than 8. However, the simple experiment for aligning the destination only (it could be demonstrated with source alignment too):
            http://cr.openjdk.java.net/~shade/8150730/poc-align-dest64.patch

          ...trims down the run-to-run variance, and shifts the performance up to the best mode:
            http://cr.openjdk.java.net/~shade/8150730/sorted-align-dest64-minusNewCode.txt
            http://cr.openjdk.java.net/~shade/8150730/sorted-align-dest64-plusNewCode.txt

          There are probably other tricks we can do, but this one seems to work well already. Of course, the choice of alignment value was arbitrarily large, and we would need to follow up what are the drawbacks on smaller arrays.
          Show
          shade Aleksey Shipilev added a comment - - edited Theoretically, there is a problem in aligning for both source and destination: think about src at base 0, and dst at base 8 -- no matter how much pre-loop operations you do, either src or dst would come misaligned at modulos larger than 8. However, the simple experiment for aligning the destination only (it could be demonstrated with source alignment too):    http://cr.openjdk.java.net/~shade/8150730/poc-align-dest64.patch ...trims down the run-to-run variance, and shifts the performance up to the best mode:    http://cr.openjdk.java.net/~shade/8150730/sorted-align-dest64-minusNewCode.txt    http://cr.openjdk.java.net/~shade/8150730/sorted-align-dest64-plusNewCode.txt There are probably other tricks we can do, but this one seems to work well already. Of course, the choice of alignment value was arbitrarily large, and we would need to follow up what are the drawbacks on smaller arrays.
          Hide
          shade Aleksey Shipilev added a comment - - edited
          Intel Optimization Manual seems to say:

          a) "Assembly/Compiler Coding Rule 73. (H impact, M generality) Align data to 32-byte boundary when possible. Prefer store alignment over load alignment." -- the proof-of-concept fix above fits that description. Dropping the alignment from 64 to 32 seems to solve the issue too.

          b) "11.6.2 Consider 16-Byte Memory Access when Memory is Unaligned
          For best results use Intel AVX 32-byte loads and align data to 32-bytes. However, there are cases where you cannot align the data, or data alignment is unknown. This can happen when you are writing a library function and the input data alignment is unknown. In these cases, using 16-Byte memory accesses may be the best alternative."
          Show
          shade Aleksey Shipilev added a comment - - edited Intel Optimization Manual seems to say: a) "Assembly/Compiler Coding Rule 73. (H impact, M generality) Align data to 32-byte boundary when possible. Prefer store alignment over load alignment." -- the proof-of-concept fix above fits that description. Dropping the alignment from 64 to 32 seems to solve the issue too. b) "11.6.2 Consider 16-Byte Memory Access when Memory is Unaligned For best results use Intel AVX 32-byte loads and align data to 32-bytes. However, there are cases where you cannot align the data, or data alignment is unknown. This can happen when you are writing a library function and the input data alignment is unknown. In these cases, using 16-Byte memory accesses may be the best alternative."
          Hide
          shade Aleksey Shipilev added a comment -
          A more advanced proof-of-concept, that treats all-type aligned-by-8 arraycopies, and does not regress tiny arrays:
           http://cr.openjdk.java.net/~shade/8150730/poc-align-dest32.patch

          However, we should really do something like stubGenerator_sparc does: accurately peel the prefix and suffix, down to single bytes, to align for non-aligned-by-8 destinations well.
          Show
          shade Aleksey Shipilev added a comment - A more advanced proof-of-concept, that treats all-type aligned-by-8 arraycopies, and does not regress tiny arrays:   http://cr.openjdk.java.net/~shade/8150730/poc-align-dest32.patch However, we should really do something like stubGenerator_sparc does: accurately peel the prefix and suffix, down to single bytes, to align for non-aligned-by-8 destinations well.
          Hide
          shade Aleksey Shipilev added a comment -
          Current set of proof-of-concept patches:
            http://cr.openjdk.java.net/~shade/8150730/poc-bytecopy-dest32-xmm1-ymm7.patch
            http://cr.openjdk.java.net/~shade/8150730/poc-bytecopy-dest32-xmm1-ymm15.patch

          These patches try to do three things:
            a) align destination to 32 bytes to avoid split stores;
            b) do deeper pipelining to alleviate penalties in hot loop;
            c) alleviate aliasing penalties by using 16-byte ops sparingly (this is surprising result);

          Performance for 1xmm + 15ymm version is rather good: the worst performance in patched version is the average on baseline:
            http://cr.openjdk.java.net/~shade/8150730/arraycopy-dest32-1xmm-7%2c15ymm.png
            (performance results seem to be periodic with 4K period, only one period is shown)

          Workload used to estimate performance improvements:
            http://cr.openjdk.java.net/~shade/8150730/ArrayCopyByteScan.java
          Show
          shade Aleksey Shipilev added a comment - Current set of proof-of-concept patches:    http://cr.openjdk.java.net/~shade/8150730/poc-bytecopy-dest32-xmm1-ymm7.patch    http://cr.openjdk.java.net/~shade/8150730/poc-bytecopy-dest32-xmm1-ymm15.patch These patches try to do three things:   a) align destination to 32 bytes to avoid split stores;   b) do deeper pipelining to alleviate penalties in hot loop;   c) alleviate aliasing penalties by using 16-byte ops sparingly (this is surprising result); Performance for 1xmm + 15ymm version is rather good: the worst performance in patched version is the average on baseline:    http://cr.openjdk.java.net/~shade/8150730/arraycopy-dest32-1xmm-7%2c15ymm.png   (performance results seem to be periodic with 4K period, only one period is shown) Workload used to estimate performance improvements:    http://cr.openjdk.java.net/~shade/8150730/ArrayCopyByteScan.java
          Hide
          skuksenko Sergey Kuksenko added a comment -
          I'd like just to emphasize set of issue which are observed here:
          0) misaligned issue.
            small issue, misaligned load/stores lead to ~10% performance degradation in comparison with 32-byte aligned store/load
          1) 4K - issue.
            hitting 4K boundary between source and destination causes ~3x times performance degradation
          2) 1M&LP - issue.
            hitting 1M boundary between source and destination and turning on Large Pages (doesn't matter explicitly or implicitly with transparent huge pages on Linux) causes ~10x times performance degradation.
          Show
          skuksenko Sergey Kuksenko added a comment - I'd like just to emphasize set of issue which are observed here: 0) misaligned issue.   small issue, misaligned load/stores lead to ~10% performance degradation in comparison with 32-byte aligned store/load 1) 4K - issue.   hitting 4K boundary between source and destination causes ~3x times performance degradation 2) 1M&LP - issue.   hitting 1M boundary between source and destination and turning on Large Pages (doesn't matter explicitly or implicitly with transparent huge pages on Linux) causes ~10x times performance degradation.
          Hide
          shade Aleksey Shipilev added a comment -
          Better proof-of-concept patch:
            http://cr.openjdk.java.net/~shade/8150730/poc-bytecopy-dest32-ymm16-v3.patch

          It still tries to do several things:
            a) align destination to 32 bytes to avoid split stores;
            b) do deeper pipelining to alleviate penalties in hot loop, uses the most efficient load/store order;
            c) alleviate aliasing penalties by using 16-byte load once (speculation: my Haswell cannot take *that* many split loads);
            d) (new) handles small arrays efficiently;

          My attempts to further tame split loads proved unsuccessful. Even with dst aligned by 32, and src aligned by 16, individual 16-byte loads + merges are slower than split 32-byte load, at least on my Haswell. Therefore, I believe we cannot do better than this. Next steps should include extracting the common parts from this new stub, and applying it to other stubs (other data types, and backward copy too).

          The latest performance data:
           http://cr.openjdk.java.net/~shade/8150730/arraycopy-dest32-ymm16-v3.png
            (again, performance results seem to be periodic with 4K period, only one period is shown)

          P.S. Somebody, tell Intel their 4K aliasing issue totally blows?
          Show
          shade Aleksey Shipilev added a comment - Better proof-of-concept patch:    http://cr.openjdk.java.net/~shade/8150730/poc-bytecopy-dest32-ymm16-v3.patch It still tries to do several things:   a) align destination to 32 bytes to avoid split stores;   b) do deeper pipelining to alleviate penalties in hot loop, uses the most efficient load/store order;   c) alleviate aliasing penalties by using 16-byte load once (speculation: my Haswell cannot take *that* many split loads);   d) (new) handles small arrays efficiently; My attempts to further tame split loads proved unsuccessful. Even with dst aligned by 32, and src aligned by 16, individual 16-byte loads + merges are slower than split 32-byte load, at least on my Haswell. Therefore, I believe we cannot do better than this. Next steps should include extracting the common parts from this new stub, and applying it to other stubs (other data types, and backward copy too). The latest performance data:   http://cr.openjdk.java.net/~shade/8150730/arraycopy-dest32-ymm16-v3.png   (again, performance results seem to be periodic with 4K period, only one period is shown) P.S. Somebody, tell Intel their 4K aliasing issue totally blows?
          Hide
          twisti Christian Thalinger added a comment -
          Well, since the bug and your comment is public you just did.
          Show
          twisti Christian Thalinger added a comment - Well, since the bug and your comment is public you just did.
          Hide
          lbourges Laurent Bourgès added a comment -
          Any progress ? It seems very promising !
          Show
          lbourges Laurent Bourgès added a comment - Any progress ? It seems very promising !
          Hide
          shade Aleksey Shipilev added a comment - - edited
          Updated patch for recent jdk/hs:
           http://cr.openjdk.java.net/~shade/8150730/webrev.00/

          Seems like Skylake-X exhibits a more interesting baseline behavior:
            http://cr.openjdk.java.net/~shade/8150730/arraycopy-dest32-ymm16-v3-v3-sklX.png

          Average performance is still better for patched version, and it is more flat. 4K aliasing hump is still present.
          Show
          shade Aleksey Shipilev added a comment - - edited Updated patch for recent jdk/hs:   http://cr.openjdk.java.net/~shade/8150730/webrev.00/ Seems like Skylake-X exhibits a more interesting baseline behavior:    http://cr.openjdk.java.net/~shade/8150730/arraycopy-dest32-ymm16-v3-v3-sklX.png Average performance is still better for patched version, and it is more flat. 4K aliasing hump is still present.

            People

            • Assignee:
              shade Aleksey Shipilev
              Reporter:
              shade Aleksey Shipilev
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated: