Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8256488

AArch64: Use ldpq/stpq instead of ld4/st4 for small copies in StubGenerator::copy_memory

    XMLWordPrintable

    Details

    • Subcomponent:
    • Resolved In Build:
      b27
    • CPU:
      aarch64

      Backports

        Description

        Submitted by Evgeny Astigeevich (eastig@amazon.co.uk)

        When UseSIMDForMemoryOps is on on Graviton2, there are 27%-48% performance regressions of arraycopy microbenchmarks for 70-80 bytes copies. Analysis shows the problem code is generated in StubGenerator::copy_memory:

            if (UseSIMDForMemoryOps) {
              __ ld4(v0, v1, v2, v3, __ T16B, Address(s, 0));
              __ ldpq(v4, v5, Address(send, -32));
              __ st4(v0, v1, v2, v3, __ T16B, Address(d, 0));
              __ stpq(v4, v5, Address(dend, -32));
            } else {

        Using ldpq/stpq instead of ld4/st4 fixes the regressions. This follows what the Arm Optimization Guide, including for Neoverse N1, recommends: Use discrete, non-writeback forms of load and store instructions while interleaving them.

          Attachments

            Issue Links

              Activity

                People

                Assignee:
                simonis Volker Simonis
                Reporter:
                simonis Volker Simonis
                Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                  Dates

                  Created:
                  Updated:
                  Resolved: