Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8231596

[lworld] ineffective 'aaload' codegeneration

    Details

      Description

      Benchmarks location: http://cr.openjdk.java.net/~skuksenko/valhalla/cleanup/

      Benchmark: "aaload.Params.getCfromC"
      The issue related to isolated "aaload" codegeneration, isolated means - not wrapped in the loop.

      All notices below are done with the following assumptions:
      - klassPrt has two injected bits; ArraysStorageProperties: flattened_bit(value) and null_free_bit(value)
      - Only arrays may have these bits set
      - if flattened_bit is set => null_free_bit is also set. That follows from common sense, if it's not true - please explain the reason.

      Please see my comments inside code prefixed with "SK".

      Options: -XX:-UseCompressedOops

      c2, level 4, aaload.Params::getC, version 475 (183 bytes)
                     [Verified Value Entry Point (RO)]
                       # this: rsi:rsi = 'aaload/Params'
                       # parm0: rdx:rdx = object
                       # parm1: rcx = int
                       # [sp+0x30] (sp of caller)
        0.02% 0x00007fbe041c0e70: mov %eax,-0x14000(%rsp)
        5.53% 0x00007fbe041c0e77: push %rbp
        0.02% 0x00007fbe041c0e78: sub $0x20,%rsp
        4.86% 0x00007fbe041c0e7c: mov %rdx,(%rsp)
        0.38% 0x00007fbe041c0e80: mov 0x10(%rdx),%r10d ; implicit exception: dispatches to 0x00007fbe041c0fb7
        0.06% 0x00007fbe041c0e84: mov 0x8(%rdx),%r11
        0.11% 0x00007fbe041c0e88: shl $0x3,%r11
        5.15% 0x00007fbe041c0e8c: sar $0x3,%r11
      SK: ^^^ load clazz ptr + clear bits is done before range check. Why?
      SK: r11 - contains klassPtr
        0.42% 0x00007fbe041c0e90: mov %ecx,%ebp
                       0x00007fbe041c0e92: cmp %r10d,%ecx
               ╭ 0x00007fbe041c0e95: jae 0x00007fbe041c0f35
      SK: range check done

        0.08% │ 0x00007fbe041c0e9b: mov 0x8(%r11),%r10d
      SK: r11 - contains klassPtr; why we have to do addtional load to check isFlattened?
       11.56% │ 0x00007fbe041c0e9f: sar $0x1d,%r10d
        3.31% │ 0x00007fbe041c0ea3: cmp $0xfffffffd,%r10d
      SK: checking both bits when one is enough
               │╭ 0x00007fbe041c0ea7: jne 0x00007fbe041c0f22
      SK: ^^^ finally checked that array is not flattened and jumped to direct load.
      SK: below is allocation started...
               ││ 0x00007fbe041c0ea9: mov 0xe8(%r11),%rsi
               ││ 0x00007fbe041c0eb0: mov 0x8(%rsi),%r11d
               ││ 0x00007fbe041c0eb4: test $0x1,%r11d
               ││╭ 0x00007fbe041c0ebb: jne 0x00007fbe041c0f29
               │││ 0x00007fbe041c0ebd: movslq %r11d,%r10
               │││ 0x00007fbe041c0ec0: and $0xfffffffffffffff8,%r10
               │││ 0x00007fbe041c0ec4: mov 0x120(%r15),%r12
               │││ 0x00007fbe041c0ecb: mov %r12,%r11
               │││ 0x00007fbe041c0ece: add %r10,%r11
               │││ 0x00007fbe041c0ed1: cmp 0x130(%r15),%r11
               │││╭ 0x00007fbe041c0ed8: jae 0x00007fbe041c0f29
               ││││ 0x00007fbe041c0eda: mov %r11,0x120(%r15)
               ││││ 0x00007fbe041c0ee1: prefetchw 0xc0(%r11)
               ││││ 0x00007fbe041c0ee9: mov 0xb8(%rsi),%r10
               ││││ 0x00007fbe041c0ef0: mov %r10,(%r12)
               ││││ 0x00007fbe041c0ef4: mov %rsi,0x8(%r12)
               ││││ ↗ 0x00007fbe041c0ef9: mov (%rsp),%rdi
               ││││ │ 0x00007fbe041c0efd: mov %ebp,%esi
               ││││ │ 0x00007fbe041c0eff: mov %r12,%rdx
               ││││ │ 0x00007fbe041c0f02: movabs $0x7fbe1ce81420,%r10
               ││││ │ 0x00007fbe041c0f0c: callq *%r10
               ││││ │ 0x00007fbe041c0f0f: mov %r12,%rax ;*synchronization entry
               ││││ │ ; - aaload.Params::getC@-1 (line 35)
        4.63% ││││↗│ 0x00007fbe041c0f12: add $0x20,%rsp
               ││││││ 0x00007fbe041c0f16: pop %rbp
               ││││││ 0x00007fbe041c0f17: mov 0x110(%r15),%r10
               ││││││ 0x00007fbe041c0f1e: test %eax,(%r10) ; {poll_return}
        4.92% ││││││ 0x00007fbe041c0f21: retq
        4.02% │↘││││ 0x00007fbe041c0f22: mov 0x18(%rdx,%rcx,8),%rax
        0.02% │ ││╰│ 0x00007fbe041c0f27: jmp 0x00007fbe041c0f12
               │ ↘↘ │ 0x00007fbe041c0f29: xchg %ax,%ax
               │ │ 0x00007fbe041c0f2b: callq 0x00007fbdfc7c3000 ; ImmutableOopMap {[0]=Oop }
               │ │ ;*aaload {reexecute=0 rethrow=0 return_oop=1 return_vt=0}
               │ │ ; - aaload.Params::getC@2 (line 35)
               │ │ ; {runtime_call _new_instance_Java}
               │ │ 0x00007fbe041c0f30: mov %rax,%r12
               │ ╰ 0x00007fbe041c0f33: jmp 0x00007fbe041c0ef9
               ↘ 0x00007fbe041c0f35: mov $0xffffffe4,%esi
                       0x00007fbe041c0f3a: nop
                       0x00007fbe041c0f3b: callq 0x00007fbdfc716800 ; ImmutableOopMap {[0]=Oop }

      Options: +XX:+UseCompressedOops

      c2, level 4, aaload.Params::getC, version 478 (232 bytes)
                     [Verified Value Entry Point (RO)]
                       # this: rsi:rsi = 'aaload/Params'
                       # parm0: rdx:rdx = object
                       # parm1: rcx = int
                       # [sp+0x30] (sp of caller)
        1.86% 0x00007fb173330a90: mov %eax,-0x14000(%rsp)
        3.77% 0x00007fb173330a97: push %rbp
        1.80% 0x00007fb173330a98: sub $0x20,%rsp
        0.48% 0x00007fb173330a9c: mov %ecx,(%rsp)
      SK: ^^^ the last argument (index) is spilled to stack; I don't see any needs for this

        2.85% 0x00007fb173330a9f: mov %rdx,%rbp
      SK: ^^^ unnecessary moving array ptr (2nd argument) to another register.

        1.76% 0x00007fb173330aa2: mov 0xc(%rdx),%r10d ; implicit exception: dispatches to 0x00007fb173330c07
        0.44% 0x00007fb173330aa6: cmp %r10d,%ecx
               ╭ 0x00007fb173330aa9: jae 0x00007fb173330b85
      SK: range check done
        0.19% │ 0x00007fb173330aaf: mov 0x8(%rdx),%r10d
        3.44% │ 0x00007fb173330ab3: mov %r10d,%r11d
        1.57% │ 0x00007fb173330ab6: and $0x1fffffff,%r11d
        0.31% │ 0x00007fb173330abd: movabs $0x7c0000000,%r10
        0.08% │ 0x00007fb173330ac7: lea (%r10,%r11,8),%r10
        2.89% │ 0x00007fb173330acb: mov 0x8(%r10),%r8d
        4.50% │ 0x00007fb173330acf: sar $0x1d,%r8d
        1.95% │ 0x00007fb173330ad3: cmp $0xfffffffd,%r8d
               │╭ 0x00007fb173330ad7: jne 0x00007fb173330b6c
      SK: ^^ finally checked if array is flattened.

               ││ 0x00007fb173330add: mov 0xe8(%r10),%rsi
               ││ 0x00007fb173330ae4: mov 0x8(%rsi),%r10d
               ││ 0x00007fb173330ae8: test $0x1,%r10d
               ││╭ 0x00007fb173330aef: jne 0x00007fb173330b7a
               │││ 0x00007fb173330af5: movslq %r10d,%r10
               │││ 0x00007fb173330af8: and $0xfffffffffffffff8,%r10
               │││ 0x00007fb173330afc: mov 0x120(%r15),%rbx
               │││ 0x00007fb173330b03: mov %rbx,%r11
               │││ 0x00007fb173330b06: add %r10,%r11
               │││ 0x00007fb173330b09: cmp 0x130(%r15),%r11
               │││╭ 0x00007fb173330b10: jae 0x00007fb173330b7a
               ││││ 0x00007fb173330b12: mov %r11,0x120(%r15)
               ││││ 0x00007fb173330b19: prefetchw 0xc0(%r11)
               ││││ 0x00007fb173330b21: mov 0xb8(%rsi),%r10
               ││││ 0x00007fb173330b28: mov %r10,(%rbx)
               ││││ 0x00007fb173330b2b: movabs $0x7c0000000,%r11
               ││││ 0x00007fb173330b35: neg %r11
               ││││ 0x00007fb173330b38: add %rsi,%r11
               ││││ 0x00007fb173330b3b: shr $0x3,%r11
               ││││ 0x00007fb173330b3f: mov %r11d,0x8(%rbx)
               ││││ ↗ 0x00007fb173330b43: mov %rbp,%rdi
               ││││ │ 0x00007fb173330b46: mov (%rsp),%esi
               ││││ │ 0x00007fb173330b49: mov %rbx,%rdx
               ││││ │ 0x00007fb173330b4c: movabs $0x7fb18943a420,%r10
               ││││ │ 0x00007fb173330b56: callq *%r10
               ││││ │ 0x00007fb173330b59: mov %rbx,%rax ;*synchronization entry
               ││││ │ ; - aaload.Params::getC@-1 (line 35)
        3.16% ││││↗│ 0x00007fb173330b5c: add $0x20,%rsp
        1.51% ││││││ 0x00007fb173330b60: pop %rbp
        0.82% ││││││ 0x00007fb173330b61: mov 0x110(%r15),%r10
        0.19% ││││││ 0x00007fb173330b68: test %eax,(%r10) ; {poll_return}
        2.72% ││││││ 0x00007fb173330b6b: retq
        2.43% │↘││││ 0x00007fb173330b6c: mov 0x10(%rdx,%rcx,4),%r10d
        1.34% │ ││││ 0x00007fb173330b71: mov %r10,%rax
        0.54% │ ││││ 0x00007fb173330b74: shl $0x3,%rax
        0.21% │ ││╰│ 0x00007fb173330b78: jmp 0x00007fb173330b5c
               │ ↘↘ │ 0x00007fb173330b7a: nop
               │ │ 0x00007fb173330b7b: callq 0x00007fb16b932600 ; ImmutableOopMap {rbp=Oop }
               │ │ ;*aaload {reexecute=0 rethrow=0 return_oop=1 return_vt=0}
               │ │ ; - aaload.Params::getC@2 (line 35)
               │ │ ; {runtime_call _new_instance_Java}
               │ │ 0x00007fb173330b80: mov %rax,%rbx
               │ ╰ 0x00007fb173330b83: jmp 0x00007fb173330b43
               ↘ 0x00007fb173330b85: mov $0xffffffe4,%esi
                       0x00007fb173330b8a: nop
                       0x00007fb173330b8b: callq 0x00007fb16b885800 ; ImmutableOopMap {rbp=Oop }

      In both cases C2 code is going to inside metadata and check if arrays is flattened there. At the same moment a single test bit in klassPtr is enough.
      Even C1 generated better code here:

        2.61% ╭ 0x00007f5ec4e0aaec: cmp 0x10(%rdx),%ecx ; implicit exception: dispatches to 0x00007f5ec4e0ab22
               │╭ 0x00007f5ec4e0aaef: jae 0x00007f5ec4e0ab2c
        3.71% ││ 0x00007f5ec4e0aaf5: mov 0x8(%rdx),%rdi
        0.67% ││ 0x00007f5ec4e0aaf9: shr $0x3d,%rdi
        0.27% ││ 0x00007f5ec4e0aafd: test $0x1,%dil
               ││ 0x00007f5ec4e0ab01: jne 0x00007f5ec4e0ab3a
        2.31% ││ 0x00007f5ec4e0ab07: movslq %ecx,%rcx
        1.90% ││ 0x00007f5ec4e0ab0a: mov 0x18(%rdx,%rcx,8),%rsi ;*aaload {reexecute=0 rethrow=0 return_oop=0 return_vt=0}
               ││ ; - aaload.Params::getC@2 (line 35)

      It's not an ideal code (don't need shifts to check the singe bit).
      And as result C1 code works faster than C2 code.

        Attachments

          Activity

            People

            • Assignee:
              roland Roland Westrelin
              Reporter:
              skuksenko Sergey Kuksenko
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: