Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8148929

Suboptimal code generated when setting sysroot include with Solaris Studio

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: P2
    • Resolution: Fixed
    • Affects Version/s: 9
    • Fix Version/s: 9
    • Component/s: infrastructure
    • Labels:
      None
    • Subcomponent:
    • Resolved In Build:
      b105

      Description

      While implementing some performance sensitive logic in Hotspot I noticed that the performance of the generated (c++) code on Solaris-x64 was not as good as it should be. A deeper analysis showed that this is related to a problem with inlining/intrinsifying various memcpy calls. I tracked this down to whether or not the "sysroot" is explicitly included when compiling the c++ file(s).

      Specifically, here's a small reproducer:

      ----
      #include <stdlib.h>
      #include <stdint.h>
      #include <string.h>

      uint64_t read_unaligned(void* src) {
        uint64_t tmp;

        memcpy(&tmp, src, sizeof(uint64_t));

        return tmp;
      }

      When compiled without an explicit sysroot include path like so:

      SS12u4-Solaris11u1/SS12u4/bin/CC -m64 -G -xO4 -o libfoo.so unaligned_read.cpp

      The resulting assembly looks like this:

      0000000000000bf0 <__1cOread_unaligned6Fpv_L_>:
       bf0: 55 push %rbp
       bf1: 48 8b ec mov %rsp,%rbp
       bf4: 48 8b 07 mov (%rdi),%rax
       bf7: 48 89 45 f8 mov %rax,-0x8(%rbp)
       bfb: 48 8b 45 f8 mov -0x8(%rbp),%rax
       bff: c9 leaveq
       c00: c3 retq

      That is, the compiler has "inlined" memcpy and is just reading the value using a normal mov.

      However, when the code is compiled *with* an explicit sysroot include like so:

      SS12u4-Solaris11u1/SS12u4/bin/CC -m64 -G -I/opt/jprt/products/P1/SS12u4-Solaris11u1/SS12u4-Solaris11u1/sysroot/usr/include -xO4 -o libfoo.so unaligned_read.cpp

      The resulting code looks like this:

      0000000000000c40 <__1cOread_unaligned6Fpv_L_>:
       c40: 55 push %rbp
       c41: 48 8b ec mov %rsp,%rbp
       c44: 48 83 ec 10 sub $0x10,%rsp
       c48: 48 8b f7 mov %rdi,%rsi
       c4b: 48 8d 45 f8 lea -0x8(%rbp),%rax
       c4f: 48 8b f8 mov %rax,%rdi
       c52: 48 c7 c2 08 00 00 00 mov $0x8,%rdx
       c59: e8 8a ff ff ff callq be8 <memcpy@plt>
       c5e: 48 8b 45 f8 mov -0x8(%rbp),%rax
       c62: c9 leaveq
       c63: c3 retq

      That is, the memcpy is still there.

      The performance difference here is significant, especially if the code in question happens to be in a hot loop.

        Attachments

          Activity

            People

            • Assignee:
              erikj Erik Joelsson
              Reporter:
              mikael Mikael Vidstedt
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: