Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8072070

Interpreter wastes lots of time on stack banging

    Details

      Description

      This is seen in just about any profiled -Xint run. E.g. running a simple JMH benchmark like this:
        http://hg.openjdk.java.net/code-tools/jmh/file/492a000a7aea/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_08_DeadCode.java

      ...yields this hot region in the method entry with zerolocals:

      ....[Hottest Region 1]..............................................................................
       [0x7f1c55010d47:0x7f1c55010de2] in <stub: method entry point (kind = zerolocals)>

                          0x00007f1c55010d18: cmpq $0x1,0x18(%rax,%rcx,8)
                          0x00007f1c55010d21: je 0x00007f1c55010d3c
                          0x00007f1c55010d23: xor 0x18(%rax,%rcx,8),%rdx
                          0x00007f1c55010d28: test $0xfffffffffffffffc,%rdx
                          0x00007f1c55010d2f: je 0x00007f1c55010d41
                          0x00007f1c55010d31: orq $0x2,0x18(%rax,%rcx,8)
                          0x00007f1c55010d3a: jmp 0x00007f1c55010d41
                          0x00007f1c55010d3c: mov %rdx,0x18(%rax,%rcx,8)
                          0x00007f1c55010d41: sub $0x2,%rcx
                          0x00007f1c55010d45: jns 0x00007f1c55010cd3
        0.20% 0.30% 0x00007f1c55010d47: mov %eax,-0x1000(%rsp)
        0.17% 0.26% 0x00007f1c55010d4e: mov %eax,-0x2000(%rsp)
        0.04% 0.30% 0x00007f1c55010d55: mov %eax,-0x3000(%rsp)
                 0.07% 0x00007f1c55010d5c: mov %eax,-0x4000(%rsp)
        0.41% 1.35% 0x00007f1c55010d63: mov %eax,-0x5000(%rsp)
        0.02% 0.46% 0x00007f1c55010d6a: mov %eax,-0x6000(%rsp)
        0.74% 2.61% 0x00007f1c55010d71: mov %eax,-0x7000(%rsp)
        0.41% 0.89% 0x00007f1c55010d78: mov %eax,-0x8000(%rsp)
        2.80% 5.21% 0x00007f1c55010d7f: mov %eax,-0x9000(%rsp)
        0.22% 0.46% 0x00007f1c55010d86: mov %eax,-0xa000(%rsp)
        4.32% 6.76% 0x00007f1c55010d8d: mov %eax,-0xb000(%rsp)
        1.63% 0.76% 0x00007f1c55010d94: mov %eax,-0xc000(%rsp)
        6.82% 5.56% 0x00007f1c55010d9b: mov %eax,-0xd000(%rsp)
        0.28% 0.24% 0x00007f1c55010da2: mov %eax,-0xe000(%rsp)
        5.25% 2.72% 0x00007f1c55010da9: mov %eax,-0xf000(%rsp)
        0.78% 0.20% 0x00007f1c55010db0: mov %eax,-0x10000(%rsp)
        2.13% 0.37% 0x00007f1c55010db7: mov %eax,-0x11000(%rsp)
        0.52% 0.04% 0x00007f1c55010dbe: mov %eax,-0x12000(%rsp)
        3.76% 0.52% 0x00007f1c55010dc5: mov %eax,-0x13000(%rsp)
        0.87% 0.02% 0x00007f1c55010dcc: mov %eax,-0x14000(%rsp)
        1.91% 0.35% 0x00007f1c55010dd3: movb $0x0,0x295(%r15)
        0.54% 0.13% 0x00007f1c55010ddb: cmpb $0x0,0x168da700(%rip) # 0x00007f1c6b8eb4e2
        0.07% 0x00007f1c55010de2: je 0x00007f1c55010e12
                          0x00007f1c55010de8: mov -0x18(%rbp),%rsi
                          0x00007f1c55010dec: mov %r15,%rdi
                          0x00007f1c55010def: test $0xf,%esp
                          0x00007f1c55010df5: je 0x00007f1c55010e0d
                          0x00007f1c55010dfb: sub $0x8,%rsp
                          0x00007f1c55010dff: callq 0x00007f1c6b3080d0
                          0x00007f1c55010e04: add $0x8,%rsp
                          0x00007f1c55010e08: jmpq 0x00007f1c55010e12
                          0x00007f1c55010e0d: callq 0x00007f1c6b3080d0
        0.20% 0x00007f1c55010e12: movzbl 0x0(%r13),%ebx
      ....................................................................................................
       33.88% 29.57% <total for region 1>

      This seems to be due to AbstractInterpreterGenerator::bang_stack_shadow_pages that does:

      void AbstractInterpreterGenerator::bang_stack_shadow_pages(bool native_call) {
        ...
        // Bang each page in the shadow zone. We can't assume it's been done for
        // an interpreter frame with greater than a page of locals, so each page
        // needs to be checked. Only true for non-native.
        if (UseStackBanging) {
          const int start_page = native_call ? StackShadowPages : 1;
          const int page_size = os::vm_page_size();
          for (int pages = start_page; pages <= StackShadowPages ; pages++) {
            __ bang_stack_with_offset(pages*page_size);
          }
        }
      }

      A quick experiment with tuning the StackShadowPages down yields a nice 1.7x-2.0x performance improvement on either the microbenchmarks, or something more heavy-weight, like Octane/Box2D:

      -Xint -XX:StackShadowPages=20 (default)
       Box2D.test: 25063.784 ± 961.405 ms
       JMHSample_08_DeadCode.measureRight: 123.027 ± 0.377 ns/op

      -Xint -XX:StackShadowPages=1
       Box2D.test: 14804.336 ± 789.289 ms
       JMHSample_08_DeadCode.measureRight: 62.221 ± 0.234 ns/op

      Having this in mind, and also recognizing the interpreter performance can affect warmup and time-to-performance,
      it seems we might want to look for more efficient stack banging in interpreter. For example, does it make sense to
      bang the stack for the frame with just one page worth of locals? Or, should we really re-bang the stack for each method
      entry, even if we have consumed less than a page from the stack?

      Update: I suddenly realised that in the absence of tiered compilation (C1 compiler, actually) to save us from interpreter
      performance, we may observe the effect on warmup. For example, the same Box2D test, 100 forks, 1 invocation of the
      test yields:

      -XX:-TieredCompilation:
      Box2D.test ss 100 4797.634 ± 120.663 ms

      -XX:-TieredCompilation -XX:StackShadowPages=1:
      Box2D.test ss 100 4491.700 ± 119.548 ms

        Issue Links

          Activity

          Hide
          fparain Frederic Parain added a comment -
          This bug could be coalesced with JDK-8069196 in order to have an interpreter with a correct and efficient stack banging code.
          Show
          fparain Frederic Parain added a comment - This bug could be coalesced with JDK-8069196 in order to have an interpreter with a correct and efficient stack banging code.
          Hide
          shade Aleksey Shipilev added a comment -
          I agree with Frederic's comment above.
          Show
          shade Aleksey Shipilev added a comment - I agree with Frederic's comment above.
          Hide
          coleenp Coleen Phillimore added a comment -
          Yes we have to bang the stack for < 1 page of locals.
          Show
          coleenp Coleen Phillimore added a comment - Yes we have to bang the stack for < 1 page of locals.
          Hide
          iklam Ioi Lam added a comment -
          How about doing the stack banging only if we have just crossed a page boundary on the stack?

            // Stack grows towards lower address
            eax = rsp % PAGE_SIZE
            if (eax < PAGE_SIZE - PAGE_SIZE / 16) {
                 jmp done_banging
            }
            // RSP is now at the top 1/16 part of a page in the stack. Let's assume
            // that we just crossed a page boundary on the stack. Bang it.
            0.20% 0.30% 0x00007f1c55010d47: mov %eax,-0x1000(%rsp)
            0.17% 0.26% 0x00007f1c55010d4e: mov %eax,-0x2000(%rsp)
            0.04% 0.30% 0x00007f1c55010d55: mov %eax,-0x3000(%rsp)
                  0.07% 0x00007f1c55010d5c: mov %eax,-0x4000(%rsp)
            0.41% 1.35% 0x00007f1c55010d63: mov %eax,-0x5000(%rsp)
            0.02% 0.46% 0x00007f1c55010d6a: mov %eax,-0x6000(%rsp)
            0.74% 2.61% 0x00007f1c55010d71: mov %eax,-0x7000(%rsp)
            0.41% 0.89% 0x00007f1c55010d78: mov %eax,-0x8000(%rsp)
            2.80% 5.21% 0x00007f1c55010d7f: mov %eax,-0x9000(%rsp)
            0.22% 0.46% 0x00007f1c55010d86: mov %eax,-0xa000(%rsp)
            4.32% 6.76% 0x00007f1c55010d8d: mov %eax,-0xb000(%rsp)
            1.63% 0.76% 0x00007f1c55010d94: mov %eax,-0xc000(%rsp)
            6.82% 5.56% 0x00007f1c55010d9b: mov %eax,-0xd000(%rsp)
            0.28% 0.24% 0x00007f1c55010da2: mov %eax,-0xe000(%rsp)
            5.25% 2.72% 0x00007f1c55010da9: mov %eax,-0xf000(%rsp)
            0.78% 0.20% 0x00007f1c55010db0: mov %eax,-0x10000(%rsp)
            2.13% 0.37% 0x00007f1c55010db7: mov %eax,-0x11000(%rsp)
            0.52% 0.04% 0x00007f1c55010dbe: mov %eax,-0x12000(%rsp)
            3.76% 0.52% 0x00007f1c55010dc5: mov %eax,-0x13000(%rsp)
            0.87% 0.02% 0x00007f1c55010dcc: mov %eax,-0x14000(%rsp)
          done_banging:
          Show
          iklam Ioi Lam added a comment - How about doing the stack banging only if we have just crossed a page boundary on the stack?   // Stack grows towards lower address   eax = rsp % PAGE_SIZE   if (eax < PAGE_SIZE - PAGE_SIZE / 16) {        jmp done_banging   }   // RSP is now at the top 1/16 part of a page in the stack. Let's assume   // that we just crossed a page boundary on the stack. Bang it.   0.20% 0.30% 0x00007f1c55010d47: mov %eax,-0x1000(%rsp)   0.17% 0.26% 0x00007f1c55010d4e: mov %eax,-0x2000(%rsp)   0.04% 0.30% 0x00007f1c55010d55: mov %eax,-0x3000(%rsp)         0.07% 0x00007f1c55010d5c: mov %eax,-0x4000(%rsp)   0.41% 1.35% 0x00007f1c55010d63: mov %eax,-0x5000(%rsp)   0.02% 0.46% 0x00007f1c55010d6a: mov %eax,-0x6000(%rsp)   0.74% 2.61% 0x00007f1c55010d71: mov %eax,-0x7000(%rsp)   0.41% 0.89% 0x00007f1c55010d78: mov %eax,-0x8000(%rsp)   2.80% 5.21% 0x00007f1c55010d7f: mov %eax,-0x9000(%rsp)   0.22% 0.46% 0x00007f1c55010d86: mov %eax,-0xa000(%rsp)   4.32% 6.76% 0x00007f1c55010d8d: mov %eax,-0xb000(%rsp)   1.63% 0.76% 0x00007f1c55010d94: mov %eax,-0xc000(%rsp)   6.82% 5.56% 0x00007f1c55010d9b: mov %eax,-0xd000(%rsp)   0.28% 0.24% 0x00007f1c55010da2: mov %eax,-0xe000(%rsp)   5.25% 2.72% 0x00007f1c55010da9: mov %eax,-0xf000(%rsp)   0.78% 0.20% 0x00007f1c55010db0: mov %eax,-0x10000(%rsp)   2.13% 0.37% 0x00007f1c55010db7: mov %eax,-0x11000(%rsp)   0.52% 0.04% 0x00007f1c55010dbe: mov %eax,-0x12000(%rsp)   3.76% 0.52% 0x00007f1c55010dc5: mov %eax,-0x13000(%rsp)   0.87% 0.02% 0x00007f1c55010dcc: mov %eax,-0x14000(%rsp) done_banging:
          Hide
          shade Aleksey Shipilev added a comment -
          I realized this may be observed with -XX:-TieredCompilation, and compiler-heavy test like Box2D, see the update in the description.
          Show
          shade Aleksey Shipilev added a comment - I realized this may be observed with -XX:-TieredCompilation, and compiler-heavy test like Box2D, see the update in the description.
          Hide
          coleenp Coleen Phillimore added a comment - - edited
          I'm trying to remember why at the interpreter entry point that we can't assume that the stack has been banged for each page in framesize+StackShadowPages. With compiled code we can assume that and only do one stack bang for the lowest page. If the stack isn't banged for each page down the stack we could skip into the next stack or other allocated memory not on our stack. That said, with N being the sum of Yellow+Red pages, we maybe could just bang every N pages.

          We have to bang at least one page for framesize < pagesize because at some point we will cross into the next page which may not be stack banged already. Stack banging causes the signal at the point where we can detect and throw StackOverflowError.

          Since TieredCompilation is the default, I think correctness greatly overrules performance here. The stack overflow handling code has a lot of pieces that fit together so that we always detect stack overflow rather than crash, so we need to make sure that any optimizations don't break the design (which admittedly isn't documented anywhere).
          Show
          coleenp Coleen Phillimore added a comment - - edited I'm trying to remember why at the interpreter entry point that we can't assume that the stack has been banged for each page in framesize+StackShadowPages. With compiled code we can assume that and only do one stack bang for the lowest page. If the stack isn't banged for each page down the stack we could skip into the next stack or other allocated memory not on our stack. That said, with N being the sum of Yellow+Red pages, we maybe could just bang every N pages. We have to bang at least one page for framesize < pagesize because at some point we will cross into the next page which may not be stack banged already. Stack banging causes the signal at the point where we can detect and throw StackOverflowError. Since TieredCompilation is the default, I think correctness greatly overrules performance here. The stack overflow handling code has a lot of pieces that fit together so that we always detect stack overflow rather than crash, so we need to make sure that any optimizations don't break the design (which admittedly isn't documented anywhere).
          Hide
          coleenp Coleen Phillimore added a comment -
          From review comments JDK-8146410

          On 1/5/16 1:01 PM, Andrew Haley wrote:
          > On 01/05/2016 04:33 PM, Lindenmaier, Goetz wrote:
          >
          >> If you are concerned about the TLB pollution, you can
          >> load thread->_stack_overflow_limit and compare against that.
          >> If you are past that limit, you just touch a yellow page to get the
          >> SIGSEGV for the stack overflow.
          > Very nice!

          yes, that is nice and would be less instructions overall.
          >
          >> You touch the thread nearby anyways, so that page should be
          >> in the TLB.
          >>
          >> (There is Thread::stack_overflow_limit_offset()).
          > That has to be far better than what we do today.
          >
          > Unfortunately, the patch we're discussing removes the locally-defined
          > override in which such a change could be made.
          >
          > Coleen, is it actually necessary to remove a cpu-specific override
          > for this code? Banging all these pages blows away 20% of the L1 TLB
          > entries on a Cortex-A57 and 90% (!) of them on a Cortex-A53. (At
          > least, this is true for 4kbyte pages; with 64k pages it's less of an
          > issue.)

          Okay, I'll have to copy the function into other CPU implementations but it does leave room for changing them so that we don't have to bang all of the pages in the stack (the reason was so that we didn't know where the top/bottom was to compare against so had to do incremental stack banging by page).
          Show
          coleenp Coleen Phillimore added a comment - From review comments JDK-8146410 On 1/5/16 1:01 PM, Andrew Haley wrote: > On 01/05/2016 04:33 PM, Lindenmaier, Goetz wrote: > >> If you are concerned about the TLB pollution, you can >> load thread->_stack_overflow_limit and compare against that. >> If you are past that limit, you just touch a yellow page to get the >> SIGSEGV for the stack overflow. > Very nice! yes, that is nice and would be less instructions overall. > >> You touch the thread nearby anyways, so that page should be >> in the TLB. >> >> (There is Thread::stack_overflow_limit_offset()). > That has to be far better than what we do today. > > Unfortunately, the patch we're discussing removes the locally-defined > override in which such a change could be made. > > Coleen, is it actually necessary to remove a cpu-specific override > for this code? Banging all these pages blows away 20% of the L1 TLB > entries on a Cortex-A57 and 90% (!) of them on a Cortex-A53. (At > least, this is true for 4kbyte pages; with 64k pages it's less of an > issue.) Okay, I'll have to copy the function into other CPU implementations but it does leave room for changing them so that we don't have to bang all of the pages in the stack (the reason was so that we didn't know where the top/bottom was to compare against so had to do incremental stack banging by page).

            People

            • Assignee:
              Unassigned
              Reporter:
              shade Aleksey Shipilev
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated: