Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8048556

Unnecessary GCLocker-initiated young GCs

    Details

    • Subcomponent:
      gc

      Description

      We are seeing very strange behavior in some GC logs (and sometimes said strange behavior happens quite frequently). The common pattern is that a GC locker-initiated young GC happens immediately after another young GC. The GC locker-initiated one seems completely unnecessary given that the eden is either empty or has very low occupancy (<10%).

      Here's an example of this (with ParNew):

      {Heap before GC invocations=2 (full 0):
       par new generation total 471872K, used 433003K [0x00000007bae00000, 0x00000007dae00000, 0x00000007dae00000)
        eden space 419456K, 100% used [0x00000007bae00000, 0x00000007d47a0000, 0x00000007d47a0000)
        from space 52416K, 25% used [0x00000007d47a0000, 0x00000007d54dacb0, 0x00000007d7ad0000)
        to space 52416K, 0% used [0x00000007d7ad0000, 0x00000007d7ad0000, 0x00000007dae00000)
       tenured generation total 524288K, used 0K [0x00000007dae00000, 0x00000007fae00000, 0x00000007fae00000)
         the space 524288K, 0% used [0x00000007dae00000, 0x00000007dae00000, 0x00000007dae00200, 0x00000007fae00000)
       compacting perm gen total 21248K, used 2549K [0x00000007fae00000, 0x00000007fc2c0000, 0x0000000800000000)
         the space 21248K, 12% used [0x00000007fae00000, 0x00000007fb07d7a0, 0x00000007fb07d800, 0x00000007fc2c0000)
      No shared spaces configured.
      1.119: [GC (Allocation Failure)[ParNew: 433003K->15843K(471872K), 0.0103090 secs] 433003K->15843K(996160K), 0.0103320 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
      Heap after GC invocations=3 (full 0):
       par new generation total 471872K, used 15843K [0x00000007bae00000, 0x00000007dae00000, 0x00000007dae00000)
        eden space 419456K, 0% used [0x00000007bae00000, 0x00000007bae00000, 0x00000007d47a0000)
        from space 52416K, 30% used [0x00000007d7ad0000, 0x00000007d8a48c88, 0x00000007dae00000)
        to space 52416K, 0% used [0x00000007d47a0000, 0x00000007d47a0000, 0x00000007d7ad0000)
       tenured generation total 524288K, used 0K [0x00000007dae00000, 0x00000007fae00000, 0x00000007fae00000)
         the space 524288K, 0% used [0x00000007dae00000, 0x00000007dae00000, 0x00000007dae00200, 0x00000007fae00000)
       compacting perm gen total 21248K, used 2549K [0x00000007fae00000, 0x00000007fc2c0000, 0x0000000800000000)
         the space 21248K, 12% used [0x00000007fae00000, 0x00000007fb07d7a0, 0x00000007fb07d800, 0x00000007fc2c0000)
      No shared spaces configured.
      }
      {Heap before GC invocations=3 (full 0):
       par new generation total 471872K, used 24002K [0x00000007bae00000, 0x00000007dae00000, 0x00000007dae00000)
        eden space 419456K, 1% used [0x00000007bae00000, 0x00000007bb5f7c50, 0x00000007d47a0000)
        from space 52416K, 30% used [0x00000007d7ad0000, 0x00000007d8a48c88, 0x00000007dae00000)
        to space 52416K, 0% used [0x00000007d47a0000, 0x00000007d47a0000, 0x00000007d7ad0000)
       tenured generation total 524288K, used 0K [0x00000007dae00000, 0x00000007fae00000, 0x00000007fae00000)
         the space 524288K, 0% used [0x00000007dae00000, 0x00000007dae00000, 0x00000007dae00200, 0x00000007fae00000)
       compacting perm gen total 21248K, used 2549K [0x00000007fae00000, 0x00000007fc2c0000, 0x0000000800000000)
         the space 21248K, 12% used [0x00000007fae00000, 0x00000007fb07d7a0, 0x00000007fb07d800, 0x00000007fc2c0000)
      No shared spaces configured.
      1.130: [GC (GCLocker Initiated GC)[ParNew: 24002K->12748K(471872K), 0.0123930 secs] 24002K->12748K(996160K), 0.0124130 secs] [Times: user=0.04 sys=0.01, real=0.01 secs]
      Heap after GC invocations=4 (full 0):
       par new generation total 471872K, used 12748K [0x00000007bae00000, 0x00000007dae00000, 0x00000007dae00000)
        eden space 419456K, 0% used [0x00000007bae00000, 0x00000007bae00000, 0x00000007d47a0000)
        from space 52416K, 24% used [0x00000007d47a0000, 0x00000007d5413320, 0x00000007d7ad0000)
        to space 52416K, 0% used [0x00000007d7ad0000, 0x00000007d7ad0000, 0x00000007dae00000)
       tenured generation total 524288K, used 0K [0x00000007dae00000, 0x00000007fae00000, 0x00000007fae00000)
         the space 524288K, 0% used [0x00000007dae00000, 0x00000007dae00000, 0x00000007dae00200, 0x00000007fae00000)
       compacting perm gen total 21248K, used 2549K [0x00000007fae00000, 0x00000007fc2c0000, 0x0000000800000000)
         the space 21248K, 12% used [0x00000007fae00000, 0x00000007fb07d7a0, 0x00000007fb07d800, 0x00000007fc2c0000)
      No shared spaces configured.
      }

      Notice that:

      * The timestamp of the second GC (1.130) is almost equal to the timestamp of the first GC plus the duration of the first GC (1.119 + 0.0103320 = 1.1293). In this test young GCs normally happen at a frequency of one every 100ms-110ms or so.
      * The eden at the start of the second GC is almost empty (1% occupancy). We've also seen it very often with a completely empty eden.
      * (the big hint) The second GC is GClocker-initiated.

      This happens most often with ParNew (in some cases, more than 30% of the GCs are those unnecessary ones) but also happens with ParallelGC too but less frequently (maybe 1%-1.5% of the GCs are those unnecessary ones). I was unable to reproduce it with G1.

      I can reproduce it with with latest JDK 7, JDK 8, and also the latest hotspot-gc/hotspot workspace. I haven't tried with JDK 6.
      1. JNICriticalStressTest.c
        2 kB
        Tony Printezis
      2. JNICriticalStressTest.java
        4 kB
        Tony Printezis
      3. JNICriticalStressTest2.java
        3 kB
        Tony Printezis

        Issue Links

          Activity

          Hide
          tonyp Tony Printezis added a comment - - edited
          On top of what I mentioned earlier, there's a second cause for this issue. A lot of the time the GC VM op for the GCLocker-initiated GC does have the correct GC count. However, at least for the GCH, that GC VM op is also tagged as "full" (why?!?!?!?!). Here's the code for skip_operation:

            POINT_1: bool skip = (_gc_count_before != Universe::heap()->total_collections());
            if (_full && skip) {
              POINT_2: skip = (_full_gc_count_before != Universe::heap()->total_full_collections());
            }

          skip is indeed set to true at POINT_1. However, because _full is true and the GC before this op was not a full GC and didn't update the totally full collection count, POINT_2 turns skip back to false. This actually explains why it happens so often in the case of ParNew (even when the GC VM op has the right GC counts, skip_operation() is doing the wrong thing).

          *SIGH*

          FWIW, out of curiosity I changed skip_operation() to:

            bool skip = (_gc_count_before != Universe::heap()->total_collections());
            if (_full && skip && _gc_cause != GCCause::_gc_locker) {
              skip = (_full_gc_count_before != Universe::heap()->total_full_collections());
            }

          and the number of occurrences of the issue dropped from over 30% to under 5%. The remaining instances are caused by the race I mentioned earlier. The point when the safepoint sneaks in when jni_unlock() is running seems to be when the collect() methods tries to take the Heap_lock in order to read the before / after GC counts. Because there's a safepoint in progress, the lock operation waits for it and by the time the lock is taken and the GC counts read, the GC has already happened and the GC counts are now up-to-date so the op is not skipped. (I hope this makes sense!!!)

          Note both fixes I suggested above still stand with the only difference that Fix 2 will also need to ensure that the scheduled operation should have full == false.

          Any thoughts on what the preferred fix would be for this? Any other ideas?
          Show
          tonyp Tony Printezis added a comment - - edited On top of what I mentioned earlier, there's a second cause for this issue. A lot of the time the GC VM op for the GCLocker-initiated GC does have the correct GC count. However, at least for the GCH, that GC VM op is also tagged as "full" (why?!?!?!?!). Here's the code for skip_operation:   POINT_1: bool skip = (_gc_count_before != Universe::heap()->total_collections());   if (_full && skip) {     POINT_2: skip = (_full_gc_count_before != Universe::heap()->total_full_collections());   } skip is indeed set to true at POINT_1. However, because _full is true and the GC before this op was not a full GC and didn't update the totally full collection count, POINT_2 turns skip back to false. This actually explains why it happens so often in the case of ParNew (even when the GC VM op has the right GC counts, skip_operation() is doing the wrong thing). *SIGH* FWIW, out of curiosity I changed skip_operation() to:   bool skip = (_gc_count_before != Universe::heap()->total_collections());   if (_full && skip && _gc_cause != GCCause::_gc_locker) {     skip = (_full_gc_count_before != Universe::heap()->total_full_collections());   } and the number of occurrences of the issue dropped from over 30% to under 5%. The remaining instances are caused by the race I mentioned earlier. The point when the safepoint sneaks in when jni_unlock() is running seems to be when the collect() methods tries to take the Heap_lock in order to read the before / after GC counts. Because there's a safepoint in progress, the lock operation waits for it and by the time the lock is taken and the GC counts read, the GC has already happened and the GC counts are now up-to-date so the op is not skipped. (I hope this makes sense!!!) Note both fixes I suggested above still stand with the only difference that Fix 2 will also need to ensure that the scheduled operation should have full == false. Any thoughts on what the preferred fix would be for this? Any other ideas?
          Hide
          ysr Y. Ramakrishna added a comment - - edited
          FWIW, I agree that sending the right gc counts and fixing the vm gc op not to signal "full" would be the right fix. (Your modified fix#2 above.)

          I have the vague feeling that somewhere in the hoary past "full" was used somewhere in the code because the requested size was 0, but I can't recollect the details.
          Did you try turning off "full" for the gc locker induced gc's and see if it had any unintended repercussions? (I don't have the code in front of me and haven't looked at
          it in a while but I vaguely recall something fishy in the GCH policy code that did some goofy stuff with "full".) I'll try and find some time to switch in the old context
          to see what that history might have been, but I am sure it would be correctable and get rid of the "full" signal here.
          Show
          ysr Y. Ramakrishna added a comment - - edited FWIW, I agree that sending the right gc counts and fixing the vm gc op not to signal "full" would be the right fix. (Your modified fix#2 above.) I have the vague feeling that somewhere in the hoary past "full" was used somewhere in the code because the requested size was 0, but I can't recollect the details. Did you try turning off "full" for the gc locker induced gc's and see if it had any unintended repercussions? (I don't have the code in front of me and haven't looked at it in a while but I vaguely recall something fishy in the GCH policy code that did some goofy stuff with "full".) I'll try and find some time to switch in the old context to see what that history might have been, but I am sure it would be correctable and get rid of the "full" signal here.
          Hide
          mgerdin Mikael Gerdin (Inactive) added a comment -
          It looks like Ramki's suspicions are correct.
          In general it seems like GCH is confused about the semantics of "full" and "complete".
          The boolean "full" is used in do_collection to force a collection.
          The boolean "complete" is used to signal that all generations will be collected.
          This confusion peaks at the call to increment_total_collections:

          // Increment total number of GC collections (started)
          void increment_total_collections(bool full = false);
          where GCH passes the boolean "complete" as the parameter.

          So the _total_full_collections value is only updated when we collect the old gen, but the VMOP treats the "full" flag as actual full gcs.
          Show
          mgerdin Mikael Gerdin (Inactive) added a comment - It looks like Ramki's suspicions are correct. In general it seems like GCH is confused about the semantics of "full" and "complete". The boolean "full" is used in do_collection to force a collection. The boolean "complete" is used to signal that all generations will be collected. This confusion peaks at the call to increment_total_collections: // Increment total number of GC collections (started) void increment_total_collections(bool full = false); where GCH passes the boolean "complete" as the parameter. So the _total_full_collections value is only updated when we collect the old gen, but the VMOP treats the "full" flag as actual full gcs.
          Hide
          tonyp Tony Printezis added a comment -
          Ramki and Mikael,

          Thanks for the comments. I think probably the source of the confusion was that the GCLocker is initiating young GC using collect(cause) which was really intended to do Full GCs with System.gc() (at least AFAIK).

          Ramki, re: "Did you try turning off "full" for the gc locker induced gc's and see if it had any unintended repercussions?": the small change I mentioned above (i.e., only look at the Full GC count if _gc_cause != GCCause::_gc_locker) should have similar effect. But, yes, I'll expand the change for the GCLocker GC VM op to have full == false. I think full == true for that op is simple incorrect.
          Show
          tonyp Tony Printezis added a comment - Ramki and Mikael, Thanks for the comments. I think probably the source of the confusion was that the GCLocker is initiating young GC using collect(cause) which was really intended to do Full GCs with System.gc() (at least AFAIK). Ramki, re: "Did you try turning off "full" for the gc locker induced gc's and see if it had any unintended repercussions?": the small change I mentioned above (i.e., only look at the Full GC count if _gc_cause != GCCause::_gc_locker) should have similar effect. But, yes, I'll expand the change for the GCLocker GC VM op to have full == false. I think full == true for that op is simple incorrect.
          Hide
          tonyp Tony Printezis added a comment -
          Attached second version of the test that reproduces this: JNICriticalStressTest2. It doesn't need extra JNI code and instead uses zip.Deflater which does critical sections in the deflate() method. So it's a bit easier to use / run.
          Show
          tonyp Tony Printezis added a comment - Attached second version of the test that reproduces this: JNICriticalStressTest2. It doesn't need extra JNI code and instead uses zip.Deflater which does critical sections in the deflate() method. So it's a bit easier to use / run.

            People

            • Assignee:
              Unassigned
              Reporter:
              tonyp Tony Printezis
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated: