Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8276094

Region Pinning in G1

    XMLWordPrintable

    Details

    • Author:
      Hamlin Li
    • JEP Type:
      Feature
    • Exposure:
      Open
    • Subcomponent:
      gc
    • Scope:
      Implementation
    • Effort:
      L
    • Duration:
      M

      Description

      Summary

      Support region pinning in G1 to avoid the need to disable garbage collection during JNI critical regions eliminating additional latency.

      Goals

      • No additional latency to start a garbage collection or stalling of other Java threads due to JNI critical regions anymore
      • Support pinning of arbitrary regions in any G1 garbage collection
      • Remove GCLocker usage in G1
      • No pause time regression when there are no JNI critical regions active during garbage collection
      • Minimal pause time regressions in the presence of JNI critical regions during garbage collection

      Motivation

      For interopability with unmanaged programming languages like C or C++, JNI provides functions to obtain raw pointers to Java objects (e.g. <code class="prettyprint" data-shared-secret="1638856979834-0.7022809348159499">GetXXXCritical</code> and <code class="prettyprint" data-shared-secret="1638856979834-0.7022809348159499">ReleaseXXXCritical</code> ). The code running inside these pairs of functions (e.g. GetPrimitiveArrayCritical and ReleasePrimitiveArrayCritical) is considered as running in a "critical region". Whenever any Java thread is in such a critical region, the JVM must take care to not move any "critical objects" (e.g. the memory area returned by GetPrimitiveArrayCritical) during garbage collection because that code uses it. An alternative is to disable garbage collection during critical regions.

      The current default garbage collector, G1, currently implements critical region support by the latter way, i.e. disabling garbage collection while any Java thread is in such a critical region.

      This choice of handling JNI critical regions has a significant undesirable latency impact on Java threads: Java threads requiring memory must wait until there are no more Java threads running in a critical region. The severity of these problems depends on the number of Java threads that use these JNI functions and the frequency and duration of these critical regions, but users report critical sections blocking garbage collection and the whole application for minutes, and fake out of memory conditions due to starvation problems. This can even lead to premature VM shutdown.

      The current performance characteristics also causes Java/JNI applications to have choose between performance and the risk of stalling GC due to critical JNI regions. Due to this risk, some applications opt to not use critical functions by default or at all, e.g. [netty 4.1.34], [another netty issue], [libgdx], [javacpp], [opensearch-project/k-NN], [JetBrainsRuntime], [microsoft/libHttpClient], potentially at the detriment of performance.

      With the proposed change there will be zero additional latency induced in application threads due to waiting for garbage collection - there will be no waiting. Garbage collection will be able to run to free memory without regards to critical regions.

      Description

      The current mechanism to disable garbage collections in G1 works as follows: G1 records Java threads in a critical region. If a Java thread requests a garbage collection, it suspends these threads until all Java threads currently in a JNI critical region exited their JNI critical region. In this case, G1 also records and suspends all subsequent Java threads trying to enter such a JNI critical region, performing selected virtual machine mode transitions or requests for further garbage collections. G1 uses a global mutex called GCLocker to achieve the above suspend/resume mechanism. Only after all JNI critical regions were exited with a pending garbage collection request, G1 executes the pending garbage collection and the VM subsequently resumes execution of all previously suspended threads.

      The main idea presented in this JEP is to, instead of disabling garbage collection completely, keep collecting garbage in heap regions not containing a critical object.

      G1 is a region based incremental collector: it can already collect parts of the heap with the granularity of a heap region. Further, some of these regions may already be treated as locked in place (marked as "pinned") during any garbage collection. This JEP aims to extend this capability for any type of region during any kind of garbage collection.

      There is existing generic support to notify the JVM of Java threads obtaining and releasing critical objects.

      Existing Support for Region Pinning in G1

      There already exist a few mechanisms that we intend to exploit for support of pinning of arbitrary regions in the G1 collector.

      • Major (full) collection already completely supports region pinning: any region type that is marked as "pinned" during major collection will not be subject to compaction: live objects (like critical objects) within pinned regions are kept in place while the surrounding areas containing dead objects are formatted as empty. Currently G1 always marks humongous regions (regions containing large objects) and archive regions (containing CDS data) as pinned permanently and any other region that exceeds a liveness threshold as pinned during that collection only.
      • Minor (young) garbage collection does not support region pinning completely at this time: permanently pinned regions as described above will never be put into the collection set, automatically excluding them from any collection effort. However, there is currently no support for pinning other region types (both Young and Old regions) during minor collection. This work will need to remedy this shortcoming.

      Modifications to G1 Garbage Collection Algorithms

      The existing region pinning support described above suggests to implement the following modifications to the G1 garbage collection algorithms to achieve the desired effect:

      • use the existing critical object obtain/release notifications to manage a count of critical objects per region. If that count is zero, there are no critical objects and that region can be garbage collected as before. Any non-zero count requires the garbage collector to treat that region as pinned for all types of collections.
      • during major collections above information can be directly used to temporarily pin the affected regions during that collection only.
      • minor collections can simply exclude Old regions that are pinned due to critical objects from the collection set during collection set selection at the start of the minor collection. Then they will not be collected. The same mechanism is not viable for Young regions: G1 can not exclude individual Young regions from evacuation. However, there is already a fallback mechanism to let live objects stay in place during minor collection: evacuation failure handling. By forcing evacuation failure for all live objects (which naturally include all critical objects) in pinned regions, G1 can achieve the expected effect.
      • remove GCLocker usage in G1.

      Reusing Evacuation Failure Handling

      When G1 is unable to find space to evacuate an object during minor collection, an evacuation failure occurs for that object. That object is kept in place, recorded, and the object and its containing region marked as "failed" (i.e. the region containing the object that failed evacuation). After evacuation there is a separate fixup phase to clear the recorded marks, format the space around these objects that failed evacuation as empty and relabel these regions as if they were Old regions.

      This current implementation assumes that evacuation failure is very rare: typically G1 avoids evacuation failure occurrences completely by proper generation sizing or preventive garbage collections. Even if a garbage collection incurs an evacuation failure, the number of affected objects is typically extremely small.

      By repurposing this mechanism for handling pinned Young regions, neither assumption is valid: still a low, but expectedly larger amount of regions will incur evacuation failure at higher frequency. Further, the number of affected objects is only bounded by the size of the regions as G1 needs not only keep the objects that actually failed evacuation in place, but all live objects.

      The original assumptions led to the following design decisions that require significant improvement:

      • generally, performance of the path recording evacuation failure and the objects that failed evacuation is not well optimized.
      • due to the rarity of regions incurring evacuation failures, performance and in particular parallelism of the mentioned fixup phase is suboptimal: e.g. the implementation uses a linear walk through the entire region to find objects that failed evacuation, and the unit of work distribution between threads is a whole region.
      • regions that failed evacuation are implicitly promoted to old generation regions, meaning that although we know for certain their liveness after collection (which is generally very low), reclaiming that space requires a significant amount of time and effort. With a larger amount of such regions promoted to Old, there is a risk that these will fill up the heap quickly causing much extra work by the G1 collector.

      There is a blog post summarizing the necessary work in detail here and the linked JIRA issues tagged with the gc-g1-pinned-regions label.

      Alternatives

      Implementation alternatives for support of critical regions correspond to the ones mentioned in the JNI specification:

      The first option is to always copy JNI critical objects to a place (e.g. the C heap) where the object does not move and copy it back afterwards: this has been discarded in the past for being very inefficient in time and space. Nothing substantially changed about the effort needed for this mechanism. A small optimization could be to only copy objects in regions G1 does not support pinning for, limiting copying to critical objects in Young regions. We do not expect that this improves the situation significantly: many heuristics in the garbage collection area assume that a large fraction of object modification and use occurs in the young generation. This is generally true given the efficiency of existing collection algorithms. We expect that the same applies to JNI critical functions.

      Another option is to pin objects individually: G1 can only evacuate whole regions, and can only allocate into completely free regions. Since a pinned object keeps a region from being freed (as it is trivially in use), there is no advantage doing that except additional code complexity to keep track of pinned objects on a per object basis.

      Of course we could keep and refine the existing mechanism to disable garbage collection during critical regions using the GCLocker: however disabling garbage collection fundamentally causes latency problems and can not improve the existing status quo as far we are aware of.

      Apart from those we have not found other reasonable ideas that provide extra benefit (performance, simplicity, ...) to implement region pinning differently than suggested in this JEP.

      Testing

      Besides of functionality tests, we especially need to do benchmarking and performance measurements to collect performance data.

      Risks and Assumptions

      We assume that there are no changes to the expected usage of JNI critical regions: they are still to be used "sparingly" and these JNI critical regions are "short".

      The existing evacuation failure handling mechanisms G1 uses are well understood, the risk in reusing them seems manageable. As stated before, there are some performance problems with using them as they are, but initial prototypes of changes show very good promise.

      There is a risk when the application pins lots of regions at the same time, in the extreme case pinning the entire heap, which will lead to an out-of-memory situation. There is no solution for this case currently, but it seems that in practice (the Shenandoah collector already uses region pinning for JNI critical regions) this will not occur.

      One good mitigation for this problem could be allowing allocation in regions that were pinned and sparsely occupied using a first-fit algorithm with a linked list of free space around critical objects. This technique may be further improved by tracking pinning on a per object basis. However we do not see any of these changes as necessary for this JEP for the above mentioned reason.

      Dependencies

      The work for this JEP is based on several existing and completed features in G1:

      • Region based heap
      • Region pinning support for major collections
      • Evacuation failure handling during young collections

        Attachments

          Issue Links

          1.
          G1: Fully support pinned regions for full gc Sub-task Resolved Thomas Schatzl  
          2.
          G1: Forwarding pointer removal thread sizing Sub-task Resolved Thomas Schatzl  
          3.
          Improve g1 evacuation failure injector performance Sub-task Resolved Thomas Schatzl  
          4.
          Compile in G1 evacuation failure injection code based on define Sub-task Resolved Thomas Schatzl  
          5.
          G1: Record regions where evacuation failed to provide targeted iteration Sub-task Resolved Hamlin Li  
          6.
          G1: Factor out concurrent segmented array from G1CardSetAllocator Sub-task Resolved Hamlin Li  
          7.
          G1: Optimize evacuation failure for regions with few failed objects Sub-task Resolved Hamlin Li  
          8.
          G1: Allow forced evacuation failure of first N regions in collection set Sub-task Resolved Hamlin Li  
          9.
          G1: Log basic statistics of evacuation failure Sub-task New Hamlin Li  
          10.
          G1: Log further detailed statistics of evacuation failure Sub-task New Unassigned  
          11.
          G1: Log per region statistics of evacuation failure Sub-task New Unassigned  
          12.
          G1: Distinguish logging between the real evac failure and region pinning Sub-task New Unassigned  
          13.
          G1: Add an additional logging category such as "gc+evacfail" to switch on the most detailed statistics Sub-task New Unassigned  
          14.
          G1: Extend the gc+heap=debug messages to also show the number of failed regions of that category Sub-task New Unassigned  
          15.
          G1: Improve parallelism in regions that failed evacuation Sub-task Open Hamlin Li  
          16.
          G1: Factor out G1CardSetFreePool and related classes from G1CardSetXxx Sub-task Open Hamlin Li  
          17.
          G1: Support reclaiming memory used in G1EvacFailureObjectsSet Sub-task Open Hamlin Li  
          18.
          G1: Consider putting regions where evacuation failed into next collection set Sub-task Open Hamlin Li  
          19.
          G1: support concurrent freeing of segments after GC in evacuation failure handling Sub-task Open Hamlin Li  
          20.
          G1: Improve generation placement heuristics for regions that could not be evacuated Sub-task Open Hamlin Li  
          21.
          G1: Improve thread sizing for evacuation failure Sub-task Open Hamlin Li  
          22.
          G1: Improve evacuation failure for regions with many objects Sub-task Open Hamlin Li  
          23.
          G1: Add objArray splitting when scanning object with evacuation failure Sub-task Open Hamlin Li  
          24.
          G1: Allow random selection of forced evacuation failure of N regions in collection set Sub-task Open Hamlin Li  
          25.
          G1: Enable region pinning in Young GC Sub-task New Unassigned  
          26.
          G1: Enable region pinning in Full GC Sub-task New Hamlin Li  
          27.
          G1: Remove GCLocker usage and related code in G1 Sub-task New Hamlin Li  

            Activity

              People

              Assignee:
              mli Hamlin Li
              Reporter:
              mli Hamlin Li
              Owner:
              Hamlin Li Hamlin Li
              Reviewed By:
              Thomas Schatzl, Vladimir Kozlov
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated: