Support region pinning in G1 to avoid the need to disable garbage collection during JNI critical regions eliminating additional latency.
- No additional latency to start a garbage collection or stalling of other Java threads due to JNI critical regions anymore
- Support pinning of arbitrary regions in any G1 garbage collection
GCLockerusage in G1
- No pause time regression when there are no JNI critical regions active during garbage collection
- Minimal pause time regressions in the presence of JNI critical regions during garbage collection
For interopability with unmanaged programming languages like C or C++, JNI provides functions to obtain raw pointers to Java objects (e.g. <code class="prettyprint" data-shared-secret="1638856979834-0.7022809348159499">GetXXXCritical</code> and <code class="prettyprint" data-shared-secret="1638856979834-0.7022809348159499">ReleaseXXXCritical</code> ). The code running inside these pairs of functions (e.g.
ReleasePrimitiveArrayCritical) is considered as running in a "critical region". Whenever any Java thread is in such a critical region, the JVM must take care to not move any "critical objects" (e.g. the memory area returned by
GetPrimitiveArrayCritical) during garbage collection because that code uses it. An alternative is to disable garbage collection during critical regions.
This choice of handling JNI critical regions has a significant undesirable latency impact on Java threads: Java threads requiring memory must wait until there are no more Java threads running in a critical region. The severity of these problems depends on the number of Java threads that use these JNI functions and the frequency and duration of these critical regions, but users report critical sections blocking garbage collection and the whole application for minutes, and fake out of memory conditions due to starvation problems. This can even lead to premature VM shutdown.
The current performance characteristics also causes Java/JNI applications to have choose between performance and the risk of stalling GC due to critical JNI regions. Due to this risk, some applications opt to not use critical functions by default or at all, e.g. [netty 4.1.34], [another netty issue], [libgdx], [javacpp], [opensearch-project/k-NN], [JetBrainsRuntime], [microsoft/libHttpClient], potentially at the detriment of performance.
With the proposed change there will be zero additional latency induced in application threads due to waiting for garbage collection - there will be no waiting. Garbage collection will be able to run to free memory without regards to critical regions.
The current mechanism to disable garbage collections in G1 works as follows: G1 records Java threads in a critical region. If a Java thread requests a garbage collection, it suspends these threads until all Java threads currently in a JNI critical region exited their JNI critical region. In this case, G1 also records and suspends all subsequent Java threads trying to enter such a JNI critical region, performing selected virtual machine mode transitions or requests for further garbage collections. G1 uses a global mutex called
GCLocker to achieve the above suspend/resume mechanism.
Only after all JNI critical regions were exited with a pending garbage collection request, G1 executes the pending garbage collection and the VM subsequently resumes execution of all previously suspended threads.
The main idea presented in this JEP is to, instead of disabling garbage collection completely, keep collecting garbage in heap regions not containing a critical object.
G1 is a region based incremental collector: it can already collect parts of the heap with the granularity of a heap region. Further, some of these regions may already be treated as locked in place (marked as "pinned") during any garbage collection. This JEP aims to extend this capability for any type of region during any kind of garbage collection.
There is existing generic support to notify the JVM of Java threads obtaining and releasing critical objects.
Existing Support for Region Pinning in G1
There already exist a few mechanisms that we intend to exploit for support of pinning of arbitrary regions in the G1 collector.
- Major (full) collection already completely supports region pinning: any region type that is marked as "pinned" during major collection will not be subject to compaction: live objects (like critical objects) within pinned regions are kept in place while the surrounding areas containing dead objects are formatted as empty. Currently G1 always marks humongous regions (regions containing large objects) and archive regions (containing CDS data) as pinned permanently and any other region that exceeds a liveness threshold as pinned during that collection only.
- Minor (young) garbage collection does not support region pinning completely at this time: permanently pinned regions as described above will never be put into the collection set, automatically excluding them from any collection effort. However, there is currently no support for pinning other region types (both Young and Old regions) during minor collection. This work will need to remedy this shortcoming.
Modifications to G1 Garbage Collection Algorithms
The existing region pinning support described above suggests to implement the following modifications to the G1 garbage collection algorithms to achieve the desired effect:
- use the existing critical object obtain/release notifications to manage a count of critical objects per region. If that count is zero, there are no critical objects and that region can be garbage collected as before. Any non-zero count requires the garbage collector to treat that region as pinned for all types of collections.
- during major collections above information can be directly used to temporarily pin the affected regions during that collection only.
- minor collections can simply exclude Old regions that are pinned due to critical objects from the collection set during collection set selection at the start of the minor collection. Then they will not be collected. The same mechanism is not viable for Young regions: G1 can not exclude individual Young regions from evacuation. However, there is already a fallback mechanism to let live objects stay in place during minor collection: evacuation failure handling. By forcing evacuation failure for all live objects (which naturally include all critical objects) in pinned regions, G1 can achieve the expected effect.
GCLockerusage in G1.
Reusing Evacuation Failure Handling
When G1 is unable to find space to evacuate an object during minor collection, an evacuation failure occurs for that object. That object is kept in place, recorded, and the object and its containing region marked as "failed" (i.e. the region containing the object that failed evacuation). After evacuation there is a separate fixup phase to clear the recorded marks, format the space around these objects that failed evacuation as empty and relabel these regions as if they were Old regions.
This current implementation assumes that evacuation failure is very rare: typically G1 avoids evacuation failure occurrences completely by proper generation sizing or preventive garbage collections. Even if a garbage collection incurs an evacuation failure, the number of affected objects is typically extremely small.
By repurposing this mechanism for handling pinned Young regions, neither assumption is valid: still a low, but expectedly larger amount of regions will incur evacuation failure at higher frequency. Further, the number of affected objects is only bounded by the size of the regions as G1 needs not only keep the objects that actually failed evacuation in place, but all live objects.
The original assumptions led to the following design decisions that require significant improvement:
- generally, performance of the path recording evacuation failure and the objects that failed evacuation is not well optimized.
- due to the rarity of regions incurring evacuation failures, performance and in particular parallelism of the mentioned fixup phase is suboptimal: e.g. the implementation uses a linear walk through the entire region to find objects that failed evacuation, and the unit of work distribution between threads is a whole region.
- regions that failed evacuation are implicitly promoted to old generation regions, meaning that although we know for certain their liveness after collection (which is generally very low), reclaiming that space requires a significant amount of time and effort. With a larger amount of such regions promoted to Old, there is a risk that these will fill up the heap quickly causing much extra work by the G1 collector.
Implementation alternatives for support of critical regions correspond to the ones mentioned in the JNI specification:
The first option is to always copy JNI critical objects to a place (e.g. the C heap) where the object does not move and copy it back afterwards: this has been discarded in the past for being very inefficient in time and space. Nothing substantially changed about the effort needed for this mechanism. A small optimization could be to only copy objects in regions G1 does not support pinning for, limiting copying to critical objects in Young regions. We do not expect that this improves the situation significantly: many heuristics in the garbage collection area assume that a large fraction of object modification and use occurs in the young generation. This is generally true given the efficiency of existing collection algorithms. We expect that the same applies to JNI critical functions.
Another option is to pin objects individually: G1 can only evacuate whole regions, and can only allocate into completely free regions. Since a pinned object keeps a region from being freed (as it is trivially in use), there is no advantage doing that except additional code complexity to keep track of pinned objects on a per object basis.
Of course we could keep and refine the existing mechanism to disable garbage collection during critical regions using the
GCLocker: however disabling garbage collection fundamentally causes latency problems and can not improve the existing status quo as far we are aware of.
Apart from those we have not found other reasonable ideas that provide extra benefit (performance, simplicity, ...) to implement region pinning differently than suggested in this JEP.
Besides of functionality tests, we especially need to do benchmarking and performance measurements to collect performance data.
Risks and Assumptions
We assume that there are no changes to the expected usage of JNI critical regions: they are still to be used "sparingly" and these JNI critical regions are "short".
The existing evacuation failure handling mechanisms G1 uses are well understood, the risk in reusing them seems manageable. As stated before, there are some performance problems with using them as they are, but initial prototypes of changes show very good promise.
There is a risk when the application pins lots of regions at the same time, in the extreme case pinning the entire heap, which will lead to an out-of-memory situation. There is no solution for this case currently, but it seems that in practice (the Shenandoah collector already uses region pinning for JNI critical regions) this will not occur.
One good mitigation for this problem could be allowing allocation in regions that were pinned and sparsely occupied using a first-fit algorithm with a linked list of free space around critical objects. This technique may be further improved by tracking pinning on a per object basis. However we do not see any of these changes as necessary for this JEP for the above mentioned reason.
The work for this JEP is based on several existing and completed features in G1:
- Region based heap
- Region pinning support for major collections
- Evacuation failure handling during young collections