Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8171119

JEP 331: Low-Overhead Heap Profiling

    Details

    • Type: JEP
    • Status: Candidate
    • Priority: P4
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: hotspot
    • Labels:
      None
    • Author:
      JC Beyler
    • JEP Type:
      Feature
    • Exposure:
      Open
    • Subcomponent:
    • Scope:
      JDK
    • Discussion:
      hotspot dash dev at openjdk dot java dot net
    • JEP Number:
      331

      Description

      Summary

      Provide a low-overhead way of sampling Java heap allocations, accessible via JVMTI.

      Goals

      Provide a way to get information about Java object heap allocations from the JVM that:

      • Is low-overhead enough to be enabled by default continuously,
      • Is accessible via a well-defined, programmatic interface,
      • Can sample all allocations (i.e., is not limited to allocations that are in one particular heap region or that were allocated in one particular way),
      • Can be defined in an implementation-independent way (i.e., without relying on any particular GC algorithm or VM implementation), and
      • Can give information about both live and dead Java objects.

      Motivation

      There is a deep need for users to understand the contents of their heaps. Poor heap management can lead to problems such as heap exhaustion and GC thrashing. As a result, a number of tools have been developed to allow users to introspect into their heaps, such as the Java Flight Recorder, jmap, YourKit, and VisualVM tools.

      One piece of information that is lacking from most of the existing tooling is the call site for particular allocations. Heap dumps and heap histograms do not contain this information. This information can be critical to debugging memory issues, because it tells developers the exact location in their code particular (and particularly bad) allocations occurred.

      There are currently two ways of getting this information out of HotSpot:

      • First, you can instrument all of the allocations in your application using a bytecode rewriter such as the Allocation Instrumenter. You can then have the instrumentation take a stack trace (when you want one).

      • Second, you can use Java Flight Recorder, which takes a stack trace on TLAB refills and when allocating directly into the old generation. The downsides of this are that a) it is tied to a particular allocation implementation (TLABs), and misses allocations that don’t meet that pattern; b) it doesn’t allow the user to customize the sampling rate; c) it only logs allocations, so you cannot distinguish between live and dead objects.

      This proposal mitigates those problems by providing an extensible JVMTI interface that allows the user to define the sampling rate, and returning a set of live stack traces.

      Description

      A) New Event and new method to JVMTI

      The user facing API for the heap sampling feature proposed by this JEP consists of an extension to JVMTI that allows for heap profiling. The following systems rely on an event notification system that would provide a callback such as:

      void JNICALL
      SampledObjectAlloc(jvmtiEnv *jvmti_env,
                  JNIEnv* jni_env,
                  jthread thread,
                  jobject object,
                  jclass object_klass,
                  jlong size)

      where:

      • thread is the thread allocating the jobject
      • object is the reference to the sampled jobject
      • object_klass is the class for the jobject
      • size is the size of the allocation.

      The new API also includes a single new JVMTI method:

      jvmtiError  SetHeapSamplingRate(jvmtiEnv* env, jint sampling_rate)

      where sampling_rate is the average allocated bytes between a sampling. The specification of the method is:

      • If non zero, the sampling rate is updated and will send a callback to the user with the new average sampling rate of sampling_rate bytes
        • For example, if the user wants a sample every megabyte, sampling_rate would be 1024 * 1024.
      • If zero is passed to the method, the sampler samples every allocation

      B) Use-case example

      To enable this, a user would use the usual event notification call to:

       jvmti->SetEventNotificationMode(jvmti, JVMTI_ENABLE, JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)

      The event would be sent when the allocation is initialized and set up correctly, so slightly after the actual code performs the allocation. By default, the sampling rate is 512kb. In essence, the minimum required to enable the sampling event system is to call SetEventNotificationMode with JVMTI_ENABLE and the event type JVMTI_EVENT_SAMPLED_OBJECT_ALLOC. To modify the sampling rate, the user calls the SetHeapSamplingRate method.

      To disable the system, there is a two part disabling:

       jvmti->SetEventNotificationMode(jvmti, JVMTI_DISABLE, JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)

      which disables the event notifications and disables the sampler automatically.

      Calling the sampler again via SetEventNotificationMode will re-enable the sampler with whatever sampling rate was currently set (either the 512kb by default or the last value passed by a user via SetHeapSamplingRate).

      C) New Capability

      To protect the new feature and make it optional for VM implementations, a new capability called can_generate_sampled_alloc_events is introduced into the jvmtiCapabilities.

      D) Global/Thread level sampling

      Using the notification system provides a direct means to send events only for specific threads. This is done via SetEventNotificationMode and providing a third parameter with the threads to be modified.

      E) What the JVMTI agent can do

      The user of the callback can then pick up a stacktrace at the moment of the callback using the JVMTI GetStackTrace method for example. The oop obtained by the callback can be also wrapped into a JNI weak reference to help determine when the object has been garbage collected. The idea behind that is to provide data on what objects were sampled and are still considered live or garbage collected, which can be a good means to understand the job's behavior.

      The sampling rate will provide a different sampling precision but also can be a means to mitigate overhead due to the profiling. Using a sampling rate of 512k and the sampling solution, the overhead should be low enough that a user could reasonably leave the system on by default.

      F) A Full Example

      The following section provides code snippets to illustrate the sampler's API. First, the capability and the event notification is enabled:

      jvmtiEventCallbacks callbacks;
      memset(&callbacks, 0, sizeof(callbacks));
      callbacks.SampledObjectAlloc = &SampledObjectAlloc;
      
      jvmtiCapabilities caps;
      memset(&caps, 0, sizeof(caps));
      caps.can_generate_sampled_alloc_events = 1;
      if (JVMTI_ERROR_NONE != (*jvmti)->AddCapabilities(jvmti, &caps)) {
        return JNI_ERR;
      }
      
      if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_ENABLE,
                                             JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
        return JNI_ERR;
      }
      
      if (JVMTI_ERROR_NONE !=  (*jvmti)->SetEventCallbacks(jvmti, &callbacks, sizeof(jvmtiEventCallbacks)) {
        return JNI_ERR;
      }
      
      // Set the sampler to 1MB.
      if (JVMTI_ERROR_NONE !=  (*jvmti)-> SetHeapSamplingRate(jvmti, 1024 * 1024)) {
        return JNI_ERR;
      } 

      To disable the sampler (disables events and the sampler):

      if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_DISABLE,
                                             JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
        return JNI_ERR;
      }

      To re-enable the sampler with the 1024 * 1024 byte sampling rate, a simple call to enabling the event is required:

      if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_ENABLE,
                                             JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
        return JNI_ERR;
      }

      User Storage of Sampled Allocations

      For a user, once a callback is set up, the system could set up a weak reference and track the reference to determine if the object has been garbage collected or not. A stacktrace can be added to the data to help users profile the code using the JVMTI GetStackTrace method.

      For example, something like this could be done:

      extern "C" JNIEXPORT void JNICALL SampledObjectAlloc(jvmtiEnv *env,
                                                           JNIEnv* jni,
                                                           jthread thread,
                                                           jobject object,
                                                           jclass klass,
                                                           jlong size) {
        jvmtiFrameInfo frames[32];
        jint frame_count;
        jvmtiError err;
      
        err = global_jvmti->GetStackTrace(NULL, 0, 32, frames, &frame_count);
        if (err == JVMTI_ERROR_NONE && frame_count >= 1) {
          jweak ref = jni->NewWeakGlobalRef(object);
          internal_storage.add(jni, ref, size, thread, frames, frame_count);
        }
      }

      where internal_storage is a data structure that can handle the sampled objects, consider if there is a need to clean up any garbage collected sample, etc. The internals of that implementation are out of scope of this JEP since it belongs to the user to define/implement the system using the data from the callback.

      Alternatives

      There are multiple alternatives to the system presented in this JEP. The introduction presented two already: The Java Flight Recorder system provides an interesting alternative but is not perfect due to it not allowing the sampling size to be set and not providing a callback.

      The JFR system does use the TLAB creation as a means to track memory allocation but, instead of a callback, JFR events use a buffer system that can lead to missing some sampled allocations. Finally, the JFR event system does not provide a means to track objects that have been garbage collected, which means it is not possible currently to have a system provide information about live and garbage collected objects using the JFR event system.

      Another alternative is the bytecode instrumentation using ASM is another alternative but its overhead makes it prohibitive and not a workable solution.

      This JEP adds a new feature into the JVMTI which is an important API/framework for various development and monitoring tools. With it, a JVMTI agent can use a low overhead heap profiling API along with the rest of JVMTI functionality, which provides great flexibility to the tools. For instance, it is up to the agent to decide if a stack trace needs to be collected at each event point.

      Testing

      There are 16 tests in the JTreg framework for this feature that test: turning on/off with multiple threads, multiple threads allocating at the same time, testing if the data is being sampled at the right rate, and if the stacks are coherent to what is expected.

      Risks and Assumptions

      There are no performance hits or risks with the feature disabled. A general user not enabling the system would not perceive a difference with or without the feature.

      However, there is a potential performance/memory hit with the feature enable. In the prototype implementation, the overhead is minimal (<2%), but this was using a mechanism that modified JIT’d code. In the version presented here, the system piggy-backs on the TLAB code and should not have that regression.

      Current evaluation of the Dacapo benchmark puts the overhead at:

      • 0% when the feature is disabled

      • 1% when the feature is enabled at the default 512kb rate but no callback action is performed (i.e., the SampledAllocEvent method is empty but registered to the JVM).

      • 3% overhead with a sampling callback that does a naive implementation to store the data (using the one in the tests)

      Prototype Implementation Details

      [The current prototype and implementation] proves the feasibility of the approach. It contains in essence five parts:

      1) Architecture dependent changes due to a change of a field name in the ThreadLocalAllocationBuffer (TLAB) structure. These changes are minimal as they are just name changes.

      2) The TLAB structure is augmented with a new allocation_end pointer and a current_end pointer. If the sampling is disabled, the two pointers are always equal and the code performs as before. If the sampling is enabled, the current_end is modified to be where the next sample point is requested. Then, any fast path will "think" the TLAB is full at that point and go down the slow path, which is explained in (3)

      3) The gc/shared/collectedHeap code is changed due to its usage as an entry point to the allocation slow path. If a TLAB is considered full, the code enters the collectedHeap and tries to allocate a new TLAB. At this point, the TLAB is set back to its original size and an allocation is attempted. If the allocation succeeds, the code returns after sampling the allocation. If it does not, it means it is the end of the TLAB and a new TLAB was actually needed. The code path continues its normal allocation of a new TLAB and determines if that allocation requires a sample. If ever the allocation is considered too big for the TLAB, the system samples the allocation as well, thus covering in TLAB and out of TLAB allocations for sampling.

      4) When a sample is requested, there is a collector object set on the stack in a place safe for sending the information to the native agent. The collector keeps track of sampled allocations and, at destruction of its own frame, sends a callback to the agent. This mechanism ensures the object is initialized correctly.

      5) Though not in the implementation due to its out-of-JDK nature, the native agent can then register a callback and obtain sampled allocations. The allocations can be associated with a stacktrace using a JVMTI method and then wrapped into a WeakReference, which will provide live-ness information. An example implementation of this can be found in the libHeapMonitorTest.c file of the webrev, which is used for the JTREG testing.

        Issue Links

          Activity

          Hide
          jcbeyler Jean Christophe Beyler added a comment -
          Fixed :)
          Show
          jcbeyler Jean Christophe Beyler added a comment - Fixed :)
          Hide
          mr Mark Reinhold added a comment - - edited
          [~jcbeyler] I've done a light copy-editing pass, and tightened up the title to match what I've seen you use in recent e-mails. If this looks okay to you then please assign it to me and I'll move it to Candidate.
          It''d be helpful to adjust the text about JFR to reflect the fact that it will, shortly, no longer be proprietary (http://openjdk.java.net/jeps/328).
          Show
          mr Mark Reinhold added a comment - - edited [~jcbeyler] I've done a light copy-editing pass, and tightened up the title to match what I've seen you use in recent e-mails. If this looks okay to you then please assign it to me and I'll move it to Candidate. It''d be helpful to adjust the text about JFR to reflect the fact that it will, shortly, no longer be proprietary ( http://openjdk.java.net/jeps/328 ).
          Hide
          jcbeyler Jean Christophe Beyler added a comment -
          Hi Mark,

          Your edits look great to me and do you want me to wait for a particular moment to update the text for JFR or should/could I do it just now. It could look like JFR "soon to be open-sourced via JEP-328" and update the text around that.

          Thanks for your help!
          Show
          jcbeyler Jean Christophe Beyler added a comment - Hi Mark, Your edits look great to me and do you want me to wait for a particular moment to update the text for JFR or should/could I do it just now. It could look like JFR "soon to be open-sourced via JEP-328" and update the text around that. Thanks for your help!
          Hide
          mr Mark Reinhold added a comment -
          To avoid any confusion, it's probably best to update the JFR text before this goes to Candidate. Otherwise people will ask (obvious) questions.
          Show
          mr Mark Reinhold added a comment - To avoid any confusion, it's probably best to update the JFR text before this goes to Candidate. Otherwise people will ask (obvious) questions.
          Hide
          jcbeyler Jean Christophe Beyler added a comment -
          Done, let me know what you think (I re-assigned it to you if that's ok)
          Show
          jcbeyler Jean Christophe Beyler added a comment - Done, let me know what you think (I re-assigned it to you if that's ok)

            People

            • Assignee:
              jcbeyler Jean Christophe Beyler
              Reporter:
              rasbold Chuck Rasbold
              Owner:
              Jean Christophe Beyler
              Reviewed By:
              Robbin Ehn, Serguei Spitsyn
            • Votes:
              1 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated: