Expose JDK Flight Recorder data for continuous monitoring.
- Provide an API for the continuous consumption of JFR data on disk, both for in-process and out-of-process applications.
- Record the same set of events as in the non-streaming.case, with overhead less than 1% if possible.
- Event streaming must be able to co-exist with non-streaming recordings, both disk and memory based.
- Provide synchronous callbacks for consumers.
- Allow consumption of in-memory recordings.
The HotSpot VM emits more than 500 data points using JFR, most of them not available through other means besides parsing log files.
To consume the data today, a user must start a recording, stop it, dump the contents to disk and then parse the recording file. This works well for application profiling, where typically at least a minute of data is being recorded at a time, but not for monitoring purposes. An example of monitoring usage is a dashboard which displays dynamic updates to the data.
There is overhead associated with creating a recording, such as:
- Emitting events that must occur when a new recording is created,
- Writing event metadata, such as the field layout,
- Writing checkpoint data, such as stack traces, and
- Copying data from the disk repository to a separate recording file.
If there were a way to read data being recorded from the disk repository without creating a new recording file, much of this overhead could be avoided.
Define an API by which users can subscribe to events asynchronously.
The following code snippet is a conceptual illustration for how such an API might look. It shows how to print all classes on which threads have blocked for more than 10 ms. If a consumer is not able to keep up, events will be dropped after 600 seconds.
EventStream.start("jdk.javaMonitorEnter", "threshold", "10 ms") .maxAge(Duration.ofSeconds(600) .consume(event -> System.out.println(e.getClass("monitorClass"));
This creates a recording and at a given interval, perhaps once every two seconds, flushes events stored in memory and thread-local buffers to the disk repository. A separate thread parses the most recent file, up to the point in which data has been written, and pushes the events to the consumers. It's an open question how to handle flow control with multiple subscribers, but perhaps the
java.util.concurrent.Flow API could be used.
JMX notifications provide a means for the JDK and third-party applications to expose information for continuous monitoring. There are, however, drawbacks that make JMX unsuited for the purpose of this JEP.
- Data points collected in the JVM often happen at places where a call to Java code is not possible, for instance during a GC induced safepoint.
- Developer time has already been invested in collecting data using JFR. Rewriting all those probe points for JMX would be a very large effort.
- JMX doesn't provide a mechanism to filter out events before they are sent, which means that the system could easily be flooded.
- Complex data structures with references, such as stack traces, can't be efficiently represented using Open MBean types.
- Verify that the feature doesn't have any memory leaks.
- Verify that the feature has stable performance over time (appropriate stress testing).
- Write unit tests for all exported methods.
- Validate that event subscriptions work with other recordings running simultaneously.
- Verify that the API works well out of the box.
- Verify that the API is suitable for forwarding event data for consumption by other frameworks.
- Verify that the API is suitable for environments where low latency is important (minimal GC pauses).
- Verify that the API is suitable for tools vendors, i.e. data arriving at a rate suitable for charting.
- Verify that the API is secure, it should not be possible to get a callback in a privileged thread context.
- Validate that the overhead is acceptable.
- Verify that it's not possible to create infinite recursion in subscribers.
Risks and Assumptions
- Operations in API callbacks may provoke JFR events, which could lead to infinite recursion. This can be mitigated by not recording events in such a situation.