Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8245042

Improve scalability of reading Windows Performance counters via PDH when using the Process object

    Details

    • Type: Enhancement
    • Status: Resolved
    • Priority: P3
    • Resolution: Fixed
    • Affects Version/s: 11, 12, 13, 14, 15
    • Fix Version/s: 15
    • Component/s: hotspot
    • Subcomponent:
    • Resolved In Build:
      b24
    • OS:
      windows

      Description

      On Windows platforms, os_perf_windows.cpp is using the Performance Data Helper (PDH) library to interface with the Windows Performance counters, https://docs.microsoft.com/en-us/windows/win32/perfctrs/using-the-pdh-functions-to-consume-counter-data .

      Example: it is possible to read the CPU load (user and kernel) for a process by creating a PDH query, specifying a counter path:

        (general form): "\object(instance#index)\counter"
        (instantiation): "\Process(java#0)\% Processor Time"

      We have recently got reports about the JVM occasionally stalling on the Windows platform. Internally, this has appeared with batches of test runs timing out intermittently. The machines where the timeouts have occurred have been heavily saturated in terms of overall CPU load, also having a relatively large number of java processes running concurrently. For example, on a machine with 16 hardware threads, it was not uncommon to see approximately 30 java processes running at the same time.

      As the machine's overall CPU resources was saturated, in combination with a relatively large number of java processes (with the same base name, e.g. "java"), the PDH API round trip time for reading selected Windows Performance Counters seem to have increased substantially, especially pdh!PdhCollectQueryData was noted. An already slow PdhCollectQueryData call that in lighter circumstances would take about 30-40 ms to complete (10 ms on my local workstation), occasionally could spike up to several seconds. All observations have been on VM's running Windows_Server_2016_Standard with no sightings as yet with Windows_Server_2019_Standard.

      The wt output for pdh!PdhCollectQueryData is attached (taken from a local workstation running Windows 10 Enterprise).

      CPU graphs from the problematic processes indicate that time is spent to a high degree in kernel mode. A hypothesis is that pdh!PdhCollectQueryData runs into scalability issues when the machines gets saturated, and to some degree, this problem might be located in the kernel itself.

      As the PDH API round trip time started to get longer, we also did see another scalability issue, one more amenable to be addressed directly.

      Background / context:

      A process per PDH is not primarily identified by the process id, but by an opaque index (the '0' in "\Process\java#0") into a list maintained by the Windows kernel. To distinguish which process instance is the intended target for a query, the PDH Process API demands, at time of registration, a string describing the target process name concatenated with the value for this index. For example:

        "\Process(java#0)", "\Process(java#1)", ...

      The bad part is that this list is constantly in-flux as processes are exiting. One consequence is that processes with indexes greater than the one that just terminated is now shifted down by one. For example:

       if "\Process(java#1)" exits, "\Process(java#2)" now becomes "\Process(java#1)", "\Process(java#3)" becomes "\Process(java#2)" ...

      To make matters even more exciting, an already registered query is not invalidated when the process list changes. The query will continue to work just as before, or at least, so it seems. Only now, the query will read performance data from another process instance! That's right, the performance data is now read from the process that was shifted down by the kernel to occupy the index slot associated with our original registration.

      If you would like to see this phenomenon, then do the following:

      1. Start two instances of some process, for example, two instances of Notepad.
      2. Start Performance Monitor (perfmon), click to Add counters.
      3. Select the "Process" object, expand the list of counters (small arrow) and select counter, "%Processor Time"
      4. In the "instances of selected object" box, select the two instances started, Notepad and Notepad#1.
      5. The CPU usage for these two processes is displayed, each on a separate counter line. You can see this more clearly if you click 'h' (for highlight).
      6. Shake each Notepad window a little bit to have the CPU graphs rise, representing activity generated by the window corresponding to the process instance.
      7. Now click to close the first process instance of Notepad.
      8. Again shake the other Notepad window a little bit to see the graph counter line render. Only now, it does not continue in the original line as before, but instead picking up to continue in the counter line "belonging to" the process instance that was just terminated.

      In os_perf_windows.cpp, we attempt to parry this complication:

      The #index for a Process query can only decrease after process creation.

      We therefore create an array of counter queries for all process object instances up to and including ourselves:

      E.g. we come in as the third process instance (java#2), we then create and register queries for the following Process object instances:

           java#0, java#1, java#2

      current_process_query_index() finds the currently correct pdh process query index by inspecting the pdh process list, at a particular instant, i.e. just before we issue the real process query. Of course, this is an inherently racy situation because the pdh process list can change at any time.
      We use current_process_query_index() to help keep the number of data errors low, where a data error is defined to be the result of using a stale index to query the wrong process.

      This is an old issue and the convoluted workaround seem to be the best solution so far: https://stackoverflow.com/questions/7584861/race-condition-with-performance-counters-for-current-process

      In current_process_query_index(), to discover the current PDH index corresponding to the process id (i.e. to figure which query to use in the real call, as to not query the wrong process), a query is constructed using the following counter path:

      "\Process(java#0)\ID Process"

      The "ID Process" Windows counter returns the process id.

      The index, here with a value of 0, is a variable that ranges over the set of java processes (having the same name, i.e. "java").

      A problem related to scalability with the current implementation is the iteration that always starts at index 0 searching upwards in the attempt to discover the mapping of pdh index to pid. With a larger number of java processes (having the same name, i.e. "java#index"), with ranges now up to ~30, and PDH API round trips slowing down, a worst case could possibly take up to 30 seconds or more just to discover which index to use for the real query (in actual measurements, the longest query observed was at 16 seconds). Adding to this, the Java thread that issues these queries, at time of invocation, is in thread state, "thread_in_vm". This means a pending safepoint synchronization must wait until the call completes.

      We can do a couple of things to improve this situation:

      1. As a general recommendation, the provisioning strategy for the machines where these situations occur might have to be revised. The very high resource usage on the machines triggers scalability and performance issues, not only for the JVM, but also for the operating system.
      2. We can directly address and improve on the scalability issue related to current_process_query_index()
      3. We can avoid stalling safepoint synchronizations by letting Java threads transition to "thread_in_native" before invoking PDH process related calls.

      Number 3 will be handled in a separate issue, so only 2 will be covered here.

      Suggestion for current_process_query_index():

      Instead of always ranging over the entire set of processes, we can optimistically re-try the pdh index that was successfully used in the previous query, for example, let this index be, say, 29.

      We then construct a query with counter path: "\Process(java#29)\ID Process" to find out if the mapping " (29, pid) " is still valid.
      If it is, and it most likely will be, it is a constant, albeit slow, call to perform a single lookup query.

      If index 29 no longer maps our pid, it means the pdh process list in the kernel has changed, i.e. because other processes with the same name (i.e. "java") have terminated. A change in the pdh process list implies a decrement to the previous index, so we can seek downwards, instead of upwards, to find the updated, now downshifted, list index corresponding to our pid.

      "\Process(java#28)\ID Process"
      "\Process(java#27)\ID Process"

      We will need to find out how many java processes are live at initialization to get at the maximum_process_index (num_processes - 1), this will become the start index instead of 0.

      There exist a PDH API that can be used for this purpose:

      PdhExpandWildCardPath() - https://docs.microsoft.com/en-us/windows/win32/api/pdh/nf-pdh-pdhexpandwildcardpatha

      If we pass the following counter path to PdhExpandWildCardPath(): "\Process(java*)\ID Process", the result will be a listing of all live instances having base name "java". This result can then be tallied to derive our maximum_process_index value.

      Seeking the list downwards will also address another potential, but in reality very rare to occur, bug with the current implementation:

      Let our target index currently be at 3 and we are seeking upwards, from 0, 1 and 2.

      At this point, lets have the pdh process list in the kernel change because the process with index 0 terminated. Our target index just shifted down to index 2. But we have just processed index 2 so we will continue with 3, 4, 5….until __ max_intx __ ….now that could take a very long time indeed.

      We may also take the opportunity to revise os_perf_windows.cpp a bit in regards to better handling of error codes, improve some commentary / descriptions as well as to clean up and simplify a few aspects / constructs.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mgronlun Markus Grönlund
                Reporter:
                mgronlun Markus Grönlund
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: