Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8255954

[windows] UseNUMAInterleaving causes VM to balloon and hang

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: P2
    • Resolution: Duplicate
    • Affects Version/s: 16
    • Fix Version/s: 16
    • Component/s: hotspot
    • Labels:
    • Subcomponent:
      gc
    • OS:
      windows

      Description

      Observed on windows x64 >1 NUMA nodes configured.

      VM started with -XX:+UseNUMAInterleaving hangs. Endlessly prints "NUMA page allocation failed". Virtual memory size balloons up into the TB range. Working set size slowly grows. VM needs to be stopped forcefully.

      On that particular machine this was reproducable with a simple
      java -XX:+UseNUMA -XX:+UseNUMAInterleaving -version

      This bug started happening with https://bugs.openjdk.java.net/browse/JDK-8251158 ("Implementation of JEP 387: Elastic Metaspace").

      Analysis shows that we hang during initialization of Metaspace/CDS in os_windows.cpp, map_or_reserve_memory_aligned() the loop starting at os_windows.cpp:3152.

      This function attempts to reserve an aligned region. This involves:
      1 reservation of a larger region anywhere (no wish pointer) to take alignment into account
      2 releasing that region
      3 re-reserving at the aligned starting address in the hope that this region is free.

      Note the difference to POSIX platforms, where we use mmap and can just unmap the unaligned begin and end of the region. Since on Windows mappings are undivisible, this is not possible, hence the release-and-hope-loop.

      Current (still unproven) hypothesis is:
      1) We reserve memory in an interleaved fashion. This involves multiple VirtualAlloc calls. This causes the resulting mapping to be a patchwork of multiple mappings.
      2) We attempt to release that mapping using os::release_memory(). But that only releases the first mapping in this patchwork area and leaves the other mappings intact.
      3) We attempt to map into the aligned address and that fails.
      4) We repeat the loop. The unreleased virtual memory segments accumulate and cause virtual size to balloon.

      I currently believe this is not caused by JEP387, but with JEP387 allocation patterns change. For instance, we now allocate with larger alignments.

      Analysis is ongoing.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                stuefe Thomas Stuefe
                Reporter:
                stuefe Thomas Stuefe
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: