Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4692906

Hotspot JVM's hang if thread suspend/resume executed by non-Java code

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: P4
    • Resolution: Fixed
    • Affects Version/s: 1.3.1
    • Fix Version/s: 1.3.1_05
    • Component/s: hotspot
    • Labels:
    • Subcomponent:
    • Resolved In Build:
      05
    • CPU:
      x86
    • OS:
      windows_2000

      Backports

        Description

        FULL PRODUCT VERSION :
        java version "1.3.1"
        Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1-b24)
        Java HotSpot(TM) Client VM (build 1.3.1-internal, mixed mode)

        FULL OPERATING SYSTEM VERSION :

        Microsoft Windows 2000 [Version 5.00.2195]
        Service pack 2

        ADDITIONAL OPERATING SYSTEMS :

        Likely any win32 NT-derived system.


        EXTRA RELEVANT SYSTEM CONFIGURATION :
        Problem occurs only on multiple processor configurations.

        A DESCRIPTION OF THE PROBLEM :
        Hotspot jvm's on win32 platforms mistakenly assume that
        suspend count for threads will only reach depth of 1.

        If any third party code (such as JNI-reachable DLL's)
        invokes win32 apis to suspend and resume java threads,
        the JVM will falsely interpret the situation as an error
        condition and slowly but surely leave threads in a hung
          state, eventually hanging the entire JVM process.

        This situation occurs only as a race condition on multiple-
        cpu window machines. It doesn't arise in the -classic jvm
        implementation, however the deprecation of that jvm means
        that we must have a fix for the hotspot jvm to avoid
        process hangs.

        We have traced the problem to
        the win32 implementation of the jvm's
        Thread::resume_thread_impl and
        Thread::suspend_thread_impl bodies, in conjunction with the
        os::pd_resume_thread and os::pd_suspend_thread counterparts.

        We patched a jvm using the diffs below. The patched JVM
        appears to
        run all our apps without problems, though with race
        conditions there is always the chance of false positives.

        Here is a description of the changes:

        Changes are in Windows-specific code and have to do with
        how the
        JVM handles the return value from the Windows system calls
        SuspendThread
        and ResumeThread.

        os::pd_suspend_thread() was changed so that it returns 0 if
        the call to
        SuspendThread was successful and 1 if it was not. This is
        the documented
        behavior of the method and it is the behavior that the
        single caller of
        this method expects. Prior to this change, the method
        treated any non-zero
        return value from SuspendThread() as an error. But
        SuspendThread is
        documented to return values >= 0 on success.

        os::pd_resume_thread() was changed so that it returns 0 if
        the call to
        ResumeThread was successful and 1 if it was not. This is the
        documented behavior of the method and it is the behavior
        that the
        single caller of this method expects. The change also sets
        the thread
          state to RUNNABLE if the call to ResumeThread was
        successful. Strictly
        speaking, a thread is not runnable if the suspend count is
        greater
        than zero, but for the JVM's purposes, the thread is
        runnable. When
        the other entity (in our case, database client DLL) updates
        the
        suspend count so that the thread can run, then the JVM will
        already be
        in the correct state.

        *** hotspot1.3.1\src\os\win32\vm\os_win32.cpp.orig Sun May
        6 03:04:54 2001
        --- hotspot1.3.1\src\os\win32\vm\os_win32.cpp Fri Apr 19
        13:19:42 2002
        ***************
        *** 1536,1543 ****
                ret = SuspendThread(handle);
              }
              assert(ret != 0xffffffffUL, "SuspendThread
        failed"); // should
        propagate back
        ! assert(ret == 0, "Win32 nested suspend");
        ! return ret;
          }

          // Resume a thread by one level. This method assumes
        that consecutive
        --- 1536,1542 ----
                ret = SuspendThread(handle);
              }
              assert(ret != 0xffffffffUL, "SuspendThread
        failed"); // should
        propagate back
        ! return (ret == 0xffffffffUL);
          }

          // Resume a thread by one level. This method assumes
        that consecutive
        ***************
        *** 1554,1564 ****
          long os::pd_resume_thread(Thread* thread) {
            OSThread* osthread = thread->osthread();
            DWORD ret = ResumeThread(osthread->thread_handle());
        ! assert(ret != 0xffffffffUL, "ResumeThread failed"); //
        should propagate
        back
        ! if (ret == 1) {
        ! osthread->set_state(RUNNABLE);
            }
        ! return ret - 1;
          }


        --- 1553,1564 ----
          long os::pd_resume_thread(Thread* thread) {
            OSThread* osthread = thread->osthread();
            DWORD ret = ResumeThread(osthread->thread_handle());
        ! if (ret == 0xffffffffUL) {
        ! assert(false, "ResumeThread failed");
        ! return 1; // error return value
            }
        ! osthread->set_state(RUNNABLE);
        ! return 0; // success return value
          }




        REGRESSION. Last worked in version 1.3.1

        STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
        There is no simple test case. My company will be happy to
        provide the test framework we used for isolating and fixing
        the fault, but it involves use of licensed native code.

          To build a test case involves designing a system where the
        right pattern of externally applied or JNI-invoked thread
        suspend and resume operations are performed on java threads.

        It also requires a bit of luck and a multi-processor
        configuration since it is a race condition generally
        triggered during hotspot safepoint processing.

        EXPECTED VERSUS ACTUAL BEHAVIOR :
        Expected results: process doesn't hang.
        Actual results: process eventually hangs, with all evidence
        of the cause obscured since the cause of the all-threads-
        waiting symptoms at the time of hang has long since passed.

        ERROR MESSAGES/STACK TRACES THAT OCCUR :
        # HotSpot Virtual Machine Error, assertion failure
        # Please report this error at
        # http://java.sun.com/cgi-bin/bugreport.cgi
        #
        # assert(ret == 0, "Win32 nested suspend")
        #
        # Error happened during: scavenge
        #
        # Error ID: E:\jdk131src\hotspot1.3.1\src\os\win32\vm\os_win32.cpp, 1539
        #
        # Problematic Thread: prio=5 tid=0x9cee98 nid=0x102c runnable

        or

        # HotSpot Virtual Machine Error, assertion failure
        # Please report this error at
        # http://java.sun.com/cgi-bin/bugreport.cgi
        #
        # assert(v_false, "resume thread failed")
        #
        # Error happened during: scavenge
        #
        # Error ID: E:\jdk131src\hotspot1.3.1\src\share\vm\runtime\thread.cpp,
        503
        #
        # Problematic Thread: prio=5 tid=0x9cee98 nid=0x1250 runnable

        Windbg C++ and java stack traces available on request. They break the
        web-
        based bug submission if included here.

        This bug can be reproduced occasionally.

        ---------- BEGIN SOURCE ----------
        Simple test source code is unavailable. However diffs to the JVM that
        fix
        the
        bug are available, and included in the description if they didn't cause
        the
        bug
        submission to break. If not in the description, please contact the
        submitter
        for fix diffs.

        We're also happy to make available the test bed to reproduce the
        problem,
        but
        it isn't a simple test case.
        ---------- END SOURCE ----------

        CUSTOMER WORKAROUND :
        This won't work for our customers who need high performance
        and scalable deployments, but this bug can be worked around
        in three ways:
        1) use -classic jvm
        2) use a single processor cpu
        3) bind the java process affinity to one cpu on windows
        4) use a non-windows platform.
        workaround:
        comments: (company - eXcelon Corporation , email - ###@###.###)

          Attachments

            Issue Links

              Activity

                People

                Assignee:
                chegar Chris Hegarty
                Reporter:
                chegar Chris Hegarty
                Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                  Dates

                  Created:
                  Updated:
                  Resolved:
                  Imported:
                  Indexed: