Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4472904

JDK 1.3.1/1.3 - Deadlock in safepoint code (Java Application hangs)

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: P4
    • Resolution: Cannot Reproduce
    • Affects Version/s: 1.3.1, 1.4.0
    • Fix Version/s: None
    • Component/s: hotspot
    • Subcomponent:
    • CPU:
      generic, x86
    • OS:
      generic, windows_2000

      Description

      Customer Problem Description:
      ------------------------------

      After 3+ hours of execution, our Java application's VM is hanging. We cannot even signal the VM to do a thread dump. We are able to consistently reproduce this problem by running our application for an extended period of time under a considerable load. This problem is detrimental to the stability of our application and is a top priority for us.

      To diagnose the problem, we downloaded to JDK 1.3.0 source and debugged it. The problem is a deadlock in the safepoint synchronization code. At the time of the deadlock, SafepointSynchronize::_waiting_to_block has a value greater than zero, while all threads appear to be at their respective safepoints.

      As you know, the compiled safepoint handler copies the currently executing method and replaces almost all calls out of it with illegal instructions, which are later trapped to stop the thread. If the compiled code makes a JVM call, the thread's state changes from _thread_in_native_trans to _thread_in_native_blocked. If the thread's state is _thread_in_native_blocked when SafepointSynchronize::examine_state_of_thread() is called, roll_forward(_at_safepoint) is called. This causes the main thread to decrement _waiting_to_block on behalf of the running thread, because it is assumed to already be in a safepoint. However, in our case, the initial state of the thread is _thread_in_native. The main thread expects the running thread to decrement _waiting_to_block once it enters a safepoint. But, once our thread reaches the safepoint, it is in _thread_in_native_blocked. In this case SafepointSynchronize::block() does not decrement _waiting_to_block. Therefore, the main thread blocks indefinately waiting for _waiting_to_block to decrement to 0.

      The application consists of four java processes:

      1. A 100% Java home grown message server. This server communicates with
      the other components using TCP connections. This process does NOT
      experience the deadlock problem.
      2. An application server process. This consists of 47 threads. Each
      thread is a subcomponent of the system. Each subcomponent requests data
      objects from the message server, performs an action on the data, and
      outputs processed data objects into the message server. The
      subcomponents form a 6 stage pipeline in which work is performed. Some
      stages have multiple threads, and there are some application management
      threads, so that's how we wind up with 47 of them. This process DOES
      experience the deadlock and does so more frequently than #3.
      3. A cloudscape database server process. This process DOES experience
      the deadlock.
      4. A tomcat web server process. This process DOES NOT experience the
      deadlock.

      We have tested the application against 1.3.1 and we do experience the
      deadlock there as well.

      Here's some text/explainations I've been putting together for my own sanity. I've included some trace diagnostics. You'll notice some fields that don't look familiar. The most useful of these is _decremented_block.
      This is a per-thread variable that gets set whenever a thread decrements
      _waiting_to_block. SafepointSynchronize::begin() sets this value to 0
      for all threads before the still_waiting loop. If the _waiting_to_block
      is decremented on behalf of a thread by roll_forward(_at_safepoint),
      _decremented_block is set to 2. If the decrement is done by
      SafepointSynchronize::block(), _decremented_block is set to 1. If
      _decremented_block_ is 0, the decrement has not been done.

      -- attached diagnostic data --

      Situation : Compiled code handler

      Hang occurred after 3.5 hours of operation running Emissary. Trace log
      had the following status repeated:

      Threads_lock timeout: 0x1a24a870 [0x42c] _waiting_to_block=1
      block_signaled=1 is_synchronizing=1 count=1
      THE HANG occurred
      num_threads = 30 init_waiting=30 stage2_wtb=2 _waiting_to_block=1
      Thread: 0x1a1d42a0 [0x408] State: _compiled_handler ThreadState: 50
      pc: 0x98d7e3 _decremented_block=0
      SafepointHandler: CompiledCodeSafepointHandler
      Thread: 0x1a1e2898 [0x3f8] State: _at_safepoint ThreadState: 11 pc:
      0x0 _decremented_block=2
      Thread: 0x1a1cbc58 [0x728] State: _at_safepoint ThreadState: 4 pc: 0x0
      _decremented_block=2
      Thread: 0x1a1c6420 [0x460] State: _at_safepoint ThreadState: 10 pc:
      0x0 _decremented_block=2
      Thread: 0x78b488 [0x868] State: _at_safepoint ThreadState: 10 pc: 0x0
      _decremented_block=2
      Thread: 0x78a818 [0x504] State: _at_safepoint ThreadState: 10 pc: 0x0
      _decremented_block=2
      Thread: 0x1a1aec80 [0x718] State: _at_safepoint ThreadState: 10 pc:
      0x0 _decremented_block=2
      Thread: 0x1a1adea8 [0x970] State: _at_safepoint ThreadState: 10 pc:
      0x0 _decremented_block=2
      Thread: 0x1a1ad3d8 [0x63c] State: _at_safepoint ThreadState: 10 pc:
      0x0 _decremented_block=2
      Thread: 0x1a1ac990 [0x208] State: _at_safepoint ThreadState: 10 pc:
      0x0 _decremented_block=2
      Thread: 0x1a1bc4b8 [0x210] State: _at_safepoint ThreadState: 10 pc:
      0x0 _decremented_block=2
      Thread: 0x1a1ebca0 [0x588] State: _at_safepoint ThreadState: 10 pc:
      0x0 _decremented_block=2
      Thread: 0x8312c8 [0x628] State: _at_safepoint ThreadState: 10 pc: 0x0
      _decremented_block=2
      Thread: 0x1a24a590 [0x498] State: _at_safepoint ThreadState: 10 pc:
      0x0 _decremented_block=2
      Thread: 0x1a24a408 [0x6ec] State: _at_safepoint ThreadState: 10 pc:
      0x0 _decremented_block=2
      Thread: 0x786a60 [0x68c] State: _at_safepoint ThreadState: 10 pc: 0x0
      _decremented_block=2
      Thread: 0x7868f0 [0x540] State: _at_safepoint ThreadState: 10 pc: 0x0
      _decremented_block=2
      Thread: 0x1a24a870 [0x42c] State: _call_back ThreadState: 7 pc: 0x0
      _decremented_block=1
      Thread: 0x7888a8 [0x3b4] State: _at_safepoint ThreadState: 50 pc: 0x0
      _decremented_block=2
      Thread: 0x786df8 [0x450] State: _at_safepoint ThreadState: 10 pc: 0x0
      _decremented_block=2
      Thread: 0x1afb48a8 [0x6b0] State: _at_safepoint ThreadState: 10 pc:
      0x0 _decremented_block=2
      Thread: 0x787618 [0x5c8] State: _at_safepoint ThreadState: 11 pc: 0x0
      _decremented_block=2
      Thread: 0x835da0 [0x704] State: _at_safepoint ThreadState: 11 pc: 0x0
      _decremented_block=2
      Thread: 0x1a1a3710 [0x880] State: _at_safepoint ThreadState: 50 pc:
      0x0 _decremented_block=2
      Thread: 0x1a16ed70 [0x3a8] State: _at_safepoint ThreadState: 10 pc:
      0x0 _decremented_block=2
      Thread: 0x1a2380b8 [0x1f8] State: _at_safepoint ThreadState: 11 pc:
      0x0 _decremented_block=2
      Thread: 0x8122f0 [0x1e4] State: _at_safepoint ThreadState: 10 pc: 0x0
      _decremented_block=2
      Thread: 0x80f920 [0x144] State: _at_safepoint ThreadState: 10 pc: 0x0
      _decremented_block=2
      Thread: 0x1a0b0c60 [0x720] State: _at_safepoint ThreadState: 10 pc:
      0x0 _decremented_block=2
      Thread: 0x7844c0 [0x590] State: _at_safepoint ThreadState: 10 pc: 0x0
      _decremented_block=2


      Thread 0x42c printed out a timeout message, as part of fix 1. It
      appears to be unrelated, since thread 0x408 is the one that did not
      decrement waiting_to_block.

      Thread 0x408 has the following stack dump:

      Reported by Debugger Interpretation
      NTDLL! 77f827e8()
      KERNEL32! 77e86a3d()
      JVM! 08047031() SafepointSynchronize::block()safepoint.cpp line 322
      JVM! 0802713a() JNI_GetPrimitiveArrayCriticaljni.cpp line 1451
      Java_java_util_zip_CRC32_updateBytes(const JNINativeInterface_ * *
      0x1a1d42f4, _jobject * 0x1cfafa6c, long 0, _jobject * 0x1cfafa80, long
      0, long 4088) line 38 + 18 bytes

      0x408's current state is 50 (_thread_in_native_blocked). The particular
      section of SafepointSynchronize:block() in which the thread is blocked
      is structured as follows:

       
      switch (state)
      {

      // Code deleted

          case _thread_in_native_trans:
          case _thread_blocked_trans:
          case _thread_new_trans:

      // Code and comments deleted

          if (state == _thread_in_native_trans)
              thread->set_thread_state(_thread_in_native_blocked);

          // Block until the safepoint operation is completed.
          Threads_lock->lock_without_safepoint_check();

      The thread's state could not have been _thread_in_native_blocked when
      the function was entered, since it would not have made it into this
      section of the switch statement. Furthermore, the only way it could
      have changed to _thread_in_native_blocked, would have been if its value
      was _thread_in_native_trans. This state is consistent with the
      transition code on the stack below it. Therefore, this thread's state
      must have been _thread_in_native_trans when the function was entered.

      The thread's safepoint state is recorded as _compiled_handler. This
      means that its state was _thread_in_comp_leaf or _thread_in_Java when
      examine_thread_state() was called. The _compiled_handler state expects
      the thread to decrement _waiting_for_block. However, when
      SafepointSynchronize::block() is entered, the state causes a jump to a
      block of code that is normally called when the examine_thread_state()
      has already called roll_forward(_at_safepoint). Since roll_forward()
      automatically decrements _waiting_to_block when the safepoint state is
      set to _at_safepoint, block() does not decrement it again.

      decremented_block is 0 meaning none has decremented _waiting_for_block
      on its behalf.

      -- end of diagnostic data --
       

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              kbr Kenneth Russell (Inactive)
              Reporter:
              cprasadsunw Ck Prasad (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Imported:
                Indexed: