Details

    • Subcomponent:
    • Resolved In Build:
      b02
    • CPU:
      sparc
    • OS:
      solaris_8

      Description

      Operating System:SUN Solaris
      OS version:Solaris 8
      Product Name:JRE (SUN ORB part of it anyway)
      Product version:1.4.2_05 (SUN ORB part of it anyway)
      Hardware platform:SunFire V880
      Severity: 1

      Short description:
      When the default values of (com.sun.CORBA.connection.ORBHighWaterMark | com.sun.CORBA.connection.ORBLowWaterMark) are exceeded, the "JavaIDL Listener" thread becomes permenantly (perhaps semi-permenantly) hung in trying to cleanup a particular IIOPConnection, which prevents any and all new clients from successfully connecting to this CORBA Server (EventBrokerServer). The server eventually had to be killed in order to recover it from its problematic state. The "JavaIDL Listener" thread was stuck in the state described in the below stack trace. This problem has manifested itself on 2 other servers.

      Full problem description:
      When the default values of (com.sun.CORBA.connection.ORBHighWaterMark | com.sun.CORBA.connection.ORBLowWaterMark) are exceeded and the SUN ORB is trying to cleanup,

      1. the following "com.sun.corba.se.internal.iiop.IIOPConnection$DeleteConn" exceptions to be thrown and noticed in the file that is capturing the stdout and sdterr streams.

      com.sun.corba.se.internal.iiop.IIOPConnection$DeleteConn
      at com.sun.corba.se.internal.iiop.IIOPConnection.delete(Unknown Source)
      at com.sun.corba.se.internal.iiop.IIOPConnection.writeLock(Unknown Source)
      at com.sun.corba.se.internal.iiop.IIOPConnection.purge_calls(Unknown Source)
      at com.sun.corba.se.internal.iiop.ReaderThread.run(Unknown Source)
      com.sun.corba.se.internal.iiop.IIOPConnection$DeleteConn
      at com.sun.corba.se.internal.iiop.IIOPConnection.delete(Unknown Source)
      at com.sun.corba.se.internal.iiop.IIOPConnection.writeLock(Unknown Source)
      at com.sun.corba.se.internal.iiop.IIOPConnection.purge_calls(Unknown Source)
      at com.sun.corba.se.internal.iiop.ReaderThread.run(Unknown Source)
      com.sun.corba.se.internal.iiop.IIOPConnection$DeleteConn
      at com.sun.corba.se.internal.iiop.IIOPConnection.delete(Unknown Source)

      "JavaIDL Listener" daemon prio=5 tid=0x0056cbe0 nid=0xc waiting for monitor entry [ed6ff000..ed6ffc30]
      at com.sun.corba.se.internal.iiop.IIOPConnection.cleanUp(Unknown Source)
      - waiting to lock <0xf0c4f0f8> (a java.lang.Object)
      - locked <0xf0c4ebf0> (a com.sun.corba.se.internal.iiop.IIOPConnection)
      at com.sun.corba.se.internal.iiop.ConnectionTable.cleanUp(Unknown Source)
      at com.sun.corba.se.internal.iiop.ConnectionTable.checkConnectionTable(Unknown Source)
      at com.sun.corba.se.internal.iiop.ListenerThread.run(Unknown Source)

      2. The "JavaIDL Reader" thread shown below in a permenant (possibly semi-permenant) hung state because it had synchronized on com.sun.corba.se.internal.iiop.IIOPConnection's "stateEvent" and within that synchronization block had synchronized on "writeEvent" and then was waiting to be notified on the "writeEvent" while holding on to the synchronization on "stateEvent". No one seems to be notifying it.

      "JavaIDL Reader for 10.97.4.21:2783" daemon prio=5 tid=0x005dcc08 nid=0x24e2 in Object.wait() [e987f000..e987fc30]
      at java.lang.Object.wait(Native Method)
      at java.lang.Object.wait(Unknown Source)
      at com.sun.corba.se.internal.iiop.IIOPConnection.writeLock(Unknown Source)
      - locked <0xf0c4f100> (a java.lang.Object)
      - locked <0xf0c4f0f8> (a java.lang.Object)
      at com.sun.corba.se.internal.iiop.IIOPConnection.purge_calls(Unknown Source)
      at com.sun.corba.se.internal.iiop.ReaderThread.run(Unknown Source)

      3. At a later point in time, when the "JavaIDL Listener" thread detected that it needed to cleanup some connections and selected and tried to cleanup the above mentioned IIOPConnection, it eventually ended up calling writeLock() and getting permenantly (semi-permenantly) blocked on trying to synchronize on "stateEvent".

      "JavaIDL Listener" daemon prio=5 tid=0x0056cbe0 nid=0xc waiting for monitor entry [ed6ff000..ed6ffc30]
      at com.sun.corba.se.internal.iiop.IIOPConnection.cleanUp(Unknown Source)
      - waiting to lock <0xf0c4f0f8> (a java.lang.Object)
      - locked <0xf0c4ebf0> (a com.sun.corba.se.internal.iiop.IIOPConnection)
      at com.sun.corba.se.internal.iiop.ConnectionTable.cleanUp(Unknown Source)
      at com.sun.corba.se.internal.iiop.ConnectionTable.checkConnectionTable(Unknown Source)
      at com.sun.corba.se.internal.iiop.ListenerThread.run(Unknown Source)

      I can be noticed that the "JavaIDL Listener" has locked <0xf0c4ebf0> and is waiting to lock <0xf0c4f0f8> which has been locked by the "JavaIDL Reader for 10.97.4.21:2783", which has subsequently locked <0xf0c4f100> and is waiting for a notification on the same which never happens.

      Code snipets from SUN's CORBA code.
      com.sun.corba.se.internal.iiop.IIOPConnection

          /**
           * Cleans up this Connection.
           * This is called from ConnectionTable, from the ListenerThread.
           * Note:it is possible for this to be called more than once
           */
          public synchronized void cleanUp() throws Exception
          {
              writeLock();
      ...
          }

          /**
           * Sets the writeLock for this connection.
           * If the writeLock is already set by someone else, block till the
           * writeLock is released and can set by us.
           * IMPORTANT: this connection's lock must be acquired before
           * setting the writeLock and must be unlocked after setting the writeLock.
           */
          protected boolean writeLock()
          {

              // Keep looping till we can set the writeLock.
              while ( true ) {
                  synchronized ( stateEvent ){
                      switch ( state ) {
                      
                      case OPENING:
                          try {
                              stateEvent.wait();
                          } catch (InterruptedException ie) {};
                          // Loop back
                          break;
                      
                      case ESTABLISHED:
                          synchronized (writeEvent) {
                              if (!writeLocked) {
                                  writeLocked = true;
                                  return true;
                              }
                          
                              try {
                                  writeEvent.wait();
                              } catch (InterruptedException ie) {};
                          }
                          // Loop back
                          break;
                      
                          //
                          // XXX
                          // Need to distinguish between client and server roles
                          // here probably.
                          //
                      case ABORT:
                          throw new COMM_FAILURE( MinorCodes.WRITE_ERROR_SEND,
                                                  CompletionStatus.COMPLETED_NO);
                           
                      case CLOSE_RECVD:
                          // the connection has been closed or closing
                          // ==> throw rebind exception
                      
                          throw new COMM_FAILURE( MinorCodes.CONN_CLOSE_REBIND,
                                                  CompletionStatus.COMPLETED_NO);
                      
                      default:
                          if (orb.transportDebugFlag)
                              dprint("Connection:writeLock: weird state");
                      
                          delete(Connection.CONN_ABORT);
                          return false;
                      }
                  }
              }
          }

          /**
           * Release the write lock on this connection.
           */
          protected void writeUnlock()
          {
              synchronized (writeEvent) {
                  writeLocked = false;
                  writeEvent.notify(); // wake up one guy waiting to write
              }
          }

      "JavaIDL Reader" thread is stuck on writeEvent.wait() and
      "JavaIDL Listener" thread is stuck on synchronized (stateEvent).

      Basically no one seems to be calling writeUnlock() which is the only way the "JavaIDL Reader" thread will release the lock on "stateEvent" and allow the "JavaIDL Listener " thread to proceed.

      We have three thread dumps from 3 different servers that can be sent via email once you contact us.

      Test case:
      We are working on it, but as you can imagine it is not very easy to create a test case for this since we are not as familiar with the internals of the SUN ORB and do not know exactly when and how the problem can be made to manifest itself.
      What we are sure of are the following things.
      1. The problem seems to be manifesting itself only after the (com.sun.CORBA.connection.ORBHighWaterMark | com.sun.CORBA.connection.ORBLowWaterMark) are exceeded and cleaup code is executed.
      2. It seems that the problem can happen regardless of who is executing the cleanup code i.e. "JavaIDL Listener" or some other thread.
      3. Some "JavaIDL Reader" thread invariably gets stuck waiting for the "writeLocked" on its IIOPConnection to be set to false which seems to never happen.
      4. Since the thread mentioned in #3 also has other lock that eventually "JavaIDL Listener" thread gets stuck on, the problem manifests itself.

      including herewith 3 files each of which represents one of our servers. Each file has multiple threaddumps and the last threaddump in each shows the exact scenario described as part of the problem report.
       
      Also, we have been digging deeper and we have discovered that the following seems to be at the core of the issue.
       
      It seems that when the "JavaIDL Listener" (com.sun.corba.se.internal.iiop.ListenerThread) thread is trying to establish a new connection and discovers that cleanup needs to be done, it calls com.sun.corba.se.internal.iiop.ConnectionTable.checkConnectionTable() which in turn calls com.sun.corba.se.internal.iiop.ConnectionTable.cleanup() which identifies the connection to cleanup and then calls com.sun.corba.se.internal.iiop.IIOPConnection.cleanup(). I have cut and pasted that code below. We believe that when this method calls sendCloseConnection(), this creates a classic race condifiton, because this method performs the writeLock() and then one of two things can happen.
      1. The Listener thread will successfully lock the stateEvent before any other thread has a chance to lock it and in this case all the right things will happen. This is the green light scenario.
      2. The sendCloseConnection() would cause the connection to be closed from the other end, which would cause the "JavaIDL Reader" (com.sun.corba.se.internal.iiop.ReaderThread) thread that is doing socketRead() to get a IOException, in which case it will end up calling the com.sun.corba.se.internal.iiop.IIOPConnection.purge_calls() with lockHeld set to false. I have also cut and pasted that code below. As you can see, since the lochHeld is false, purge_calls() calls com.sun.corba.se.internal.iiop.IIOPConnection.writeLock(). I have also cut and pasted that code below. As you can see, the com.sun.corba.se.internal.iiop.IIOPConnection.writeLock() method locks the stateEvent and then subsequently locks the writeEvent, and since writeLocked would be set to true becaue the Listener thread has already called writeLock(), it simply does a wait() on the writeEvent to be informed when the writeLocked gets set to false. But unfortunately this will be a permenant wait because only the Listener thread would call writeUnlocked() but since it first needs to lock stateEvent (which is currently locked by Reader thread) it is waiting for the stateEvent to be unlocked which will never be unlocked because the Reader thread has it locked and is waiting to be notified on the writeEvent and so on and so forth. So while not actually technically a deadlock situation, it has for all intents and purposes the same impact as a deadlock. This is the red light scenario. Basically if the reader thread beats the listener thread in locking the stateEvent, we are permenantly screwed. This is purely a timing window issue and is exactly what is happening in the problems that we are experiencing.
       
      One of our engineers (Tai-Sin Quah) has developed a partial test case which while not reproducing the exact same situation, does reproduce something pretty similar and close to it. I am calling it partial because it only reporoduces the problem on Windows platform (all the time on Windows XP amd sometimes on Windows 2000) but not on SUN platform. BUT the reason I am mentioning this is that when he commented out the sendCloseConnection() (we are not suggesting that that is the right solution, we were simply hacking to prove our theory) the problem could not be reproduced at all thereby proving our analysis.
       
      We would also like to point out further that the following code in the cleanup() method
                  // stop the reader without causing it to do purge_calls
                  Exception ex = new Exception();
                  reader.stop(ex);
      indicates that the developers were indeed aware of the race condition and were trying to prevent it by stopping the reader thread. Unfortunately there are two problems/concerns with this code.
      1. A deprecated and unsafe method stop() is being used which SUN strongly encourages recommends be not used.
      2. It is being done too late in the game to prevent the race condition mentioned above. In the event that listener thread wins the race condition to lock stateEvent (green light scenario) this may help, but in the event that the reader thread wins the race condition to lock stateEvent (red light scenario) the listener thread will never get to that safety net code and for all intents and purposes the listener thread and the reader thread will be lockedup for ever.
       
      Hopes all this analysis helps us to get to a quick effecient and elegant resolution to the problem at hand.

      cannot upgrade to 1.4.2_08 because the version of our product that the customer is running was QAed and certified only against 1.4.2_05 and also because it happens to be a 24X7 highly available production environment.
       
      However, the testcase that was mentioned has exhibited the following behaviour so far.
       
      1. On Windows XP, it reproduces the issue everytime with 1.4.2_06 as well as 1.4.2_08.
      2. On Windows 2000, we have only tried 1.4.2_08. It reproduces the problem only if
          a) the stdout and stderr are redirected to a flat file
          b) the stderr is redirected to a flat file.
      3. On Solaris, we have only tried 1.4.2_08 and the problem cannot be reproduced.

      *** Attached 4 Files including 3 thread dumps and code snippet file having suspicion ***

      Thanking you in anticipation.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                hgodugusunw Harsha Godugu (Inactive)
                Reporter:
                sdattatrsunw Sreenatha Dattatri (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Imported:
                  Indexed: