Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8006395

Race in async socket close on Linux

    XMLWordPrintable

    Details

    • Subcomponent:
    • Resolved In Build:
      b77
    • Verification:
      Verified

      Backports

        Description


        SHORT SUMMARY: Possible race condition which may cause the
        socket close failure.

        INDICATORS:
        Threads stuck in socket read and they correspond to the ESTABLISHED
        connections.

        These sockets are closed from other java threads and the customer suspects a race
        condition.

        "OUT Socket Receiver (/10.4.66.8:39918)" prio=1 tid=0x3220ac78
        nid=0x1bab runnable [2e1ff000..2e1ff86c]
                at java.net.SocketInputStream.socketRead0(Native Method)
                at java.net.SocketInputStream.read(SocketInputStream.java:129)
                at java.io.DataInputStream.readFully(DataInputStream.java:266)
                at java.io.DataInputStream.readFully(DataInputStream.java:242)
                at
        com.x.y.integration.systemconnection.messagedefinition.MessageSepara

        tor.separateByteMessage(MessageSeparator.java:213)
                at
        com.x.y.integration.systemconnection.outbound.socket.SocketEisContro

        ller.run(SocketEisController.java:316)

        NET_Read calls recv(2) in the startOp and endOp pair. startOp registers
        current thread to entry fdEntry and endOp checks the fdEntry's intr field.

        In closefd, fdEntry's intr is set to 1 and pthread_kill is called to the
        thread. However, this can happen before calling recv. In that case, recv is
        not interrupted by the pthread_kill and it blocks (unless close(2)/dup2(2)
        interrupts the recv() system call and it looks it does not happen).

        #define BLOCKING_IO_RETURN_INT(FD, FUNC) { \
            int ret; \
            threadEntry_t self; \
            fdEntry_t *fdEntry = getFdEntry(FD); \
            if (fdEntry == NULL) { \
                errno = EBADF; \
                return -1; \
            } \
            do { \
                startOp(fdEntry, &self); \
                ret = FUNC; \
                endOp(fdEntry, &self); \
            } while (ret == -1 && errno == EINTR); \
            return ret; \
        }

        int NET_Read(int s, void* buf, size_t len) {
            BLOCKING_IO_RETURN_INT( s, recv(s, buf, len, 0) );
        }

        COUNTER INDICATORS: na

        TRIGGERS: na

        KNOWN WORKAROUND: na

        PRESENT SINCE: 1.4.2

        HOW TO VERIFY:
        It looks it is difficult to create the suspected timing window from a java
        program.

        Support tried OpenJDK jdk7u-dev to widen the timing window.
        He put usleep to two places to make the timing window bigger.

        1. added usleep to closefd()
        static int closefd(int fd1, int fd2) {
                :
                threadEntry_t *curr = fdEntry->threads;
                while (curr != NULL) {
                    curr->intr = 1;
                    pthread_kill( curr->thr, sigWakeup );
                    if (getenv("SOCKET_CLOSE_DEBUG")) printf("D: kill %#x", curr->thr);
                    curr = curr->next;
                }


                if (getenv("SLEEP_BEFORE_CLOSE")) {
                  usleep(200000); // ### sleep 200msec after pthread_kill, but before dup2
                }

        2. added usleep just ofter startOp in modified version of
        BLOCKING_IO_RETURN_INT, which is called by NET_Read.

        #define BLOCKING_IO_RETURN_INT_READ(FD, FUNC) { \
            int ret; \
            threadEntry_t self; \
            fdEntry_t *fdEntry = getFdEntry(FD); \
            int blockingIODebug = (int)getenv("BLOCKING_IO_DEBUG"); \
            if (fdEntry == NULL) { \
                errno = EBADF; \
                return -1; \
            } \
            do { \
                startOp(fdEntry, &self); \
                if (blockingIODebug) { \
                  printf("Calling usleep in Net_X"); \
                } \
                usleep(200000); \
                if (blockingIODebug) { \
                  printf("usleep done in Net_X"); \
                } \
                ret = FUNC; \
                if (blockingIODebug) { \
                  printf("FUNC returned ret %d errno %d", ret, errno); \
                } \
                endOp(fdEntry, &self); \
            } while (ret == -1 && errno == EINTR); \
            return ret; \
        }


        int NET_Read(int s, void* buf, size_t len) {
            if (getenv("SOCKET_READ_DEBUG")) printf("D: NET_Read(%d,,%d)
        ", s, len);
            BLOCKING_IO_RETURN_INT_READ( s, recv(s, buf, len, 0) );
        }




        $ SLEEP_BEFORE_CLOSE=1 BLOCKING_IO_DEBUG=1 SOCKET_CLOSE_DEBUG=1
        SOCKET_READ_DEBUG=1 ~/local/openjdk/jdk7u-dev/build/linux-i586/bin/java
        -Dsocket.close.debug=1 AsyncCloseEx
        D: Client Calling read
        D: NET_Read(13,,128)
        Calling usleep in Net_X
        D: main calling close.
        D: PlainSocket close fdUseCount 1 closePending false
        D: PlainSocket calling socketPreClose() for fdUseCount != 0
        D: socketClose0 0xb779cf58 deferred 1 fd 13 use-dup2 1
        D: kill 0x8efffb70
        usleep done in Net_X
        D: dup2(11, 13) called
        D: socketClose0 0xb779cf58 deferred 1 fd 13 use-dup2 1 returning
        D: main close done.
        ### Ctrl- to show thread dump.

        ^2013-01-10 17:29:33
        Full thread dump OpenJDK Server VM (24.0-b28 mixed mode):

        "DestroyJavaVM" prio=10 tid=0x091d4000 nid=0x639 waiting on condition
        [0x00000000]
           java.lang.Thread.State: RUNNABLE

        "client" prio=10 tid=0x09364800 nid=0x647 runnable [0x8effe000] <<< blocked
        in socket read
           java.lang.Thread.State: RUNNABLE
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:152)
            at java.net.SocketInputStream.read(SocketInputStream.java:122)
            at AsyncCloseEx$1.run(AsyncCloseEx.java:29)

        "server" prio=10 tid=0x09307c00 nid=0x646 runnable [0x8f1cc000]
           java.lang.Thread.State: RUNNABLE
            at java.net.PlainSocketImpl.socketAccept(Native Method)
            at
        java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
            at java.net.ServerSocket.implAccept(ServerSocket.java:522)
            at java.net.ServerSocket.accept(ServerSocket.java:490)
            at AsyncCloseEx$2.run(AsyncCloseEx.java:53)

        "Service Thread" daemon prio=10 tid=0x092e9c00 nid=0x644 runnable
        [0x00000000]
           java.lang.Thread.State: RUNNABLE

        "C2 CompilerThread1" daemon prio=10 tid=0x092e7c00 nid=0x643 waiting on
        condition [0x00000000]
           java.lang.Thread.State: RUNNABLE

        "C2 CompilerThread0" daemon prio=10 tid=0x092e5400 nid=0x642 waiting on
        condition [0x00000000]
           java.lang.Thread.State: RUNNABLE

        "Signal Dispatcher" daemon prio=10 tid=0x092e3800 nid=0x641 waiting on
        condition [0x00000000]
           java.lang.Thread.State: RUNNABLE

        "Finalizer" daemon prio=10 tid=0x092ce800 nid=0x640 in Object.wait()
        [0x8f538000]
           java.lang.Thread.State: WAITING (on object monitor)
            at java.lang.Object.wait(Native Method)
            - waiting on <0xa9f35400> (a java.lang.ref.ReferenceQueue$Lock)
            at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
            - locked <0xa9f35400> (a java.lang.ref.ReferenceQueue$Lock)
            at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
            at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)

        "Reference Handler" daemon prio=10 tid=0x092c9c00 nid=0x63e in Object.wait()
        [0x8f588000]
           java.lang.Thread.State: WAITING (on object monitor)
            at java.lang.Object.wait(Native Method)
            - waiting on <0xa9f34fd8> (a java.lang.ref.Reference$Lock)
            at java.lang.Object.wait(Object.java:503)
            at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
            - locked <0xa9f34fd8> (a java.lang.ref.Reference$Lock)

        "VM Thread" prio=10 tid=0x092c8400 nid=0x63c runnable

        "GC task thread#0 (ParallelGC)" prio=10 tid=0x091dd800 nid=0x63a runnable

        "GC task thread#1 (ParallelGC)" prio=10 tid=0x091df000 nid=0x63b runnable

        "VM Periodic Task Thread" prio=10 tid=0x092f3c00 nid=0x645 waiting on
        condition

        JNI global references: 137

        Heap
         PSYoungGen total 9408K, used 983K [0xa9f30000, 0xaa9a0000, 0xb4680000)
          eden space 8128K, 12% used [0xa9f30000,0xaa025f78,0xaa720000)
          from space 1280K, 0% used [0xaa860000,0xaa860000,0xaa9a0000)
          to space 1280K, 0% used [0xaa720000,0xaa720000,0xaa860000)
         ParOldGen total 21376K, used 0K [0x95080000, 0x96560000, 0xa9f30000)
          object space 21376K, 0% used [0x95080000,0x95080000,0x96560000)
         PSPermGen total 16384K, used 1878K [0x91080000, 0x92080000,
        0x95080000)
          object space 16384K, 11% used [0x91080000,0x91255a48,0x92080000)

        So, it was verified that the timing window really exists.

        NOTES FOR SE:

        Support notes :
        ---------------
        Using OpenJDK jdk7u-dev, I implemented an experimental fix. If it is
          
        enabled, the read socket thread won't stack in SocketInputStream.socketRead0
             
        even with the intentionally extended timing window we are suspecting.
             
                                                                                      
            
        Changes are below in solaris/native/java/net/linux_close.c:
             
        1. Added a function isInterrupted() which checks if the current thread entry
             
        has intr flag set.
             
           This must be called after locking corresponding fdEntry's lock.
             
        2. Changed BLOCKING_IO_RETURN_INT macro to use isInterrupted() after locking
             
        self.lock.
             
           It calls FUNC (in this case recv()), only if the current thread is not interrupted.
             
        closefd() calls pthread_kill and sets threadEntry's intr filed. Enclosed incr field change with lock/unlock with curr->lock.
             
        Guarding the recv with the same lock should make the macro not to miss the pthread_kill and thread entry's intr flag.
                         

        REGRESSION: no

        }

          Attachments

            Issue Links

              Activity

                People

                Assignee:
                chegar Chris Hegarty
                Reporter:
                shadowbug Shadow Bug
                Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                  Dates

                  Created:
                  Updated:
                  Resolved: