The current recommende patch cluster is loaded and the exact java
version is 1.1.7_08 with native threads.
We are working to reproduce the problem with a smaller example, and
hope to have one soon.
Here's the customer's description of the problem:
The servers are Java applications receiving RMI calls from a
Servlet. The problem occurs in the socket handling underneath the
RMI runtime code, but I believe it is a socket problem, not a RMI
The specific situation is that a thread blocked on a
ServerSocket.accept method occasionally gets a I/O exception
"interrupted system call" thrown. If we turn around and retry the
accept method on the socket, it blocks as expected, but about 50% of
the time it will no longer accept any new connections. Clients will
connect to the server and block on I/O, but the server accept method
never returns, so the system appears hung to these new connections.
An interesting side item that may or may not be related to this hang
is that after the server processes are killed, the OS has been
observed to have sockets left in a CLOSE_WAIT state that lasts
indefinitely. The passive close that should wipe them out in a few
minutes or hours never occurs, even days later. A system reboot is
the only thing that cleans this up, which is especially nasty because
they are tying up port numbers used by our application servers.
This customer is using a version of our product, Windchill, which is
running in a 1.1.7 native threads JVM and uses Oracle 8.0 OCI JDBC
drivers (type II, native code), and is running on Solaris 2.6. We
had a similar problem over a year ago on HP/UX and determined that it
was occasionally occurring when the Oracle OCI drivers opened new
database connections. That was solved by configuring the server to
use a fixed size connection pool so it wouldn't need to open new ones
during its normal lifetime. We didn't seeing it again until now.
This particular site has a customization that makes heavy use of JDNI
to access a Netscape directory server using LDAP, but is otherwise
unexceptional. We have other Solaris sites running the same
configuration that haven't reported this problem yet, but this is
probably the busiest one.
I've looked through the bug parade and seen several reports of
problems with socket methods throwing "interrupted system call"
exceptions, but I haven't been able to determine if the problem has
really been fixed in a production JVM release. Further, none of them
described the situation where retrying the accept method leaves the
server hung. The version of Windchill they are running supports Java
2, and they will eventually upgrade for performance reasons, but I'd
like to be able to say definitively whether it will also solve the
server hanging problem.
We have a wrapper class around the real server socket for logging and
exception recovery, so it's easy to control the response to these
exceptions. This logging is how we discovered that the hangs
correspond to the server socket throwing this exception. Our code
was retrying a couple of times to work around a early Windows/NT
socket bug where clients that connect and die before the server
accept call completes could raise a connection reset exception from
the accept method. That was on NT, but now on Solaris the only
working technique to prevent this hang is to shut down the server
when its accept call fails. Luckily, Windchill can have multiple
servers load balancing and they can fail and be restarted
automatically by a server manager process. Most calls in progress
will be retried against another server. This was simply defensive
programming against bugs or memory leaks in native code (Oracle) or
unstable JVMs. It allows the customer to stay in production, but
it's not a very clean solution to have servers committing suicide
every few hours. There was a total of 22 shutdowns yesterday. And
we still have the lost port numbers waiting for a reboot (not typical
of a Solaris enterprise server).