Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6546278

Synchronization problem in the pseudo memory barrier code

    Details

    • Type: Bug
    • Status: Closed
    • Priority: P4
    • Resolution: Fixed
    • Affects Version/s: 5.0u12, 6, 6u1, 6u3
    • Fix Version/s: hs11
    • Component/s: hotspot
    • Labels:
    • Subcomponent:
    • Resolved In Build:
      b01
    • CPU:
      x86, sparc
    • OS:
      linux, linux_redhat_4.0, solaris_8, solaris_10
    • Verification:
      Verified

      Backports

        Description

        FULL PRODUCT VERSION :
        Hotspot/Java:

        - 1.6.0 b105
        - sources:
          jdk-6-fcs-bin-b105-jrl-29_nov_2006.jar
          jdk-6-fcs-src-b105-jrl-29_nov_2006.jar
        - build options: STATIC_MOTIF=false

        FULL OS VERSION :
        - uname: Linux b1c1s9 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006
        x86_64 x86_64 x86_64 GNU/Linux
        - RHEL 4, (patch level 4)
        - 2xDual Core Intel Xenon CPUs, (shows as 8-way machine)

        A DESCRIPTION OF THE PROBLEM :
        The problem is detected as relatively rare random 7-30 seconds
        application pauses. Typically, these occur once every 1-4 hours in
        production. With application pause time tracking enabled, the problem
        can be easily seen in output logs as "application stopped" time. During
        these stoppage times, a full CPU is being consumed in kernel mode.

        After building the JVM from source and inserting debugging statements in
        various places, we were able to determine that the pause was the result
        of a synchronization problem in the psuedo memory barrier code that
        attempts to control multiple processor JVM safe point entry.

        We verified this by attempting to use the reinstated -XX:+UseMembar
        option. This did appear to clear the problem, however the overall
        performance of the system was not acceptable with this option invoked
        since it uses a true memory barrier instruction to synchronized the
        multiple processors.

        Further investigation into the problem pointed to a race condition and
        associated thread starvation during entry into the JVM global safe
        point. The psuedo memory barrier code is dependent on SIGSEGV error
        processing generated while attempting to access a block of shared memory
        protected by another thread. While one thread was blocked trying to
        protect the shared memory to enter the safe point, another thread looped
        repeatedly in the SIGSEGV handler code. This continued for random
        lengths of time until the protecting thread managed to get a time slice
        on the same CPU.

        We believe this appears random because it only occurs on safe point
        entry when there are other threads executing and when the thread trying
        to force the safe point and the outstanding threads are on the same CPU.
        It also appears to happen very frequently, but long pauses seem to occur
        only rarely: often the number of iterations through the SIGSEGV loop are
        less than 10 and the pause escapes detection.

        THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: Did not try

        THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes

        STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
        See description

        EXPECTED VERSUS ACTUAL BEHAVIOR :
        See description
        ERROR MESSAGES/STACK TRACES THAT OCCUR :
        Not available

        REPRODUCIBILITY :
        This bug can be reproduced always.

        ---------- BEGIN SOURCE ----------
        Not available
        ---------- END SOURCE ----------

        CUSTOMER SUBMITTED WORKAROUND :
        We can make available a patch that we are using successfully under production
        loads. This patch tracks the number of times a thread iterates through
        the SIGSEGV handler and yields the CPU to the safepoint serializing
        thread if the count exceeds 10. This eliminates the longer pauses while
        still allowing the loop to "spin" as it does naturally frequently.

        We are not sure this is the optimal patch, but it does clearly
        demonstrate the issue we were encountering with the psudeo memory
        barrier implementation in our system environments.
        Fixed mis-spelling of "pseudo" in Synopsis field.

          Attachments

            Issue Links

              Activity

                People

                • Assignee:
                  xlu Xiaobin Lu (Inactive)
                  Reporter:
                  ndcosta Nelson Dcosta (Inactive)
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  2 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved:
                    Imported:
                    Indexed: