Details

    • Subcomponent:
      gc

      Description

      Work-in-progress implementation: https://cr.openjdk.java.net/~manc/8236485/webrev_wip0/

      Implement an epoch synchronization protocol between G1's concurrent refinement threads and mutator threads. The protocol is necessary for removing StoreLoad memory fence in G1's post-write barrier. The protocol ensures that all heap stores scheduled prior to the initiation of the protocol are visible after the protocol finishes. The protocol basically waits for all mutator threads to execute an operation that implies a store-load fence.

      The protocol maintains the following data structures:
       - global_epoch: a global atomic counter;
       - T.local_epoch: a local counter for a mutator thread T;
       - global_frontier: a minimum value of all local counters for all mutator threads;

      Each mutator thread copies current global_epoch to its local_epoch after executing certain operations that imply a store-load fence. Currently it updates local_epoch at thread state transition from in_vm to in_java, transition back from in_vm for handshake (in order to support handshake, see below), and handling safepoint poll page exception. These operations happen frequently enough at runtime to make the protocol returns quickly in most cases.

      A thread doing concurrent refinement initiates and executes the protocol after cleaning the card(s) to refine and executing a store-load fence (see G1RefineBufferedCards::refine()). The following pseudocode shows the protocol:

      required_frontier = atomic_inc(&global_epoch); // required_frontier is the value after increment
      … // doing some work to pre-process the card(s)
      if (required_frontier <= global_frontier) { // already synchronized by some other refinement thread
        return;
      }
      while (!timeout) { // spin-wait by just loading each mutator thread’s local_epoch
        current_frontier = MIN(T.local_epoch foreach mutator thread T in in_java state);
        if (current_frontier >= required_frontier) {
          // update global_frontier if current_frontier is larger
          CAS(&global_frontier, global_frontier, MAX(global_frontier, current_frontier);
          return;
        }
      }
      handshake_with_all_mutator_threads(); // fallback to trigger a handshake and wait for it to complete

      The protocol could finish in one of three cases:
       - The fast path, where some other refinement thread initiated the protocol later and has succeeded.
       - Spin waits for a fixed amount of time, and all mutator threads in in_java state have passed the required_frontier.
       - Triggers a heavyweight handshake (JEP-312/JDK-8185640: ThreadLocalHandshakes) with all mutator threads. This would force each mutator thread in in_java or in_vm state to pass a state transition that implies a store-load fence.

      Challenges
      There are several unresolved challenges and corner cases. The main problem is the call to trigger a handshake (Handshake::execute()) could block and return after passing a safepoint.
      A preliminary test shows that about 20%-50% of handshakes pass a safepoint. This brings two problems:
      1. Concurrent refinement work cannot span across a safepoint. A safepoint could be a collection pause, which requires the cards being refined to either complete or added back to the DirtyCardQueueSet. The two-phase batch cleaning and later refinement of a card buffer requires that the two phases not to span across a safepoint.
      2. The post-write barrier can only call JRT_LEAF functions, which forbids any blocking operation. In the case of a mutator thread doing refinement work, it cannot fall back to the blocking call to trigger a handshake.

      Possible solutions could be:
       - Add a non-blocking API to trigger a handshake. Give up on refinement work if the handshake does not finish in time.
       - Use other approach such as sys_membarrier() syscall as a fallback instead of handshake

      Another problem is whether the epoch synchronization protocol should synchronize with threads in native/VM state that could execute the native post-write barrier (e.g. Access::oop_store_at()). If it should, then it would check local_epochs for mutator threads in in_vm state in the spin-wait loop. This would make the protocol much longer to finish, and increase the likelihood of fallback to handshake. We also need to ensure cases of non-mutator threads (e.g. concurrent marking threads) are correct, as they could also execute the native post-write barrier.
      An alternative solution is to keep the store-load fence in the native post-write barriers. This would make the native post-write barrier slower if we later remove the filter for young objects in the post-write barrier.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                manc Man Cao
                Reporter:
                manc Man Cao
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: