Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8059461

Refactor IndexSet for better performance

    Details

    • Type: Enhancement
    • Status: Open
    • Priority: P3
    • Resolution: Unresolved
    • Affects Version/s: 8u20, 9, 10
    • Fix Version/s: tbd
    • Component/s: hotspot

      Description

      If you run jdk9-dev on Nashorn/Octane suite with -XX:+CITime enabled:

      ~/trunks/jdk9-dev/build/linux-x86_64-normal-server-release/images/j2sdk-image/bin/java -XX:+CITime -jar ~/trunks/jdk9-dev/build/linux-x86_64-normal-server-release/images/j2sdk-image/jre/lib/ext/nashorn.jar -Dnashorn.typeInfo.disabled=false --class-cache-size=0 --persistent-code-cache=false -scripting --log=time test/script/basic/run-octane.js -- --iterations 1

      Then you will see this profile:
       http://cr.openjdk.java.net/~shade/8059461/nashorn-compile-baseline.log

      The remarkable time is spent in regalloc, "Coalesce 2". If you dive there in profiler, you will realize we spend most of the time in update_ifg, modifying the IFG neighbors for a new (merged) live range. I see little-to-none easy things to do algorithmic-wise here:

      void PhaseConservativeCoalesce::update_ifg(uint lr1, uint lr2, IndexSet *n_lr1, IndexSet *n_lr2) {
        ...
        // Some neighbors of intermediate copies now interfere with the
        // combined live range.
        IndexSetIterator three(&_ulr);
        while ((neighbor = three.next()) != 0)
          if( _phc._ifg->neighbors(neighbor)->insert(lr1) )
            lrgs(neighbor).inc_degree( lrg1.compute_degree(lrgs(neighbor)) );
      }

      But, the native profiler shows we spend a lot of time walking the actual IndexSet containing the IFG.
      Because of that, I tried to see how IndexSet reacts to simple change of bit distribution:

        // The length of the preallocated top level block array
        enum { preallocated_block_list_size = 16 };

        // The lengths of the index bitfields
        enum { bit_index_length = 5,
               word_index_length = 3,
               block_index_length = 8 // not used
        };

      Current IndexSet allocates 32 bits per word, 8 words per block, and pre-allocates 16 blocks.

      Tuning the word count up (coarsening the blocks) yields a considerable improvement in compile time:

      * Baseline: 32 bits per word, 8 words per block, 16 blocks pre-allocated:

        Total compilation time : 508.008 s
          C2 Compile Time: 393.537 s
             Regalloc: 162.268 s
               Coalesce 2: 53.094 s
        Average compilation speed : 22518 bytes/s

      * 32 bits per word, 16 words per block, 8 blocks pre-allocated:
       http://cr.openjdk.java.net/~shade/8059461/nashorn-compile-w4-b7.log

       Total compilation time : 488.732 s
          C2 Compile Time: 377.150 s
             Regalloc: 149.755 s
               Coalesce 2: 50.160 s
        Average compilation speed : 22568 bytes/s


      * 32 bits per word, 32 words per block, 4 blocks pre-allocated:
       http://cr.openjdk.java.net/~shade/8059461/nashorn-compile-w5-b6.log

        Total compilation time : 461.385 s
          C2 Compile Time: 349.802 s
             Regalloc: 126.326 s
               Coalesce 2: 28.198 s
        Average compilation speed : 23928 bytes/s

      * 32 bits per word, 64 words per block, 2 blocks pre-allocated:
      http://cr.openjdk.java.net/~shade/8059461/nashorn-compile-w6-b5.log

        Total compilation time : 468.099 s
          C2 Compile Time: 351.622 s
             Regalloc: 131.058 s
               Coalesce 2: 29.227 s
        Average compilation speed : 23669 bytes/s

      * 32 bits per word, 128 words per block, 1 block pre-allocated:
      http://cr.openjdk.java.net/~shade/8059461/nashorn-compile-w7-b4.log

        Total compilation time : 464.916 s
          C2 Compile Time: 349.978 s
             Regalloc: 129.082 s
               Coalesce 2: 24.206 s
        Average compilation speed : 23423 bytes/s


      Therefore, the block chain in IndexSet is not bringing the performance benefits, and allocating a single large block is
      better for performance (better locality). With headlines like that, I think we would be better off gutting the IndexSet internals,
      and replacing it with BitMap. BitMap is dynamically resizable, and so we arguably would have similar footprint.
      Moving to BitMap will additionally bring 64 bits per word on 64-bit platforms, since BitMap allocates words with uintptr_t,
      not uint32_t as IndexSet does.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                shade Aleksey Shipilev
                Reporter:
                shade Aleksey Shipilev
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: