Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8086056

ParNew: auto-tune ParGCCardsPerStrideChunk

    Details

    • Type: Enhancement
    • Status: Open
    • Priority: P2
    • Resolution: Unresolved
    • Affects Version/s: 9
    • Fix Version/s: 10
    • Component/s: hotspot
    • Labels:
      None
    • Subcomponent:
      gc

      Description

      We found that increasing the value of ParGCCardsPerStrideChunk parameter (we've tried up to 8k) can improve ParNew GC times by quite a non-trivial amount (but can also degrade ParNew GC times in some cases too). Another observation is that (unsurprisingly) larger heaps (specifically: larger old gens) benefit more from higher ParGCCardsPerStrideChunk values.

      We have a change in TwitterJDK that auto-tunes ParGCCardsPerStrideChunk before each ParNew GC based on the old gen capacity. The policy is quite simple:

      old gen capacity < 1G : 256
      1G <= old gen capacity <= 16G : interpolate between 256 and 8k
      old gen capacity > 16 : 8k

      (we picked the [1G,16G] range for old gen capacity somewhat arbitrarily)

      Is there interest in this change (it's really quite small / simple). The old gen capacity and ParGCCardsPerStrideChunk value ranges we used can be fine-tuned further, based on further experimentation, if desired. I'll post some numbers to show the benefit of the change...

        Issue Links

          Activity

          Hide
          tonyp Tony Printezis added a comment - - edited
          Some ParNew numbers from jbb2005 :-) to illustrate the problem and benefit of this auto-tuning. Did runs with 3 different old gen sizes and for each old gen size used a different warehouse size (but all runs used up to 48 warehouses). All runs were done on a 12-way / 2-ht intel box. All numbers are reported are P50 ParNew times over 4 separate runs. Did separate runs with 4 GC threads and 16 GC threads to see if the GC thread number affects the results.

          2g old gen size / jbb table size = 20,000 (default)

                4 GCThr 16 GCThr

           256 85.521 30.891
           512 84.887 30.763
            1k 82.871 30.832
            2k 82.541 30.495
            4k 81.459 30.768
            8k 83.349 30.383

          auto 84.755 30.764

          10g old gen size / jbb table size = 125,000

                4 GCThr 16 GCThr

           256 100.678 36.954
           512 92.584 34.588
            1k 88.919 32.439
            2k 85.135 31.980
            4k 83.343 32.024
            8k 82.229 31.354

          auto 83.257 31.911

          20g old gen size / jbb table size = 250,000

                4 GCThr 16 GCThr

           256 122.945 45.489
           512 104.772 38.694
            1k 94.479 35.379
            2k 89.789 32.965
            4k 86.112 33.477
            8k 84.393 31.855

          auto 84.330 32.239

          (Tables look terrible; sorry. Is it possible to do fixed size fonts here?)
          Show
          tonyp Tony Printezis added a comment - - edited Some ParNew numbers from jbb2005 :-) to illustrate the problem and benefit of this auto-tuning. Did runs with 3 different old gen sizes and for each old gen size used a different warehouse size (but all runs used up to 48 warehouses). All runs were done on a 12-way / 2-ht intel box. All numbers are reported are P50 ParNew times over 4 separate runs. Did separate runs with 4 GC threads and 16 GC threads to see if the GC thread number affects the results. 2g old gen size / jbb table size = 20,000 (default)       4 GCThr 16 GCThr  256 85.521 30.891  512 84.887 30.763   1k 82.871 30.832   2k 82.541 30.495   4k 81.459 30.768   8k 83.349 30.383 auto 84.755 30.764 10g old gen size / jbb table size = 125,000       4 GCThr 16 GCThr  256 100.678 36.954  512 92.584 34.588   1k 88.919 32.439   2k 85.135 31.980   4k 83.343 32.024   8k 82.229 31.354 auto 83.257 31.911 20g old gen size / jbb table size = 250,000       4 GCThr 16 GCThr  256 122.945 45.489  512 104.772 38.694   1k 94.479 35.379   2k 89.789 32.965   4k 86.112 33.477   8k 84.393 31.855 auto 84.330 32.239 (Tables look terrible; sorry. Is it possible to do fixed size fonts here?)
          Hide
          tonyp Tony Printezis added a comment -
          One more thing (for the record): We also tried to vary the ParGCStridesPerThread argument too to see if it had any effect on the ParNew GC times but we it didn't seem to. We tried values 2 (default), 3, and 4.
          Show
          tonyp Tony Printezis added a comment - One more thing (for the record): We also tried to vary the ParGCStridesPerThread argument too to see if it had any effect on the ParNew GC times but we it didn't seem to. We tried values 2 (default), 3, and 4.
          Hide
          jwilhelm Jesper Wilhelmsson added a comment -
          We have seen other problems reported on the openjdk list that could be helped by this change. I think it would be really nice to get this enhancement in, raising priority.
          The problems have been reported on 8, but since there is no more 8u release planned after 8u60, and it is too late to get this into 8u60, I put this on 9.
          Show
          jwilhelm Jesper Wilhelmsson added a comment - We have seen other problems reported on the openjdk list that could be helped by this change. I think it would be really nice to get this enhancement in, raising priority. The problems have been reported on 8, but since there is no more 8u release planned after 8u60, and it is too late to get this into 8u60, I put this on 9.
          Hide
          tonyp Tony Printezis added a comment -
          Hi Jesper, Thanks for the reply. Targeting JDK 9 for this change is totally fine (we've already included this change in our JDK 8 anyway....). I'll post a webrev after I port it to JDK 9.
          Show
          tonyp Tony Printezis added a comment - Hi Jesper, Thanks for the reply. Targeting JDK 9 for this change is totally fine (we've already included this change in our JDK 8 anyway....). I'll post a webrev after I port it to JDK 9.
          Hide
          ysr Y. Ramakrishna added a comment -
          As input to the auto-tuning "function" (from heap size and possibly # gc threads) to the parameters ParGCStridesPerCardChunk and ParGCStridesPerThread, it would be nice if numbers could be obtained with a variety of benchmarks on a variety of hardware to determine an optimal (possibly platform-dependent) "function". Perhaps the "Alacrity" framework could be used by someone inside Oracle to obtain those numbers and see how they look.
          Show
          ysr Y. Ramakrishna added a comment - As input to the auto-tuning "function" (from heap size and possibly # gc threads) to the parameters ParGCStridesPerCardChunk and ParGCStridesPerThread, it would be nice if numbers could be obtained with a variety of benchmarks on a variety of hardware to determine an optimal (possibly platform-dependent) "function". Perhaps the "Alacrity" framework could be used by someone inside Oracle to obtain those numbers and see how they look.
          Hide
          tonyp Tony Printezis added a comment - - edited
          Quick update on this, based on code review comments. We should add a cmd line arg to turn this feature on and off. We should also add additional cmd line args to set the chunk size and old gen capacity min / max values, so that they can be fine-tuned if necessary. Cmd line arg proposed names (feel free to propose alternatives):

          * +/-UseDynamicParGCStrides
          * DynamicParGCStrides{Min,Max}OldGenCapacity
          * DynamicParGCStrides{Min,Max}Size
          * PrintDynamicParGCStrides (suggested by Jon Masa to print the ParGCCardsPerStrideChunk value and associated info to the GC log)

          I'll also make them all manageable so that we can adjust them at runtime if necessary.
          Show
          tonyp Tony Printezis added a comment - - edited Quick update on this, based on code review comments. We should add a cmd line arg to turn this feature on and off. We should also add additional cmd line args to set the chunk size and old gen capacity min / max values, so that they can be fine-tuned if necessary. Cmd line arg proposed names (feel free to propose alternatives): * +/-UseDynamicParGCStrides * DynamicParGCStrides{Min,Max}OldGenCapacity * DynamicParGCStrides{Min,Max}Size * PrintDynamicParGCStrides (suggested by Jon Masa to print the ParGCCardsPerStrideChunk value and associated info to the GC log) I'll also make them all manageable so that we can adjust them at runtime if necessary.
          Hide
          tonyp Tony Printezis added a comment -
          Ramki, You're right that the GC thread # could be another input to the auto-tuning function. Our performance tests showed that the GC thread # doesn't affect the result much (see the numbers above, I got them for 4 and 16 GC threads). But, again, this could be different on another architecture / app.
          Show
          tonyp Tony Printezis added a comment - Ramki, You're right that the GC thread # could be another input to the auto-tuning function. Our performance tests showed that the GC thread # doesn't affect the result much (see the numbers above, I got them for 4 and 16 GC threads). But, again, this could be different on another architecture / app.
          Hide
          tonyp Tony Printezis added a comment -
          An additional small issue: the default max old gen capacity, 16*G, is not very 32-bit-friendly. :-) (thanks to Jon Masa for pointing this out) So I'll try the following:

          32-bit : old gen capacity range [ 1g, 2g ], stride size range [ 256, 512 ]
          64-bit : old gen capacity range [ 1g, 16g ], stride size range [ 256, 8k ]
          Show
          tonyp Tony Printezis added a comment - An additional small issue: the default max old gen capacity, 16*G, is not very 32-bit-friendly. :-) (thanks to Jon Masa for pointing this out) So I'll try the following: 32-bit : old gen capacity range [ 1g, 2g ], stride size range [ 256, 512 ] 64-bit : old gen capacity range [ 1g, 16g ], stride size range [ 256, 8k ]

            People

            • Assignee:
              tonyp Tony Printezis
              Reporter:
              tonyp Tony Printezis
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: