Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8027547

G1 should not push already forwarded objects into the object task queue but special case them

    Details

    • Type: Enhancement
    • Status: Open
    • Priority: P2
    • Resolution: Unresolved
    • Affects Version/s: hs25, 8
    • Fix Version/s: 10
    • Component/s: hotspot
    • Subcomponent:
      gc

      Description

      Measurements on specjbb2005 show (see JDK-8020306) that G1 object copy times are much larger than parallel gc's.

      One important problem is that the number of items pushed into the object copy task queue in G1 is much larger than for parallel gc. Investigation of this behavior showed that parallel gc has a special fast path handling references to already forwarded objects: it does not push them onto the work queue, but fixes them up and does any missing processing (remembered set) inline.

      E.g.

      inline void PSPromotionManager::claim_or_forward_internal_depth(T* p) {
        if (p != NULL) { // XXX: error if p != NULL here
          oop o = oopDesc::load_decode_heap_oop_not_null(p);
          if (o->is_forwarded()) { // <-- "fast path" here
            o = o->forwardee();
            // Card mark
            if (PSScavenge::is_obj_in_young(o)) {
              PSScavenge::card_table()->inline_write_ref_field_gc(p, o);
            }
            oopDesc::encode_store_heap_oop_not_null(p, o);
          } else {
            push_depth(p); // push into work queue
          }
        }

      Depending on the amount of already forwarded objects, this saves a lot of work as task queue management is relatively expensive: pushing/popping the element, possibly using locality, additional use of memory in the overflow queue, more (slow) stealing work (including possibly contended one) that is done etc.

        Activity

        Hide
        tschatzl Thomas Schatzl added a comment -

        Here's the output of task queue stats for parallel gc on specjbb2005:

        thr qpush qpop qpop-s qattempt qsteal opush omax
        --- ---------- ---------- ---------- ---------- ---------- ---------- ----------
          0 795448 795303 66 28 27 0 0
          1 678503 678391 104 75 74 0 0
          2 612772 612615 110 52 51 0 0
          3 556778 556735 242 306 305 0 0
          4 0 0 0 0 0 0 0
            --------masked------- arrays array
        thr push steal chunked chunks
        --- ---------- ---------- ---------- ----------
          0 1409 12 4 1409
          1 673 3 1 673
          2 676 11 2 676
          3 740 15 20 740
          4 0 0 0 0

        and here a similar one for G1:

        GC Task Stats
        thr qpush qpop qpop-s qattempt qsteal opush omax
        --- ---------- ---------- ---------- ---------- ---------- ---------- ----------
          0 744163 743894 375 236 233 0 0
          1 774933 774545 313 202 200 0 0
          2 850281 850104 275 311 309 0 0
          3 632904 632758 778 240 238 0 0
        tot 3002281 3001301 1741 989 980 0 0

        E.g. in total PS scavenge pushes 2.64M references, while G1 pushes 3M; adding some code that prints out the number of forwarded references pushed at that time (in G1ParScanThreadState::push_on_queue), we can see that this is about 360k, i.e. almost exactly the difference (accounting for slight differences in the point in time the measurement has been taken).

        0 forwarded: 124482
        1 forwarded: 75292
        2 forwarded: 76951
        2 forwarded: 86251

        The values also show the difference in the number of slow pops, attempts and steals.

        Vtune shows a (total) ratio of cycles spent in the worker task queue of around 1.8 (parallel)/4 (g1).
        Show
        tschatzl Thomas Schatzl added a comment - Here's the output of task queue stats for parallel gc on specjbb2005: thr qpush qpop qpop-s qattempt qsteal opush omax --- ---------- ---------- ---------- ---------- ---------- ---------- ----------   0 795448 795303 66 28 27 0 0   1 678503 678391 104 75 74 0 0   2 612772 612615 110 52 51 0 0   3 556778 556735 242 306 305 0 0   4 0 0 0 0 0 0 0     --------masked------- arrays array thr push steal chunked chunks --- ---------- ---------- ---------- ----------   0 1409 12 4 1409   1 673 3 1 673   2 676 11 2 676   3 740 15 20 740   4 0 0 0 0 and here a similar one for G1: GC Task Stats thr qpush qpop qpop-s qattempt qsteal opush omax --- ---------- ---------- ---------- ---------- ---------- ---------- ----------   0 744163 743894 375 236 233 0 0   1 774933 774545 313 202 200 0 0   2 850281 850104 275 311 309 0 0   3 632904 632758 778 240 238 0 0 tot 3002281 3001301 1741 989 980 0 0 E.g. in total PS scavenge pushes 2.64M references, while G1 pushes 3M; adding some code that prints out the number of forwarded references pushed at that time (in G1ParScanThreadState::push_on_queue), we can see that this is about 360k, i.e. almost exactly the difference (accounting for slight differences in the point in time the measurement has been taken). 0 forwarded: 124482 1 forwarded: 75292 2 forwarded: 76951 2 forwarded: 86251 The values also show the difference in the number of slow pops, attempts and steals. Vtune shows a (total) ratio of cycles spent in the worker task queue of around 1.8 (parallel)/4 (g1).
        Hide
        tschatzl Thomas Schatzl added a comment -
        Some statistics about the amount of forwarded references G1 pushes into the task queue:

        specjbb2005: 8.96%
        specjbb2013: 8.96%
        specjvm2008.compiler: 44.6% (!)
        specjvm2008.compress: 31.2%
        specjvm2008.crypto: 30.9%
        specjvm2008.derby: 37.1%
        specjvm2008.mpeg: 9.59%
        specjvm2008.scimark: 13.0%
        specjvm2008.serial: 29.5%
        specjvm2008.startup: 32.2%
        specjvm2008.sunflow: 20.5%
        specjvm2008.xml: 17.4%
        specjvm98: 17.9%

        Show
        tschatzl Thomas Schatzl added a comment - Some statistics about the amount of forwarded references G1 pushes into the task queue: specjbb2005: 8.96% specjbb2013: 8.96% specjvm2008.compiler: 44.6% (!) specjvm2008.compress: 31.2% specjvm2008.crypto: 30.9% specjvm2008.derby: 37.1% specjvm2008.mpeg: 9.59% specjvm2008.scimark: 13.0% specjvm2008.serial: 29.5% specjvm2008.startup: 32.2% specjvm2008.sunflow: 20.5% specjvm2008.xml: 17.4% specjvm98: 17.9%
        Hide
        tonyp Tony Printezis added a comment -
        Thomas,

        There's a very interesting trade-off here. On one hand, checking whether the object is forwarded or not might cause an immediate cache miss (and you have an extra conditional in your fast path). On the other hand, pushing on / popping from the taskqueue might be not too bad cache performance-wise (the top of the taskqueue will be very hot and probably already in the cache). I'm very interested to see whether the extra check will have an impact on GC times and/or cache miss rate (it'd be worth doing a couple of runs with your favorite profiler to check the latter).

        Tony
        Show
        tonyp Tony Printezis added a comment - Thomas, There's a very interesting trade-off here. On one hand, checking whether the object is forwarded or not might cause an immediate cache miss (and you have an extra conditional in your fast path). On the other hand, pushing on / popping from the taskqueue might be not too bad cache performance-wise (the top of the taskqueue will be very hot and probably already in the cache). I'm very interested to see whether the extra check will have an impact on GC times and/or cache miss rate (it'd be worth doing a couple of runs with your favorite profiler to check the latter). Tony
        Hide
        tschatzl Thomas Schatzl added a comment - - edited
        I saw the comment in the G1ParScanClosure::do_oop_work_nv (or so) method.

        The problem is that we are not just pushing/popping an object from a queue, managing the queues takes time too. Then there's the synchronization overhead of other threads trying to access and steal from the work queues. Given the task queue stats above, the stealing seems to grow unproportionally with the number of operations on the queues. Also, there is quite a lot of work to do (a few branches at least) from popping a reference from the queue until we are back to the processing code - G1 does not exactly have the most lightweight closures - even if everything is in the caches. I wrote elsewhere I think that G1 is decode bounds (too long code paths all around) too according to the tool.

        From measurements with update RS phase changes (no CR yet), the executed code to actually do the forwarding and then update the remembered set entry is typically really small (rset filters work well).

        As the first comment indicates, although G1 only does 13% more pushes and pops, according to VTune it spends more than twice the time there (4/1.8) (specjbb2005, this is repeatable, but as you know when working with these tools this might be a sampling problem)

        Iirc most applications also have a higher amount of forwarded references to process than specjbb, so the saving should generally be larger than there (I've gotten better than expected improvement in specjbb scores in a few quick runs with the change though, probably something wrong with the test). There are other optimizations already in the bug system that work better with specjbb than this one.

        Sure, the change needs to be measured well. :)
        Show
        tschatzl Thomas Schatzl added a comment - - edited I saw the comment in the G1ParScanClosure::do_oop_work_nv (or so) method. The problem is that we are not just pushing/popping an object from a queue, managing the queues takes time too. Then there's the synchronization overhead of other threads trying to access and steal from the work queues. Given the task queue stats above, the stealing seems to grow unproportionally with the number of operations on the queues. Also, there is quite a lot of work to do (a few branches at least) from popping a reference from the queue until we are back to the processing code - G1 does not exactly have the most lightweight closures - even if everything is in the caches. I wrote elsewhere I think that G1 is decode bounds (too long code paths all around) too according to the tool. From measurements with update RS phase changes (no CR yet), the executed code to actually do the forwarding and then update the remembered set entry is typically really small (rset filters work well). As the first comment indicates, although G1 only does 13% more pushes and pops, according to VTune it spends more than twice the time there (4/1.8) (specjbb2005, this is repeatable, but as you know when working with these tools this might be a sampling problem) Iirc most applications also have a higher amount of forwarded references to process than specjbb, so the saving should generally be larger than there (I've gotten better than expected improvement in specjbb scores in a few quick runs with the change though, probably something wrong with the test). There are other optimizations already in the bug system that work better with specjbb than this one. Sure, the change needs to be measured well. :)
        Hide
        tonyp Tony Printezis added a comment -
        Hey Thomas,

        Thanks for the detailed reply. I was not saying that it's not worth trying this! I just pointed out that the higher cache miss rate might cancel out any benefits the change might have (at least, in some cases).

        I wonder why stealing goes up when there are more operations on the queues. Maybe, when some threads terminate earlier than others, there are simply more items available for stealing?

        Tony
        Show
        tonyp Tony Printezis added a comment - Hey Thomas, Thanks for the detailed reply. I was not saying that it's not worth trying this! I just pointed out that the higher cache miss rate might cancel out any benefits the change might have (at least, in some cases). I wonder why stealing goes up when there are more operations on the queues. Maybe, when some threads terminate earlier than others, there are simply more items available for stealing? Tony
        Hide
        tschatzl Thomas Schatzl added a comment -
        I think the situation is as follows: typically at the end of GC the frequency of forwarded references increases. There are two classes of tasks in the queue, one that is fast to process and another one that is much slower to process (since handling forwarded references is much faster than the normal processing (call to copy_to_survivor_space()). Mainly to-be-forwarded references, so the task that is stolen is typically a reference that needs to be forwarded.The threads are simply able to try more often as they are waiting for a single thread to complete a "real" reference (of which there are probably just a few now and then at this point).

        Also, after completing an already forwarded reference, the thread will be required to go steal again right away as forwarded references do not create new references to process.

        Another issue, with that high (it does not seem bad in this particular example, might dig out the specjvm2008.compiler logs) amount of stealing the prefetching may be useless, as the reference we think this thread is going to process soon is stolen in a lot of cases.
        Show
        tschatzl Thomas Schatzl added a comment - I think the situation is as follows: typically at the end of GC the frequency of forwarded references increases. There are two classes of tasks in the queue, one that is fast to process and another one that is much slower to process (since handling forwarded references is much faster than the normal processing (call to copy_to_survivor_space()). Mainly to-be-forwarded references, so the task that is stolen is typically a reference that needs to be forwarded.The threads are simply able to try more often as they are waiting for a single thread to complete a "real" reference (of which there are probably just a few now and then at this point). Also, after completing an already forwarded reference, the thread will be required to go steal again right away as forwarded references do not create new references to process. Another issue, with that high (it does not seem bad in this particular example, might dig out the specjvm2008.compiler logs) amount of stealing the prefetching may be useless, as the reference we think this thread is going to process soon is stolen in a lot of cases.
        Hide
        tonyp Tony Printezis added a comment -
        Very good point re: "prefetching becoming less effective towards the end of the GC". Not sure what to do about that, though. Maybe only prefetch if your task queue is over a certain length (less chance of other threads stealing, given that they steal from the bottom)?
        Show
        tonyp Tony Printezis added a comment - Very good point re: "prefetching becoming less effective towards the end of the GC". Not sure what to do about that, though. Maybe only prefetch if your task queue is over a certain length (less chance of other threads stealing, given that they steal from the bottom)?
        Hide
        tschatzl Thomas Schatzl added a comment -
        More an investigation on what can be done here.
        Show
        tschatzl Thomas Schatzl added a comment - More an investigation on what can be done here.

          People

          • Assignee:
            Unassigned
            Reporter:
            tschatzl Thomas Schatzl
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated: