Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8146828

C2: Subsequent arraycopy does not always eliminate array zeroing

    Details

      Description

      There are many cases where C2 eliminates the array pre-zeroing followed by arraycopy into the array. However, there are cases where this surprisingly does not work.

      For example, see this benchmark:
       http://cr.openjdk.java.net/~shade/8146828/ArrayZeroingBench.java
       http://cr.openjdk.java.net/~shade/8146828/benchmarks.jar

      The most puzzling example is this:

          @Benchmark
          public Foo[] arraycopy_srcLength() {
              Object[] src = this.src;
              Foo[] dst = new Foo[size];
              System.arraycopy(src, 0, dst, 0, src.length);
              return dst;
          }

          @Benchmark
          public Foo[] arraycopy_dstLength() {
              Object[] src = this.src;
              Foo[] dst = new Foo[size];
              System.arraycopy(src, 0, dst, 0, dst.length);
              return dst;
          }

      In srcLength case we successfully eliminate the pre-zeroing on destination array, while in dstLength we don't!

      See "rep stos" in dstLength case, and no zeroing in srcLength case:
       http://cr.openjdk.java.net/~shade/8146828/jdk9b99.perfasm
       http://cr.openjdk.java.net/~shade/8146828/jdk9b99-size10.perfasm

        Activity

        Hide
        roland Roland Westrelin added a comment -
        The problem I think is that we add a null check for src when we intrinsify arraycopy. So there's a null check between the allocation and the arraycopy itself and if we hit that null check we deoptimize with a live allocation and that allocation must be zeroed. In the arraycopy_srcLength(), src.length acts as a null check. It's between the allocation and the arraycopy but the allocation is not live at that null check because if it fails it's going to throw.

        We usually move the allocation below the predicates and arrange for the predicates in case they fell to resume execution in the interpreter right before the allocation. But in that case one of the predicate fells on a first compilation (subtype check) so we don't attempt it on the second one.
        Show
        roland Roland Westrelin added a comment - The problem I think is that we add a null check for src when we intrinsify arraycopy. So there's a null check between the allocation and the arraycopy itself and if we hit that null check we deoptimize with a live allocation and that allocation must be zeroed. In the arraycopy_srcLength(), src.length acts as a null check. It's between the allocation and the arraycopy but the allocation is not live at that null check because if it fails it's going to throw. We usually move the allocation below the predicates and arrange for the predicates in case they fell to resume execution in the interpreter right before the allocation. But in that case one of the predicate fells on a first compilation (subtype check) so we don't attempt it on the second one.
        Hide
        jrose John Rose added a comment -
        Roland is correct. This optimization is fragile because all GCs must be excluded between the new Foo[] and the arraycopy call, but implicit operations like null checks can throw, which in turn may GC.

        The optimization could be made less fragile by either (a) moving implicit checks (not visible in source code) up, as done with loop predication, or (b) splitting the zeroing operation out of the allocation and downward along both paths (the normal and exceptional path).

        //original
        dst = allocate(Foo[].class, size);
        block_zero(&dst[0], size);
        if (src == null) { uncommon_trap(); }
        block_move(&src[0], &dst[0], size);

        //(a)
        PREDICATE: if (src == null) { deopt(); }
        dst = allocate(Foo[].class, size);
        //DOMINATED, ELIMINATED: if (src == null) { uncommon_trap(); }
        //ELIMINATED: block_zero(&dst[0], size);
        block_move(&src[0], &dst[0], size);

        //(b)
        dst = allocate(Foo[].class, size);
        if (src == null) {
          SPLIT: block_zero(&dst[0], size);
          uncommon_trap(); }
        //SPLIT, ELIMINATED: block_zero(&dst[0], size);
        block_move(&src[0], &dst[0], size);
        Show
        jrose John Rose added a comment - Roland is correct. This optimization is fragile because all GCs must be excluded between the new Foo[] and the arraycopy call, but implicit operations like null checks can throw, which in turn may GC. The optimization could be made less fragile by either (a) moving implicit checks (not visible in source code) up, as done with loop predication, or (b) splitting the zeroing operation out of the allocation and downward along both paths (the normal and exceptional path). //original dst = allocate(Foo[].class, size); block_zero(&dst[0], size); if (src == null) { uncommon_trap(); } block_move(&src[0], &dst[0], size); //(a) PREDICATE: if (src == null) { deopt(); } dst = allocate(Foo[].class, size); //DOMINATED, ELIMINATED: if (src == null) { uncommon_trap(); } //ELIMINATED: block_zero(&dst[0], size); block_move(&src[0], &dst[0], size); //(b) dst = allocate(Foo[].class, size); if (src == null) {   SPLIT: block_zero(&dst[0], size);   uncommon_trap(); } //SPLIT, ELIMINATED: block_zero(&dst[0], size); block_move(&src[0], &dst[0], size);
        Hide
        roland Roland Westrelin added a comment -
        The problem that Aleksey reports with ArrayList.toArray is due to the fact that the array is allocated in one method but the arraycopy is in an inlined method. The code that checks for a tightly coupled allocation at parse time sees a SafePointNode that it doesn’t expect and bails out.

        http://cr.openjdk.java.net/~roland/8146828/webrev.02/

        is an attempt at fixing that problem. I moved the tightly coupled allocation logic post parse time. The logic that changes the guards so they resume execution at the allocation and moves the allocation after the guards is at macro expansion time now. That led to a couple of new problems: loop unswitching can cause 2 allocations to trap at a single unc. So I added some code to clone the unc and the control flow that leads to it so each allocation has its own unc. The other problem is that once the guards are changed to resume execution at the allocation method/bci, if a guard fails then we record it at the allocation method/bci but during parsing we check for too many traps at the arraycopy. So if a guard fails, we recompile, emit the guard again and risk recompiling again. To fix that, I suggest we pass a couple extra arguments to the uncommon trap code in that case and that case only: the method/bci of the arraycopy to properly record the trap. I considered other solutions but this one seems the simplest. All the graph surgery code is well tested. The extra argument to the uncommon trap code is only sketched on x86. Note there seems to be an interesting bug in callnode.cpp
        Show
        roland Roland Westrelin added a comment - The problem that Aleksey reports with ArrayList.toArray is due to the fact that the array is allocated in one method but the arraycopy is in an inlined method. The code that checks for a tightly coupled allocation at parse time sees a SafePointNode that it doesn’t expect and bails out. http://cr.openjdk.java.net/~roland/8146828/webrev.02/ is an attempt at fixing that problem. I moved the tightly coupled allocation logic post parse time. The logic that changes the guards so they resume execution at the allocation and moves the allocation after the guards is at macro expansion time now. That led to a couple of new problems: loop unswitching can cause 2 allocations to trap at a single unc. So I added some code to clone the unc and the control flow that leads to it so each allocation has its own unc. The other problem is that once the guards are changed to resume execution at the allocation method/bci, if a guard fails then we record it at the allocation method/bci but during parsing we check for too many traps at the arraycopy. So if a guard fails, we recompile, emit the guard again and risk recompiling again. To fix that, I suggest we pass a couple extra arguments to the uncommon trap code in that case and that case only: the method/bci of the arraycopy to properly record the trap. I considered other solutions but this one seems the simplest. All the graph surgery code is well tested. The extra argument to the uncommon trap code is only sketched on x86. Note there seems to be an interesting bug in callnode.cpp

          People

          • Assignee:
            roland Roland Westrelin
            Reporter:
            shade Aleksey Shipilev
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated: