Many core loop transformations apply to counted loops, which
are those with a calculated trip count. Transformations
include unrolling, iteration range splitting (array RCE),
and strip mining (JDK8186027). The optimizer performs many
complicated pattern matches to detect and transform counted
loop.
Most or all of these pattern matches and transformations
apply to loops with 32bit control variables and arithmetic.
This makes sense as long as bulk operations apply only to
Java arrays, since those arrays can only span a 31bit index
range. Newer APIs for larger blocks of bulk data will
introduce 64bit indexes, such as Panama's native arrays and
(possibly) rangeexpanded byte buffers. Under the hood, the
Unsafe API routinely works with 64bit addresses and address
arithmetic. Loops which work on such data structures
naturally use 64bit values, either as direct Java longs, or
as wrapped cursor structure with incrementing long
components (Panama pointers).
There needs to be a story for transforming such longrunning
loops. This RFE is a request for that story.
A complicating factor is that sometimes counted loops have
no safepoints, on the assumption that the largest possible
iteration (across 32 bits of dynamic range) won't cause the
JVM's safepoint mechanism to malfunction due to a
nonresponsive thread stuck in such a counted loop. This
assumption is invalid in the 64bit case. Luckily, we have
a (relatively new) optimization which can address this
problem, by stripmining a single very long running loop
into a sequence (outer loop) of loops of with appropriately
bounded trip counts.
This observation leads to a suggested approach to this RFE.
Generalize the stripmining transform operate on
tripcounted loops which use 64bit arithmetic. In the
inner loops, use a normal 32bit counter, constrained (of
course) to avoid overflow. This should enable all the other
loop transforms to work on the inner loops, without them
"knowing" that they are working on a stripmined 64bit
loop.
The strip mining transform, along with the index variable
size reduction, would look something like this:
```
void processRange(long start, long end) {
for (long li = start; li < end; li++) {
checkIndexLong(li, a.length());
a.put(li, a.get(li) + 1);
}
==STRIPMINE==>
void processRange(long start, long end) {
for (long li = start; li < end; ) {
int stripMax = LoopStripMiningIter;
int strip = (li + stripMax < end) ? stripMax : (int)(end  li);
for (int si = 0; si < strip; si++) {
long li_ = li + si;
checkIndexLong(li_, a.length());
a.put(li_, a.get(li_) + 1);
}
li += strip;
}
```
In the future, Valhalla will introduce primitivelike
"inline types" that can wrap hardware vectors of 128 bits or
larger. The possibility exists (in theory) of treating
those larger types as capable of arithmetic. In that case,
the JIT may encounter loops over other types besides Java
longs, and wider than 64 bits. The initial version of this
RFE need to support that, but it should be factored so that
adding more kinds of index types (besides T_INT and T_LONG)
will not require a total rewrite. I say "should" not "must"
because this point is a goal not a requirement of this RFE.
A separate part of getting the benefit from this work is
recognizing the places where range checking occurs, so that
iteration range splitting can be done (inside each strip of
the outer loop). The JVM does this in a hardwired way for
Java arrays, and also (as ofJDK8042997) on librarycoded
data structures which happen to use the intrinsic method
`Objects.checkIndex` to check their indexes.
For librarycoded data structures with 64bit indexes, the
JIT can be set up to look for a 64bit version of
`checkIndex`. (That is TBD, but see `checkIndexLong`
above.) Range checks on the long index can be reduced to
range checks on the shortened inner loop index.
Here is an example, based on the previous one, of splitting
the inner loop into pre, main, and postloops:
```
==RCE==>
void processRange(long start, long end) {
long mainStartLI = Long.max(start, 0);
long mainLimitLI = Long.min(end, a.length());
for (long li = start; li < end; ) {
int stripMax = LoopStripMiningIter;
int strip = (li+stripMax < end) ? stripMax : (int)(endli);
int si = 0;
int mainStartSI, mainLimitSI;
if (li+strip <= mainStartLI)
mainStartSI = mainLimitSI = strip; //or 0
else if (li >= mainLimitLI)
mainStartSI = mainLimitSI = 0; //or strip
else {
mainStartSI = (int)Long.max(mainStartLIli, 0);
mainLimitSI = (int)Long.min(mainLimitLIli, strip);
}
for (; si < mainStartSI; si++) { //PreLoop
long li_ = li + si;
checkIndexLong(li_, a.length());
a.put(li_, a.get(li_) + 1);
}
for (; si < mainLimitSI; si++) { //MainLoop
long li_ = li + si;
//RCE!// checkIndexLong(li_, a.length());
a.put(li_, a.get(li_) + 1);
}
for (; si < strip; si++) { //PostLoop
(... same as PreLoop ...)
}
li += strip;
}
```
Or it may be more reasonable to split the outer loop itself
into pre/main/post copies, in which case the 32bit versions
of the split values are not computed.
(Note: The gathering of constraints is a complex matter.
See `PhaseIdealLoop::add_constraint` in C2 for more details.
It seems it would need to be generalized to handle
constraints on 64bit values.)
The request to generalize iteration range splitting over
64bit ranges should be broken out into a separate RFE or
subtask, when work starts on this RFE.
are those with a calculated trip count. Transformations
include unrolling, iteration range splitting (array RCE),
and strip mining (
complicated pattern matches to detect and transform counted
loop.
Most or all of these pattern matches and transformations
apply to loops with 32bit control variables and arithmetic.
This makes sense as long as bulk operations apply only to
Java arrays, since those arrays can only span a 31bit index
range. Newer APIs for larger blocks of bulk data will
introduce 64bit indexes, such as Panama's native arrays and
(possibly) rangeexpanded byte buffers. Under the hood, the
Unsafe API routinely works with 64bit addresses and address
arithmetic. Loops which work on such data structures
naturally use 64bit values, either as direct Java longs, or
as wrapped cursor structure with incrementing long
components (Panama pointers).
There needs to be a story for transforming such longrunning
loops. This RFE is a request for that story.
A complicating factor is that sometimes counted loops have
no safepoints, on the assumption that the largest possible
iteration (across 32 bits of dynamic range) won't cause the
JVM's safepoint mechanism to malfunction due to a
nonresponsive thread stuck in such a counted loop. This
assumption is invalid in the 64bit case. Luckily, we have
a (relatively new) optimization which can address this
problem, by stripmining a single very long running loop
into a sequence (outer loop) of loops of with appropriately
bounded trip counts.
This observation leads to a suggested approach to this RFE.
Generalize the stripmining transform operate on
tripcounted loops which use 64bit arithmetic. In the
inner loops, use a normal 32bit counter, constrained (of
course) to avoid overflow. This should enable all the other
loop transforms to work on the inner loops, without them
"knowing" that they are working on a stripmined 64bit
loop.
The strip mining transform, along with the index variable
size reduction, would look something like this:
```
void processRange(long start, long end) {
for (long li = start; li < end; li++) {
checkIndexLong(li, a.length());
a.put(li, a.get(li) + 1);
}
==STRIPMINE==>
void processRange(long start, long end) {
for (long li = start; li < end; ) {
int stripMax = LoopStripMiningIter;
int strip = (li + stripMax < end) ? stripMax : (int)(end  li);
for (int si = 0; si < strip; si++) {
long li_ = li + si;
checkIndexLong(li_, a.length());
a.put(li_, a.get(li_) + 1);
}
li += strip;
}
```
In the future, Valhalla will introduce primitivelike
"inline types" that can wrap hardware vectors of 128 bits or
larger. The possibility exists (in theory) of treating
those larger types as capable of arithmetic. In that case,
the JIT may encounter loops over other types besides Java
longs, and wider than 64 bits. The initial version of this
RFE need to support that, but it should be factored so that
adding more kinds of index types (besides T_INT and T_LONG)
will not require a total rewrite. I say "should" not "must"
because this point is a goal not a requirement of this RFE.
A separate part of getting the benefit from this work is
recognizing the places where range checking occurs, so that
iteration range splitting can be done (inside each strip of
the outer loop). The JVM does this in a hardwired way for
Java arrays, and also (as of
data structures which happen to use the intrinsic method
`Objects.checkIndex` to check their indexes.
For librarycoded data structures with 64bit indexes, the
JIT can be set up to look for a 64bit version of
`checkIndex`. (That is TBD, but see `checkIndexLong`
above.) Range checks on the long index can be reduced to
range checks on the shortened inner loop index.
Here is an example, based on the previous one, of splitting
the inner loop into pre, main, and postloops:
```
==RCE==>
void processRange(long start, long end) {
long mainStartLI = Long.max(start, 0);
long mainLimitLI = Long.min(end, a.length());
for (long li = start; li < end; ) {
int stripMax = LoopStripMiningIter;
int strip = (li+stripMax < end) ? stripMax : (int)(endli);
int si = 0;
int mainStartSI, mainLimitSI;
if (li+strip <= mainStartLI)
mainStartSI = mainLimitSI = strip; //or 0
else if (li >= mainLimitLI)
mainStartSI = mainLimitSI = 0; //or strip
else {
mainStartSI = (int)Long.max(mainStartLIli, 0);
mainLimitSI = (int)Long.min(mainLimitLIli, strip);
}
for (; si < mainStartSI; si++) { //PreLoop
long li_ = li + si;
checkIndexLong(li_, a.length());
a.put(li_, a.get(li_) + 1);
}
for (; si < mainLimitSI; si++) { //MainLoop
long li_ = li + si;
//RCE!// checkIndexLong(li_, a.length());
a.put(li_, a.get(li_) + 1);
}
for (; si < strip; si++) { //PostLoop
(... same as PreLoop ...)
}
li += strip;
}
```
Or it may be more reasonable to split the outer loop itself
into pre/main/post copies, in which case the 32bit versions
of the split values are not computed.
(Note: The gathering of constraints is a complex matter.
See `PhaseIdealLoop::add_constraint` in C2 for more details.
It seems it would need to be generalized to handle
constraints on 64bit values.)
The request to generalize iteration range splitting over
64bit ranges should be broken out into a separate RFE or
subtask, when work starts on this RFE.
 relates to

JDK8221358 need intrinsics for loop control that can guide iteration range splitting
 Open

JDK8237082 Workaround C2 limitations when working with long loops
 New

JDK8244504 C2: refactor counted loop code in preparation for long counted loop
 Resolved
 links to

Review openjdk/jdk/318