Originally found with larger benchmark. With the targeted microbenchmark:
http://cr.openjdk.java.net/~shade/8050850/ArrayTest.java
default: 259.322 +- 3.997 ns/op
LoopMaxUnroll=1: 276.164 +- 7.740 ns/op
LoopMaxUnroll=2: 208.690 +- 2.690 ns/op
LoopMaxUnroll=3: 207.843 +- 1.946 ns/op
LoopMaxUnroll=4: 256.309 +- 1.202 ns/op
Default mode seems be very close to LMU=4 assembly-wise. Below are the hottest loops for different LMUs:
(monospaced: http://cr.openjdk.java.net/~shade/8050850/lmu-hots.txt)
-------------------
LoopMaxUnroll=1:
; working...
3.12% 3.08% 0x00007fec78107f60: mov 0x10(%rsi,%r10,4),%r11d
4.22% 4.20% 0x00007fec78107f65: add 0xc(%r12,%r11,8),%edi
; index increment + back branch
80.62% 82.20% 0x00007fec78107f6a: inc %r10d
2.11% 2.02% 0x00007fec78107f6d: cmp %ecx,%r10d
0x00007fec78107f70: jl 0x00007fec78107f60
-------------------
LoopMaxUnroll=2:
; working...
1.73% 1.80% 0x00007f966d1a15b0: mov 0x10(%rsi,%r11,4),%r9d
2.44% 2.23% 0x00007f966d1a15b5: add 0xc(%r12,%r9,8),%edx
31.49% 43.87% 0x00007f966d1a15ba: movslq %r11d,%r9
1.04% 1.04% 0x00007f966d1a15bd: mov 0x14(%rsi,%r9,4),%r9d
1.91% 1.80% 0x00007f966d1a15c2: mov 0xc(%r12,%r9,8),%r9d
22.86% 23.19% 0x00007f966d1a15c7: add %r9d,%edx
; index increment + back branch
24.86% 15.38% 0x00007f966d1a15ca: add $0x2,%r11d
0.87% 0.76% 0x00007f966d1a15ce: cmp %r10d,%r11d
0x00007f966d1a15d1: jl 0x00007f966d1a15b0
-------------------
LoopMaxUnroll=4:
0.47% 0.17% 0x00007fc25919f5f0: mov %rdx,%rbx
; taking three things from stack
0.11% 0.07% 0x00007fc25919f5f3: mov (%rsp),%r8
8.02% 9.26% 0x00007fc25919f5f7: mov 0x8(%rsp),%rdx
2.62% 2.65% 0x00007fc25919f5fc: mov 0x10(%rsp),%r9
; working...
0.46% 0.43% 0x00007fc25919f601: mov 0x10(%rsi,%r10,4),%r11d
0.13% 0.13% 0x00007fc25919f606: add 0xc(%r12,%r11,8),%edi
14.20% 15.05% 0x00007fc25919f60b: movslq %r10d,%rax
0.25% 0.18% 0x00007fc25919f60e: mov 0x14(%rsi,%rax,4),%r11d
0.04% 0.07% 0x00007fc25919f613: mov 0xc(%r12,%r11,8),%r11d
; putting the same three things back on stack, no usages (!!!)
9.80% 10.28% 0x00007fc25919f618: mov %r9,0x10(%rsp)
2.80% 2.97% 0x00007fc25919f61d: mov %rdx,0x8(%rsp)
0.25% 0.32% 0x00007fc25919f622: mov %r8,(%rsp)
; working...
0.13% 0.15% 0x00007fc25919f626: mov %rbx,%rdx
9.09% 9.92% 0x00007fc25919f629: mov 0x18(%rsi,%rax,4),%r8d
2.71% 3.00% 0x00007fc25919f62e: mov 0xc(%r12,%r8,8),%r8d
4.23% 4.05% 0x00007fc25919f633: mov 0x1c(%rsi,%rax,4),%ebx
0.04% 0.04% 0x00007fc25919f637: mov 0xc(%r12,%rbx,8),%r9d
13.38% 12.47% 0x00007fc25919f63c: add %r11d,%edi
1.32% 0.78% 0x00007fc25919f63f: add %r8d,%edi
4.90% 4.75% 0x00007fc25919f642: add %r9d,%edi
; index increment + back branch
11.02% 11.13% 0x00007fc25919f645: add $0x4,%r10d
2.59% 2.44% 0x00007fc25919f649: cmp %ecx,%r10d
0x00007fc25919f64c: jl 0x00007fc25919f5f0
-------------------
There, LMU=4 starts to spill something without a good reason.
http://cr.openjdk.java.net/~shade/8050850/ArrayTest.java
default: 259.322 +- 3.997 ns/op
LoopMaxUnroll=1: 276.164 +- 7.740 ns/op
LoopMaxUnroll=2: 208.690 +- 2.690 ns/op
LoopMaxUnroll=3: 207.843 +- 1.946 ns/op
LoopMaxUnroll=4: 256.309 +- 1.202 ns/op
Default mode seems be very close to LMU=4 assembly-wise. Below are the hottest loops for different LMUs:
(monospaced: http://cr.openjdk.java.net/~shade/8050850/lmu-hots.txt)
-------------------
LoopMaxUnroll=1:
; working...
3.12% 3.08% 0x00007fec78107f60: mov 0x10(%rsi,%r10,4),%r11d
4.22% 4.20% 0x00007fec78107f65: add 0xc(%r12,%r11,8),%edi
; index increment + back branch
80.62% 82.20% 0x00007fec78107f6a: inc %r10d
2.11% 2.02% 0x00007fec78107f6d: cmp %ecx,%r10d
0x00007fec78107f70: jl 0x00007fec78107f60
-------------------
LoopMaxUnroll=2:
; working...
1.73% 1.80% 0x00007f966d1a15b0: mov 0x10(%rsi,%r11,4),%r9d
2.44% 2.23% 0x00007f966d1a15b5: add 0xc(%r12,%r9,8),%edx
31.49% 43.87% 0x00007f966d1a15ba: movslq %r11d,%r9
1.04% 1.04% 0x00007f966d1a15bd: mov 0x14(%rsi,%r9,4),%r9d
1.91% 1.80% 0x00007f966d1a15c2: mov 0xc(%r12,%r9,8),%r9d
22.86% 23.19% 0x00007f966d1a15c7: add %r9d,%edx
; index increment + back branch
24.86% 15.38% 0x00007f966d1a15ca: add $0x2,%r11d
0.87% 0.76% 0x00007f966d1a15ce: cmp %r10d,%r11d
0x00007f966d1a15d1: jl 0x00007f966d1a15b0
-------------------
LoopMaxUnroll=4:
0.47% 0.17% 0x00007fc25919f5f0: mov %rdx,%rbx
; taking three things from stack
0.11% 0.07% 0x00007fc25919f5f3: mov (%rsp),%r8
8.02% 9.26% 0x00007fc25919f5f7: mov 0x8(%rsp),%rdx
2.62% 2.65% 0x00007fc25919f5fc: mov 0x10(%rsp),%r9
; working...
0.46% 0.43% 0x00007fc25919f601: mov 0x10(%rsi,%r10,4),%r11d
0.13% 0.13% 0x00007fc25919f606: add 0xc(%r12,%r11,8),%edi
14.20% 15.05% 0x00007fc25919f60b: movslq %r10d,%rax
0.25% 0.18% 0x00007fc25919f60e: mov 0x14(%rsi,%rax,4),%r11d
0.04% 0.07% 0x00007fc25919f613: mov 0xc(%r12,%r11,8),%r11d
; putting the same three things back on stack, no usages (!!!)
9.80% 10.28% 0x00007fc25919f618: mov %r9,0x10(%rsp)
2.80% 2.97% 0x00007fc25919f61d: mov %rdx,0x8(%rsp)
0.25% 0.32% 0x00007fc25919f622: mov %r8,(%rsp)
; working...
0.13% 0.15% 0x00007fc25919f626: mov %rbx,%rdx
9.09% 9.92% 0x00007fc25919f629: mov 0x18(%rsi,%rax,4),%r8d
2.71% 3.00% 0x00007fc25919f62e: mov 0xc(%r12,%r8,8),%r8d
4.23% 4.05% 0x00007fc25919f633: mov 0x1c(%rsi,%rax,4),%ebx
0.04% 0.04% 0x00007fc25919f637: mov 0xc(%r12,%rbx,8),%r9d
13.38% 12.47% 0x00007fc25919f63c: add %r11d,%edi
1.32% 0.78% 0x00007fc25919f63f: add %r8d,%edi
4.90% 4.75% 0x00007fc25919f642: add %r9d,%edi
; index increment + back branch
11.02% 11.13% 0x00007fc25919f645: add $0x4,%r10d
2.59% 2.44% 0x00007fc25919f649: cmp %ecx,%r10d
0x00007fc25919f64c: jl 0x00007fc25919f5f0
-------------------
There, LMU=4 starts to spill something without a good reason.