memcpy() was re-written for M7 Sparc [ bug 19933710 ]
memcpy T4 version new (T7) version
1 thread 5.8 GB/s 14.7 GB/s
16 threads 87.9 GB/s 130.4 GB/s
32 threads 131.3 GB/s 131.3 GB/s
Analysis:
Block store init requires the existing 64 byte cache line to be moved out of
L2 cache. The L2 cache has a store queue/miss buffer which is limited to 32
entries. Normal, in-order stores would be limited to 4 cache lines moving out
of the L2 cache at a time due to 8 stores sharing a cache line. This note
proposes an alternate method of storing the first elements of a group of a
chunk of cache lines at a time, then storing the rest of those chunk of cache
lines.
Algorithmic description:
... move data until store pointer is on cache line boundary ...
where CHUNK=20
while (more than CHUNK*64 bytes left) {
for (i=0;i<CHUNK;i++) { /* load/store first element of CHUNK cache lines
*/
prefetch load address
ld one eight byte element of cache line
BIS one eight byte element of cache line
advance load/store pointers by 64 bytes
}
reset load/store pointers to beginning of chunk
for (i=0;i<CHUNK;i++) { /* load/store rest of CHUNK cache lines */
prefetch load address
ld other fifty-six bytes of cache line
BIS other fifty-six bytes of cache line (treated as normal stores)
advance load/store pointers by 64 bytes
}
}
... extra finish up loop for final data less than CHUNK size
is it possible to enhance arraycopy to use this algorithm for M/T7 ?
memcpy T4 version new (T7) version
1 thread 5.8 GB/s 14.7 GB/s
16 threads 87.9 GB/s 130.4 GB/s
32 threads 131.3 GB/s 131.3 GB/s
Analysis:
Block store init requires the existing 64 byte cache line to be moved out of
L2 cache. The L2 cache has a store queue/miss buffer which is limited to 32
entries. Normal, in-order stores would be limited to 4 cache lines moving out
of the L2 cache at a time due to 8 stores sharing a cache line. This note
proposes an alternate method of storing the first elements of a group of a
chunk of cache lines at a time, then storing the rest of those chunk of cache
lines.
Algorithmic description:
... move data until store pointer is on cache line boundary ...
where CHUNK=20
while (more than CHUNK*64 bytes left) {
for (i=0;i<CHUNK;i++) { /* load/store first element of CHUNK cache lines
*/
prefetch load address
ld one eight byte element of cache line
BIS one eight byte element of cache line
advance load/store pointers by 64 bytes
}
reset load/store pointers to beginning of chunk
for (i=0;i<CHUNK;i++) { /* load/store rest of CHUNK cache lines */
prefetch load address
ld other fifty-six bytes of cache line
BIS other fifty-six bytes of cache line (treated as normal stores)
advance load/store pointers by 64 bytes
}
}
... extra finish up loop for final data less than CHUNK size
is it possible to enhance arraycopy to use this algorithm for M/T7 ?