Details
Description
Summary
Provide a second iteration of an incubator module,
jdk.incubator.vector
, to express vector computations that reliably compile at runtime to
optimal vector hardware instructions on supported CPU architectures and thus achieve
superior performance to equivalent scalar computations.
History
The Vector API was first proposed by JEP 338 and was integrated into Java 16 as an incubating API. This JEP proposes to incorporate Vector API enhancements based on feedback, performance improvements, and significant implementation enhancements, such as optimizing masked vector operations on supporting hardware.
Goals
Minor enhancements to the Vector API.
Integration of intrinsics from Intel's Short Vector Math Library (SVML) operations, supporting optimized transcendental vector operations on Intel x64 architectures.
Support the Vector API on the ARM Scalable Vector Extension (SVE) architecture.
Optimal implementation of Vector API mask operations in HotSpot for ARM SVE and Intel AVX-512 architectures that support vector mask registers.
Motivation
The primary motivation of the Vector API remains unchanged, as described in JEP 338.
This JEP has three specific motivations. The first is to improve the Vector API by incorporating feedback, which involves some minor enhancements and adjustments. The second is to improve the performance of the Vector API with enhancements to HotSpot, specifically, enhancing vector support in the C2 runtime compiler, and the existing supported architectures of Intel x64 and ARM Neon. Where possible this may also enhance, or enable future enhancements, to Hotspot's auto-vectorizer. The third is to broaden the support of the Vector API on new CPU architectures, specifically support for ARM SVE.
Description
API enhancements
The following API enhancements are proposed:
Loading and storing
ShortVector
from and tochar[]
arrays. This is preferable to adding a new kind ofVector
whose element type ischar
. These methods are useful for data parallel algorithms on characters, such as parsing UTF-8 encoded characters into achar[]
array.Unsigned comparison operators for vectors whose element type is an integral primitive type, such as
short
andint
. Such functionality is complementary to the prior API enhancement, sincechar
values are unsigned. Unsigned comparison operators will be added, as static fields onVectorOperators
of typeVectorOperators.Comparison
, that complement the existing signed comparison operators. TheVector
comparison implementations and HotSpot will be updated to support the unsigned comparison operators.Loading and storing
ByteVector
from and toboolean[]
arrays. Thebyte
elements of aByteVector
will be normalized to 0 and 1 before storing.Modify the behavior of
Vector.rearrange
to wrap around any exceptional source indexes in theVectorShuffle
argument, rather than throwing an exception on any exceptional source indexes. In additional the implementation ofVectorShuffle.checkIndex
will be optimized. This ensures thatVector.rearrange
can produce efficient code, and the developer can, for an increased cost, explicitly check for exceptional source indexes. (See JDK-8262985 for further details.)
Implementation enhancements
Implementation enhancements are detailed in the follow sub-sections.
Intel SVML intrinsics
The Vector API supports transcendental and trigonometric lanewise operations. Currently, such operations are not optimized, since there are no associated vector hardware instructions available, nor intrinsic implementations consisting of vector hardware instructions.
For x86, the Intel Short Vector Math Library (SVML) can be leveraged to provide optimized intrinsic implementations for such operations.
The assembly source files of
SVML operations are placed in the jdk.incubator.vector
module under OS-specific
directories.
The JDK build process compiles the assembly source files for the target OS platform
into an SVML-specific shared object library. Note that, if a JDK image is built, using
jlink
, that omits the jdk.incubator.vector
module, then the SVML library will not be
present in the JDK image.
The supported OS platforms are Linux and Windows. Mac OSX support will be considered later, since it is a non-trivial amount of work to provide assembler source files with the required OS-specific directives.
The HotSpot runtime will attempt to load the SVML library, and if present binds the operations in the SVML library to named stub routines. The C2 compiler generates code that calls the appropriate stub routine based on the operation and vector species (element type and shape).
ARM SVE
The C2 compiler is enhanced to support the Vector API on ARM SVE. Such support will leverage general ARM SVE support in C2, which is proposed and integrated separately from this JEP.
Masking
Vector operations that accept masks are not optimally supported on architectures that
support masking in hardware.
Currently, such operations are implemented by composing the non-masked operation with a
blend operation, for example the masked lanewise
operation on DoubleVector
is
implemented as follows:
@ForceInline
public final
DoubleVector lanewise(VectorOperators.Binary op,
Vector<Double> v,
VectorMask<Double> m) {
return blend(lanewise(op, v), m);
}
On hardware that supports masked registers, such as AVX-512 and SVE, the blend operation
is not required.
Instead, the mask m
can be compiled to a mask register, and the vector operation
compiled to a vector hardware instruction that operates with the mask register.
For example, consider the following code that loads a vector and mask, then performs a masked lanewise operation:
var vec = IntVector.fromArray(SPECIES_512, int_arr, 0);
var mask = VectorMask.fromArray(SPECIES_512, mask_arr, 0);
var res = vec1.lanewise(VectorOperations.ABS, mask);
On AVX-512 hardware the sequence of instructions generated by C2 is:
// LoadVector (IntVector.fromArray)
vmovdqu32 0x10(%r9),%zmm0
// LoadVector (VectorMask.fromArray)
vmovdqu 0x10(%r12,%r8,8),%xmm1
// AbsV (IntVector.lanewise)
vpabsd %zmm0,%zmm2
// VectorLoadMask (VectorMask.fromArray)
vpxord %zmm3,%zmm3,%zmm3
vpsubb %zmm1,%zmm3,%zmm3
vpmovsxbd %xmm3,%zmm3
// VectorBlend (IntVector.blend)
vpcmpeqd -0xeb539(%rip),%zmm3,%k7
vpblendmd %zmm2,%zmm0,%zmm0{%k7}
With hardware masking support the ideal sequence of instructions generated is:
// LoadVector (IntVector.fromArray)
vmovdqu32 0x10(%r9),%zmm1
// LoadVector (VectorMask.fromArray)
vmovdqu 0x10(%r12,%r8,8),%xmm0
// VectorLoadMask (VectorMask.fromArray)
vpcmpb $0x0,-0xee9e1(%rip),%xmm0,%k7
// VectorMaskedOper(IntVector.lanewise)
vpabsd %zmm1,%zmm1{%k7}
A predicated vector hardware instruction is generated using a masked hardware register. Fewer instructions are generated and performance is improved.
The Vector API implementation and generic components of C2 are enhanced to
support efficient masked operations, rather than composing explicitly using blend
.
In addition, special attention will be required for loads and stores of vectors to ensure
no out-of-bounds access occurs.
Such support will leverage general enhancements to HotSpot for masked registers and
their allocation, which is proposed and integrated separately from this JEP
(see JDK-8262355).
Care is taken to ensure C2's masking support allows for efficient generation of code on AVX-512 and SVE, requiring a common intermediate representation that is expressive enough to abstract over the underlying architectural differences.
Further, care is taken to ensure masking support does not unduly increase the following:
number of instruction selection patterns; the size of ad
files; and the size of the
resulting libjvm
shared library.
Testing
Existing tests will be updated to test enhancements to the Vector API.
Existing tests are considered sufficient to cover enhancements to HotSpot. Testing on ARM SVE and AVX-512 hardware will be aided by the contributors, since such hardware may not be widely available.
Risks and Assumptions
Two features may be deferred to a future JEP if they are not ready in a timely manner and risk delaying the progress of this JEP and its other features. Specifically if masking and/or ARM SVE are not considered ready, then this JEP will be updated to remove related details.
Attachments
Issue Links
- relates to
-
JDK-8262356 Optimize existing masked operation support for AVX-512.
-
- Open
-
-
JDK-8264563 Add masked vector intrinsics for binary/store operations
-
- Open
-
-
JDK-8262355 Support for AVX-512 opmask register allocation.
-
- Resolved
-
-
JDK-8262644 Arm SVE scalable predicate register allocation support
-
- Open
-
-
JDK-8264954 unified handling for VectorMask object re-materialization during de-optimization
-
- Closed
-