Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8261663

Vector API (Second Incubator)

    XMLWordPrintable

    Details

    • Type: JEP
    • Status: Draft
    • Priority: P4
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: hotspot
    • Labels:
      None
    • Author:
      Paul Sandoz
    • JEP Type:
      Feature
    • Exposure:
      Open
    • Subcomponent:
    • Scope:
      JDK
    • Discussion:
      panama dash dev at openjdk dot java dot net
    • Effort:
      M
    • Duration:
      M

      Description

      Summary

      Provide a second iteration of an incubator module, jdk.incubator.vector, to express vector computations that reliably compile at runtime to optimal vector hardware instructions on supported CPU architectures and thus achieve superior performance to equivalent scalar computations.

      History

      The Vector API was first proposed by JEP 338 and was integrated into Java 16 as an incubating API. This JEP proposes to incorporate Vector API enhancements based on feedback, performance improvements, and significant implementation enhancements, such as optimizing masked vector operations on supporting hardware.

      Goals

      • Minor enhancements to the Vector API.

      • Integration of intrinsics from Intel's Short Vector Math Library (SVML) operations, supporting optimized transcendental vector operations on Intel x64 architectures.

      • Support the Vector API on the ARM Scalable Vector Extension (SVE) architecture.

      • Optimal implementation of Vector API mask operations in HotSpot for ARM SVE and Intel AVX-512 architectures that support vector mask registers.

      Motivation

      The primary motivation of the Vector API remains unchanged, as described in JEP 338.

      This JEP has three specific motivations. The first is to improve the Vector API by incorporating feedback, which involves some minor enhancements and adjustments. The second is to improve the performance of the Vector API with enhancements to HotSpot, specifically, enhancing vector support in the C2 runtime compiler, and the existing supported architectures of Intel x64 and ARM Neon. Where possible this may also enhance, or enable future enhancements, to Hotspot's auto-vectorizer. The third is to broaden the support of the Vector API on new CPU architectures, specifically support for ARM SVE.

      Description

      API enhancements

      The following API enhancements are proposed:

      • Loading and storing ShortVector from and to char[] arrays. This is preferable to adding a new kind of Vector whose element type is char. These methods are useful for data parallel algorithms on characters, such as parsing UTF-8 encoded characters into a char[] array.

      • Unsigned comparison operators for vectors whose element type is an integral primitive type, such as short and int. Such functionality is complementary to the prior API enhancement, since char values are unsigned. Unsigned comparison operators will be added, as static fields on VectorOperators of type VectorOperators.Comparison, that complement the existing signed comparison operators. The Vector comparison implementations and HotSpot will be updated to support the unsigned comparison operators.

      • Loading and storing ByteVector from and to boolean[] arrays. The byte elements of a ByteVector will be normalized to 0 and 1 before storing.

      • Modify the behavior of Vector.rearrange to wrap around any exceptional source indexes in the VectorShuffle argument, rather than throwing an exception on any exceptional source indexes. In additional the implementation of VectorShuffle.checkIndex will be optimized. This ensures that Vector.rearrange can produce efficient code, and the developer can, for an increased cost, explicitly check for exceptional source indexes. (See JDK-8262985 for further details.)

      Implementation enhancements

      Implementation enhancements are detailed in the follow sub-sections.

      Intel SVML intrinsics

      The Vector API supports transcendental and trigonometric lanewise operations. Currently, such operations are not optimized, since there are no associated vector hardware instructions available, nor intrinsic implementations consisting of vector hardware instructions.

      For x86, the Intel Short Vector Math Library (SVML) can be leveraged to provide optimized intrinsic implementations for such operations.

      The assembly source files of SVML operations are placed in the jdk.incubator.vector module under OS-specific directories. The JDK build process compiles the assembly source files for the target OS platform into an SVML-specific shared object library. Note that, if a JDK image is built, using jlink, that omits the jdk.incubator.vector module, then the SVML library will not be present in the JDK image.

      The supported OS platforms are Linux and Windows. Mac OSX support will be considered later, since it is a non-trivial amount of work to provide assembler source files with the required OS-specific directives.

      The HotSpot runtime will attempt to load the SVML library, and if present binds the operations in the SVML library to named stub routines. The C2 compiler generates code that calls the appropriate stub routine based on the operation and vector species (element type and shape).

      ARM SVE

      The C2 compiler is enhanced to support the Vector API on ARM SVE. Such support will leverage general ARM SVE support in C2, which is proposed and integrated separately from this JEP.

      Masking

      Vector operations that accept masks are not optimally supported on architectures that support masking in hardware. Currently, such operations are implemented by composing the non-masked operation with a blend operation, for example the masked lanewise operation on DoubleVector is implemented as follows:

      @ForceInline
      public final
      DoubleVector lanewise(VectorOperators.Binary op,
                            Vector<Double> v,
                            VectorMask<Double> m) {
           return blend(lanewise(op, v), m);
      }

      On hardware that supports masked registers, such as AVX-512 and SVE, the blend operation is not required. Instead, the mask m can be compiled to a mask register, and the vector operation compiled to a vector hardware instruction that operates with the mask register.

      For example, consider the following code that loads a vector and mask, then performs a masked lanewise operation:

      var vec    = IntVector.fromArray(SPECIES_512, int_arr, 0);
      var mask   = VectorMask.fromArray(SPECIES_512, mask_arr, 0);
      var res    = vec1.lanewise(VectorOperations.ABS, mask);

      On AVX-512 hardware the sequence of instructions generated by C2 is:

      // LoadVector (IntVector.fromArray)
      vmovdqu32 0x10(%r9),%zmm0          
      // LoadVector (VectorMask.fromArray)
      vmovdqu 0x10(%r12,%r8,8),%xmm1  
      // AbsV   (IntVector.lanewise)
      vpabsd %zmm0,%zmm2                    
      // VectorLoadMask (VectorMask.fromArray)
      vpxord %zmm3,%zmm3,%zmm3        
      vpsubb %zmm1,%zmm3,%zmm3       
      vpmovsxbd %xmm3,%zmm3               
      // VectorBlend (IntVector.blend)  
      vpcmpeqd -0xeb539(%rip),%zmm3,%k7 
      vpblendmd %zmm2,%zmm0,%zmm0{%k7}  

      With hardware masking support the ideal sequence of instructions generated is:

      // LoadVector (IntVector.fromArray)
      vmovdqu32 0x10(%r9),%zmm1 
      // LoadVector (VectorMask.fromArray) 
      vmovdqu 0x10(%r12,%r8,8),%xmm0
      // VectorLoadMask (VectorMask.fromArray)
      vpcmpb $0x0,-0xee9e1(%rip),%xmm0,%k7 
      // VectorMaskedOper(IntVector.lanewise)
      vpabsd %zmm1,%zmm1{%k7}

      A predicated vector hardware instruction is generated using a masked hardware register. Fewer instructions are generated and performance is improved.

      The Vector API implementation and generic components of C2 are enhanced to support efficient masked operations, rather than composing explicitly using blend. In addition, special attention will be required for loads and stores of vectors to ensure no out-of-bounds access occurs. Such support will leverage general enhancements to HotSpot for masked registers and their allocation, which is proposed and integrated separately from this JEP (see JDK-8262355).

      Care is taken to ensure C2's masking support allows for efficient generation of code on AVX-512 and SVE, requiring a common intermediate representation that is expressive enough to abstract over the underlying architectural differences.

      Further, care is taken to ensure masking support does not unduly increase the following: number of instruction selection patterns; the size of ad files; and the size of the resulting libjvm shared library.

      Testing

      Existing tests will be updated to test enhancements to the Vector API.

      Existing tests are considered sufficient to cover enhancements to HotSpot. Testing on ARM SVE and AVX-512 hardware will be aided by the contributors, since such hardware may not be widely available.

      Risks and Assumptions

      Two features may be deferred to a future JEP if they are not ready in a timely manner and risk delaying the progress of this JEP and its other features. Specifically if masking and/or ARM SVE are not considered ready, then this JEP will be updated to remove related details.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              psandoz Paul Sandoz
              Reporter:
              psandoz Paul Sandoz
              Owner:
              Paul Sandoz Paul Sandoz
              Reviewed By:
              John Rose, Maurizio Cimadamore
              Endorsed By:
              John Rose, Maurizio Cimadamore
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated: