GCC 11 - Tuning for SVE cores

  • Micro-architectural tuning for Fujitsu A64FX and Arm Neoverse V1 cores
  • SVE intrinsics code-gen improvement
  • Auto vectorization improvements


  • Tune SVE code to specific cores
    • -mcpu=a64fx compiles code specifically for Fujitsu A64FX cores and tunes SVE code for the A64FX micro-architecture. 
    • -mcpu=neoverse-v1 compiles code specifically for Arm Neoverse V1 cores and tunes SVE code for the Neoverse V1 micro-architecture. 
    • Arm strongly recommends using -mcpu=<core> if the target hardware is known.
      • This allows GCC to use all of the available architecture extensions.   It also tells GCC to optimize for the core's micro-architecture.
      • If code is being built and run on the same machine, the easiest way of getting the best code is to use -mcpu=native.
  • Use more SVE and SVE2 instructions for auto-vectorization, including:
    • FCADD and FCMLA (SVE)
    • CADD and CMLA (SVE2)
  • Improve SVE instruction selection, including:
    • Remove redundant PTEST instructions - This affects both PTESTs in auto-vectorized code and PTESTs created by svptest intrinsic functions.  It significantly improves the code generated for calls to the svrdffr and svrdffr_z intrinsics, in cases where they are followed by an svptest of the result.
    • Improve the addressing mode choices for the prefetch intrinsic functions (svprfb, svprfh, svprfw and svprfd).
  • Improve support for “unpacked” integer vector operations, where integer vector elements are stored in wider containers.
    • For example, if a loop is operating on both 32-bit and 64-bit integers, it is sometimes better to operate on 32-bit integers stored in 64-bit containers.

GCC 10 - SVE2  and SVE/2 intrinsic support

  • SVE2 intrinsic and auto-vectorization
  • SVE intrinsic support
  • SVE code-gen improvements
  • Use some additional SVE instructions, including:
    • FABD, SDOT and UDOT
    • Extending loads and truncating stores (including gathers and scatters) SXT[BHW] and UXT[BHW]
  • Use some SVE2 instructions, including:
    • B/T pairs for multiply high with rounding
    • EOR3, BSL, BSL1N, BSL2N and NBSL
    • SRA
  • Improve SVE instruction selection, including
    • Prefer to use FP registers over core registers for vector+scalar forms(e.g. in CLASTA/B)
    • Make more use of MOVPRFX
    • Make more use of reversed instructions
    • Make more use of immediate forms
    • Improve handling of complicated constants
    • Reduce the number of redundant predicates (e.g. by sharing PTRUEs)
    • Add more constant folds (e.g. constant WHILEs -> PTRUE VLn)
    • Use UABD + UDOT for the sum of absolute differences
    • Try to use 32-bit forms of WHILELO where possible
  • Make more use of predication, including:
    • Support more forms of conditional arithmetic
    • Support predicated FADDA
    • Support predicated dot product and sum of absolute differences
  • Improve length-specific code generation, including:
    • Make -msve-vector-bits=128 generate length-specific code (little-endian only)
    • Optimize the construction of length-specific vectors; previously this often went via the stack
  • Improve the vectorization technique used, including:
    • Make runtime alias checks less expensive and use SVE2 WHILERW and WHILEWR where possible
    • Try vectorizing with and without unpacked vectors, extending loads and truncating stores, and pick the best one
    • Try vectorizing with SVE and Advanced SIMD and pick the best one - Mostly useful for base SVE vs. Advanced SIMD (where SVE2 isn't available) or for -msve-vector-bits=128
    • Try using SVE to vectorize the scalar tail of an Advanced SIMD loop
    • Improve the cost model some more, e.g. to account better for the overhead of using multiple loop predicates

GCC 9 - SVE auto-vectorization improvements

  • Bug fixes 
  • Minor improvements to SVE auto-vectorization
  • Use some additional SVE instructions, including
    • SDIV and UDIV
  • Improve SVE instruction selection, including
    • Make better choices between FMLA, FMLS, FNMLA and FNMLS
  • Make more use of predication, including
    • Support conditional comparisons and some conditional arithmetic
    • Support FMLA reductions in predicated loops
  • Improve length-agnostic code generation, including
    • Handle more types of permutation
    • Make more use of "loop-aware SLP" (roughly equivalent to unroll + SLP)
  • Improve the vectorization technique used, including: 
    • Use runtime loop versioning to avoid gathers and scatters for strides that might be 1
    • Prevent integer types from being widened more than they need to be (e.g. all the way to int, because of C's promotion rules)
  • Improve the cost model, to stop SVE from being used in expensive ways, especially for low iteration counts

GCC 8 - SVE basic auto-vectorization

  • SVE basic auto-vectorization
  • No intrinsic support

GCC 7  and earlier  - No SVE/2 support

No support for SVE or SVE2