GCC 7  and earlier  - No SVE/2 support

No support for SVE or SVE2

GCC 8 - SVE basic auto-vectorization

  • SVE basic auto-vectorization
  • No intrinsic support

GCC 9 - SVE auto-vectorization improvements

  • Bug fixes 
  • Minor improvements to SVE auto-vectorization
  • Use some additional SVE instructions, including
    • SDIV and UDIV
  • Improve SVE instruction selection, including
    • Make better choices between FMLA, FMLS, FNMLA and FNMLS
  • Make more use of predication, including
    • Support conditional comparisons and some conditional arithmetic
    • Support FMLA reductions in predicated loops
  • Improve length-agnostic code generation, including
    • Handle more types of permutation
    • Make more use of "loop-aware SLP" (roughly equivalent to unroll + SLP)
  • Improve the vectorization technique used, including: 
    • Use runtime loop versioning to avoid gathers and scatters for strides that might be 1
    • Prevent integer types from being widened more than they need to be (e.g. all the way to int, because of C's promotion rules)
  • Improve the cost model, to stop SVE from being used in expensive ways, especially for low iteration counts

GCC 10 - SVE2  and SVE/2 intrinsic support

  • SVE2 intrinsic and auto-vectorization
  • SVE intrinsic support
  • SVE code-gen improvements
  • Use some additional SVE instructions, including:
    • FABD, SDOT and UDOT
    • Extending loads and truncating stores (including gathers and scatters) SXT[BHW] and UXT[BHW]
  • Use some SVE2 instructions, including:
    • B/T pairs for multiply high with rounding
    • EOR3, BSL, BSL1N, BSL2N and NBSL
    • SRA
  • Improve SVE instruction selection, including
    • Prefer to use FP registers over core registers for vector+scalar forms(e.g. in CLASTA/B)
    • Make more use of MOVPRFX
    • Make more use of reversed instructions
    • Make more use of immediate forms
    • Improve handling of complicated constants
    • Reduce the number of redundant predicates (e.g. by sharing PTRUEs)
    • Add more constant folds (e.g. constant WHILEs -> PTRUE VLn)
    • Use UABD + UDOT for the sum of absolute differences
    • Try to use 32-bit forms of WHILELO where possible
  • Make more use of predication, including:
    • Support more forms of conditional arithmetic
    • Support predicated FADDA
    • Support predicated dot product and sum of absolute differences
  • Improve length-specific code generation, including:
    • Make -msve-vector-bits=128 generate length-specific code (little-endian only)
    • Optimize the construction of length-specific vectors; previously this often went via the stack
  • Improve the vectorization technique used, including:
    • Make runtime alias checks less expensive and use SVE2 WHILERW and WHILEWR where possible
    • Try vectorizing with and without unpacked vectors, extending loads and truncating stores, and pick the best one
    • Try vectorizing with SVE and Advanced SIMD and pick the best one - Mostly useful for base SVE vs. Advanced SIMD (where SVE2 isn't available) or for -msve-vector-bits=128
    • Try using SVE to vectorize the scalar tail of an Advanced SIMD loop
    • Improve the cost model some more, e.g. to account better for the overhead of using multiple loop predicates