DGEMM parallel performance graph

DGEMM parallel performance

Benchmark run on 64-core AWS Graviton 2 (Neoverse N1-based) SoC.

Arm Performance Libraries demonstrates the most consistent performance across a range of problem sizes.

OpenBLAS built from git hash 1ef97c47 (February 28, 2022) using target Neoverse N1 with GCC 11.2.

BLIS built from git hash 84732bf9 (February 28, 2022) using target ThunderX2 with GCC 11.2.

Cholesky factorization graph

Cholesky factorization parallel performance

Benchmark run on 64-core AWS Graviton 2 (Neoverse N1-based) SoC.

OpenBLAS built from git hash 1ef97c47 (February 28, 2022) using target Neoverse N1 with GCC 11.2.

Interleave-batch speedup

Interleave-batch routine speedup

Benchmark run on 64-core AWS Graviton 2 (Neoverse N1-based) SoC.

Arm Performance Libraries includes optimized implementations for batches of important linear algebra operations.

See this Arm Community blog for details.

FFT performance

FFT performance

Benchmark run on 64-core AWS Graviton 2 (Neoverse N1-based) SoC.

FFTW 3.3.10 built with GCC 11.2 using configuration options –enable-single –enable-neon –enable-fma

Sparse matrix-matrix multiplication parallel performance

Sparse matrix-matrix multiplication parallel performance

Benchmark run on 64-core AWS Graviton 2 (Neoverse N1-based) SoC.

Libamath performance

Math functions (libamath) performance

Benchmark run on 64-core AWS Graviton 2 (Neoverse N1-based) SoC.

Note: All data is generated using AWS Graviton 2 (Neoverse N1-based), using the 'c6gd.16xlarge' instance type .