How to Use Arm SIMD to Achieve Huge Performance Gains

These educational materials are for native app developers, familiar with C/C++ programming and with a basic knowledge of SIMD.

TOPIC 1

Scalable Matrix Extension (SME)

Learn how to use SME in your code—via intrinsics (C/C++) or in assembly—and explore several SME2 code examples.

SME Programmer’s Guide

Get started using SME and SME2 in your code to enable matrix operations such as multiplication, inversion, and on-the-fly transposition. SME2 extends SME by introducing multi-vector data-processing instructions, load/store support for multi-vectors, and a multi-vector predication mechanism. Several SME2 code examples are included at the end.

SME Language Extensions and Intrinsics

A glossary of SME keyword attributes, types, functions, and intrinsics.

SME introduction blogs:

Part 1: Arm Scalable Matrix Extension (SME) Introduction - An overview of key SME features.
Part 2: Arm Scalable Matrix Extension (SME) Introduction - A look at some of the instructions SME provides.
Part 3: Matrix-matrix multiplication. Neon, SVE, and SME compared

Accelerate Matrix Multiplication Performance With SME2

A hands-on Learning Path to help you get started using SME2 in both assembly and intrinsics, featuring a matrix multiplication example.

Function multi-versioning for SME2

Learn how to create multiple versions of C/C++ functions to target different hardware capabilities like SME2, NEON, or SVE2

TOPIC 2

Optimize with Arm SIMD

Learn how to optimize in Assembly and in C/C++ using Neon, SVE, and SVE2 intrinsics. Arm intrinsics are a set of C/C++ functions whose precise implementation is known to the Arm compiler, GCC and LLVM. The LLVM (open-source Clang) version 5 and onwards includes support for SVE, and version 9 and onwards includes support for SVE2.

Arm Intrinsics Search Engine

The Arm intrinsics search engine can be filtered by SIMD ISA (Neon, SVE, SVE2, Helium), base type (floating point, integer, etc.), bit size, and architecture.

Optimizing C/C++ and Assembly Code with Arm SIMD

The Neon programmers guide, Optimizing C code with Neon Intrinsics guide, SVE and SVE2 programmers guide, SVE Optimization guide, explain how to use intrinsics in your C/C++ code to take advantage of SIMD in Armv8 and Armv9. For IoT Cortex-M ecosystem, there is the Helium Programmers Guide.

C/C++ Case Studies with Open-Source Libraries

Optimizing Image Processing PNG with Neon Intrinsics, Optimizing TIFF Image Processing library with Neon intrinsics, accelerating DSP functions with the DOT instructions for VP8 and VP9 video codecs for Arm Neoverse CPUs.

How to Vectorize Loops with Conditional Statements

C compilers have limited ability to vectorize loops with conditional statements. Learn how best to use Arm Neon intrinsics to get the best optimized code from C compilers.

TOPIC 3

Migrate from x86 and x64 to Arm Intrinsics

Learn about the different methods of porting existing x86 and x64 to Arm SIMD. And get inspired with several case studies from cloud to edge.

Porting Architecture-Specific Intrinsics

Learn about different libraries to migrate the x86 and x64 Intrinsics code to Arm intrinsics, and how to find intrinsics in large code bases.

Vectorscan Porting Analysis

Vectorscan is a portable fork of Intel’s Hyperscan. Learn about the porting challenges and the success of the porting project.

Optimize with Arm Intrinsics for Android

A wealth of resources on how-to get started using Arm intrinsics (Neon and SVE2) on Android’s NDK.

Porting to Arm Intrinsics with SIMDe

A case study on how H.266 (VVenC and VVdeC) was converted from x86 and x64 to Arm Neon with SIMDe, leveraging over 200% performance gains.

Evaluating SSE-to-Neon and SIMDe Libraries

Read the list of considerations to take when deciding which library would be best suited to your SIMD porting needs.

Porting Intel and AMD Intrinsics to Arm Neon Intrinsics

Blog going through the different porting options with the pros and cons of each, when migrating x86 or x64 code to Arm intrinsics.

TOPIC 4

Optimize Your Programs

Learn about tips and techniques to create better performing programs, so that today’s compilers can auto-vectorize leveraging Arm SIMD extensions.

Memory Aliasing and the “Restrict” Keyword

Learn about the importance of using the “restrict” keyword in C correctly. When a compiler auto-vectorizes code, it first needs to be sure that this is a safe action.

Memory Latency

Gain a better understanding of caches, prefetching, and data alignment on Arm platforms. Learn what a programmer can do to improve this access time.

Learn about Integer and Floating-point conversions

This how-to guide explains how to avoid pitfalls (cases where inadvertently the developer ends up with floating point operations) and how to leverage the power of integer performance.

Leverage Auto-Vectorization in Compilers

Learn how to structure the flow of your program to make it easier for the compiler to perform auto-vectorization.

Modifying Loop Layout to be Auto-Vectorization Friendly

An efficient data layout can be the difference between a slow and very fast program. Learn how you can help the compiler, as well as how you can covert your program to hand-written SIMD code.

Arm Developer Program

How to Use Arm SIMD to Achieve Huge Performance Gains

TOPIC 1

Scalable Matrix Extension (SME)

TOPIC 2

Optimize with Arm SIMD

TOPIC 3

Migrate from x86 and x64 to Arm Intrinsics

TOPIC 4

Optimize Your Programs

Scalable Matrix Extension (SME)

Optimize with Arm SIMD

Migrate from x86 and x64 to Arm Intrinsics

Optimize Your Programs

Accelerate Your Projects With the Arm Developer Community

Community Support

Learn From the Community

George Steed

Tell Us What We Are Missing