Accelerate Mobile AI on Arm With SME2

Build faster, more efficient mobile AI apps with Arm Scalable Matrix Extension 2 (SME2). This guide shows you how to run and optimize on-device Large Language Models (LLMs), voice, vision, and GenAI workloads using SME2-enabled hardware, supported frameworks, and tools for Android and iOS.

What is SME2 and Why Should Developers Use It?

SME2 is Arm’s latest CPU extension for accelerating matrix-oriented compute workloads directly on-device. It is designed to improve performance for AI and ML models – particularly those relying on operations like matrix multiplication, common in Transformers, CNNs, and LLMs.

Get Started

Get Started - What Are You Developing?

GENERATIVE AI
VOICE AND VISION
LIBRARIES AND FRAMEWORKS
Arm SIMD Instructions

Designed for application developers, this section showcases real-world examples – including LLMs, audio generation, and multimodal LLMs – running directly on Arm CPUs using KleidiAI, ExecuTorch, ONNX Runtime, and MediaPipe. It assumes a foundational understanding of Android development and familiarity with Android Studio.

Resources	Framework	Description
Generate Audio with Stable Audio on LiteRT	LiteRT	Learn how to deploy the Stable Audio Open Small text-to-audio model using LiteRT on Android and macOS.
Vision LLM Inference on Android with KleidiAI + MNN	MNN	Run Vision Transformers (ViT) efficiently on Android with KleidiAI and MNN in this beginner-friendly path.
Build an Android Chat App with ExecuTorch + Llama 3 – ExecuTorch Learning Path + Docs	PyTorch / ExecuTorch	Step-by-step guide to build a lightweight, real-time Llama 3 chat app on Arm-based Android devices.
Build a Chatbot on Android with ONNX Runtime	ONNX	Learn to build a powerful Android chat app using the ONNX Runtime and Generate() API for efficient inference.
Multimodal AI on Android with MediaPipe + KleidiAI - Build a Selfie App  + Run LLM Inference	MediaPipe	Develop high-performance multimodal apps using MediaPipe, KleidiAI, and XNNPACK – from selfie filters to LLM integration.
Neural Network Quantization for Mobile AI	N/A	Explore key quantization techniques to reduce model size and improve performance for on-device AI.

Designed for application developers, this section walks you through accelerating voice assistants, enhancing camera pipelines, and optimizing computer vision apps using frameworks like KleidiAI, PyTorch, and OpenCV. It assumes foundational knowledge of Android development, including experience with Android Studio.

Resources	Framework	Description
Accelerate Voice Assistants with KleidiAI + SME2	Whisper.cpp and Llama.cpp	Learn how to optimize Voice Assistant performance on Android using KleidiAI and SME2.
Enhance Camera Effects with AI Optimization	LiteRT with XNNPack	Discover how KleidiAI and KleidiCV can optimize camera pipelines for real-time visual effects on Android.
Train a Digit Classifier with PyTorch for Android	PyTorch	Learn to train a digit classification model with PyTorch and optimize it for Android deployment.
Accelerate OpenCV on Android with KleidiCV – CV Camera App With OpenCV + Face Detection on Android	OpenCV	Three Learning Paths on using KleidiCV and SME2 to accelerate OpenCV apps on Android – from basics to face detection.

Designed for library and framework developers, this section introduces lightweight, open-source libraries for accelerating AI/ML frameworks, tools, and libraries with open-source Arm Kleidi Libraries.

Resources	Description
Arm Kleidi Libraries	Lightweight, open-source libraries for accelerating AI and ML workloads – an alternative to Arm Compute Library (ACL) with lower overhead.
Accelerate Generative AI workloads using KleidiAI	Learn how to accelerate GenAI workloads with KleidiAI. Includes a step-by-step guide for running key functions like Gemma LLM inference. Read launch blog.
Arm KleidiCV GitLab Repo	A high-performance library for computer vision. Integrates easily with any CV framework to accelerate image processing on Arm-based devices. Read launch blog.

Designed for developers directly using the Arm SIMD Instruction Set, this section provides practical SME2 examples, compiler toolchain insights, and low-level programming techniques in C/C++ and assembly.

Resources	Decription
Introduction to SME2 blogs Part 1 – SME2 Overview Part 2 – SME2 Architecture Deep Dive Part 3 – Matrix Multiplication	A 3-part blog series introducing Arm SME2—covering architecture, programming model, and comparisons with NEON and SVE.
SME2 Semantics, Toolchains & Code Examples	A programmer’s guide to Arm SME2, including architecture, semantics, and how to accelerate matrix workloads on Armv9-A CPUs.
Arm Intrinsics Search Engine	Technical reference for SME and SME2 intrinsics in C/C++, including descriptions, syntax, and usage examples.
Accelerate Matrix Multiplication with SME2	Advanced Learning Path for applying SME2 to optimize matrix multiplication on Arm-based platforms.
Code kata: perfect your SVE and SME instructions skills with SIMD Loops	SIMD Loops is an open-source project for learning SVE and SME through hands-on loop kernels. It offers real-world examples in C, Arm intrinsics, and assembly, annotated to highlight how instruction examples like fmopa and fmla work.
Function Multiversioning for SME2, NEON & SVE2	Learn to optimize C/C++ apps across SIMD instruction sets using function multiversioning for performance portability.
Arm SIMD Extensions Best Practice	Optimize your AI/ML workloads with Arm SIMD code, either in assembly or using Arm Intrinsics in C/C++, to leverage huge performance gains.

Tools and Libraries

Use these tools to profile, tune, and deploy your AI workloads after model selection and initial integration. These tools support low-level ML optimization by targeting Arm-specific features and analyzing system performance across compute and memory.