What is new in LLVM 22?
Find out about Arm contribution to the latest LLVM 22 release: Latest architecture and CPU support, code generation improvements.

LLVM 22.1.1 released on March 11, 2026. Cullen Rhodes from Arm managed this release. It delivers improvements in Arm architecture support, performance, and tools. The following sections highlight key changes.
To find out more about the previous LLVM release, you can read What is new in LLVM 21?
New architecture and CPU support
Armv9.7-A architecture support
By Maciej Gabka
LLVM 22 provides assembly support for several new architecture features announced by Arm in 2025. The 2026 update is Armv9.7-A. For more information, see the annual announcement blog post. The Armv9.7-A architecture extension page documents these extensions and includes additional details.
In 2026, Arm announced an update to the Arm Generic Interrupt Controller (GICv5). LLVM 22 is the first release that provides full assembly support for this extension. This release also adds assembly support for the Future Architecture Technologies, including Permission Overlays Extension 2 (POE2) and Virtual Memory Tagging (vMTE). For more information, see the Arm announcement blog post.
The LLVM 22 assembler is compatible with the December 2025 architecture XML release, which is part of the Exploration Tools. For the full instruction set, see the December 2025 Arm A-profile A64 Instruction Set Architecture release.
LLVM 22 extends the Arm C Language Extensions (ACLE) implementation. It implements intrinsics for data processing instructions that architecture features introduced in 2024, including SVE2p2 and SME2p2. It also defines new __pld_range and __pldx_range prefetch intrinsics that match the semantics of the range prefetch instruction. For more information, see the Arm C Language Extensions specification at https://github.com/ARM-software/acle.
LLVM 22 removes support for the Transactional Memory Extension (FEAT_TME) because Arm withdrew this feature from all architecture versions.
CPU Support
By David Candler
LLVM 22 adds support for the C1 series of Armv9.3-A-based CPUs: C1-Ultra, C1-Premium, C1-Pro, and C1-Nano. These CPUs target high-performance on-device workloads. For more information, see the C1 series announcement blog post.
Performance improvements
Optimizing small AArch64 CPUs
By Ties Stuij
We focus AArch64 backend optimizations on larger out-of-order CPUs, with less attention on smaller cores. This is because these CPU types often sit side by side in devices such as smartphones. The larger CPU handles compute-intensive tasks, while the small cores manage background tasks when larger cores are inactive. However, there are more use cases where the smaller cores need to be performant in configurations without the bigger ones. We now increase optimization efforts for these cores.
This release includes several improvements for smaller AArch64 cores. The following sections highlight key changes.
Our initial use case for optimizing small cores is the Cortex-A320, a fully fledged Armv9.2-A CPU with an in-order CPU and a minimal amount of execution units. Many improvements come from making code generation more suitable for this processor design. In-order execution requires instructions to execute sequentially, without a large reorder buffer (ROB) to select from future instructions.
Looping is a key area where this limitation affects performance. An out-of-order core can fill the ROB with instructions from the next loop iteration and execute them in an optimal order. However, in-order CPUs rely more on the compiler to perform this scheduling. Loop unrolling is therefore more important for in-order cores. Loop unrolling reduces branch overhead. It also increases instructions per iteration, which enables the compiler to perform scheduling at compile time instead of relying on the ROB at runtime.
For example, LLVM adds the -aarch64-force-unroll-threshold=<nr> option. This option instructs the compiler to unroll loops when the instruction count per loop is below <nr>.
LLVM 22 also adds an aggressive inlining feature to the AArch64 backend, enabled for Cortex-A320. This feature increases the likelihood of unrolling vector loop iterations across more contexts than a conservative configuration. For example, when trying to unroll larger vector loops.
Code generation improvements
SME code generation improvements
By Sander De Smalen
LLVM 22 supports stack unwinding and exception handling when SME state is present. This means that C++ exceptions and destructors work as expected when a function uses SME state, as described by the ACLE function attributes.The compiler inserts code to save and restore state. It also instructs the unwinder to calculate frame addresses using the appropriate streaming or non-streaming vector length.
LLVM 22 also reduces unnecessary SME state transitions. For example:
void func() __arm_inout("za") {
// set up lazy-save
another_func();
// resume PSTATE.ZA and restore ZA from save-buffer
// set up lazy-save
another_func();
// resume PSTATE.ZA and restore ZA from save-buffer
}void func() __arm_inout("za") {
// set up lazy-save
another_func();
another_func();
// resume PSTATE.ZA and restore ZA from save-buffer
}SVE code generation improvements
By Benjamin Maxwell
LLVM 22 includes several small code generation improvements for NEON ACLE intrinsics when SVE is available.
This can be seen when mixing NEON and SVE ACLE intrinsics. LLVM 22 can fold constants more efficiently between NEON and SVE vector types. For example, this code:
svuint32_t mix_SVE_and_NEON_divide() {
uint32x4_t vec1 = vdupq_n_u32(9);
svuint32_t vec2 = svdup_n_u32(3);
return svset_neonq(svundef_u32(), vec1) / vec2;
}gives the following assembly output respectively:
- LLVM 21:
mix_SVE_and_NEON_divide():
mov w8, #43691
movi v0.4s, #9
movk w8, #43690, lsl #16
mov z1.s, w8
umulh z0.s, z0.s, z1.s
lsr z0.s, z0.s, #1
ret
- LLVM 22:
mix_SVE_and_NEON_divide():
movi v0.4s, #3
ret
LLVM 22 uses SVE to lower NEON intrinsics when it provides more compact instructions for the same operation. For example, SVE can apply when it provides an immediate form of an operation that would otherwise require a constant in a register:
uint32x4_t immAdd(uint32x4_t a) {
return a + vdupq_n_u32(1);
}results in:
- LLVM 21:
immAdd():
movi v1.4s, #1
add v0.4s, v0.4s, v1.4s
ret
- LLVM 22:
immAdd(__Uint32x4_t):
add z0.s, z0.s, #1
ret
Use of SVE2p1 UDOT/SDOT
By Sander De Smalen
LLVM 22 can emit SVE2p1's UDOT and SDOT instructions to implement reductions from i16 to i32 elements. For example, this code:
#include <stdint.h>
int32_t sve2p1_sdot(int16_t *src1, int16_t *src2, int N) {
int32_t sum = 0;
for (int i=0; i<N; ++i)
sum += src1[i] * src2[i];
return sum;
}results in the following assembly:
.LBB0_5:
ld1h { z1.h }, p0/z, [x0, x8, lsl #1]
ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
inch x8
cmp x10, x8
sdot z0.s, z2.h, z1.h
b.ne .LBB0_5Speculative devirtualization
By Hassnaa Hamdi
LLVM 22 supports an opt-in speculative devirtualization feature. It transforms a virtual call into a direct call when the object is assumed to have a specific type.
The compiler inserts a runtime check to validate this assumption before the direct call. If the check fails, it uses the original virtual call.
This feature enables more inlining opportunities and improves optimization of the direct call.
This feature works in 2 scenarios:
- A single implementation of the virtual function exists, as in virtual_function1() in the example below.
- Multiple implementations exist, like the case of virtual_function2(), but all created objects are of the same class.
class Base {
public:
__attribute__((noinline))
virtual void virtual_function1() { asm volatile("NOP"); }
virtual void virtual_function2() { asm volatile("NOP"); }
};
class Derived : public Base {
public:
void virtual_function2() override { asm volatile("NOP"); }
};
__attribute__((noinline))
void func(Base *BV) {
BV->virtual_function2();
}
void another_func() {
Base *b = new Derived();
func(b);
}
Output using command: -O3 -fdevirtualize-speculatively -emit-llvm
- Clang 21:
define dso_local void @func(Base*)(ptr noundef %0) local_unnamed_addr #0 {
%2 = load ptr, ptr %0, align 4
%3 = getelementptr inbounds nuw i8, ptr %2, i32 4
%4 = load ptr, ptr %3, align 4
tail call void %4(ptr noundef nonnull align 4 dereferenceable(4) %0)
ret void
}- Clang 22:
define dso_local void @func(Base*)(ptr noundef %0) local_unnamed_addr #0 {
%2 = load ptr, ptr %0, align 4
%3 = getelementptr inbounds nuw i8, ptr %2, i32 4
%4 = load ptr, ptr %3, align 4
%5 = icmp eq ptr %4, @Derived::virtual_function2()
br i1 %5, label %6, label %7
6:
tail call void asm sideeffect "NOP", ""() #7
br label %8
7:
tail call void %4(ptr noundef nonnull align 4 dereferenceable(4) %0)
br label %8
8:
ret void
}Improved code generation of interleaved and tail-folded loops
By Kerry McLaughlin
LLVM 22 improves code generation for interleaved, tail-folded loops. LoopVectorise is now able to create a single wide active lane mask for control flow, with a wider return type than other vector instructions in the loop. This behavior is controlled by the new -enable-wide-lane-mask flag. It improves code generation because the active lane mask intrinsic can be lowered to a single whilelo instruction.
With the following example:
void func(int * __restrict__ a, long n) {
for (long i = 0; i < n; i++) {
a[i] = a[i] * 3;
}
}Requesting tail-folding and interleaving will currently result in the following codegen:
clang++ -O3 -march=armv8-a+sve2 -mllvm -prefer-predicate-over-epilogue=predicate-dont-vectorize -mllvm -force-vector-interleave=2 -S
.LBB0_2:
ld1w { z0.s }, p0/z, [x0, #1, mul vl]
ld1w { z1.s }, p1/z, [x0]
mul z0.s, z0.s, #3
mul z1.s, z1.s, #3
st1w { z0.s }, p0, [x0, #1, mul vl]
whilelo p0.s, x8, x9
inch x8
st1w { z1.s }, p1, [x0]
incb x0, all, mul #2
mov x10, x8
decw x10, all, mul #3
whilelo p1.s, x10, x9
cset w10, mi
tbnz w10, #0, .LBB0_2When wide active lane masks are enabled, the compiler considers loop interleaving based on the cost model.
Below is the same example vectorized with a factor of vscale × 4. A single whilelo generates a predicate of vscale × 8 elements, which the vector instructions then unpack: :
clang++ -O3 -march=armv8-a+sve2 -mllvm -prefer-predicate-over-epilogue=predicate-dont-vectorize -mllvm -enable-wide-lane-mask -S
.LBB0_2:
ld1w { z0.s }, p1/z, [x0]
ld1w { z1.s }, p0/z, [x0, #1, mul vl]
inch x9
mul z0.s, z0.s, #3
mul z1.s, z1.s, #3
whilelo p2.h, x9, x8
st1w { z0.s }, p1, [x0]
punpklo p1.h, p2.b
st1w { z1.s }, p0, [x0, #1, mul vl]
incb x0, all, mul #2
punpkhi p0.h, p2.b
b.mi .LBB0_2For targets with SVE2.1 or SME2 enabled, the compiler can lower the active lane mask into a single whilelo instruction that returns a pair of predicates. This improves loop code generation by removing the need for instructions to unpack the predicate, such as the punpklo/punpkhi found in the output above:
clang++ -O3 -march=armv8-a+sve2p1 -mllvm -prefer-predicate-over-epilogue=predicate-dont-vectorize -mllvm -enable-wide-lane-mask -S
.LBB0_2:
ld1w { z0.s }, p0/z, [x0]
ld1w { z1.s }, p1/z, [x0, #1, mul vl]
inch x9
mul z0.s, z0.s, #3
mul z1.s, z1.s, #3
st1w { z0.s }, p0, [x0]
st1w { z1.s }, p1, [x0, #1, mul vl]
incb x0, all, mul #2
whilelo { p0.s, p1.s }, x9, x8
b.mi .LBB0_2Tools improvements
Flang
By Tom Eccles
This release delivers Flang improvements across Fortran language coverage, OpenMP support, performance, and usability.
A major focus of this release is stronger OpenMP support. Flang expands parsing, semantic checking, lowering, and translation coverage for newer OpenMP features. Thisincludes continued progress on loop constructs, tasking, and reductions. Support for OpenMP 4.0 is nearly complete, and many features from later standards already work.
The compiler includes optimization and code generation work. These improvements include better lowering for do concurrent, improved alias analysis, better array handling and repacking, and more accurate math and intrinsic lowering.
Arm contributed code reviews for many of these improvements and implemented several key features and fixes directly.
The most significant contribution is a collaboration with AMD to deliver full MLIR-to-LLVM IR translation for OpenMP TASKLOOP, including support for important TASKLOOP clauses. Arm contributors also extend OpenMP cancellation support by adding Flang lowering for CANCEL and CANCELLATION POINT.
SIMD and composite loop handling improved. These changes include SIMD reductions in MLIR translation, improved privatization for composite DO SIMD and DISTRIBUTE SIMD, and fixes for privatization and clause-handling corner cases. On the frontend side, Arm-authored changes strengthen OpenMP semantics.These changes improve checking of array sections in DEPEND, support for substrings and complex-part references in DEPEND, handle reductions on variables involved in EQUIVALENCE, enforce stricter validation around goto inside SECTION, and improve detection of derived-type array-element uses inside OpenMP clauses.
Beyond OpenMP, Arm improved the encoding of alias information for Fortran pointers, which improves benchmark performance. New -f[no-]vectorize and -f[no-]slp-vectorize controls align vectorization behavior with Clang, VecLib handling is fixed so the mid-end and back-end agree on vector library selection, and support for the VECTOR VECTORLENGTH directive is also added.
BOLT
By Paschalis Mpeis
Lite mode is supported on AArch64 and can be enabled with -lite=1. In this mode, BOLT reuses the original cold code instead of duplicating it, reducing output binary size.
BOLT supports binaries compiled with PAC (Pointer Authentication Code), a security hardening feature that mitigates ROP-style attacks. As PAC is deployed across distributions and mobile systems, users can optimize these binaries without breaking PAC authentication.
AArch64 build attributes support
LLVM 22 adds support for build attributes for Arm 64-bit (AArch64) ELF files to record data that the linker uses to assess compatibility across relocatable object files. In most cases AArch64 uses a rich operating system that handles the compatibility and hardware requirements through runtime capability checks. However, some features are not suitable for runtime checks, for example, use of pointer authentication (PAC) instructions outside the hint space and branch target identification (BTI). AArch64 build attributes address these and are generated by clang and handled by LLD, llvm-readobj, llvm-readelf, llvm-objdump and other tools.
Re-use is only permitted for informational and non-commercial or personal use only.
