Home

Community

Tools, Software and IDEs blog

April 22, 2026

What is new in LLVM 22?

Find out about Arm contribution to the latest LLVM 22 release: Latest architecture and CPU support, code generation improvements.

By Volodymyr Turanskyy

Reading time 21 minutes

LLVM 22.1.1 released on March 11, 2026. Cullen Rhodes from Arm managed this release. It delivers improvements in Arm architecture support, performance, and tools. The following sections highlight key changes.

To find out more about the previous LLVM release, you can read What is new in LLVM 21?

New architecture and CPU support

Armv9.7-A architecture support

By Maciej Gabka

LLVM 22 provides assembly support for several new architecture features announced by Arm in 2025. The 2026 update is Armv9.7-A. For more information, see the annual announcement blog post. The Armv9.7-A architecture extension page documents these extensions and includes additional details.

In 2026, Arm announced an update to the Arm Generic Interrupt Controller (GICv5). LLVM 22 is the first release that provides full assembly support for this extension. This release also adds assembly support for the Future Architecture Technologies, including Permission Overlays Extension 2 (POE2) and Virtual Memory Tagging (vMTE). For more information, see the Arm announcement blog post.

The LLVM 22 assembler is compatible with the December 2025 architecture XML release, which is part of the Exploration Tools. For the full instruction set, see the December 2025 Arm A-profile A64 Instruction Set Architecture release.

LLVM 22 extends the Arm C Language Extensions (ACLE) implementation. It implements intrinsics for data processing instructions that architecture features introduced in 2024, including SVE2p2 and SME2p2. It also defines new  __pld_range and __pldx_range prefetch intrinsics that match the semantics of the range prefetch instruction. For more information, see the Arm C Language Extensions specification at https://github.com/ARM-software/acle.

LLVM 22 removes support for the Transactional Memory Extension (FEAT_TME) because Arm withdrew this feature from all architecture versions.

CPU Support

By David Candler

LLVM 22 adds support for the C1 series of Armv9.3-A-based CPUs: C1-Ultra, C1-Premium, C1-Pro, and C1-Nano. These CPUs target high-performance on-device workloads. For more information, see the C1 series announcement blog post.

Performance improvements

Optimizing small AArch64 CPUs

By Ties Stuij

We focus AArch64 backend optimizations on larger out-of-order CPUs, with less attention on smaller cores. This is because these CPU types often sit side by side in devices such as smartphones. The larger CPU handles compute-intensive tasks, while the small cores manage background tasks when larger cores are inactive. However, there are more use cases where the smaller cores need to be performant in configurations without the bigger ones. We now increase optimization efforts for these cores.

This release includes several improvements for smaller AArch64 cores. The following sections highlight key changes.

Our initial use case for optimizing small cores is the Cortex-A320, a fully fledged Armv9.2-A CPU with an in-order CPU and a minimal amount of execution units. Many improvements come from making code generation more suitable for this processor design. In-order execution requires instructions to execute sequentially, without a large reorder buffer (ROB) to select from future instructions.

Looping is a key area where this limitation affects performance. An out-of-order core can fill the ROB with instructions from the next loop iteration and execute them in an optimal order. However, in-order CPUs rely more on the compiler to perform this scheduling. Loop unrolling is therefore more important for in-order cores. Loop unrolling reduces branch overhead. It also increases instructions per iteration, which enables the compiler to perform scheduling at compile time instead of relying on the ROB at runtime.

For example, LLVM adds the -aarch64-force-unroll-threshold=<nr> option. This option instructs the compiler to unroll loops when the instruction count per loop is below <nr>.

LLVM 22 also adds an aggressive inlining feature to the AArch64 backend, enabled for Cortex-A320. This feature increases the likelihood of unrolling vector loop iterations across more contexts than a conservative configuration. For example, when trying to unroll larger vector loops.

Code generation improvements

SME code generation improvements

By Sander De Smalen

LLVM 22 supports stack unwinding and exception handling when SME state is present. This means that C++ exceptions and destructors work as expected when a function uses SME state, as described by the ACLE function attributes.The compiler inserts code to save and restore state. It also instructs the unwinder to calculate frame addresses using the appropriate streaming or non-streaming vector length.

LLVM 22 also reduces unnecessary SME state transitions. For example:

void func() __arm_inout("za") {
  // set up lazy-save
  another_func();
  // resume PSTATE.ZA and restore ZA from save-buffer
  // set up lazy-save
  another_func();
  // resume PSTATE.ZA and restore ZA from save-buffer
 }
is now optimized to:

void func() __arm_inout("za") {
  // set up lazy-save
  another_func();
  another_func();
  // resume PSTATE.ZA and restore ZA from save-buffer
 }

SVE code generation improvements

By Benjamin Maxwell

LLVM 22 includes several small code generation improvements for NEON ACLE intrinsics when SVE is available.

This can be seen when mixing NEON and SVE ACLE intrinsics. LLVM 22 can fold constants more efficiently between NEON and SVE vector types. For example, this code:

svuint32_t mix_SVE_and_NEON_divide() {
    uint32x4_t vec1 = vdupq_n_u32(9);
    svuint32_t vec2 = svdup_n_u32(3);
    return svset_neonq(svundef_u32(), vec1) / vec2;
}
gives the following assembly output respectively:

LLVM 21:

mix_SVE_and_NEON_divide():
    mov w8, #43691
    movi v0.4s, #9
    movk w8, #43690, lsl #16
    mov z1.s, w8
    umulh z0.s, z0.s, z1.s
    lsr z0.s, z0.s, #1
    ret

LLVM 22:

mix_SVE_and_NEON_divide():
    movi v0.4s, #3
    ret

LLVM 22 uses SVE to lower NEON intrinsics when it provides more compact instructions for the same operation. For example, SVE can apply when it provides an immediate form of an operation that would otherwise require a constant in a register:

uint32x4_t immAdd(uint32x4_t a) {
    return a + vdupq_n_u32(1);
}

results in:

LLVM 21:

immAdd():
    movi v1.4s, #1
    add v0.4s, v0.4s, v1.4s
    ret

LLVM 22:

immAdd(__Uint32x4_t):
    add z0.s, z0.s, #1
    ret

Use of SVE2p1 UDOT/SDOT

By Sander De Smalen

LLVM 22 can emit SVE2p1's UDOT and SDOT instructions to implement reductions from i16 to i32 elements. For example, this code:

#include <stdint.h>
 
int32_t sve2p1_sdot(int16_t *src1, int16_t *src2, int N) {
    int32_t sum = 0;
    for (int i=0; i<N; ++i)
        sum += src1[i] * src2[i];
    return sum;
}

results in the following assembly:

.LBB0_5:
        ld1h    { z1.h }, p0/z, [x0, x8, lsl #1]
        ld1h    { z2.h }, p0/z, [x1, x8, lsl #1]
        inch    x8
        cmp     x10, x8
        sdot    z0.s, z2.h, z1.h
        b.ne    .LBB0_5

Speculative devirtualization

By Hassnaa Hamdi

LLVM 22 supports an opt-in speculative devirtualization feature. It transforms a virtual call into a direct call when the object is assumed to have a specific type.

The compiler inserts a runtime check to validate this assumption before the direct call. If the check fails, it uses the original virtual call.

This feature enables more inlining opportunities and improves optimization of the direct call.

This feature works in 2 scenarios:

A single implementation of the virtual function exists, as in virtual_function1() in the example below.
Multiple implementations exist, like the case of virtual_function2(), but all created objects are of the same class.

class Base {
public:
    __attribute__((noinline))
    virtual void virtual_function1() { asm volatile("NOP"); }
    virtual void virtual_function2() { asm volatile("NOP"); }
};
class Derived : public Base {
public:
    void virtual_function2() override { asm volatile("NOP"); }
};
__attribute__((noinline))
void func(Base *BV) {
    BV->virtual_function2();
}
void another_func() {
    Base *b = new Derived();
    func(b);
}

Output using command: -O3 -fdevirtualize-speculatively -emit-llvm

Clang 21:

define dso_local void @func(Base*)(ptr noundef %0) local_unnamed_addr #0 {
  %2 = load ptr, ptr %0, align 4
  %3 = getelementptr inbounds nuw i8, ptr %2, i32 4
  %4 = load ptr, ptr %3, align 4
  tail call void %4(ptr noundef nonnull align 4 dereferenceable(4) %0)
  ret void
}

Clang 22:

define dso_local void @func(Base*)(ptr noundef %0) local_unnamed_addr #0 {
  %2 = load ptr, ptr %0, align 4
  %3 = getelementptr inbounds nuw i8, ptr %2, i32 4
  %4 = load ptr, ptr %3, align 4
  %5 = icmp eq ptr %4, @Derived::virtual_function2()
  br i1 %5, label %6, label %7
 
6:
  tail call void asm sideeffect "NOP", ""() #7
  br label %8
 
7:
  tail call void %4(ptr noundef nonnull align 4 dereferenceable(4) %0)
  br label %8
 
8:
  ret void
}

Improved code generation of interleaved and tail-folded loops

By Kerry McLaughlin

LLVM 22 improves code generation for interleaved, tail-folded loops. LoopVectorise is now able to create a single wide active lane mask for control flow, with a wider return type than other vector instructions in the loop. This behavior is controlled by the new -enable-wide-lane-mask flag. It improves code generation because the active lane mask intrinsic can be lowered to a single whilelo instruction.

With the following example:

void func(int * __restrict__ a, long n) {
    for (long i = 0; i < n; i++) {
        a[i] = a[i] * 3;
    }
}

Requesting tail-folding and interleaving will currently result in the following codegen:

clang++ -O3 -march=armv8-a+sve2 -mllvm -prefer-predicate-over-epilogue=predicate-dont-vectorize -mllvm -force-vector-interleave=2 -S

.LBB0_2:
    ld1w    { z0.s }, p0/z, [x0, #1, mul vl]
    ld1w    { z1.s }, p1/z, [x0]
    mul     z0.s, z0.s, #3
    mul     z1.s, z1.s, #3
    st1w    { z0.s }, p0, [x0, #1, mul vl]
    whilelo p0.s, x8, x9
    inch    x8
    st1w    { z1.s }, p1, [x0]
    incb    x0, all, mul #2
    mov     x10, x8
    decw    x10, all, mul #3
    whilelo p1.s, x10, x9
    cset    w10, mi
    tbnz    w10, #0, .LBB0_2

When wide active lane masks are enabled, the compiler considers loop interleaving based on the cost model.

Below is the same example vectorized with a factor of vscale × 4. A single whilelo generates a predicate of vscale × 8 elements, which the vector instructions then unpack: :

clang++ -O3 -march=armv8-a+sve2 -mllvm -prefer-predicate-over-epilogue=predicate-dont-vectorize -mllvm -enable-wide-lane-mask -S

.LBB0_2:
    ld1w    { z0.s }, p1/z, [x0]
    ld1w    { z1.s }, p0/z, [x0, #1, mul vl]
    inch    x9
    mul     z0.s, z0.s, #3
    mul     z1.s, z1.s, #3
    whilelo p2.h, x9, x8
    st1w    { z0.s }, p1, [x0]
    punpklo p1.h, p2.b
    st1w    { z1.s }, p0, [x0, #1, mul vl]
    incb    x0, all, mul #2
    punpkhi p0.h, p2.b
    b.mi    .LBB0_2

For targets with SVE2.1 or SME2 enabled, the compiler can lower the active lane mask into a single whilelo instruction that returns a pair of predicates. This improves loop code generation by removing the need for instructions to unpack the predicate, such as the punpklo/punpkhi found in the output above:

clang++ -O3 -march=armv8-a+sve2p1 -mllvm -prefer-predicate-over-epilogue=predicate-dont-vectorize -mllvm -enable-wide-lane-mask -S

.LBB0_2:
    ld1w    { z0.s }, p0/z, [x0]
    ld1w    { z1.s }, p1/z, [x0, #1, mul vl]
    inch    x9
    mul     z0.s, z0.s, #3
    mul     z1.s, z1.s, #3
    st1w    { z0.s }, p0, [x0]
    st1w    { z1.s }, p1, [x0, #1, mul vl]
    incb    x0, all, mul #2
    whilelo { p0.s, p1.s }, x9, x8
    b.mi    .LBB0_2

Tools improvements

Flang

By Tom Eccles

This release delivers Flang improvements across Fortran language coverage, OpenMP support, performance, and usability.

A major focus of this release is stronger OpenMP support. Flang expands parsing, semantic checking, lowering, and translation coverage for newer OpenMP features. Thisincludes continued progress on loop constructs, tasking, and reductions. Support for OpenMP 4.0 is nearly complete, and many features from later standards already work.

The compiler includes optimization and code generation work. These improvements include better lowering for do concurrent, improved alias analysis, better array handling and repacking, and more accurate math and intrinsic lowering.

Arm contributed code reviews for many of these improvements and implemented several key features and fixes directly.

The most significant contribution is a collaboration with AMD to deliver full MLIR-to-LLVM IR translation for OpenMP TASKLOOP, including support for important TASKLOOP clauses. Arm contributors also extend OpenMP cancellation support by adding Flang lowering for CANCEL and CANCELLATION POINT.

SIMD and composite loop handling improved. These changes include SIMD reductions in MLIR translation, improved privatization for composite DO SIMD and DISTRIBUTE SIMD, and fixes for privatization and clause-handling corner cases. On the frontend side, Arm-authored changes strengthen OpenMP semantics.These changes improve checking of array sections in DEPEND, support for substrings and complex-part references in DEPEND, handle reductions on variables involved in EQUIVALENCE, enforce stricter validation around goto inside SECTION, and improve detection of derived-type array-element uses inside OpenMP clauses.

Beyond OpenMP, Arm improved the encoding of alias information for Fortran pointers, which improves benchmark performance. New -f[no-]vectorize and -f[no-]slp-vectorize controls align vectorization behavior with Clang, VecLib handling is fixed so the mid-end and back-end agree on vector library selection, and support for the VECTOR VECTORLENGTH directive is also added.

BOLT

By Paschalis Mpeis

Lite mode is supported on AArch64 and can be enabled with -lite=1. In this mode, BOLT reuses the original cold code instead of duplicating it, reducing output binary size.

BOLT supports binaries compiled with PAC (Pointer Authentication Code), a security hardening feature that mitigates ROP-style attacks. As PAC is deployed across distributions and mobile systems, users can optimize these binaries without breaking PAC authentication.

AArch64 build attributes support

LLVM 22 adds support for build attributes for Arm 64-bit (AArch64) ELF files to record data that the linker uses to assess compatibility across relocatable object files. In most cases AArch64 uses a rich operating system that handles the compatibility and hardware requirements through runtime capability checks. However, some features are not suitable for runtime checks, for example, use of pointer authentication (PAC) instructions outside the hint space and branch target identification (BTI). AArch64 build attributes address these and are generated by clang and handled by LLD, llvm-readobj, llvm-readelf, llvm-objdump and other tools.

By Volodymyr Turanskyy

Article text

Re-use is only permitted for informational and non-commercial or personal use only.