Smarter Write Barriers for Arm64 in .NET CoreCLR

Enhanced Arm64 write barriers streamline GC scanning and improve .NET runtime behaviour, delivering faster performance for modern, memory-intensive applications.

By Alan Hayward

Reading time 8 minutes

Last year, I explored how you can use the Arm Scalable Vector Extension (SVE) in .NET to unlock SIMD performance at scale. This year, my focus has shifted to something less visible but just as fundamental to runtime performance. Write barriers in the CoreCLR garbage collector (GC).

Write barriers are not a feature most .NET developers ever think about. They do not change how you write C# code, and you do not see them in benchmarks unless you deliberately look. However, they are among the hottest code paths in the runtime. Every time managed code writes a reference field, which happens constantly, the write barrier runs. This behavior makes them prime candidates for micro-optimization.
In this blog post, I cover:

What write barriers are and why they’re essential to generational GC.
How Arm64’s implementation used to work in CoreCLR.
What’s new in the updated design.
The tradeoffs between write-time and collection-time performance.

What is a write barrier?

A write barrier is a small snippet of code that runs whenever a managed reference is stored into a field or array element. Its job is to keep the GC’s metadata in sync with program memory. Without a write barrier, the GC would not track when one object starts or stops referencing another. This gap can cause incorrect memory reclamation when the GC assumes that no object references that memory. The write barrier is the GC’s eyes and ears during execution.

Depending on the GC mode, a write barrier might:

Mark cards: Flagging fixed-size regions of memory as “dirty” so the GC knows which areas to rescan.
Maintain remembered sets: Tracking references from older to younger generations.
Handle special regions: Accounting for large object heaps or ephemeral generations.

Because this code runs on every reference update, its performance is critical. In many applications, the write barrier is the function that runs most often. For this reason, it is written in Arm64 assembly. During its operation, the write barrier needs to know certain state in the GC and pointers to various tables. To speed this up, it uses cached copies of key GC state which is placed nearby in memory. Some of these caches are static. Others are updated dynamically by the GC to reflect runtime configuration.

To keep the barrier fast, the card marking logic is deliberately simple. It errs on the side of marking more memory dirty than required. This design reduces time during each write, but it requires the garbage collector to scan more memory than it really needs to. On small heaps this overhead is negligible. On large servers with tens or hundreds of gigabytes of managed memory, the additional scanning increases pause times and reduces throughput.

The current design is a compromise that saves time on each write but pays for it later during GC.

The new approach: Multiple write barriers

In .NET Runtime PR #111636, we changed the Arm64 design to match the x64 design that has used a WriteBarrierManager for years.

Instead of one universal helper, the runtime now uses 10 specialized write barrier variants. Each variant is optimized for a particular GC configuration. Some variants are tuned for server GC. Others provide more accurate dirty card marking. The most precise variant uses Armv8.1-LSE (Large System Extensions) instructions. Each variant assumes specific GC state and removes redundant checks. This turns what used to be a complex code path into near straight-line execution. The result is smaller and faster barrier code.

The challenge is ensuring the correct variant is called each time a reference changes. As before, all .NET code continues to call the same one global writebarrier function, as it would be impractical to update all calls. Meanwhile, every time any relevant state is changed inside the GC, the GC calls out to the WriteBarrierManager.

The WriteBarrierManager then decides which is the correct specialized writebarrier function to use. It then copies this function over the top of the global writebarrier function, flushing codecaches as necessary. In many cases, this happens once at startup. In others, the active barrier may switch dynamically as the program evolves.

Performance tradeoffs: Paying a little to save a lot

The main point in this change is the tradeoff between the cost of each write and the cost during each collection.

Before: Each write was as cheap as possible, but the GC had to scan more memory during collection.
After: Each write pays a small extra cost, but the GC scans less.

Since writes happen very often, it may sound counterintuitive to make them more expensive. But in practice, the cost per write is tiny and the additional work is only a couple of extra instructions. The savings during collection are significant. For example, imagine a large web service with a 64 GB heap, running background GC. Reducing the number of dirty cards by 5–10% can translate into fewer milliseconds of pause time per collection, multiplied across thousands of collections per day. That is a huge win in terms of tail latency and throughput.

This is a classic example of shifting work out of the critical path (GC pause time) into the steady-state path (writes). For modern server workloads, this tradeoff provides a clear benefit.

Constants: to inline or not to inline

In the new version of the code, there are still constants that need to be loaded by the write barrier. For example, it loads the location of various tables and the offsets into those tables of the different regions. The offsets can change while the program runs, and need updating by the writeBarrierManager. In both in the old version of the code, these constants are placed in a small buffer directly after the main writebarrier. This locality ensures loading is fast and the CPU can cache effectively.

In x64 this is done in a different way. Instead of using a buffer, x64 writes the constants directly into movabs instructions. This approach avoids loading these values from memory. This is possible because the variable-length of instructions. A single 64bit constant can be moved into a register in one instruction. Arm64 uses fixed 32-bit instructions and can move only a 16bit constant in a single instruction. As a result, moving a 64bit constant requires four MOVK and MOVN instructions. This can be reduced if you need to move multiple constants that all share common parts.

In practice, the cost of a nearby load is low, and the saving by moving to four mov instructions is small, especially when you factor in the cost of a bigger code footprint. Arm does not generally recommend this optimization, so the new Arm64 code continues to load the constants from the buffer.

Under the hood: What changed in the code

At the code level, the change included the following steps:

Introducing multiple Arm64 variants of the JIT_WriteBarrier function. Each variant is tuned for a GC mode. Because many parts of the variants are similar, the functions use blocks of assembly macros to reduce the possibility of errors.
Updating the WriteBarrierManager to handle the differences in Arm64 behaviour compared to x64. These differences result mostly from loading constants from the buffer instead of inlining them,
Ensuring the WriteBarrierManager is called whenever relevant state in the GC is changed.

Looking forward

This work forms part of a broader, ongoing effort to make .NET on Arm64 as performant and mature as .NET on x64. Write barriers might appear minor, but they are among the most performance-critical components of the runtime. This work is part of the ongoing collaboration between Arm and Microsoft to push .NET performance on Arm64 even further, ensuring that managed code runs smoothly and efficiently on all platforms.
In fact, running .NET on Arm64-based Azure Cobolt 100 shows performance improvements compared with equivalent AMD systems on key workloads. With the announcement of Cobalt 200, we expect it to improve further.

Conclusion

The new WriteBarrierManager for Arm64 may not change how you write C# code, but it changes how the runtime behaves in subtle and important ways. By trading a little extra work per write for much more efficient GC scanning, we have made the runtime better suited to today’s memory-intensive workloads.

It is another reminder that runtime performance is not just about flashy vector intrinsics or JIT tricks. Sometimes the biggest wins come from making the invisible machinery of the garbage collector just a little bit smarter.