Caches and Self-Modifying Code: Implementing `__clear_cache`
How to implement `__clear_cache` using assembly.

Some time ago, I posted an article about cache maintenance in self-modifying code. I described the use of the __clear_cache function (in Linux) to synchronize the instruction and data caches so that the processor executes what you want it to execute after you have written some code.
Most of the time, using an abstraction (like __clear_cache) is the best solution. However, there may be times when you need to implement it yourself, possibly because you are actually implementing a similar library function, or want something slightly different and want to know where to start from. Perhaps you just want to know how it works. That is what I’ll discuss here.
Implementing __clear_cache for AArch64 using the A64 instruction set
The A64 instruction set provides the necessary instructions to perform the required cache maintenance operations in user space (or ‘EL0’ in Arm terminology). This allows self-modifying code updates to be performed directly from EL0, without a system call. For example, __clear_cache on Linux requires no system calls.
Avoiding a dependency on a kernel interface potentially makes this approach fairly portable across operating systems. However, operating system kernels have the ability to deny EL0 access to the necessary instructions. If that’s the case on your target platform, this approach will not work, and you will need to rely on the tools provided by the system.
The Arm ARM (section B2.7.4.2 in DDI0487L.a) tells you exactly what you need to do:
- Write the new instructions to memory.
- Use
dc cvauto clean the data cache to the point of unification. Loosely, the point of unification is the point at which data and instruction accesses see the same value for a given memory location. I simplified this to “memory” in my earlier article, but it is more typically an L2 cache. The architecture does not specify exactly where it is, but application code usingdc cvaudoes not need to know. - Use a
dsbbarrier to ensure that the data is visible before we move on. - Use
ic ivauto invalidate the instruction cache to the point of unification. - Use another
dsbbarrier to ensure that theiccompletes before the next instruction. - Finally, an
isbbarrier flushes the pipeline, ensuring that any subsequent instructions that the processor has already started working on are discarded and reloaded.
The following sequence synchronizes a single cache line at x0:
... // Write code to the cache line at x0.
dc cvau, x0
dsb ish
ic ivau, x0
dsb ish
isb
... // It is now safe to execute the code in the cache line at x0.
These steps are effectively the same as what the Linux kernel does for 32-bit programs.
In practice, code buffers are likely to vary in length, and span multiple cache lines, so functions like __clear_cache will need to loop. To do that, you need to know the size of the system’s cache lines, which you can determine by reading the cache type register, ctr_el0, described in section D24.2.37 in DDI0487L.a of the Arm ARM. Here’s a simple (but complete) example using GCC inline assembly:
#include <stdint.h>
#include <stddef.h>
void EnsureIAndDCacheCoherency(uintptr_t start, uintptr_t end) {
uint32_t ctr;
asm("mrs %[ctr], ctr_el0\n\t" : [ctr] "=r" (ctr));
// Work out the line sizes for the I and D caches.
uintptr_t const dsize = 4 << ((ctr >> 16) & 0xf);
uintptr_t const isize = 4 << ((ctr >> 0) & 0xf);
for (uintptr_t dline = start & ~(dsize - 1); dline < end; dline += dsize) {
asm("dc cvau, %[dline]\n\t" : : [dline] "r" (dline) : "memory");
}
asm("dsb ish\n\t" : : : "memory");
for (uintptr_t iline = start & ~(isize - 1); iline < end; iline += isize) {
asm("ic ivau, %[iline]\n\t" : : [iline] "r" (iline) : "memory");
}
asm("dsb ish\n\t"
"isb\n\t"
: : : "memory");
}
Some other realistic examples can be found in VIXL’s CPU::EnsureIAndDCacheCoherency and in Google V8.
Cache-coherent implementations
Although the Arm architecture does not require automatic coherency between data writes and instruction fetches, it does allow for implementations that provide it. On such implementations, one or both of the cache-maintenance operations can be omitted:
- If
ctr_el0.IDCis set, the implementation does not require an explicitdc cvau. Thedsbis still required, to ensure that outstanding stores complete. - If
ctr_el0.DICis set, the implementation does not require an explicitic ivau.
Taking this into account, the example implementation looks like this:
#include <stdint.h>
#include <stddef.h>
void EnsureIAndDCacheCoherency(uintptr_t start, uintptr_t end) {
uint32_t ctr;
asm("mrs %[ctr], ctr_el0\n\t" : [ctr] "=r" (ctr));
uintptr_t const dsize = 4 << ((ctr >> 16) & 0xf);
uintptr_t const isize = 4 << ((ctr >> 0) & 0xf);
bool n_idc = ((ctr >> 28) & 0x1) == 0;
bool n_dic = ((ctr >> 29) & 0x1) == 0;
if (n_idc) {
for (uintptr_t dline = start & ~(dsize - 1); dline < end; dline += dsize) {
asm("dc cvau, %[dline]\n\t" : : [dline] "r" (dline) : "memory");
}
}
asm("dsb ish\n\t" : : : "memory");
if (n_dic) {
for (uintptr_t iline = start & ~(isize - 1); iline < end; iline += isize) {
asm("ic ivau, %[iline]\n\t" : : [iline] "r" (iline) : "memory");
}
asm("dsb ish\n\t" : : : "memory");
}
asm("isb\n\t" : : : "memory");
}
Implementing __clear_cache for AArch32: Arm (A32) and Thumb (T32).
In AArch32, neither A32 nor T32 offer similar EL0 instructions, so __clear_cache works by calling into the Linux kernel. You can do this directly as follows:
ldr r0, =start_address
ldr r1, =end_address
mov r2, #0 @ r2 _must_ be zero.
ldr r7, =0x000f0002
svc 0 @ The svc number is ignored.
The important thing is that the registers r0, r1, r2 and r7 are set properly when the svc executes; it doesn’t matter how you achieve this. If the arguments are already in the right registers, for example, you might not need to do anything. The Google V8 JavaScript engine uses GCC inline assembly to do it and lets the compiler worry about the best way to get the values where they need to be.
Re-use is only permitted for informational and non-commercial or personal use only.
