Hardware Cache Coherency Introduction
Coherency is about ensuring all processors, or bus masters in the system see the same view of memory. For example, if you have a processor which is creating a data structure then passing it to a DMA engine to move, both the processor and DMA must see the same data. If that data were cached in the CPU and the DMA reads from external DDR, the DMA will read old, stale data.
There are three mechanisms to maintain coherency:
- Disable caching is the simplest mechanism but can cost significant CPU performance. To get the highest performance, processors are pipelined to run at high frequency, and to run from caches that offer a very low latency. Caching data that is accessed multiple times increases performance significantly and reduces DRAM accesses and power. Marking data as “non-cached” could impact performance and power.
- Software managed coherency is the traditional solution to the data sharing problem. The software, usually device drivers, must clean or flush dirty data from caches, and invalidate old data to enable sharing with other processors or masters in the system. This takes processor cycles, bus bandwidth, and power.
- Hardware managed coherency offers an alternative to simplify software. With this solution, any cached data marked "shared" is always automatically up to date. All processors and bus masters in that sharing domain see the exact same value.
Extending hardware coherency to the system requires a coherent bus protocol, and in 2011 Arm released the AMBA 4 ACE specification which introduced the AXI Coherency Extensions on top of the popular AXI protocol. The full ACE interface allows hardware coherency between processor clusters and allows an SMP operating system to extend to more cores. With the example of two clusters, any shared access to memory can snoop into the other cluster’s caches to see if the data is already on chip. If not, it is fetched from external memory (DDR). The AMBA 4 ACE-Lite interface is designed for IO (or one-way) coherent system masters like DMA engines, network interfaces and GPUs. These devices might not have any caches of their own, but they can read shared data from the ACE processors. Alternatively, they might have caches but not cache shareable data. While hardware coherency can add some complexity to the interconnect and processors, it massively simplifies the software and enables applications that would not be possible with software coherency. An example being big.LITTLE Global Task Scheduling.