You copied the Doc URL to your clipboard.

Performance difference between synthesized MMU in ARM926EJ-S and custom MMU in ARM92xT

Information in this article applies to:

  • ARM920/922T

  • ARM926EJ-S

Question

Performance difference between synthesized MMU in ARM926EJ-S and custom MMU in ARM92xT

Answer

The ARM920T is implemented using customized CAM memory cells. The absence of a standardized CAM Compiler Memory offering across the industry prevented the use of this approach for synthesizable ARM cores such as the ARM926EJ-S.

In ARM926EJ-S, the Tag SRAMs can only be looked up by single reads. For this reason, a small number of TLB entries are copied into registers (micro-TLBs) so that they can be looked up in parallel. This works efficiently until the code or data runs off the end of the current entries in the registers, at which point there will be an overhead of a few cycles while the main TLB is interrogated. It is this overhead which is mainly responsible for the performance difference.

The TRM avoids giving a fully detailed description of this because the TRM is in the public domain and we do not generally make this kind of detail available to non-licensees.

Common Memory System Stalls

There are 3 types of common memory system stall which can occur for read accesses which hit both the cache and the main TLB. These are:

  • cache micro-TAG miss

  • memory region switch

  • micro-TLB miss

Cache Micro-TAG miss

The ARM926EJ-S cache contains an 8-entry fully associative structure called a micro-TAG used to speed up cache accesses. The micro-TAG holds a virtual address TAG for a cache line, and the corresponding way information. When a read cache lookup is performed, both a main TAG, and micro-TAG lookup are performed in parallel. If the address hits in the micro-TAG, the data is returned to the ARM9EJ-S in the same clock cycle. If the address misses the micro-TAG, the data is returned in the following cycle (one cycle stall), and a new entry in the micro-TAG is loaded corresponding to the address which missed. Note that the micro-TAG is only used for read accesses and not writes.

Memory Region Switch

The ARM926EJ-S memory system controller uses a predictive memory region scheme, where the prediction is based upon the previous memory region accessed. Switching between memory region types (e.g. switching from data cache to data TCM) causes a 2 cycle stall. The region types are as follows:

For the data side:

TCM = data side TCM region

cacheable = WT or WB region

non-cacheable = NCB or NCNB region

For the instruction side:

TCM = instruction side TCM region

cacheable = C bit set in page-table entry

non-cacheable = C bit clear in page-table entry.

Any change from one region type to another region type causes the prediction to be changed to the new region which causes a 2 cycle penalty. So for example changing from a WT region to an NCNB region will cause the prediction to change to non-cacheable and cause a 2 cycle stall.

Changing from a WT region to a WB region will not result in a misprediction as they are the same region type.

Data accesses to instruction TCM do not affect the predicted memory region for the data side, and always incur a 2 cycle penalty.

Micro-TLB miss

There are micro-TLBs for both instruction and data sides, each one containing 8 entries.

The micro-TLBs are fully-associative and have a round-robin replacement policy. The main TLB is 64 entry, 2-way set associative, and is constructed using a standard SRAM component.

Main TLB replacement is round-robin per entry. Each entry can hold page-type entries of 1KB, 4KB, 16KB, 64KB,and 1MB. If a micro-TLB miss occurs the main TLB is searched. In order to search the TLB, up to five RAM reads are required. In order to reduce the search time where only a subset of page-types are used the MMU records which page sizes are present in the TLB, and only searches for those types. The penalty for a micro-TLB miss, and main-TLB hit is between 11 and 15 cycles.

Related information

Not applicable.

Was this page helpful? Yes No