Choosing the IP configuration
How a piece of IP is configured can have a big effect on a PPA analysis. In particular, the area of a realized piece of IP is affected by how many optional extra features are included. Potentially, the choice of options can have a bigger effect than channel length and track size. This means that becoming familiar with the configuration options for the different IP available, via the Arm Flexible Access program, and deciding whether an option is needed, is an important step. Ultimately, if an SoC design requires certain features, then the increase in size must be accepted. In this case, other ways of reducing the size, for example through physical IP library choice or using a more modern process, could be explored.
Running a PPA analysis requires that the analysis team choose how to configure the IP. If they decide to run an analysis multiple times, they may remove optional features during runs so that they can target, for example, minimal area or high performance.
The following sections look at the configuration options for processors as a starting point. Not all options are available for all processors. In addition, some options may be obligatory for some processors.
When looking at a PPA analysis, look at the size of the caches used in the trial implementation. If you are concerned about the area of the IP, you could reduce the size of the caches. Because smaller caches impact throughput performance, a tradeoff exists here. Also, be aware that larger caches increase performance up to a sweet spot beyond which further increases in cache size will only lead to marginal performance improvements.
Level 1 caches
On Arm Cortex-A series and Arm Cortex-R series processors, it is typical to have a level 1 cache for instructions, called an I-cache, and a level 1 cache for data, called a D-cache, for each core.
Note: It is possible to have a processor without any level 1 memories, and most Arm Cortex-M series processors do not have caches. Without a cache, the processor must fetch all data and instructions from the possibly slower main memory. If the main memory is slower, the system will be significantly slower even if the processor is clocked very fast.
Level 2 caches
Level 2 caches are only available for Arm Cortex-A series processors. All Arm Cortex-A series processors within the Arm Flexible Access program, except for the Arm Cortex-A5 processor, support an internal level 2 cache, which is shared between the cores on the processor. Be aware that the area for level 2 caches for Arm Cortex-A5 processors will still contribute to the overall area of the SoC. The area figure for Arm Cortex-A5 multi-core processor PPA analysis will include a level 2 cache controller (L2CC).
Note: Some more advanced Arm architectures allow for each core in a multi-core processor to have a level 2 cache. However, these processors are not currently available within the Arm Flexible Access program.
Tightly Coupled Memory
Tightly Coupled Memory (TCM) ensures that critical code and data is always available. Compared with caches, which can introduce indeterminate delays, TCM supports the processing of data with single cycle access. Where applicable, data blocks can be moved into the TCM as a background task that is processed while within the TCM, and then written back to the main memory. One or more TCMs can increase performance where it is most needed, and these memories are very suitable for deterministic interrupt routine responses.
TCMs are an option for Arm Cortex-R series processors and the Arm Cortex-M7 processor. It is very common to have more than one TCM per core. For example, like level 1 caches, you can have a TCM A for instructions and a TCM B for data. TCMs can vary in size. Larger TCMs will have a bigger impact on the area of your realized IP.
Caution: When you look at PPA analysis data for the Arm Cortex-R5 and Arm Cortex-M7 processors, you will notice that only the interfaces are included. This is because the actual TCMs must be situated on a different block in the SoC than the core. This means that the PPA analysis data is only showing the area for the interface, which is negligible, and you must consider that TCMs will add to the final area of the SoC.
Note: The Arm Cortex-R5 processor provides two interfaces for TCM B: B0 and B1. If you are using an Advanced eXtensible Interface (AXI) slave interface, you can make use of the second interface to TCM B. Although the increase in area from having two B interfaces is trivial, an increase in performance could be achieved by this configuration.
Two mechanisms can be employed to protect against faults occurring in data: Error Correction Code (ECC) and parity. Parity is cheaper in terms of area than ECC because only a single bit is used. In an SoC, both mechanisms can be employed, with ECC and parity protecting different data.
Although fault protection requires a larger memory footprint, because ECC and parity are built into the pipeline, there is a very minimal impact on performance.
Error Correction Code
Error Correction Code (EEC) is an optional feature that detects errors and, in the case of single-bit errors, automatically corrects them. This ensures data integrity in caches and TCMs. ECC achieves its function by storing code which describes the data present in memory. The code itself is stored in extra bits. For example, every 32 bits of memory might have 8 bits allocated for the code. When the memory is written to, the codes are automatically created. When the memory is read from, any errors are detected and, where possible, corrected.
On Arm Cortex-R series processors, it is possible to have ECC on buses. When this configuration is used, it is noted in the PPA analysis data. This extra functionality does not affect the area of the processor IP beyond noise.
Parity is an optional feature that detects errors, and which involves the use of a parity bit. A parity bit is used to protect the data in the tag RAM of a cache controller. The tag RAM contains the address information for a cache.
Floating Point Unit
A Floating Point Unit (FPU) enables floating point calculations to be made on a processor. An FPU also significantly increases the area of a processor. FPUs can be single precision or double precision. Double precision FPUs use more area than single precision FPUs. Where the FPU is an option, for example for all Arm Cortex-R-series processors, all Arm Cortex-A-series processors, and the Arm Cortex-M23, Arm Cortex-M4, Arm Cortex-M33, and Arm Cortex-M7 processors, your decision regarding an FPU will ultimately be governed by the requirements of the software that you need to run on your SoC.
The Neon instruction set is reliant on an FPU being included in the processor design, and on some Arm Cortex-A-series processors, the FPU and Neon are coupled together as a single option.
Embedded Trace Macrocell
The Embedded Trace Macrocell (ETM) is a debug component that enables reconstruction of program execution. The ETM is designed to be a high-speed, low-power debug tool, which only supports instruction tracing. This ensures that the area that the ETM adds to any realized IP is minimal. The ETM is optional for the Arm Cortex-M23, Arm Cortex-M3, Arm Cortex-M4, Arm Cortex-M33, Arm Cortex-M7, Arm Cortex-R5, Arm Cortex-R8, and Arm Cortex-A5 processors. If you anticipate that processor tracing would have no use in your design, you could reduce the area of your design by excluding it. However, excluding the ETM may have a serious impact on the team developing software for your SoC, and might affect safety certification processes. We recommend that your software team is included in any discussion about the inclusion of an ETM.
Note: The ETM option for the Arm Cortex-R5 processor is a separate component outside of the processor block. Although the ETM option cannot affect the PPA analysis of the Arm Cortex-R5 processor, if the option is included, then the ETM component will still add to the area of the SoC.
Memory Protection Unit
The Memory Protection Unit (MPU) is a hardware unit that controls a limited number of protection regions in memory. MPUs are a major component of the Arm Protected Memory System Architecture (PMSA), which is found on all the Arm Cortex-R-series processors and most of the Arm Cortex-M-series processors (Arm Cortex-M0+, Arm Cortex-M3, Arm Cortex-M4, Arm Cortex-M7, Arm Cortex-M23, and Arm Cortex-M33 processors) that are included in the Arm Flexible Access program. In Arm Cortex-M-series processors and the Arm Cortex-R5 processor, MPUs are optional. In addition, you can specify how many MPU regions you want a core to support.
Neon is an advanced Single Instruction Multiple Data (SIMD) architecture extension for Arm Cortex-A series processors and the Arm Cortex-R52 processors. The technology is intended to improve the multimedia user experience by accelerating audio and video encoding and decoding, user interfaces, and 2D/3D graphics or gaming. Neon can also accelerate signal processing algorithms and functions. This means that Neon can speed up applications like audio and video processing, voice and facial recognition, computer vision and deep learning. Neon is optional on all Arm processors. If you do not require its capabilities for your software, you can reduce the area of your SoC design by excluding it.
The Neon instruction set is reliant on an FPU being included. On the Arm Cortex-A32, Arm Cortex-A34, Arm Cortex-A35, and Arm Cortex-A53 processors, Neon and the FPU are coupled together, so you need to include both or none.
Snoop Control Unit
A Snoop Control Unit (SCU) maintains data cache coherency between different cores and arbitrates between cores requesting level 2 access. The SCU is required for multi-core processor configurations and non-multi-core configurations where an ACP is present. The SCU is available on the Arm Cortex-R8 processor and Arm Cortex-A-series processors. The Arm Cortex-R5 processor has a Micro Snoop Control Unit (µSCU).
SCUs significantly increase the area of a processor when they are included in the design. This tradeoff is necessary when a multi-core processor is required.
Accelerator Coherency Port
An Accelerator Coherency Port (ACP) is an AXI slave interface, which allows external, non-cached, intelligent peripherals, for example DMA controllers, companion DSPs, and Ethernet or Flexray interfaces, to access cacheable memory belonging to the core or cores of the processor.
To maintain cache coherency, access attempts are checked in all shared cached locations in the processor cluster. This data cache sharing typically boosts performance when the external memory access latency is long.
An ACP requires the presence of an SCU in a non-multi-core processor configuration. This technology is not an option for Arm Cortex-M-series processors, the Arm Cortex-R52 processor, or the Arm Cortex-A7 processor.
Interrupt controllers manage interrupt requests (IRQs) and can be integrated or external. In relation to PPA analysis, integrated, built-in interrupt controllers will add to the area figure in the PPA analysis data. The more interrupts that are supported, the more space that is used. For example, an Arm Cortex-M0 processor supports 1-32 interrupts. However, an Arm Cortex-R52 processor can support 32-960 interrupts. When considering how many interrupts your processor needs to support, keep in mind that:
- Each interrupt will contribute to the area of the SoC. External interrupt controllers contribute to the overall area of the SoC even if they do not contribute to PPA analysis area figures.
- Having more interrupts might make it harder for individual cores to achieve higher performance figures.
All Arm Cortex-M series processors and the Arm Cortex-R52 processor have an integrated interrupt controller.
Arm Cortex-A5 multi-core processors and Arm Cortex-R8 processors have integrated interrupt controllers, but these can be set to support 0 IRQs, which effectively disables them and allows an external controller to be used. The Arm Cortex-A7 processor has the option of an integrated interrupt controller, but an external interrupt controller can be used instead.
The Arm Cortex-R5 processor and other Arm Cortex-A series processors available in the Arm Flexible Access program only support an external interrupt controller.
Note: When looking at PPA analysis data, be aware of how the interrupt controllers have been set up. Are they integrated or external? How many IRQs were specified in the implementation? You might have plans for the setup of the interrupt controller that differ from the PPA implementation, and you should consider how your plans could affect the size of your final SoC design.
Low Latency Peripheral Port
A Low Latency Peripheral Port (LLPP) is a feature of Arm Cortex-R series processors which integrates latency-sensitive peripherals more tightly with the processor. LLPPs bypass the main AXI bus and ensure I/O access to the peripheral is not blocked by queued transactions.
The inclusion of an LLPP will not have any noticeable effect on the area figures in a PPA analysis, but it may offer a performance improvement for the SoC overall.