Optimizing the Arm CoreLink NIC-400

This tutorial explains how to optimize the Arm CoreLink NIC-400 Network interconnect for your application, introducing concepts of virtual prototyping, modeling the NIC-400 and how to drive and analyze traffic through it. The key features of the NIC-400 are highlighted as well as a methodology, based on real system traffic and software, for making the best design decisions.

Arm CoreLink NIC-400 Overview

The CoreLink NIC-400 Network Interconnect provides a highly configurable network of interconnect switches and bridges to connect up to 128 AXI or AHB-Lite masters to up to 64 AXI, AHB-Lite or APB slaves in a single NIC-400 instantiation.

The NIC-400 is the 4th generation AXI interconnect from Arm and is delivered as a base product of AMBA AXI3 and/or AXI4 interconnect with three optional, license-managed advanced features: QoS-400 Advanced Quality of Service to dynamically regulate traffic entering the network; QVN-400 QoS Virtual Networks to prevent blocking at arbitration points; and TLX-400 Thin Links to reduce routing congestion and ease timing closure for long paths.

The designer can select the topology of the network of switches to increase the efficiency of the interconnect in different ways, including:

  • Traffic streams from multiple masters can be combined to increase wire utilization, reducing congestion
  • Grouping of masters by location can shorten paths between IP blocks and switches
  • The division of large switches in to multiple smaller switches can allow increased frequencies and provide low latencies for critical paths such as between CPU and DDR main memory
  • Each switch can select the appropriate data width and clock frequency to meet performance target while minimizing power and area
  • Different switches can be placed in different domains allowing hierarchical clock gating to reduce power whenever each domain is idle

The NIC-400 switches and bridges support both AMBA AXI3 and the new AMBA 4 AXI4, with less wires (no WIDs) and enhanced streaming performance (longer burst support). All bridging between AXI4 and AXI3 is handled seamlessly by the NIC-400. APB support is extended to AMBA 4 APB4 with new write strobes and TrustZone signalling.



Configuring the NIC-400

Designers use the AMBA Designer tool to configure the NIC-400:

  1. The user starts by defining the masters and slaves in his system, what bus protocol, bus width and clock each use and filling in a matrix of the required master-slave connectivity.
  2. The designer then sets up the address maps for each master and a global address map as required along with any remap options.
  3. Finally, the designer establishes the topology of the connectivity between all masters and all slaves in the implementation view. This allows the designer to minimize latency between critical masters and slaves, e.g. between the CPU and main memory, and to group components together to save wires and gates. The GUI then allows the designer to select configuration options for buffer depths, registering options, QoS regulators and other characteristics.
  4. The tool will then automatically instantiate the required master/slave interfaces, switch matrices and bridges in RTL and generate an accompanying IP-XACT description.

Modeling the NIC-400

Given the possible matrix of connections, connection types, protocols, QoS, etc there are billions of possible unique model possibilities with NIC-400.

  1. After configuring the interconnect using AMBA Designer, upload the IP-XACT file to Arm IP Exchange which will then compile the model and make it available for download.
  2. Once compiled, the model may be managed using the same portal and the IP-XACT file is stored there if modification to the model is required.
  3. Once created, the NIC-400 model can easily be executed in tandem with traffic generators to quickly start gathering data. This approach, using traffic generators to both produce and consume traffic is not targeted at verifying the NIC-400 since that is assumed to be correct. Instead, this approach is targeted at measuring the performance characteristics of the model as it is exercised.
  4. Traffic can be parameterized to mimic the behavior of system IP and can be parameterized to sweep across a range of various options include burst length, priority, address, etc.


In the example shown here, a simple NIC-400 is configured with two masters and two slaves. The masters are set up to mimic the data loads from a CPU and DMA controller and the dummy targets are an Ethernet MAC and a DDR3 memory controller. Since the traffic generators are configurable, it’s possible to model any number of different sources or targets. The graphs shown above track the latency on the CPU interface and the queues in the DDR controller.

This process enables you to establish a framework to begin gathering quantitative data on the performance of the NIC-400 to track how well it meets the requirements. The results can be analyzed which will likely lead to reconfiguration, recompilation and re-simulation. Furthermore it’s important to vary the traffic parameters as well as the NIC parameters as the true performance of the NIC-400 is how it impacts the behavioral characteristics of the entire system.

The traffic generator approach can provide fast, accurate results even if there is nothing more than just the NIC-400 and a few sources and consumers. It can be set up quickly and enables easy testing of a wide range of system possibilities including corner cases which might be difficult to set up with real IP. However, no matter how much time is spent assembling traffic generation schemes, these do not reflect the actual behavior of the system running real software and this behavior can vary greatly depending upon the system software and IP configuration. Even a slight reordering of system software calls can have a big impact on overall system performance. Therefore it is important that the system is validated using virtual prototypes to ensure it meets performance targets.


Validating the NIC-400

System level virtual prototypes provide a far more realistic view of what’s going on inside the system. This is important to handling cases which traffic generators may not model as correctly as the real IP such as ordering, arbitration and number of outstanding transactions. Another item which is extremely important is coherency. While most of the coherent traffic in the system will be handled by the CCI/CCN IP it will still have an impact on system performance and a few software calls can greatly impact the system traffic in order to maintain this coherency.

Software can cause a dramatic impact on the performance of the overall system and getting this software up and running with the real hardware can enable both of these to be optimized in advance of actual silicon. For example, system level benchmarks are often used to market the IP once it has been finalized. Leading edge design teams will use these benchmarks during the design process of the SoC to drive traffic in the system but also to tweak the settings of the IP to maximize performance. This helps ensure that the actual silicon will meet the marketing specifications.



This is an example system level virtual prototype. Here the Arm Cortex-A57 is featured as the main processor in the system. There’s a CCI-400 and multiple DMC-400s to handle most memory accesses and the NIC-400 hangs off the CCI to manage memory accesses to the rest of the system. The system is fully capable of booting Linux and then running a variety of system level benchmarks. While components would still need to be added to model the capabilities of a leading edge SoC, this simple system model is sufficient to optimize the performance of compute oriented benchmarks and maximize the performance of the processor/memory subsystem.

A parameterizable memory has been used on the NIC-400 to provide system level flexibility on additional components. Most systems will use multiple memories, one for each modeled component or, if desired, the actual IP models can be used. Since it can boot an OS however, this system level virtual prototype is valuable not only to validate the performance of the various system component but it can also be used to enable pre-silicon firmware development.

Using the system model described in the CPAK above, it is possible to execute real system software, boot an OS and execute system level benchmarks to enable optimization of the NIC-400 as well as the other components in the system.



This article was originally written as a blog by Bill Neifert. Read the original post on Connected Community.