Differentiate your design

Arm Custom Instructions open the door to implement bespoke data processing operations without introducing complexity to the software development flow. By using Arm Custom Instructions, silicon architects can differentiate the design without compromising quality, ease-of-use, and security.


Specifications and features

Arm Custom Instructions enable a new level of optimization to meet the increasing industry demand for workload specific compute. Arm Custom Instruction features are:

  • Arm architecture compliant
  • Supported by standard Arm compliant software development tools, including open-source compilers such as GCC
  • Tightly coupled to the processor pipeline, bringing the highest performance efficiency gains to latency and power-sensitive applications
  • Supported in hardware and software to ensure co-development between both teams
  • Compatible with TrustZone technology

Implementing Arm Custom Instructions

Arm Custom Instructions enable you to push performance and efficiency further by adding application domain specific features in small embedded processors, while maintaining all the advantages of the Arm software ecosystem.

Arm Custom Instructions allow you to add a customizable module inside the Arm Cortex-M33 and Cortex-M55* processors. This module is driven by the pre-decoded instructions and shares the same interface as the standard arithmetic logic unit (ALU) of the CPU. Adding custom instructions to a customizable CPU requires two steps:

  1. Providing a configuration file that lists the regions you want to use for adding your own custom instructions.
  2. Building the data path for your own custom instructions and integrating it into the configuration space.

The configuration space can implement one of the following Arm Custom Instruction formats, which are defined by the Arm Instruction Set Architecture:

Instruction Assembly
General-purpose registers and NZCV flags
CX1{A} CX1{A} Pn, {Rd,} Rd, #imm
CX2{A} CX2{A} Pn, {Rd,} Rd, Rn, #imm
CX3{A} CX3{A} Pn, {Rd,} Rd, Rn, Rm, #imm
CX1{A}.D CX1D{A} Pn,{Rd,} Rd, #imm
CX2{A}.D CX2D{A} Pn, {Rd,} Rd, Rn, #imm
CX3{A}.D CX3D{A} Pn, {Rd,} Rd, Rn, Rm, #imm
FPU/M-Profile Vector Extension (MVE) registers
VCX1{A}.F VCX1{A} Pn, {Sd,} Sd, #imm
VCX2{A}.F VCX2{A} Pn, {Sd,} Sd, Sn, #imm
VCX3{A}.F VCX3{A} Pn, {Sd,} Sd, Sn, Sm, #imm
VCX1{A}.D VCX1{A} Pn, {Dd,} Dd, #imm
VCX2{A}.D VCX2{A} Pn, {Dd,} Dd, Dn, #imm
VCX3{A}.D VCX3{A} Pn, {Dd,} Dd, Dn, Dm, #imm
VCX1{A}.Q VCX1{A} Pn, {Qd,} Qd, #imm
VCX2{A}.Q VCX2{A} Pn, {Qd,} Qd, Qn, #imm
VCX3{A}.Q VCX3{A} Pn, {Qd,} Qd, Qn, Qm, #imm

*Available in 2021

White paper: Introduction to Arm Custom Instructions

To see an example in practice and explore further details of the capabilities enabled for Arm Custom Instructions, download our white paper

Download

Your questions answered

  • When will Arm Custom Instructions be available to Arm silicon partners?

    The first implementation of the Arm Custom Instructions is now available to Arm Cortex-M33 processor licensees. Arm Custom Instructions are now supported as an optional feature in the Cortex-M33 r1p0 revision. Arm Custom Instructions for the Cortex-M55 processor will be available in 2021.

  • How do I implement my Custom Execution Logic?

    Arm Cortex-M processors that support the Arm Custom Instructions include a customizable module, called the Partner Custom Execution Logic (see figure 1), where silicon designers add their custom data path logic. This module receives pre-decoded instructions from the processor decoder in the same way as the native ALUs of the processor. All required control signals and management of instruction interlocks, as well as data dependencies, are handled by the CPU.

    Figure 1: Partner custom execution logic in Armv8-M Cortex-M processor

    Within the Partner Custom Execution Logic, the designer implements one or several of the following Arm Custom Instructions classes, which are defined by the Arm Instruction Set Architecture:

    INSTRUCTION ASSEMBLY INPUTS OUTPUTS
    Scalar registers and NZCV flags as inputs and outputs
    CX1{A} CX1A Pn,{Rd,}Rd,#imm
    CX1 Pn,Rd,#imm
    Immediate and 1x 32-bit GPR/NZCV
    Immediate only
    1x 32-bit GPR or NZCV
    CX2{A} CX2A Pn,{Rd,}Rd,Rn,#imm
    CX2 Pn,Rd,Rn,#imm
    Immediate and 1x 32-bit GPR/NZCV
    Immediate and 1x 32-bit GPR/NZCV
    1x 32-bit GPR or NZCV
    CX3{A} CX3{A} Pn,{Rd,}Rd,Rn,Rm,#imm
    CX3 Pn,Rd,Rn,Rm,#imm
    Immediate and 1x 32-bit GPR/NZCV
    Immediate and 1x 32-bit GPR/NZCV
    1x 32-bit GPR or NZCV
    CX1D{A} CX1DA Pn,{Rd,}Rd,#imm
    CX1D Pn,Rd,#imm
    Immediate and 1x 32-bit GPR/NZCV
    Immediate only
    1x 32-bit GPR
    CX2D{A} CX2DA Pn,{Rd,}Rd,Rn,#imm
    CX2D Pn, Rd,Rn,#imm
    Immediate and 1x 32-bit GPR/NZCV
    Immediate and 1x 32-bit GPR/NZCV
    1x 32-bit GPR
    CX3D{A} CX3DA Pn,{Rd,}Rd,Rn,Rm,#imm
    CX3D Pn, Rd,Rn,Rm,#imm
    Immediate and 1x 32-bit GPR/NZCV
    Immediate and 1x 32-bit GPR/NZCV
    1x 32-bit GPR
    Floating-point/Vector registers as inputs and outputs
    VCX1{A}.F VCX1A Pn,{Sd,}Sd,#imm
    VCX1 Pn,Sd,#imm
    Immediate and 1x 32-bit fp32 register
    Immediate
    1x 32-bit fp32 register
    VCX2{A}.F VCX2A Pn,{Sd,}Sd,Sn,#imm
    VCX2 Pn,Sd,Sn,#imm
    Immediate and 1x 32-bit fp32 register
    Immediate and 1x 32-bit fp32 register
    1x 32-bit fp32 register
    VCX3{A}.F VCX3A Pn,{Sd,}Sd,Sn,Sm,#imm
    VCX3 Pn,Sd,Sn,Sm,#imm
    Immediate and 1x 32-bit fp32 register
    Immediate and 1x 32-bit fp32 register
    1x 32-bit fp32 register
    VCX1{A}.D VCX1A Pn,{Dd,}Dd,#imm
    VCX1 Pn,Dd,#imm
    Immediate and 1x 64-bit fp64 register
    Immediate
    1x 64-bit fp64 register
    VCX2{A}.D VCX2A Pn,{Dd,}Dd,Dn,#imm
    VCX2 Pn,Dd,Dn,#imm
    Immediate and 2x 64-bit fp64 register
    Immediate and 1x 64-bit fp64 register
    1x 64-bit fp64 register
    VCX3{A}.D VCX3A Pn,{Dd,}Dd,Dn,Dm,#imm
    VCX3 Pn,Dd,Dn,Dm,#imm
    Immediate and 3x 64-bit fp64 register
    Immediate and 2x 64-bit fp64 register
    1x 64-bit fp64 register
    VCX1{A}.Q VCX1A Pn,{Qd,}Qd,#imm
    VCX1 Pn,Qd,#imm
    Immediate and 1x 128-bit vector register
    Immediate
    1x 128-bit vector register
    VCX2{A}.Q VCX2A Pn,{Qd,}Qd,Qn,#imm
    VCX2 Pn,Qd,Qn,#imm
    Immediate and 2x 128-bit vector register
    Immediate and 1x 128-bit vector register
    1x 128-bit vector register
    VCX3{A}.Q VCX3A Pn,{Qd,}Qd,Qn,Qm,#imm
    VCX3 Pn,Qd,Qn,Qm,#imm
    Immediate and 3x 128-bit vector register
    Immediate and 2x 128-bit vector register
    1x 128-bit vector register

    Table 1: Arm Custom Instructions classes

    All Arm Custom Instructions operate on either the general-purpose scalar register file (those instructions of the format CX...) or the register file in the Floating-Point Unit (those instructions of the format VCX...). Each Arm Custom Instruction can be single-cycled, or multi-cycled. Behavior of interrupts to the processor is not different from a processor that does not support the Arm Custom Instruction.

  • Can Arm Custom Instructions access memory?

    Arm Custom Instructions have some deliberate restrictions to help avoid implementation issues:

    1. They cannot directly access memory or inputs/outputs outside the processor.
    2. They cannot have their own register states.

    Both restrictions above are necessary for enforcing security since the Arm architecture has clear definitions for secure accesses. If the Arm Custom Instruction were to define their own state (by having their own registers) or their own path to memory, the architecture can no longer claim to enforce security. An interrupt that forces a transition between Secure and Non-secure modes of execution could possibly lead to leaking of secure state from the custom execution logic block. For hardware accelerators that need internal state or direct memory access, the existing coprocessor interface feature can be a suitable solution.

  • How do I decide if an Arm Custom Instruction should be implemented?

    Like all Cortex-M processors, Arm provides fast models and cycle-accurate models for silicon vendors to evaluate performance requirements for target algorithms. In some cases, the decision to implement an Arm Custom Instruction can be obvious. For example, when an operation requiring several standard Arm instructions to complete can be replaced by a single-cycle Arm Custom Instruction, and that operation happens to be time-critical. In some cases, however, weighing the benefits of cycle count reduction against the cost of adding extra custom logic might not be as clear cut. In these cases, the fast model would be used to help arrive at a decision.

  • How do I verify my custom execution logic implementation?

    Custom execution logic added by the silicon designer is under the designer’s responsibility as far as verification goes. Arm provides testbenches that facilitate the verification of the added custom datapath extension, but ultimately any custom logic verification responsibility falls in the designer.

  • How do software developers use Arm Custom Instructions? 

    Arm Custom Instructions are defined by the architecture, which means, all Arm compliant compilers will, by default, support Arm Custom Instructions. The instructions themselves don’t change; it is the behavior of that instruction that can vary from one silicon to another. It is expected that silicon providers will provide libraries that leverage those Arm Custom Instructions to accelerate specific tasks on their silicon. In addition to libraries, C intrinsic functions for the Arm Custom Instructions are defined in Arm C Language Extension (ACLE), which will allow C programming code to access these instructions on compilers that are compliant to the ACLE.

  • How many Arm Custom Instructions can I implement?

    Arm has redefined the coprocessor instruction set architecture encoding space to enable the Arm Custom Instructions. There is a 3-bit space in the instruction encoding that is used to identify between multiple Arm Custom Instruction execution units or coprocessor hardware connected via the coprocessor interface. Each ACI execution unit can support a range of ACI instructions as shown in table 1, and for each of those instructions, there is an immediate value which can range from 3-bit to 13-bits, depending on the instruction class. Silicon designers can use these immediate values to implement multiple instructions in the same instruction class. For example, the Arm Custom Instruction class below allows an immediate value of 13 bits:

    CX1A Pn, {Rd,} Rd, #<imm>

    The 13-bit immediate value can theoretically allow up to 8192 different Arm Custom Instructions using this class . In practice, it is unlikely that such a large number of Arm Custom Instructions be implemented due to hardware cost, and part of this immediate value field would be used as an immediate constant for data processing, which is more likely to be the case.

  • Are Arm Custom Instructions interruptible like STM/LDM?

    Arm Custom Instructions can be interrupted like any other Arm instruction. Multi-cycled Arm Custom Instructions cannot, however, be stopped by an interrupt and resumed from where it was stopped. A multi-cycled Arm Custom Instruction can be terminated during execution but has to start again from the beginning.

More questions?

If you have any more questions about Arm Custom Instructions or you would like to access the release, talk to an Arm expert.

Contact us