Automated instruction sequence optimization for M-profile Vector Extension (MVE)

M-profile Vector Extension (MVE) is a SIMD extension introduced in the Armv8.1-M architecture to accelerate digital signal processing and machine learning applications. The vector width in MVE is 128-bit wide, but in order to reduce area and power consumption, the architecture design of MVE allows different classes of MVE instructions to be partially overlap. This arrangement allows Armv8.1-M processors such as the Cortex-M55 and Cortex-M85 to achieve a data processing performance level which traditionally requires 128-bit internal datapath, even though these processors have 64-bit datapath.

To get the best performance, the scheduling of MVE instructions inside the program code is crucial – it is important to interleave instructions of different types to maximize the chance of overlapping instruction execution cycles. Doing such optimization manually can be challenging, as this requires good understanding of the processor’s pipeline characteristics and can be very time consuming. Code optimization using brute force solver is not practical, as complex data processing loops could have nearly hundred instructions, which means billions of possibilities in scheduling.

Arm have been looking into this issue and have been working on prototypes of optimization tools which can make such optimization easier. In this presentation, we will share our results and the methodology used in our early prototype - Using an existing optimizer available in the open-source community, we are able to create an optimization flow that can optimize Helium code sequences in several seconds, which normally could take days to optimize.