Optimize the final implementation
In CMSIS-NN, there are several versions of the kernels depending on the values of the layer dimensions. For instance, there is a square version of functions for convolutional layers.
There are also some specific _opt versions of the fully connected layers which require some additional weight reordering. CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs describes this.
The final implementation should use the most efficient version of each layer.
Also, often several buffers in the network have the same size. When those buffers are not used at the same time, the memory should be reused to minimize the number of buffers required for the full network.