This guide was built using a keyword spotting network as an example. But a keyword spotting network cannot work alone. The network is using the output of a Mel-Frequency Cepstral Coefficient (MFCC) as input. The MFCC is computing the features used by the network for the recognition.
In theory, one could imagine a network using the audio samples as input instead of the MFCC. Such a network would be trained to learn how to extract features from the audio samples.
It would not be efficient to do it like that on an embedded system. A network computing features equivalent to an MFCC would be bigger (in memory and cycles) than just using an optimized MFCC implementation.
For embedded systems, it is useful, for performance reasons, to do some signal processing on the input signal before using the neural network. Hence, in a full solution, CMSIS-DSP should also be used.
For the same reasons, the final layer of a network may be replaced by other classifiers when it allows to decrease the memory usage and number of cycles of the full solution.