Quantization of a network is a difficult problem. This is because, when switching from floating point to fixed point arithmetic, truncation noise and saturation effects are introduced.

There are two ways to address the problem of quantizing a network:

  • Training with a quantized network
  • Quantizing an existing network

The first solution should to give better results since the network is trained with the truncation noise and the saturation effects. But it is also more complex than the second method.

Some ML frameworks provide operators to model quantization and saturation. For example, TensorFlow includes quantization operators like fake_quant_with_min_max_vars and fake_quant_with_min_max_vars_gradient.

Other frameworks may have similar operators. If the network is rewritten with those operators in the right places, which some automated tool can do, then:

  • The network can be trained with quantization effects
  • Statistics like min/max can be generated for the input/output of each layer

However, these framework operators are not totally equivalent to the fixed-point implementations. This is because these are float implementations and focus on the input and output but not on the internal computations of the fixed-point kernels. Those internal computations can saturate, or suffer from sign inversion issues, and they are generating a different computation noise than a float implementation.

If a full equivalence with CMSIS-NN is required, you should extend your framework used with kernels that behave like the CMSIS-NN ones. However, the inference and back-propagation would have no reason to require the same accuracy, and you may need a version of the kernels with more fixed-point accuracy for the back-propagation.

Therefore, we can see that training with a quantized network:

  • Is more complex
  • Is framework dependent
  • May not model all effects of the fixed-point implementation

For the reasons stated above, we implement the second strategy, quantizing an existing network, because it is simpler and framework independent.

With the second strategy, quantizing the weights and biases is simple. This is because the values are known, so that it is easy to find the fixed-point format as soon as the word size is chosen. For CMSIS-NN, the word size can either be 8 bits or 16 bits.

For the activation values, for example input and output of layers, quantizing is more difficult and is described in the following section.

Previous Next