Migrating loss of precision

The basic idea of floating-point numbers is that the location of the fractional bits which are stored changes, or floats, is based on the magnitude of the number you are trying to represent. The level of accuracy that you can store reduces as the magnitude of the stored number increases.

For many types of shader arithmetic, accuracy of small numbers is important. Examples include: unorm color outputs, UV texture coordinates, and components in unit length vectors in which all values are between 0.0 and 1.0.

Preserving accuracy of the numbers in this output range is important. Therefore, we will now discuss how you can use mathematical construction to reduce the errors introduced by precision limitations.

Avoid large magnitudes

Avoid creating numbers with a large magnitude that will be turned in to a small number in mathematical operations. For example, consider the expression:

glsl
float opA = 100.00;
float opB =   0.01;
float tmp = (a + b)
float result = tmp - a;

When executed at FP32 precision, this expression gives the expected answer 100.01. But, when executed at FP16 precision, this expression gives the answer 99.989.

This happens because of the large difference in magnitudes of the original inputs. This means that the intermediate value of tmp lacks enough accuracy to store the fractional part of 100.01, and so only contains the value 100. However, the smaller value tmp - a can be stored, meaning that the errors do not cancel out.

To avoid losing accuracy, construct equations that preserve intermediate values, so they are as close as possible to the final magnitude. For example, if passing in a rotation from the application into sin() or cos(), we know that the useful part of the function can be found between [0, 2(PI)). Any values that are higher than this are just repeated rotations larger than 360 degrees, and are visually indistinguishable from a smaller rotation.

So rather than passing in an ever-increasing value from the application, wrap the rotation on the CPU to the range [0, 2(PI)), in turn, preserving as much precision as possible in the useful range.

For this example, if the rotation is not wrapped to a small range on the CPU, then the object eventually ceases rotating. The magnitude of the number becomes so large that adding in a small incremental rotation does not do anything. This is because the small increment is below the accuracy threshold of the stored number.

This happens quickly with FP16 numbers, but it also happens eventually with FP32 numbers.

Exploit symmetrical functions

The sign-bit is always stored in a floating-point number. For many types of periodic mathematical functions, this can be used to improve accuracy because the magnitude of the numbers that need to be stored can be reduced.

For example, a rotation of +270 degrees is the same as a rotation of -90 degrees. So, for inputs into sin() and cos(), it is preferable to use values in the range [-(PI), +(PI)) instead of [0, 2(PI)). This is because the -PI to +PI range halves the maximum magnitude, therefore preserving one bit of accuracy which the latter values would lose.

Exploit built-in functions

Built-in functions in the shader libraries are often backed by hardware that preserves more precision than the equivalent function that is implemented in shader code arithmetic.

An example of this is the Fused Multiply Accumulate operation. This operation is very common in compute applications:

glsl
float r = (a * b) + c;

If this operation is implemented as separate multiply and add operations, the result of (a * b) is rounded to fit into a tmp float. The result of tmp + c is rounded again, so that two sets of rounding errors are introduced.

When using a hardware fused multiply accumulate operation, only the final result needs to be rounded to the output precision. This removes the intermediate rounding result, and the error that it introduces.

Minimize memory size

Double Data Rate (DDR) memory bandwidth requires lots of power, so when reviewing shaders and narrowing precision, remember also to narrow any associated vertex attributes stored in memory.

Support for GL_HALF_FLOAT attributes is a core feature in OpenGL ES 3.0. If you are using OpenGL ES 2.0, remember that all Mali GPUs support the [OES_vertex_half_float][VHF] extension.

OpenGL ES OES Vertex Half-Float Information

A caveat of using lower numerical precision is that, In general, lower precision is better. However, the cost of type conversion may not be free. Therefore, try to minimize the number of casts needed in shader code by loading data at a suitable precision level.

Previous Next