Optimize your shader program

Now you have identified the critical path, speed up the tone mapping to improve performance of the shader.

  1. The first change you can make is to reduce precision. Currently the tone mapping is using a highp (fp32) matrix operation, which has more precision than we need to generate an 8-bit per channel color output. Change the precision to “mediump” (fp16) float and sampler precision by modifying these two lines at the top of the shader:
    precision mediump float; 
    precision mediump sampler2D;
    Just these two simple changes significantly reduce the cost of the longest path, as Mali GPUs can process twice as many fp16 operations per clock than fp32 operations.
                           A   LS    V    T   Bound 
    Longest Path Cycles: 2.7 0.0 0.2 2.5 A
  2. After changing the precision, arithmetic is still the longest path. Move the tone mapping out of the accumulation loop, and apply it to the final color instead of the individual samples. This gives the final shader structure:
    // For each gaussian sample 
    for (int i = 0; i < WINDOW_SIZE; i++) {
    vec2 offsetTexCoord = texCoord + vec2(gaussOffsets[i], 0.0);
    vec4 data = texture(texUnit, offsetTexCoord);
    fragColor += data * gaussWeights[i];
    }

    // Tone map the final color
    if (toneMap) {
    fragColor *= colorModulation;
    }
    This change reduces the arithmetic cost of the longest path to just a single shader core cycle, even if tone mapping is used. The slowest path is now texturing, which needs 2.5 cycles per fragment to load the 5 samples needed. You can not make this any faster, because this is the architectural performance of this particular shader core.
                                A   LS    V    T   Bound 
    Total Instruction Cycles: 1.0 0.0 0.2 2.5 T
    Shortest Path Cycles: 0.5 0.0 0.2 2.5 T
    Longest Path Cycles: 1.0 0.0 0.2 2.5 T

Although the last optimization reduced the arithmetic cost from 2.7 cycles to 1.0 cycles, the shader throughput only improved from 2.7 cycles to 2.5 cycles per fragment because the bottleneck changed from A to T. However, reducing the load on any pipeline will improve energy efficiency and prolong battery life, so these types of optimizations are still worth making, even if they do not improve the headline performance.

Previous Next