Fragment Shader Core

The Utgard fragment shader core consists of a single, programmable, pipeline that is wrapped by fixed function logic for the generation of new fragment threads and the retiring of completed fragment threads. The fragment shader core data flow for an Utgard GPU is shown in the image below:

Utgard fragment

The programmable fragment shader core is a multi-threaded processing engine that can run up to 128 threads simultaneously, where each running thread equals a single fragment shader program instance.

The large number of threads exists to hide the costs of cache misses and memory fetch latency.

Threads that miss data in the L1 cache can fetch data from L2 cache without any performance penalty. Or, threads that miss data in the L1 cache can fetch data from main memory with only a single cycle of overhead, although this depends on memory latency.

The Utgard programmable pipeline executes all shader programs and contains three types of processing stages, usable by a single instruction issue cycle.

These processing stages are load stages, texture stages, and arithmetic stages.

Load stages

Load stages are responsible for all shader memory accesses that are not related to texture samplers. These accesses include uniform access, varying access and interpolation, and thread stack access.

On a per-clock basis, the load stage can load 64-bits of uniform, interpolate 64-bits of varying data, or load or store 64-bits of stack data.

Texture stages

The texture stages are responsible for all texture memory access. One bilinear filtered texel per-clock can be returned for most texture formats. However, performance varies for some texture formats and filtering modes.

Operation  Performance scaling
Trilinear filter  x2
Depth format  x2
YUV format x planes

At a cost of one cycle per plane, importing YUV surfaces from camera and video sources can be read directly without needing prior conversion to RGB. However, importing semi-planar Y + UV sources are preferred to fully planar Y + U + V sources.

Arithmetic stages

The arithmetic stages are a Single Instruction Multiple Data (SIMD) vector processing engine, where arithmetic units operate on vec4 fp16 values, and each pipeline can process a total of 14 FP16 FLOPS.

Caution! Only `mediump` precision using fp16 data types for fragment shading is supported by Mali-400, `highp` fp32 data types are not supported.

Early and late ZS testing

In the OpenGL ES specification, after fragment shading has completed, Fragment operations occurs at the end of the pipeline. Fragment operations includes depth (Z) and stencil (S) testing, Although the OpenGL ES specification is simple to understand, it also implies that fragment shading must occur, even if it must be thrown away afterwards by ZS testing.

Coloring fragments and then discarding them costs a significant amount of performance and wasted energy. So, where possible, early ZS testing is performed before fragment shading occurs. Late ZS testing occurs after fragment shading only when unavoidable. For example, a dependency on a shader fragment that calls discard creates an indeterminate depth state until it retires.

Previous Next