Midgard Tripipe Execution Core
A programmable Tripipe execution core, with several fixed-function units wrapped around it, make up the structure of the GPU’s shader core.
These fixed function units perform the setup for a shader operation or handle the post-shader activities.
The image below shows the Tripipe execution core and the fixed-function units that surround it:
The programmable shader core is a multi-threaded processing engine that can run hundreds of threads simultaneously, where each running thread equals a single shader program instance.
This large number of threads exists to hide the costs of cache misses and memory fetch latency. Performance drops due to cache misses can be avoided if some of the threads are ready to run at each cycle. This is the case even if many of the other threads are stalled and blocked fetching data.
The Tripipe is the programmable part of the core responsible for the execution of shader programs and it contains the following classes of parallel execution pipeline:
- The arithmetic pipeline, or A-pipe, handles all arithmetic processing.
- The load/store pipeline, or LS-pipe, handles all general purpose memory access, varying interpolation, and read/write image access.
- The texture pipe, or T-pipe, handles all read-only texture accesses and texture filtering.
While all Midgard shader cores have one load/store pipe and one texture pipe, the number of arithmetic pipelines can vary depending on which GPU you are using:
- The Mali-T720 and T820 GPUs have one each.
- The Mali-T880 has three.
- All remaining Midgard GPUs have two.
The Arithmetic pipeline
The Arithmetic pipeline (A-pipe), is a Single Instruction Multiple Data (SIMD) vector processing engine, with arithmetic units that operate on 128-bit quad-word registers. The registers can be accessed as several different data types, for example, as 4xFP32/I32, 8xFP16/I16, or 16xI8. Therefore, a single arithmetic vector operation can process eight
mediump values in a single cycle.
Although we cannot disclose the internal architecture of the A-pipe, the publicly available performance data for each Midgard GPU gives some idea of the number of mathematical units available. For example, the Mali-T760, using 16 cores, is rated at 326 FP32 GFLOPS at 600MHz.
Using a T760, using 16 cores, equals a total of 34 FP32 FLOPS per clock cycle, per shader core. Because the Mali-T760 has two pipelines, the final mathematical units available is 17 FP32 FLOPS per pipeline. The operational performance increases for smaller data types and decreases for larger ones.
The Texture pipeline (T-pipe) is responsible for all texture-related memory accesses. It can return one bilinear filtered texel per clock for most texture formats. However, performance can vary for some texture formats and filtering modes.
The following table shows that trilinear filtering of a 3D texture would require four cycles, where two cycles are used for trilinear filtering, and a further two cycles for the 3D texture:
|32-bit channel format||x channel count|
|YUV format||x planes|
At a cost of one cycle per plane, importing YUV surfaces from camera and video sources can be read directly without needing prior conversion to RGB. However, importing semi-planar Y + UV sources are preferred to fully planar Y + U + V sources.
Note: The T-pipe has a 16KB L1 data cache.
The Load/Store pipeline (LS-pipe) is responsible for all shader memory accesses that are not related to texture samplers. This responsibility includes generic pointer-based memory access, buffer access, varying interpolation, atomics, and
imageStore() accesses for mutable image data.
Although every instruction is a single cycle memory access operation, the hardware also supports wide vector operations and can load an entire
highp vec4 varying in a single cycle.
Note: The LS-pipe has a 16KB L1 data cache.
Early and late ZS testing
In the OpenGL ES specification, ‘fragment operations’, which includes depth (Z) and stencil (S) testing, occurs at the end of the pipeline after fragment shading has completed.
Although the specification is simple to understand, it also implies that fragment shading must occur, even if it is then thrown away afterwards by ZS testing.
Coloring fragments and then discarding them costs a significant amount of performance and wasted energy. So, where possible, early ZS testing is performed before fragment shading occurs. Late ZS testing occurs after fragment shading only when unavoidable. For example, a dependency on a shader fragment that calls discard creates an indeterminate depth state until it exits the Tripipe.
Midgard GPUs also add a hidden surface removal capability, called Forward Pixel Kill (FPK). FPK can stop rasterized fragments, that have already passed ZS testing, from turning into real rendering work. However, the Mali stack must first determine that the fragments are occluded and do not contribute to the output scene in a useful way.