The Midgard Shader Core
Second generation Mali GPU architecture
March 2018 by Peter Harris
When optimizing applications using a GPU it is useful to have at least a high level mental model for how the underlying hardware works, as well as an idea of its expected performance and data rate for different types of operation it might perform. Understanding the block architecture is particularly important when optimizing using the Mali performance counters, as understanding what the counters are telling you implies a need to understand the blocks the counters are tied to.
This article presents a stereotypical Mali "Midgard" GPU programmable core – the second generation of Mali GPUs and the first to support OpenGL ES 3.0 and OpenCL – including the Mali-T600, Mali-T700, and Mali-T800 series products.
This article assumes you have read the earlier introductory article on tile-based rendering, as it will build on concepts introduced there.
The "Midgard" family of Mali GPUs (the Mali-T600, Mali-T700, and Mali-T800 series) use a unified shader core architecture, meaning that only a single type of hardware shader processor exists in the design. This single processor type can execute all types of shader code, including vertex shaders, fragment shaders, compute kernels, etc.
The exact number of shader cores present in a particular silicon chip varies; Arm licenses configurable designs to our silicon partners, who can then choose how to configure the GPU in their specific chipset based on their performance needs and silicon area constraints. The Mali-T880 GPU design, the latest of the Midgard GPU products, can scale from a single core for low-end devices all the way up to 16 cores for the highest performance designs.
The shader cores in the system share a level 2 cache to improve performance, and to reduce memory bandwidth caused by repeated data fetches. The size of the L2 is another configurable component which can be tuned by our silicon partners, but is typically in the range of 32-64KB per shader core in the GPU depending on how much silicon area is available.
The number and bus width of the memory ports the L2 cache has to external memory is configurable, again allowing our partners to tune the implementation to meet their performance, power, and area needs. The architecture aims to be able to write one 32-bit pixel per core per clock, so it would be reasonable to expect an 8-core design to have a total of 256-bits of memory bandwidth (for both read and write) per clock cycle, although whether this much bandwidth is available to the GPU will depend on the chipset integration.
The Mali drivers submit workloads for each render pass as a pair of independently schedulable pieces once the application has completed defining the render pass. One piece is for all geometry and compute related workloads in the pass, and one is for the fragment workload. As a tile-based renderer all geometry processing for a render pass must be complete before the fragment shading can start, as we need a finalized "tile list" to provide fragment processing the primitive coverage information it needs.
The hardware supports two parallel issue queues which the driver can use, one for each workload type. Workloads from both queues can be processed by the GPU at the same time, so geometry processing and fragment processing for different render passes can be running in parallel.
The workload for a single render pass is nearly always large and highly parallelizable, so the GPU hardware will break it into smaller pieces and distribute it across all of the shader cores available in the GPU.
The Midgard shader core
The Midgard shader core is structured as a number of fixed-function hardware blocks wrapped around a programmable "Tripipe" execution core. The fixed function units perform the setup for a shader operation – such as rasterizing triangles or performing depth testing – or handling the post-shader activities – such as blending, or writing back a whole tile's worth of data at the end of rendering.
Unlike a traditional CPU architecture, where you will typically only have a single thread of execution at a time running on a single core, the programmable core is a massively multi-threaded processing engine capable of running hundreds of threads concurrently. Each running thread corresponds to a single shader program instance.
This large number of threads exists to hide the costs of cache misses and memory fetch latency; some threads can be stalled and blocked on data fetch, but as long as some of the threads are ready to run each cycle we do not lose significant levels of performance due to cache misses.
The Tripipe is the programmable part of the core responsible for the execution of shader programs. It contains three classes of parallel execution pipeline in the tripipe design:
- Arithmetic pipeline (A-pipe) for all arithmetic processing
- Load/store pipeline (LS-pipe) for general purpose memory access, varying interpolation, and read/write image access.
- Texture pipe (T-pipe) for read-only texture access and texture filtering.
All Midgard shader cores have one load/store pipe and one texture pipe, but the number of arithmetic pipelines can vary depending on which GPU you are using.
- Mali-T720 and Mali-T820 have one,
- Mali-T880 has three,
- ... all other Midgard GPUs have two
The A-pipe is a SIMD (single instruction multiple data) vector processing engine, with arithmetic units which operate on 128-bit quad-word registers. The registers can be flexibly accessed as a number of different data types; for example as 4xFP32/I32, 8xFP16/I16, or 16xI8. It is therefore possible for a single arithmetic vector operation to process 8 "mediump" values in a single cycle.
While I can't disclose the internal architecture of the A-pipe, our public performance data for each GPU can be used to give some idea of the number of maths units available. For example, the Mali-T760 with 16 cores is rated at 326 FP32 GFLOPS at 600MHz. This gives a total of 34 FP32 FLOPS per clock cycle per shader core; and as it has two pipelines that's 17 FP32 FLOPS per pipeline. The available performance in terms of operations will increase for smaller data types and decrease for larger ones.
The T-pipe is responsible for all memory access to do with textures. It can return one bilinear filtered texel per clock for most texture formats, but performance can vary for some texture formats and filtering modes.
|Trilinear filter||x 2|
|3D format||x 2|
|Depth format||x 2|
|32-bit channel format||x channel count|
|YUV format||x planes|
Using this table we can see that trilinear filtering a 3D texture would require 4 cycles (x2 for using trilinear filtering, and x2 again for being a 3D format).
Note that imported YUV surfaces from camera and video sources can be read directly without needing prior conversion to RGB at a cost of one cycle per plane. Importing semi-planar Y + UV is preferred to importing fully planar Y + U + V sources, if at all possible.
The texture pipeline has a 16KB L1 data cache.
The LS-pipe is responsible for all shader memory accesses which are not related to texture samplers. This includes generic pointer-based memory access, buffer access, varying interpolation, atomics, and
imageStore() accesses for mutable image data.
In general every instruction is a single cycle memory access operation, although like the arithmetic pipeline the hardware supports wide vector operations and so can load an entire "highp" vec4 varying in a single cycle.
The load/store pipeline has a 16KB L1 data cache.
Early and late ZS testing
In the OpenGL ES specification "fragment operations" – which include depth (Z) and stencil (S) testing – happen at the end of the pipeline, after fragment shading has completed. This makes the specification very simple to understand, but implies that you have shading something first, only to throw it away if it turns out to be killed by ZS testing.
Coloring fragments just to discard them would cost a huge amount of performance and wasted energy, so where possible we will do "early ZS" testing (i.e. before fragment shading), only falling back to "late ZS" testing (i.e. after fragment shading) where it is unavoidable (e.g. a dependency on fragment which may call "discard" and as such has indeterminate depth state until it exits the Tripipe).
In addition to the traditional ZS culling schemes, Midgard GPUs also add a hidden surface removal capability – called Forward Pixel Kill – which can stop fragments which have already been rasterized and passed ZS testing from turning into real rendering work if the Mali stack can determine that they are occluded and do not contribute to the output scene in a useful way.
Based on this simple block model it is possible to outline some of the fundamental performance properties which you can expect from a Mali Midgard GPU. The GPU can:
- issue one new thread per shader core per clock, AND
- retire one fragment per shader core per clock, AND
- write one pixel per shader core per clock, AND
- issue one instruction per pipe per clock
... and from a shader core performance point of view:
- Each A-pipe can process:
- 17 FP32 operations per clock
- The LS-pipe can process:
- 128-bits of vector load/store per clock, OR
- 128-bits of varying interpolation per clock, OR
imageStore()per clock, OR
- one atomic access per clock
- The T-pipe can process:
- One bilinear filtered sample per clock, OR
- One bilinear filtered depth sample every two clocks, OR
- One trilinear filtered sample every two clocks, OR
- One plane of YUV sampled data every clock
If we scale this to match an example reference chipset containing a Mali-T760 MP8 running at 600MHz we can calculate the theoretical peak performance as:
- 8 pixels per clock = 4.8 GPix/s
- That's 2314 complete 1080p layers per second!
- 8 bilinear texels per clock = 4.8 GTex/s
- That's 38 bilinear filtered texture samples per pixel for 1080p @ 60 FPS!
- 17 FP32 FLOPS per pipe per core = 163 FP32 GFLOPS
- That's 1311 FLOPS per pixel for 1080p @ 60 FPS!
- 256-bits of memory access per clock = 19.2GB/s read and write bandwidth1.
- That's 154 bytes per pixel for 1080p @ 60 FPS!
The performance of a Mali GPU in any specific chipset is very dependent on configuration choices made by the silicon implementation and the final device form factor the chipset is used in.
Some characteristics will be visible in terms of GPU logical configuration the silicon partner has built, including the the number of shader cores and size of the GPU L2 cache.
Some characteristics will depend on the memory system logical configuration, such as the memory latency, bandwidth, DDR memory type, and how memory is shared between multiple users.
Some characteristics will depend on analogue silicon implementation choices, such as which silicon process was used, the target top frequency, the DVFS voltage and frequency choices available at run time.
Finally, some characteristics will depend on physical device form factor, as this determines the available power budget. An identical chipset can therefore have very different peak performance in different form factor devices.
- A small smartphone has a sustainable GPU power budget between 1-2 Watts
- A large smartphone has a sustainable GPU power budget between 2-3 Watts
- A large tablet has sustainable GPU power budget between 4-5 Watts
- An embedded device with a heat sink may have a GPU power budget up to 10 Watts
... combining all of these together means that it can be hard to predict the performance of a particular GPU implementation based solely on the GPU product name, core count, and top frequency. If in doubt write some test scenarios which behave like your real use cases and run them to see how well they work out on your target device(s).