The Bifrost Shader Core
Third generation Mali GPU architecture
March 2018 by Peter Harris
When optimizing applications using a GPU it is useful to have at least a high level mental model for how the underlying hardware works, as well as an idea of its expected performance and data rate for different types of operation it might perform. Understanding the block architecture is particularly important when optimizing using the Mali performance counters, as understanding what the counters are telling you implies a need to understand the blocks the counters are tied to.
This article presents a stereotypical Mali "Bifrost" GPU programmable core, the third generation of Mali GPUs, including the Mali-G30, Mali-G50, and Mali-G70 series of products.
This article assumes you have read the earlier introductory article ontile-based rendering, as it will build on concepts introduced there.
The "Bifrost" family of Mali GPUs (the Mali-G30, Mali-G50, and Mali-G70 series) use the same top-level architecture as a Bifrost GPU is the same as the earlier Midgard GPUs. It uses a unified shader core architecture, meaning that only a single type of hardware shader processor exists in the design. This single processor type can execute all types of shader code, including vertex shaders, fragment shaders, compute kernels, etc.
The exact number of shader cores present in a particular silicon chip varies; Arm licenses configurable designs to our silicon partners, who can then choose how to configure the GPU in their specific chipset based on their performance needs and silicon area constraints. The Mali-G72 GPU design can scale from a single core for low-end devices all the way up to 32 cores for the highest performance designs.
The shader cores in the system share a level 2 cache to improve performance, and to reduce memory bandwidth caused by repeated data fetches. The size of the L2 is another configurable component which can be tuned by our silicon partners, but is typically in the range of 64-128KB per shader core in the GPU depending on how much silicon area is available.
The number and bus width of the memory ports the L2 cache has to external memory is configurable, again allowing our partners to tune the implementation to meet their performance, power, and area needs. The architecture aims to be able to write one 32-bit pixel per core per clock, so it would be reasonable to expect an 8-core design to have a total of 256-bits of memory bandwidth (for both read and write) per clock cycle, although whether this much bandwidth is available to the GPU will depend on the chipset integration.
The Mali drivers submit workloads for each render pass as a pair of independently schedulable pieces once the application has completed defining the render pass. One piece is for all geometry and compute related workloads in the pass, and one is for the fragment workload. As a tile-based renderer all geometry processing for a render pass must be complete before the fragment shading can start, as we need a finalized "tile list" to provide fragment processing the primitive coverage information it needs.
The hardware supports two parallel issue queues which the driver can use, one for each workload type. Workloads from both queues can be processed by the GPU at the same time, so geometry processing and fragment processing for different render passes can be running in parallel.
The workload for a single render pass is nearly always large and highly parallelizable, so the GPU hardware will break it into smaller pieces and distribute it across all of the shader cores available in the GPU.
The Bifrost shader core
All Mali shader cores are structured as a number of fixed-function hardware blocks wrapped around a programmable core. The programmable core is the largest area of change in the Bifrost GPU family, with a number of significant changes over the earlier Midgard "Tripipe" design discussed in the previous blog in this series.
The Bifrost programmable Execution Core 9EC) consists of one or more Execution Engines (EEs) – three in the case of the Mali-G71 – and a number of shared data processing units, all linked by a messaging fabric. The Bifrost shader cores come in two sizes, depending on product configuration.
The "small" Bifrost core can read one texture sample, blend one fragment, and write one pixel per clock. The "large" Bifrost core used in later generations to improve energy and area efficiency, can read two texture samples, blend two fragments, and write two pixels per clock.
The execution engines
The Execution Engines are responsible for actually executing the programmable shader instructions, each including a single composite arithmetic processing pipeline as well as all of the required thread state for the threads that the EE is processing.
To improve performance and performance scalability for complex programs, Bifrost implements a substantially larger general-purpose register file for the shader programs to use. Programs can use up to 32x32-bit registers while maintaining full thread occupancy, and more complex programs can use to up 64 at the expense of reduced thread count.
The size of the per-draw call fast constant storage, used for storing OpenGL ES uniforms and Vulkan push constants, has also been increased to 512 bytes per draw call.
The arithmetic units in the Bifrost EE implement a warp-based vectorization scheme to improve functional unit utilization. Multiple threads are grouped into bundles, called a warp, to fill the width of the underlying vector processing hardware and then executed in lockstep.
From the point of view of a single thread this architecture looks like a stream of scalar 32-bit operations, which makes achieving high utilization of the hardware a relative straight forward task for the shader compiler, but the underlying hardware keeps the efficiency benefit of being a vector unit with a single set of control logic which can be amortized over all of the threads in the warp.
The example below shows how a vec3 arithmetic operation may map onto a pure SIMD unit (pipeline executes one thread per clock):
... vs a 4-wide warp unit (pipeline executes one lane per thread for four threads per clock):
The advantages in terms of the ability to keep the hardware units full of useful work, irrespective of the vector length in the program, is clearly highlighted by these diagrams. The former case takes one cycle pre thread, and leaves performance on the table if the thread cannot fill the vector with. The latter case takes one cycle per 32-bit operation in each thread, and had no idle cycles.
The warp width of the Bifrost shader cores varies. The earlier single pixel per core per cycle products have a 4-wide warp, the two pixel per core per cycle cores found in later products have an 8-wide warp.
The power efficiency and performance provided by the narrower than 32-bit types is still critically important for mobile devices, so Bifrost maintains native support for int8, int16, and fp16 data types which can be packed to fill the 128-bit data width of the data unit. A single 4-wide warp maths unit can therefore perform 8x fp16/int16 operations per clock cycle, or 16x int8 operations per clock cycle.
The load/store unit is responsible for all shader memory accesses which are not related to texture samplers. This includes generic pointer-based memory access, buffer access, atomics, and
imageStore() accesses for mutable image data.
The load/store cache can access data stored in a single 64-byte cache line per clock cycle, and accesses across a warp are optimized to reduce the number of unique cache access requests required. For example, if all threads access data inside the same cache line that data can be returned in a single cycle.
Due to the wide data access width and cross-thread merging functionality it is highly recommended that data access in shaders algorithms is designed which access patterns which exploit this.
- Use vector loads and stores in each thread
- Access sequential address ranges across the threads in a warp
It includes 16KB L1 data cache per core, which is backed by the shared L2 cache.
The varying unit is a dedicated fixed-function varying interpolator. It implements a similar optimization strategy to the arithmetic units, using warp vectorization to ensure good functional unit utilization.
The unit can interpolate 32-bits per warp per clock – e.g. interpolating a mediump (fp16) vec4 would take two cycles – so
mediump fp16 interpolation is faster and more energy efficient than
highp fp32 interpolation.
The texture unit implements all texture memory accesses.
The architectural performance of this block is variable, with a "texels per clock" peak performance which matches the "pixels per clock" of the shader core; this may be one or two bilinear filtered samples per clock depending on product configuration. There is 16KB of L1 data cache per texel per clock, and this is backed by the shared L2 cache.
The baseline performance is one or two bilinear filtered texel per clock for most texture formats, but performance can vary for some texture formats and filtering modes.
Mali-G71 and Mali-G72
The first two Bifrost products implement a similar texture unit to the Midgard unit in terms of performance, the only significant difference being optimized depth sampling so is now runs at full rate.
|Trilinear filter||x 2|
|3D format||x 2|
|32-bit channel format||x channel count|
|YUV format||x planes|
Other Bifrost products
The later Bifrost products implement a substantially upgraded texturing unit design, which significantly reduces silicon area and improves energy efficiency over earlier designs.
|N x anisotropic filter||Up to x N|
|Trilinear filter||x 2|
|3D format||x 2|
|32-bit channel format||x channel count|
From a application developer point of view this texture mapper introduces two major improvements, in addition to the energy efficiency improvements.
Firstly it introduces anisotropic filtering, which allows dynamic sample patterns depending on the orientation of a primitive relative to the viewing plane.
Do you know that 2x anisotropic filtering significantly improves image quality compared to trilinear filtering, in particular improving primitives which are almost tangential to the view plane, but is often faster than trilinear because the filtering will revert back to making a reduced number of samples whenever primitive projection in screen space is close to isotropic.
Secondly it introduces optimized YUV filtering performance, allowing single cycle throughput for bilinear filtered samples irrespective of the number of input planes used to store the image data in memory. This significantly improves the performance of many applications importing camera and video streams, which commonly use semi-planar or fully planar YUV encodings.
ZS & blend unit
The ZS and blend unit is responsible for handling all accesses to the tile-memory, both for built-in OpenGL ES operations – such as ZS testing and color blending – and programmatic access to the tile buffer needed for functionality such as:
- ... and the merged sub-pass functionality in Vulkan.
The blender can write either one or two fragments per clock to the tile memory, depending on the shader core size. All Mali GPUs are designed to support fast multi-sample anti-aliasing (MSAA), so the GPU supports full rate blending of fragments and resolving of pixels when using 4xMSAA.
Index-driven geometry pipeline
Bifrost introduces an Index-Driven Vertex Shading (IDVS) geometry processing pipeline. Earlier Mali GPUs processed all of the vertex shading before culling primitives, often resulting in wasted computation and bandwidth for the vertices which are only used in culled triangles (e.g. because they are outside of the frustum, or failing a facing test):
The IDVS pipeline starts by building primitives and submitting shading in primitive order, and splits the shader in two halves. Position shading runs before culling, and varying shading runs after it for the visible vertices which survive culling.
This flow provides two significant optimizations:
- Position shading is only submitted for small batches of vertices where at least one vertex in each batch is referenced by the index buffer. This allows vertex shading to jump spatial gaps in the index buffer which are never referenced.
- Varying shading is only submitted for primitives with survive the clip-and-cull phase; this removes a significant amount of redundant computation and bandwidth for vertices contributing only to triangles which are culled.
To get the most benefit from the IDVS geometry flow is it useful to deinterleave packed vertex buffers partially; place attributes contributing to position in one packed buffer, and attributes contributing to non-position varyings in a second packed buffer. This means that the non-position varyings are not pulled into the cache for vertices which are culled and never contribute to an on-screen primitive. This will be covered in more detail in a later article.