Valhall Shader Core
All Mali shader cores are structured as a number of fixed-function hardware blocks that are wrapped around a programmable core. The programmable core is the largest area of change in the Valhall GPU family, with some significant changes from the earlier Bifrost designs.
The following image shows a single execution core implementation and the fixed-function units that surround it:
The Valhall programmable Execution Core (EC) consists of one or more Processing Engines (PEs). The Mali-G57 and Mali-G77 have two PEs, and several shared data processing units, all of which are linked by a messaging fabric.
A Valhall core can perform 32 FP32 FMAs, read four bilinear filtered texture samples, blend two fragments, and write two pixels per clock.
The processing engines
Each PE executes the programmable shader instructions.
Each PE includes three arithmetic processing pipelines:
- FMA pipeline with is used for complex maths operations
- CVT pipeline which is used for simple maths operations
- SFU pipeline which is used for special functions
The FMA and SVT pipelines are 16-wide, the SFU pipeline is 4-wide and runs at one quarter of the throughput of the other two.
Programs can use up to 32x32-bit registers while maintaining full thread occupancy. More complex programs can use to up 64 at the expense of reduced thread count.
The arithmetic units in the Valhall PE implement a warp-based vectorization scheme to improve functional unit utilization. To improve processing speed, multiple threads are grouped into a set of bundle, which is called a warp, before being executed in parallel by the GPU. Valhall uses 16-wide warps.
For of a single thread, this architecture looks like a stream of scalar 32-bit operations. This means that achieving high utilization of the hardware is a relative straight forward task for the shader compiler. However, the underlying hardware keeps the efficiency benefit of being a vector unit with a single set of control logic. This unit can be amortized over all of the threads in the warp.
The two following diagrams illustrate the advantages of keeping the hardware units busy, regardless of the vector length in the program.
In this example, a vec3 arithmetic operation maps onto a pure SIMD (Single instruction, multiple data) unit, where a pipeline executes one thread per clock:
By contrast, in this example the pipeline in a 4-wide warp unit, executes one lane per thread for four threads per clock:
Valhall maintains native support for int8, int16, and fp16 data types. These data types can be packed using SIMD instructions to fill each 32-bit data processing lane. This arrangement maintains the power efficiency and performance that is provided by the types that are narrower than 32-bits.
A single 16-wide warp maths unit can therefore perform 32x fp16/int16 operations per clock cycle, or 64x int8 operations per clock cycle.
The load/store unit is responsible for all shader memory accesses which are not related to texture samplers. This includes generic pointer-based memory access, buffer access, atomics, and imageLoad() and imageStore() accesses for mutable image data.
Data that is stored in a single 64-byte cache line per clock cycle can be accessed. Warp unit accesses are optimized to reduce unique cache access requests. For example, data can be returned in a single cycle if all threads access data inside the same cache line.
We recommend that you use shader algorithms that exploit the wide data access, and the cross-thread merging functionalities, of the load/store unit.
- Use vector loads and stores in each thread.
- Access sequential address ranges across the threads in a warp
The load/store unit includes 16KB L1 data cache per core, which is backed up by the shared L2 cache.
The varying unit is a dedicated fixed-function varying interpolator. The varying unit ensures good functional unit utilization by using warp vectorization.
For every clock, the varying unit can interpolate 32 bits for every thread in a warp. For example, interpolating a mediump - fp16 - vec4 takes two cycles. Therefore, interpolation of a mediump fp16 is both faster, and more energy efficient, than interpolation of a highp fp32.