Fourth-Generation Mali GPU architecture
The Valhall family of Mali GPUs uses the same top-level architecture as the earlier Bifrost GPUs. The Valhall family uses a unified shader core architecture. This means that only a single type of hardware shader processor exists in the design. This single processor type can execute all types of shader code, including vertex shaders, fragment shaders, and compute kernels.
The exact number of shader cores that are present in a silicon chip can vary. At Arm we license configurable designs to our silicon partners. Partners choose how to configure the GPU in their chipset based on their performance needs and silicon area constraints.
The following diagram provides a top-level overview of the control bus and data bus of a typical Mali Valhall GPU:
To improve performance, and to reduce memory bandwidth wastage that is caused by repeated data fetches, the shader cores in the system all share access to a level 2 cache. The size of the L2 cache is configurable by our silicon partners, is typically in the range of 64-128KB for each shader core in the GPU. However, the size of the L2 cache depends on how much silicon area is available.
Also, our silicon partners can configure the number, and bus width, of the memory ports that the L2 cache has to external memory.
The Valhall architecture aims to write two 32-bit pixel per core per clock. Therefore, it is reasonable to expect a 4-core design to have a total of 256 bits of memory bandwidth, for both read and write, per clock cycle. This can vary between chipset implementations.
When the application has completed defining the render pass, the Mali driver submits a pair of independent workloads for each render pass.
One section deals with all geometry-related and compute-related workloads in the pass, and the other section is for the fragment workload. Mali GPUs are tile-based renderers, all geometry processing for a render pass must be complete before the fragment shading can start. This is because we need a finalized tile list to provide fragment processing with the primitive coverage information that it needs.
The hardware supports two parallel issue queues which the driver can use, one for each workload type. Workloads from both queues can be processed by the GPU at the same time. This means that geometry processing and fragment processing for different render passes can be running in parallel.
The workload for a single render pass is nearly always large and highly parallelizable. This means that the GPU hardware will break the workload into smaller pieces and distribute it across all of the shader cores that are available in the GPU.