Third Gen Mali GPU Architecture
The Bifrost family of Mali GPUs use the same top-level architecture as the earlier Midgard GPUs. They use a unified shader core architecture. This means that the design only includes a single type of hardware shader processor. This single shader processor can execute all types of shader code, such as vertex shaders, fragment shaders, and compute kernels.
The exact number of shader cores present in a silicon chip can vary. We license configurable designs to our silicon partners, who can then choose how to configure the GPU in their specific chipset, based on their performance needs and silicon area constraints.
For example, the Mali-G72 GPU can scale from a single core, for low-end devices, up to 32 cores for the highest performance designs.
The following diagram provides a top-level overview of the Control Bus and Data Bus of a typical Mali Bifrost GPU:
To improve performance, and to reduce memory bandwidth wastage caused by repeated data fetches, the shader cores in the system all share access to a level 2 cache. The size of the L2 cache, while configurable by our silicon partners, is typically in the range of 64-128KB per shader core in the GPU. However, the size of the L2 cache depends on how much silicon area is available.
Also, our silicon partners can configure the number, and bus width, of the memory ports that the L2 cache has to external memory.
The Bifrost architecture aims to write one 32-bit pixel, per core, per clock. Therefore, it is reasonable to expect an eight-core design to have a total of 256-bits of memory bandwidth, for both read and write, per clock cycle. This can vary between chipset implementations.
Once the application has completed defining the render pass, the Mali driver submits a pair of independent workloads for each render pass.
One independent pass deals with all geometry and compute related workloads. The other independent pass is for the fragment-related workload. As Mali GPUS are tile-based renderers, all geometry processing for a render pass must be complete before the fragment shading can begin.
A finalized tile rendering list is required to provide the fragment that is processing the per-tile primitive coverage the information that it needs.
Bifrost GPUs can support two parallel issue queues, that the driver can use, one for each workload type. Geometry and fragment workloads from both queues can be processed in parallel by the GPU at the same time. This arrangement allows the workload to be distributed across all available shader cores in the GPU.