Analyzing shaders

If shader programs are maths-heavy, texture-heavy or prevent the GPU from applying optimizations, you could see performance problems. Streamline provides a range of charts to help you analyze shader behavior.

In order to be efficient, shader cores within a GPU should:

  • Execute calculations at moderate precision - in most cases, mediump (16-bit precision) is sufficient. 
  • Be kept busy with work to process.

Mali core warps

The Mali Core Warps chart shows the number of warps created for fragment and non-fragment workloads. Non-fragment workloads include vertex shading, geometry shading, tessellation shading, and compute shading. Each warp represents N threads of shader execution running in lock-step. The value of N - the warp width - varies for different GPUs. Download the Mali GPU datasheet to check this value for different GPUs.

Mali core throughput

The Mali Core Throughput chart shows the average number of cycles it takes to get a single thread shaded by the shader core. Note that this chart shows average throughput, not average cost, so includes impacts of processing latency and of resource sharing across the two workload types.

Mali core utilization

The Mali Core Utilization chart shows the activity levels of the major data paths in the shader core. It tells you the number of cycles where there is something running in the shader core, but doesn’t tell you how busy it is. This chart can indicate which type of workload could be causing a performance problem, and whether there are any scheduling issues.

Fragment FPKB utilization counts the percentage of cycles where the forward pixel kill (FPK) quad buffer, before the execution core, contains at least one quad. Quads in this buffer are queued, waiting to be shaded, and should remain fairly high. If this number drops off significantly, only showing a small percentage of cycles that have queued quads waiting to run, there could be either dependencies on early ZS (you have things that have to be early ZS tested, and can’t be pushed to late ZS) or the queue is draining faster than it is filling. This can happen if there are a number of very small triangles, which are being shaded faster than they can be rasterized.

If the execution core utilization is low, this indicates possible lost performance, because there are spare shader core cycles that could be used if they were accessible. In some use cases this is unavoidable, because there are regions in a render pass where there is no shader workload to process. For example, a clear color tile that contains no shaded geometry, or a shadow map that can be resolved entirely using early ZS depth updates. If screen regions contain high volumes of redundant geometry, this can cause the programmable core to run out of work to process because the fragment front-end can not generate warps fast enough. This can happen if a high percentage of triangles are killed by ZS testing or FPK hidden surface removal, or by a very high density of microtriangles which each generate low numbers of threads.

Mali core unit utilization

The Mali Core Unit Utilization chart shows the percentage utilization of the functional units inside the shader core; the execution engine, the varying unit, the texture unit and the load/store unit. The most heavily utilized functional unit should be the target for optimizations to improve performance, although reducing load on any of the units is good for energy efficiency.

If the texture unit utilization is a bottleneck, check the texturing charts to look for ways to optimize.

Mali core pipe utilization

For Valhall-based GPUs such as the Mali-G77, an additional Mali Core Pipe Utilization chart shows the breakdown of execution engine activity between the FMA (fused multiply-accumulate), CVT (convert), SFU (special functions unit) and MSG (message) pipes.

Mali core workload property rate

The Mali Core Workload Property Rate gives information about shader workload behavior that could be optimized:

  • Partial coverage - Warps that contain samples with no coverage. A high number suggests that content has a high density of very small triangles, or microtriangles, which are disproportionately expensive to process.
  • Diverged instructions - Instructions that have control flow divergence across the warp.
  • Constant tile kill - Tile writes that are killed by the transaction elimination CRC check. A high number indicates that a significant part of the framebuffer is static from frame to frame.

Mali varying usage

The Mali Varying Usage chart shows the amount of interpolation processed by the varying unit, at 16-bit or 32-bit precision. 

 16-bit interpolation is twice as fast as 32-bit interpolation. It is recommended to use mediump (16-bit) varying inputs to fragment shaders, rather than highp whenever possible.

Mali Offline Compiler

You can also use Mali Offline Compiler to generate an offline analysis report of how your shader program performs on a Mali GPU. It provides a cycle cost breakdown for the shader’s arithmetic, load/store, varying and texture usage. You can check the percentage of arithmetic operations that are efficiently performed at 16-bit precision or lower, and you can see information about the shader's use of language features that can impact the performance of shader execution. 

 

Previous Next