This guide explains the performance counters found in the Arm Streamline tool's profiling template for the Mali-G52 GPU. This GPU is part of the Mali Bifrost architecture family. Note that the Streamline template only shows a subset of the available performance counters. However, it covers the most common types of GPU workload performance analysis.

The counter template in Streamline follow a step-by-step analysis workflow, starting with a coarse analysis of the overall GPU workload, before moving on to a more detailed analysis of the rendering content the application is passing to the GPU, and how the GPU shader cores process that workload.

This guide contains the following sections:

  • CPU performance: look at how to analyze the overall usage of the CPU by observing the activity on the CPU clusters and cores in the system, and how that workload is split across threads.
  • GPU activity: look at how to analyze the overall usage of the GPU by observing the activity on the GPU processing queues, and the workload split between non-fragment and fragment processing.
  • Content behavior: look at how to analyze the content efficiency by observing the the number of vertices being processed, the number of primitives being culled, and the number of pixels being processed.
  • Shader core data paths: look at Mali shader core workload, throughput, and data path utilization.
  • Shader core unit overview: look at the macro-scale usage of the shader core by observing the effectiveness of depth and stencil testing, the number of threads spawned for shading, and relative loading of the programmable core processing pipelines.
  • Shader core varying unit: look at performance of the varying unit, and how the unit is being used by the shader programs that are running. This can be used to identify optimization opportunities for varying-bound content that has been identified in the Shader core section.
  • Shader core texture unit: look at performance of the texture filtering unit, and how the unit is being used by the shader programs that are running. This can be used to identify optimization opportunities for texture-bound content that has been identified in the Shader core section.
  • Shader core load/store unit: look at performance of the load/store unit, and how the unit is being used by the shader programs that are running. This can be used to identify optimization opportunities for texture-bound content that has been identified in the Shader core section.
  • Shader core memory access: look at the memory traffic between the shader core and the L2 cache and external memory system, broken down by unit. This can be used to identify which type of workload is causing memory traffic, helping to narrow down where optimizations should be targeted.

CPU performance

The first charts in the graphics analysis template look at the performance of the CPUs in the system, as many graphics performance issues are caused by high CPU load or poor scheduling of workloads across the CPU and GPU.

CPU activity

The activity charts show the activity of each processor cluster, presented as the percentage of each time slice that the CPU was running. Expand each chart to show the individual cores present inside the cluster. Note that this only shows percentage of the time slice at whatever CPU frequency was being used, it does not show as a percentage of peak performance.

For CPU bound applications is is common that a single thread is running all of the time, becoming the bottleneck. The thread activity panel below the Timeline charts can be used to see when each application thread is running; select threads in this thread view to filter CPU charts by those threads.

Scheduling bound applications, where neither CPU nor GPU is busy all of the time due to poor synchronization, can been seen in this view as activity oscillating between the impacted CPU thread and the Mali GPU. The CPU thread will block and wait for the GPU to complete, and then the GPU will go idle waiting for the CPU to submit more work to process.

Streamline variable name:

$CPUActivityUser.Cluster[0..N]

CPU cycles

The cycle charts show the activity of each processor cluster, presented as the number of clock cycles used. This can be cross-referenced against the relevant CPU activity charts to give an indication of the CPU frequency.

Streamline variable name:

$CyclesCPUCycles.Cluster[0..N]

GPU activity

The workloads running on Mali-G52 are coordinated by the Job Manager, which is responsible for scheduling workloads onto the various processing units inside of the GPU. It exposes two FIFO work queues, called Job slots, to the graphics driver. There is one slot for non-fragment workloads, which include compute shading and vertex shading, and one slot for fragment shading workloads.

These two queues run asynchronously to the CPU and can run in parallel to each other, provided that sufficient work is available. The GPU will raise a CPU interrupt when each queued batch of work submitted by the driver has completed. It can continue processing the next item in the work queue while that interrupt is pending.

The diagram below shows the basic processing pipeline data paths through the GPU for different kinds of workload, and the performance counters for each data path or major block in the hierarchy.

Note that some counters track activity in an entire data path, not just a single hardware unit. For example, the Fragment queue active cycles counter will increment every cycle that there is any fragment workload running anywhere in the GPU.

Also note that some counters are common to multiple data paths; for example, both non-fragment and fragment shader programs will run inside the same unified Execution Core. The swim lane diagram below shows how the top-level Job Manager counters will increment for overlapping render passes.

This diagram shows two render passes per frame, shown in different shades of blue, each consisting of a single piece of non-fragment work that is executed before a single fragment shading can start. An interrupt is raised back to the CPU at the end of each piece of work on each queue. Note that GPU active cycles will increment whenever any queue contains work.

GPU usage

This set of performance counters provides an overview of the overall load on the GPU, and how the workload is split between non-fragment and fragment processing.

These counters can be used to determine if an application is GPU bound, as they show if the GPU is being kept busy, and the workload distribution across the two main processing queues.

GPU active cycles

This counter increments every clock cycle where the GPU has any pending workload present in one of its processing queues, and therefore shows the overall GPU processing load requested by the application.

This counter will increment every clock cycle where any workload is present in a processing queue, even if the GPU is stalled waiting for external memory to return data; this is still counted as active time even though no forward progress is being made.

Streamline variable name:

$MaliGPUCyclesGPUActive

Non-fragment queue active cycles

This counter increments every clock cycle where the GPU has any workload present in the non-fragment queue. This queue can be used for vertex shaders, tessellation shaders, geometry shaders, fixed function tiling, and compute shaders. This counter can not disambiguate between these workloads.

This counter will increment any clock cycle where a workload is loaded into a queue even if the GPU is stalled waiting for external memory to return data; this is still counted as active time even though no forward progress is being made.

Streamline variable name:

$MaliGPUCyclesNonFragmentQueueActive

Fragment queue active cycles

This counter increments every clock cycle where the GPU has any workload present in the fragment queue.

For most graphics content there are significantly more fragments than vertices, so this queue will normally have the highest processing load. In content that is GPU bound by fragment processing it is normal for Fragment queue active cycles to be approximately equal to GPU active cycles, with vertex and fragment processing running in parallel.

This counter will increment any clock cycle where a workload is loaded into a queue even if the GPU is stalled waiting for external memory to return data; this is still counted as active time even though no forward progress is being made.

Streamline variable name:

$MaliGPUCyclesFragmentQueueActive

Tiler active cycles

This counter increments every cycle the tiler has a workload in its processing queue. The tiler can run in parallel to vertex shading and fragment shading. A high cycle count here does not necessarily imply a bottleneck, unless the Compute active cycles counters in the shader cores are very low relative to this.

Streamline variable name:

$MaliGPUCyclesTilerActive

Interrupt pending cycles

This counter increments every cycle that the GPU has an interrupt pending and is waiting for the CPU to process it.

Note that cycles with a pending interrupt do not necessarily indicate lost performance because the GPU can process other queued work in parallel. However, if Interrupt pending cycles is a high percentage of GPU active cycles, there could be an underlying problem that is preventing the CPU from handling interrupts efficiently. The most common cause for this is a misbehaving device driver, which may not be the Mali device driver, that has masked interrupts for a long period of time.

Streamline variable name:

$MaliGPUCyclesInterruptActive

GPU utilization

These counters provide views of the data path activity cycles, normalized against the total GPU active cycle count.

For GPU bound content it is expected that one of the queues should be close to 100% utilization, with the other running in parallel to it. The most heavily loaded queue therefore becomes the highest priority target for content optimization.

For GPU bound content where the GPU is always busy, but neither queue is running all of the time, application API usage may prevent parallel processing. When optimizing GPU bound content aim to minimize scheduling bubbles, ensuring the workloads are executing in parallel across the two queues, before optimizing the dominant queue's workload. This can be caused by:

  • The application blocking and waiting for GPU activity to complete, for example by waiting on a query object result which is not yet available. This may cause one or more of the work queues to run out of new work to process.
  • The application submitting rendering workloads that have data dependencies across the queues which prevent parallel execution. For example, a fragment-compute-fragment data flow may mean that no processing can be executed in the fragment queue while the compute shader is running, if no non-dependent work is available.

Mobile systems use dynamic voltage and frequency scaling (DVFS), reducing voltage and clock frequency for light workloads, to improve energy efficiency. When seeing a workload with high percentage utilization always check the GPU active cycles counter, because the GPU might be highly utilized but running at a low clock frequency.

Non-fragment queue utilization

This expression defines the non-fragment queue utilization compared against the GPU active cycles. For GPU bound content it is expected that the GPU queues process work in parallel, so the dominant queue should be close to 100% utilized. If no queue is dominant, but the GPU is close to 100% utilized, then there could be a serialization or dependency problem preventing better overlap across the queues.

Streamline expression:

min(($MaliGPUCyclesNonFragmentQueueActive / $MaliGPUCyclesGPUActive) * 100, 100)

Fragment queue utilization

This expression defines the fragment queue utilization compared against the GPU active cycles. For GPU bound content it is expected that the GPU queues will process work in parallel, so the dominant queue should be close to 100% utilized. If no queue is dominant, but the GPU is close to 100% utilized, then there could be a serialization or dependency problem preventing better overlap across the queues.

Streamline expression:

min(($MaliGPUCyclesFragmentQueueActive / $MaliGPUCyclesGPUActive) * 100, 100)

Tiler utilization

This expression defines the tiler utilization compared to the total GPU active cycles.

Note that this measures the overall processing time for index-driven vertex shading (IDVS) workloads, in addition to the fixed function tiling process. It is not necessarily indicative of the runtime of the fixed-function tiling process itself.

Streamline expression:

min(($MaliGPUCyclesTilerActive / $MaliGPUCyclesGPUActive) * 100, 100)

Interrupt pending utilization

This expression defines the IRQ pending utilization compared against the GPU active cycles. In a well-functioning system this expression should be less than 2% of the total cycles. If the value is much higher than this then there may be a system issue preventing the CPU from efficiently handling interrupts.

Streamline expression:

min(($MaliGPUCyclesInterruptActive / $MaliGPUCyclesGPUActive) * 100, 100)

External memory bandwidth

These counters show the memory bandwidth between the GPU and the downstream memory system. These access may go directly to external DRAM, or may be sent through additional levels of system cache outside of the GPU.

Memory accesses to external DRAM are very power intensive; a good rule of thumb is 100mW per GB/s of bandwidth used. Minimizing GPU memory bandwidth is always a good optimization objective.

Output external read bytes

This expression defines the output read bandwidth for the GPU.

Streamline expression:

$MaliExternalBusBeatsReadBeat * 16

Output external write bytes

This expression defines the output write bandwidth for the GPU.

Streamline expression:

$MaliExternalBusBeatsWriteBeat * 16

External memory stalls

These counters show the memory stall rate seen by the GPU when attempting to make accesses to the downstream memory system. A high stall rate is indicative of content which is requesting more data than the memory system can provide, so optimizations that reduce memory bandwidth usage should be attempted.

Output external read stall rate

This expression defines the percentage of read transactions that stall waiting for the external memory interface.

Streamline expression:

min(($MaliExternalBusStallsReadStallCycles / $MaliGPUCyclesGPUActive) * 100, 100)

Output external write stall rate

This expression defines the percentage of write transactions that stall waiting for the external memory interface.

Streamline expression:

min(($MaliExternalBusStallsWriteStallCycles / $MaliGPUCyclesGPUActive) * 100, 100)

External memory read latency

These counters show the memory read latency rate seen by the GPU when making memory system accesses. A high read latency over 256 cycles can be indicative of content which is requesting more data than the memory system can provide, so optimizations that reduce memory bandwidth usage should be attempted.

Output external read latency 0-127 cycles

This counter increments for every data beat that is returned between 0 and 127 cycles after the read transaction started.

Streamline variable name:

$MaliExternalBusReadLatency0127Cycles

Output external read latency 128-191 cycles

This counter increments for every data beat that is returned between 128 and 191 cycles after the read transaction started.

Streamline variable name:

$MaliExternalBusReadLatency128191Cycles

Output external read latency 192-255 cycles

This counter increments for every data beat that is returned between 192 and 255 cycles after the read transaction started.

Streamline variable name:

$MaliExternalBusReadLatency192255Cycles

Output external read latency 256-319 cycles

This counter increments for every data beat that is returned between 256 and 319 cycles after the read transaction started.

Streamline variable name:

$MaliExternalBusReadLatency256319Cycles

Output external read latency 320-383 cycles

This counter increments for every data beat that is returned between 320 and 383 cycles after the read transaction started.

Streamline variable name:

$MaliExternalBusReadLatency320383Cycles

Output external read latency 384+ cycles

This expression increments for every read beat that is returned at least 384 cycles after the transaction started.

Streamline expression:

$MaliExternalBusBeatsReadBeat - $MaliExternalBusReadLatency0127Cycles - $MaliExternalBusReadLatency128191Cycles - $MaliExternalBusReadLatency192255Cycles - $MaliExternalBusReadLatency256319Cycles - $MaliExternalBusReadLatency320383Cycles

Content behavior

Slow rendering performance has three common causes:

  • Content which is efficiently written, but doing too much processing given the capabilities of the target device.
  • Content which is inefficiently written, with redundancy in the workload submitted for rendering, which means it takes longer to render than it should.
  • Application API usage which triggers high workload, or causes idle bubbles, due to GPU-specific or driver-specific behaviors.

This section of the Streamline template aims to focus on the first two of these bullets, looking at the size and efficiency of the workload that has been submitted.

Geometry usage

The first application input processed by the GPU rendering pipeline is the vertex stream. This set of counters looks at the amount of geometry being processed, and how much is discarded due to culling.

Geometry is one of the most expensive inputs to the GPU, as vertices typically need 32-64 bytes of input data and data access is expensive. It is important that high detail geometry is used only when needed. Pseudo-geometry techniques, such as normal mapping, should be preferred to using vertex-based geometry whenever possible. Dynamic mesh level-of-detail, using simpler meshes when objects are further from the camera, should be used.

Total input primitives

This expression defines the total number of input primitives to the rendering process.

Streamline expression:

$MaliPrimitiveCullingFacingTestCulledPrimitives + $MaliPrimitiveCullingFrustumTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives

Total culled primitives

This expression defines the number of primitives that were culled during the rendering process, for any reason.

Streamline expression:

$MaliPrimitiveCullingFacingTestCulledPrimitives + $MaliPrimitiveCullingFrustumTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives

Visible primitives

This counter increments for every visible primitive that survives culling. There may still be some forms of redundancy present in the set of visible primitives. For example, a complex mesh that is inside the frustum but occluded by a wall is classified as visible by this counter. Software techniques such as portal culling can be used to efficiently cull objects inside the frustum, as they can provide guarantees about visibility between regions of a game level.

Streamline variable name:

$MaliPrimitiveCullingVisiblePrimitives

Geometry culling

All geometry must be processed by the GPU to determine its position in clip-space before it can be put through the culling process. Geometry which is culled is therefore a source of processing cost, even though it contributes no visual output to the final render. This set of counters helps to identify why triangles are being culled, allowing you to correctly target optimizations at the area causing problems.

The Mali culling pipeline executes in the order shown below, and the counters in this section show the percentage the primitives that enter each pipeline stage that are killed by it.

Visible primitives after culling

This expression defines the percentage of primitives that are visible after culling.

Streamline expression:

min(($MaliPrimitiveCullingVisiblePrimitives / ($MaliPrimitiveCullingFacingTestCulledPrimitives + $MaliPrimitiveCullingFrustumTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives)) * 100, 100)

Input primitives to facing test killed by it

This expression defines the percentage of primitives entering the facing test that are killed by it.

Streamline expression:

min(($MaliPrimitiveCullingFacingTestCulledPrimitives / ($MaliPrimitiveCullingFacingTestCulledPrimitives + $MaliPrimitiveCullingFrustumTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives)) * 100, 100)

Input primitives to frustum test killed by it

This expression defines the percentage of primitives entering the frustum test that are killed by it.

Streamline expression:

min(($MaliPrimitiveCullingFrustumTestCulledPrimitives / (($MaliPrimitiveCullingFacingTestCulledPrimitives + $MaliPrimitiveCullingFrustumTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives) - $MaliPrimitiveCullingFacingTestCulledPrimitives)) * 100, 100)

Input primitives to sample test killed by it

This expression defines the percentage of primitives entering the sample coverage test that are killed by it.

Streamline expression:

min(($MaliPrimitiveCullingSampleTestCulledPrimitives / (($MaliPrimitiveCullingFacingTestCulledPrimitives + $MaliPrimitiveCullingFrustumTestCulledPrimitives + $MaliPrimitiveCullingSampleTestCulledPrimitives + $MaliPrimitiveCullingVisiblePrimitives) - $MaliPrimitiveCullingFacingTestCulledPrimitives - $MaliPrimitiveCullingFrustumTestCulledPrimitives)) * 100, 100)

IDVS shading

Mali Bifrost GPUs use an optimized index-driven vertex shading (IDVS) processing pipeline. In this pipeline vertex shading is split into two pieces - position shading, and varying shading - and varying shading only occurs for vertices that are part of a triangle that survives primitive culling.

These counters show the vertex shading workload submitted to the shader cores by the IDVS pipeline. This pipeline uses a post-transform vertex cache, which contains the positions of recently shaded vertices, to avoid reshading vertices that are common to multiple primitives more than once. Poor index buffer spatial locality can result in a vertex being shaded multiple times, because it can be evicted from the cache before it can be reused.

Position shader thread invocations

This expression defines the number of position shader thread invocations.

Streamline expression:

$MaliTilerShadingRequestsPositionShadingRequests * 4

Varying shader thread invocations

This expression defines the number of varying shader thread invocations.

Streamline expression:

$MaliTilerShadingRequestsVaryingShadingRequests * 4

Fragment overview

These counters look at the GPU processing workload being requested in terms of the total number of output pixels shaded, the average number of GPU cycles spent per pixel, and the average numbers of fragments shaded per output pixel.

It can be a useful exercise to set a cycle budget for an application, measured in terms of cycles per pixel. Compute the maximum possible cycle budget using this equation:

  shaderCyclesPerSecond = MaliCoreCount MaliFrequency
  pixelsPerSecond = Screen_Resolution * Target_FPS

  // Max cycle budget assuming perfect execution
  maxBudget = shaderCyclesPerSecond / pixelsPerSecond

  // Real-world cycle budget assuming 85% utilization
  realBudget = 0.85 * maxBudget

This can help set expectations of what is possible. For example, consider a mass-market device with a 3 core Mali GPU running at 500MHz. At 1080p60 this device will have a cycle budget of just 10 cycles per pixel, which must used to cover all processing costs - including vertex shading and fragment shading. It's definitely possible to write content inside this budget, but care must be taken to spend every cycle wisely.

Pixels

This expression defines the total number of pixels that are shaded for any render pass. Note that this can be a slight overestimate because the underlying hardware counter rounds the width and height values of the rendered surface to be 32-pixel aligned, even if those pixels are not actually processed during shading because they are out of the active viewport and/or scissor region.

Streamline expression:

$MaliGPUTasksFragmentTasks * 1024

Cycles per pixel

This expression defines the average number of GPU cycles being spent per pixel rendered, including any vertex shading cost.

It can be a useful exercise to set a cycle budget for each render pass in your game, based on the target resolution and frame rate you want to achieve. Rendering 1080p at 60 FPS is possible in a mass-market device, but the number of cycles per pixel you have to work with can be small, especially if you have multiple render passes per frame, so those cycles must be used wisely.

Streamline expression:

$MaliGPUCyclesGPUActive / ($MaliGPUTasksFragmentTasks * 1024)

Fragments per pixel

This expression computes the number of fragments shaded per output pixel. High levels of overdraw can be a significant processing cost, especially when rendering to a high-resolution framebuffer.

Note that this expression assumes a 16x16 tile size is used during shading. For render passes using more than Tile storage per pixel bits per pixel the tile size will be dynamically reduced and this assumption will be invalid.

Streamline expression:

($MaliCoreWarpsFragmentWarps * 8) / ($MaliCoreTilesTiles * 256)

Fragment depth and stencil testing

This section looks at how fragment quads to be shaded interact with the early (before fragment shading), and late (after fragment shading) depth and stencil (ZS) testing. It is important that as many fragments as possible are early-ZS tested, as this is more efficient that testing and killing things later using late-ZS.

To maximize the efficiency of early-ZS testing it is recommended to draw opaque objects starting with those closest to camera and then working further away.

Early ZS tested quad percentage

This expression defines the percentage of rasterized quads that were subjected to early depth and stencil testing.

Streamline expression:

min(($MaliCoreQuadsEarlyZSTestedQuads / $MaliCoreQuadsRasterizedQuads) * 100, 100)

Early ZS updated quad percentage

This expression defines the percentage of rasterized quads that update the framebuffer during early depth and stencil testing.

Streamline expression:

min(($MaliCoreQuadsEarlyZSUpdatedQuads / $MaliCoreQuadsRasterizedQuads) * 100, 100)

Early ZS killed quad percentage

This expression defines the percentage of rasterized quads that are killed by early depth and stencil testing.

Streamline expression:

min(($MaliCoreQuadsEarlyZSKilledQuads / $MaliCoreQuadsRasterizedQuads) * 100, 100)

FPK killed quad percentage

This expression defines the percentage of rasterized quads that are killed by forward pixel kill (FPK) hidden surface removal.

Streamline expression:

min((($MaliCoreQuadsRasterizedQuads - $MaliCoreQuadsEarlyZSKilledQuads - (($MaliCoreWarpsFragmentWarps * 8) / 4)) / $MaliCoreQuadsRasterizedQuads) * 100, 100)

Late ZS tested quad percentage

This expression defines the percentage of rasterized quads that are tested by late depth and stencil testing.

Streamline expression:

min(($MaliCoreQuadsLateZSTestedQuads / $MaliCoreQuadsRasterizedQuads) * 100, 100)

Late ZS killed quad percentage

This expression defines the percentage of rasterized quads that are killed by late depth and stencil testing. Quads killed by late ZS testing will execute some of their fragment program before being killed, so a significant number of quads being killed at late ZS testing indicates a significant performance overhead and/or wasted energy. You should minimize the number of quads using and being killed by late ZS testing.

The main causes of late ZS usage are where fragment shader programs:

  • use explicit discard statements
  • use implicit discard (alpha-to-coverage)
  • use a shader-created fragment depth
  • cause side-effects on shared resources, such as shader storage buffer objects, images, or atomics.

In addition to the application use cases, the driver will generate warps for preloading the ZS values in the framebuffer if an attached depth or stencil attachment is not cleared or invalidated at the start of a render pass. These will be reported as quads killed by late ZS testing in the counter values. Always clear or invalidate all attached framebuffer surface attachments unless the algorithm requires the value to be preserved.

Streamline expression:

min(($MaliCoreQuadsLateZSKilledQuads / $MaliCoreQuadsRasterizedQuads) * 100, 100)

Shader core data path

This section describes the counters implemented by the Mali shader core thread issue units for both non-fragment and fragment workloads.

All non-fragment workloads are known as "compute" workloads for this part of the GPU; this includes vertex shading, geometry shading, and tessellation shading, not just compute shading.

Shader core workload

These counters count the number of warps issued for the two workload types. Each warp represents N threads of shader execution running in lock-step. The value of N - the warp width - varies for different GPUs in the Bifrost family. For Mali-G52 the value of the Threads per warp constant is 8 threads wide.

Compute warps

This counter increments for every created compute warp. To ensure full utilization of the warp capacity any compute workgroups should be a multiple of warp size.

Note that the warp width varies between Mali devices, the Threads per warp constant defines the number of threads in a single warp.

Streamline variable name:

$MaliCoreWarpsComputeWarps

Fragment warps

This counter increments for every created fragment warp.

Note that the warp width varies between Mali devices, the Threads per warp constant defines the number of threads in a single warp.

Streamline variable name:

$MaliCoreWarpsFragmentWarps

Shader core throughput

These counters show the average number of cycles it takes to get a single thread shaded by the shader core. Note that this chart shows average throughput, not average cost, so includes impacts of processing latency and of resource shading across the two workload types.

Compute cycles per thread

This expression defines the average number of shader core cycles per compute thread.

Note that this measurement captures the average throughput, which may not be a direct measure of processing cost for content that is sensitive to memory access latency. In addition there will be some crosstalk caused by compute and fragment workloads running concurrently on the same hardware. This expression is therefore indicative of cost, but does not reflect precise costing.

Streamline expression:

$MaliCoreCyclesComputeActive / ($MaliCoreWarpsComputeWarps * 8)

Fragment cycles per thread

This expression defines the average number of shader core cycles per fragment thread. Note that this measurement captures the average throughput, which may not be a direct measure of processing cost for content which is sensitive to memory access latency. In addition there will be some crosstalk caused by compute and fragment workloads running concurrently on the same hardware. This expression is therefore indicative of cost, but does not reflect precise costing.

Streamline expression:

$MaliCoreCyclesFragmentActive / ($MaliCoreWarpsFragmentWarps * 8)

Shader data path utilization

These counters show the total activity level of the major data paths in the shader core. These can help indicate which workload type which should be reviewed, and whether they are any scheduling issues.

Compute utilization

This expression defines the percentage utilization of the shader core compute path.

Streamline expression:

min(($MaliCoreCyclesComputeActive / $MaliGPUCyclesGPUActive) * 100, 100)

Fragment utilization

This expression defines the percentage utilization of the shader core fragment path.

Streamline expression:

min(($MaliCoreCyclesFragmentActive / $MaliGPUCyclesGPUActive) * 100, 100)

Fragment FPK buffer active percentage

This expression defines the percentage of cycles where the forward pixel kill (FPK) quad buffer, before the execution core, contains at least one quad.

Streamline expression:

min(($MaliCoreCyclesFragmentFPKBActive / $MaliCoreCyclesFragmentActive) * 100, 100)

Execution core utilization

This expression defines the percentage utilization of the programmable execution core. A low utilization indicates possible lost performance, because there are spare shader core cycles that could be used if they were accessible.

In some use cases this is unavoidable, because there are regions in a render pass where there is no shader workload to process. For example, a clear color tile that contains no shaded geometry, or a shadow map that can be resolved entirely using early ZS depth updates.

Aim to optimize screen regions that contain high volumes of redundant geometry, causing the programmable core to run out of work to process because the fragment front-end can not generate warps fast enough. This can be caused by a high percentage of triangles that are killed by ZS testing or FPK hidden surface removal, or by a very high density of micro-triangles which each generate low numbers of threads.

Streamline expression:

min(($MaliCoreCyclesExecutionCoreActive / $MaliGPUCyclesGPUActive) * 100, 100)

Shader core functional units

These counters provide views of the activity for the various programmable and fixed-function units inside the programmable shader core, which are responsible for executing shader programs.

Shader unit utilization

These counters provide normalized views of the functional unit activity inside the shader core. The most heavily utilized functional unit should be the target for optimizations to improve performance, although reducing any of the units' load will be good for energy efficiency.

Execution engine utilization

This expression defines the percentage utilization of the execution engine.

Streamline expression:

min(($MaliCoreEEInstructionsExecutedInstructions / $MaliCoreCyclesExecutionCoreActive) * 100, 100)

Varying unit utilization

This expression defines the percentage utilization of the varying unit.

Streamline expression:

min((($MaliCoreVaryingCycles32BitInterpolationActive + $MaliCoreVaryingCycles16BitInterpolationActive) / $MaliCoreCyclesExecutionCoreActive) * 100, 100)

Texture unit utilization

This expression defines the percentage utilization of the texturing unit.

Streamline expression:

min(($MaliCoreTextureCyclesTexturingActive / $MaliCoreCyclesExecutionCoreActive) * 100, 100)

Load/store unit utilization

This expression defines the percentage utilization of the load/store unit.

Streamline expression:

min((($MaliCoreLoadStoreCyclesFullReadCycles + $MaliCoreLoadStoreCyclesPartialReadCycles + $MaliCoreLoadStoreCyclesFullWriteCycles + $MaliCoreLoadStoreCyclesPartialWriteCycles + $MaliCoreLoadStoreCyclesAtomicAccessCycles) / $MaliCoreCyclesExecutionCoreActive) * 100, 100)

Shader workload properties

These counters provide normalized views of specific workload properties that can impact efficiency or hint at potential opportunities to optimize.

Partial coverage rate

This expression defines the percentage of warps that contain samples with no coverage. A high percentage can indicate that your content has a high density of small triangles, which is expensive. To avoid this, use mesh level-of-detail algorithms to select simpler meshes as objects move further from the camera.

Streamline expression:

min(($MaliCoreWarpsPartialFragmentWarps / $MaliCoreWarpsFragmentWarps) * 100, 100)

Full quad warp rate

This expression defines the percentage of warps that are fully populated with quads. If there are many warps that are not full then performance may be lower, because thread slots in the warp are unused. Full warps are more likely if:

  • Compute shaders have work groups that are a multiple of warp size.
  • Draw calls avoid high numbers of small primitives.

Streamline expression:

min(($MaliCoreWarpsFullQuadWarps / ($MaliCoreWarpsComputeWarps + $MaliCoreWarpsFragmentWarps)) * 100, 100)

Diverged instruction issue rate

This expression defines the percentage of instructions that have control flow divergence across the warp.

Streamline expression:

min(($MaliCoreEEInstructionsDivergedInstructions / $MaliCoreEEInstructionsExecutedInstructions) * 100, 100)

All registers warp rate

This expression defines the percentage of warps that require more than 32 registers. If this number is high, the lack of running threads will start to impact the ability for the GPU to stay busy, especially under conditions of high memory latency.

Streamline expression:

min(($MaliCoreWarpsAllRegisterWarps / ($MaliCoreWarpsComputeWarps + $MaliCoreWarpsFragmentWarps)) * 100, 100)

Constant tile kill rate

This expression defines the percentage of tiles that are killed by the transaction elimination CRC check. If a high percentage of tile writes are being killed, this indicates that a significant part of the framebuffer is static from frame to frame. Consider using scissor rectangles to reduce the area that is redrawn. To help manage the partial frame updates for window surfaces consider using the EGL extensions such as:

  • EGL_KHR_partial_update
  • EGL_EXT_swap_buffers_with_damage

Streamline expression:

min(($MaliCoreTilesConstantTilesKilled / $MaliCoreTilesTiles) * 100, 100)

Shader core varying unit

These counters show the usage of the varying unit, which is used for varying interpolation in fragment shaders. If the shader core utilization counters show that this unit is a bottleneck, these counters might provide some indication of optimization opportunities.

The interpolator has a 32-bit data path, so 16-bit interpolation is twice as fast as 32-bit interpolation. It is recommended to use mediump (16-bit) varying inputs to fragment shaders whenever possible. It is recommended to pack 16-bit values into vec2 or vec4 values; e.g. a single vec4 will interpolate faster than than a separate vec3 + float pair.

Varying unit usage

Varying cycles

This expression defines the total number of cycles where the varying interpolator is active.

Streamline expression:

$MaliCoreVaryingCycles32BitInterpolationActive + $MaliCoreVaryingCycles16BitInterpolationActive

16-bit interpolation active

This counter increments for every 16-bit interpolation cycle processed by the varying unit.

Streamline variable name:

$MaliCoreVaryingCycles16BitInterpolationActive

32-bit interpolation active

This counter increments for every 32-bit interpolation cycle processed by the varying unit. 32-bit interpolation is half the performance of 16-bit interpolation, so if content is varying bound and this counter is high consider reducing precision of varying inputs to fragment shaders.

Streamline variable name:

$MaliCoreVaryingCycles32BitInterpolationActive

Shader core texture unit

These counters show the usage of the texturing unit, which is used for all texture sampling and filtering. If the shader core utilization counters show that this unit is a bottleneck, these counters might provide some indication of optimization opportunities.

Texture unit usage

These counters show the usage of the texturing unit, and the average number of cycles per instruction. The performance of the texture unit in the shader core - varies for different GPUs in the Bifrost family. For Mali-G52 the best case performance (bilinear filtered samples) is 0.5 cycles per sample.

Texture filtering cycles

This counter increments for every texture filtering issue cycle. Some instructions take more than one cycle due to multi-cycle data access and filtering operations. The costs per 4 sample quad are:

  • 2D bilinear filtering takes two cycles.
  • 2D trilinear filtering takes four cycles.
  • 3D bilinear filtering takes four cycles.
  • 3D trilinear filtering takes eight cycles.

Streamline variable name:

$MaliCoreTextureCyclesTexturingActive

Texture filtering cycles per instruction

This expression defines the average number of texture filtering cycles per instruction. For texture unit limited content that has a CPI lower than Texture samples per cycle, review any use of multi-cycle operations and consider using simpler texture filters. See Texture issue cycles for details of the expected performance for different types of operation.

Streamline expression:

$MaliCoreTextureCyclesTexturingActive / ($MaliCoreTextureQuadsTextureRequests * 4)

Texture unit workload properties

These counters show the content behavior in the texture unit, in terms of the number of accesses using texture compression, mipmapping, or trilinear filtering probes.

Texture accesses using trilinear filter percentage

This expression defines the percentage of texture operations using trilinear filtering.

Streamline expression:

min(($MaliCoreTextureQuadsTrilinearFilteredIssues / $MaliCoreTextureQuadsTextureRequests) * 100, 100)

Texture accesses using mipmapped texture percentage

This expression defines the percentage of texture operations accessing mipmapped textures.

Streamline expression:

min(($MaliCoreTextureQuadsMipmappedTextureIssues / $MaliCoreTextureQuadsTextureIssues) * 100, 100)

Texture unit memory usage

These counters show the average number of bytes read from the L2 cache or external memory per texture sample.

Texture bytes read from L2 per texture cycle

This expression defines the average number of bytes read from the L2 memory system by the texture unit per filtering cycle. This metric indicates how well textures are being cached in the L1 texture cache. If a high number of bytes are being requested per access, where high depends on the texture formats you are using, it can be worth reviewing texture settings:

  • Enable mipmaps for offline generated textures
  • Use ASTC or ETC compression for offline generated textures
  • Replace run-time generated framebuffer and texture formats with a narrower format
  • Reduce any use of negative LOD bias used for texture sharpening
  • Reduce the MAX_ANISOTROPY level for anisotropic filtering

Streamline expression:

($MaliCoreL2ReadsTextureL2ReadBeats * 16) / $MaliCoreTextureCyclesTexturingActive

Texture bytes read from external memory per texture cycle

This expression defines the average number of bytes read from the external memory system by the texture unit per filtering cycle. This metric indicates how well textures are being cached in the L2 cache. If a high number of bytes are being requested per access, where high depends on the texture formats you are using, it can be worth reviewing texture settings:

  • Enable mipmaps for offline generated textures
  • Use ASTC or ETC compression for offline generated textures
  • Replace run-time generated framebuffer and texture formats with a narrower format
  • Reduce any use of negative LOD bias used for texture sharpening
  • Reduce the MAX_ANISOTROPY level for anisotropic filtering

Streamline expression:

($MaliCoreL2ReadsTextureExternalReadBeats * 16) / $MaliCoreTextureCyclesTexturingActive

Shader core load/store unit

These counters show the content behavior in the load/store unit. This unit is used for all shader memory accesses except texturing and framebuffer write-back.

Load/store unit usage

These counters show the content behavior in the load/store unit, in terms of the number of reads and writes being made, and whether those loads use the full width of the available data path.

A key memory access optimization for compute shaders is to make effective use of the data width the load/store interface provides. It is recommended to make vector memory accesses in each thread, and to ensure that threads in the same warp access overlapping or sequential addresses inside a single 64 byte address range.

Load/store total issues

This expression defines the total number of load/store issue cycles. Note that this counter ignores secondary effects such as cache misses, so this counter provides the best case cycle usage.

Streamline expression:

$MaliCoreLoadStoreCyclesFullReadCycles + $MaliCoreLoadStoreCyclesPartialReadCycles + $MaliCoreLoadStoreCyclesFullWriteCycles + $MaliCoreLoadStoreCyclesPartialWriteCycles + $MaliCoreLoadStoreCyclesAtomicAccessCycles

Load/store full read issues

This counter increments for every full-width load/store cache read.

Streamline variable name:

$MaliCoreLoadStoreCyclesFullReadCycles

Load/store partial read issues

This counter increments for every partial-width load/store cache read. Partial data accesses do not make full use of the load/store cache capability, so efficiency can be improved by merging short accesses together to make fewer larger access requests. To do this in shader code:

  • Use vector data loads
  • Avoid padding in strided data accesses
  • Write compute shaders so that adjacent threads in a warp access adjacent addresses in memory.

Streamline variable name:

$MaliCoreLoadStoreCyclesPartialReadCycles

Load/store full write issues

This counter increments for every full-width load/store cache write.

Streamline variable name:

$MaliCoreLoadStoreCyclesFullWriteCycles

Load/store partial write issues

This counter increments for every partial-width load/store cache write. Partial data accesses do not make full use of the load/store cache capability, so efficiency can be improved by merging short accesses together to make fewer larger access requests. To do this in shader code:

  • Use vector data loads
  • Avoid padding in strided data accesses
  • Write compute shaders so that adjacent threads in a warp access adjacent addresses in memory.

Streamline variable name:

$MaliCoreLoadStoreCyclesPartialWriteCycles

Load/store atomic issues

This counter increments for every load/store atomic access. Atomic memory accesses are typically multi-cycle operations per thread in the warp, so they are exceptionally expensive. Minimize the use of atomics in performance critical code.

Streamline variable name:

$MaliCoreLoadStoreCyclesAtomicAccessCycles

Load/store unit memory usage

These counters show the average number of bytes read or written to the L2 cache per load/store read or write. This can be used to see how well workloads are using the GPU L1 and L2 caches, although knowing what "good" looks like requires knowledge of the algorithm being executed.

Load/store bytes read from L2 per access cycle

This expression defines the average number of bytes read from the L2 memory system by the load/store unit per read cycle. This metric gives some idea how well data is being cached in the L1 load/store cache. If a high number of bytes are being requested per access, where high depends on the buffer formats you are using, it can be worth reviewing data formats and access patterns.

Streamline expression:

($MaliCoreL2ReadsLoadStoreL2ReadBeats * 16) / ($MaliCoreLoadStoreCyclesFullReadCycles + $MaliCoreLoadStoreCyclesPartialReadCycles)

Load/store bytes read from external memory per access cycle

This expression defines the average number of bytes read from the external memory system by the load/store unit per read cycle. This metric indicates how well data is being cached in the L2 cache. If a high number of bytes are being requested per access, where high depends on the texture formats you are using, it can be worth reviewing data formats and access patterns.

Streamline expression:

($MaliCoreL2ReadsLoadStoreExternalReadBeats * 16) / ($MaliCoreLoadStoreCyclesFullReadCycles + $MaliCoreLoadStoreCyclesPartialReadCycles)

Load/store bytes written to L2 per access cycle

This expression defines the average number of bytes written to the L2 memory system by the load/store unit per write cycle.

Streamline expression:

(($MaliCoreWritesLoadStoreWritebackWriteBeats + $MaliCoreWritesLoadStoreOtherWriteBeats) * 16) / ($MaliCoreLoadStoreCyclesFullWriteCycles + $MaliCoreLoadStoreCyclesPartialWriteCycles)

Shader core memory traffic

These counters show the total amount of memory access to the L2 and external memory made by the different parts of the shader core. This can be used to identify where memory bandwidth is being consumed.

Load/store read bytes from L2 cache

This expression defines the total number of bytes read from the L2 memory system by the load/store unit.

Streamline expression:

$MaliCoreL2ReadsLoadStoreL2ReadBeats * 16

Texture read bytes from L2 cache

This expression defines the total number of bytes read from the L2 memory system by the texture unit.

Streamline expression:

$MaliCoreL2ReadsTextureL2ReadBeats * 16

Load/store read bytes from external memory

This expression defines the total number of bytes read from the external memory system by the load/store unit.

Streamline expression:

$MaliCoreL2ReadsLoadStoreExternalReadBeats * 16

Texture read bytes from external memory

This expression defines the total number of bytes read from the external memory system by the texture unit.

Streamline expression:

$MaliCoreL2ReadsTextureExternalReadBeats * 16

Load/store write bytes

This expression defines the total number of bytes written to the L2 memory system by the load/store unit.

Streamline expression:

($MaliCoreWritesLoadStoreWritebackWriteBeats + $MaliCoreWritesLoadStoreOtherWriteBeats) * 16

Tile buffer write bytes

This expression defines the total number of bytes written to the L2 memory system by the tile buffer writeback unit.

Streamline expression:

$MaliCoreWritesTileBufferWriteBeats * 16