You copied the Doc URL to your clipboard.

Mali Midgard counters

Arm Streamline can capture performance counters for each functional block in the design of a Mali Midgard GPU. These blocks are Job Manager, Shader cores, Tiler, and L2 caches. This topic describes the counters in each block.

Midgard GPUs implement many performance counters natively in the hardware. You can also generate derived counters by combining raw hardware counters. This topic describes all the Mali Midgard counters that are available in Arm Streamline, and some of the useful derived counters. To minimize the impact on performance and power from extra hardware logic, many of the counters are close approximations of the described behavior. Therefore there may be slight deviations from the expected behavior.

Note

  • The naming convention for Mali Midgard counters in Arm Streamline is ARM_Mali-T<GPUID>_<COUNTER_NAME>.
  • If a counter is only available for specific GPUs, the description states which GPUs it is available for.

Job Manager counters

This section describes the counters that the Mali Job Manager component implements.

The following counters provide information about the number of cycles that the GPU spends processing workloads:

GPU_ACTIVE
Increments for every cycle that the GPU has any workload queued in a Job slot, or the GPU cycle counter is running for OpenCL profiling. If the GPU is waiting for external memory to return data, it is still counted as active if there is a workload in the queue.
GPU_UTILIZATION (Derived)
The overall GPU utilization.
GPU_UTILIZATION = GPU_ACTIVE / GPU_MHZ

Note

If your device supports Dynamic Frequency and Voltage Scaling (DVFS), the GPU frequency is often not constant while running content. If possible, disable platform DVFS to lock the processor, GPU, and memory bus at a fixed frequency.
JS0_ACTIVE
Increments every cycle that the GPU is running a Job chain in Job slot 0. Corresponds directly to fragment shading workloads because this Job slot is only used for processing fragment Jobs.
JS1_ACTIVE
Increments every cycle that the GPU is running a Job chain in Job slot 1. This Job slot can be used for compute shaders, vertex shaders, and tiling workloads, but the counter does not differentiate between them.
JS2_ACTIVE
Increments every cycle that the GPU is running a Job chain in Job slot 2. This Job slot can be used for compute shaders and vertex shaders.
IRQ_ACTIVE
Increments every cycle that the GPU has an interrupt waiting to be handled by the driver running on the processor.

The following counters relate to how the Job Manager issues work to the shader cores:

JS0_TASKS (1)

Available for Mali-T600, Mali-T620, and Mali-T720.

Increments for every task the Job Manager issues to a shader core. For JS0, these tasks correspond to a 16x16 pixel screen region.

JS0_TASKS (2)

Available for Mali-T760 and Mali-T800 series.

Increments for every task the Job Manager issues to a shader core. For JS0, these tasks correspond to a 32x32 pixel screen region.

JS1_TASKS
Increments for every task the Job Manager issues to a shader core or the tiler. For JS1, these tasks correspond to a range of vertices or compute work items for shader cores, or a range of indices for the tiler.
JS2_TASKS
Increments for every task the Job Manager issues to a shader core. For JS2, these tasks correspond to a range of vertices or compute work items.

Shader Core counters

This section describes the counters that the Mali Shader Core implements. The GPU hardware records each counter separately for each shader core. Arm Streamline displays the average value of the counter across all the shader cores.

Note

This section refers to fragment workloads or compute workloads. Vertex workloads are treated as a one dimensional compute problem by the shader core, so they are counted as a compute workload.

The following counters show the total activity level of the shader core:

FRAG_ACTIVE
Increments every cycle that at least one fragment task is active inside the shader core.
COMPUTE_ACTIVE
Increments every cycle that at least one compute task is active inside the shader core.
TRIPIPE_ACTIVE
Increments every cycle that at least one thread is active inside the programmable tri-pipe.
TRIPIPE_UTILIZATION (Derived)
An approximation of the overall utilization of the tri-pipe.
TRIPIPE_UTILIZATION =  TRIPIPE_ACTIVE / GPU_ACTIVE

The following counters show the task and thread issue behavior of the fixed function compute frontend of the shader core, which issues work into the programmable tri-pipe:

COMPUTE_TASKS
Increments for every compute task that the shader core handles.
COMPUTE_THREADS
Increments for every compute thread that the shader core spawns.

The following counters show the task and thread issue behavior of the fixed-function fragment frontend of the shader core:

FRAG_PRIMITIVES
Increments for every primitive that is read from the tile list.
FRAG_PRIMITIVES_DROPPED
Increments for every primitive that is read from the tile list and then discarded because it is not relevant for the tile being rendered.
THREADS_PER_PRIMITIVE_LOAD (Derived)
The number of fragment threads that are issued per primitive.
THREADS_PER_PRIMITIVE_LOAD = FRAG_THREADS / (FRAG_PRIMITIVES - FRAG_PRIMITIVES_DROPPED)
FRAG_QUADS_RAST
Increments for every 2x2 pixel quad that the rasterization unit rasterizes.
FRAG_QUADS_EZS_TEST
Increments for every 2x2 pixel quad that undergoes early depth and stencil (ZS) test and update operations.
FRAG_QUADS_EZS_KILLED
Increments for every 2x2 pixel quad that early ZS testing kills.
FRAG_CYCLES_NO_TILE
Increments every cycle that a lack of available physical tile memory blocks the shader core early ZS unit from progressing.
FRAG_CYCLES_QUADS_BUFFERED

Not available for Mali-T600.

Increments every cycle that the quad buffer contains at least one 2x2 pixel quad waiting to be executed in the tri-pipe.

FRAG_THREADS
Increments for every real or dummy fragment thread that the GPU creates.
FRAG_DUMMY_THREADS
Increments for every dummy fragment thread that the GPU creates.

The following counters record the fragment backend behavior:

FRAG_THREADS_LZS_TEST
Increments for every thread that triggers late ZS testing.
FRAG_THREADS_LZS_KILLED
Increments for every thread that late ZS testing kills.
FRAG_NUM_TILES (1)

Available for Mali-T600, Mali-T620, and Mali-T720.

Increments for every 16x16 pixel tile that the shader core renders.

FRAG_NUM_TILES (2)

Available for Mali-T760 and Mali-T800 series

Increments for every 32x32 pixel tile that the shader core renders.

FRAG_TRANS_ELIM
Increments for every physical rendered tile that has its writeback canceled due to a matching transaction elimination CRC hash.

The following counters show the behavior of the arithmetic pipe:

ARITH_WORDS (1)

Not available for Mali-T720, Mali-T820, and Mali-T830

Increments for every arithmetic instruction that is architecturally executed. This counter is normalized to give the per pipe performance.

ARITH_WORDS (2)

Available for Mali- T720, Mali-T820, and Mali-T830.

Increments for every batched arithmetic instruction that is executed.

ARITH_ARCH_UTILIZATION (Derived)
The utilization of the arithmetic hardware.
ARITH_ARCH_UTILIZATION = ARITH_WORDS / TRIPIPE_ACTIVE

The following counters show the behavior of the load/store pipe:

LS_WORDS
Increments for every load or store instruction that is architecturally executed.
LS_ARCH_UTILIZATION (Derived)
The architectural utilization of the load store pipe.
LS_ARCH_UTILIZATION = LS_WORDS / TRIPIPE_ACTIVE
LS_ISSUES (1)

Available for Mali-T600, Mali-T620, and Mali-T720.

Increments for every load or store instruction that is issued, or reissued due to varying or data cache misses.

LS_ISSUES (2)

Available for Mali-T760 and Mali-T800 series.

Increments for every load or store instruction that is issued, or reissued due to varying cache misses.

LS_UARCH_UTILIZATION (Derived)
The microarchitectural utilization.
LS_UARCH_UTILIZATION = LS_ISSUES / TRIPIPE_ACTIVE
LS_CPI (Derived)
Cycles Per Instruction.
LS_CPI = LS_ISSUES / LS_WORDS

The following counters monitor the performance of the load/store cache:

LSC_READ_HITS
Increments for every load/store L1 cache read access that is a hit.
LSC_READ_MISSES

Available for Mali-T600, Mali-T620, and Mali-T720.

Increments for every load/store L1 cache read access that is a miss.

LSC_READ_HITRATE (Derived)
The percentage of the total number of read accesses that are hits.
LSC_READ_HITRATE = LSC_READ_HITS / (LSC_READ_HITS + LSC_READ_MISSES)
LSC_READ_OPS

Available for Mali-T760 and Mali-T800 series.

Increments for every load/store L1 cache read access.

LSC_READ_HITRATE (Derived)
The percentage of the total number of read accesses that are hits.
LSC_READ_HITRATE = LSC_READ_HITS / LSC_READ_OPS
LSC_WRITE_HITS
Increments for every load/store L1 cache write access that is a hit.
LSC_WRITE_MISSES

Available for Mali-T600, Mali-T620, and Mali-T720.

Increments for every load/store L1 cache write access that is a miss.

LSC_WRITE_HITRATE (Derived)
The percentage of the total number of write accesses that are hits.
LSC_WRITE_HITRATE = LSC_WRITE_HITS / (LSC_WRITE_HITS + LSC_WRITE_MISSES)
LSC_WRITE_OPS

Available for Mali-T760 and Mali-T800 series.

Increments for every load/store L1 cache write access.

LSC_WRITE_HITRATE (Derived)
The percentage of the total number of write accesses that are hits.
LSC_WRITE_HITRATE = LSC_WRITE_HITS / LSC_WRITE_OPS
LSC_ATOMIC_HITS
Increments for every atomic memory access that hits in the L1 atomic cache.
LSC_ATOMIC_MISSES

Available for Mali-T600, Mali-T620, and Mali-T720.

Increments for every atomic memory access that misses in the L1 atomic cache.

LSC_ATOMIC_HITRATE (Derived)
The percentage of the total number of atomic memory accesses that are hits.
LSC_ATOMIC_HITRATE = LSC_ATOMIC_HITS / (LSC_ATOMIC_HITS + LSC_ATOMIC_MISSES)
LSC_ATOMIC_OPS

Available for Mali-T760 and Mali-T800 series.

Increments for every atomic memory access that misses in the L1 atomic cache.

LSC_ATOMIC_HITRATE
The percentage of the total number of atomic memory accesses that are hits.
LSC_ATOMIC_HITRATE = LSC_ATOMIC_HITS / LSC_ATOMIC_OPS
LSC_LINE_FETCHES
Increments for every line that the L1 cache fetches from the L2 memory system.
LSC_DIRTY_LINE
Increments for every dirty line that is evicted from the L1 cache into the L2 memory system.
LSC_SNOOPS
Increments for every snoop into the L1 cache from the L2 memory system.

The following counters show the texture pipe behavior:

TEX_WORDS
Increments for every architecturally executed texture instruction.
TEX_ISSUES (1)

Available for Mali-T600, Mali-T620, and Mali-T720.

Increments for every texture issue cycle used. Some instructions take more than one cycle due to data cache misses or multi-cycle filtering operations.

TEX_ISSUES (2)

Available for Mali-T760 and Mali-T800 series.

Increments for every texture issue cycle used. Some instructions take more than one cycle due to multi-cycle filtering operations.

TEX_CPI (Derived)
Cycles Per Instruction.
TEX_CPI = TEX_ISSUES / TEX_WORDS

Tiler counters

The tiler counters provide details of the workload of the fixed function tiling unit. This unit places primitives into the tile lists that the fragment frontend reads during fragment shading.

The following counters show the overall activity of the tiling unit:

TI_ACTIVE

Available for Mali-T600, Mali-T620, Mali-T760, Mali-T860, and Mali-T880

Increments every cycle that the tiler processes a task.

The following counters give a functional breakdown of the tiling workload that is given to the GPU by the application:

TI_POINTS
Increments for every point primitive that the tiler processes. This counter increments before any clipping or culling, so it reflects the raw workload from the application.
TI_LINES
Increments for every line segment primitive that the tiler processes. This counter increments before any clipping or culling, so it reflects the raw workload from the application.
TI_TRIANGLES
Increments for every triangle primitive that the tiler processes. This counter increments before any clipping or culling, so it reflects the raw workload from the application.

The following counters show how clipping and culling affect the workload:

TI_PRIM_VISIBLE
Increments for every primitive that is visible according to its type, clip-space coordinates, and front-face or back-face orientation.
TI_PRIM_CULLED
Increments for every primitive that is culled due to the application of front-face or back-face culling rules.
TI_PRIM_CLIPPED
Increments for every primitive that is culled because it is outside of the clip-space volume.
TI_FRONT_FACING
Incremented for every triangle that is front-facing. This counter increments after culling, so it only counts visible primitives that are emitted into the tile list.
TI_BACK_FACING
Incremented for every triangle that is back-facing. This counter increments after culling, so it only counts visible primitives that are emitted into the tile list.

ache Counters

This section describes the behavior of the L2 memory system counters. In systems that implement multiple L2 caches or bus interfaces, the counters that Arm Streamline displays are the sum of the counters from all the L2 counter blocks.

The following counters profile the internal read traffic into the L2 cache from the various internal masters:

L2_READ_LOOKUP
Increments for every read transaction that the L2 cache receives.
L2_READ_HITS
Increments for every read transaction that the L2 cache receives that also hits in the cache.
L2_READ_HITRATE (Derived)
The percentage of read transactions that the L2 cache receives that also hit in the cache.
L2_READ_HITRATE = L2_READ_HITS / L2_READ_LOOKUP
L2_READ_SNOOP
Increments for every inner coherent read snoop transaction that the L2 cache receives.

The following counters profile the internal write traffic into the L2 cache from the various internal masters:

L2_WRITE_LOOKUP
Increments for every write transaction that the L2 cache receives.
L2_WRITE_HITS
Increments for every write transaction that the L2 cache receives that also hits in the cache.
L2_WRITE_HITRATE (Derived)
The percentage of read transactions that the L2 cache receives that also hit in the cache.
L2_WRITE_HITRATE = L2_WRITE_HITS / L2_WRITE_LOOKUP
L2_WRITE_SNOOP
Increments for every inner coherent write snoop transaction that the L2 cache receives.

The following counters profile the external read memory interface behavior:

Note

This behavior includes traffic from the entire GPU L2 memory subsystem as some types of access bypass the L2 cache.
L2_EXT_READ_BEATS
Increments on every clock cycle that a read beat is read off the AXI bus.
L2_EXT_READ_BYTES (Derived)
Converts the beat counter into a raw bandwidth counter.
L2_EXT_READ_BYTES = L2_EXT_READ_BEATS * L2_AXI_WIDTH_BYTES
L2_EXT_READ_UTILIZATION (Derived)
The total percentage of available AXI port bandwidth that is used.
L2_EXT_READ_UTILIZATION = L2_EXT_READ_BEATS / (L2_AXI_PORT_COUNT * GPU_ACTIVE)

Note

Normalize the number of read beats accumulated by Arm Streamline into a per-port count. If a design uses one to four shader cores, a single AXI port is present. Otherwise two AXI ports are present.
L2_EXT_R_BUF_FULL

Not available for Mali-T720.

Increments every cycle that the GPU is unable to create a read transaction because there are no free entries in the internal response buffer.

L2_EXT_RD_BUF_FULL

Not available for Mali-T720.

Increments if a read response is received and the internal read data buffer is full.

L2_EXT_AR_STALL
Increments every cycle that the GPU is unable to issue a new read transaction to AXI because AXI is unable to accept the request.

The following counters profile the external write memory interface behavior:

Note

This behavior includes traffic from the entire GPU L2 memory subsystem.
L2_EXT_WRITE_BEATS
Increments on every clock cycle that a write data beat is sent on the AXI bus.
L2_EXT_WRITE_BYTES (Derived)
Converts the beat counter into a raw bandwidth counter.
L2_EXT_WRITE_BYTES = L2_EXT_WRITE_BEATS * L2_AXI_WIDTH_BYTES
L2_EXT_WRITE_UTILIZATION (Derived)
The percentage of available AXI port bandwidth used.
L2_EXT_WRITE_UTILIZATION = L2_EXT_WRITE_BEATS / (L2_AXI_PORT_COUNT * GPU_ACTIVE)

Note

Normalize the number of read beats accumulated by Arm Streamline into a per-port count.
L2_EXT_W_BUF_FULL
Increments every cycle that the GPU is unable to create a new write transaction because there are no free entries in the internal write buffer.
L2_EXT_W_STALL
Increments every cycle that the GPU is unable to issue a new write transaction to AXI because AXI is unable to accept the request.

For more information about these counters, see Mali Midgard Family Performance Counters on Arm Community.

Was this page helpful? Yes No