You copied the Doc URL to your clipboard.

Mali Bifrost counters

Arm Streamline can capture performance counters for each functional block in the design of a Bifrost GPU. These blocks are Job Manager, Shader cores, Tiler, and L2 caches. This topic describes the counters in each block.

Bifrost GPUs implement many performance counters natively in the hardware. You can also generate derived counters by combining raw hardware counters. This topic describes all the Mali Bifrost counters that are available in Arm Streamline, and some of the useful derived counters. To minimize the impact on performance and power from extra hardware logic, many of the counters are close approximations of the described behavior. Therefore there may be slight deviations from the expected behavior.

Job Manager counters

This section describes the counters that the Mali Job Manager component implements.

The following counters provide top-level activity information. For example, information about the number of cycles that the GPU spends processing a workload, or waiting for the handling of completion interrupts:

JM.GPU_ACTIVE
Increments every cycle that the GPU has any workload queued in a Job slot. If the GPU is waiting for external memory to return data, it is still counted as active if there is a workload in the queue.
JM.GPU_UTILIZATION (Derived)
The overall GPU utilization.
JM.GPU_UTILIZATION = JM.GPU_ACTIVE / GPU_MHZ
JM.JS0_ACTIVE
Increments every cycle that the GPU is running a Job chain in Job slot 0. Corresponds directly to fragment shading workloads because this Job slot is only used for processing fragment Jobs.
JM.JS0_UTILIZATION (Derived)
The percentage JS0 utilization.
JM.JS0_UTILIZATION = JM.JS0_ACTIVE / JM.GPU_ACTIVE
JM.JS1_ACTIVE
Increments every cycle that the GPU is running a Job chain in Job slot 1. This Job slot can be used for compute shaders, vertex shaders, and tiling workloads, but the counter does not differentiate between them.
JM.JS1_UTILIZATION (Derived)
The percentage JS1 utilization.
JM.JS1_UTILIZATION = JM.JS1_ACTIVE / JM.GPU_ACTIVE
JM.IRQ_ACTIVE
Increments every cycle that the GPU has an interrupt waiting for the driver running on the processor to handle it.

The following counters provide task dispatch information:

JM.JS0_TASKS
Increments for every task the Job Manager issues to a shader core. For JS0, these tasks correspond to a 32x32 pixel screen region.
JM.PIXEL_COUNT (Derived)
An approximation of the total scene pixel count.
JM.PIXEL_COUNT = JM.JS0_TASKS * 32 * 32

Shader Core counters

This section describes the counters that the Mali Shader Core implements. Arm Streamline displays the average of all the shader core counters that the GPU hardware records.

Note

This section refers to fragment workloads or compute workloads. Vertex, Geometry, and Tessellation workloads are treated as one dimensional compute problems by the shader core, so they are counted as compute workloads.

The following counters show the total activity level of the shader core:

SC.COMPUTE_ACTIVE
Increments every cycle that at least one compute task is active inside the shader core.
SC.FRAG_ACTIVE
Increments every cycle that at least one fragment task is active inside the shader core.
SC.EXEC_CORE_ACTIVE
Increments every cycle that at least one quad is active inside the programmable execution core.
SC.EXEC_CORE_UTILIZATION (Derived)
An approximation of the overall utilization of the execution core.
SC.EXEC_CORE_UTILIZATION = SC.EXEC_CORE_ACTIVE / JM.GPU_ACTIVE

The following counters show the task and thread issue behavior of the fixed function compute frontend of the shader core:

SC.COMPUTE_QUADS
Increments for every compute quad that the shader core spawns.
SC.COMPUTE_QUAD_CYCLES (Derived)
The average number of compute cycles per compute quad.
SC.COMPUTE_QUAD_CYCLES = SC.COMPUTE_ACTIVE / SC.COMPUTE_QUADS

The following counters show the task and thread issue behavior of the fixed-function fragment frontend of the shader core:

SC.FRAG_PRIMITIVES_RAST
Increments for every primitive entering the frontend fixed-function rasterization stage.
SC.FRAG_QUADS_RAST
Increments for every 2x2 pixel quad that the rasterization unit rasterizes.
SC.FRAG_QUADS_EZS_TEST
Increments for every 2x2 pixel quad that undergoes ZS testing.
SC.FRAG_QUADS_EZS_UPDATE
Increments for every 2x2 pixel quad that completes an early ZS update operation.
SC.FRAG_QUADS_EZS_KILLED
Increments for every 2x2 pixel quad that early ZS testing kills.
SC.FRAG_QUADS_KILLED_BY_OVERDRAW (Derived)
Increments for every 2x2 pixel quad that survives early ZS testing, but that is overdrawn by an opaque quad before spawning as fragment shading threads in the programmable core.
SC.FRAG_QUADS_KILLED_BY_OVERDRAW = SC.FRAG_QUADS_RAST - SC.FRAG_QUADS_EZS_KILL - SC.FRAG_QUADS
SC.FRAG_QUADS_OPAQUE
Increments for every 2x2 pixel quad that is architecturally opaque that survives early ZS testing. Architecturally opaque pixel quads do not use blending, shader discard, or alpha-to-coverage.
SC.FRAG_QUADS_TRANSPARENT (Derived)
Increments for every 2x2 pixel quad that is architecturally transparent that survives early ZS testing. Architecturally transparent pixel quads do not use blending, shader discard, or alpha-to-coverage.
SC.FRAG_QUADS_TRANSPARENT = SC.FRAG_QUADS_RAST - SC.FRAG_QUADS_EZS_KILL - SC.FRAG_QUADS_OPAQUE

Note

Transparent in this context means alpha transparency or a shader-dependent coverage mask.
SC.FRAG_QUAD_BUFFER_NOT_EMPTY
Increments every cycle that the fragment unit is active and the pre-pipe buffer contains at least one 2x2 pixel quad waiting to be executed in the execution core.
SC.FRAG_QUADS
Increments for every fragment quad that the GPU creates.
SC.FRAG_PARTIAL_QUADS
Increments for every fragment quad that contains at least one thread slot that has no sample coverage.
SC.FRAG_PARTIAL_QUAD_PERCENTAGE (Derived)
Calculates the percentage of spawned quads that have partial coverage.
SC.FRAG_PARTIAL_QUAD_PERCENTAGE = SC.FRAG_PARTIAL_QUADS / SC.FRAG_QUADS
SC.FRAG_QUAD_CYCLES (Derived)
Calculates the average number of fragment cycles for each fragment quad.
SC.FRAG_QUAD_CYCLES = SC.FRAG_ACTIVE / SC.FRAG_QUADS

The following counters show the fragment backend behavior:

SC.FRAG_THREADS_LZS_TEST
Increments for every thread that triggers late depth and stencil (ZS) testing.
SC.FRAG_THREADS_LZS_KILLED
Increments for every thread that late ZS testing kills.
SC.FRAG_NUM_TILES
Increments for every tile that is rendered.
SC.FRAG_TILES_CRC_CULLED
Increments for every physical rendered tile that has its writeback canceled due to a matching transaction elimination CRC hash.

The following counters show the behavior of the arithmetic execution engine:

SC.EE_INSTRS
Increments for every arithmetic instruction that is architecturally executed for a quad in an execution engine. This counter is normalized based on the number of execution engines that the design implements, so it gives the performance per engine.
SC.EE_UTILIZATION (Derived)
The utilization of the arithmetic hardware.
SC.EE_UTILIZATION = SC.EE_INSTRS / SC.EXEC_CORE_ACTIVE
SC.EE_INSTRS_DIVERGED
Increments for every arithmetic instruction architecturally executed where there is control flow divergence in the quad resulting in at least one lane of computation being masked out.

The following counters show the behavior of the load/store pipe:

SC.LSC_READS_FULL
Increments for every LS cache access executed that returns 128 bits of data.
SC.LSC_READS_SHORT
Increments for every LS cache access executed that returns less than 128 bits of data.
SC.LSC_WRITES_FULL
Increments for every LS cache executed that writes 128 bits of data.
SC.LSC_WRITES_SHORT
Increments for every LS cache access executed that writes less than 128 bits of data.
SC.LSC_ATOMICS
Increments for every atomic operation that is issued to the LS cache.
SC.LSC_ISSUES (Derived)
The total number of load/store cache access operations issued.
SC.LSC_ISSUES = SC.LSC_READS_FULL + SC.LSC_READS_SHORT + 
                SC.LSC_ISSUES = SC.LSC_WRITES_FULL + SC.LSC_WRITES_SHORT +
                SC.LSC_ATOMICS
SC.LSC_UTILIZATION (Derived)
Utilization of the load/store cache.
SC.LSC_UTILIZATION = SC.LSC_ISSUES / SC.EXEC_CORE_ACTIVE
SC.LSC_READ_BEATS
Increments for every 16 bytes of data that is fetched from the L2 memory system.
SC.LSC_L2_BYTES_PER_ISSUE (Derived)
The average number of bytes read from the L2 cache for each load/store L1 cache access.
SC.LSC_L2_BYTES_PER_ISSUE = (SC.LSC_READ_BEATS * 16) / SC.LSC_ISSUES
SC.LSC_READ_BEATS_EXTERNAL
Increments for every 16 bytes of data that are fetched from the L2 memory system that missed in the L2 cache and required a fetch from external memory.
SC.LSC_EXTERNAL_BYTES_PER_ISSUE (Derived)
The average number of bytes that are read from the external memory interface for each load/store L1 cache access.
SC.LSC_EXTERNAL_BYTES_PER_ISSUE = (SC.LSC_READ_BEATS_EXTERNAL * 16) / SC.LSC_ISSUES
SC.LSC_WRITE_BEATS
Increments for every 16 bytes of data that are written to the L2 memory system.

The following counters show the texture pipe behavior:

Note

The texture pipe event counters increment for each thread (fragment), not for each quad.
SC.TEX_INSTRS
Increments for every architecturally executed texture instruction.
SC.TEX_ISSUES
Increments for every texture issue cycle used.
SC.TEX_UTILIZATION (Derived)
The texture unit utilization.
SC.TEX_UTILIZATION = SC.TEX_ISSUES / SC.EXEC_CORE_ACTIVE
SC.TEX_CPI (Derived)
The average cycle usage of the texture unit per instruction.
SC.TEX_CPI = SC.TEX_ISSUES / SC.TEX_INSTRS
SC.TEX_INSTR_3D
Increments for every architecturally executed texture instruction that accesses a 3D texture.
SC.TEX_INSTR_TRILINEAR
Increments for every architecturally executed texture instruction that uses a trilinear minification filter.
SC.TEX_INSTR_MIPMAP
Increments for every architecturally executed texture instruction that accesses a texture that has mipmaps enabled.
SC.TEX_INSTR_COMPRESSED
Increments for every architecturally executed texture instruction that accesses a texture that is compressed.
SC.TEX_READ_BEATS
Increments for every 16 bytes of texture data that is fetched from the L2 memory system.
SC.TEX_L2_BYTES_PER_ISSUE (Derived)
The average number of bytes read from the L2 cache per texture L1 cache access.
SC.TEX_L2_BYTES_PER_ISSUE = (SC.TEX_READ_BEATS * 16) / SC.TEX_ISSUES
SC.TEX_READ_BEATS_EXTERNAL
Increments for every 16 bytes of texture data fetched from the L2 memory system that missed in the L2 cache and required a fetch from external memory.
SC.TEX_EXTERNAL_BYTES_PER_ISSUE (Derived)
The average number of bytes read from the external memory interface per texture operation.
SC.TEX_EXTERNAL_BYTES_PER_ISSUE = (SC.TEX_READ_BEATS_EXTERNAL * 16) / SC.TEX_ISSUES

The following counters show the varying unit behavior:

SC.VARY_INSTR
Increments for every architecturally executed varying unit instruction for a fragment quad.
SC.VARY_ISSUES_16
Increments for every architecturally executed cycle of mediump 16-bit varying interpolation.
SC.VARY_ISSUES_32
Increments for every architecturally executed cycle of highp 32-bit varying interpolation.
SC.VARY_UTILIZATION (Derived)
The utilization of the varying unit.
SC.VARY_UTILIZATION = (SC.VARY_ISSUES_16 + SC.VARY_ISSUES_32) / SC.EXEC_CORE_ACTIVE

Tiler counters

The tiler counters provide details of the workload of the fixed function tiling unit. This unit places primitives into the tile lists that the fragment frontend reads during fragment shading.

The following counters show the overall activity of the tiling unit:

TI.ACTIVE
Increments every cycle that the tiler processes a task.

The following counters give a functional breakdown of the tiling workload that is given to the GPU by the application:

TI.PRIMITIVE_POINTS
Increments for every point primitive that the tiler processes. This counter increments before any clipping or culling, so it reflects the raw workload from the application.
TI.PRIMITIVE_LINES
Increments for every line segment primitive that the tiler processes. This counter increments before any clipping or culling, so it reflects the raw workload from the application.
TI.PRIMITIVE_TRIANGLES
Increments for every triangle primitive that the tiler processes. This counter increments before any clipping or culling, so it reflects the raw workload from the application.
TI.INPUT_PRIMITIVES (Derived)
The total number of primitives entering primitive assembly.
TI.INPUT_PRIMITIVES = TI.PRIMITIVE_POINTS + TI.PRIMITIVE_LINES + TI.PRIMITIVE_TRIANGLES

The following counters give a breakdown of how clipping and culling affect the workload. The culling schemes are applied in the following order:

  1. Primitive assembly
  2. Facing culling
  3. Frustum culling
  4. Coverage culling

This order impacts the interpretation of the counters in terms of comparing the culling rates against the total number of primitives entering and leaving each stage.

TI.CULLED_FACING
Increments for every primitive that is culled due to the application of front-face or back-face culling rules.
TI.CULLED_FRUSTUM
Increments for every primitive that is culled due to being outside of the clip-space volume.
TI.CULLED_COVERAGE
Incremented for every microtriangle primitive that is culled due to no coverage of active sample points.
TI.PRIMITIVE_VISIBLE
Incremented for every primitive that is visible and survives all types of culling that are applied.

Note

Visible in this context means that a primitive is inside the viewing frustum, facing in the correct direction, and has at least some sample coverage. Primitives that are visible at this stage may not generate any rendered fragments. For example, ZS testing during fragment processing may determine that a primitive is entirely occluded by other primitives.
TI.CULLED_FACING_PERCENT (Derived)
The percentage of primitive inputs that the facing test culls.
TI.CULLED_FACING_PERCENT = TI.CULLED_FACING / TI_INPUT_PRIMITIVES
TI.CULLED_FRUSTUM_PERCENT (Derived)
The percentage of primitive inputs that the frustum test culls.
TI.CULLED_FRUSTUM_PERCENT = TI.CULLED_FRUSTUM / (TI.INPUT_PRIMITIVES - TL.CULLED_FACING)
TI.CULLED_COVERAGE_PERCENT (Derived)
The percentage of primitive inputs that the coverage test culls.
TI.CULLED_COVERAGE_PERCENT = TI.CULLED_FRUSTUM / (TI.INPUT_PRIMITIVES - TL.CULLED_FACING - TI.CULLED_FRUSTUM)
TI.FRONT_FACING
Incremented for every triangle that is front-facing. This counter increments after culling, so it only counts visible primitives that are emitted into the tile list.
TI.BACK_FACING
Incremented for every triangle that is back-facing. This counter increments after culling, so it only counts visible primitives that are emitted into the tile list.

The following counters track the workload requests for the Index-Driver Vertex Shading pipeline:

TI.IDVS_POSITION_SHADING_REQUEST
Increments for every batch of triangles that are position shaded. Each batch consists of four vertices from sequential index ranges.
TI.IDVS_VARYING_SHADING_REQUEST
Increments for every batch of triangles that are varying shaded. Each batch consists of four vertices from sequential index ranges.

ache counters

This section describes the behavior of the L2 memory system counters. In systems that implement multiple L2 caches or bus interfaces, the counters that are presented in Arm Streamline are the sum of the counters from all the L2 counter blocks.

Note

All derivations in this topic are computations per slice. As Arm Streamline reports the sum of the slices, it may be necessary to divide these derivations by the number of cache slices present in your design.

The following counters profile the internal use of the L2 cache versus the available cycle capacity:

L2.ANY_LOOKUP
Increments for any L2 read or write request from an internal master, or snoop request from an internal or external master.
L2.INTERNAL_UTILIZATION (Derived)
The internal utilization of the L2 cache by the processing masters in the system.
LS.INTERNAL_UTILIZATION = L2.ANY_LOOKUP / JM.GPU_ACTIVE

The following counters profile the internal read traffic into the L2 cache from the various internal masters:

L2.READ_REQUEST
Increments for every read transaction that the L2 cache receives.
L2.EXTERNAL_READ_REQUEST
Increments for every read transaction that the L2 cache sends to external memory.
L2.READ_MISS_RATE (Derived)
Indicates the number of reads that are missing and are being sent on the L2 external interface to main memory.
L2.READ_MISS_RATE = L2.READ_REQUEST / L2.EXTERNAL_READ_REQUEST
L2.WRITE_REQUEST
Increments for every write transaction that the L2 cache receives.
L2.EXTERNAL_WRITE_REQUEST
Increments for every write transaction that the L2 cache sends to external memory.
L2.WRITE_MISS_RATE (Derived)
Indicates the number of writes that are missing and are being sent on the L2 external interface to main memory.
L2.WRITE_MISS_RATE = L2.WRITE_REQUEST / L2.EXTERNAL_WRITE_REQUEST

The following counters profile the external read memory interface behavior:

Note

This behavior includes traffic from the entire GPU L2 memory subsystem as some types of access bypass the L2 cache.
L2.EXTERNAL_READ_BEATS
Increments on every clock cycle that a read beat is read off the external AXI bus.
L2.EXTERNAL_READ_BYTES (Derived)
Converts the beat counter into a raw bandwidth counter.
L2.EXTERNAL_READ_BYTES = SUM(L2.EXTERNAL_READ_BEATS * L2.AXI_WIDTH_BYTES)
L2.EXTERNAL_READ_UTILIZATION (Derived)
The total utilization of the AXI read interface per cache slice.
L2.EXTERNAL_READ_UTILIZATION = L2.EXTERNAL_READ_BEATS / SC.GPU_ACTIVE
L2.EXTERNAL_READ_STALL
Increments every cycle that the GPU is unable to issue a new read transaction to AXI because AXI is unable to accept the request.
L2 Read Latency Histogram
The L2 interface implements a six entry histogram that tracks the response latency for the external reads. The counter for the sixth level is synthesized from multiple raw counter values.
Histogram range Counter equation
0-127 Cycles L2.EXT_RRESP_0_127
128-191 Cycles L2.EXT_RRESP_128_191
192-255 Cycles L2.EXT_RRESP_192_255
256-319 Cycles L2.EXT_RRESP_256_319
320-383 Cycles L2.EXT_RRESP_320_383
> 383 Cycles L2.EXTERNAL_READ_BEATS - L2.EXT_RRESP_0_127 - L2.EXT_RRESP_128_191 - L2.EXT_RRESP_192_255 - L2.EXT_RRESP_256_319 - L2.EXT_RRESP_320_383
L2 Read Outstanding Transaction Histogram
The L2 interface implements a four entry histogram that tracks the outstanding transaction levels for the external reads. The counter for the fourth level is synthesized from multiple raw counter values.
Histogram range Counter equation
0-25% L2.EXT_READ_CNT_Q1
25-50% L2.EXT_READ_CNT_Q2
50-75% L2.EXT_READ_CNT_Q3
75%-100% L2.EXTERNAL_READ - L2.EXT_READ_CNT_Q1 - L2.EXT_READ_CNT_Q2 - L2.EXT_READ_CNT_Q3

The following counters profile the external write memory interface behavior:

Note

This behavior includes traffic from the entire GPU L2 memory subsystem as some types of access bypass the L2 cache.
L2.EXTERNAL_WRITE_BEATS
Increments on every clock cycle a write beat is read off the external AXI bus.
L2.EXTERNAL_WRITE_BYTES (Derived)
Converts the beat counter into a raw bandwidth counter.
L2.EXTERNAL_WRITE_BYTES = SUM(L2.EXTERNAL_WRITE_BEATS * L2.AXI_WIDTH_BYTES)
L2.EXTERNAL_WRITE_UTILIZATION (Derived)
The total utilization of the AXI write interface per cache slice.
L2.EXTERNAL_WRITE_UTILIZATION = L2.EXTERNAL_WRITE_BEATS / SC.GPU_ACTIVE
L2.EXTERNAL_WRITE_STALL
Increments every cycle that the GPU is unable to issue a new write transaction to AXI because AXI is unable to accept the request.
L2 Write Outstanding Transaction Histogram
The L2 interface implements a four entry histogram that tracks the outstanding transaction levels for the external writes. The counter for the fourth level is synthesized from multiple raw counter values.
Histogram range Counter equation
0-25% L2.EXT_WRITE_CNT_Q1
25-50% L2.EXT_WRITE_CNT_Q2
50-75% L2.EXT_WRITE_CNT_Q3
75%-100% L2.EXTERNAL_WRITE - L2.EXT_WRITE_CNT_Q1 - L2.EXT_WRITE_CNT_Q2 - L2.EXT_WRITE_CNT_Q3

For more information about these counters, see Mali Bifrost Family Performance Counters on Arm Community.

Was this page helpful? Yes No