HomeCommunityMobile, Graphics, and Gaming blog
February 10, 2026

Hidden surface graphics removal methods in Mali GPUs

Explore fragment prepass on Mali GPUs, how it complements Z-culling and FPK, and the trade-offs between added cost and robust HSR across use

By tonli

Share
Reading time 5 minutes

Hidden surface removal (HSR) is used in modern compute graphics to reduce GPU workload. It avoids shading and blending for surfaces that are not visible from the camera. When HSR works well, it lowers bandwidth use and improves frame time. Fewer fragments reach the expensive parts of the pipeline.

In Mali GPUs, several hardware and pipeline techniques contribute to HSR. Each technique has different strengths and limitations. Some techniques interact with early Z-testing and late Z-testing in ways that affect performance.

The orange blocks in the following diagram show techniques that we have used or are currently using to reduce surface occlusion.

Diagram showing the techniques that we have used or are currently using to reduce surface occlusion.

This blog post analyzes the advantages and disadvantages of various hidden surface removal techniques and the principles of the fragment-prepass. These techniques are used in the Immortalis-G925 series, Mali-G725 GPUs, and Mali-G625 GPUs.

Record buffer

A record buffer accepts primitives and reorders them to improve geometry ordering based on depth. It then outputs reordered primitives to the rasterization stage. This reordering can improve culling during rasterization and during early ZS testing. This is because the GPU sees nearer geometry earlier.

Typical limitations include:

  • It does not work with stencil testing.
  • It does not handle intersecting triangles well.
  • The reorder window is limited. As a result, the benefit depends on how much reordering the buffer can do.
  • Older designs, such as Immortalis-G715, Mali-G715, Immortalis-G720 and Mali-G720, used this approach. Newer GPUs no longer include it.

Z-culling with early and late Z-testing

Z-culling removes fragments that fail the depth test. Mali GPUs can perform depth testing early or late in the fragment pipeline. Early Z-testing runs before the fragment shader. It can save significant work because it prevents shading fragments that are already occluded.

Late Z-testing runs after the fragment shader. Late Z-testing is required when the shader can change values that affect visibility. Several shader features can force late behavior or reduce the benefit of early testing, including the following:

  • The shader can modify coverage.
  • The shader can modify depth.
  • The shader can conditionally use discard.

Early and late Z-testing share the same ZS pipeline. The shared pipeline can create dependencies. For example, a newer fragment at a given position might need to wait until processing finishes for the oldest fragment at that position.

Forward Pixel Kill

Diagram showing the Forward Pixel Kill queue

Forward Pixel Kill (FPK) is a hardware feature. You can think of it as forward pixel culling. It uses an FPK queue between early Z-testing and the shader core. The queue stores pixel quads after early Z-testing. It can cull older quads at the same position when a newer quad is known to be in front.

FPK in the GPU engine sends this information toward the shader core. The hardware can then cull quads that are no longer visible before they consume more shader work.

Typical limitations include:

  • The queue length is limited.
  • Larger tiles can reduce culling efficiency.
  • Results can be timing dependent. A quad might not be culled in the first pass but it can be culled in a later pass.
  • FPK can cull only quads at the same position that remain in the queue.
  • With micro-triangles, FPK often culls fewer fragments. Quads contain a mix of visible and occluded coverage.
  • FPK is disabled when late Z-testing is enabled.

Enhanced Forward Pixel Kill

Enhanced FPK (EFPK) improves FPK by combining the coverage of an incoming quad with the previous quad at the same position in the queue, if one exists. This change improves the chance that the hardware can identify work that is safe to cull.

EFPK can cull matching quads that are in front of both patches. In practice, EFPK still shares most of the same limits as FPK. These include queue constraints, ordering sensitivity, and behavior that depends on timing.

Patch forward Pixel Kill culling

Patch FPK culling (PFPK) works at patch granularity during rasterization. A PFPK tracker scans state and culls candidates based on lookup checks.

Typical limitations include:

  • Look-ahead is limited.
  • Efficiency depends on geometry ordering.
  • PFPK is disabled when late Z-testing is enabled.
  • Results can be timing dependent.
  • PFPK can cull only quads that are still in the rasterizer. The time when a quad leaves the rasterizer depends on timing.
  • PFPK runs in the rasterizer (RAST).

Diagram PFPK done at RAST

General limitations of HSR techniques

Several patterns reduce the effectiveness of the techniques previously described:

  • Small reorder windows reduce the benefit when geometry ordering is poor.
  • Very small triangles can reduce culling opportunities and increase overhead in queues and tracking logic.
  • Some approaches require multiple triangles to form a single effective culling quad, which reduces benefit for fine geometry.
  • Shader features such as discard, shader-modified coverage, and alpha-to-coverage can force late Z-testing.
  • When late Z-testing is enabled, FPK, EFPK, and PFPK can be disabled.

Fragment prepass

When the limits above become a bottleneck, a fragment prepass can provide more consistent HSR. This approach is available starting with the Immortalis-G925 series, Mali-G725 GPUs, and Mali-G625 GPUs. The detailed theory is covered in the Hidden Surface Removal in Immortalis-G925: The Fragment Prepass Arm blog post.

Fragment pre-pass process

At a high level, the fragment prepass improves HSR by identifying occluded work earlier and more reliably across a wider range of content.

Key benefits include the following:

  • Improved HSR
  • Independence from geometry ordering in common cases.
  • An effectively unlimited culling window compared with small hardware queues.
  • The fragment prepass is not disabled when late Z-testing is enabled.
    • FPK, EFPK, and PFPK are disabled when late Z-testing is enabled.
  • Stencil testing is supported.
  • Sustained performance improvement over Immortalis-G720, Mali-G720, and Mali-G620.

Trade-offs to consider

A fragment prepass adds work. In workloads with little occlusion, the prepass can reduce performance. It runs an additional pass before the full fragment shader path. This is often visible in user interface style content. Most pixels are already visible and there is limited overdraw.

Conclusion

Mali GPUs use several complementary methods to reduce occluded fragment work. These include primitive reordering, depth-based culling, and forward pixel culling queues. These methods can be limited by queue depth, geometry ordering, small triangles, and shader features that require late Z-testing. The fragment prepass addresses many of these limitations and provides more robust HSR. The main trade-off being added overhead when there is little occlusion.


Log in to like this post
Share

Article text

Re-use is only permitted for informational and non-commercial or personal use only.

placeholder