Tile-based rendering

Understanding the Mali rendering architecture

Published

December 2017 by Pete Harris

All Mali GPUs use a tile-based rendering architecture. This means that the GPU renders the output framebuffer as a number of distinct smaller sub-regions called tiles, writing each tile out to memory as it is completed. In the case of Mali these tiles are small, spanning just 16x16 pixels each. This article explains the pros and cons of the tile-based architecture, and contrasts the tile-based design against a traditional immediate mode GPU you might find in a desktop PC or games console.

Immediate mode GPUs

In a traditional desktop GPU architecture – commonly called an immediate mode architecture – the rendering is processed as a strict command stream, with the vertex shaders and the fragment shaders being executed in sequence on each primitive in each draw call. At the high level, ignoring parallel processing and pipelining, the pseudo-code for this approach is:

for draw in renderPass:

    for primitive in draw:

        for vertex in primitive:

            execute_vertex_shader(vertex)

        for fragment in primitive:

            execute_fragment_shader(fragment)

 

... and in terms of hardware data flow and interactions with memory this looks like:

 


Advantages

The main advantage of the immediate mode approach is the fact that the output of the vertex shader (and other geometry related shaders) can remain on-chip inside the GPU. The output of these shaders can simply be stored in a FIFO until the next stage in the pipeline is ready to consume the data, which means that little external memory bandwidth cost is incurred for storage and access of intermediate geometry states.

Disadvantages

The major downside of the immediate mode approach is that any triangle in the stream may cover any part of the screen. This means that the framebuffer working set which must be maintained is large; typically a full-screen color buffer and depth buffer, and possibly a stencil buffer too. A frame buffer for a modern device will usually be 32 bits-per-pixel (bpp) color, and 32bpp packed depth/stencil. A 1440p smartphone therefore has a working set of 30MB, which is far too large to keep on chip and therefore must be stored off-chip in DRAM.

Every blending, depth testing, and stencil testing operation requires the current value of the data for the current fragment's pixel coordinate to be fetched from this working set. All fragments shaded will typically touch this working set, so at high resolutions the bandwidth load placed on this memory can be exceptionally high, with multiple read-modify-write operations per fragment, although caching can mitigate this slightly. This need for high bandwidth access in turn drives the need for a wide memory interface with lots of pins, as well as specialized high-frequency DRAM devices, both of which result in external memory accesses which are particularly energy intensive. For mobile and embedded electronics where battery life and passive cooling are important design requirements, this bandwidth to off-chip memory is a significant overall cost.

Tile-based GPUs

The Mali GPU family takes a very different approach to processing render passes, commonly called tile-based rendering, designed to minimize the amount of external memory accesses which are needed during rendering.

Tile-based renders split the screen in to small pieces – Mali renders 16x16 tiles – and process fragment shading on each small tile to completion before writing it out to memory. To make this work the GPU must know up-front which geometry contributes to each tile, so tile-based renderers split each render pass into two distinct processing passes.

  • The first pass executes all of the geometry related processing, and generates the tile lists which indicate which primitives contribute to each screen tile.
  • The second pass executes all of the fragment processing, tile by tile, writing tiles back to memory as they have been completed.

For tile-based architectures the rendering algorithm equates to:

# Pass one

for draw in renderPass:

    for primitive in draw:

        for vertex in primitive:

            execute_vertex_shader(vertex)

        append_tile_list(primitive)

# Pass two

for tile in renderPass:

    for primitive in tile:

        for fragment in primitive:

            execute_fragment_shader(fragment)

... and in terms of hardware data flow and interactions with memory this looks like:

 

Advantages: Bandwidth

The main advantage of tile-based rendering is that a tile is only a small fraction of the total screen area, so it is possible to keep the entire framebuffer working set (color, depth, and stencil) in a fast on-chip RAM which is tightly coupled to the GPU shader core. The intermediate framebuffer states needed for depth/stencil testing and for blending transparent fragments are therefore readily available without needing an external memory access. Reducing the number of external memory accesses needed for common framebuffer operations makes fragment-heavy content significantly more energy efficient.

In addition, a significant proportion of content has a depth and stencil buffer which is transient and only needs to exist for the duration of a single render pass. If developers tell the Mali drivers that depth and stencil buffers do not need to be preserved[1] – ideally via a call to glDiscardFramebufferEXT (OpenGL ES 2.0), glInvalidateFramebuffer (OpenGL ES 3.0), or using appropriate storeOp settings (Vulkan) – then the depth and stencil attachments are never written back to main memory at all.

Further framebuffer bandwidth saving optimizations are possible because Mali only has to write the color data for a tile back to memory once it is complete, at which point we know its final state. We can compare the content of a tile with the current data already in main memory via a CRC check – a process called Transaction Elimination – skipping the tile write to external memory completely if the tile contents are the same. This doesn't help performance in most situations – the fragment shaders still have to run to build the tile content – but it will reduce the external memory bandwidth considerably for many common use cases, such as UI rendering and casual gaming where screen regions will be unchanged across multiple frames, and therefore reduce system power consumption.

In addition we can also compress the color data for the tiles which are written out using a lossless compression scheme called ARM Frame Buffer Compression (AFBC), which allows us to lower the bandwidth and power consumed even further. This compression can be applied to render-to-texture outputs, which can be read back as textures by the GPU in subsequent render passes, as well as the main window surface, provided there is an AFBC compatible display controller such as Mali-DP650 in the system. Framebuffer compression therefore saves bandwidth multiple times; once on write out from the GPU and once each time that framebuffer is read by another processor.

Advantage: Algorithms

In addition to the basic bandwidth saving for framebuffer related operations, tile-based renderers also enable some algorithms which would otherwise be too expensive.

A tile is sufficiently small that Mali can store enough samples locally in the tile memory to allow multi-sample anti-aliasing[2], and the hardware can resolve the MSAA samples to a single pixel color during tile writeback to external memory without needing a separate resolve pass. This allows very low overhead anti-aliasing, both in terms of shading performance overhead and bandwidth cost.

Some more advanced techniques, such as deferred lighting, can benefit from fragment shaders being able to programmatically access the current value stored in the framebuffer by previous fragments. Traditional algorithms might execute in multiple passes, first rendering to a texture in main memory to create the deferred lighting geometry buffer, and then reading that as a texture in a second render pass, at the cost of perhaps 128bpp of bandwidth per G-Buffer read and write. Tile-based renderers can enable lower bandwidth approaches where intermediate per-pixel data is shared directly from the tile-memory, and only the final lit pixels are written back to memory. This functionality is exposed using extensions for OpenGL ES[3], or via the subpass feature in Vulkan.

Disadvantages

It is clear from the sections above that tile-based rendering carries a number of advantages, in particular giving very significant reductions in the bandwidth and power associated with framebuffer data, as well as being able to provide low-cost anti-aliasing. Nothing ever comes for free, so what is the downside?

The principal additional overhead of any tile-based rendering scheme is the point of hand-over from the geometry processing to the fragment processing. The output of the geometry processing stage – the per-vertex varying data and tiler intermediate state – must be written out to main memory and then subsequently read by the fragment processing stage. There is therefore a balance to be struck between the extra bandwidth costs related to geometry, and the bandwidth savings for the framebuffer data.

It is also important for developers to note that some rendering operations, such as tessellation, are disproportionately expensive for a tile-based architecture as they are designed to suit the strengths of the immediate mode architecture where the explosion in geometry data can be buffered inside the on-chip FIFO rather than being written back to main memory.

Conclusions

In modern consumer electronics today there is a significant shift towards higher resolution displays; 1080p is now normal in mass-market smartphones, 1440p is common place in high-end smartphones, and 4K is widely adopted in the DTV market. Screen resolution, and hence fragment processing and framebuffer bandwidth, is the dominant workload and this fits the strengths of a tile-based renderer, and so in this area Mali really shines.

There are some pitfalls which developers must avoid to get the best performance from a tile-based renderer. Firstly, it is critical to correctly set up render passes to make the best use of the optimizations the tile-based approach can provide, or the advantages may be lost. Secondly, it is important to understand the limitations related to geometry processing, and make sure that you get the most visual benefit per triangle and byte of bandwidth being spent. Both of these are sufficiently important that we will look at both in much more detail in later articles.

Footnotes

  1. Depth and stencil invalidation is automatically applied for EGL window surfaces, but is under application control for off screen render-to-texture render passes. 
  2. The maximum support MSAA sample count varies my product. All Mali GPUs support at least 4 samples per pixel, and more recent GPUs support up 16 samples per pixel. 
  3. The OpenGL ES extensions of interest for in-tile shading are ARM_shader_framebuffer_fetch,ARM_shader_framebuffer_fetch_depth_stencil, and EXT_shader_pixel_local_storage.