Understanding Render Passes

Building efficient rendering pipelines

Published

December 2017 by Pete Harris

Understanding Render Passes 

In a previous article in this series we looked at the fundamental rendering architecture of tile-based renderers, in particular looking at how Mali GPUs process fragments by coloring the framebuffer as a series of small subregions called tiles. During fragment shading the framebuffer working set for each tile is kept inside the GPU in a memory which is tightly coupled to each shader core, minimizing the number of power-hungry external DRAM accesses which are needed for each render.

To get the most benefit from the tile-based rendering approach it is critical that applications minimize the amount of memory traffic in to and out of the tile memory. Specifically, avoiding reading in older framebuffer values at the start of a render pass if they are going to be overdrawn, and writing out values at the end of each pass which are transient and only needed for the duration of that render pass.

Render passes in the APIs

Due to the need to minimize the start of pass and end of pass overheads, render passes are essential concepts for tile-based renders, but because of how the APIs have evolved not all of the APIs support them natively. It is therefore important to understand what we mean by a render pass, and how they are constructed using the APIs.

For our purposes here a render pass is a single execution of the rendering pipeline, rendering a single output image into a set of framebuffer attachments in memory. Each attachment will need initializing in tile memory at the start of the render pass, and may need writing back out to memory at the end of the render pass. Some attachments are likely to be transient; for example an application may be using both color and depth attachments during the render, but will only need to keep the color attachment for use in later rendering operations.

Render passes in Vulkan

Vulkan has added explicit support for render passes in the API via the VkRenderPassstructure, and the individual framebuffer attachments the render pass contains via theVkAttachmentDescription structure. Each attachment description must specify explicit loadOp operations to perform at the start of the render pass, and storeOpoperations to perform at the end of the render pass, so the API requires a clear statement of intent from the application developer.

Render passes in OpenGL

Unlike Vulkan, the older OpenGL ES API has no explicit render passes in the API so the driver must infer which rendering operations form a single render pass.

For the Mali drivers drawing commands are added to the current render pass, and the render pass is submitted for processing when an API call changes the framebuffer or forces a flush of the queued work. The most common causes for ending a render pass are:

  • The application called glBindFramebuffer() to change the GL_FRAMEBUFFER or GL_DRAW_FRAMEBUFFER target.
  • The application called glFramebufferTexture*() or glFramebufferRenderbuffer() to change the attachments of the currently bound draw framebuffer object.
  • The application called eglSwapBuffers() to signal the end of a frame.
  • The application called glFlush() or glFinish() to explicitly flush any queued rendering.
  • The application created a glFenceSync() for some rendering in the current render pass and then called glClientWaitSync() to wait on that work completing.

Efficient render passes

To get the best performance for each render pass it is important to follow some basic rules to remove any redundant memory accesses.

Process each render pass once

The first rule to follow – in particular for OpenGL ES where render passes are inferred – is to make sure that each logical render pass in the application only turns into a single render pass when submitted to the hardware. Specifically this means that you should bind each framebuffer object only once, making all required draw calls before switching to the next.

Minimizing start of tile loads

Mali GPUs can cheaply initialize the tile memory to a clear color value at the start of a render pass without having to read back the old framebuffer content from memory. Unless you are deliberately drawing on top of what was rendered in a previous frame, ensure that you clear or invalidate all of your attachments at the start of each render pass before making any draw calls.

For OpenGL ES you can use any of the following calls to prevent a start of tile read from memory:

  • glClear()
  • glClearBuffer*()
  • glInvalidateFramebuffer()

... but note that these must be clears of the entire framebuffer, not a subregion of it.

Beware

Only the start of tile clear is free; calling glClear() or glClearBuffer*() after the first draw call in a render pass is not free and will result in a per-fragment operation.

For Vulkan set the loadOp for each attachment to either of:

  • VK_ATTACHMENT_LOAD_OP_CLEAR
  • VK_ATTACHMENT_LOAD_OP_DONT_CARE

Beware

Calling VkCmdClear*() commands to clear an attachment will result in a per-fragment operation; it is much more efficient to use the render pass loadOp operations to benefit from the fast tile initialization.

Note that on Mali there is no performance difference between a start-of-pass clear and a start-of-pass invalidate, but clears can cost performance on some other vendors' hardware. If you know you are going to completely cover the screen in opaque primitives and have no dependency on the starting value then it is preferable to use an invalidate or a "don't care" operation instead of a clear.

Minimizing end of tile stores

Once a tile has been completed it will be written back to main memory. For many applications some of the attachments may be transient and do not need to be kept beyond the duration of the render pass, so it is important that the driver is notified of which attachments it can safely discard.

For OpenGL ES you can notify the driver that an attachment is transient by marking the content as invalid using a call to glInvalidateFramebuffer() as the last "draw call" in the render pass.

Info

If you are writing applications using OpenGL ES 2.0 you must use glDiscardFramebufferExt() from the EXT_discard_framebufferextension, which all Mali GPUs support; glInvalidateFramebuffer() is only present in OpenGL ES 3.0 onwards.

For Vulkan set the storeOp for each transient attachment to VK_ATTACHMENT_STORE_OP_DONT_CARE. For additional efficiency the application can even avoid allocating physical backing memory for transient attachments by allocating the backing memory using VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT and constructing the VkImage with VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT.

Handling packed depth-stencil

Depth and stencil attachments are commonly allocated together in memory using a packed pixel format, such as D24S8. Due to the packed nature of this format to get bandwidth savings for tile loads and stores you must read neither attachment during load, and write neither attachment during store.

To reliably get the best performance we therefore recommend:

  • If you only need a depth buffer allocate a depth-only format such as D24 or D24X8, and simply never attach a stencil attachment.
  • If you only need a stencil buffer allocate a stencil-only format such as S8, and then never attach a depth attachment.
  • If you use a packed depth-stencil attachment always attach both attachments, clear both attachments on load, and invalidate both attachments on store.
  • If you use a packed depth-stencil attachment and need to persist one of the attachments for use in a later render pass it is still worth invalidating the other at the end of the render pass as it may allow some bandwidth savings when using framebuffer compression.

Worked example

In the example below we consider a 3D game rendering pipeline which builds a frame from four component render passes.

  • FBO1 is an off-screen pass rendering a depth shadow map for a dynamic light source. It uses only a depth attachment.
  • FBO2 is an off-screen pass rendering a velocity buffer for motion blur. It uses color and depth attachments.
  • FBO3 is an off-screen pass implementing the main 3D render. It uses color, depth, and stencil attachments, and reads the output from FBO1 as an input texture.
  • FBO0 is the final window render which performs some 2D post-processing, combining the outputs from FBO2 and FBO3 as input textures to implement a motion blur effect.

If we view this as a render pass data flow graph we get:

 

Each large box represents a single render pass, the smaller boxes to the left of each render pass represent input attachments in memory, and the smaller boxes to the right of each render pass represent the output attachments in memory. Arrows in to the top of each render pass represent attachments rendered earlier being used as textures in later passes.

Inefficient implementation

The abbreviated call sequence below is functional, including the minimal number of clears for correctness, but breaks all of our efficiency rules at least once:

// Start rendering the off-screen shadow map pass

glBindFramebuffer(2)

glClear(GL_DEPTH_BUFFER_BIT)

glDraw...(...)                      // Make some draws to FBO2

...

// Complete rendering the off-screen velocity pass

glBindFramebuffer(1)

glClear(GL_DEPTH_BUFFER_BIT)

glDraw...(...)                      // Make all draws to FBO1

...

// Complete rendering the off-screen shadow map pass

glBindFramebuffer(2)

glDraw...(...)                      // Make remaining draws to FBO2

...

// Complete rendering the off-screen main 3D pass

glBindFramebuffer(3)

glClear(GL_DEPTH_BUFFER_BIT | GL_STENCIL_BUFFER_BIT)

glDraw...(...)                      // Make all draws to FBO 3

...

// Complete rendering the window surface with motion blur

glBindFramebuffer(0)

glDraw...(...)                      // Make all draws to FBO0

eglSwapBuffers()                    // Finish the frame

 

If we revisit the render pass graph that this will create we can see a number of issues here:

 

Each light grey arrow represents some bandwidth saved by a clear or an invalidate, but each orange arrow represents some wasted bandwidth where an attachment is being read or written to main memory unnecessarily. This happens because:

  • FBO1 is being created as two hardware render passes because it is rendered as two distinct sets of API calls, separated by the rendering of FBO2.
  • FBO2 has redundant color read-backs because the color attachment is not cleared.
  • FBO2 has redundant depth write-backs because the depth attachment is not invalidated.
  • FBO3 has redundant color and stencil read-backs because the attachments are not cleared. Note that this effectively means that the depth data will also be read back, as it is stored interleaved with the stencil data in memory.
  • FBO3 has redundant depth and stencil write-backs because the attachments are not invalidated.

Info

Even though clears and invalidates are not present for FBO0, the default behavior of EGL window surfaces is to implicitly clear all window surface attachments at the start of the render pass and invalidate depth and stencil at the end, so explicit calls are not necessary.

If we assume the application is rendering all passes at 1080p, with 32bpp for color (RGBA8), depth (D24X8), and depth-stencil (D24S8), then the total wasted bandwidth in this case is:

frameCost = 1920 * 1080 * 4 * 6

          = 49.7 MB

... which is 2.99 GB/s at 60 FPS, a substantial fraction of the system memory bandwidth and energy budget wasted on completely unnecessary memory traffic. Framebuffer compression and other GPU optimizations, such as hidden surface removal, can help reduce this, but it is far better if the application removes the redundancy completely to guarantee that it has no overhead.

Efficient implementation

The abbreviated call sequence below implements the same rendering pipeline, but following our efficiency rules:

#define CLEAR_CDS (GL_COLOR_BUFFER_BIT | \

                   GL_DEPTH_BUFFER_BIT | \

                   GL_STENCIL_BUFFER_BIT)

static const GLEnum INVALID_DS[2] = {

    GL_DEPTH_ATTACHMENT,

    GL_STENCIL_ATTACHMENT

};

// Complete rendering the off-screen shadow map pass

glBindFramebuffer(2)

glClear(CLEAR_CDS)                  // Clear all attachments

glDraw...(...)                      // Make all draws to FBO2

...

// Complete rendering the off-screen velocity pass

glBindFramebuffer(1)

glClear(CLEAR_CDS)                  // Clear all attachments

glDraw...(...)                      // Make all draws to FBO1

...

glInvalidateFramebuffer(GL_FRAMEBUFFER, 2, INVALID_DS)

// Complete rendering the off-screen 3D pass

glBindFramebuffer(3)

glClear(CLEAR_CDS)                  // Clear all attachments

glDraw...(...)                      // Make all draws to FBO3

...

glInvalidateFramebuffer(GL_FRAMEBUFFER, 2, INVALID_DS)

// Complete rendering the window surface with motion blur

glBindFramebuffer(0)

glDraw...(...)                      // Make all draws to FBO0

eglSwapBuffers()                    // Finish the frame

 

If we view this in terms of the rendering pipeline graph again, then the efficiency savings become immediately visible when comparing with the inefficient example:

 

Each light grey arrow represents some bandwidth saved by a clear or an invalidate, and we no longer have any wasted memory bandwidth due to unnecessary tile loads or stores.

Multi-sample anti-aliasing

Multi-sample anti-aliasing (MSAA) is a low cost approach for improving the quality of rendering by reducing the impact of "jaggies" – aliasing due to pixel sampling of geometry – along the edges of primitives. It is regaining importance on mobile devices due to the proliferation of augmented reality and virtual reality headsets, which amplify the perceptual strength of jagged edges.

Fundamentally MSAA works by using multiple samples per pixel for color and depth during the main rendering process, including storing these in the framebuffer, and then reducing to a single value per pixel once the render is completed. For Mali the GPU is optimized for storing 4 samples per pixel, but higher MSAA levels are possible in some of the newer Mali GPUs for some additional cost.

A naive implementation of MSAA requires all of the additional samples to be written back to memory and then read back into the GPU to be resolved to a single value per pixel. This is very expensive and should be avoided at all costs. If we assume a 1440p panel rendering RGBA8 pixels at 60 FPS – a common VR configuration – then the bandwidth cost of these additional samples for 4xMSAA would be:

bytesPerFrame4x = 2560 * 1440 * 4 * 4

bytesPerFrame1x = 2560 * 1440 * 4 * 1

 

# Additional 4x bandwidth is doubled because the additional samples

# are written by one pass and then re-read to resolve the final color

bytesPerFrame = ((bytesPerFrame4x * 2) + bytesPerFrame1x)

bytesPerSecond = bytesPerFrame * 60

               = 7.9 GB/s

Our usual rule of thumb for external DDR bandwidth is that it will cost ~100pJ per byte of access, so supporting this bandwidth would use 790mW of our total device power budget just to write the framebuffer needed.

One of the biggest benefits of a tile-based renderer is that the local memory can store the additional samples needed for MSAA during the render and resolve those back to a single color before the tile is written, so the bandwidth cost should be the same a single sampled framebuffer:

bytesPerFrame1x = 2560 * 1440 * 4 * 1

 

# All additional 4x bandwidth is kept entirely inside the tile memory

bytesPerSecond = bytesPerFrame1x * 60

               = 884 MB/s

... which is much more reasonable. However, to benefit from this low-bandwidth inline resolve the application has to "opt-in" and take explicit steps to use the technique, as it does change the behavior of the framebuffers in memory which means that the driver cannot do it transparently.

For OpenGL ES the application should use the EXT_multisampled_render_to_textureextension, which enables an implicit resolve at the end of a render pass rather than requiring a separate glBlitFramebuffer() resolve render pass.

For Vulkan the render pass should provide single sampled pResolveAttachments to store the resolved data, and set the storeOp for the transient multi-sampled attachments to VK_ATTACHMENT_STORE_OP_DONT_CARE. For additional efficiency the application can even avoid allocating physical backing memory for transient attachments by allocating the backing memory usingVK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT and constructing the VkImage withVK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT.

When correctly keeping all of the additional MSAA bandwidth inside the GPU, we can dramatically reduce the system bandwidth impact to just a ninth of that used by the naive implementation, which will improve both performance and energy efficiency significantly.

Conclusions

In this article we have explored how an application can efficiently construct its render passes to get the best performance and minimal external memory traffic when running on a tile-based renderer.

In summary, the guidelines are:

  1. Bind each render pass once, and render it to completion without unbinding it
  2. Clear or invalidate all attachments at the start of each render pass
  3. Invalidate transient attachments at the end of each render pass, before unbinding the pass
  4. Ensure you get an inline resolve when using multi-sample anti-aliasing