Worked Example

This example describes a 3D game rendering pipeline that builds a frame from four component render passes:

  1. FBO1 is an off-screen pass that renders a velocity texture for motion blur. It uses color and depth attachments.
  2. FBO2 is an off-screen pass that renders a depth shadow map for a dynamic light source. It uses only a depth attachment.
  3. FBO3 is an off-screen pass that implements the main 3D render. It uses color, depth, and stencil attachments, and reads the output from FBO2 as an input texture.
  4. FBO0 is the final window render which performs some 2D post-processing, combining the outputs from FBO1 and FBO3 as input textures to implement a motion blur effect.

The following render pass data flow diagram illustrates this rendering pipeline:

Basic pipeline

Each large box represents a single render pass. The smaller boxes to the left of each render pass represent input attachments in memory, and the smaller boxes to the right of each render pass represent the output attachments in memory.

Arrows that go in to the top of each render pass represent attachments rendered earlier that are then being used as textures in later passes.

Inefficient Implementation

In practice, this abbreviated call sequence works, but it also breaks every efficiency rule at least once:

c
// Start rendering the off-screen shadow map pass
glBindFramebuffer(2);
glClear(GL_DEPTH_BUFFER_BIT);
glDrawElements(...);                 // Make some draws to FBO2
...

// Complete rendering the off-screen velocity pass
glBindFramebuffer(1);
glClear(GL_DEPTH_BUFFER_BIT);
glDrawElements(...);                 // Make all draws to FBO1
...

// Complete rendering the off-screen shadow map pass
glBindFramebuffer(2);
glDrawElements(...);                 // Make remaining draws to FBO2
...

// Complete rendering the off-screen main 3D pass
glBindFramebuffer(3);
glClear(GL_DEPTH_BUFFER_BIT | GL_STENCIL_BUFFER_BIT);
glDrawElements(...);                 // Make all draws to FBO3
...

// Complete rendering the window surface with motion blur
glBindFramebuffer(0);
glDrawElements(...);                 // Make all draws to FBO0
eglSwapBuffers();                    // Finish the frame

This is a render pass data flow diagram of an inefficient rendering pipeline:

Inefficient Timeline

Looking at this diagram, we can identify some issues. Each light grey arrow represents some bandwidth saved by a clear or an invalidate. However, each orange arrow represents some wasted bandwidth where a render pass reads, or writes, an attachment to main memory unnecessarily.

This happens because:

  • The call sequence creates FBO2 as two hardware render passes because two distinct sets of API calls, separated by the rendering of FBO1, render it.
  • FBO1 has redundant color read-backs because the call sequence does not the clear color attachment.
  • FBO1 has redundant depth write-backs because the call sequence does not invalidate the depth attachment.
  • FBO3 has redundant color read-backs because the call sequence does not clear the color attachment.
  • FBO3 has redundant depth and stencil write-backs because the call sequence does not invalidate the attachments.

Note: Even though FBO0 does not have clears and invalidates, explicit calls are not necessary as the default behavior of EGL window surfaces is to implicitly clear all window surface attachments at the start of the render pass and then invalidate the depth and stencil at the end.

If we assume the application is rendering all passes at 1080p, with 32bpp for color (RGBA8), depth (D24X8), and depth-stencil (D24S8), the total wasted bandwidth is:

frameCost = 1920 * 1080 * 4 * 6
          = 49.7 MB

This is 2.99 GB/s at 60 FPS which is a significant amount of the system memory bandwidth and energy budget that is wasted on unnecessary memory traffic. Framebuffer compression and other GPU optimizations, such as hidden surface removal, can help reduce this.

However, it is more efficient if the application removes this redundancy to guarantee that it has no overhead.

Efficient Implementation

The following abbreviated call sequence implements the same rendering pipeline as the inefficient implementation. However, this time it follows our efficiency rules:

c
#define CLEAR_CDS (GL_COLOR_BUFFER_BIT | \
                   GL_DEPTH_BUFFER_BIT | \
                   GL_STENCIL_BUFFER_BIT)

static const GLEnum INVALID_DS[2] = {
    GL_DEPTH_ATTACHMENT,
    GL_STENCIL_ATTACHMENT
};

// Complete rendering the off-screen shadow map pass
glBindFramebuffer(2);
glClear(CLEAR_CDS);                  // Clear all attachments
glDrawElements(...);                 // Make all draws to FBO2
...

// Complete rendering the off-screen velocity pass
glBindFramebuffer(1);
glClear(CLEAR_CDS);                  // Clear all attachments
glDrawElements(...);                 // Make all draws to FBO1
...
glInvalidateFramebuffer(GL_FRAMEBUFFER, 2, INVALID_DS)

// Complete rendering the off-screen 3D pass
glBindFramebuffer(3);
glClear(CLEAR_CDS);                  // Clear all attachments
glDrawElements(...);                 // Make all draws to FBO3
...
glInvalidateFramebuffer(GL_FRAMEBUFFER, 2, INVALID_DS);

// Complete rendering the window surface with motion blur
glBindFramebuffer(0);
glDrawElements(...);                 // Make all draws to FBO0
eglSwapBuffers();                    // Finish the frame

Looking at the render pass graph that this call sequence creates, the efficiency savings are clear to see when compared to the inefficient example:

Efficient Pipeline 

Each light grey arrow represents some bandwidth saved by a clear or an invalidate, and compared to the inefficient example, there is no wasted memory bandwidth due to unnecessary tile loads or stores.

Previous Next