Multi-Sample Anti-Aliasing

Multi-Sample Anti-Aliasing (MSAA) is a low-cost approach for improving the quality of rendering by reducing the impact of jaggies along the edges of primitives. Jaggies are the result of aliasing due to pixel sampling of geometry.

MSAA is gaining importance on mobile devices due to the proliferation of augmented reality and virtual reality headsets. These headsets effectively make pixels larger because of the lenses scaling the screen to fill the field of vision.

MSAA works by using multiple samples-per-pixel for color and depth during the main rendering process, including storing these in the framebuffer. MSAA then reduces these samples to a single value per pixel once the render is completed. Mali GPUs are optimized for storing 4 samples per pixel, but higher MSAA levels are possible in some of the newer products for some additional performance cost.

A naive implementation of MSAA requires a sequence call that writes all the additional samples back to DRAM and then reads them back into the GPU where the GPU resolves them to a single value per pixel. This is a very expensive use of limited bandwidth and should therefore be avoided.

If we assume a 1440p panel rendering RGBA8 pixels at 60 FPS, which is a common VR configuration, then the bandwidth cost of these additional samples for 4xMSAA is:

python
bytesPerFrame4x = 2560 * 1440 * 4 * 4
bytesPerFrame1x = 2560 * 1440 * 4 * 1

# Additional 4x bandwidth is doubled because the additional samples
# are written by one pass and then re-read to resolve the final color
bytesPerFrame = ((bytesPerFrame4x * 2) + bytesPerFrame1x)
bytesPerSecond = bytesPerFrame * 60
               = 7.9 GB/s

In general, external DDR bandwidth costs 100mW per GB/s. Therefore, in our example, this overhead uses 790mW of our total 2.5 watt device power budget just to write the framebuffer.

One of the biggest benefits of a tile-based renderer is that the local memory can store all of the additional samples needed for MSAA during the render and resolve those back to a single color before the tile is written, so the bandwidth cost should be the same a single sampled framebuffer:

python
bytesPerFrame1x = 2560 * 1440 * 4 * 1

# All additional 4x bandwidth is kept entirely inside the tile memory
bytesPerSecond = bytesPerFrame1x * 60
               = 884 MB/s

This uses just 88mW of memory power. To benefit from this low-bandwidth inline resolve, the application has to opt-in and take explicit steps to use the technique.

For OpenGL ES, the application should use the [EXT_multisampled_render_to_texture][EXT_msaa] extension. This enables an implicit resolve at the end of a render pass rather than requiring a separate glBlitFramebuffer() resolve render pass.

For Vulkan, the render pass should provide single sampled pResolveAttachments to store the resolved data and set the storeOp for the multi-sampled attachments to VK_ATTACHMENT_STORE_OP_DONT_CARE. For additional efficiency the application can even avoid allocating physical backing memory for transient attachments by allocating the backing memory using VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT and constructing the VkImage with VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT.

When you correctly keep all of the additional MSAA bandwidth inside the GPU, you can dramatically reduce the system bandwidth impact to just a ninth of that used by the naive implementation. This greatly improves both performance and energy.

Previous Next