Multi-Sample Anti-Aliasing (MSAA) is a low-cost approach for improving the quality of rendering by reducing the impact of jaggies along the edges of primitives. Jaggies are the result of aliasing due to pixel sampling of geometry.
MSAA is gaining importance on mobile devices due to the proliferation of augmented reality and virtual reality headsets. These headsets effectively make pixels larger because of the lenses scaling the screen to fill the field of vision.
MSAA works by using multiple samples-per-pixel for color and depth during the main rendering process, including storing these in the framebuffer. MSAA then reduces these samples to a single value per pixel once the render is completed. Mali GPUs are optimized for storing 4 samples per pixel, but higher MSAA levels are possible in some of the newer products for some additional performance cost.
A naive implementation of MSAA requires a sequence call that writes all the additional samples back to DRAM and then reads them back into the GPU where the GPU resolves them to a single value per pixel. This is a very expensive use of limited bandwidth and should therefore be avoided.
If we assume a 1440p panel rendering RGBA8 pixels at 60 FPS, which is a common VR configuration, then the bandwidth cost of these additional samples for 4xMSAA is:
python bytesPerFrame4x = 2560 * 1440 * 4 * 4 bytesPerFrame1x = 2560 * 1440 * 4 * 1 # Additional 4x bandwidth is doubled because the additional samples # are written by one pass and then re-read to resolve the final color bytesPerFrame = ((bytesPerFrame4x * 2) + bytesPerFrame1x) bytesPerSecond = bytesPerFrame * 60 = 7.9 GB/s
In general, external DDR bandwidth costs 100mW per GB/s. Therefore, in our example, this overhead uses 790mW of our total 2.5 watt device power budget just to write the framebuffer.
One of the biggest benefits of a tile-based renderer is that the local memory can store all of the additional samples needed for MSAA during the render and resolve those back to a single color before the tile is written, so the bandwidth cost should be the same a single sampled framebuffer:
python bytesPerFrame1x = 2560 * 1440 * 4 * 1 # All additional 4x bandwidth is kept entirely inside the tile memory bytesPerSecond = bytesPerFrame1x * 60 = 884 MB/s
This uses just 88mW of memory power. To benefit from this low-bandwidth inline resolve, the application has to opt-in and take explicit steps to use the technique.
For OpenGL ES, the application should use the Khronos EXT_msaa extension. This enables an implicit resolve at the end of a render pass rather than requiring a separate
glBlitFramebuffer() resolve render pass.
For Vulkan, the render pass should provide single sampled
pResolveAttachments to store the resolved data and set the
storeOp for the multi-sampled attachments to
VK_ATTACHMENT_STORE_OP_DONT_CARE. For additional efficiency the application can even avoid allocating physical backing memory for transient attachments by allocating the backing memory using
VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT and constructing the
When you correctly keep all of the additional MSAA bandwidth inside the GPU, you can dramatically reduce the system bandwidth impact to just a ninth of that used by the naive implementation. This greatly improves both performance and energy.