Attribute Buffer Encoding

Attribute buffers store the data for each vertex, allowing for the flexible storage of data that is based on a base pointer and row stride . Attribute buffer encoding is the method of how to pack multiple vertex attributes into your memory buffers.

Attribute interleaving

There are two high-level strategies for attribute storage than an application can choose to use:

  1. Non-interleaved: A structure of arrays.
  2. Interleaved: An array of structures

Non-interleaved storage stores each individual attribute in a unique array, with data fetches for a single vertex gathered from multiple arrays.

The image below shows you how attribute storage is handled in a non-interleaved buffer packing array:

This image shows how attribute storage is handled in a non-interleaved buffer packing array.

Interleaved storage is generally preferred over non-interleaved storage as it stores the different attributes for each vertex serially in memory. Interleaved storage also minimizes the number of unique data fetches needed for each vertex, as well as the number of redundant bytes fetched around the boundaries of the used parts of the vertex buffer.

The image below shows you how attribute storage is handled in an interleaved buffer packing array:

This image shows you how attribute storage is handled in an interleaved buffer packing array.

Interleaving for position shading

The Bifrost family of Mali GPUs comprises of the Mali-G30/50/70 series. These GPUs implement an optimized vertex processing flow that splits the vertex shader into two pieces: position shading and varying shading.

After primitive assembly, position shading is run, then primitives are put through the fixed-function clip and cull unit, before the varying shading fu nction is run for the vertices that contribute only to visible primitives.

The diagram below shows the vertex processing flow in an IDVS geometry pipeline:

This image shows the vertex processing flow in an IDVS geometry pipeline.

Compared to non-interleaved buffer-packing, this approach gives two benefits:

  1. Model shading costs are reduced as it does not run the varying shading for culled vertices.
  2. The amount of data fetched from culled vertices can be reduced with the application helping to optimize the buffer’s layout.

Let’s consider the vertex data structure below:

c
struct vertex {
    fp32  position[4];
    fp32  xyScale[2];
    fp16  texCoord1[2];
    fp16  texCoord2[2];
    fp16  vertexColor[4];
}

And the corresponding vertex shader:

glsl
#version 300 es
precision highp float;

uniform mat4 u_mvp;

in vec4 position;
in vec2 xyScale;

in mediump vec2 texCoord1;
in mediump vec2 texCoord2;
in mediump vec4 vertexColor;

out mediump vec2 v_texCoord1;
out mediump vec2 v_texCoord2;
out mediump vec4 v_vertexColor;

void main()
{
    vec4 tmpPos = position;
    tmpPos.xy *= xyScale;
    gl_Position = u_mvp * tmpPos;

    v_texCoord1 = texCoord1;
    v_texCoord2 = texCoord2;
    v_vertexColor = vertexColor;
}

As data is always fetched from main memory as entire 64-byte cache lines, storing this using a single interleaved attribute data set in memory loads a total of 40 bytes per vertex. Even if only position and xyScale values are used, due to the vertex being culled.

Ideally, in this scenario, only 24 bytes per vertex needs to be loaded for culled vertices. This means that 40% of the read bandwidth is wasted.

To get the full benefit of split position and varying shading in Bifrost GPUs, we recommend using two interleaved data sets. The first data set should interleave all attributes required to compute gl_Position, and the second set should contain everything else.

Using two interleaved data sets in this way ensures that only position-related data is read from memory for culled vertices and maximizes the bandwidth savings. The two data sets can be stored as separate sub-regions inside a single buffer, or inside two separate buffers. This is shown in the image below:

This image shows how two data sets can be stored as separate sub-regions inside a single buffer, or inside two separate buffers.

Note: While this type of split packing can small overhead costs on older Mali GPUs, impact is minimal if other best practices, such as ensuring good spatial locality, are also followed.

Buffer specialization

It is useful to produce specialized attribute data sets for each render pass for complex geometry that is reused in multiple render passes, such as a shadow pass and a color pass.

These specialized versions should strip out the unused attributes for each pass, producing a bandwidth optimized version of the mesh for each use.

Note: For most use cases, the split data sets for position and varying shading already provide optimal position-only data sets for depth shadow map generation. Meaning that further specializations may not be necessary.

Attribute vectorization

Because all Mali GPUs are vector processors to some extent, the shader compiler can optimize memory accesses more effectively if it has guarantees that data is contiguous in memory.

In the previous example, the shader uploaded a pair of texture coordinates as two different vec2 attributes.

glsl
in mediump vec2 texCoord1;
in mediump vec2 texCoord2;

out mediump vec2 v_texCoord1;
out mediump vec2 v_texCoord2;

This means that at compile time the compiler has no guarantee that these are contiguous in memory because the application can change buffer packing at draw time. If this version of the shader is run though the offline compiler for a Mali-T880 GPU, the vertex shader requires 10 load/store cycles to complete and the load/store unit is the critical path.

If a pair of coordinates is uploaded together as a single vec4, the compiler is given some guarantees that they are contiguous in memory. This allows it to perform better optimizations:

glsl
in mediump vec4 texCoords;

out mediump vec4 v_texCoords;

This version of the shader should run more quickly and use less energy as it only requires 8 load/store cycles to complete.

Previous Next