Immediate Mode GPUs

Traditional desktop GPU architecture is commonly known as immediate mode architecture. Immediate mode GPUs process rendering as a strict command stream, executing the vertex and fragment shaders in sequence on each primitive in every draw call.

Ignoring parallel processing and pipelining, here is a high-level pseudo-code example of this approach:

python
for draw in renderPass:
    for primitive in draw:
        for vertex in primitive:
            execute_vertex_shader(vertex)
        if primitive not culled:
            for fragment in primitive:
                execute_fragment_shader(fragment)

The diagram below shows the hardware data flow and memory interactions:

 

Advantages

The output of the vertex shader, and other geometry related shaders, can remain on-chip inside the GPU. The output of these shaders can be stored in a FIFO buffer until the next stage in the pipeline is ready to use the data. This means that the GPU uses little external memory bandwidth storing and retrieving intermediate geometry results.

Disadvantages

The fragment shading jumps around the screen depending on the locations of the triangles in each draw. This happens because any triangle in the stream may cover any part of the screen and triangles are processed in draw order.

The effect of this means that the active working set is the size of the entire framebuffer. For example, consider a device with 1440p resolution, it uses 32 Bits-Per-Pixel (BPP) for color, and 32 BPP for packed depth/stencil. This gives a total working set of 30MB, which is far too large to keep on chip and therefore must be stored off-chip in DRAM.

The GPU must fetch from this working set the current value of the data for the pixel coordinate of the current fragment for every blending, depth testing, and stencil testing operation.

Typically, all shaded fragments access this working set. Therefore, at high resolutions the bandwidth load placed on this memory can be very high because of multiple read-modify-write operations for each fragment. However, caching can mitigate high bandwidth load by keeping recently accessed parts of the framebuffer close to the GPU.

Previous Next