Efficient parallel processing
January 2018 by Peter Harris
In this article we will look at macro-scale pipelining of workloads, the means by which we keep the GPU busy all of the time, and some of the common reasons for that frame level pipeline to stall.
Note: I'm assuming you already have DS-5 Streamline up and running on your platform. If you are yet to do this, there are some work guides posted on the community for getting set up on a variety of Mali-based consumer devices. The examples in this blog were captured using DS-5 v5.16.
What does good content look like?
Before we dive into diagnosing performance problems it is useful to understand what we are aiming for, and what this looks like in Streamline. There are two possible "good" behaviors depending on the performance of the system and the complexity of the content.
- One for content where the GPU is the bottleneck
- One for content where the vsync is the bottleneck
The counters needed for this experiment are:
- Mali Job Manager Cycles: GPU cycles
- This counter increments any clock cycle the GPU is doing something
- Mali Job Manager Cycles: JS0 cycles
- This counter increments any clock cycle the GPU is fragment shading
- Mali Job Manager Cycles: JS1 cycles
- This counter increments any clock cycle the GPU is vertex shading or tiling
If we successfully create and maintain the frame-level rendering pipeline needed for content where the GPU is the bottleneck (e.g. the rendering is too complex to hit 60 FPS), then we would expect one of the GPU workload types (vertex or fragment processing) to be running at full capacity all of the time.
In nearly all content the fragment processing is the dominant part of the GPU execution; applications usually have one or two orders of magnitude more fragments to shade than vertices. In this scenario we would therefore expect JS0 to be active all of the time, and both the CPU and JS1 to be going idle for at least some of the time every frame.
When using Streamline to capture this set of counters we will see three activity graphs which are automatically produced by the tool, in addition to the raw counter values for GPU. We can see that the "GPU Fragment" processing is fully loaded, and that both the "CPU Activity" and the "GPU Vertex-Tiling-Compute" workloads are going idle for a portion of each frame. Note - you need to zoom in down close to the 1ms or 5ms zoom level to see this - we are talking about quite short time periods here.
In systems which are throttled by vsync then we would expect the CPU and the GPU to go idle every frame, as they cannot render the next frame until the vsync signal occurs and a window buffer swap happens. The graph below shows what this would look like in Streamline:
If you are a platform integrator rather than an application developer, testing cases which are running at 60FPS can be a good way to review the effectiveness of your system's DVFS frequency choices. In the example above there is a large amount of time between each burst of activity. This implies that the DVFS frequency selected is too high and that the GPU is running much faster than it needs to, which reduces energy efficiency of the platform as a whole.
Issue #1: Double buffering stalls
In a double-buffered system it is possible to have content which is not hitting 60 FPS, but which is still limited by vsync. This content will look much like the graph above, except the time between workloads will be a multiple of one frame period, and the visible framerate will be an exact division of the maximum screen refresh rate (e.g. a 60 FPS panel could run at 30 FPS, 20 FPS, 15 FPS, etc).
In a double-buffered system which is running at 60 FPS the GPU successfully manages to produce frames in time for each vsync buffer swap. In the figure below we see the lifetime of the two framebuffers (FB0 and FB1), with periods where they are on-screen in green, and periods where they are being rendered by the GPU in blue.
In a system where the GPU is not running fast enough to do this, we will miss one or more vsync deadlines, so the current front-buffer will remain on screen for another vsync period. At the point of the orange line in the diagram below FB1 is the front-buffer is still being displayed on the screen, and FB0 is the back-buffer which has completed rendering and is queued for display. The GPU has no available framebuffer to render on to and so it goes idle. The system performance snaps down to run at 30 FPS, despite having a GPU which is fast enough to run the content at over 45 FPS.
The Android windowing system typically uses triple buffering, so avoids this problem as the GPU has a spare buffer available to render on to, but this is still seen in some X11-based Mali deployments which are double buffered. If you see this issue it is recommended that you disable vsync while doing performing optimization; it is much easier to determine what needs optimizing without additional factors clouding the issue!
For OpenGL ES most Android integrations force the use of triple buffering. The Vulkan API gives the application control over the number of framebuffers in the swapchain, so be aware that each application must explicitly opt-in to use a swapchain of 3 frames to avoid this issue with FPS locking to an integer division of your vsync period when running under 60 FPS.
Issue #2: Pipeline draining
The second issue which you may see is a pipeline break. In this scenario at least one of the CPU or GPU processing parts are busy at any point, but not at the same time; some form of serialization point has been introduced.
In the example below the content is fragment dominated, so we would expect the fragment processing to be active all the time, but we see an oscillating activity which is serializing GPU vertex processing and fragment processing.
The most common reason for this is the use of an OpenGL ES API function which enforces the synchronous behavior of the API, forcing the driver to flush all of the pending operations and drain the rendering pipeline in order to honor the API requirements. The most common culprits here are:
glFinish(): explicitly request a pipeline drain.
glReadPixels(): implicitly request a pipeline drain for the current surface.
GL_MAP_UNSYNCHRONIZED_BITset: explicit pipeline drain for all pending surfaces using the data resource being mapped.
It is almost impossible to make these API calls fast due to their pipeline draining semantics, so I would suggest avoiding these specific uses wherever possible. It is worth noting that OpenGL ES 3.0 allows
glReadPixels to target a Pixel Buffer Object (PBO) which can do the pixel copy asynchronously. This no longer causes a pipeline flush, but may mean you have to wait a while for your data to arrive, and the memory transfer can still be relatively expensive.
Issue #3: CPU limitations
The final issue I will cover in this article is one where the GPU is not the bottleneck at all, but which often shows up as poor graphics performance.
We can only maintain the pipeline of frames if the CPU can produce new frames faster than the GPU consuming them. If the CPU takes 20ms to produce a frame which the GPU takes 5ms to render, then the pipeline will run empty each frame. In the example below the GPU is going idle every frame, but the CPU is running all of the time, which implies that the CPU cannot keep up with the GPU.
"Hang on" I hear you say, "that says the CPU is only 25% loaded". Streamline shows the total capacity of the system as 100%, so if you have 4 CPU cores in your system with one thread maxing out a single processor then this will show up as 25% load. If you click on the arrow in the top right of the "CPU Activity" graph's title box it will expand giving you separate load graphics per CPU core in the system:
As predicted we have one core maxed at 100% load, so this thread is the bottleneck in our system which is limiting the overall performance. There can be many reasons for this, but in terms of the graphics behavior rather than application inefficiency, the main two are:
- Excessive amounts of
- Excessive amounts of dynamic data upload
Draw call cost
Every draw call has a cost for the driver in terms of building control structures and submitting them to the GPU. The number of draw calls per frame should minimized by batching together drawing of objects with similar render state, although there is a balance to be struck between larger batches and efficient culling of things which are not visible. In terms of a target to aim for: most high-end 3D content on mobile today uses only 250 draw calls per frame, with many 2D games coming in under 50.
Dynamic data upload
In terms of dynamic data upload be aware that every data buffer uploaded from client memory to the graphics server requires the driver to copy that data from a client buffer into a server buffer. If this is a new resource rather than sub-buffer update then the driver has to allocate the memory for the buffer too. The most common offender here is the use of client-side vertex attributes. Where possible use static Vertex Buffer Objects (VBOs) which are stored persistently in graphics memory, and use that buffer by reference in all subsequent rendering. This allows you to pay the upload cost once, and amortize that cost over many frames of rendering.
It some cases it may not be Mali graphics stack which is limiting the performance at all. We do sometimes get support cases where the application logic itself is taking more than 16.6ms, so the application could not hit 60 FPS even if the OpenGL ES calls were infinitely fast. DS-5 Streamline contains a very capable software profiler which can help you identify precisely where the bottlenecks are in your code, as well as helping you load balance workloads across multiple CPU cores in your system if you want to parallelize your software using multiple threads, but as this is not directly related to the Mali behavior I'm not going to dwell on it this time around.