This section describes optimizations for graphics processors based on high-end Mali GPUs with Midgard architecture. Processors in this range are the Mali-T604, Mali-T622, Mali-T624, Mali-T628, Mali-T678 and Mali-T760, as well as the mid-range Mali-T720.
Fragment Bound GPU
If the fragmernt shader is dominating the GPU workload, we refer to the application as fragment bound. This is a common cause of poor performance in mobile games. There are a couple of common reasons why this might be happening.
The concept of drawing objects from front to back should be well known to game developers, but it is worth reiterating its importance in mobile gaming. In drawing each pixel on the screen more than once, you are creating overdraw which will be illustrated by the fragment shader dominating the GPU workload. Too much use of transparency in a scene can also contribute towards a fragment bound application. More details can be find at Mali GPU optimization guide.
Is the resolution too high?
Smartphone resolution is very high, in many cases exceeding consoles, which in their previous generation are typically rendered at 720p. Compare this with a device like the Google Nexus 10, which has a 2.5k native resolution. If you're attempting to run an application at a native resolution, this can result in a significant slowdown.
Is texture bandwidth too high?
Since textures use a large amount of memory bandwidth, the fragment shader might not be able to get sufficient data and will stall.
Too many effects or cycles in the shader
Every light and every effect will add to the number of cycles that your shader will take.
For example of Google Nexus 10:
- Nexus 10 Native Resolution = 2560 x 1600 = 4,096,000 pixels
- Quad Core GPU 533MHz ≈ 520 Cycles per pixel
- Targeting 30 FPS = 17 Cycles in your shader
Testing for a Fragment Bound Application in Streamline
Select JS0 (Job Slot 0) Active as a counter and run a Streamline capture. Select a 1 second window (using the blue slider) and center it over a position of interest.
Note: It’s a good idea to avoid the first second of the Streamline capture to allow for settling time.
Using the number to the right of your capture window (in this case 447,709,960) you can now calculate the fragment percentage:
Fragment Percentage = (JS0 Active / GPU frequency) * 100
We can see in our example that the fragment percentage is 84%. If the fragment shader is taking up such a large percentage of GPU activity, then it’s likely to be the primary cause of any bottlenecking and therefore we should investigate how we can improve performance here.
As a general rule, if you’re using up more than 90% of the GPU with the fragment shader then this is almost certainly the main issue impeding the performance of your application.
Using the Fragment Threads Started counter, we can test for overdraw (again using a 1 second window):
Overdraw = (Fragment Threads Started * Number of Cores) / (Resolution * FPS)
An overdraw value greater than 3 is another indicator that the fragment shader will be taking up too much of the GPU’s time.
Vertex Bound GPU
In the Mali-T600 series, it is far less likely that your application will be vertex bound than in the mid-range Mali-400 series, because more than 1 vertex shader available. However, it can occasionally be the cause of slow-down, as explained below.
Possible Causes of a Vertex Bound Application
Too many vertices in your geometry
Artists working at game studios are often used to creating content with desktop PC graphics cards in mind. There is a tendency to use too many vertices, which can then in turn cause the vertex shader to dominate the GPU workload. This can be worked around with techniques such as strict budgeting and limits for the number of vertices, level of detail (LOD) switching and culling.
Vertex shader calculation is too heavy
There is a link between number of vertices and shader cycles. Given that you have a limited number of cycles to do your vertex shading, it is important to use the same techniques mentioned above to reduce this.
Testing for a Vertex Bound Application in Streamline
Select the JS1 (Job Slot 1) counter and begin another Streamline capture. JS1 is the hardware counter that represents the vertex job. Select a 1 second sample window using the blue slider:
Vertex percentage = (JS1 Active / Frequency) * 100
In the example, our vertex percentage works out at 13%, which is quite low, indicating that we are not vertex bound.
Measuring Load/Store cycles per instruction (Mali Load/Store Pipe: LS instructions) gives an indication as to whether you're using too many vertices:
Load Store CPI = Full Pipeline issues / Load Store Instruction Words Completed
In our example, this value is 2.02. Any value over 2 is considered a problem worth investigating.
Bandwidth Bound GPU
Simply put, an application is bandwidth bound if it tries to use more bandwidth than is available. It can be tricky to determine whether an application is bandwidth bound as it impacts all parts of the application and graphics pipeline.
Possible Causes of a Bandwidth Bound Application
Typically, a smartphone or tablet GPU will have a bandwidth of around 5 GBps. This presents one of the key hurdles for developers who are used to desktop PC gaming where typical bandwidth will exceed 100 GBps.
Though screen sizes and resolutions are increasing in mobile devices, it is still possible to use popular texture compression formats such as ASTC or ETC to greatly reduce the bandwidth required. Arm provide the Mali Texture Compression Tool to help with this.
Testing for a Bandwidth Bound Application
To look at the bandwidth utilization, you need to look at the External Bus Read Beats and External Bus Write Beats in the L2 cache. Additionally, you need to know the bus width in bytes. Once again, select a 1 second window:
Bandwidth (Bytes) = (External Bus Read Beats + External Bus Write Beats) * Bus Width
Our example (using a bus width of 16 bytes) gives a bandwidth of 967 MBps.
Possible Causes of CPU Bound Performance
In some cases, optimizing your graphics won't improve performance, as the CPU is causing the low frame rate. This is very often caused by not taking advantage of the multiple cores that most modern day smartphones and tablets have. If an application is not using multithreading, you might see (for instance on a quad-core CPU) "25%" CPU utilization, which often means 100% of one core is being used.
This might seem like a common sense optimization, but failing to enable multithreading has been the cause of embarrassingly poor performance in certain well-known desktop PC games, so it is definitely worth checking for when it comes to mobile games.
Since the Mali GPU is a deferred architecture, it helps to reduce the amount of draw calls you make, as well as combining draw calls together. You can also take advantage of OpenCL 1.1 support to offload work from the CPU to the GPU.
Testing for CPU Bound Performance
This test is easy: simply look at the CPU activity. Make sure that you expand the chart to show all the cores in order to test for the aforementioned lack of multithreading.
One feature that can help to narrow down the share of CPU activity is the "per application" activity in Streamline. Click on a thread or application in the activity bar (as shown in the screenshot below) and Streamline will filter activity to show how much CPU workload is taken up with that particular application.