Use Streamline to Optimize Applications for Mali GPUs

Overview Midgard Optimization Utgard Optimization Further Reading

Midgard Optimization

This section describes optimizations for graphics processors based on high-end Mali GPUs with Midgard architecture. Processors in this range are the Mali-T604, Mali-T622, Mali-T624, Mali-T628, Mali-T678 and Mali-T760, as well as the mid-range Mali-T720.

Fragment Bound GPU

If the fragmernt shader is dominating the GPU workload, we refer to the application as fragment bound. This is a common cause of poor performance in mobile games. There are a couple of common reasons why this might be happening.


The concept of drawing objects from front to back should be well known to game developers, but it is worth reiterating its importance in mobile gaming. In drawing each pixel on the screen more than once, you are creating overdraw which will be illustrated by the fragment shader dominating the GPU workload. Too much use of transparency in a scene can also contribute towards a fragment bound application. More details can be find at Mali GPU optimization guide.

Is the resolution too high?

Smartphone resolution is very high, in many cases exceeding consoles, which in their previous generation are typically rendered at 720p. Compare this with a device like the Google Nexus 10, which has a 2.5k native resolution. If you're attempting to run an application at a native resolution, this can result in a significant slowdown.

Is texture bandwidth too high?

Since textures use a large amount of memory bandwidth, the fragment shader might not be able to get sufficient data and will stall.

Too many effects or cycles in the shader

Every light and every effect will add to the number of cycles that your shader will take.

For example of Google Nexus 10:

  • Nexus 10 Native Resolution = 2560 x 1600 = 4,096,000 pixels
  • Quad Core GPU 533MHz ≈ 520 Cycles per pixel
  • Targeting 30 FPS = 17 Cycles in your shader

Testing for a Fragment Bound Application in Streamline

Select JS0 (Job Slot 0) Active as a counter and run a Streamline capture. Select a 1 second window (using the blue slider) and center it over a position of interest.

Note: It’s a good idea to avoid the first second of the Streamline capture to allow for settling time.

Mali GPU profiling in Arm DS Streamline, showing JS0 counter

Using the number to the right of your capture window (in this case 447,709,960) you can now calculate the fragment percentage:

Fragment Percentage = (JS0 Active / GPU frequency) * 100

We can see in our example that the fragment percentage is 84%. If the fragment shader is taking up such a large percentage of GPU activity, then it’s likely to be the primary cause of any bottlenecking and therefore we should investigate how we can improve performance here.

As a general rule, if you’re using up more than 90% of the GPU with the fragment shader then this is almost certainly the main issue impeding the performance of your application.

Using the Fragment Threads Started counter, we can test for overdraw (again using a 1 second window):

Arm DS Streamline capture for a Mali Midgard GPU, showing overdraw

Overdraw = (Fragment Threads Started * Number of Cores) / (Resolution * FPS)

An overdraw value greater than 3 is another indicator that the fragment shader will be taking up too much of the GPU’s time.

Vertex Bound GPU

In the Mali-T600 series, it is far less likely that your application will be vertex bound than in the mid-range Mali-400 series, because more than 1 vertex shader available. However, it can occasionally be the cause of slow-down, as explained below.

Possible Causes of a Vertex Bound Application

Too many vertices in your geometry

Artists working at game studios are often used to creating content with desktop PC graphics cards in mind. There is a tendency to use too many vertices, which can then in turn cause the vertex shader to dominate the GPU workload. This can be worked around with techniques such as strict budgeting and limits for the number of vertices, level of detail (LOD) switching and culling.

Vertex shader calculation is too heavy

There is a link between number of vertices and shader cycles. Given that you have a limited number of cycles to do your vertex shading, it is important to use the same techniques mentioned above to reduce this.

Testing for a Vertex Bound Application in Streamline

Select the JS1 (Job Slot 1) counter and begin another Streamline capture. JS1 is the hardware counter that represents the vertex job. Select a 1 second sample window using the blue slider:

JS1 vertex job counter Arm DS Streamline capture

Vertex percentage = (JS1 Active / Frequency) * 100

In the example, our vertex percentage works out at 13%, which is quite low, indicating that we are not vertex bound.

Measuring Load/Store cycles per instruction (Mali Load/Store Pipe: LS instructions) gives an indication as to whether you're using too many vertices:

load / store cycles per instruction to measure vertex count in Arm DS Streamline

Load Store CPI = Full Pipeline issues / Load Store Instruction Words Completed

In our example, this value is 2.02. Any value over 2 is considered a problem worth investigating.

Bandwidth Bound GPU

Simply put, an application is bandwidth bound if it tries to use more bandwidth than is available. It can be tricky to determine whether an application is bandwidth bound as it impacts all parts of the application and graphics pipeline.

Possible Causes of a Bandwidth Bound Application

Typically, a smartphone or tablet GPU will have a bandwidth of around 5 GBps. This presents one of the key hurdles for developers who are used to desktop PC gaming where typical bandwidth will exceed 100 GBps.

Though screen sizes and resolutions are increasing in mobile devices, it is still possible to use popular texture compression formats such as ASTC or ETC to greatly reduce the bandwidth required. Arm provide the Mali Texture Compression Tool to help with this.

Testing for a Bandwidth Bound Application

To look at the bandwidth utilization, you need to look at the External Bus Read Beats and External Bus Write Beats in the L2 cache. Additionally, you need to know the bus width in bytes. Once again, select a 1 second window:

Arm DS Streamline capture showing a bandwidth bound GPU

Bandwidth (Bytes) = (External Bus Read Beats + External Bus Write Beats) * Bus Width

Our example (using a bus width of 16 bytes) gives a bandwidth of 967 MBps.

CPU Bound

Possible Causes of CPU Bound Performance

In some cases, optimizing your graphics won't improve performance, as the CPU is causing the low frame rate. This is very often caused by not taking advantage of the multiple cores that most modern day smartphones and tablets have. If an application is not using multithreading, you might see (for instance on a quad-core CPU) "25%" CPU utilization, which often means 100% of one core is being used.

This might seem like a common sense optimization, but failing to enable multithreading has been the cause of embarrassingly poor performance in certain well-known desktop PC games, so it is definitely worth checking for when it comes to mobile games.

Since the Mali GPU is a deferred architecture, it helps to reduce the amount of draw calls you make, as well as combining draw calls together. You can also take advantage of OpenCL 1.1 support to offload work from the CPU to the GPU.

Testing for CPU Bound Performance

This test is easy: simply look at the CPU activity. Make sure that you expand the chart to show all the cores in order to test for the aforementioned lack of multithreading.

One feature that can help to narrow down the share of CPU activity is the "per application" activity in Streamline. Click on a thread or application in the activity bar (as shown in the screenshot below) and Streamline will filter activity to show how much CPU workload is taken up with that particular application.

Arm DS Streamline capture showing a CPU bound GPU application

Previous Next