Using Streamline to Optimize Applications for Mali GPUs

Use this guide to optimize the graphics in your Android applications running on Mali-400 and Mali T-600 GPUs using Arm DS-5 Streamline performance analyzer.

Introduction

Using Streamline to Optimize Applications for Mali GPUs

Arm DS-5 Streamline can form a useful part of your workflow when optimizing applications for Mali Midgard and Utgard based GPUs. Streamline allows you to see what API calls are being made, how many times API functions are called and how much time is spent in API functions.

This guide is intended to give you a selection of common issues and bottlenecks which can be found using the Streamline performance analyzer. In turn, these should help you diagnose areas of your code or texture assets which can be improved. For the complete guide Mali GPU optimization, read the Mali Developer OpenGL ES Application Optimization Guide.

Also provided are rule of thumb measurements which give an idea about whether a counter value in Streamline is likely to be causing a performance issue in your application.

However, it’s also important to have an understanding of the specifications and limits of the GPU itself, so that you can begin to create your own tests and rules.

When do I need to use Streamline?

The Streamline performance analyzer gives a complete picture of system performance and is very useful for telling you where a problem is. However, to find out what is causing that problem, you need to use tools such as the Mali Graphics Debugger

Setting up Streamline

Streamline requires that you have root access on your device so that you can install gator (the daemon that Streamline uses to collect system information) and insert the kernel module. For Android devices, you will also need Android SDK and NDK.

We have a selection of blogs and presentations for Streamline setup on specific devices:


Midgard Optimization

This section deals with optimizations for graphics processors based on high-end Mali GPUs with Midgard architecture. Processors in this range are the Mali-T604, Mali-T622, Mali-T624, Mali-T628, Mali-T678 and Mali-T760, as well as the mid-range Mali-T720.


Fragment Bound GPU

If the fragment shader is dominating the GPU workload, we refer to the application as fragment bound. This is a common cause of poor performance in mobile games. There are a couple of common reasons why this might be happening:

Possible Causes of a Fragment Bound Application

Overdraw

The concept of drawing objects from front to back should be well known to game developers, but it is worth reiterating its importance in mobile gaming. In drawing each pixel on the screen more than once, you are creating overdraw which will be illustrated by the fragment shader dominating the GPU workload. Too much use of transparency in a scene can also contribute towards a fragment bound application.

Is the resolution too high?

Smartphone resolution is very high, in many cases exceeding consoles, which in their previous generation typically rendered at 720p. Compare this with a device like the Google Nexus 10, which has a 2.5k native resolution. If you're attempting to run an application at native resolution, this can result in a significant slowdown.

Is texture bandwidth too high?

Since textures use a large amount of memory bandwidth, the fragment shader might not be able to get sufficient data and will stall.

Too many effects or cycles in the shader

Every light and every effect will add to the number of cycles that your shader will take.

Going back to the example of the Google Nexus 10:

Nexus 10 Native Resolution = 2560 x 1600 = 4,096,000 pixels

Quad Core GPU 533MHz ≈ 520 Cycles per pixel

Targeting 30 FPS = 17 Cycles in your shader

Testing for a Fragment Bound Application in Streamline

Select JS0 (Job Slot 0) Active as a counter and run a Streamline capture. Select a 1 second window (using the blue slider) and center it over a position of interest. N.B. It’s a good idea to avoid the first second of the Streamline capture to allow for settling time.

Mali GPU profiling in ARM DS-5 Streamline, showing JS0 counter

Using the number to the right of your capture window (in this case 447,709,960) you can now calculate the fragment percentage:

Fragment Percentage = (JS0 Active / GPU frequency) * 100

We can see in our example that the fragment percentage is 84%. If the fragment shader is taking up such a large percentage of GPU activity, then it’s likely to be the primary cause of any bottlenecking and therefore we should investigate how we can improve performance here.

As a general rule, if you’re using up more than 90% of the GPU with the fragment shader then this is almost certainly the main issue impeding the performance of your application.

Using the Fragment Threads Started counter, we can test for overdraw (again using a 1 second window):

ARM DS-5 Streamline capture for a Mali Midgard GPU, showing overdraw

Overdraw = (Fragment Threads Started * Number of Cores) / (Resolution * FPS)

An overdraw value greater than 3 is another indicator that the fragment shader will be taking up too much of the GPU’s time.


Vertex Bound GPU

In the Mali-T600 series, it is far less likely that your application will be vertex bound than in the mid-range Mali-400 series, since there are more than 1 vertex shader available. However, it can occasionally be the cause of slow-down, as explained below.

Possible Causes of a Vertex Bound Application

Too many vertices in your geometry

Artists working at game studios are often used to creating content with desktop PC graphics cards in mind. There is a tendency to use too many vertices, which can then in turn cause the vertex shader to dominate the GPU workload. This can be worked around with techniques such as strict budgeting and limits for the number of vertices, level of detail (LOD) switching and culling.

Vertex shader calculation is too heavy

There is a link between number of vertices and shader cycles. Given that you have a limited number of cycles to do your vertex shading, it is important to use the same techniques mentioned above to reduce this.

Testing for a Vertex Bound Application in Streamline

Select the JS1 (Job Slot 1) counter and begin another Streamline capture. JS1 is the hardware counter that represents the vertex job. Select a 1 second sample window using the blue slider:

JS1 vertex job counter ARM DS-5 Streamline capture

Vertex percentage = (JS1 Active / Frequency) * 100

In the example, our vertex percentage works out at 13%, which is quite low, indicating that we are not vertex bound.

Measuring Load/Store cycles per instruction (Mali Load/Store Pipe: LS instructions) gives an indication as to whether you're using too many vertices:

load / store cycles per instruction to measure vertex count in ARM DS-5 Streamline

Load Store CPI = Full Pipeline issues / Load Store Instruction Words Completed

In our example, this value is 2.02. Any value over 2 is considered a problem worth investigating.


Bandwidth Bound GPU

Simply put, an application is bandwidth bound if it tries to use more bandwidth than is available. It can be tricky to determine whether an application is bandwidth bound as it impacts all parts of the application and graphics pipeline.

Possible Causes of a Bandwidth Bound Application

Typically, a smartphone or tablet GPU will have a bandwidth of around 5 GBps. This presents one of the key hurdles for developers who are used to desktop PC gaming where typical bandwidth will exceed 100 GBps.

Though screen sizes and resolutions are increasing in mobile devices, it is still possible to use popular texture compression formats such as ASTC or ETC to greatly reduce the bandwidth required. Arm provide the Mali Texture Compression Tool to help with this.

Testing for a Bandwidth Bound Application

To look at the bandwidth utilization, you need to look at the External Bus Read Beats and External Bus Write Beats in the L2 cache. Additionally, you need to know the bus width in bytes. Once again, look a 1 second window:

ARM DS-5 Streamline capture showing a bandwidth bound GPU

Bandwidth (Bytes) = (External Bus Read Beats + External Bus Write Beats) * Bus Width

Our example (using a bus width of 16 bytes) gives a bandwidth of 967 MBps.


CPU Bound

Possible Causes of CPU Bound Performance

In some cases, optimizing your graphics won't improve performance, as the CPU is causing the low frame rate. This is very often caused by not taking advantage of the multiple cores that most modern day smartphones and tablets have. If an application is not using multithreading, you might see (for instance on a quad-core CPU) "25%" CPU utilization, which often means 100% of one core is being used.

This might seem like a common sense optimization, but failing to enable multithreading has been the cause of embarrassingly poor performance in certain well-known desktop PC games, so it is definitely worth checking for when it comes to mobile games.

Since the Mali GPU is a deferred architecture, it helps to reduce the amount of draw calls you make, as well as combining draw calls together. You can also take advantage of OpenCL 1.1 support to offload work from the CPU to the GPU.

Testing for CPU Bound Performance

This test is easy: simply look at the CPU activity. Make sure that you expand the chart to show all the cores in order to test for the aforementioned lack of multithreading.

One feature that can help to narrow down the share of CPU activity is the "per application" activity in Streamline. Click on a thread or application in the activity bar (as shown in the screenshot below) and Streamline will filter activity to show how much CPU workload is taken up with that particular application.

ARM DS-5 Streamline capture showing a CPU bound GPU application


Utgard Optimization

This section deals with optimizations for graphics processors based on the mid-range Utgard architecture. Processors in this range are the Mali-300, Mali-400 MP and Mali-450 MP.

There are a good number of counters available to use with Mali-400 GPUs in a single capture. However, you are limited to a maximum of 2 hardware counters per "unit", where a unit might for instance be the L2 cache.


Fragment Bound GPU

Possible Causes of And Tests For Fragment Bound Applications

Just like in the case of the more powerful Mali-T600 series GPUs, your application may be fragment bound for a number of reasons.

Oversized & Uncompressed Textures

A high reading from the Fragment Processor: Total bus reads and Fragment Processor: Texture descriptors reads counters indicates that your application is texture bound.

To check for oversized textures, measure the Mali GPU Fragment Processor X: Texture cache hit count and Mali GPU Fragment Processor X: Texture cache miss count counters and take their ratio. A 10:1 ratio of hit to miss is typical, with higher values suggesting that texture bit-depth is too high, or textures in general are too big.

A similar test with the Mali GPU Fragment Processor X: Compressed texture cache hit count and Mali GPU Fragment Processor X: Compressed texture cache miss count counters helps to identify whether you are using uncompressed textures.

Overdraw

As described in the case of the Midgard architecture, overdraw can be a factor in fragment bound applications.

To test for overdraw factor, measure the Mali GPU Fragment Processor X: Fragment passed z/stensil count counter:

Overdraw Factor = Fragment passed z/stensil count / Number of pixels in a frame

A typical result will be around 2.5. As in the Midgard optimization, a value higher than 3 in a scene that doesn't contain much transparency can indicate that performance is being impeded. Sorting scene objects by depth and drawing opaque objects from front to back can help to reduce this.

Fragment Shader Problems

Obtaining a meaningful idea of whether fragment shader time is high varies depending on the configuration of your GPU. You need to know the number of shader cores and their clock speeds to be able to work out the number of cycles per fragment which are available.

Instruction words completed per fragment = Instruction completed count / Fragment passed z/stensil count

Fragment Shader Length & Complexity

Program cache miss count percentage = (Program cache miss count / Program cache hit count) * 100

This value is typically around 0.01%. If the graph is high, then it's likely the fragment shader is too long. If the graph of program cache hit count is very low, then it's likely the shader is too complex.

Too Many Branches

Branches are very unlikely to be an issue in Mali GPUs, as branches have a low computational cost. However, you can test for them with the Mali GPU Fragment Processor X: Pipeline bubbles cycle count counter, with a high value indicating that there are too many branches.


Vertex Bound GPU

Although it is unusual for vertex processing to be the bottleneck in applications, there is the potential that developers not used to mobile GPUs will use too many triangles.

Possible Causes of and Tests for a Vertex Bound Application

Vertex Shader Length & Complexity

If the vertex shader is too complex, it's possible to simplify it by taking some of the workload into the fragment shader, or applying arithmetic optimizations. Likewise, if the shader is too long, try to shorten it. Use the Mali GPU Vertex Processor: Active cycles, Mali GPU Vertex Processor: Active cycles, vertex shader and Mali GPU Vertex Processor: Vertex loader cache misses counters to test for these issues.

If the value of Active cycles, vertex shader is low compared to the value of Active cycles and the value of Vertex loader Cache misses is high, then it's likely the shader is too long.

If the value of Active cycles, vertex shader is close to the value of Active cycles and the value of Vertex loader Cache misses is low, then the shader is too complex.

Too Many Vertices

Measure Mali GPU Vertex Processor: Vertices processed and see whether the chart is consistently high. If it is, check if your application is using too many triangles. This can be done by measuring the Mali GPU Vertex Processor: Active cycles, PLBU geometry processing counter to test the Polygon List Builder Unit (PLBU) time. A consistently high chart will again suggest that there are too many triangles. This issue can be solved by reducing the complexity of your objects, using fewer objects in a scene or culling triangles. Once you have culled triangles, you can test for the amount of culling that the GPU is doing by measuring the Mali GPU Vertex Processor: Primitives culled counter.

Underutilization of Vertex Buffer Objects (VBOs)

Using VBOs reduces the amount of data that must be transferred every frame, which in turn increases the performance of your application. Check whether you are making use of them with the BufferProfiling: VBO Upload Time (ms) counter.

If VBOs are being used correctly, the chart will peak after a number of frames and then drop off to zero (or close to zero) for one or more frames. If the chart is consistently low, then it's likely you're not using VBOs.


Bandwidth Bound

A bandwidth bound application will impact on all areas and therefore it can be difficult to identify the root cause.

Possible Causes of and Tests for Bandwidth Bound Applications

Too Much Texture Bandwidth

Textures are the largest user of memory bandwidth and a side effect of using too much texture bandwidth is that the texture cache usage will be high. This is usually caused by textures that are too large, uncompressed, or not mipmapped.

Test for this by measuring the Mali GPU Fragment Processor X: Texture cache hit count and Mali GPU Fragment Processor X: Texture cache miss count and taking their ratio. A ratio of 10:1 is typical, with a lower ratio being worse.

Blitting

Blitting can contribute to using too much bandwidth, especially if you are blitting a high resolution framebuffer.

Use the EGL Counters: Blit Time software counter to measure this.

Other Factors

Fragment processing and vertex processing can also cause an application to be bandwidth bound. Trilinear filtering, complex fragment shaders, too many triangles (without the use of culling), complex vertex shaders, or reading non-localized data can all contribute. Refer to the sections on how to deal with the fragment and vertex shader for the relevant tests.

To get an idea of how your application is performing, versus the maximum available bandwidth, sum the following counters:

Mali GPU Vertex Processor: Words read, system bus, Mali GPU Vertex Processor: Words written, system bus, Mali GPU Fragment Processor X: Total bus reads and Mali GPU Fragment Processor X: Total bus writes, ensuring that you have taken measurements from all available fragment processors. Finally, multiply your sum total by 8 to obtain the value in bytes.

If you don't know the maximum available bandwidth of your system, running a test application that uses as much bandwidth as possible will help to determine this.

Further Reading

Mali Blogs

Since Streamline is used frequently in our own development processes, you can find plenty of blogs on the Arm Connected Community, giving you an insight into Mali driver development.

Macro-scale Pipelining

Pete Harris, our lead performance engineer for the Mali OpenGL ES driver team, has put together an additional guide for debugging issues related to macro-scale pipelining.

Cocos2d-x Engine Optimization

Learn about using Streamline to optimize the popular Cocos2d-x game engine.

Profiling Epic Citadel

For greater than 1080p resolutions, optimizing your games and applications is more important than ever. Mali GPU Tools product manager, Lorenzo Dal Col, has written a guide to profiling the Epic Citadel demo, with lots of practical advice and simple calculations to identify areas of high GPU usage.

Setup information for using Streamline with the Google Nexus 10 (used in the Epic Citadel example) can be found here.