Use Streamline to Optimize Applications for Mali GPUs


Overview Midgaurd Optimization Utgaurd Optimization Further Reading

Utgaurd Optimization

This section describes optimizations for graphics processors based on the mid-range Utgard architecture. Processors in this range are the Mali-300, Mali-400 MP and Mali-450 MP.

There are a good number of counters available to use with Mali-400 GPUs in a single capture. However, you are limited to a maximum of 2 hardware counters per "unit", where a unit might for instance be the L2 cache.


Fragment Bound GPU

Possible Causes of And Tests For Fragment Bound Applications

Just like in the case of the more powerful Mali-T600 series GPUs, your application may be fragment bound for a number of reasons.

Oversized & Uncompressed Textures

A high reading from the Fragment Processor: Total bus reads and Fragment Processor: Texture descriptors reads counters indicates that your application is texture bound.

To check for oversized textures, measure the Mali GPU Fragment Processor X: Texture cache hit count and Mali GPU Fragment Processor X: Texture cache miss count counters and take their ratio. A 10:1 ratio of hit to miss is typical, with higher values suggesting that texture bit-depth is too high, or textures in general are too big.

A similar test with the Mali GPU Fragment Processor X: Compressed texture cache hit count and Mali GPU Fragment Processor X: Compressed texture cache miss count counters helps to identify whether you are using uncompressed textures.

Overdraw

As described in the case of the Midgard architecture, overdraw can be a factor in fragment bound applications.

To test for overdraw factor, measure the Mali GPU Fragment Processor X: Fragment passed z/stensil count counter:

Overdraw Factor = Fragment passed z/stensil count / Number of pixels in a frame

A typical result will be around 2.5. As in the Midgard optimization, a value higher than 3 in a scene that doesn't contain much transparency can indicate that performance is being impeded. Sorting scene objects by depth and drawing opaque objects from front to back can help to reduce this.

Fragment Shader Problems

Obtaining a meaningful idea of whether fragment shader time is high varies depending on the configuration of your GPU. You need to know the number of shader cores and their clock speeds to be able to work out the number of cycles per fragment which are available. More details available at Fragment Shader Optimizations guide. 

Instruction words completed per fragment = Instruction completed count / Fragment passed z/stensil count

Fragment Shader Length & Complexity

Program cache miss count percentage = (Program cache miss count / Program cache hit count) * 100

This value is typically around 0.01%. If the graph is high, then it's likely the fragment shader is too long. If the graph of program cache hit count is very low, then it's likely the shader is too complex.

Too Many Branches

Branches are very unlikely to be an issue in Mali GPUs, as branches have a low computational cost. However, you can test for them with the Mali GPU Fragment Processor X: Pipeline bubbles cycle count counter, with a high value indicating that there are too many branches. 


Vertex Bound GPU

Although it is unusual for vertex processing to be the bottleneck in applications, there is the potential that developers not used to mobile GPUs will use too many triangles.

Possible Causes of and Tests for a Vertex Bound Application

Vertex Shader Length & Complexity

If the vertex shader is too complex, it's possible to simplify it by taking some of the workload into the fragment shader, or applying arithmetic optimizations. Likewise, if the shader is too long, try to shorten it. Use the Mali GPU Vertex Processor: Active cycles, Mali GPU Vertex Processor: Active cycles, vertex shader and Mali GPU Vertex Processor: Vertex loader cache misses counters to test for these issues.

If the value of Active cycles, vertex shader is low compared to the value of Active cycles and the value of Vertex loader Cache misses is high, then it's likely the shader is too long.

If the value of Active cycles, vertex shader is close to the value of Active cycles and the value of Vertex loader Cache misses is low, then the shader is too complex.

Too Many Vertices

Measure Mali GPU Vertex Processor: Vertices processed and see whether the chart is consistently high. If it is, check if your application is using too many triangles. This can be done by measuring the Mali GPU Vertex Processor: Active cycles, PLBU geometry processing counter to test the Polygon List Builder Unit (PLBU) time. A consistently high chart will again suggest that there are too many triangles. This issue can be solved by reducing the complexity of your objects, using fewer objects in a scene or culling triangles. Once you have culled triangles, you can test for the amount of culling that the GPU is doing by measuring the Mali GPU Vertex Processor: Primitives culled counter.

Underutilization of Vertex Buffer Objects (VBOs)

Using VBOs reduces the amount of data that must be transferred every frame, which in turn increases the performance of your application. Check whether you are making use of them with the BufferProfiling: VBO Upload Time (ms) counter.

If VBOs are being used correctly, the chart will peak after a number of frames and then drop off to zero (or close to zero) for one or more frames. If the chart is consistently low, then it's likely you're not using VBOs. More details available at Vertex Processing Optimizations guide.


Bandwidth Bound

A bandwidth bound application will impact on all areas and therefore it can be difficult to identify the root cause.

Possible Causes of and Tests for Bandwidth Bound Applications

Too Much Texture Bandwidth

Textures are the largest user of memory bandwidth and a side effect of using too much texture bandwidth is that the texture cache usage will be high. This is usually caused by textures that are too large, uncompressed, or not mipmapped.

Test for this by measuring the Mali GPU Fragment Processor X: Texture cache hit count and Mali GPU Fragment Processor X: Texture cache miss count and taking their ratio. A ratio of 10:1 is typical, with a lower ratio being worse.

Blitting

Blitting can contribute to using too much bandwidth, especially if you are blitting a high resolution framebuffer.

Use the EGL Counters: Blit Time software counter to measure this.

Other Factors

Fragment processing and vertex processing can also cause an application to be bandwidth bound. Trilinear filtering, complex fragment shaders, too many triangles (without the use of culling), complex vertex shaders, or reading non-localized data can all contribute. Refer to the sections on how to deal with the fragment and vertex shader for the relevant tests.

To get an idea of how your application is performing, versus the maximum available bandwidth, sum the following counters:

Mali GPU Vertex Processor: Words read, system bus, Mali GPU Vertex Processor: Words written, system bus, Mali GPU Fragment Processor X: Total bus reads and Mali GPU Fragment Processor X: Total bus writes, ensuring that you have taken measurements from all available fragment processors. Finally, multiply your sum total by 8 to obtain the value in bytes. 

If you don't know the maximum available bandwidth of your system, running a test application that uses as much bandwidth as possible will help to determine this. More details available at Bandwidth Bound

Previous Next