Performance Expectations

Here are some of the fundamental performance properties that you can expect from a Mali Midgard GPU. The GPU can do the following tasks simultaneously:

  • Issue one new thread per shader core per clock.
  • Retire one fragment per shader core per clock.
  • Write one pixel per shader core per clock.
  • Issue one instruction per pipe per clock.

From the point of view of shader core performance:

  • Each A-pipe can process:
    • 17 FP32 operations per clock.
  • The LS-pipe can process:
    • 128-bits of vector load/store per clock, or
    • 128-bits of varying interpolation per clock, or
    • one imageLoad() or imageStore() per clock, or
    • one atomic access per clock.
  • The T-pipe can process:
    • One bilinear filtered sample per clock, or
    • One bilinear filtered depth sample every two clocks, or
    • One trilinear filtered sample every two clocks, or
    • One plane of YUV sampled data every clock.

Theoretical peak performance example

If we scale the shader core performance to match an example reference chipset that contains a Mali-T760 MP8 running at 600MHz, we can calculate the theoretical peak performance as:

Fill rate:

  • 8 pixels per clock = 4.8 GPix/s.
  • That is 2314 complete 1080p layers per second.

Texture rate:

  • 8 bilinear texels per clock = 4.8 GTex/s.
  • That is 38 bilinear filtered texture samples per pixel for 1080p @ 60 FPS.

Arithmetic rate:

  • 17 FP32 FLOPS per pipe per core = 163 FP32 GFLOPS.
  • That is 1311 FLOPS per pixel for 1080p @ 60 FPS.

Bandwidth:

  • 256-bits of memory access per clock = 19.2GB/s read and write bandwidth.
  • That is 154 bytes per pixel for 1080p @ 60 FPS.

Platform dependencies

The performance of a Mali GPU in any specific chipset is very dependent on both the configuration choices made by the silicon implementation, and the final device form factor the chipset is used in.

Some characteristics, including the number of shader cores and the size of the GPU L2 cache, are visible in terms of the GPU logical configuration that the silicon partner has built.

Other characteristics depend on the memory system’s logical configuration, like the memory latency, bandwidth, DDR memory type, and how memory is shared between multiple users.

Some characteristics depend on analogue silicon implementation choices, like which silicon process was used, the target top frequency, the DVFS voltage, and frequency choices available at run time.

Finally, some characteristics depend on the physical form factor of the device, because this determines the available power budget. Therefore, an identical chipset can have very different peak performance results in different form factor devices.

For example:

  • A small smartphone has a sustainable GPU power budget between 1-2 Watts.
  • A large smartphone has a sustainable GPU power budget between 2-3 Watts.
  • A large tablet has sustainable GPU power budget between 4-5 Watts.
  • An embedded device with a heat sink may have a GPU power budget up to 10 Watts.

When combined, it can be hard to predict the performance of any GPU implementation based solely on the GPU product name, core count, and top frequency. If in doubt, write some test scenarios that match your own use cases, and then run them to see how well they work on your target devices.

Previous Next