This guide shows how Neon technology has been used to improve performance in the real-world: specifically, in the open source Chromium project.

The goal of this guide is to demonstrate to programmers who might be unfamiliar with Neon how they can use intrinsics in their code to enable SIMD (Single Instruction, Multiple Data) processing. Using Neon in this way can bring huge performance benefits, as we will discover in this case study.

What are Neon intrinsics?

Neon technology provides a dedicated extension to the Arm Instruction Set Architecture, providing additional instructions that can perform mathematical operations in parallel on multiple data streams.

Neon technology can help speed up a wide variety of applications, including:

  • Audio and video processing.
  • 2D and 3D gaming graphics.
  • Voice and facial recognition.
  • Computer vision and deep learning.

Neon intrinsics are function calls that programmers can use in their C or C++ code. The compiler then replaces these function calls with an appropriate Neon instruction or sequence of Neon instructions.

Intrinsics provide almost as much control as writing assembly language, but leave low-level details such as register allocation and instruction scheduling to the compiler. This frees developers to concentrate on the higher-level behavior of their algorithms, rather than the lower-level implementation details.

Another advantage of using intrinsics is that the same source code can be compiled for different targets. This means, for example, that you can have a single source code implementation that can be built for both 32-bit and 64-bit targets.

Why Chromium?

Why did we choose Chromium to investigate the performance improvements possible with Neon?

Chromium provides the basis for Google Chrome, the world's most popular web browser in terms of user numbers. Any performance improvements we were able to make to the Chromium codebase had the potential to benefit many millions of users worldwide.

Chromium is an open source project, so everyone can inspect the full source code. When learning about a new subject, such as programming with Neon intrinsics, it often helps to have examples to learn from. We hope that the examples provided in this guide will prove especially helpful because they can be seen in the context of a complete, real-world, codebase.

Why PNG?

The next question we asked was: where should we look in the Chromium code to make optimizations? With over 25 million lines of code, we needed to pick a specific area to target. When looking at the type of workloads web browser deal with, the bulk of content is still text and graphics. Images often represent most of the downloaded bytes on a web page, and contribute to a significant proportion of the processing time. Recent data suggests that 53% of mobile users abandon sites that take over 3 seconds to load, so optimizing image load times (and therefore page load times) should bring tangible benefits.

PNG was developed as an improved, non-patented replacement for Graphics Interchange Format (GIF) and is the standard for transparent images in the web. It is also a popular format for web graphics in general. This led to Arm's decision to investigate opportunities for Neon optimization in PNG image processing.

Introducing Bobby the budgie

To help decide where to look for optimization opportunities, we went in search of performance data.

This image of a budgerigar has complex textures, a reasonably large size, and a transparent background, which makes it a good test case for investigating optimizations to the PNG decoding process.

Image source: Penubag [Public domain], via Wikimedia Commons

The first thing to note is that all PNG images are not created equal. There are a number of different ways to encode PNG images, for example:

  • Compression. Different compression algorithms can result in different file sizes. For example Zopfli produces PNG image files that are typically around 5% smaller than zlib, at the cost of taking longer to perform the compression.
  • Pre-compression filters. The PNG format allows filtering of the image data to improve compression results. PNG filters are lossless, so they do not affect the content of the image itself. Filters only change the data representation of the image to make it more compressible. Using pre-compression filters can give smaller file sizes at the cost of increased processing time.
  • Color depth. Reducing the number of colors in an image will reduce file size, but also potentially degrade image quality.
  • Color indexing. The PNG format allows individual pixel colors to be specified as either a TrueColor RGB triple, or an index into a palette of colors. Indexing colors reduces file sizes, but may degrade image quality if the original image contains more colors than the maximum allowed by the palette. Indexed colors also need decoding back to the RGB triple, which may increase processing time.

We investigated performance with three different versions of the Bobby the budgie image to investigate possible areas for optimization.

Image File size Number of colors Palette or TrueColor? Filters? Compression Encoder
Original_Bobby.PNG 2.7M 211787 TrueColor Yes zlib libpng
Palette_Bobby.PNG 0.9M 256 Palette No zlib libpng
Zopfli_Bobby.PNG 2.6M 211787 TrueColor Yes Zopfli ZopfliPNG

To obtain performance data for each of these three images, we used the Linux perf tool to profile ContentShell.

== Image has pre-compression filters (2.7MB) ==
Lib	 	Command	SharedObj			 method				CPU (%)
zlib	 	TileWorker	liblink			inflate_fast.................... 1.96
zlib 	 	TileWorker	libblnk			adler32......................... 0.88
blink  		TileWorker	liblink			ImageFrame::setRGBAPremultiply.. 0.45
blink  		TileWorker	liblink			png_read_filter_row_up...........0.03*
== Image has no pre-compression filters (0.9MB) ==
Lib	 	Command	SharedObj			 method				CPU (%)
libpng 		TileWorker	liblink			cr_png_do_expand_palette........ 0.88
zlib 	 	TileWorker	liblink			inflate_fast.................... 0.62
blink  		TileWorker	liblink			ImageFrame::setRGBAPremultiply.. 0.49
zlib 	 	TileWorker	libblnk			adler32......................... 0.31
== Image was optimized using zopfli (2.6MB) ==
Lib	 	Command	SharedObj			 method				CPU (%)
zlib 	 	TileWorker	liblink			inflate_fast.................... 3.06
zlib 	 	TileWorker	libblnk			adler32......................... 1.36
blink  		TileWorker	liblink			ImageFrame::setRGBAPremultiply.. 0.70
blink  		TileWorker	liblink			png_read_filter_row_up.......... 0.48*

This data helped identify the zlib library as a good target for our optimization efforts, as it contains a number of methods that contribute significantly to performance.

In addition, zlib was considered a good candidate to target for the following reasons:

  • The zlib library is used in many different software applications and libraries, for example libpng, Skia, FreeType, Cronet, and Chrome to name but a few. This meant that any performance improvements we could achieve in zlib would yield performance improvements for a large number of users.
  • Released in 1995, the zlib library has a relatively old codebase. Older codebases with areas that might not have been modified in many years, are likely to provide more opportunities for improvement.
  • The zlib library did not contain any existing optimizations for Arm, which meant there were likely to be a wide range of improvements that could be made.