This guide shows how Neon technology has been used to improve performance in the real-world: specifically, in the open-source Chromium project.

In the guide, we demonstrate to programmers who are unfamiliar with Neon how they can use intrinsics in their code to enable Single Instruction, Multiple Data (SIMD) processing. Using Neon in this way can bring huge performance benefits.

What are Neon intrinsics?

Neon technology provides a dedicated extension to the Arm Instruction Set Architecture, providing additional instructions that can perform mathematical operations in parallel on multiple data streams.

Neon technology can help speed up a wide variety of applications, including:

  • Audio and video processing
  • 2D and 3D gaming graphics
  • Voice and facial recognition
  • Computer vision and deep learning

Neon intrinsics are function calls that programmers can use in their C or C++ code. The compiler then replaces these function calls with an appropriate Neon instruction or sequence of Neon instructions.

Intrinsics provide almost as much control as writing assembly language, but leave low-level details such as register allocation and instruction scheduling to the compiler. This frees developers to concentrate on the higher-level behavior of their algorithms, rather than the lower-level implementation details.

Another advantage of using intrinsics is that the same source code can be compiled for different targets. This means that, for example, a single source code implementation can be built for both 32-bit and 64-bit targets. 

Why Chromium?

Why did we choose Chromium to investigate the performance improvements possible with Neon?

Chromium provides the basis for Google Chrome, which is the most popular web browser in the world, in terms of user numbers. Any performance improvements that we made to the Chromium codebase can benefit many millions of users worldwide.

Chromium is an open-source project, so everyone can inspect the full source code. When learning about a new subject, like programming with Neon intrinsics, it often helps to have examples to learn from. We hope that the examples that are provided in this guide will help, because you can see them in the context of a complete, real-world codebase. 

Why PNG?

Now that we have decided to work in Chromium, where should we look in the Chromium code to make optimizations? With over 25 million lines of code, we must pick a specific area to target. When looking at the type of workloads that web browsers deal with, the bulk of content is still text and graphics. Images often represent most of the downloaded bytes on a web page, and contribute to a significant proportion of the processing time. Recent data suggests that 53% of mobile users abandon sites that take over 3 seconds to load. This means that optimizing image load times, and therefore page load times, should bring tangible benefits.

The Portable Network Graphics (PNG) format was developed as an improved, non-patented replacement for the Graphics Interchange Format (GIF). PNG is the standard for transparent images in the web. It is also a popular format for web graphics in general. Because of this, Arm decided to investigate opportunities for Neon optimization in PNG image processing. 

Introducing Bobby the bird

To help decide where to look for optimization opportunities, we went in search of performance data.

The following image of a bird has complex textures, a reasonably large size, and a transparent background. This means that it is a good test case for investigating optimizations to the PNG decoding process:


Image source: Penubag [Public domain], via Wikimedia Commons

The first thing to know is that all PNG images are not created equally. There are several different ways to encode PNG images, for example:

  • Compression. Different compression algorithms can result in different file sizes. For example, Zopfli produces PNG image files that are typically about 5% smaller than zlib, at the cost of taking longer to perform the compression.
  • Pre-compression filters. The PNG format allows filtering of the image data to improve compression results. PNG filters are lossless, so they do not affect the content of the image itself. Filters only change the data representation of the image to make it more compressible. Using pre-compression filters can give smaller file sizes at the cost of increased processing time.
  • Color depth. Reducing the number of colors in an image reduces file size, but also potentially degrades image quality.
  • Color indexing. The PNG format allows individual pixel colors to be specified as either a TrueColor RGB triple, or an index into a palette of colors. Indexing colors reduces file sizes, but may degrade image quality if the original image contains more colors than the maximum that the palette allows. Indexed colors also need decoding back to the RGB triple, which may increase processing time.

We investigated performance with three different versions of the Bobby the budgie image to investigate possible areas for optimization.

Image File size Number of colors Palette or TrueColor? Filters? Compression Encoder
Original_Bobby.PNG 2.7M 211787 TrueColor Yes zlib libpng
Palette_Bobby.PNG 0.9M 256 Palette No zlib libpng
Zopfli_Bobby.PNG 2.6M 211787 TrueColor Yes Zopfli ZopfliPNG

To obtain performance data for each of these three images, we used the Linux perf tool to profile ContentShell. The performance data for each image is as follows:

== Image has pre-compression filters (2.7MB) ==
Lib	 	Command	SharedObj			 method				CPU (%)
zlib	 	TileWorker	liblink			inflate_fast.................... 1.96
zlib 	 	TileWorker	libblnk			adler32......................... 0.88
blink  		TileWorker	liblink			ImageFrame::setRGBAPremultiply.. 0.45
blink  		TileWorker	liblink			png_read_filter_row_up...........0.03*
== Image has no pre-compression filters (0.9MB) ==
Lib	 	Command	SharedObj			 method				CPU (%)
libpng 		TileWorker	liblink			cr_png_do_expand_palette........ 0.88
zlib 	 	TileWorker	liblink			inflate_fast.................... 0.62
blink  		TileWorker	liblink			ImageFrame::setRGBAPremultiply.. 0.49
zlib 	 	TileWorker	libblnk			adler32......................... 0.31
== Image was optimized using zopfli (2.6MB) ==
Lib	 	Command	SharedObj			 method				CPU (%)
zlib 	 	TileWorker	liblink			inflate_fast.................... 3.06
zlib 	 	TileWorker	libblnk			adler32......................... 1.36
blink  		TileWorker	liblink			ImageFrame::setRGBAPremultiply.. 0.70
blink  		TileWorker	liblink			png_read_filter_row_up.......... 0.48*

This data helped identify the zlib library as a good target for our optimization efforts. This is because it contains several methods that contribute significantly to performance.

Zlib was also considered a good candidate to target for the following reasons: 

  • The zlib library is used in many different software applications and libraries, for example libpng, Skia, FreeType, Cronet, and Chrome. This means that any performance improvements that we could achieve in zlib would yield performance improvements for many users.
  • Released in 1995, the zlib library has a relatively old codebase. Older codebases might have areas that have not been modified in many years. These areas are likely to provide more opportunities for improvement.
  • The zlib library did not contain any existing optimizations for Arm. This means that there probably a wide range of improvements to make.