Adler-32 is a checksum algorithm used by the zlib compression library to detect data corruption errors. Adler-32 checksums are faster to calculate than CRC32 checksums, but trade reliability for speed as Adler-32 is more prone to collisions.

The PNG format uses Adler-32 for uncompressed data and CRC32 is used for the compressed segments.

An Adler-32 checksum is calculated as follows:

• A is a 16-bit checksum calculated as the sum of all bits in the input stream, plus 1, modulo 65521.
• B is a 16-bit checksum calculated as the sum of all individual A values, modulo 65521. B has the initial value 0.
• The final Adler-32 checksum is a 32-bit checksum formed by concatenating the two 16-bit values of A and B, with B occupying the most significant bytes.

This means that the Adler-32 checksum function can be expressed as follows:

```A = 1 + D1 + D2 + ... + Dn (mod 65521)

B = (1 + D1) + (1 + D1 + D2) + ... + (1 + D1 + D2 + ... + Dn) (mod 65521)
= nxD1 + (n-1) x D2 + (n-2) x D3 + ... + Dn + n (mod 65521)

Adler-32(D) = (B x 65536) + A```

For example, the following table shows the calculation of the Adler-32 checksum of the ASCII string Neon:

Character Decimal ASCII code A B
N 78 (1 + 78) % 65521 = 79 (0 + 79) % 65521 = 79
e 101 (79 + 101) % 65521 = 180 (79 + 180) % 65521 = 259
o 111 (180 + 111) % 65521 = 291 (259 + 291) % 65521 = 550
n 110 (291 + 110) % 65521 = 401 (550 + 401) % 65521 = 951

The decimal Adler-32 checksum is calculated as follows:

```Adler-32 = (B x 65536) + A
= (951 x 65536) + 401
= 62,324,736 + 401
= 62,325,137```

The same calculation in hexadecimal is as follows:

```Adler-32 = (B x 00010000) + A
= (03B7 x 00010000) + 0191
= 03B70000 + 0191
= 03B70191```

### Unoptimized implementation

The following code shows a simplistic implementation of the Adler-32 algorithm, from Wikipedia:

```const uint32_t MOD_ADLER = 65521;

uint32_t adler32(unsigned char *data, size_t len)
/*
where data is the location of the data in physical memory and
len is the length of the data in bytes
*/
{
uint32_t a = 1, b = 0;
size_t index;

// Process each byte of the data in order
for (index = 0; index < len; ++index)
{
a = (a + data[index]) % MOD_ADLER;
b = (b + a) % MOD_ADLER;
}

return (b << 16) | a;
}```

This code simply loops through the data one value at a time, summing and accumulating results. One of the problems with this approach is that performing the modulo operation is expensive. Here, this expensive modulo operation is performed at every single iteration.

### Neon-optimized implementation

Optimizing the Adler-32 algorithm with Neon uses vector multiplication and accumulation to process up to 32 data values at the same time:

```static void NEON_accum32(uint32_t *s, const unsigned char *buf,
z_size_t len)
{
/* Please refer to the 'Algorithm' section of:
* Here, 'taps' represents the 'n' scalar multiplier of 'B', which
* will be multiplied and accumulated.
*/
static const uint8_t taps = {
32, 31, 30, 29, 28, 27, 26, 25,
24, 23, 22, 21, 20, 19, 18, 17,
16, 15, 14, 13, 12, 11, 10, 9,
8, 7, 6, 5, 4, 3, 2, 1 };

/* This may result in some register spilling (and 4 unnecessary VMOVs). */
const uint8x16_t t0 = vld1q_u8(taps);
const uint8x16_t t1 = vld1q_u8(taps + 16);
const uint8x8_t n_first_low = vget_low_u8(t0);
const uint8x8_t n_first_high = vget_high_u8(t0);
const uint8x8_t n_second_low = vget_low_u8(t1);
const uint8x8_t n_second_high = vget_high_u8(t1);

uint8x16_t d0, d1;

uint32x4_t s2acc = vdupq_n_u32(0);
s2acc = vsetq_lane_u32(s, s2acc, 0);

/*  Think of it as a vectorized form of the code implemented to
* handle the tail (or a DO16 on steroids). But in this case
* we handle 32 elements and better exploit the pipeline.
*/
while (len >= 2) {
d0 = vld1q_u8(buf);
d1 = vld1q_u8(buf + 16);
sum2 = vmull_u8(n_first_low, vget_low_u8(d0));
sum2 = vmlal_u8(sum2, n_first_high, vget_high_u8(d0));
sum2 = vmlal_u8(sum2, n_second_low, vget_low_u8(d1));
sum2 = vmlal_u8(sum2, n_second_high, vget_high_u8(d1));
len -= 2;
buf += 32;
}

/* This is the same as before, but we only handle 16 elements as
* we are almost done.
*/
while (len > 0) {
d0 = vld1q_u8(buf);
sum2 = vmull_u8(n_second_low, vget_low_u8(d0));
sum2 = vmlal_u8(sum2, n_second_high, vget_high_u8(d0));
buf += 16;
len--;
}

/* Combine the accumulated components (adler and sum2). */

/* Store the results. */
s = vget_lane_u32(as, 0);
s = vget_lane_u32(as, 1);
}```

The taps optimization that is referred to in the code comments works by computing the checksum of a vector of 32 elements where the n variable is known and fixed. This computed checksum is later recombined with another segment of 32 elements, rolling through the input data array. For more information, you can watch the BlinkOn 9: Optimizing image decoding on Arm presentation.

Elsewhere in the code, the expensive modulo operation is optimized so that it is only run when absolutely needed. The point at which the modulo is needed is just before the accumulated sum could possibly overflow the modulo value. This is calculated to be once every 5552 iterations.

Intrinsic Description
`vaddq_u32` Vector add.
`vdupq_n_u32` Load all lanes of vector to the same literal value.
`vget_high_u32`
`vget_high_u8`
`vget_low_u32`
`vget_low_u8`
Split vectors into two components.
`vget_lane_u32` Extract a single lane from a vector.
`vld1q_u8` Load a single vector or lane.
`vmlal_u8` Vector multiply and accumulate.
`vmull_u8` Vector multiply.
`vpadalq_u16`
`vpadalq_u8`
`vpadd_u32`
`vpaddlq_u8`
`vsetq_lane_u32` Load a single lane of a vector from a literal.
`vshlq_n_u32` Vector shift left by constant.

### Results

Optimizing Adler-32 to use Neon intrinsics to perform SIMD arithmetic yielded significant performance improvements when this optimization started shipping in Chrome M63.

Tests in Armv8 showed an improvement of around 3x. For example, elapsed real time reduced from 350ms to 125ms for a 4096x4096 byte test executed 30 times.

This optimization alone yielded a performance boost for PNG decoding ranging from 5% to 18%.