BOLT instrumentation brings 52% performance uplift for MongoDB on Neoverse N2
BOLT is a post-link optimization technology enabling performance improvement for various workloads. Read more in this post.
By Bolt Liu

BOLT is a post-link optimization technology which brings performance improvement for various workloads. Previously, BOLT was enabled through CoreSight and perf, which improved performance for some typical workloads. Find out more about BOLT optimization technology in the following blog. However, CoreSight is required to capture branch perf datas, which is not convenient to deploy in the production environment.
BOLT instrumentation is an alternative method which optimizes the executable binary based on the profile data, which is collected by instrumenting and running the binary. Only llvm-bolt utility is required as there is no dependency on CoreSight and perf.
This blog illustrates the steps to enable BOLT instrumentation and benchmark results on MongoDB.
Test environment
Two Alibaba ECS instances are reserved for the benchmark. Client runs the ycsb while the server runs MongoDB. 200G AutoPL ESSD, which has a higher bandwidth, is attached to the server to ensure there is no bottleneck on the drive.

MongoDB BOLT instrumentation test environment
There are two steps when running ycsb: load and run. This sends 40000000 records and 5000000 operations. Run the following command:
REC_CNT=40000000
OP_CNT=5000000
./bin/ycsb.sh load mongodb -s -P workloads/workloada -p recordcount=$REC_CNT -p operationcount=$OP_CNT -threads 64 -p mongodb.url="mongodb://$1:27017/ali"
./bin/ycsb.sh run mongodb -s -P workloads/workloada -p recordcount=$REC_CNT -p operationcount=$OP_CNT -threads 64 -p mongodb.url="mongodb://$1:27017/ali"
Steps to enable BOLT instrumentation
Build Default MongoDB
- Download MongoDB source code and checkout version 7.0.5
- Upgrade gcc version to 11.4.0, which is required to build MongoDB 7.0.5
- Build mongo (name it as mongod.orig) with the following options:
python3 buildscripts/scons.py DESTDIR=$WORKSPACE/install/mongo install-mongod \ CCFLAGS="-fno-reorder-blocks-and-partition -mcpu=native -O3 -w" \ LINKFLAGS="-Wl,--emit-relocs" --disable-warnings-as-errors
Collect profile data
- Build llvm-bolt with version 6841395
- Convert mongod.orig to mongod.inst:
llvm-bolt mongod.orig -instrument -o mongod.inst --instrumentation-file=`pwd`/prof.fdata --instrumentation-sleep-time=60
3. Start mongod.inst and run ycsb to collect profile data. Run the following command:
OP_CNT=5000000
./bin/ycsb.sh load mongodb -s -P workloads/workloada -p operationcount=$OP_CNT -threads 64 -p mongodb.url="mongodb://$1:27017/ali"
./bin/ycsb.sh run mongodb -s -P workloads/workloada -p operationcount=$OP_CNT -threads 64 -p mongodb.url="mongodb://$1:27017/ali"
4. Stop mongod.inst
Optimize executable
1. Convert mongod.orig to optimized executable (name it mongod.bolt):
llvm-bolt mongod.orig -o mongod.bolt -data=prof.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions=2 -split-all-cold -split-eh -dyno-stats
2. Run mongod.orig and mongod.bolt, and compare the results of them.
Test Results
The benchmark shows that MongoDB improved 58% for INSERT and 52% for READ and UPDATE. Latencies also dropped significantly with BOLT enabled.
INSERT:
| metrics | Default | BOLT enhanced | Improvement (%) |
| Total time | 1394331 | 879745 | 36.90 |
| throughputs | 28687 | 45467 | 58.49 |
| INSERT AverageLatency (us) | 2211 | 1390 | 37.13 |
| INSERT 95th Latency (us) | 4103 | 2739 | 33.24 |
| INSERT 99th Latency (us) | 7679 | 5595 | 27.13 |
READ and UPDATE (with ratio 1:1):
| metrics | Default | BOLT enhanced | Improvement (%) |
| Total time | 249593 | 164264 | 34.18 |
| throughputs | 20032 | 30438 | 51.94 |
| READ Average Latency (us) | 3146 | 2051 | 34.80 |
| READ 95th Latency (us) | 7527 | 6571 | 12.70 |
| READ 99th Latency (us) | 12863 | 10831 | 15.79 |
| UPDATE Average Latency (us) | 3211 | 2122 | 33.91 |
| UPDATE 95th Latency (us) | 7659 | 6771 | 11.59 |
| UPDATE 99th Latency (us) | 13119 | 11111 | 15.30 |
Throughput improvement
The throughput improvement after using BOLT increased by 58% for INSERT and 52% for READ and UPDATE:

Throughput improvement report for BOLT
Latency improvement
Latency improvement after using BOLT increased by 37% for INSERT, 35% for READ and 34% for UPDATE average latency:

Latency improvement report for BOLT
Perf data
The perf data concludes that L1-icache-misses, branch-misses and iTLB-load-misses dropped significantly. Use the following command to capture perf data:
perf stat -e instructions,L1-icache-misses,branches,branch-misses,iTLB-load,iTLB-load-misses -p `pgrep mongo` -- sleep 60

Perf data report for BOLT
Summary
BOLT instrumentation results in a 52% performance uplift for MongoDB READ and UPDATE tests, whilst latencies have dropped significantly. Moreover, the instrumentation method is easy to deploy as it has no dependency on hardware counters and perf.
By Bolt Liu
Re-use is only permitted for informational and non-commercial or personal use only.
