Free Hardware Performance Counters in OxCaml Runtime Events
At ICFP in Singapore last year I had lunch with Stephen Dolan and we got talking about hardware performance counters. Specifically, about how useful it would be to have them integrated into OxCaml's Runtime Events so you could ask questions like "which phases of my program have the lowest instructions-per-clock?" or "what is causing my TLB misses?". This post covers a prototype I've been working on since then.
OCaml's Runtime Events already gives you lightweight tracing of GC phases and user-defined spans in every OCaml binary. The prototype extends this to sample user-specified hardware counters at the start and end of each span, so you get performance data for free alongside the events you were already collecting. This works for built-in runtime events but also for custom events you can add around any logical section in your own code. Since spans can be continuously consumed, you can gather performance counter data on specific program or runtime phases across large numbers of long-running services. The overhead is indistinguishable from noise.
Why hardware performance counters?
Modern CPUs look like they execute instructions sequentially, but underneath they're parallel execution engines, reordering and speculatively executing instructions across deep pipelines. Multiple tiers of cache hide the fact that main memory is hundreds of cycles away and a single CPU might have dozens of cores that all need their caches kept coherent. To expose what's really going on, CPUs provide hardware performance counters. My 'Alder Lake' Intel i7-12700F has hundreds of them1, covering everything from cache misses at each level to branch mispredictions to instructions retired.
On Linux the standard way to access these is through perf. You can run perf stat to get aggregate counter values for a whole program, or use perf record -e .. to do sampling based on counter overflow, which gives you a profile showing which lines of code are responsible for the event you're monitoring2. These work well for standalone questions like finding the code causing the most L1 cache misses but they are unsuitable for continuous monitoring of long-running programs, and they can't easily correlate multiple counters with specific program phases, like asking which GC phases have the lowest instructions-per-clock.
Runtime Events already traces GC phases and user-defined spans in every binary. It also supports Custom Events, which let you emit spans and counters from your own code. If we extend it to capture hardware counters at span boundaries, we get continuous monitoring of both the runtime and user code, correlating hardware performance with logical program phases on real production traffic.
The implementation
The implementation has two sides: changes to the C runtime to set up and sample the counters, and changes to the OCaml API to expose the samples to consumers.
Runtime
We need to call perf_event_open for each counter we want to monitor on every thread and domain, and clean up when threads and domains exit. To handle this we add domain_start/stop, thread_start/stop hooks in the appropriate places in systhreads, covering both newly spawned threads and existing threads being registered.
When a Runtime Events span is started or stopped, we use rdpmc to sample the counters and append the values to the event data as it's written to the ring buffer. rdpmc takes around 20-40 cycles3, which is what makes the whole approach viable. That's very small compared to the useful work any span is doing.
The sampling is cheap but the configuration interface is less elegant. Counters to monitor are specified via the OCAML_RUNTIME_EVENTS_PERF_COUNTERS environment variable as a list of hex-encoded event codes and umasks. You need to look up the codes in the Intel Software Developer's Manual, from perfmon-events.intel.com or by running perf list and extracting the event code from perf stat -vv -e <counter name>.
For example, to count instructions, cpu-cycles, L2 cache misses (l2_request.miss) and L3 cache misses (longest_lat_cache.miss) on a Performance Core of a 12th Gen Intel Core i3-1220P: OCAML_RUNTIME_EVENTS_PERF_COUNTERS=r00c0,r003c,r3f24,r412e
OCaml API side
Consumers receive an additional local_ perf_sample array parameter on runtime_begin and runtime_end callbacks. This is an array of unboxed int64 pairs: the event code from the environment variable and the current counter value, in the same order as specified. The same parameter is available on add_user_event callbacks for custom events, and the samples are also accessible through the C API.
Taking it for a spin
The ahrefs benchmarks from Sandmark exercise code used heavily by their crawler and are known to be memory intensive, which is what we want for correlating GC behaviour with hardware performance. I had Claude port four of them (htmlStream, network, gzip, stre) to OxCaml, fixing up dependencies and using Anil's Oxmono to create a monorepo. You can find the benchmarks and the code to run them here: https://github.com/sadiqj/runtime-events-pmc-benchmarks.
GC phases by instructions-per-clock
Across all four benchmarks I monitored instructions and cpu-cycles for every runtime event span. The table below shows GC phases ordered by their total cycles, with the 5th, 50th and 95th percentile instructions-per-clock (IPC).
| Phase | N | Cycles | % | IPC p5 | IPC p50 | IPC p95 |
|---|---|---|---|---|---|---|
| major | 10,728 | 4.91B | 19.8% | 0.237 | 1.748 | 3.994 |
| major_slice | 10,728 | 4.60B | 18.6% | 0.191 | 1.720 | 4.184 |
| stw_leader | 11,760 | 2.76B | 11.2% | 0.541 | 1.486 | 3.249 |
| minor | 6,093 | 2.59B | 10.5% | 0.650 | 1.955 | 3.917 |
| minor_remembered_set | 6,093 | 2.49B | 10.1% | 0.709 | 2.112 | 4.378 |
| major_mark | 2,622 | 2.38B | 9.6% | 0.612 | 2.639 | 5.029 |
| minor_remembered_set_promote | 6,093 | 2.27B | 9.2% | 0.614 | 2.150 | 4.748 |
| major_sweep | 6,216 | 2.19B | 8.8% | 0.096 | 1.762 | 4.317 |
| major_gc_cycle_domains | 1,689 | 288M | 1.2% | 0.917 | 1.706 | 3.359 |
| minor_local_roots | 6,093 | 85M | 0.3% | 0.411 | 1.145 | 3.823 |
| major_mark_roots | 1,680 | 65M | 0.3% | 0.859 | 1.233 | 1.987 |
| minor_local_roots_promote | 6,093 | 50M | 0.2% | 0.702 | 1.906 | 5.139 |
| minor_finalized | 6,093 | 28M | 0.1% | 0.763 | 2.051 | 2.444 |
| major_gc_stw | 1,689 | 16M | 0.1% | 0.837 | 1.438 | 2.415 |
The p5-to-p95 spread is wide across most phases. major_sweep ranges from 0.10 to 4.32 IPC, which likely reflects variability in dense versus sparse shared heap pools. A sparse heap pool could be difficult to branch predict, leading to pipeline stalls and low IPC. The next step to investigate would be to add a runtime/custom events counter inside the sweeping code that emitted the occupancy of pools. This would tell you whether the issue is sparse or dense pools, or possibly even the number of pools swept within a major_sweep phase.
minor_local_roots and major_mark_roots have the lowest median IPC at 1.15 and 1.23 respectively. Root scanning involves chasing pointers through the stack and globals which is unfriendly to caches and branch predictors, so low IPC is expected here.
major_mark has the highest median IPC at 2.64, with a p95 of 5.03. Marking does more computation per memory access than sweeping or root scanning, so the CPU has more useful work to keep its pipelines busy. There was also significant optimisation work done by Stephen Dolan on prefetching during marking.
Since consumers get counter values at the start and end of every span, you aren't restricted to looking at aggregates. You can dump out individual traces and inspect them too:
Beyond the GC — profiling your own code
Performance counters are also emitted for Custom Events. Adding a custom event around a section of code is straightforward:
Runtime_events.User.write span_name Begin;
(* your code here *)
Runtime_events.User.write span_name End;
We added custom events around each benchmark phase in the ahrefs benchmarks. This time I monitored L2 and last-level cache (LLC) misses alongside cycles. The table below shows LLC misses and L2 misses per 1,000 instructions:
| Phase | N | Cycles | % | LLC/ki | L2/ki |
|---|---|---|---|---|---|
| bench:large_blocks | 3 | 54.89B | 32.5% | 0.8 | 1.7 |
| bench:generational_violation | 3 | 29.77B | 17.6% | 0.6 | 2.6 |
| bench:morphing_heap | 3 | 14.81B | 8.8% | 0.6 | 1.5 |
| bench:complex_network_ops | 3 | 10.62B | 6.3% | 5.6 | 12.7 |
| bench:variable_rate | 3 | 10.49B | 6.2% | 1.8 | 5.0 |
| bench:range_operations | 3 | 7.99B | 4.7% | 0.2 | 1.5 |
| bench:attribute_lists | 3 | 6.51B | 3.9% | 0.5 | 2.3 |
| bench:complex_references | 3 | 5.83B | 3.4% | 0.8 | 3.1 |
| bench:fragmentation | 3 | 4.50B | 2.7% | 0.6 | 1.6 |
| bench:cidr_calculations | 3 | 3.25B | 1.9% | 0.4 | 3.7 |
| bench:small_buffer_storm | 3 | 3.21B | 1.9% | 0.4 | 1.3 |
| bench:mixed_size_patterns | 3 | 2.99B | 1.8% | 0.5 | 1.6 |
| bench:compression_pipelines | 3 | 2.63B | 1.6% | 0.4 | 1.6 |
| bench:ip_sorting | 3 | 1.81B | 1.1% | 0.4 | 3.2 |
| bench:large_block_compression | 3 | 1.61B | 1.0% | 0.3 | 1.2 |
| bench:split_storm | 3 | 1.30B | 0.8% | 0.6 | 3.4 |
| bench:streaming_operations | 3 | 1.16B | 0.7% | 0.1 | 1.9 |
| bench:buffer_recycling | 3 | 1.06B | 0.6% | 0.5 | 3.0 |
| bench:broadcast_calculations | 3 | 925M | 0.5% | 0.6 | 4.5 |
| bench:ipv4_parsing | 3 | 855M | 0.5% | 0.6 | 3.8 |
| bench:concurrent_style | 3 | 553M | 0.3% | 0.3 | 0.9 |
| bench:small_strings | 3 | 432M | 0.3% | 0.3 | 1.3 |
| bench:pattern_operations | 3 | 384M | 0.2% | 0.7 | 4.9 |
| bench:mixed_format_parsing | 3 | 350M | 0.2% | 0.5 | 3.2 |
| bench:nat_tables | 3 | 250M | 0.1% | 1.1 | 10.7 |
| bench:headers_metadata | 3 | 235M | 0.1% | 0.8 | 3.1 |
| bench:enum_string_ops | 3 | 201M | 0.1% | 0.3 | 1.4 |
| bench:mixed_size_allocations | 3 | 201M | 0.1% | 0.5 | 2.4 |
| bench:string_building | 3 | 127M | 0.1% | 0.8 | 4.6 |
| bench:transformation_chains | 3 | 94M | 0.1% | 0.5 | 4.0 |
| bench:concatenation_chains | 3 | 32M | 0.0% | 0.3 | 4.0 |
| bench:substring_slicing | 3 | 6M | 0.0% | 0.4 | 5.2 |
Most phases have very low LLC miss rates, well under 1.0 per 1,000 instructions. bench:complex_network_ops is the exception at 5.6 LLC misses per 1,000 instructions, roughly 10x the rate of most other phases. Its L2 miss rate of 12.7 is also the highest in the table, so if you were chasing memory access patterns this is where you'd start.
bench:nat_tables stands out too. Its LLC miss rate of 1.1 is modest but its L2 miss rate of 10.7 is among the highest, suggesting a working set that fits in LLC but not in L2.
With perf alone you would need to run separate profiling passes for each counter and then try to correlate them with specific sections of your code. Here you get it all in one continuous stream.
Like the runtime events, we can look at individual traces. Here are some operations from the gzip benchmark, showing user events alongside runtime ones. You can mouseover for more detail:
How expensive is it?
Each benchmark was run three ways: with runtime events off entirely, with runtime events on but no performance counters, and with runtime events on and four performance counters. Each configuration was run 20 times, pinned to a single P-core on my i3-1220p server, measuring wall-clock time.
| Benchmark | Baseline | Events Overhead | PMC Overhead |
|---|---|---|---|
| htmlStream | 29.26 +/- 0.16s | -0.0% [-0.5, 0.3] | +0.0% [-0.5, 0.5] |
| stre | 5.32 +/- 0.04s | +0.1% [-0.4, 1.1] | +0.5% [-0.0, 1.5] |
| network | 6.03 +/- 0.16s | +0.1% [-1.1, 0.5] | +0.0% [-0.6, 1.0] |
| gzip | 2.79 +/- 0.02s | +0.3% [-0.2, 0.9] | +0.4% [-1.6, 1.6] |
The 5th and 95th percentiles are in square brackets. In every case the range crosses zero, which means the overhead is not distinguishable from noise. The numbers confirm what the cycle counts predict, that rdpmc adds around 30 cycles per counter per span. Even with four counters that's roughly 120 cycles at each span boundary, which is tiny compared to the work any GC phase is doing. The longest benchmark here runs for nearly 30 seconds.
This makes it practical to leave performance counters enabled on long-running services and collect hardware-level data continuously.
Trying it yourself
Note that this is currently x86-only and Linux-only since it relies on rdpmc and perf_event_open.
To install the prototype, create an opam switch from the branch:
opam switch create oxcaml-dev --empty
opam pin add -ny git+https://github.com/sadiqj/oxcaml#runtime_events_pmc
opam switch set-invariant -y --packages oxcaml-dev
eval $(opam env)
Build your program as normal. To specify which counters to monitor, set OCAML_RUNTIME_EVENTS_PERF_COUNTERS with the hex event codes as described above.
You'll also need a consumer to read the events. There's an example consumer in the benchmarks repo that connects to a running program given its executable directory and PID, and writes a CSV of all events with their durations and counter values.
If you try it out, let me know what you find.
Next steps
This is a prototype and isn't merged upstream to OxCaml yet. There are a few things I want to address before proposing it:
The counter configuration needs to be less clunky. Requiring hex event codes works but ideally you'd be able to specify counters by name, similar to how perf lets you write cache-misses instead of r412e.
The API could also be cleaner. Right now the performance samples are passed as an additional parameter on every callback, which makes the signature noisier even when you don't care about counters.
I'd like to explore the arm64 equivalents. arm64 has comparable facilities through the PMU.
Finally it would be worth extending Olly to support collecting the counters and pushing them through to traces, which would make this accessible without writing a custom consumer.
Longer term, the fact that the overhead is negligible means you could leave counters enabled across an entire fleet of services. Aggregating hardware counter data across hundreds of instances would let you ask questions that are impossible on a single machine, like which program phases have the worst cache behaviour on a specific microarchitecture, or whether certain workloads cause specific GC phases to degrade.
-
In reality it has nearly twice as many because there's one set for the P-cores and one set for the E-cores - you have to be reasonably careful to pin to a specific set of cores. ↩
-
This isn't entirely true because as we mentioned earlier, modern CPUs are parallel execution engines, so what does it even mean when an event happens but there's 100s of instructions in-flight? Sometimes we've already executed a number of instructions before the interrupt that captures the stack. A subset of the Intel counters are able to more precisely determine their causing instruction. I recommend these articles. ↩
-
Agner Fog has cycle counts in his excellent data tables ↩
