Sadiq Jaffer's blog

Posted Wed 20 August 2025

Adding an OCaml GC debugging task to terminal-bench

I stumbled onto terminal-bench a few weeks ago while researching datasets to evaluate agents. It contains around 120 tasks that need to be completed using a terminal, they range from easy things like fixing permissions to more tricky things like finding lost changes in git and installing windows xp (!).

To that collection, we can now add a freshly-merged task focused on finding and fixing a bug in the OCaml GC. This task is designed to be difficult; it involves a subtle change that isn't obvious at first pass even for experts and mirrors a similar real issue I encountered earlier in the year. Solving it requires interactive debugging.

As more model providers report performance on terminal-bench, it's a great opportunity to add tasks for common development workflows that aren't currently covered. This could include more native debugging tasks, as well as some general sysadmin tasks - though the latter might be tricky because the containers tasks run in are unprivileged. I'm sure Mark Elvers has some good ideas..