Adding an OCaml GC debugging task to terminal-bench
I stumbled onto terminal-bench a few weeks ago while researching datasets to evaluate agents. It contains around 120 tasks that need to be completed using a terminal, they range from easy things like fixing permissions to more tricky things like finding lost changes in git and installing windows xp (!).
To that collection, we can now add a freshly-merged task focused on finding and fixing a bug in the OCaml GC. This task is designed to be difficult; it involves a subtle change that isn't obvious at first pass even for experts and mirrors a similar real issue I encountered earlier in the year. Solving it requires interactive debugging.
As more model providers report performance on terminal-bench, it's a great opportunity to add tasks for common development workflows that aren't currently covered. This could include more native debugging tasks, as well as some general sysadmin tasks - though the latter might be tricky because the containers tasks run in are unprivileged. I'm sure Mark Elvers has some good ideas..