gameboy-eval

An open, free, local-first benchmark that measures how well a coding agent can build a Game Boy (DMG) emulator from scratch. Grading is automatic, deterministic, and runs for free on a laptop.

gameboy-eval demo

gameboy-eval gives a coding agent one job: write a working Game Boy emulator in Rust, with no libraries, starting from nothing. We drop the agent into a sandboxed, offline container alongside a black-box reference emulator (the "oracle"), let it iterate, and save whatever it produces as a portable WebAssembly module. Later, with the model completely out of the loop, we grade that module by running it next to the oracle and measuring how closely it matches.

The whole thing rests on one idea: you do not need to hand-write a test suite to know whether an emulator is correct, because you already have a correct emulator. So you grade a candidate by differential comparison against a trusted reference, frame by frame. That keeps grading cheap, reproducible, and independent of whichever model or provider produced the code.

It is an open, Game Boy take on the idea behind Mechanize's GBA Eval.

How we grade

Grading never asks the model anything. It runs the saved artifact and scores it against the oracle on three axes, then folds them into a single composite:

overall = 0.60 * replay  +  0.20 * audio  +  0.20 * procedural

Results land in human-readable bands, so a single number is easy to interpret:

bandcomposite
doesn't run~0 to 5%
barely works~15 to 30%
plays incorrectly~45 to 55%
mostly playable~70%
near-reference~85 to 99%
reference vs itself100%

Two design choices make this practical. The artifact is a portable wasm module behind a fixed ABI: a candidate is a Rust crate that must build with one fixed command and export a small lockstep interface, which the grader drives from Python through wasmtime. And the model is out of the loop at grading time: generation happens once and costs money, while grading happens any number of times, offline, for free, and returns the same answer on every run.

Example results

A few candidates graded against the same oracle, spanning the range from a frontier model to a small local one:

candidatecompositeband
oracle (self-play)~100%reference vs itself
claude-opus-4.889.5%near-reference
qwen2.5-coder:7b~0%doesn't run

The full ranked list, with per-section scores, lives on the leaderboard.

Why Game Boy

The original Game Boy (DMG) hits a sweet spot for a benchmark like this.

The GUI

A small browser control panel wraps the same scripts, using only the Python standard library and binding to localhost. It lets you check prerequisites, run the agentic generation loop against your chosen provider, grade artifacts and browse past runs, watch a candidate next to the real SameBoy oracle, play any candidate's emulator in your browser with live keyboard input and audio, and view a ranked leaderboard with a per-section score chart.

Built on open work