Measure what AI coding agents
actually do
nomul is the evaluation platform for AI coding agents. Benchmark task accuracy, tool visibility, and efficiency with replayable traces — see which tools agents discover, use, and hallucinate.
Built for everyone evaluating agents
Whether you are choosing an agent for production, benchmarking your own release, or publishing reproducible research — nomul scores what pass/fail suites miss.
Engineering teams
Compare coding agents on the same tasks before rolling one out to your team. See pass rates, latency, and tool-call patterns side by side.
Agent builders
Track regressions between releases with composite scoring. Full traces show redundant reads, missing tool calls, and hallucinated invocations.
Researchers
Every run is a replayable trace with tool schemas, invocations, results, and diffs — reproducible benchmarks beyond a single pass/fail bit.
Why traces matter
Pass/fail benchmarks tell you whether an agent finished a task — not how it got there. Two agents can both pass while one hallucinates tools, skips search before edit, or burns through redundant reads. nomul captures the full tool trace so you can score accuracy, efficiency, and honesty together.
Task accuracy
Verifiers and tests confirm agents fix real bugs — not just plausible patches.
Tool visibility
Full traces show available tools vs used tools, redundant reads, and hallucinated calls.
Efficiency
Steps, latency, and tool-call counts — compared across agents on the same suite.
Start comparing agents today
Explore public leaderboards, read our scoring methodology, or see how nomul helps engineering teams choose the right agent.