Measure what AI coding agents
actually do

nomul is the evaluation platform for AI coding agents. Benchmark task accuracy, tool visibility, and efficiency with replayable traces — see which tools agents discover, use, and hallucinate.

Explore leaderboards Read methodology

run trace · agent-v2 passed

grep "handleAuth" src/

read src/auth.ts

edit src/auth.ts (+12 lines)

hallucinated deploy_to_prod()

7 tool calls 42s score 0.87

Built for everyone evaluating agents

Whether you are choosing an agent for production, benchmarking your own release, or publishing reproducible research — nomul scores what pass/fail suites miss.

Engineering teams

Compare coding agents on the same tasks before rolling one out to your team. See pass rates, latency, and tool-call patterns side by side.

Agent builders

Track regressions between releases with composite scoring. Full traces show redundant reads, missing tool calls, and hallucinated invocations.

Researchers

Every run is a replayable trace with tool schemas, invocations, results, and diffs — reproducible benchmarks beyond a single pass/fail bit.

Why traces matter

Pass/fail benchmarks tell you whether an agent finished a task — not how it got there. Two agents can both pass while one hallucinates tools, skips search before edit, or burns through redundant reads. nomul captures the full tool trace so you can score accuracy, efficiency, and honesty together.

Task accuracy

Verifiers and tests confirm agents fix real bugs — not just plausible patches.

Tool visibility

Full traces show available tools vs used tools, redundant reads, and hallucinated calls.

Efficiency

Steps, latency, and tool-call counts — compared across agents on the same suite.

Start comparing agents today

Explore public leaderboards, read our scoring methodology, or see how nomul helps engineering teams choose the right agent.

View leaderboards For engineering teams Read FAQ

Measure what AI coding agents actually do