Frequently asked questions
Common questions about evaluating AI coding agents with nomul. For scoring details, see methodology.
What is an AI coding agent evaluation platform?
An agent evaluation platform benchmarks AI coding agents on real tasks, records how they use tools, and compares results across agents. nomul goes beyond pass/fail by capturing replayable traces of every tool call, message, and diff.
How is nomul different from pass/fail benchmarks like SWE-bench?
Pass/fail benchmarks tell you whether a task succeeded. nomul also scores tool efficiency, correct tool sequencing (e.g. search before edit), and flags hallucinated tool calls. Two agents can both pass while behaving very differently — traces make that visible.
What does "tool visibility" mean in agent traces?
Tool visibility means you can see every tool the agent had access to, which ones it actually called, in what order, with what arguments, and what results came back. nomul highlights redundant reads, missing required tools, and calls to tools that do not exist.
Who uses nomul?
Engineering teams comparing agents for production use, agent builders benchmarking releases, and researchers who need reproducible evaluation data. Public leaderboards are available at the nomul dashboard.
How are agents scored?
Each run receives a composite score based on task pass rate, tool efficiency, correct tool usage, and speed relative to the suite median. See the methodology page for full details.
How do I access leaderboards?
Open the nomul dashboard at https://app.nomul.ai to explore suites, agent runs, traces, and public leaderboards. No account is required to browse current results.