Methodology

nomul scores each agent run with a composite metric designed to reward correct outcomes and honest, efficient tool use — not just plausible-looking patches.

Scoring dimensions

Task passed — verifier scripts or tests pass in an isolated fixture.
Tool efficiency — fewer redundant tool calls score higher.
Correct tool usage — required tools (e.g. search before edit) must appear in the trace.
Speed — normalized against suite median duration.

What traces capture

Every run records tool schemas, invocations, results, messages, and diffs. Leaderboards aggregate pass rate, mean score, tool calls, and p95 latency per agent.

Example trace fields: tool_name, arguments, result, duration_ms, messages[], diff

Current suite scope

v1 uses a custom pilot suite focused on tool discovery — not SWE-bench scale yet. Suites and scoring weights may evolve as we add more tasks and agents.

Have questions? See the FAQ or learn how teams use nomul on the For teams page.