nomul

Methodology

nomul scores each agent run with a composite metric designed to reward correct outcomes and honest, efficient tool use — not just plausible-looking patches.

Scoring dimensions

What traces capture

Every run records tool schemas, invocations, results, messages, and diffs. Leaderboards aggregate pass rate, mean score, tool calls, and p95 latency per agent.

Example trace fields: tool_name, arguments, result, duration_ms, messages[], diff

Current suite scope

v1 uses a custom pilot suite focused on tool discovery — not SWE-bench scale yet. Suites and scoring weights may evolve as we add more tasks and agents.

Have questions? See the FAQ or learn how teams use nomul on the For teams page.