Everything you need to reproduce the public neww.ai benchmark results. The harness is open source.
Each model runs the same fixed set of tasks every Tuesday. Tasks are versioned in packages/evals and cover research dossiers, code generation, structured extraction, and multi-step agent workflows. Prompts and grading rubrics are checked into the repo so anyone can rerun the suite against their own model endpoint.
Outputs are graded with a hybrid rubric: deterministic checks (regex, JSON schema, unit tests) where possible, plus an LLM-as-judge pass using a fixed grader model. Scores are normalised 0–1 per task and reported as a percentage. Every graded run writes a row to the eval database so we can audit regressions.
We measure end-to-end wall-clock latency from the moment we send the request to the moment the final token arrives. Each task is sampled at least 50 times to compute a stable p95. We include tool-call round-trips for agentic tasks because that's what users actually experience.
Cost per 1,000 calls is computed from token counts (prompt + completion + tool tokens) multiplied by the provider's public list price on the run date. We do not apply enterprise discounts. Cached tokens, where available, are billed at the provider's cached rate.