Open AI Model Benchmarks

We run the same tasks across the top LLMs every Tuesday. All numbers reproducible — see the harness on GitHub.

As of April 29, 2026

ModelTaskScorep95 LatencyCost / 1k calls
claude-opus-4-7research-dossier84.06.2s$4.20
claude-sonnet-4-6research-dossier79.03.1s$1.80
gpt-5research-dossier81.04.4s$3.50
gemini-2.5-proresearch-dossier77.03.6s$1.20
groq-llama-3.3-70bresearch-dossier71.0850ms$0.40

Methodology

Every score above comes from the same harness, run on the same prompts, with the same scoring rubric. Read the full methodology for tasks, scoring, latency, and cost.

Read the methodology →