Open AI Model Benchmarks

We run the same tasks across the top LLMs every Tuesday. All numbers reproducible — see the harness on GitHub.

As of April 29, 2026

Model	Task	Score	p95 Latency	Cost / 1k calls
claude-opus-4-7	research-dossier	84.0	6.2s	$4.20
claude-sonnet-4-6	research-dossier	79.0	3.1s	$1.80
gpt-5	research-dossier	81.0	4.4s	$3.50
gemini-2.5-pro	research-dossier	77.0	3.6s	$1.20
groq-llama-3.3-70b	research-dossier	71.0	850ms	$0.40

Methodology

Every score above comes from the same harness, run on the same prompts, with the same scoring rubric. Read the full methodology for tasks, scoring, latency, and cost.

Read the methodology →