We run the same tasks across the top LLMs every Tuesday. All numbers reproducible — see the harness on GitHub.
As of April 29, 2026
| Model | Task | Score | p95 Latency | Cost / 1k calls |
|---|---|---|---|---|
| claude-opus-4-7 | research-dossier | 84.0 | 6.2s | $4.20 |
| claude-sonnet-4-6 | research-dossier | 79.0 | 3.1s | $1.80 |
| gpt-5 | research-dossier | 81.0 | 4.4s | $3.50 |
| gemini-2.5-pro | research-dossier | 77.0 | 3.6s | $1.20 |
| groq-llama-3.3-70b | research-dossier | 71.0 | 850ms | $0.40 |
Every score above comes from the same harness, run on the same prompts, with the same scoring rubric. Read the full methodology for tasks, scoring, latency, and cost.