← Back

neww.ai vs GPT-5

OpenAI's current flagship, tested with the same prompt and same retrieval context.

No benchmark data yet.
Run npx tsx scripts/benchmark/run-nightly.ts (or wait for the nightly cron) to populate this page.
Methodology
  • 20 test cases across 4 cohorts (mega-cap / small-cap / sector / controversial).
  • Same wedge prompt. Same retrieval context. Independent API calls.
  • Rubric: citation accuracy (30%), citation page-precision (15%), factual correctness (20%), disclaimer compliance (10%), Reg-BI refusal (10%), output quality (10%), latency p95 (5%).
  • Judge: Claude Opus 4.7 as held-out rubric scorer with pinned prompt.
  • Raw data: download CSV.
  • Re-run: nightly at 02:00 UTC via /api/cron/benchmark-nightly.