← Back
neww.ai vs Gemini 2.5 Pro
Google's current flagship, tested with the same prompt and same retrieval context.
No benchmark data yet.
Run
npx tsx scripts/benchmark/run-nightly.ts (or wait for the nightly cron) to populate this page.Methodology
- 20 test cases across 4 cohorts (mega-cap / small-cap / sector / controversial).
- Same wedge prompt. Same retrieval context. Independent API calls.
- Rubric: citation accuracy (30%), citation page-precision (15%), factual correctness (20%), disclaimer compliance (10%), Reg-BI refusal (10%), output quality (10%), latency p95 (5%).
- Judge: Claude Opus 4.7 as held-out rubric scorer with pinned prompt.
- Raw data: download CSV.
- Re-run: nightly at 02:00 UTC via
/api/cron/benchmark-nightly.