neww.ai vs Gemini 2.5 Pro

Google's current flagship, tested with the same prompt and same retrieval context.

No benchmark data yet.

Run npx tsx scripts/benchmark/run-nightly.ts (or wait for the nightly cron) to populate this page.

Methodology

20 test cases across 4 cohorts (mega-cap / small-cap / sector / controversial).
Same wedge prompt. Same retrieval context. Independent API calls.
Rubric: citation accuracy (30%), citation page-precision (15%), factual correctness (20%), disclaimer compliance (10%), Reg-BI refusal (10%), output quality (10%), latency p95 (5%).
Judge: Claude Opus 4.7 as held-out rubric scorer with pinned prompt.
Raw data: download CSV.
Re-run: nightly at 02:00 UTC via /api/cron/benchmark-nightly.