Claude Opus 4.7 vs Gemini 3.5 Flash
Anthropic Claude Opus 4.7 | Google Gemini 3.5 Flash | |
|---|---|---|
| Overview | ||
| Company | Anthropic | |
| Release date | Apr 16 2026 | May 19 2026 |
| Access | Proprietary | Proprietary |
| Specifications | ||
Context window | 1M | — |
| Benchmarks | ||
Nonsense detection BullshitBench v2Given a confidently-worded but nonsensical prompt, does the AI spot that it makes no sense and push back — instead of playing along and inventing an answer? The score is how often it clearly called out the nonsense. Higher is better. | 83%Best | 20% |
Agentic coding SWE-Bench ProCan the AI fix real bugs in real software? It's handed actual problems from open-source projects and has to write code that genuinely solves them. Higher is better. | 64.3%Best | 55.1% |
Coding SWE-Bench VerifiedReal coding tasks pulled from open-source projects — the AI has to find and fix actual bugs. A human-checked version of the original SWE-Bench. Higher is better. | 87.6% | — |
Multilingual coding SWE-Bench MultilingualLike SWE-Bench, but the coding problems span many programming languages, not just one. Tests how broadly the AI can code. Higher is better. | 80.5% | — |
Agentic coding CursorBench v3.1Cursor's own test of harder, real-world coding tasks inside a code editor. Higher is better. | 64.8%Best | 49.8% |
Next.js coding Next.js EvalsVercel's open eval of how well AI coding agents build and migrate real Next.js apps — measured as the share of tasks the agent completes successfully. Higher is better. | 75% | — |
Agentic terminal coding Terminal-Bench 2.1Can the AI work in a command-line terminal — running commands and finishing technical setup tasks the way a developer would? Higher is better. | 66.1% | 76.2%Best |
Agentic terminal coding Terminal-Bench 2.0Can the AI work in a command-line terminal — running commands and finishing technical setup tasks the way a developer would? (Version 2.0 of the test.) Higher is better. | 69.4% | — |
Multi-step tool use MCP AtlasCan the AI chain together many tools and steps to complete one bigger task, rather than doing just a single thing? Higher is better. | 79.1% | 83.6%Best |
General tool use ToolathlonTests how well the AI uses everyday real-world tools and apps to get things done. Higher is better. | — | 56.5% |
Web browsing BrowseCompCan the AI browse the web and track down hard-to-find answers? Higher is better. | 79.3% | — |
Cybersecurity CyberGymTests the AI on cybersecurity challenges — finding and exploiting software weaknesses inside a safe sandbox. Higher is better. | 73.1% | — |
Multidisciplinary reasoning Humanity's Last Exam · no toolsHumanity's Last Exam — extremely hard expert questions across many subjects, written so you can't just look up the answer. “No tools” means the AI answers on its own. Higher is better. | 46.9%Best | 40.2% |
Multidisciplinary reasoning Humanity's Last Exam · with toolsHumanity's Last Exam — extremely hard expert questions across many subjects. “With tools” means the AI is allowed to search the web or run code while answering. Higher is better. | 54.7% | — |
Abstract reasoning ARC-AGI-2Puzzle-style tests of abstract reasoning and pattern-finding — the kind of thing people find easy but AIs often struggle with. Higher is better. | 75.8%Best | 72.1% |
Advanced math FrontierMath · Tier 1–3Very hard, research-level math problems. Tiers 1–3 are the (still extremely difficult) lower tiers. Higher is better. | 43.8% | — |
Advanced math FrontierMath · Tier 4Very hard, research-level math problems. Tier 4 is the hardest — close to what professional research mathematicians tackle. Higher is better. | 22.9% | — |
Science GPQA DiamondGraduate-level science questions in biology, physics, and chemistry — hard enough that subject-matter PhDs score around 65%. Higher is better. | 94.2% | — |
Agentic computer use OSWorld-VerifiedCan the AI actually operate a computer — clicking, typing, and using real apps — to finish tasks on its own? Higher is better. | 78% | 78.4%Best |
Agentic financial analysis Finance Agent v2Tests the AI on real financial-analysis work, like digging through reports and making sound decisions. Higher is better. | 51.5% | 57.9%Best |
Knowledge work GDPval-AAMeasures how well the AI does economically valuable knowledge work, judged against human experts. Shown as a rating (like a chess Elo) — higher is better. | 1753Best | 1656 |
Knowledge work GDPval (win/tie rate)How often the AI's work matches or beats a human expert's on real knowledge-work tasks. Higher is better. | 80.3% | — |
Chart reasoning CharXiv ReasoningCan the AI read and reason about complex charts and figures, not just text? Higher is better. | 82.1% | 84.2%Best |
Multimodal reasoning MMMU-ProA tougher version of MMMU — college-level questions that mix images, diagrams, and text together. Higher is better. | 75.2% | 83.6%Best |
Spatial reasoning Blueprint-Bench 2Can the AI reason about space and layout — for example, understanding a floor plan or blueprint? Higher is better. | 24.5% | 33.6%Best |
Long context MRCR v2 (8-needle) · 128k averageTests whether the AI can find specific details buried inside a very long document (around 128k tokens — roughly a long book). Higher is better. | 59.3% | 77.3%Best |
Long context MRCR v2 (8-needle) · 1M pointwiseTests whether the AI can find specific details buried inside an enormous document (around 1 million tokens — many books). Higher is better. | — | 26.6% |
| Timeline | ||
| Release gap | Claude Opus 4.7 shipped 33 days before Gemini 3.5 Flash | |
Which is better: Claude Opus 4.7 or Gemini 3.5 Flash?
Gemini 3.5 Flash leads Claude Opus 4.7 on 8 of the 14 benchmarks they both report. Claude Opus 4.7 shipped 33 days before Gemini 3.5 Flash, so benchmark comparisons should account for the intervening progress.
Published specifications for these two models are limited — see each model page for the latest details.
On BullshitBench v2, Claude Opus 4.7 leads at 83% vs Gemini 3.5 Flash at 20%. On SWE-Bench Pro, Claude Opus 4.7 leads at 64.3% vs Gemini 3.5 Flash at 55.1%. On CursorBench v3.1, Claude Opus 4.7 leads at 64.8% vs Gemini 3.5 Flash at 49.8%. On Terminal-Bench 2.1, Gemini 3.5 Flash leads at 76.2% vs Claude Opus 4.7 at 66.1%. On MCP Atlas, Gemini 3.5 Flash leads at 83.6% vs Claude Opus 4.7 at 79.1%. On Humanity's Last Exam · no tools, Claude Opus 4.7 leads at 46.9% vs Gemini 3.5 Flash at 40.2%. On ARC-AGI-2, Claude Opus 4.7 leads at 75.8% vs Gemini 3.5 Flash at 72.1%. On OSWorld-Verified, Gemini 3.5 Flash leads at 78.4% vs Claude Opus 4.7 at 78%. On Finance Agent v2, Gemini 3.5 Flash leads at 57.9% vs Claude Opus 4.7 at 51.5%. On GDPval-AA, Claude Opus 4.7 leads at 1753 vs Gemini 3.5 Flash at 1656. On CharXiv Reasoning, Gemini 3.5 Flash leads at 84.2% vs Claude Opus 4.7 at 82.1%. On MMMU-Pro, Gemini 3.5 Flash leads at 83.6% vs Claude Opus 4.7 at 75.2%. On Blueprint-Bench 2, Gemini 3.5 Flash leads at 33.6% vs Claude Opus 4.7 at 24.5%. On MRCR v2 (8-needle) · 128k average, Gemini 3.5 Flash leads at 77.3% vs Claude Opus 4.7 at 59.3%.
Frequently asked questions
Claude Opus 4.7 was released by Anthropic on Apr 16 2026.
Gemini 3.5 Flash was released by Google on May 19 2026.
Claude Opus 4.7 leads on SWE-Bench Pro — Claude Opus 4.7 64.3% vs Gemini 3.5 Flash 55.1%.
Claude Opus 4.7 leads on Humanity's Last Exam · no tools — Claude Opus 4.7 46.9% vs Gemini 3.5 Flash 40.2%.