Google
Gemini 3.5 Flash
Gemini 3.5 Flash is an AI model released by Google on May 19 2026, the same day as Gemini Omni. Benchmark results (shown below) cover BullshitBench v2, SWE-Bench Pro, CursorBench v3.1, Terminal-Bench 2.1, MCP Atlas, Toolathlon, and 9 more.
Benchmarks
Nonsense detection
BullshitBench v2Given a confidently-worded but nonsensical prompt, does the AI spot that it makes no sense and push back — instead of playing along and inventing an answer? The score is how often it clearly called out the nonsense. Higher is better.
20%+9%vs Gemini 3.1 Flash-Lite
Agentic coding
SWE-Bench ProCan the AI fix real bugs in real software? It's handed actual problems from open-source projects and has to write code that genuinely solves them. Higher is better.
55.1%+5.5%vs Gemini 3.0 Flash
Agentic coding
CursorBench v3.1Cursor's own test of harder, real-world coding tasks inside a code editor. Higher is better.
49.8%
Agentic terminal coding
Terminal-Bench 2.1Can the AI work in a command-line terminal — running commands and finishing technical setup tasks the way a developer would? Higher is better.
76.2%+18.2%vs Gemini 3.0 Flash
Multi-step tool use
MCP AtlasCan the AI chain together many tools and steps to complete one bigger task, rather than doing just a single thing? Higher is better.
83.6%+21.6%vs Gemini 3.0 Flash
General tool use
ToolathlonTests how well the AI uses everyday real-world tools and apps to get things done. Higher is better.
56.5%+7.1%vs Gemini 3.0 Flash
Multidisciplinary reasoning
Humanity's Last ExamHumanity's Last Exam — extremely hard expert questions across many subjects, written so you can't just look up the answer. “No tools” means the AI answers on its own. Higher is better.
40.2%+6.5%vs Gemini 3.0 Flash
no tools
Abstract reasoning
ARC-AGI-2Puzzle-style tests of abstract reasoning and pattern-finding — the kind of thing people find easy but AIs often struggle with. Higher is better.
72.1%+38.5%vs Gemini 3.0 Flash
Agentic computer use
OSWorld-VerifiedCan the AI actually operate a computer — clicking, typing, and using real apps — to finish tasks on its own? Higher is better.
78.4%+13.3%vs Gemini 3.0 Flash
Agentic financial analysis
Finance Agent v2Tests the AI on real financial-analysis work, like digging through reports and making sound decisions. Higher is better.
57.9%+15.3%vs Gemini 3.0 Flash
Knowledge work
GDPval-AAMeasures how well the AI does economically valuable knowledge work, judged against human experts. Shown as a rating (like a chess Elo) — higher is better.
1656+452vs Gemini 3.0 Flash
Chart reasoning
CharXiv ReasoningCan the AI read and reason about complex charts and figures, not just text? Higher is better.
84.2%+3.9%vs Gemini 3.0 Flash
Multimodal reasoning
MMMU-ProA tougher version of MMMU — college-level questions that mix images, diagrams, and text together. Higher is better.
83.6%+2.4%vs Gemini 3.0 Flash
Spatial reasoning
Blueprint-Bench 2Can the AI reason about space and layout — for example, understanding a floor plan or blueprint? Higher is better.
33.6%+33.6%vs Gemini 3.0 Flash
Long context
MRCR v2 (8-needle)Tests whether the AI can find specific details buried inside a very long document (around 128k tokens — roughly a long book). Higher is better.
77.3%+10.1%vs Gemini 3.0 Flash
128k average
26.6%+4.5%vs Gemini 3.0 Flash
1M pointwise
Compare Gemini 3.5 Flash with
Gemini 3.5 Flash vs Gemini OmniGemini 3.5 Flash vs GPT-5.6 SolGemini 3.5 Flash vs Claude Sonnet 5Gemini 3.5 Flash vs Muse SparkGemini 3.5 Flash vs Grok 4.3 BetaGemini 3.5 Flash vs DeepSeek-V4-ProGemini 3.5 Flash vs Mistral Medium 3.5Gemini 3.5 Flash vs Kimi K2.7 CodeGemini 3.5 Flash vs Composer 2.5Gemini 3.5 Flash vs GLM-5.2