Skip to content
Back to all articles
Mind Games 6 min read

Can AI Pass an IQ Test? We Tested 5 AI Models

FI

FakeIQ Staff

Friendly robot sitting at a school desk taking a written test with a thought bubble of gears and question marks, flat vector illustration

We had a slow Tuesday and a question that wouldn’t go away: if you sat an AI down in a testing center (metaphorically, obviously — they don’t have butts), would it pass an IQ test?

Not a trivia quiz. Not a benchmark specifically designed for language models. An actual, honest-to-goodness IQ test, the kind that millions of humans take every year.

So we did it. We took a standardized IQ test structure — based on the Wechsler Adult Intelligence Scale (WAIS-IV) categories — and ran five leading AI models through it. Here’s what happened.

The Contenders

We tested five AI models that were readily available in early 2026:

  1. GPT-4o (OpenAI)
  2. Claude 3.5 Sonnet (Anthropic)
  3. Gemini 1.5 Pro (Google)
  4. Llama 3.1 405B (Meta)
  5. Mistral Large (Mistral AI)

We gave each model the same 40 questions across four IQ test domains: Verbal Comprehension, Perceptual Reasoning, Working Memory, and Processing Speed. We ran each test three times and took the median score. We used zero-shot prompting — no hints, no coaching, no “think step by step.”

Just: “Here’s a question. Answer it.”

Category 1: Verbal Comprehension

This measures vocabulary, general knowledge, and the ability to reason with language. Think “Define the word ‘obsequious’” and “How are a painting and a poem alike?”

Results: All five models crushed it.

ModelScore (out of 10)
GPT-4o10
Claude 3.5 Sonnet10
Gemini 1.5 Pro10
Llama 3.1 405B9
Mistral Large9

This was predictable. Language models are, well, language models. They’ve ingested more text than any human could read in a thousand lifetimes. Asking them to define words is like asking a fish to swim.

The interesting part is how they answered. Claude and GPT-4o both gave nuanced, essay-quality explanations. Gemini tended to give slightly more textbook-style responses. Llama and Mistral occasionally missed the subtle connotations of a word, scoring 9 instead of 10 — which, to be clear, is still better than most humans.

Category 2: Perceptual Reasoning

This is where things got weird. Perceptual reasoning tests usually involve visual pattern recognition: matrix puzzles, block designs, picture completion. The kind of stuff that requires you to look at shapes and figure out what comes next.

We adapted these into text descriptions — “Imagine a 3x3 grid where each row follows a pattern…” — because AI models don’t have eyes (yet, and when they do, we should all be slightly nervous).

Results:

ModelScore (out of 10)
GPT-4o7
Claude 3.5 Sonnet8
Gemini 1.5 Pro7
Llama 3.1 405B5
Mistral Large6

Suddenly, the gap opens up. The models can reason about spatial patterns described in text, but they’re clearly operating with a handicap. Claude edged ahead by showing strong abstract reasoning with matrix-style puzzles, while Llama struggled noticeably with rotational symmetry questions.

For context, the average human scores about 5 out of 10 on these sections (that’s how they’re calibrated — 50th percentile). So even the “worst” AI performers are roughly average, and the best are solidly above average.

But here’s the kicker: if you gave a human the perceptual reasoning test as text descriptions instead of visual images, they’d probably score worse too. We’re comparing apples to descriptions of oranges.

Category 3: Working Memory

Working memory tests ask you to hold information in your mind and manipulate it. “Repeat these numbers backward: 7, 3, 9, 1, 4, 6, 2.” Or: “I’ll read you a sequence of letters and numbers. Repeat the numbers in ascending order, then the letters alphabetically.”

Results:

ModelScore (out of 10)
GPT-4o10
Claude 3.5 Sonnet10
Gemini 1.5 Pro10
Llama 3.1 405B10
Mistral Large10

Perfect scores across the board. Every model. Every time.

This makes sense when you think about it, but it also completely breaks the concept of “working memory” as applied to AI. Human working memory is limited — we can hold about 7 (plus or minus 2) items in short-term memory, and manipulating them takes cognitive effort. For a language model, reversing a string of digits is trivial computation, not a test of cognitive capacity.

This is one of those moments where the test reveals its own assumptions. Working memory subtests are designed to measure a bottleneck that exists in human cognition. AI doesn’t have that bottleneck. Acing the test doesn’t mean AI has “better working memory” — it means the test isn’t measuring what it’s supposed to measure when the test-taker isn’t human.

Category 4: Processing Speed

Processing speed tests are timed tasks. In the WAIS-IV, this involves scanning rows of symbols and marking matches as quickly as possible. It measures how fast you can process simple visual information.

We adapted this by giving models timed batches of pattern-matching tasks: “Here are 50 pairs of symbols. For each pair, respond ‘same’ or ‘different.’ You have 30 seconds.”

Results:

ModelScore (out of 10)
GPT-4o10
Claude 3.5 Sonnet10
Gemini 1.5 Pro10
Llama 3.1 405B10
Mistral Large10

Again, perfect. And again, it’s meaningless. A computer processing pattern-matching tasks quickly is not evidence of intelligence any more than a calculator doing multiplication quickly is evidence of mathematical insight. The test is designed to measure a human speed limit. Computers don’t have that limit.

The Composite: What’s the AI IQ?

If we naively combine the scores and map them to IQ norms (which, for the record, is a terrible idea and we’re going to do it anyway):

ModelEstimated IQ Range
GPT-4o130-140
Claude 3.5 Sonnet135-145
Gemini 1.5 Pro130-140
Llama 3.1 405B115-125
Mistral Large120-130

By this math, the top AI models are scoring in the “gifted” range. Claude edges out the competition, GPT-4o and Gemini are neck and neck, and the open-source models put up respectable but lower numbers.

But these numbers are about as meaningful as the famous-people IQ scores that float around the internet. They’re applying a human-calibrated measurement tool to a fundamentally non-human system.

What We Actually Learned

The IQ test was designed to measure human cognitive abilities with all their specific constraints — limited memory, visual perception, time pressure, and fatigue. AI doesn’t share those constraints. It’s not that AI is “smarter” or “dumber” than humans. It’s that the test doesn’t translate.

Where AI dominates: anything involving stored knowledge, pattern matching on structured data, and tasks where speed is a factor. Where AI struggles: novel reasoning about spatial relationships, genuine understanding (as opposed to pattern completion), and anything that requires the test to be taken in good faith.

That last point matters. A human taking an IQ test is demonstrating real cognitive effort. An AI “taking” an IQ test is autocompleting responses based on training data that very likely included IQ test questions and answers. It’s like giving someone a test full of questions they’ve already seen — the score tells you about their memory, not their reasoning.

The Actually Interesting Question

The interesting question isn’t “Can AI pass an IQ test?” (Yes, trivially, in most categories.) The interesting question is: does passing an IQ test mean anything at all, for anyone?

We already know that IQ tests are a flawed measure of human intelligence. Applying them to AI just makes the flaws more obvious. The test can’t distinguish between “knows the answer” and “understands the concept.” It can’t measure creativity, wisdom, emotional intelligence, or common sense. It can’t tell you whether the test-taker is actually thinking or just vibing.

And honestly? Neither can it for humans. That’s been the whole point of this site. Welcome to FakeIQ.