AI IQ is here: a new site scores frontier AI models on the human IQ scale. The results are already dividing tech.

Contents

12 Benchmarks, 4 Dimensions, and 1 Controversial Quantity: How AI IQ Really Works OpenAI leads the bell curve, however the hole between prime AI fashions is narrower than ever Why emotional intelligence is changing into the brand new battleground in AI mannequin rankings AI value efficiency chart that company consumers ought to really take a look at What critics say about AI "jagged" A single IQ rating will be dangerously deceptive The true race is not concerning the highest rating, it is concerning the smartest mannequin stack

For many years, IQ checks have been one of many best-known and most controversial measures of human intelligence. Now, a startup undertaking referred to as AI IQ is making use of the identical metaphor to synthetic intelligence, assigning estimated intelligence quotients to greater than 50 of the world’s strongest language fashions and plotting them on a typical bell curve.

The ensuing sequence of interactive visualizations from aiiq.org have been making the rounds on social media over the previous week, drawing reward from company technologists who say the charts make an extremely advanced market simpler to learn, however sharp criticism from researchers and commentators who warn that your entire framework is deceptive.

"That is very handy," Know-how critic Thibault Melen writes about X: "Mapping like this as an alternative of one other big leaderboard desk makes it simpler to know the mannequin’s progress."

Enterprise strategist Brian Vermuir expressed related help. "That is useful. Observe anecdotally based mostly on private expertise."

However the backlash got here simply as rapidly. "That is nonsense. AI is just too jagged. Maps usually are not territories," The publish, posted by the substitute intelligence commentary account AI Deeply, embodies a priority shared by many researchers that decreasing the chaotic and uneven options of language fashions to a single quantity creates a harmful phantasm of accuracy.

12 Benchmarks, 4 Dimensions, and 1 Controversial Quantity: How AI IQ Really Works

AI IQ was created by Ryan Shea, an engineer, entrepreneur, and angel investor greatest often known as the co-founder of blockchain platform Stacks. Shea can also be a co-founder of Voterbase and has invested within the early phases of a number of unicorn firms together with OpenSea, Lattice, Anchorage, and Mercury. He holds a Bachelor of Science in Mechanical Engineering from Princeton College.

The location’s methodology is predicated on a deceptively easy method. AI IQ teams 12 benchmarks into 4 reasoning dimensions: summary, mathematical, programmatic, and educational. Composite IQ is the direct common of those 4 dimension scores: IQ = 1/4 (IQ_Abstract + IQ_Math + IQ_Prog + IQ_Acad).

The summary reasoning dimension is derived from ARC-AGI-1 and ARC-AGI-2, notoriously tough sample recognition benchmarks designed to check normal fluid intelligence. Mathematical reasoning contains FrontierMath (Tier 1-3 and Tier 4), AIME, and ProofBench. Programmatic inference makes use of Terminal-Bench 2.0, SWE-Bench Verified, and SciCode. Educational inferences come from the Final Take a look at of Mankind, CritPt, and GPQA Diamond.

Every uncooked benchmark rating is mapped to an implicit IQ via what’s described on the location. "Manually adjusted issue curve." Importantly, this system compresses the higher certain of benchmarks which are thought of simpler or extra vulnerable to knowledge air pollution, stopping scores from exceeding 100. Benchmarks which are harder and fewer gaming keep larger limits. The system additionally handles lacking knowledge conservatively. The mannequin requires scores on no less than two of the 4 dimensions to obtain a derived IQ, and within the absence of a benchmark, the pipeline deliberately lowers scores as an alternative of elevating them. The location states: "All derived IQs common over all 4 dimensions, so the dearth of protection doesn’t make the mannequin look higher by omitting it."

OpenAI leads the bell curve, however the hole between prime AI fashions is narrower than ever

The AI IQ chart as of mid-Could 2026 speaks to speedy convergence on the prime of the frontier and rising range on the backside.

In accordance with the Frontier IQ Over Time chart, OpenAI’s GPT-5.5 is at the moment on the prime of the bell curve, with an estimated IQ of round 136, the very best of any tracked mannequin. That is intently adopted by GPT-5.4 (approx. 131), Anthropic’s Opus 4.7 (approx. 132), and Opus 4.6 (approx. 129). Google’s Gemini 3.1 Professional sits round 131, making the highest cluster very tight.

This compression shouldn’t be distinctive to AI IQ’s framework. Visible Capitalist not too long ago noticed an identical motion based mostly on one other Mensa-based rating by TrackingAI, stating: "The most important takeaway is how compressed the highest of the leaderboard is." On this scale, Grok-4.20 Knowledgeable Mode and GPT 5.4 Professional tied at 145, whereas Gemini 3.1 Professional scored 141.

The AI IQ chart reveals a crowded midfield under the Frontier cluster. The Chinese language Institute’s fashions (Kimi K2.6, GLM-5, DeepSeek-V3.2, Qwen3.6, MiniMax-M2.7) vary from round 112 to 118, making the value-for-money tier more and more aggressive for company consumers who do not want the very best mannequin for each job. Considered one of our X customers, ovsky, identified that his knowledge appears to be like like this: "Confirming the expertise that Sonnet 4.6 is an absolute workhorse in distinction to Opus 4.5" — Factors out how charts can validate practitioners’ intuitions which are typically neglected in headline rankings.

Why emotional intelligence is changing into the brand new battleground in AI mannequin rankings

What makes AI IQ totally different from different benchmarking efforts is that "EQ" — Emotional Intelligence — Rating. This web site maps every mannequin’s EQ-Bench 3 Elo rating and Area Elo rating to an estimated EQ utilizing an adjusted piecewise linear scale, then obtains a 50/50 weighted composite of the 2.

EQ scores produce very totally different rankings than IQ alone. Within the IQ vs. EQ scatter plot, Anthropic’s Opus 4.7 leads in EQ with a rating of almost 132, pushing it into the highest proper quadrant. That is essentially the most fascinating place and signifies each excessive cognitive intelligence and excessive emotional intelligence. OpenAI’s GPT-5.5 and GPT-5.4 focus within the excessive IQ zone, however lag barely in EQ. Google’s Gemini 3.1 Professional is a powerful center floor on each axes.

One notable methodological alternative has garnered consideration. EQ-Bench 3 has been rated by Claude, a human mannequin, and the location acknowledges it. "Creates a possible scoring bias in favor of the human mannequin." To repair this, AI IQ subtracts a 200 level Elo penalty from the EQ bench element of all anthroposophic fashions earlier than mapping to implicit EQ. The Area element shouldn’t be affected because it makes use of human judges. This sort of self-correction is uncommon on the planet of benchmarking, and means that Mr. Shea is conscious of the methodological minefield he has entered into. Nonetheless, the EQ dimension captures one thing that IQ alone doesn’t: the significance of conversational high quality, collaboration, and the reliability of fashions deployed in user-facing duties.

AI value efficiency chart that company consumers ought to really take a look at

Maybe essentially the most helpful graph on this web site shouldn’t be a bell curve, however a scatterplot of IQ versus efficient value. The estimated IQ of every mannequin is "efficient value" Metric — Outlined because the token value of a job that makes use of 2 million enter tokens and 1 million output tokens multiplied by the utilization issue.

This chart reveals widespread patterns in enterprise expertise. In different phrases, the most effective mannequin does not all the time provide the most effective worth. GPT-5.5 and Opus 4.7 are within the prime left nook, with excessive IQ and excessive value, with efficient prices per job exceeding $30 and $50, respectively. In the meantime, fashions like GPT-5.4-mini, DeepSeek-V3.2, and MiniMax-M2.7 occupy a candy spot in between. It has a good IQ rating of 112-120 and an actual value per job of roughly $1-5. At its least expensive, GPT-oss-20b (an open supply OpenAI mannequin) has an efficient value of about $0.20 and an IQ of about 107. This may be essentially the most economical possibility for bulk classification or extraction workloads.

The location additionally supplies 3D visualization that maps IQ, EQ, and efficient value. The dashed line via the dice represents the best of upper IQ, larger EQ, and decrease value. fashions close to me "inexperienced finish" Inside that axis, extra highly effective complete buying and selling takes place. folks close by "pink edge" sacrificing capability, value effectivity, or each. For CIOs gazing their API payments, the implications are clear. The intelligence hole between a $50 mannequin and a $3 mannequin has narrowed sufficient that routing the costly mannequin for tough issues and a budget mannequin for all the things else is now not an possibility. That is the first structure for full-scale AI deployments.

What critics say about AI "jagged" A single IQ rating will be dangerously deceptive

Probably the most vocal opposition to AI IQ is philosophical, and it cuts deep. Critics argue that lumping the mannequin’s uneven options right into a single rating obfuscates greater than it reveals.

"IQ as a proxy is declining, and we see spikes in reasoning density that do not map to the G issue." Know-how critic Zaya posted on X. "GPT-5.5 has already reached saturation on MMLU-Professional, however nonetheless fails 50% of the time on ClockBench."

This commentary is what AI researchers say "Jagged" The issue: Massive-scale language fashions typically exhibit markedly uneven efficiency, excelling at graduate-level physics however failing at duties even a toddler can do. Composite scores can fill in these gaps.

One other X consumer, Pressureangle, posted a extra detailed critique, calling out: "Lack of full transparency" And the location claims to not absolutely disclose how the calibration curve was created or verified. To be honest, AI IQ lists 12 benchmarks and reveals the form of every calibration curve in its methodology modal. Nevertheless, the uncooked knowledge and actual mathematical transformations usually are not revealed as open datasets. This hole is vital for researchers accustomed to totally reproducible strategies.

Some questioned the premise itself. "As ineffective as a human IQ check," Shubham Sharma, an AI and expertise author, supplied a constructive different. "Why not have your fashions take an official (MENSA grade) check? Is not this essentially the most correct and most “human-comparable” solution to benchmark intelligence?" This method already exists via TrackingAI, which applies the Mensa Norway IQ check to language fashions. However whereas Mensa-style checks solely measure summary sample recognition, AI IQ makes an attempt a broader composite spanning coding, arithmetic, and educational reasoning. As Visible Capitalist identified, "IQ-style benchmarks seize just one slice of performance." Every method has trade-offs, and neither method has but gained the controversy.

The true race is not concerning the highest rating, it is concerning the smartest mannequin stack

Regardless of a lot debate about methodology, a very powerful sign in AI IQ knowledge will not be the rating of a single mannequin. It’s the form of the market that the chart reveals.

Greater than 50 Frontier-class fashions are at the moment obtainable via APIs from no less than 14 main suppliers throughout the US, China, and Europe. Every supplier publishes their very own benchmarks, typically hand-picked to reveal their strengths. The result’s a Tower of Babel, the place no two firms can measure the identical factor in the identical means. Educational analysis has revealed the next: "Most benchmarks introduce bias by specializing in particular kinds of domains." AI IQ’s Frontier IQ graph over time reveals how rapidly targets are shifting. As of October 2023, GPT-4 Turbo had an estimated IQ of round 75. By early 2026, the highest mannequin was over 135. That is an enchancment of about 60 factors in 30 months.

This tempo raises basic questions on whether or not any scoring system can sustain. This web site compresses the higher certain on saturated benchmarks, but when the mannequin continues to succeed in its most even in essentially the most tough checks (ARC-AGI-2, FrontierMath Tier 4, Humanity’s Final Take a look at), the framework will face the identical higher certain impact that has plagued each AI analysis up to now. Conor Forsyth identified this dynamic in X. "ARC AGI 3 disagrees." he wrote, referring to next-generation benchmarks that will already be hurting present scores.

AI’s IQ shouldn’t be excellent. The methodology is partially opaque. That IQ metaphor will be deceptive. And whereas its creators acknowledge identified biases, they could have neglected others. However the different, which is to have a look at dozens of provider-specific benchmark tables, every with totally different check suites and scoring guidelines, is even worse. This web site presents a really uncommon merchandise to company consumers. It is a single framework for evaluating fashions throughout suppliers, dimensions, and worth factors, up to date frequently and with sufficient nuance to provide the proper solutions. "Which mannequin is greatest?" nearly all the time "It relies on the duty."

Debdoot Ghosh considered X after wanting on the chart: "Is the position of people now solely to coordinate?"

maybe. But when the AI IQ knowledge clearly reveals us something, it’s that orchestration—figuring out which fashions to deploy, when, and at what worth—is itself a type of intelligence. And there aren’t any benchmarks for that but.

AI IQ is here: a new site scores frontier AI models on the human IQ scale. The results are already dividing tech.

12 Benchmarks, 4 Dimensions, and 1 Controversial Quantity: How AI IQ Really Works

OpenAI leads the bell curve, however the hole between prime AI fashions is narrower than ever

Why emotional intelligence is changing into the brand new battleground in AI mannequin rankings

AI value efficiency chart that company consumers ought to really take a look at

What critics say about AI "jagged" A single IQ rating will be dangerously deceptive

The true race is not concerning the highest rating, it is concerning the smartest mannequin stack

Leave a Reply Cancel reply

Follow US

Popular News

Inside OpenAI’s quest to make AI do anything for you

From Risk Premia to Constraint

Why AST SpaceMobile Stock Popped Again Today

Best Private Cloud Hosting Platforms in 2026

Choosing the Right Agentic Design Pattern: A Decision-Tree Approach

Categories

About US

Quick Links

Important Links

Subscribe US

12 Benchmarks, 4 Dimensions, and 1 Controversial Quantity: How AI IQ Really Works

OpenAI leads the bell curve, however the hole between prime AI fashions is narrower than ever

Why emotional intelligence is changing into the brand new battleground in AI mannequin rankings

AI value efficiency chart that company consumers ought to really take a look at

What critics say about AI "jagged" A single IQ rating will be dangerously deceptive

The true race is not concerning the highest rating, it is concerning the smartest mannequin stack

Leave a Reply Cancel reply

Follow US

Weekly Newsletter

Popular News

Inside OpenAI’s quest to make AI do anything for you

From Risk Premia to Constraint

Why AST SpaceMobile Stock Popped Again Today

Best Private Cloud Hosting Platforms in 2026

Choosing the Right Agentic Design Pattern: A Decision-Tree Approach

Categories

About US

Quick Links

Important Links

Subscribe US