Large language models are not improving exponentially

Just look at the AI models' benchmark scores with a skeptical eye

Mar 14, 2026

Ethan Mollick, professor of management at the Wharton School and fellow Substack writer, spends a lot of time researching and thinking about AI. I find his blogletter about AI, One Useful Thing, intermittently insightful and informative; Mollick himself seems to occupy a useful place between AI-boosterism and AI-hating.

However. However. That’s not to say he doesn’t succumb to bouts of boosterism, and his latest postletter, portentously titled “The Shape of the Thing”, is one such bout. “The Shape of the Thing” uses the word “exponential” not once but 8 times to describe how generative AI is improving. His evidence for this strong, specific claim is

a gallery of 6 images of otters on airplanes, generated by AI over 3 years to show improvements in image-generation,
a 15-second video he generated with ByteDance’s most advanced AI model,
a viral chart of major LLMs’ scores over time on “the most famous evaluation in AI today, the METR Long Tasks graph”, and
4 more charts, each of a different metric, showing how LLM performance has gone up over time.

The e-word also shows up in Mollick’s Bluesky post linking the postletter —

— and an X/Twitter post linking it.

Ethan Mollick@emollick

I wrote about the exponential improvement path of AI, the early signs of massive transformations in the nature of work (including software companies where nobody codes any more), and how one week in February is an omen of our future as things get weirder.

open.substack.com

The Shape of the Thing

2:13 PM · Mar 12, 2026 · 84.7K Views

36 Replies · 85 Reposts · 570 Likes

This is a reasonably consistent claim from Mollick, although in one X post he did put a vital asterisk on a claim of “Exponential improvements* everywhere”, clarifying that “technically” he meant a “logistic improvement”.

But that’s not a minor technical concession! A logistic growth curve is very different to an exponential growth curve: logistic growth looks close to exponential in its early phase, but slows to linear growth around an inflection point, and finally tops out at a finite level; exponential growth, on the other hand, continues at a constant relative rate unto infinity. This mathematical point translates directly into how we should interpret these charts of improving AI capabilities: “exponential” growth implies AI is going to improve, in absolute terms, faster and faster indefinitely, whereas “logistic” growth implies that AI is going to improve more slowly in the future and plateau.

And this vital distinction is buried in a xeet, appearing nowhere in Mollick’s full-length postletter!

Mollick’s alleged evidence for exponential improvement in AI is actually evidence against it

That distinction doesn’t matter to Mollick’s point if generative AI is, in fact, improving exponentially. To discern whether it is, one can consider his evidence for exponential improvement, which I summarized above in 4 bullet points. The first two bullet points can be immediately dismissed: a gallery of example images and an example video generated by various AIs aren’t quantitative evidence, and are too subjective to demonstrate exponential improvement. (Consider: how would merely “logistic” improvement, or “linear” improvement, in image-generating AI literally look different to “exponential” improvement?)

The next bullet point of evidence is the METR time-horizons graph, which Mollick frankly concedes “has attracted its share of critics”, METR itself among them. Mollick also spends only a paragraph on it, so I shan’t dwell on it for long either. I’ll just note that once an AI crosses the good-enough threshold where running the AI for longer on a problem tends to improve the AI’s answer (rather than degrading the answer, as can happen if the AI has a significant rate of errors that compound during long runs), then exponential improvement in METR’s task-duration metric might just mean AIs are being run for longer. In that case, the exponential improvement mainly reflects an exponential increase in a resource (time) granted to the AI, not the AI making smarter use of resources.

That leaves the final batch of evidence, which Mollick introduces with the claim that “if you don’t like the METR graph, you will find most graphs of AI ability have that same curve”, which he then tries to demonstrate by “graph[ing] progress over time in the image below” of “four hard and diverse AI tests”.

[see caption] — Ethan Mollick’s plots of “progress over time” on “four hard and diverse AI tests”: “the Google-Proof Q&A benchmark” (top left), “GDPval” (top right), “Humanity’s Last Exam” (bottom left), and “Pencil Puzzles”/”PPBench” (bottom right). Each plot features a fitted exponential-growth curve as well as the actual data.

Something obvious about all of these AI performance metrics is that they’re scored as percentages. This might lead one to ask whether these performance metrics have maxima of 100%. In fact they do. It immediately follows that we can’t take exponential growth curves fitted to these data completely seriously, because exponential growth curves ignore the 100% ceiling, and simply crash through it by increasing forever at an accelerating rate. Mollick’s first plot, of the GPQA Diamond benchmark, even acknowledges this by having its dashed exponential curve abruptly become flat when it hits 100% at the end of last year.

Am I just being pedantic? Not if we’re to take these metrics at face value. Either a metric is basically meaningful or it’s not.

If it’s not meaningful, then it is indeed a waste of time to nitpick about its 100% ceiling, but then it’s also a waste of time to plot charts of it.
If it is meaningful, then so is its 100% ceiling and no one should try to force a curve onto it that zooms far above 100%.

Or, more broadly, if Mollick’s graphs are to be taken seriously, then so should statistical critiques of them, and if they’re not to be taken seriously, then they’re not serious evidence for claims of exponential improvement in AI.

The GPQA Diamond benchmark graph, continued

Continuing to look at the first graph, even aside from the rigid curve-wants-to-go-above-100% critique, the fitted curve overestimates the actual performance of the latest 8 models on the graph (including the 6 latest “Key Models” from GPT-5.1 onward), representing every model since the autumn. Runs of points like this, all falling on the same side of the fitted curve, suggest a poor fit; fitted curves are meant to go through the data, with the actual data points bracketing both sides of the curve.

A basic problem with the first graph is that the data span scores of about 12% to about 94%, which is much of the limited range of the metric, and approaches the metric’s ceiling of 100%. An exponential curve is generally going to have a hard time fitting such data, and simply looking at the graph I’m confident that a simple straight line, representing linear growth, would do better: a straight line rising from 30% to 93% over the time period shown would do a much better job fitting the labeled “Key Models”, and a straight line rising from 15% to 95% would probably do at least as well fitting all of the models in general. But better still would be a “logistic” curve, briefly mentioned by Mollick in another context: its “S” shape, which starts by rising shallowly, then accelerating in the middle, then leveling off again in recent months, would do even better, while respecting the metric’s limited 0%–100% range.

Ultimately, the GPQA Diamond benchmark chart, if we’re to take it at face value, is evidence for logistic, slowing improvement in AI over time, not exponential, accelerating improvement.

The GDPval benchmark graph

But maybe the GPQA Diamond chart’s misleading because of the ceiling effect of models getting too good to squeeze out further improvements on that benchmark? I move on to the the GDPval metric, plotted in the top right of Mollick’s chart quartet. The GDPval scores are much better fitted by an exponential curve. The flip side is that the curve’s fitted to only 5 observations, so any kind of curve that smoothly increases at an accelerating rate (like the upward part of a parabola) is liable to fit these data.

Still, the fit is very close, so let’s take it seriously for argument’s sake. The zooming-through-the-ceiling problem comes back, just a little more subtly: unlike in the first chart, the exponential curve hasn’t already hit 100%, but it’s about to; Mollick’s legend helpfully notes that the curve’s doubling time is only 237 days, and on March 5 the curve was already at 87%, above GPT-5.4’s score of 83%. That has the curve breaking the 100% barrier in just 5 or 6 weeks, around April 21. By the summer generative AI should be beating human experts 110% of the time, according to this (very snug!) fit.

So, while the exponential model hasn’t already ruled itself out, simply waiting until the end of next month should do the trick. Once again, a logistic model makes more sense, especially when one notices that the score improvement from GPT-5.2 Thinking to GPT-5.4 (the latest two models plotted) is just 12 percentage points, less than all of the model-to-model score improvements that came before it. A decelerating improvement like that is more consistent with logistic growth than exponential growth.

The Humanity’s Last Exam benchmark graph

The bottom-left graph plots scores of LLMs on Humanity’s Last Exam since October 2024, and it has a few more data points than the GDPval graph (though still only 9). With no model having reached even 40% on the Exam, the exponential curve still has some way to go before unrealistically hitting the 100% ceiling. To my eye the curve has a doubling time of about 8 months, so wouldn’t reach 100% until early 2027.

In that case, is anything wrong with the exponential curve for the time being? I think there is, though the small sample size makes a decisive diagnosis difficult: the curve overshoots the scores of the earliest and latest models, and undershoots the scores of the middle models (April 2025 through October 2025). Given the rule of thumb that points shouldn’t show systematic error around a fitted curve (they should ideally scatter randomly around the curve), this suggests poor fit.1 To my eye a straight line would again fit about as well. Certainly there’s no sign of accelerating improvement in the data points themselves.

The Pencil Puzzle Bench graph

The final graph, in the bottom right, shows performance over time on the Pencil Puzzle Bench collection of “20 types” of puzzles. The sample size shrinks again, this time to just 7 frontier models, and the oldest of the models (“gpt-3.5-turbo”) seems to be excluded from the exponential-curve fit (perhaps because its score was exactly zero).

As in the first two graphs, the exponential curve here problematically implies that LLMs are about to become impossibly good at the benchmark. It doubles roughly every quarter, and having hit 70% around the start of this month, it’s due to hit the ceiling next month.

The pattern of these (few) data points is also rather strange. The other 3 graphs’ data suggest relatively smooth, steady improvement. Pencil Puzzle Bench, according to Mollick’s subset of the data, doesn’t: instead 4 models show weak (< 10%) performance that improves only slowly across multiple years, before performance zooms up in just one month with the release of Gemini 3 Pro and GPT-5.2, increasing more modestly with GPT-5.4 a quarter later. Absent GPT-5.4 this would actually look more extreme than exponential growth, more like hyperbolic growth (virtually flat until some critical point when the curve bends upward to near-vertical). But with GPT-5.4 suggesting slowing growth (and it would have to slow! Once scores beat 50% it became impossible to sustain abrupt month-to-month growth) this would appear to be another “S”-shaped logistic curve: slow growth, then accelerating growth, then decelerating growth approaching a plateau again.

As an aside, looking at the PP Bench website rather than Mollick’s graph, the researchers’ own leaderboard supports my suspicion that better scores on benchmarks come more from spending more resources than from more-efficient models. The leading model, “gpt-5.4@xhigh”, apparently cost $8.08 per puzzle attempt, and second-place model “gpt-5.2@xhigh” $5.07 per attempt, while the notably lower-scoring “gemini-3-pro@high” cost only $1.27, “gpt-5.1@medium” only $0.28, and the lowest-scoring pre-November models on Mollick’s chart cost $0.40 (“o3”), $0.83 (“o1”), and $0.0015 (“gpt-3.5-turbo”).

Taking all 4 graphs together: logistic, not exponential, improvement

Ethan Mollick introduces these 4 graphs with the claim that “if you don’t like the METR graph, you will find most graphs of AI ability have that same curve”. Well, his graphs of AI ability do have the same basic curve, but it’s a logistic curve (the canonical slow-fast-slow “S” curve) rather than an exponential curve of ever-accelerating improvement. This is unsurprising because all of the 4 metrics are graded on a scale of 0% to 100%, and an exponential-growth curve is always going to punch through a 100% ceiling sooner or later. (For this reason, exponential growth shouldn’t be the default choice of model for these metrics.)

The general trend in the metrics Mollick marshals is upward, and I think that’s a genuine conclusion of interest in itself. But he’s over-interpreting his own graphs to call the trend “exponential”, when I’d argue that the data are actually evidence against ongoing exponential improvements in generative-AI performance.

The necessity of skepticism

Since 2024, Mollick’s bemoaned haters, doubters, and skeptics of AI who treat AI as something that’s simply going to go away, and complained about misinformation about AI. He’s right that, in a world where most of us can download and run AI models on our computers (albeit smaller, less-powerful models), AI isn’t going to simply disappear, and he’s right to oppose misinformation.

But fighting misinformation starts at home, and Mollick’s latest postletter is substantially misinformation. He claims that AI’s getting exponentially better when his own evidence implies it isn’t.

AI-boosters and the AI-fascinated should avoid spreading misinformation just the same as AI-critics. It’s concerning when a Wharton professor applies less skepticism to improving AI performance metrics than comments on /r/singularity (a Reddit community that apparently mostly expected the Humanity’s Last Exam benchmark to be “saturated” by now!). Generative AI’s increasing performance on assorted benchmarks chimes with my own subjective experience of it becoming incrementally more useful in the past year or two, but it would be wrong for me to jump to the conclusion that AI’s becoming better at an accelerating rate (let alone an exponentially accelerating rate). Everybody, from the haters to the boosters, needs to apply due skepticism. I hope, and believe, I’ve done that here.

The specific over-under-over pattern suggests that a concave curve might be a better fit, rather than a convex curve like an exponential curve.

Free Splains

Discussion about this post

Ready for more?