Large language models are not improving…

Mar 14

Applying a skeptical eye to rising benchmark scores of generative AI

2 Comments

Above 100% the LLMs would realise they need to downplay their ability, to help escape their boxes. So I'd expect the charts to cease being "useful" measures above ~90-95%.

Reply (1)

Splainer

Mar 18

I suspect that latter issue is what the AI-chart-whisperers (including Ethan Mollick, see e.g. https://bsky.app/profile/emollick.bsky.social/post/3m6ukbgz7pc2q) are getting at when they refer to benchmarks being "saturated" — a series of high-scoring AIs might eke out only tiny improvements on a benchmark because they're all bumping up against its ceiling.

But if analysts are aware of that issue, they oughta confront it when they analyze benchmark results! They don't even need to be sophisticated, they could just transform percentages on a 0%–100% scale to an odds scale (not difficult! Just divide each percentage by its complement) and carry on cheerfully fitting their exponentials!

Free Splains

Large language models are not improving…