Above 100% the LLMs would realise they need to downplay their ability, to help escape their boxes. So I'd expect the charts to cease being "useful" measures above ~90-95%.
I suspect that latter issue is what the AI-chart-whisperers (including Ethan Mollick, see e.g. https://bsky.app/profile/emollick.bsky.social/post/3m6ukbgz7pc2q) are getting at when they refer to benchmarks being "saturated" — a series of high-scoring AIs might eke out only tiny improvements on a benchmark because they're all bumping up against its ceiling.
But if analysts are aware of that issue, they oughta confront it when they analyze benchmark results! They don't even need to be sophisticated, they could just transform percentages on a 0%–100% scale to an odds scale (not difficult! Just divide each percentage by its complement) and carry on cheerfully fitting their exponentials!
Above 100% the LLMs would realise they need to downplay their ability, to help escape their boxes. So I'd expect the charts to cease being "useful" measures above ~90-95%.
I suspect that latter issue is what the AI-chart-whisperers (including Ethan Mollick, see e.g. https://bsky.app/profile/emollick.bsky.social/post/3m6ukbgz7pc2q) are getting at when they refer to benchmarks being "saturated" — a series of high-scoring AIs might eke out only tiny improvements on a benchmark because they're all bumping up against its ceiling.
But if analysts are aware of that issue, they oughta confront it when they analyze benchmark results! They don't even need to be sophisticated, they could just transform percentages on a 0%–100% scale to an odds scale (not difficult! Just divide each percentage by its complement) and carry on cheerfully fitting their exponentials!