The reported $100 billion revenue threshold we talked about earlier conflates business success with cognitive functionality, as if a system’s capability to generate income says something significant about whether or not it might “assume,” “motive,” or “perceive” the world like a human.
Relying in your definition, we could have already got AGI, or it might be bodily unattainable to realize. Should you outline AGI as “AI that performs higher than most people at most duties,” then present language fashions probably meet that bar for sure kinds of work (which duties, which people, what’s “higher”?), however settlement on whether or not that’s true is way from common. This says nothing of the even murkier idea of “superintelligence”—one other nebulous time period for a hypothetical, god-like mind up to now past human cognition that, like AGI, defies any stable definition or benchmark.
Given this definitional chaos, researchers have tried to create goal benchmarks to measure progress towards AGI, however these makes an attempt have revealed their very own set of issues.
Why benchmarks preserve failing us
The seek for higher AGI benchmarks has produced some fascinating alternate options to the Turing Check. The Abstraction and Reasoning Corpus (ARC-AGI), launched in 2019 by François Chollet, assessments whether or not AI methods can remedy novel visible puzzles that require deep and novel analytical reasoning.
“Virtually all present AI benchmarks might be solved purely through memorization,” Chollet instructed Freethink in August 2024. A serious downside with AI benchmarks at present stems from knowledge contamination—when check questions find yourself in coaching knowledge, fashions can seem to carry out nicely with out actually “understanding” the underlying ideas. Massive language fashions function grasp imitators, mimicking patterns present in coaching knowledge, however not at all times originating novel options to issues.
However even refined benchmarks like ARC-AGI face a elementary downside: They’re nonetheless making an attempt to cut back intelligence to a rating. And whereas improved benchmarks are important for measuring empirical progress in a scientific framework, intelligence is not a single factor you possibly can measure like peak or weight—it is a advanced constellation of talents that manifest otherwise in numerous contexts. Certainly, we do not even have an entire practical definition of human intelligence, so defining synthetic intelligence by any single benchmark rating is more likely to seize solely a small a part of the entire image.