WWW.NEWSCIENTIST.COM
OpenAI's o3 model aced a test of AI reasoning but it's still not AGI
OpenAI announced a breakthrough achievement for its new o3 AI modelRokas Tenys / AlamyOpenAIs new o3 artificial intelligence model has achieved a breakthrough high score on a prestigious AI reasoning test called the ARC Challenge, inspiring some AI fans to speculate that o3 has achieved artificial general intelligence (AGI). But even as ARC Challenge organisers described o3s achievement as a major milestone, they also cautioned that it has not won the competitions grand prize and it is only one step on the path towards AGI, a term for hypothetical future AI with human-like intelligence.The o3 model is the latest in a line of AI releases that follow on from the large language models powering ChatGPT. This is a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models, said Franois Chollet, an engineer at Google and the main creator of the ARC Challenge, in a blog post. AdvertisementWhat did OpenAIs o3 model actually do?Chollet designed the Abstraction and Reasoning Corpus (ARC) Challenge in 2019 to test how well AIs can find correct patterns linking pairs of coloured grids. Such visual puzzles are intended to make AIs demonstrate a form of general intelligence with basic reasoning capabilities. But throwing enough computing power at the puzzles could let even a non-reasoning program simply solve them through brute force. To prevent this, the competition also requires official score submissions to meet certain limits on computing power.OpenAIs newly announced o3 model which is scheduled for release in early 2025 achieved its official breakthrough score of 75.7 per cent on the ARC Challenges semi-private test, which is used for ranking competitors on a public leaderboard. The computing cost of its achievement was approximately $20 for each visual puzzle task, meeting the competitions limit of less than $10,000 total. However, the harder private test that is used to determine grand prize winners has an even more stringent computing power limit, equivalent to spending just 10 cents on each task, which OpenAI did not meet.The o3 model also achieved an unofficial score of 87.5 per cent by applying approximately 172 times more computing power than it did on the official score. For comparison, the typical human score is 84 per cent, and an 85 per cent score is enough to win the ARC Challenges $600,000 grand prize if the model can also keep its computing costs within the required limits. The latest science news delivered to your inbox, every day.Sign up to newsletterBut to reach its unofficial score, o3s cost soared to thousands of dollars spent solving each task. OpenAI requested that the challenge organisers not publish the exact computing costs.Does this o3 achievement show that AGI has been reached?No, the ARC challenge organisers have specifically said they do not consider beating this competition benchmark to be an indicator of having achieved AGI.The o3 model also failed to solve more than 100 visual puzzle tasks, even when OpenAI applied a very large amount of computing power toward the unofficial score, said Mike Knoop, an ARC Challenge organiser at software company Zapier, in a social media post on X.In a social media post on Bluesky, Melanie Mitchell at the Santa Fe Institute in New Mexico said the following about o3s progress on the ARC benchmark: I think solving these tasks by brute-force compute defeats the original purpose.While the new model is very impressive and represents a big milestone on the way towards AGI, I dont believe this is AGI theres still a fair number of very easy [ARC Challenge] tasks that o3 cant solve, said Chollet in another X post.However, Chollet described how we might know when human-level intelligence has been demonstrated by some form of AGI. Youll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible, he said in the blog post.Thomas Dietterich at Oregon State University suggests another way to recognise AGI. Those architectures claim to include all of the functional components required for human cognition, he says. By this measure, the commercial AI systems are missing episodic memory, planning, logical reasoning and, most importantly, meta-cognition.So what does o3s high score really mean?The o3 models high score comes as the tech industry and AI researchers have been reckoning with a slower pace of progress in the latest AI models for 2024, compared with the initial explosive developments of 2023.Although it did not win the ARC Challenge, o3s high score indicates that AI models could beat the competition benchmark in the near future. Beyond its unofficial high score, Chollet says many official low-compute submissions have already scored above 81 per cent on the private evaluation test set.Dietterich also thinks that this is a very impressive leap in performance. However, he cautions that, without knowing more about how OpenAIs o1 and o3 models work, it is impossible to evaluate just how impressive the high score is. For instance, if o3 was able to practise the ARC problems in advance, then that would make its achievement easier. We will need to await an open-source replication to understand the full significance of this, says Dietterich.The ARC Challenge organisers are already looking to launch a second and more difficult set of benchmark tests sometime in 2025. They will also keep the ARC Prize 2025 challenge running until someone achieves the grand prize and open-sources their solution.Topics:
0 التعليقات 0 المشاركات 40 مشاهدة