These researchers used NPR Sunday Puzzle questions to benchmark AI reasoning models
techcrunch.com
Every Sunday, NPR host Will Shortz, The New York Times crossword puzzle guru, gets to quiz thousands of listeners in a long-running segment called the Sunday Puzzle. While written to be solvable without too much foreknowledge, the brainteasers are usually challenging even for skilled contestants.Thats why some experts think theyre a promising way to test the limits of AIs problem-solving abilities.In a new study, a team of researchers hailing from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, and startup Cursor created an AI benchmark using riddles from Sunday Puzzle episodes. The team says their test uncovers surprising insights, like that so-called reasoning models OpenAIs o1, among others sometimes give up and provide answers they know arent correct.We wanted to develop a benchmark with problems that humans can understand with only general knowledge, Arjun Guha, a computer science undergraduate at Northeastern and one of the co-authors on the study, told TechCrunch. The AI industry is in a bit of a benchmarking quandary at the moment. Most of the tests commonly used to evaluate AI models probe for skills, like competency on PhD-level math and science questions, that arent relevant to the average user. Meanwhile, many benchmarks even benchmarks released relatively recently are quickly approaching the saturation point.The advantages of a public radio quiz game like the Sunday Puzzle is that it doesnt test for esoteric knowledge, and the challenges are phrased such that models cant draw on rote memory to solve them, explained Guha.I think what makes these problems hard is that its really difficult to make meaningful progress on a problem until you solve it thats when everything clicks together all at once, Guha said. That requires a combination of insight and a process of elimination.No benchmark is perfect, of course. The Sunday Puzzle is U.S.-centric and English-only. And because the quizzes are publicly available, its possible that models trained on them and can cheat in a sense, although Guha says he hasnt seen evidence of this. New questions are released every week, and we can expect the latest questions to be truly unseen, he added. We intend to keep the benchmark fresh and track how model performance changes over time.On the researchers benchmark, which consists of around 600 Sunday Puzzleriddles, reasoning models such as o1 and DeepSeeks R1 far outperform the rest. Reasoning models thoroughly fact-check themselves before giving out results, whichhelps themavoid some of thepitfallsthat normally trip up AI models. The trade-off is that reasoning models take a little longer to arrive at solutions typically seconds to minutes longer.At least one model, DeepSeeks R1, gives solutions it knows to be wrong for some of the Sunday Puzzle questions. R1 will state verbatim I give up, followed by an incorrect answer chosen seemingly at random behavior this human can certainly relate to.The models make other bizarre choices, like giving a wrong answer only to immediately retract it, attempt to tease out a better one, and fail again. They also get stuck thinking forever and give nonsensical explanations for answers, or they arrive at a correct answer right away but then go on to consider alternative answers for no obvious reason.On hard problems, R1 literally says that its getting frustrated, Guha said. It was funny to see how a model emulates what a human might say. It remains to be seen how frustration in reasoning can affect the quality of model results.R1 getting frustrated on a question in the Sunday Puzzle challenge set.Image Credits:Guha et al.The current best-performing model on the benchmark is o1 with a score of 59%, followed by the recently released o3-mini set to high reasoning effort (47%). (R1 scored 35%.) As a next step, the researchers plan to broaden their testing to additional reasoning models, which they hope will help to identify areas where these models might be enhanced.The scores of the models the team tested on their benchmark.Image Credits:Guha et al.You dont need a PhD to be good at reasoning, so it should be possible to design reasoning benchmarks that dont require PhD-level knowledge, Guha said. A benchmark with broader access allows a wider set of researchers to comprehend and analyze the results, which may in turn lead to better solutions in the future. Furthermore, as state-of-the-art models are increasingly deployed in settings that affect everyone, we believe everyone should be able to intuit what these models are and arent capable of.
0 Commentarios
·0 Acciones
·53 Views