TECHCRUNCH.COM
Will Smith eating spaghetti and other weird AI benchmarks that took off in 2024
When a company releases a new AI video generator, its not long before someone uses it to make a video of actor Will Smith eating spaghetti. Its become something of a meme as well as a benchmark: Seeing whether a new video generator can realistically render Smith slurping down a bowl of noodles. Smith himself parodied the trend in an Instagram post in February.Will Smith and pasta is but one of several bizarre unofficial benchmarks to take the AI community by storm in 2024. A 16-year-old developer built an app that gives AI control over Minecraft and tests its ability to design structures. Elsewhere, a British programmercreated a platform where AI plays games like Pictionary and Connect 4 against each other. Its not like there arent more academic tests of an AIs performance. So why did the weirder ones blow up?Image Credits:Paul CalcraftFor one, many of the industry-standard AI benchmarks dont tell the average person very much. Companies often cite their AIs ability to answer questions on Math Olympiad exams, or figure out plausible solutions to Ph.D.-level problems. Yet most people yours truly included use chatbots for things likeresponding to emails and basic research.Crowdsourced industry measures arent necessarily better or more informative.Take, for example, Chatbot Arena, a public benchmark many AI enthusiasts and developers follow obsessively. Chatbot Arena lets anyone on the web rate how well AI performs on particular tasks, like creating a web app or generating an image. But raters tend not to be representative most come from AI and tech industry circles and cast their votes based on personal, hard-to-pin-down preferences.The Chatbot Arena interface.Image Credits:LMSYSEthan Mollick, a professor of management at Wharton, recently pointed out in a post on X another problem with many AI industry benchmarks: they dont compare a systems performance to that of the average person. The fact that there are not 30 different benchmarks from different organizations in medicine, in law, in advice quality, and so on is a real shame, as people are using systems for these things, regardless, Mollick wrote. Weird AI benchmarks like Connect 4, Minecraft, and Will Smith eating spaghetti are most certainly not empirical or even all that generalizable. Just because an AI nails the Will Smith test doesnt mean itll generate, say, a burger well.Note the typo; theres no such model as Claude 3.6 Sonnet.Image Credits:Adonis SinghOne expert I spoke to about AI benchmarks suggested that the AI community focus on the downstream impacts of AI instead of its ability in narrow domains. Thats sensible. But I have a feeling that weird benchmarks arent going away anytime soon. Not only are they entertaining who doesnt like watching AI build Minecraft castles? but theyre easy to understand. And as my colleague Max Zeff wrote about recently, the industry continues to grapple with distilling a technology as complex as AI into digestible marketing.The only question in my mind is, which odd new benchmarks will go viral in 2025?
0 Comentários 0 Compartilhamentos 84 Visualizações