This Week in AI: Maybe we should ignore AI benchmarks for now

@TechCrunch μοιράστηκε ένα σύνδεσμο

2025-02-19 18:37:08 ·

Welcome to TechCrunchs regular AI newsletter! Were going on hiatus for a bit, but you can find all our AI coverage, including my columns, our daily analysis, and breaking news stories, at TechCrunch. If you want those stories and much more in your inbox every day, sign up for our daily newslettershere.This week, billionaire Elon Musks AI startup, xAI, released its latest flagship AI model, Grok 3, which powers the companys Grok chatbot apps. Trained on around 200,000 GPUs, the model beats a number of other leading models, including from OpenAI, on benchmarks for mathematics, programming, and more.But what do these benchmarks really tell us? Here at TC, we often reluctantly report benchmark figures because theyre one of the few (relatively) standardized ways the AI industry measures model improvements. Popular AI benchmarks tend to test for esoteric knowledge, and give aggregate scores that correlate poorly to proficiency on the tasks that most people care about. As Wharton professor Ethan Mollick pointed out in a series of posts on X after Grok 3s unveiling Monday, theres an urgent need for better batteries of tests and independent testing authorities. AI companies self-report benchmark results more often than not, as Mollick alluded to, making those results even tougher to accept at face value.Public benchmarks are both meh and saturated, leaving a lot of AI testing to be like food reviews, based on taste, Mollick wrote. If AI is critical to work, we need more.Theres no shortage of independent tests and organizations proposing new benchmarks for AI, but their relative merit is far from a settled matter within the industry. Some AI commentators and experts propose aligning benchmarks with economic impact to ensure their usefulness, while others argue that adoption and utility are the ultimate benchmarks.This debate may rage until the end of time. Perhaps we should instead, as X userRoon prescribes, simply pay less attention to new models and benchmarks barring major AI technical breakthroughs. For our collective sanity, that may not be the worst idea, even if it does induce some level of AI FOMO.As mentioned above, This Week in AI is going on hiatus. Thanks for sticking with us, readers, through this roller coaster of a journey. Until next time.NewsImage Credits:Nathan Laine/Bloomberg / Getty ImagesOpenAI tries to uncensor ChatGPT: Max wrote about how OpenAI is changing its AI development approach to explicitly embrace intellectual freedom, no matter how challenging or controversial a topic may be.Miras new startup: Former OpenAI CTO Mira Muratis new startup,Thinking Machines Lab, intends to build tools to make AI work for [peoples] unique needs and goals.Grok 3 cometh: Elon Musks AI startup, xAI, has released its latest flagship AI model, Grok 3, and unveiled new capabilities for the Grok apps for iOS and the web.A very Llama conference: Meta will host its first developer conference dedicated to generative AI this spring. Called LlamaCon after Metas Llama family of generative AI models, the conference is scheduled for April 29.AI and Europes digital sovereignty: Paul profiled OpenEuroLLM, a collaboration between some 20 organizations to build a series of foundation models for transparent AI in Europe that preserves the linguistic and cultural diversity of all EU languages.Research paper of the weekImage Credits:Jakub Porzycki/NurPhoto / Getty ImagesOpenAI researchers have created a new AI benchmark, SWE-Lancer, that aims to evaluate the coding prowess of powerful AI systems. The benchmark consists of over 1,400 freelance software engineering tasks that range from bug fixes and feature deployments to manager-level technical implementation proposals.According to OpenAI, the best-performing AI model, Anthropics Claude 3.5 Sonnet, scores 40.3% on the full SWE-Lancer benchmark suggesting that AI has quite a ways to go. Its worth noting that the researchers didnt benchmark newer models like OpenAIs o3-mini or Chinese AI company DeepSeeks R1.Model of the weekA Chinese AI company named Stepfun has released an open AI model, Step-Audio, that can understand and generate speech in several languages. Step-Audio supports Chinese, English, and Japanese and lets users adjust the emotion and even dialect of the synthetic audio it creates, including singing.Stepfun is one of several well-funded Chinese AI startups releasing models under a permissive license. Founded in 2023, Stepfun reportedly recently closed a funding round worth several hundred million dollars from a host ofinvestors that include Chinese state-owned private equity firms.Grab bagImage Credits:Nous ResearchNous Research, an AI research group, has released what it claims is one of the first AI models that unifies reasoning and intuitive language model capabilities.The model, DeepHermes-3 Preview, can toggle on and off long chains of thought for improved accuracy at the cost of some computational heft. In reasoning mode, DeepHermes-3 Preview, similar to other reasoning AI models, thinks longer for harder problems and shows its thought process to arrive at the answer.Anthropic reportedly plans to release an architecturally similar model soon, and OpenAI has said such a model is on its near-term roadmap.

0 Σχόλια ·0 Μοιράστηκε ·71 Views

Upgrade to Pro