Did xAI lie about Grok 3s benchmarks?
techcrunch.com
Debates over AI benchmarks and how theyre reported by AI labs are spilling out into public view. This week, an OpenAI employee accused Elon Musks AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was in the right. The truth lies somewhere in between. In a post on xAIs blog, the company published a graph showing Grok 3s performance on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. Some experts have questioned AIMEs validity as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to probe a models math ability. xAIs graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAIs best-performing available model, o3-mini-high, on AIME 2025. But OpenAI employees on X were quick to point out that xAIs graph didnt include o3-mini-highs AIME 2025 score at cons@64. What is cons@64, you might ask? Well, its short for consensus@64, and it basically gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers. As you can imagine, cons@64 tends to boost models benchmark scores quite a bit, and omitting it from a graph might make it appear as though one model surpasses another when in reality, thats isnt the case.Grok 3 Reasoning Beta and Grok 3 mini Reasonings scores for AIME 2025 at @1 meaning the first score the models got on the benchmark fall below o3-mini-highs score. Grok 3 Reasoning Beta also trails ever-so-slightly behind OpenAIs o1 model set to medium computing. Yet xAI is advertising Grok 3 as the worlds smartest AI.Babushkin argued on X that OpenAI has published similarly misleading benchmark charts in the past albeit charts comparing the performance of its own models. A more neutral party in the debate put together a more accurate graph showing nearly every models performance at cons@64:But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. That just goes to show how little most AI benchmarks communicate about models limitations and their strengths.
0 Reacties ·0 aandelen ·45 Views