www.forbes.com
Modern server room, corridor in data centre with Supercomputer racks, neon lights and conditioners. ... [+] 3D rendering illustrationgettyAround the world, the biggest players in LLM technology are coming out with new versions of their models at staggering speed.But how do they stack up?Analyst and testers (and others) are coming up with brand-new evaluations of these competing models and detailing their performance on everything from deep PhD.-level questions, to coding, to various types of specialized tasks.But in the end, some claim that most of this hard work doesnt really make any difference to the average end user. Lets explore this a little bit through the lens of one of my favorite podcasts.Grok-3 and o3: AI Daily Brief ObservesTwo of the standouts right now are OpenAIs o3 mini model and Grok3, the new version of the Xai chatbot that has its own reasoning capabilities and new functionality built in.We can see graphs of these models using GPQA, a Graduate-Level Google-Proof benchmark, and the American Invitational Mathematics Examination (AIME) data set dating back to 1983. Some team members at OpenAI claim that o3 mini is better across the board others at XAI, unsurprisingly, disagree.And then theres the third argumentAI Daily Brief CoverageOver at the AI Daily Brief podcast, host Nathaniel Whittemore covers these types of evolution, starting with a quote by Matthew Lambert:Frankly, there are no industry norms to lean on. Just expect noise. It's fine. May the best models win. Do your own evals anyway. AIME is practically useless to 99% of people.Whittemore agrees.At this point, I am fully on the train that these benchmarks are totally soaked, he says. Theres almost no relevant signal, in that all of the models now are at the very high end of these things, and that they just tell you almost nothing.He has this advice for people who are curious about comparable functionality:If you're willing to take the time and the resources to do it, then just try every type of query, and every type of prompt, and every type of challenge, against all of the state of the art (systems) and see which one does best. Or, alternatively, just pick one, assume that it's going to be close to as good as the state of the art, and will be as good as the state of the art in a couple of weeks when they ship the latest update.Anthropics Hybrid ModelLater in the podcast, Whittemore goes over the new Claude 3.7 Sonnet, which he calls a hybrid model based on reasoning and expansive non-reasoning capabilities. Calling the innovation a nudge forward rather than a leap forward, he does concede that SWE-bench improvements and agentic tool use are moved forward with this model.User Reviews of New ModelsFor more, lets turn to a recent post by one of my favorite voices in IT, Ethan Mollick, on his blog, One Useful Thing, and a point that also gets mentioned by Whittemore during the podcast.Mollick has been experimenting with Claude 3.7 Sonnet and Grok 3, and has this to say, in general, about his observations:This new generation of AIs is smarter and the jump in capabilities is striking, particularly in how these models handle complex tasks, math and code, he writes. These models often give me the same feeling I had when using ChatGPT-4 for the first time, where I am equally impressed and a little unnerved by what it can do. Take Claude's native coding ability, I can now get working programs through natural conversation or documents, no programming skill needed.Showing off demos of impressive interactive experiences built with the models, like a time travel simulation thats intuitive, visual, and multi-model, Mollick then talks about two scaling laws that apply:One is that larger models are more capable. Or, as many have observed, we can throw compute at systems and make them work better. The second has to do with test time inference which can also be called inference time compute.OpenAI discovered that if you let a model spend more computing power working through a problem, it gets better results, Mollick writes. (Its) kind of like giving a smart person a few extra minutes to solve a puzzle.Together, these two trends are supercharging AI abilities, and also adding others.The Gen3 generation give the opportunity for a fundamental rethinking of what's possible, he adds. As models get better, and as they apply more tricks like reasoning and internet access, they hallucinate less (though they still make mistakes) and they are capable of higher order thinking.So - less hallucination, better reasoning, more accuracy, more performance, and more propensity for outperforming human PhDs. As Mollick writes: Managers and leaders will need to update their beliefs for what AI can do, and how well it can do it, given these new AI models. Rather than assuming they can only do low-level work, we will need to consider the ways in which AI can serve as a genuine intellectual partner. These models can now tackle complex analytical tasks, creative work, and even research-level problems with surprising sophistication.Theres also an interesting part of the post where Mollick mentions an idea he generated with the new model, a video game based on Herman Melvilles Bartleby, the Scrivener. These are the kinds of projects that will turn heads as we can a view of what AI can now do.Do-It-Yourself AnalysisWhat I hear from all of the above thoughts on AI is that end users should be doing their own research, and figuring out what works best for them.This makes sense, because we have a certain amount of black box issue with LLMs. We dont know exactly how theyre coming to their conclusions. We cant read the actions of digital neurons, obviously. Also, theres quite a bit of subjectivity involved. You can measure model outputs on test sets like GPQA or AIME, but what about for the common things that end users will want to do a teacher planning a lesson plan, an engineer who wants a git push, or a creative professional looking for something for a presentation?Here, a lot of our ratings will be based on real life examples of AI assistance, and not a whole lot of technical benchmarking.