
ARSTECHNICA.COM
Metas surprise Llama 4 drop exposes the gap between AI ambition and reality
Not quite open source Metas surprise Llama 4 drop exposes the gap between AI ambition and reality Touted 10M token context proves elusive, while early performance tests disappoint experts. Benj Edwards Apr 7, 2025 3:54 pm | 3 Credit: Rocter via Getty Images Credit: Rocter via Getty Images Story textSizeSmallStandardLargeWidth *StandardWideLinksStandardOrange* Subscribers only Learn moreOn Saturday, Meta released its newest Llama 4 multimodal AI models in a surprise weekend move that caught some AI experts off guard. The announcement touted Llama 4 Scout and Llama 4 Maverick as major advancements, with Meta claiming top performance in their categories and an enormous 10 million token context window for Scout. But so far the open-weights models have received an initial mixed-to-negative reception from the AI community, highlighting a familiar tension between AI marketing and user experience."The vibes around llama 4 so far are decidedly mid," said independent AI researcher Simon Willison in a short interview with Ars Technica. Willison often checks the community pulse around open source and open weights AI releases in particular.While Meta positions Llama 4 in competition with closed-model giants like OpenAI and Google, the company continues to use the term "open source" despite licensing restrictions that prevent truly open use. As we have noted in the past with previous Llama releases, "open weights" more accurately describes Meta's approach. Those who sign in and accept the license terms can download the two smaller Llama 4 models from Hugging Face or llama.com.The company describes the new Llama 4 models as "natively multimodal," built from the ground up to handle both text and images using a technique called "early fusion." Meta says this allows joint training on text, images, and video frames, giving the models a "broad visual understanding." This approach ostensibly puts Llama 4 in direct competition with existing multimodal models from OpenAI (such as GPT-4o) and Google (Gemini 2.5).The company trained the two new models with assistance from an even larger unreleased 'teacher' model named Llama 4 Behemoth (with 2 trillion total parameters), which is still in development. Parameters are the numerical values a model adjusts during training to learn patterns. Fewer parameters mean smaller, faster models that can run on phones or laptops, though creating high-performing compact models remains a major AI engineering challenge.Meta constructed the Llama 4 models using a mixture-of-experts (MoE) architecture, which is one way around the limitations of running huge AI models. Think of MoE like having a large team of specialized workers; instead of everyone working on every task, only the relevant specialists activate for a specific job.For example, Llama 4 Maverick features a 400 billion parameter size, but only 17 billion of those parameters are active at once across one of 128 experts. Likewise, Scout features 109 billion total parameters, but only 17 billion are active at once across one of 16 experts. This design can reduce the computation needed to run the model, since smaller portions of neural network weights are active simultaneously.Llamas reality check arrives quicklyCurrent AI models have a relatively limited short-term memory. In AI, a context window acts somewhat in that fashion, determining how much information it can process simultaneously. AI language models like Llama typically process that memory as chunks of data called tokens, which can be whole words or fragments of longer words. Large context windows allow AI models to process longer documents, larger code bases, and longer conversations.Despite Meta's promotion of Llama 4 Scout's 10 million token context window, developers have so far discovered that using even a fraction of that amount has proven challenging due to memory limitations. Simon Willison reported on his blog that third-party services providing access, like Groq and Fireworks, limited Scout's context to just 128,000 tokens. Another provider, Together AI, offered 328,000 tokens.Evidence suggests accessing larger contexts requires immense resources. Willison pointed to Meta's own example notebook ("build_with_llama_4"), which states that running a 1.4 million token context needs eight high-end NVIDIA H100 GPUs.Willison documented his own testing troubles. When he asked Llama 4 Scout via the OpenRouter service to summarize a long online discussion (around 20,000 tokens), the result wasn't useful. He described the output as "complete junk output," which devolved into repetitive loops.Meta claims that the larger of its two new Llama 4 models, Maverick, outperforms competitors like OpenAI's GPT-4o and Google's Gemini 2.0 on various technical benchmarks, which we usually note are not necessarily useful reflections of everyday user experience. So far, independent verification of the released model's performance claims remains limited.More interestingly, a version of Llama 4 is currently perched at No. 2 on the popular Chatbot Arena LLM vibemarking leaderboard. However, even this comes with a catch: Willison noted a distinction pointed out in Meta's own announcement: the high-ranking leaderboard entry referred to an "experimental chat version scoring ELO of 1417 on LMArena," different from the Maverick model made available for download.A potential technical dead-endThe Llama 4 release sparked discussion on social media about AI development trends, with reactions including mild disappointment over lackluster multimodal features, concerns that its mixture-of-experts architecture used too few activated parameters (only 17 billion), and criticisms that the release felt rushed or poorly managed internally. Some Reddit users also noted it compared unfavorably with innovative competitors such as DeepSeek and Qwen, particularly highlighting its underwhelming performance in coding tasks and software development benchmarks.On X, researcher Andriy Burkov, author of "The Hundred-Page Language Models Book," argued the underwhelming Llama 4launch reinforces skepticism about monolithic base models. He stated that recent "disappointing releases of both GPT-4.5 and Llama 4 have shown that if you don't train a model to reason with reinforcement learning, increasing its size no longer provides benefits."Burkov's mention of GPT-4.5 echoes that model's somewhat troubled launch; Ars Technica previously reported that GPT-4.5 faced mixed reviews, with its high cost and performance limitations suggesting a potential dead-end for simply scaling up traditional AI model architectures. This observation aligns with broader discussions in the AI field about the scaling limitations of training massive base models without incorporating newer techniques (such as simulated reasoning or training smaller, purpose-built models).Despite all of the current drawbacks with Meta's new model family, Willison is optimistic that future releases of Llama 4 will prove more useful. "My hope is that well see a whole family of Llama 4 models at varying sizes, following the pattern of Llama 3," he wrote on his blog. "Im particularly excited to see if they produce an improved ~3B model that runs on my phone."Benj EdwardsSenior AI ReporterBenj EdwardsSenior AI Reporter Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC. 3 Comments
0 Comments
0 Shares
72 Views