Meta unveils Llama API said to deliver record-breaking inference...

RT hat einen Link geteilt

2025-04-30 05:44:25 -

WWW.NEOWIN.NET

Meta unveils Llama API said to deliver record-breaking inference speeds

When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works. Meta unveils Llama API said to deliver record-breaking inference speeds Pradeep Viswanathan Neowin @pradeepviswav · Apr 29, 2025 23:54 EDT At the first-ever LlamaCon, Meta today made several announcements and introduced tools to make the Llama family of models more accessible to developers. The main highlight was the launch of the Llama API, which is now available as a limited free preview for developers. The Llama API will allow developers to try out different Llama models, including the recently launched Llama 4 Scout and Llama 4 Maverick models. It offers one-click API key creation and lightweight TypeScript and Python SDKs. To make it easier for developers to port OpenAI-based applications, the Llama API is compatible with the OpenAI SDK. Meta is also partnering with Cerebras and Groq to deliver faster inference speeds for the Llama API. Cerebras claims that the Llama 4 Cerebras model in the API can deliver token generation speeds up to 18 times faster than regular GPU-based solutions from NVIDIA and others. According to the Artificial Analysis benchmarking site, the Cerebras solution delivered over 2,600 tokens/s for Llama 4 Scout, compared to ChatGPT at 130 tokens/s and DeepSeek at 25 tokens/s. Andrew Feldman, CEO and co-founder of Cerebras, said: “Cerebras is proud to make the Llama API the fastest inference API in the world. Developers building agentic and real-time apps need speed. With Cerebras on the Llama API, they can build AI systems that are fundamentally out of reach for leading GPU-based inference clouds.” Interested developers can access this ultra-fast Llama 4 inference by selecting Cerebras from the model options within the Llama API. Llama 4 Scout is also available from Groq, but it is currently running at over 460 tokens/s, which is about 6x slower than the Cerebras solution, yet 4x faster when compared to other GPU-based solutions. Tags Report a problem with article Follow @NeowinFeed

0 Kommentare 0 Anteile 85 Ansichten