TOWARDSAI.NET
TAI#149: OpenAI’s Agentic o3; New Open Weights Inference Optimized Models (DeepMind Gemma, Nvidia Nemotron-H)
Author(s): Towards AI Editorial Team
Originally published on Towards AI.
What happened this week in AI by Louie
This week, OpenAI finally released its anticipated o3 and o4-mini models, shifting the focus towards AI agents that skillfully use tools. DeepMind also made significant contributions with its cost-effective Flash 2.5 model and a highly optimized, distilled version of Gemma 3, underscoring a key industry trend: major labs are increasingly delivering inference-optimized open-weight models ready for deployment.
OpenAI’s new o3 and o4-mini arrived with a notable shift in focus from the December model preview. Rather than just a reasoning upgrade over o1, o3’s core innovation lies in its agentic capabilities, trained via reinforcement learning to intelligently use tools like web search, code interpreter, and memory in a loop. It behaves more like a streamlined, rapid-response “Deep Research-Lite“. Set it to a task, and o3 can often return an answer in just 30 seconds to three minutes, much faster than the 10–30 minutes Deep Research might take. While o3’s outputs are less detailed, this speed and integrated tool use make it ideal for many real-world questions needing quick, actionable answers. On the BrowseComp complex web search benchmark (which we discussed in depth last week), o3 achieves 49.7% vs. Deep Research 51.5%, o4-mini 28.3%, and GPT-4o (all with browsing). Of course, this doesn’t measure the more in-depth research tasks where Deep Research still excels.
This agentic nature allows o3 to break past some LLM-based search limitations. Because it actively plans and uses tools like search iteratively and natively, users don’t need to be as wary of it simply summarizing the first low-quality blog post it finds. It can handle multiple files, provide coherent, complete answers, and automatically perform multiple web searches to find up-to-date information, significantly cutting down errors and making the ChatGPT experience far more useful. Another key new strength is its ability to manipulate image inputs using code, cropping and zooming in to make sure it can identify key features. This has led to some fun demonstrations of o3’s skills at the “GeoGuessr” photo location guessing game!
o3 and o4-mini’s Agentic reasoning vs. prior generations of reasoning models
Source: Towards AI.
Benchmark results also show o3’s new strengths. On the Aider polyglot coding benchmark, o3 achieved an impressive 79.6% (at $111.0 for the evaluation), surpassing Gemini 2.5 Pro’s 72.9% (but this was delivered at just $6.3 cost) and GPT-4.1’s 52.4% (at $9.9 cost). However, a hybrid approach using o3-high as the planning “architect” and GPT-4.1 as the code-writing “editor” set a new state-of-the-art, scoring 82.7% on Aider while reducing costs to just $69.3. The cost-efficient o4-mini also impressed, achieving 72.0% on Aider (at $19.6 per evaluation), making it a powerful option for developers balancing performance and budget. On OpenAI’s MRCR long context test at 128,000 token length, o3 scored 70.3%, demonstrating solid long-context ability, though still trailing Gemini Pro 2.5’s leading 91.6%.
While the o3 and o4-mini releases show clear progress over OpenAI’s predecessors in both performance and cost-efficiency, the perceived capability leap feels somewhat moderated by prior access to similar functionalities through Deep Research and the strong performance of new competitors like Gemini Pro 2.5.
In other releases this week, DeepMind’s Gemini Flash 2.5 offers great performance at a very affordable base price ($0.15/M input, $0.60/M output), but activating its “Thinking Mode” for reasoning tasks comes with a substantial output token cost premium, jumping to $3.50/M. In contrast, xAI’s Grok-3 Mini, priced consistently at $0.30/M input and $0.50/M output, has emerged as a surprising leader in cost-efficiency for reasoning models. On the GPQA science benchmark, Grok-3 Mini scored 79%, slightly edging out both Flash 2.5 Thinking (78%) and the more expensive o4-mini high (78%). For code generation on LiveCodeBench v5, Grok-3 Mini achieved 70%, again surpassing Flash 2.5 Thinking (63%) while remaining competitive with o4-mini high (80%). These results position Grok-3 Mini as a great option for developers seeking high performance in reasoning and coding without breaking the bank.
This week also had something new for open weights models. DeepMind’s latest iteration of Gemma models, optimized through Quantization-Aware Training (QAT), continues an important industry trend: advanced AI labs are increasingly performing advanced inference optimization internally, rather than leaving it to users. QAT enables Gemma 3’s powerful 27B parameter model — initially needing 54 GB of memory (BF16) — to run smoothly on consumer GPUs like the NVIDIA RTX 3090 using just 14.1 GB (int4), while maintaining high quality. This trend was also demonstrated this week by NVIDIA’s Nemotron-H, a family of efficient hybrid models (8B, 47B, 56B) combining Mamba-2, Self-Attention, and FFN layers. Their compressed 47B model matches larger 70B-class models like Llama 3 and Qwen 2 while being significantly faster and smaller. NVIDIA used compression techniques like layer dropping and FFN pruning, specifically targeting deployment on consumer hardware like 32GB GPUs. Similar efforts to release inference-ready models were also seen recently from the DeepSeek team, who distilled their R1 reasoning model into smaller, easier-to-deploy “dense” models. This shift suggests developers will increasingly rely on officially optimized, deployment-ready variants instead of undertaking quantization or pruning themselves.
Why should you care?
For non-technical LLM users and businesses: As we noted last week, the growing variety of models available — often now with very specific strengths and weaknesses — means selecting the right tool for the job is more important than ever. You might use o3 via ChatGPT for its quick, tool-assisted answers and natural interaction style, but switch to Gemini for tasks requiring deep understanding of very long documents, or explore Grok mini for quick and cost-sensitive reasoning. Experimenting with how these models use tools (like search or analysis) is key to unlocking their value for everyday tasks. Moving beyond relying on one single model will become standard practice. We think the new o3 model will later become the “router” layer in OpenAI’s upcoming GPT-5; trained not just to use tools but also to activate different LLM models for specific tasks according to their strengths. This will simplify the user experience, but a strong understanding of the core strengths of different foundation models will still lead to the best results.
For LLM developers and enterprises: OpenAI’s o3 introduces an accessible, ready-to-use smart agent. Its core strength isn’t just raw reasoning, but its trained ability to intelligently select and use tools. Experimenting with this agentic capability in complex LLM workflows is crucial. We anticipate that this architecture, where a central model intelligently routes tasks and orchestrates tools and other specialized models, will be the foundation for most future agent systems. Learning how to leverage o3’s tool-use skills now provides a valuable head start. The success of hybrid approaches on Aider (o3 planner + 4.1 executor) also proves that combining models based on their unique strengths is becoming essential for state-of-the-art performance and efficiency. Developers who master these multi-model strategies and understand the nuances of agentic tool use will be best positioned going forward.
Hottest News
1. OpenAI Launches a Pair of AI Reasoning Models, o3 and o4-Mini
OpenAI has unveiled two new AI models: o3 and o4-mini, replacing the earlier o1 and o3-mini versions. The o3 model stands out as OpenAI’s most advanced reasoning AI to date, capable of integrating visual inputs, such as sketches or whiteboards, into its reasoning processes. It can also manipulate images by zooming or rotating them to aid interpretation. They are now available to ChatGPT Plus, Pro, and Team users, with o3-pro support expected soon.
2. xAI Adds a ‘Memory’ Feature to Grok
xAI has announced a new “memory” feature for its chatbot, Grok. This feature enables Grok to remember details from past conversations, allowing it to provide more personalized responses over time. For instance, if you ask Grok for recommendations, it will tailor its suggestions based on your previous interactions, assuming you’ve used it enough to establish your preferences.
3. OpenAI Open Sourced Codex CLI
OpenAI has released Codex CLI, an open-source command-line tool designed to run locally from the terminal software. This tool links OpenAI’s models with local code and computing tasks, enabling the models to write and edit code on a desktop and perform actions like moving files. Codex CLI also supports multimodal reasoning by allowing users to pass screenshots or low-fidelity sketches to the model, combined with access to local code.
4. Cohere Launched a New Embed 4
Cohere has introduced Embed 4, a multimodal search solution tailored for businesses. This tool leverages advanced language models to enhance search capabilities across various data types, offering improved efficiency and scalability for enterprise applications. Embed 4 delivers state-of-the-art accuracy and efficiency, helping enterprises securely retrieve their multimodal data to build agentic AI applications.
5. Mistral Released a Suite of Models for Different Classification Tasks
Mistral AI has announced new advancements in AI model optimization, focusing on enhancing scalability and practical applications for businesses. The company aims to refine large-scale models for improved performance across diverse industries. These models are designed to handle various classification tasks, providing businesses with more efficient and scalable AI solutions.
6. Google Released a Preview Version of Gemini 2.5 Flash
Google has unveiled Gemini 2.5 Flash, now available in preview. This new version introduces a “thinking budget” feature, allowing developers to control the amount of computational reasoning the AI uses for different tasks. This provides a balance between quality, cost, and response latency. Gemini 2.5 Flash offers improved speed, efficiency, and performance for developers building AI-powered applications.
7. Meta Unveils Perception Language Model
Meta has introduced PerceptionLM, a dataset and model aimed at advancing AI’s visual understanding capabilities. This open-access release provides tools for training models that interpret complex visual data with greater detail and accuracy. PerceptionLM is designed to enhance AI’s ability to comprehend and reason about visual information, contributing to more sophisticated multimodal AI systems.
Five 5-minute reads/videos to keep you learning
1. Building an AI Study Buddy: A Practical Guide to Developing a Simple Learning Companion
This step-by-step guide walks you through creating a lightweight study companion using Groq’s ultra-fast inference with Llama 3 or Mistral, paired with LangChain, FAISS, and Sentence Transformers for RAG. You’ll also learn how to deploy a simple, modular frontend using Streamlit — perfect for summarizing, generating, and learning on the go.
2. Identifying and Scaling AI Use Cases
OpenAI released a practical framework to help teams find high-impact AI use cases. The guide includes department-specific examples, real-world stories, and actionable checklists to support adoption and scaling across the organization.
3. Automating Content Creation With Qwen2.5-Omni
Qwen2.5-Omni, a multimodal model by Alibaba’s Qwen team, handles text, images, audio, and video, and generates both text and speech. This tutorial shows you how to set up the model, automate audio content creation, and integrate it with vector databases for advanced workflows.
4. Introducing HELMET: Holistically Evaluating Long-Context Language Models
HELMET offers a holistic benchmark for evaluating long-context language models (LCLMs). This blog post explains key findings and how practitioners can use HELMET to differentiate between various LCLMs in future research and applications. It also includes a guide for using HELMET with HuggingFace.
5. How To Think About Agent Frameworks
This blog analyzes agent frameworks and distinguishes between agents and workflows. It also introduces LangGraph as an orchestration framework that combines declarative and imperative APIs to manage complex agentic systems effectively.
6. Voice AI & Voice Agents: An Illustrated Primer
This guide explores the current state of conversational voice AI in 2025, detailing how LLMs are used to transform unstructured speech into structured data across various applications, including healthcare and customer service. It covers the core technologies involved — speech-to-text, text-to-speech, audio processing, and network transport — and discusses best practices for building production-ready voice agents.
Repositories & Tools
Kimi VL is an open-source Mixture-of-Experts vision-language model that excels in multimodal reasoning and long-context understanding with only 2.8B activated parameters.
Jump Server is an open-source Privileged Access Management tool that provides DevOps and IT teams with on-demand and secure access to SSH, RDP, Kubernetes, Database, and RemoteApp endpoints.
Code Server allows users to run VS Code on any machine and access it in the browser.
BitNet is the first open-source, native 1-bit LLM at the 2-billion parameter scale.
Top Papers of The Week
1. Collaborative Reasoner: Self-Improving Social Agents With Synthetic Conversations
This paper introduces Collaborative Reasoner (Coral), a framework for evaluating and improving collaborative reasoning in LLMs. Coral turns traditional reasoning problems into multi-agent, multi-turn tasks, where two agents must reach a shared solution through natural conversation. These dialogues simulate real-world dynamics, pushing agents to challenge, negotiate, and align on joint conclusions.
2. ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
ReTool enhances long-form reasoning by combining tool use with reinforcement learning. It uses real-time code execution and feedback to refine strategies over time. ReTool’s 32B model scores 67% on the AIME benchmark, outperforming text-only RL baselines and demonstrating emergent behaviors like self-correcting code, pushing the frontier in hybrid neuro-symbolic reasoning.
3. xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
As reasoning models like OpenAI’s o1 adopt slow-thinking strategies, traditional evaluations fall short. xVerify offers a more reliable answer verifier, trained on the VAR dataset, achieving over 95% accuracy. It significantly outperforms existing methods and proves effective and generalizable across tasks.
4. Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
This study questions the assumption that Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning abilities beyond what’s already in the base model. RL training shifts output distribution towards more rewarding responses, improving performance at lower pass@k values but restricting reasoning boundaries. Distillation, unlike RLVR, introduces genuinely new capabilities, prompting a reevaluation of RLVR’s impact on reasoning capacities in LLMs.
5. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3 introduces a new multimodal pre-training approach, simultaneously acquiring linguistic and multimodal skills. Employing advanced techniques like V2PE and SFT, InternVL3 sets a new standard with a 72.2 score on the MMMU benchmark among open-source MLLMs. It rivals ChatGPT-4o and others, planning a public release of its training data and model weights to promote open research.
Quick Links
1. OpenAI is working on a social network prototype, similar to X (formerly Twitter), focused on sharing AI-generated images from ChatGPT. The project adds a new dimension to OpenAI’s rivalry with Elon Musk and Meta, both of which are also exploring social AI integrations. The platform will help feed real-time data back into model training.
2. DeepSeek is open-sourcing its modified inference engine built on vLLM. After running into challenges like code divergence and infrastructure lock-ins, they’re shifting gears — partnering with open-source projects, modularizing components, and contributing their performance optimizations to the community.
3. OpenAI just introduced Flex processing, a lower-priced API tier for tasks that don’t need fast responses. Available in beta for o3 and o4-mini, it’s aimed at non-production use cases like evaluations, data enrichment, or background jobs. The tradeoff: slower responses and occasional downtime.
Who’s Hiring in AI
LLM Data Researcher @Turing (USA/Remote)
Software Engineer II @GumGum (Santa Monica, CA, USA)
Backend Developer @ZenGRC (Remote)
Data Scientist Intern — Singapore @GoTo Group (Singapore)
Senior Software Engineer ( Search) @Simpplr (Hybrid, India)
Interested in sharing a job opportunity here? Contact [email protected].
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
0 Commentarii
0 Distribuiri
47 Views