TAI #148: New API Models from OpenAI (4.1) & xAI (grok-3); Exploring Deep Research’s Scaling Laws
Author(s): Towards AI Editorial Team
Originally published on Towards AI.
What happened this week in AI by Louie
This week, AI developers got their hands on several significant new model options. Adding to the new Llama 4 model options last week, OpenAI released GPT-4.1 — its first developer-only API model, not directly available within ChatGPT — and xAI launched its Grok-3 API. OpenAI made significant progress in addressing prior limitations in its non-reasoning models, improving coding capabilities, and finally breaking through its previous long-context barrier (now supporting up to 1 million tokens in GPT-4.1). OpenAI also enhanced ChatGPT’s memory capabilities with access to the user’s full conversation history (likely still via summarisation and RAG). OpenAI is also expected to imminently release its powerful new reasoning models, o3 and o4-mini.
The GPT-4.1 series offers three models: GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano. GPT-4.1 substantially surpasses GPT-4o across multiple tasks at lower pricing, scoring 54.6% on SWE-bench Verified — a 21.4 percentage point jump from GPT-4o — big gains in practical software engineering capabilities. Real world instruction following, long a frustration for developers, has also seen clear gains: GPT-4.1 outperforms GPT-4o by 10.5 percentage points on Scale’s MultiChallenge benchmark. Perhaps most significantly, GPT-4.1 offers up to one million tokens of input context, compared to 128k tokens in GPT-4o (though still with significantly worse multitasking across long context vs Gemini 2.5). This makes the new models suitable for processing massive codebases and extensive documentation. GPT-4.1 mini and nano also offer solid performance boosts at much lower latency and cost, with GPT-4.1 mini beating GPT-4o in several tests and reducing costs by 83%.
Practical usage tips for developers adopting the new GPT-4.1 models include equipping GPT-4.1 with tools and iterative planning to build agentic workflows. Chain-of-thought prompting remains useful, particularly for structured and detailed reasoning tasks, with GPT-4.1 benefiting from clearly specified prompts and systematic thinking instructions. Developers should approach the full million-token context window carefully; while highly performant compared to previous models, very complex tasks may still require smaller targeted contexts for optimal results. Detailed and consistent instructions significantly enhance GPT-4.1’s performance; conflicting or underspecified instructions remain a common pitfall that can induce errors or hallucinations.
Meanwhile, xAI’s Grok-3 API also became publicly available, with pricing marginally above GPT-4o — $3 per million input tokens and $15 per million output tokens, along with premium faster-inference options. Its 131k-token context limit, while large, currently falls short of earlier claims of a million tokens, prompting some disappointment within the developer community. Clarification around Grok-3’s model scale and underlying training specifics remains elusive, including whether it might be a distilled version of a larger internal training run.
Latest LLM API pricing, context windows, and options. Source: Towards AI; Google Gemini, OpenAI, Anthropic Claude, and xAI Grok.
The model choice should increase further later this week with the o3 release, which, so far, is only available through Deep Research. Together with coding tools, Deep Research Agents are one of the most valuable LLM-based tools for high-value work at enterprises that have been released so far. Competition escalated this week with a big upgrade to Gemini Deep Research after the base model was moved to Pro 2.5. In my own experience, Gemini Pro 2.5 Deep Research is a big improvement and now offers distinct strengths and weaknesses relative to OpenAI’s Deep Research. Leveraging both platforms simultaneously — given their distinct strengths — may become standard practice for many of my research tasks.
Gemini’s search tends to be broader and wider, often rapidly finding many hundreds of sources (possibly due to Google’s existing web crawling infrastructure). Its true strength lies in its true long context understanding and ability to synthesize and cohesively present large quantities of information. OpenAI’s Deep Research, in contrast, is narrower but often feels more targeted and intelligent — executing more complex, iterative research strategies, exploring deeper niche rabbit holes, and generating intricate next-step research plans after each stage. OpenAI’s model also engages in more advanced reasoning, extrapolating between data points with added analysis and insight, although this sophistication can also make verification for hallucinations more challenging.
OpenAI’s new BrowseComp test and paper offered further insights into Deep Research this week. On this benchmark, only~ 21% of tasks were solved by humans in under two hours and just 2% by GPT-4o with a web search. Deep Research’s performance increased from ~10% to ~52% (the released version) — by scaling inference tokens in series (22x token scaling assuming base 2 in their unlabelled chart). Series scaling is likely both from longer Chains of Thoughts and a larger number of agentic steps. On top of this, further 64x parallel compute scaling (64 parallel attempts at the answer) lifted accuracy to 78%. This parallel sampling used self-judged confidence scores (“best of N”) to choose the final answer. These confidence scores outperformed majority voting methods (which instead choose the most common answer in the final exam). A similar mechanism may well also be behind o1-pro’s effectiveness! Collectively, the test-time or inference scaling of Deep Research increased its score from 10% to 78%. It now becomes a difficult task to find the optimal balance of scaling these vectors and for which tasks extra capability will be worth the extra cost!
Why should you care?
All these new model API options are great for the AI ecosystem but make LLM model choices harder than ever. Many models now have their own unique strengths and are state-of-the-art for certain use cases. The best LLM development pipelines and agents will now use a combination of several models from different providers. Multiple models should also be used together via ChatGPT, Gemini, and Claude, even in non-technical workflows.
Insights from the BrowseComp study highlight another increasing complexity: advanced agents get huge benefits from investing more compute into inference-time scaling. While costs escalate quickly, performance gains from deeper iterative reasoning and parallel sampling can outweigh those costs for strategically important tasks. Many professional and enterprise workflows could see returns by leveraging greater inference scaling than is commonly employed today, provided these additional costs align clearly with the value delivered.
More broadly, the rapid evolution of AI highlights different but equally important considerations for non-technical users and LLM developers alike.
For non-technical users and businesses, the emphasis should shift towards well-informed, deliberate, and deep integration of AI into daily workflows — moving beyond casual experimentation. Selecting the right models requires clearly defined goals and a careful understanding of each model’s unique strengths and limitations. Users must focus on how different AI tools practically enhance their productivity, creativity, and effectiveness in real-world tasks.
For LLM developers, the expanding array of models and development options requires an even deeper grasp of nuances such as inference scaling behaviors, optimal model combinations, and agentic frameworks. Developers need to thoughtfully customize and embed these models within robust pipelines and workflows, carefully balancing performance gains against computational costs. A deep understanding of model-specific strengths and inference techniques, such as strategic parallel or series scaling, will become essential to building highly capable, efficient, and economically viable applications.
In both cases, those who proactively master these subtleties today will be best positioned to drive productivity, innovation, and competitive advantage in their fields.
We have yet another guest post this week with GradientFlow (aka Ben Lorica) and are diving into the hottest new development in AI: Deep Research Tools. While tools like ChatGPT’s web browsing or Perplexity extend these capabilities by gathering context from the internet, they remain limited for complex analytical work. Deep research tools change that by combining conversational AI with autonomous web browsing, tool integrations, and sophisticated reasoning capabilities. If you’re building or integrating AI tools, this is essential context for what’s coming next.
We’ll provide a detailed comparative analysis of popular deep research platforms, examining their unique approaches and explaining why they represent a fundamental shift in knowledge work. Read the complete article here!
Hottest News
1. OpenAI Introduced GPT-4.1 in the API
OpenAI has launched GPT-4.1, along with GPT-4.1 Mini and Nano variants, exclusively through its API. The new models bring substantial improvements in coding capabilities, instruction-following, and long-context handling — supporting up to 1 million tokens. GPT-4.1 also shows marked performance gains over GPT-4o, achieving a 54.6% score on SWE-bench Verified and outperforming GPT-4o by 10.5 percentage points on Scale’s MultiChallenge benchmark.
2. Google announces the Agent2Agent Protocol (A2A)
Google unveiled a new open-source Agent2Agent Protocol (A2A) designed to let AI agents from different vendors and systems work together seamlessly. While A2A overlaps with Anthropic’s Model Context Protocol (MCP), Google has also announced support for MCP, positioning the two protocols as complementary. A2A particularly steps in for more complex multiple-agent interactions and also allowing secure agent communication.
3. Anthropic Rolls Out a $200-per-Month Claude Subscription
Anthropic has introduced a high-tier subscription plan for its Claude chatbot called Claude Max. In addition to the existing $20/month Claude Pro, there are now two Max tiers: a $100/month plan offering 5x higher usage limits and a $200/month option with 20x the rate limits. Both plans include priority access to Anthropic’s newest models and features.
4. OpenAI Launched Memory in ChatGPT
OpenAI has begun rolling out a new memory feature in ChatGPT that personalizes responses based on a user’s past interactions. Displayed as “reference saved memories” in settings, the feature enhances context awareness across text, voice, and image interactions — making conversations more relevant and adaptive over time.
5. xAI Launches an API for Grok 3
xAI is making its flagship Grok 3 model available via an API. Grok 3 is priced at $3 per million tokens (~750,000 words) fed into the model and $15 per million tokens generated by the model. The company also introduced Grok 3 Mini, available at $0.30 per million input tokens and $0.50 per million output tokens, offering a more cost-effective option.
6. OpenAI Open Sources BrowseComp: A Benchmark for Browsing Agents
OpenAI has released BrowseComp, a benchmark designed to evaluate agents’ ability to persistently browse the web and retrieve challenging, hard-to-find information. The benchmark contains 1,266 fact-seeking tasks, each with a short, unambiguous answer. Solving them involves navigating multiple pages, reconciling conflicting information, and filtering signals from noise.
7. Together AI Released DeepCoder-14B-Preview: A Fully Open-Source Code Reasoning Model
Together AI, in collaboration with Agentica, unveiled DeepCoder-14B-Preview, a model fine-tuned from DeepSeek-R1-Distilled-Qwen-14B using distributed reinforcement learning. Achieving 60.6% Pass@1 on LiveCodeBench, it rivals top-tier models like o3-mini-2025 in output quality — while operating at just 14B parameters, showcasing impressive efficiency in code reasoning.
8. Google Introduces Firebase Studio
Google announced Firebase Studio, a web-based IDE for building and deploying full-stack AI apps. It integrates tools like Project IDX, Genkit, and Gemini into a unified environment. Its standout feature, the App Prototyping Agent, allows users to create entire applications from natural language prompts or hand-drawn sketches.
9. OpenAI Will Soon Phase Out GPT-4 From ChatGPT
OpenAI will fully phase out GPT-4 from ChatGPT by April 30, replacing it with GPT-4o as the default model. While GPT-4 will no longer be available within ChatGPT, it will remain accessible through OpenAI’s API offerings.
10. Measuring Human Leadership Skills with AI Agents
A new study found managing AI agents effectively strongly correlates with managing people. Harvard research demonstrated that leadership skill assessment via GPT-4o simulated interactions strongly aligns (r=0.81) with real-world human evaluations. Effective leaders exhibit conversational reciprocity, fluid intelligence, and nuanced social perception.
11. Kunlun Wanwei Open-Sources Skywork-OR1 Series Models
Kunlun Wanwei’s Tiangong team released the Skywork-OR1 (Open Reasoner 1) series, featuring open-source models tailored for math, code, and reasoning tasks. The lineup includes Skywork-OR1-Math-7B, Skywork-OR1–7B-Preview for combined skills, and the powerful Skywork-OR1–32B-Preview for complex, general-purpose applications.
12. Google Announced its New TPU v7 Ironwood.
The chip will be available later this year and offers 4.8 PFLOP/s FP8 (Nvidia Blackwell GB200 is 5.0 PFLOP/s FP8) and 192 GB High Bandwidth Memory (equal to GB200). Its easier pod size scaling (up to 9216 chips) is a key strength for TPUs, while Nvidia has its easy-to-use CUDA software and strong ecosystem. Nvidia GPU vs TPU competition aside, High Bandwidth Memory is scaling rapidly in all AI solutions now that we have a sparse Mixture of Experts and inference time scaling laws (with high KV cache needs). HBM costs are now higher than TSMC manufacturing costs for these chips.
13. Google Improves Deep Research
Google upgraded its Deep Research capabilities by migrating to the Gemini Pro 2.5 model. It has different strengths and weaknesses and focus vs. OpenAI Deep Research but overall was ranked more useful 70% to 30% in head-to-head by external users (Google’s data).
Five 5-minute reads/videos to keep you learning
1. Are AI Agents Sustainable? It Depends
This article explores whether AI agents are more environmentally and computationally sustainable than other AI systems. It focuses on three key factors: the type of model, the modality used, and how system choices are made in real-world deployments.
2. Decoding Strategies in Large Language Models
This is a deep dive into how LLMs generate text, covering greedy search, beam search, and sampling strategies like top-k and nucleus sampling. The article includes GitHub and Colab links with working code for hands-on experimentation.
3. 5 Reasons Why Traditional Machine Learning is Alive and Well in the Age of LLMs
Despite the rise of LLMs, traditional machine learning still holds strong. This piece highlights five reasons why classical models continue to be crucial — especially for targeted, efficient solutions in specialized domains.
4. Strategic Planning with ChatGPT
This demo showcases how o1 pro mode can tackle complex business scenarios, such as crafting a step-by-step market entry strategy. It highlights the model’s ability to break down competitors, analyze trends, and identify growth opportunities.
5. The Best Open-Source OCR Models
OmniAI evaluated several open-source vision-language models for OCR tasks, including Qwen 2.5 VL, Gemma-3, Mistral-ocr, and Llama 4. Qwen 2.5 VL (72B) achieved the highest accuracy at 75%, outperforming other models. The benchmark assessed models on JSON extraction accuracy, cost, and latency across 1,000 documents using open-source datasets and methodologies.
Repositories & Tools
AI Agents for Beginners contains 10 lessons on how to get started building AI agents.
vLLM is a fast and easy-to-use library for LLM inference and serving.
Wan 2.1 is an open suite of video foundation models for video generation.
Debug Gym is a text-based interactive debugging framework designed for debugging Python programs.
Top Papers of The Week
1. Inference-Time Scaling for Generalist Reward Modeling
This paper introduces Self-Principled Critique Tuning (SPCT) to enhance inference-time scalability in reward modeling for general queries. Using pointwise generative reward modeling, SPCT improves quality and scalability, outperforming existing approaches. The method employs parallel sampling and a meta-reward model for better performance.
2. Self-Steering Language Models
Researchers introduce DisCIPL, a self-steering framework where a Planner model generates recursive inference programs executed by Follower models. This decouples planning from execution, enabling efficient, verifiable reasoning. DisCIPL matches or outperforms larger models like GPT-4o on constrained generation tasks without fine-tuning.
3. Rec-R1: Bridging Generative Large Language Models and User-Centric Recommendation Systems via Reinforcement Learning
The paper presents Rec-R1, a reinforcement learning framework that directly optimizes LLM outputs using feedback from fixed black-box recommendation models. This approach outperforms prompting and SFT methods in product search and sequential recommendation tasks. Rec-R1 maintains the general-purpose capabilities of LLMs while enhancing recommendation performance.
4. OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
This paper introduces OLMoTrace, the first system capable of tracing LLM outputs back to their multi-trillion-token training data in real-time. By identifying verbatim matches between model outputs and training documents, it aids in understanding model behavior, including fact-checking and detecting hallucinations. The open-source tool operates efficiently, returning results within seconds.
5. Towards Accurate Differential Diagnosis With Large Language Models
This study evaluates an LLM optimized for diagnostic reasoning, assessing its ability to generate differential diagnoses (DDx) independently and as a clinician aid. In trials with 20 clinicians on 302 challenging cases, the LLM outperformed unassisted clinicians and improved DDx quality when used as an assistive tool. Findings suggest the LLM’s potential to enhance diagnostic accuracy and support clinical decision-making.
Quick Links
1. Amazon introduced Nova Sonic, a new foundation model capable of natively processing voice and generating natural-sounding speech. Amazon claims Sonic’s performance is competitive with frontier voice models from OpenAI and Google on benchmarks measuring speed, speech recognition, and conversational quality.
2. Eleven Labs has rolled out a brand new version of the Professional Voice Clone (PVC) creation flow that makes creating a perfect clone of your voice much easier. Users can upload their recordings. They can also trim clips, remove background noise, and pick up where they left off anytime.
Who’s Hiring in AI
Senior Machine Learning Scientist (Generative AI) — Viator @Tripadvisor (London, England)
A.I. Engineering Intern @Sezzle (Colombia/Remote)
Intern AI Systems Analyst @Acubed (USA/Remote)
Software Engineer @Isomorphic Labs (Switzerland/UK)
Developer Relations Engineer, AI Community Manager
Tenstorrent Inc. (Hybrid/Multiple US Locations)
Software Engineering Intern, Server @Strava (Denver, CO, USA)
Interested in sharing a job opportunity here? Contact [email protected].
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI