towardsai.net
Author(s): Towards AI Editorial Team Originally published on Towards AI. What happened this week in AI by LouieThis weeks AI discourse centered on DeepSeeks r1 release, which sparked a heated debate about its implications for OpenAI, GPUs, and the broader industry. Meanwhile, Google quietly rolled out an improved version of its own reasoning model Gemini Flash 2.0 Thinking, improving its AIME benchmark score to 73.3% (from ~64% in December). OpenAIs announcement of its planned $500B Stargate data center project a collaboration with SoftBank and Oracle painted a contrasting picture: while DeepSeek refined efficiency, OpenAI appears to be doubling down on scale.We have often covered Deepseeks model releases and technical innovations over the past year, and last week; I outlined r1s reinforcement learning (RL)-driven reasoning, 30x lower API costs than OpenAIs o1, and successful distillation into smaller models. This week, Deepseeks models went viral, its chatbot leaped to the top of the app store, and reactions oscillated between OpenAI is obsolete and DeepSeeks training costs are faked. In particular, many people cottoned on to Deepseeks impressive training costs (just $5.6m direct compute cost for v3 for the final model run announced in December) and lower inference prices for r1 vs OpenAI o1. This led many to question whether huge $bn training clusters are still needed and whether the US has lost its AI lead.We think much of the sudden reaction is overblown, and its entertaining that the r1 price reduction hits the media while the invention and consequences of reasoning models themselves have still gone largely unreported. Deepseek has a very impressive research team with a productive culture and structure (including high vertical integration and fewer silos between teams), but we think the US still has more leading AI researchers and companies. The main difference is that the best AI researchers in the US work for companies that are not GPU poor, and expertise has been prioritized to scale quickly over first principles-led improvements and tweaks to LLM architectures and methods. In China, the best researchers instead flocked to Deepseek, which is still GPU-poor (relatively speaking, due to sanctions) and has been focusing on finding next-generation methods to improve training and inference efficiency. Gains from their 10+ breakthroughs over the last 2 years (all publicly shared, many of them already included in the v2 release over 8 months ago) added up to what looks like a cost-efficiency advantage vs. US labs. However, OpenAI and Anthropic reportedly have a 70%+ gross margin, while Deepseek CEO said they price close to cost, so v3 is not 10x more efficient than 4o and r1, not 30x vs. o1 as prices would imply. Nevertheless, it is still significant that such a capable model is now made available open-source and that it could be trained and served at such an affordable price.Why should you care?We think its great to see new LLM techniques, efficiency gains, and such a strong open-source reasoning model. Hopefully, the release will pressure OpenAI to also show its o1 reasoning tokens, reduce prices, and release the much stronger o3! We also see huge potential for the open source community to build on top of these models and, in particular, in reinforcement fine-tuning these models for new domains.However, we dont think this is the end of building larger training clusters. Scaling laws still hold and all else equal, the more compute we put in, the more capable models we get out. Algorithmic and technique efficiency gains on top of this just means we get more out of our GPU clusters. It doesnt mean we wont get even more from larger training runs. More compute still stacks capability on top of all other improvements so there is no loss of incentive to have the biggest cluster. It is no surprise to see OpenAI hoping to push towards their $500bn Stargate data center plan! The main news over the past 4 months is just that now we also have new test-time compute scaling laws, which is yet another vector to scale both during training via the RL process and synthetic data generation and at inference time.Lost somewhat in the noise of the pricing a potentially much more significant aspect of r1 we noted this week is that despite being trained via rewards for solving math and LeetCode problems, it has also demonstrated significant improvements in creative writing. The r1 model now tops the eqbench leaderboard for Creative Writing with large gains over V3. We have also heard from many people who are finding Gemini Flash-Thinking 2.0 better than Gemini Pro 2.0 for creative writing tasks. This raises the question of just how far the generalization potential of this new paradigm of reasoning LLMs can take us.Hottest News1. OpenAI Launches Operator, an AI Agent That Performs Tasks AutonomouslyOpenAI launched a research preview of Operator, a general-purpose AI agent that can take control of a web browser and independently perform specific actions. Operator is powered by a new model called a Computer-Using Agent (CUA). Combining GPT-4os vision capabilities with advanced reasoning through reinforcement learning, CUA is trained to interact with the buttons, menus, and text fields people see on a screen.2. Google Releases Update to Gemini 2.0 Flash Thinking ModelGoogle quietly released another update to its own reasoning model Gemini 2.0 Flash Thinking first released in late December. Similar to Deepseek R1, and unlike OpenAI o1, the Flash Thinking model displays its reasoning process. The model is currently available for free while in its experimentation stage. The model climbed to a score of 73.3% on AIME (vs. ~64% in December) and 74.2% on the GPQA Diamond science benchmark (vs. ~66% in December and 58.6% for the non-reasoning Flash 2.0 model).3. Anthropic Introduces Citations To Reduce AI ErrorsAnthropic unveiled a new feature for its developer API called Citations. This feature lets Claude ground its answers in source documents. It provides detailed references to the exact sentences and passages used to generate responses, leading to more verifiable, trustworthy outputs. Citations are available only for Claude 3.5 Sonnet and Claude 3.5 Haiku. Additionally, Citations may incur charges depending on the length and number of the source documents.4. Hugging Face Shrinks AI Vision Model SmolVLM to Phone-Friendly SizeHugging Face introduced vision-language models that run on devices as small as smartphones while outperforming their predecessors, which required massive data centers. The companys new SmolVLM-256M model, requiring less than one gigabyte of GPU memory, surpasses the performance of its Idefics 80B model from just 17 months ago a system 300 times larger.5. OpenAI Teams Up With SoftBank and Oracle on $500B Data Center ProjectOpenAI announced it is teaming up with Japanese conglomerate SoftBank and with Oracle, among others, to build multiple data centers for AI in the U.S. The joint venture, the Stargate Project, intends to invest $500 billion over the next four years to build new AI infrastructure for OpenAI in the United States.6. Google Invests Further $1Bn in OpenAI Rival AnthropicGoogle is reportedly investing over $1 billion in Anthropic. This new investment is separate from the companys earlier reported funding round of nearly $2 billion earlier this month, led by Lightspeed Venture Partners, to bump the companys valuation to about $60 billion.Five 5-minute reads/videos to keep you learning1. Building Effective AgentsThis post combines everything Anthropic has learned from working with customers and building agents. It also shares practical advice for developers on building effective agents. The post covers when and how to use agents, the workflow, and more.2. Inside DeepSeek-R1: The Amazing Model that Matches GPT-o1 on Reasoning at a Fraction of the CostOne dominant reasoning thesis is that big models are necessary to achieve reasoning. DeepSeek-R1 challenges that thesis by matching the performance of GPT-o1 at a fraction of the compute cost. This article explores the technical details of the DeepSeek-R1 architecture and training process, highlighting key innovations and contributions.3. Agents Are All You Need vs. Agents Are Not Enough: A Dueling Perspective on AIs FutureThe rapid evolution of AI has sparked a compelling debate: Are autonomous agents sufficient to tackle complex tasks, or do they require integration within broader ecosystems to achieve optimal performance? As industry leaders and researchers share insights, the divide between these perspectives has grown more pronounced. This article presents arguments for both sides and provides a middle ground.4. 10 FAQs on AI Agents: Decoding Googles Whitepaper in Simple TermsThe future of AI agents holds exciting advances, and weve only scratched the surface of what is possible. This article explores AI agents by diving into Googles Agents whitepaper and addressing the ten most common questions about them.5. Image Segmentation Made Easy: A Guide to Ilastik and EasIlastik for Non-ExpertsTools like Ilastik and EasIlastik empower users to perform sophisticated image segmentation without writing a single line of code. This article explores what makes them so powerful, walks through how to use them, and shows how they can simplify image segmentation tasks, no matter your level of experience.6. Why Everyone in AI Is Freaking Out About DeepSeekOnly a handful of people knew about DeepSeek a few days ago. Yet, thanks to the release of DeepSeek-R1, its been arguably the most discussed company in Silicon Valley in the last few days. This article explains what has led to this popularity.Repositories & ToolsOpen R1 is a fully open reproduction of DeepSeek-R1.PaSa is an advanced paper search agent powered by LLMs that can autonomously make a series of decisions.Top Papers of The Week1. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement LearningDeepSeek-R1-Zero and DeepSeek-R1 are reasoning models that perform comparable to OpenAI-o11217 on reasoning tasks. DeepSeek-R1-Zero is trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT), while DeepSeek-R1 incorporates multi-stage training and cold-start data before RL. Available in sizes 1.5B, 7B, 8B, 14B, 32B, and 70B, DeepSeek-R1-Zero and DeepSeek-R1 are open-sourced and distilled from DeepSeek-R1 based on Qwen and Llama.2. Humanitys Last ExamHumanitys Last Exam (HLE) is a multi-modal benchmark designed to be the final closed-ended academic benchmark with broad subject coverage. HLE is developed by subject-matter experts and comprises 3,000 multiple-choice and short-answer questions across dozens of subjects, including mathematics, humanities, and the natural sciences. Each question has a known, unambiguous, and easily verifiable solution that cannot be quickly answered via Internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE.3. Evolving Deeper LLM ThinkingThis paper explores an evolutionary search strategy for scaling inference time compute in LLMs. It proposes a new approach, Mind Evolution, that uses a language model to generate, recombine, and refine candidate responses. Controlling for inference cost, Mind Evolution significantly outperforms other inference strategies, such as Best-of-N and Sequential Revision, in natural language planning tasks.4. Agent-R\xspace: Training Language Model Agents to Reflect via Iterative Self-TrainingThis paper proposes an iterative self-training framework, Agent-R, that enables language Agents to Reflect on the fly. It leverages Monte Carlo Tree Search (MCTS) to construct training samples that recover correct trajectories from erroneous ones. It introduces a model-guided critique construction mechanism: the actor model identifies the first error step in a failed trajectory. Next, it is spliced with the adjacent correct path, which shares the same parent node in the tree.5. Reasoning Language Models: A BlueprintThis paper proposes a comprehensive blueprint that organizes reasoning language model (RLM) components into a modular framework based on a survey and analysis of all RLM works. It incorporates diverse reasoning structures, reasoning strategies, RL concepts, supervision schemes, and other related concepts. It also provides detailed mathematical formulations and algorithmic specifications to simplify RLM implementation.Quick Links1. Meta AI releases Llama Stack 0.1.0, the first stable release of a unified platform designed to simplify building and deploying generative AI applications. The platform offers backward-compatible upgrades, automated provider verification, and a consistent developer experience across local, cloud, and edge environments. It addresses the complexity of infrastructure, essential capabilities, and flexibility in AI development.2. Perplexity launched Sonar, an API service that allows enterprises and developers to integrate the startups generative AI search tools into their applications. Perplexity currently offers two tiers for developers: a cheaper and faster base version, Sonar, and a pricier version, Sonar Pro, which is better for tough questions.Whos Hiring in AIDeveloper and Technical Communications Lead @Anthropic (Multiple US Locations/Hybrid)AI Algorithm Intern @INTEL (Poland/Hybrid)Software Developer 3 @Oracle (Austin, TX, United States)Data Scientist @Meta (Seattle, WA, USA)Junior Software Engineer @Re-Leased (Napier, New Zealand)Designated Technical Support Engineer @Glean (Palo Alto, CA, USA)Gen AI Engineer | LLMOps @NEORIS (Spain)Interested in sharing a job opportunity here? Contact [emailprotected].Think a friend would enjoy this too? Share the newsletter and let them join the conversation.Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming asponsor. Published via Towards AI