• Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards

    Text embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation. However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed.
    Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding
    Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages, making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs.
    These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs.

    Technical Architecture
    Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to thetoken. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function.

    The models are trained using a robust multi-stage training pipeline:

    Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks.
    Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity, fine-tuning performance in downstream applications.
    Model merging: Spherical linear interpolationof multiple fine-tuned checkpoints ensures robustness and generalization.

    This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings.
    Performance Benchmarks and Insights
    The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks.

    On MMTEB, Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series.
    On MTEB: Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B.
    On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA.

    For reranking:

    Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers.
    Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance.

    Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops, emphasizing their contributions.
    Conclusion
    Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation.

    Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent Using LangGraph and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding
    #alibaba #qwen #team #releases #qwen3embedding
    Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards
    Text embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation. However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed. Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages, making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs. These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs. Technical Architecture Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to thetoken. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function. The models are trained using a robust multi-stage training pipeline: Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks. Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity, fine-tuning performance in downstream applications. Model merging: Spherical linear interpolationof multiple fine-tuned checkpoints ensures robustness and generalization. This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings. Performance Benchmarks and Insights The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks. On MMTEB, Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series. On MTEB: Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B. On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA. For reranking: Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers. Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance. Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops, emphasizing their contributions. Conclusion Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation. Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent Using LangGraph and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding #alibaba #qwen #team #releases #qwen3embedding
    WWW.MARKTECHPOST.COM
    Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards
    Text embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation (RAG). However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed. Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages (119 in total), making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs. These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs. Technical Architecture Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to the [EOS] token. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function. The models are trained using a robust multi-stage training pipeline: Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks. Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity (>0.7), fine-tuning performance in downstream applications. Model merging: Spherical linear interpolation (SLERP) of multiple fine-tuned checkpoints ensures robustness and generalization. This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings. Performance Benchmarks and Insights The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks. On MMTEB (216 tasks across 250+ languages), Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series. On MTEB (English v2): Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B. On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA. For reranking: Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers. Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance. Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops (up to 6 points on MMTEB), emphasizing their contributions. Conclusion Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation. Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent Using LangGraph and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding
    Like
    Love
    Wow
    Angry
    Sad
    332
    0 Комментарии 0 Поделились
  • Manus has kick-started an AI agent boom in China

    Last year, China saw a boom in foundation models, the do-everything large language models that underpin the AI revolution. This year, the focus has shifted to AI agents—systems that are less about responding to users’ queries and more about autonomously accomplishing things for them. 

    There are now a host of Chinese startups building these general-purpose digital tools, which can answer emails, browse the internet to plan vacations, and even design an interactive website. Many of these have emerged in just the last two months, following in the footsteps of Manus—a general AI agent that sparked weeks of social media frenzy for invite codes after its limited-release launch in early March. 

    These emerging AI agents aren’t large language models themselves. Instead, they’re built on top of them, using a workflow-based structure designed to get things done. A lot of these systems also introduce a different way of interacting with AI. Rather than just chatting back and forth with users, they are optimized for managing and executing multistep tasks—booking flights, managing schedules, conducting research—by using external tools and remembering instructions. 

    China could take the lead on building these kinds of agents. The country’s tightly integrated app ecosystems, rapid product cycles, and digitally fluent user base could provide a favorable environment for embedding AI into daily life. 

    For now, its leading AI agent startups are focusing their attention on the global market, because the best Western models don’t operate inside China’s firewalls. But that could change soon: Tech giants like ByteDance and Tencent are preparing their own AI agents that could bake automation directly into their native super-apps, pulling data from their vast ecosystem of programs that dominate many aspects of daily life in the country. 

    As the race to define what a useful AI agent looks like unfolds, a mix of ambitious startups and entrenched tech giants are now testing how these tools might actually work in practice—and for whom.

    Set the standard

    It’s been a whirlwind few months for Manus, which was developed by the Wuhan-based startup Butterfly Effect. The company raised million in a funding round led by the US venture capital firm Benchmark, took the product on an ambitious global roadshow, and hired dozens of new employees. 

    Even before registration opened to the public in May, Manus had become a reference point for what a broad, consumer‑oriented AI agent should accomplish. Rather than handling narrow chores for businesses, this “general” agent is designed to be able to help with everyday tasks like trip planning, stock comparison, or your kid’s school project. 

    Unlike previous AI agents, Manus uses a browser-based sandbox that lets users supervise the agent like an intern, watching in real time as it scrolls through web pages, reads articles, or codes actions. It also proactively asks clarifying questions, supports long-term memory that would serve as context for future tasks.

    “Manus represents a promising product experience for AI agents,” says Ang Li, cofounder and CEO of Simular, a startup based in Palo Alto, California, that’s building computer use agents, AI agents that control a virtual computer. “I believe Chinese startups have a huge advantage when it comes to designing consumer products, thanks to cutthroat domestic competition that leads to fast execution and greater attention to product details.”

    In the case of Manus, the competition is moving fast. Two of the most buzzy follow‑ups, Genspark and Flowith, for example, are already boasting benchmark scores that match or edge past Manus’s. 

    Genspark, led by former Baidu executives Eric Jing and Kay Zhu, links many small “super agents” through what it calls multi‑component prompting. The agent can switch among several large language models, accepts both images and text, and carries out tasks from making slide decks to placing phone calls. Whereas Manus relies heavily on Browser Use, a popular open-source product that lets agents operate a web browser in a virtual window like a human, Genspark directly integrates with a wide array of tools and APIs. Launched in April, the company says that it already has over 5 million users and over million in yearly revenue.

    Flowith, the work of a young team that first grabbed public attention in April 2025 at a developer event hosted by the popular social media app Xiaohongshu, takes a different tack. Marketed as an “infinite agent,” it opens on a blank canvas where each question becomes a node on a branching map. Users can backtrack, take new branches, and store results in personal or sharable “knowledge gardens”—a design that feels more like project management softwarethan a typical chat interface. Every inquiry or task builds its own mind-map-like graph, encouraging a more nonlinear and creative interaction with AI. Flowith’s core agent, NEO, runs in the cloud and can perform scheduled tasks like sending emails and compiling files. The founders want the app to be a “knowledge marketbase”, and aims to tap into the social aspect of AI with the aspiration of becoming “the OnlyFans of AI knowledge creators”.

    What they also share with Manus is the global ambition. Both Genspark and Flowith have stated that their primary focus is the international market.

    A global address

    Startups like Manus, Genspark, and Flowith—though founded by Chinese entrepreneurs—could blend seamlessly into the global tech scene and compete effectively abroad. Founders, investors, and analysts that MIT Technology Review has spoken to believe Chinese companies are moving fast, executing well, and quickly coming up with new products. 

    Money reinforces the pull to launch overseas. Customers there pay more, and there are plenty to go around. “You can price in USD, and with the exchange rate that’s a sevenfold multiplier,” Manus cofounder Xiao Hong quipped on a podcast. “Even if we’re only operating at 10% power because of cultural differences overseas, we’ll still make more than in China.”

    But creating the same functionality in China is a challenge. Major US AI companies including OpenAI and Anthropic have opted out of mainland China because of geopolitical risks and challenges with regulatory compliance. Their absence initially created a black market as users resorted to VPNs and third-party mirrors to access tools like ChatGPT and Claude. That vacuum has since been filled by a new wave of Chinese chatbots—DeepSeek, Doubao, Kimi—but the appetite for foreign models hasn’t gone away. 

    Manus, for example, uses Anthropic’s Claude Sonnet—widely considered the top model for agentic tasks. Manus cofounder Zhang Tao has repeatedly praised Claude’s ability to juggle tools, remember contexts, and hold multi‑round conversations—all crucial for turning chatty software into an effective executive assistant.

    But the company’s use of Sonnet has made its agent functionally unusable inside China without a VPN. If you open Manus from a mainland IP address, you’ll see a notice explaining that the team is “working on integrating Qwen’s model,” a special local version that is built on top of Alibaba’s open-source model. 

    An engineer overseeing ByteDance’s work on developing an agent, who spoke to MIT Technology Review anonymously to avoid sanction, said that the absence of Claude Sonnet models “limits everything we do in China.” DeepSeek’s open models, he added, still hallucinate too often and lack training on real‑world workflows. Developers we spoke with rank Alibaba’s Qwen series as the best domestic alternative, yet most say that switching to Qwen knocks performance down a notch.

    Jiaxin Pei, a postdoctoral researcher at Stanford’s Institute for Human‑Centered AI, thinks that gap will close: “Building agentic capabilities in base LLMs has become a key focus for many LLM builders, and once people realize the value of this, it will only be a matter of time.”

    For now, Manus is doubling down on audiences it can already serve. In a written response, the company said its “primary focus is overseas expansion,” noting that new offices in San Francisco, Singapore, and Tokyo have opened in the past month.

    A super‑app approach

    Although the concept of AI agents is still relatively new, the consumer-facing AI app market in China is already crowded with major tech players. DeepSeek remains the most widely used, while ByteDance’s Doubao and Moonshot’s Kimi have also become household names. However, most of these apps are still optimized for chat and entertainment rather than task execution. This gap in the local market has pushed China’s big tech firms to roll out their own user-facing agents, though early versions remain uneven in quality and rough around the edges. 

    ByteDance is testing Coze Space, an AI agent based on its own Doubao model family that lets users toggle between “plan” and “execute” modes, so they can either directly guide the agent’s actions or step back and watch it work autonomously. It connects up to 14 popular apps, including GitHub, Notion, and the company’s own Lark office suite. Early reviews say the tool can feel clunky and has a high failure rate, but it clearly aims to match what Manus offers.

    Meanwhile, Zhipu AI has released a free agent called AutoGLM Rumination, built on its proprietary ChatGLM models. Shanghai‑based Minimax has launched Minimax Agent. Both products look almost identical to Manus and demo basic tasks such as building a simple website, planning a trip, making a small Flash game, or running quick data analysis.

    Despite the limited usability of most general AI agents launched within China, big companies have plans to change that. During a May 15 earnings call, Tencent president Liu Zhiping teased an agent that would weave automation directly into China’s most ubiquitous app, WeChat. 

    Considered the original super-app, WeChat already handles messaging, mobile payments, news, and millions of mini‑programs that act like embedded apps. These programs give Tencent, its developer, access to data from millions of services that pervade everyday life in China, an advantage most competitors can only envy.

    Historically, China’s consumer internet has splintered into competing walled gardens—share a Taobao link in WeChat and it resolves as plaintext, not a preview card. Unlike the more interoperable Western internet, China’s tech giants have long resisted integration with one another, choosing to wage platform war at the expense of a seamless user experience.

    But the use of mini‑programs has given WeChat unprecedented reach across services that once resisted interoperability, from gym bookings to grocery orders. An agent able to roam that ecosystem could bypass the integration headaches dogging independent startups.

    Alibaba, the e-commerce giant behind the Qwen model series, has been a front-runner in China’s AI race but has been slower to release consumer-facing products. Even though Qwen was the most downloaded open-source model on Hugging Face in 2024, it didn’t power a dedicated chatbot app until early 2025. In March, Alibaba rebranded its cloud storage and search app Quark into an all-in-one AI search tool. By June, Quark had introduced DeepResearch—a new mode that marks its most agent-like effort to date. 

    ByteDance and Alibaba did not reply to MIT Technology Review’s request for comments.

    “Historically, Chinese tech products tend to pursue the all-in-one, super-app approach, and the latest Chinese AI agents reflect just that,” says Li of Simular, who previously worked at Google DeepMind on AI-enabled work automation. “In contrast, AI agents in the US are more focused on serving specific verticals.”

    Pei, the researcher at Stanford, says that existing tech giants could have a huge advantage in bringing the vision of general AI agents to life—especially those with built-in integration across services. “The customer-facing AI agent market is still very early, with tons of problems like authentication and liability,” he says. “But companies that already operate across a wide range of services have a natural advantage in deploying agents at scale.”
    #manus #has #kickstarted #agent #boom
    Manus has kick-started an AI agent boom in China
    Last year, China saw a boom in foundation models, the do-everything large language models that underpin the AI revolution. This year, the focus has shifted to AI agents—systems that are less about responding to users’ queries and more about autonomously accomplishing things for them.  There are now a host of Chinese startups building these general-purpose digital tools, which can answer emails, browse the internet to plan vacations, and even design an interactive website. Many of these have emerged in just the last two months, following in the footsteps of Manus—a general AI agent that sparked weeks of social media frenzy for invite codes after its limited-release launch in early March.  These emerging AI agents aren’t large language models themselves. Instead, they’re built on top of them, using a workflow-based structure designed to get things done. A lot of these systems also introduce a different way of interacting with AI. Rather than just chatting back and forth with users, they are optimized for managing and executing multistep tasks—booking flights, managing schedules, conducting research—by using external tools and remembering instructions.  China could take the lead on building these kinds of agents. The country’s tightly integrated app ecosystems, rapid product cycles, and digitally fluent user base could provide a favorable environment for embedding AI into daily life.  For now, its leading AI agent startups are focusing their attention on the global market, because the best Western models don’t operate inside China’s firewalls. But that could change soon: Tech giants like ByteDance and Tencent are preparing their own AI agents that could bake automation directly into their native super-apps, pulling data from their vast ecosystem of programs that dominate many aspects of daily life in the country.  As the race to define what a useful AI agent looks like unfolds, a mix of ambitious startups and entrenched tech giants are now testing how these tools might actually work in practice—and for whom. Set the standard It’s been a whirlwind few months for Manus, which was developed by the Wuhan-based startup Butterfly Effect. The company raised million in a funding round led by the US venture capital firm Benchmark, took the product on an ambitious global roadshow, and hired dozens of new employees.  Even before registration opened to the public in May, Manus had become a reference point for what a broad, consumer‑oriented AI agent should accomplish. Rather than handling narrow chores for businesses, this “general” agent is designed to be able to help with everyday tasks like trip planning, stock comparison, or your kid’s school project.  Unlike previous AI agents, Manus uses a browser-based sandbox that lets users supervise the agent like an intern, watching in real time as it scrolls through web pages, reads articles, or codes actions. It also proactively asks clarifying questions, supports long-term memory that would serve as context for future tasks. “Manus represents a promising product experience for AI agents,” says Ang Li, cofounder and CEO of Simular, a startup based in Palo Alto, California, that’s building computer use agents, AI agents that control a virtual computer. “I believe Chinese startups have a huge advantage when it comes to designing consumer products, thanks to cutthroat domestic competition that leads to fast execution and greater attention to product details.” In the case of Manus, the competition is moving fast. Two of the most buzzy follow‑ups, Genspark and Flowith, for example, are already boasting benchmark scores that match or edge past Manus’s.  Genspark, led by former Baidu executives Eric Jing and Kay Zhu, links many small “super agents” through what it calls multi‑component prompting. The agent can switch among several large language models, accepts both images and text, and carries out tasks from making slide decks to placing phone calls. Whereas Manus relies heavily on Browser Use, a popular open-source product that lets agents operate a web browser in a virtual window like a human, Genspark directly integrates with a wide array of tools and APIs. Launched in April, the company says that it already has over 5 million users and over million in yearly revenue. Flowith, the work of a young team that first grabbed public attention in April 2025 at a developer event hosted by the popular social media app Xiaohongshu, takes a different tack. Marketed as an “infinite agent,” it opens on a blank canvas where each question becomes a node on a branching map. Users can backtrack, take new branches, and store results in personal or sharable “knowledge gardens”—a design that feels more like project management softwarethan a typical chat interface. Every inquiry or task builds its own mind-map-like graph, encouraging a more nonlinear and creative interaction with AI. Flowith’s core agent, NEO, runs in the cloud and can perform scheduled tasks like sending emails and compiling files. The founders want the app to be a “knowledge marketbase”, and aims to tap into the social aspect of AI with the aspiration of becoming “the OnlyFans of AI knowledge creators”. What they also share with Manus is the global ambition. Both Genspark and Flowith have stated that their primary focus is the international market. A global address Startups like Manus, Genspark, and Flowith—though founded by Chinese entrepreneurs—could blend seamlessly into the global tech scene and compete effectively abroad. Founders, investors, and analysts that MIT Technology Review has spoken to believe Chinese companies are moving fast, executing well, and quickly coming up with new products.  Money reinforces the pull to launch overseas. Customers there pay more, and there are plenty to go around. “You can price in USD, and with the exchange rate that’s a sevenfold multiplier,” Manus cofounder Xiao Hong quipped on a podcast. “Even if we’re only operating at 10% power because of cultural differences overseas, we’ll still make more than in China.” But creating the same functionality in China is a challenge. Major US AI companies including OpenAI and Anthropic have opted out of mainland China because of geopolitical risks and challenges with regulatory compliance. Their absence initially created a black market as users resorted to VPNs and third-party mirrors to access tools like ChatGPT and Claude. That vacuum has since been filled by a new wave of Chinese chatbots—DeepSeek, Doubao, Kimi—but the appetite for foreign models hasn’t gone away.  Manus, for example, uses Anthropic’s Claude Sonnet—widely considered the top model for agentic tasks. Manus cofounder Zhang Tao has repeatedly praised Claude’s ability to juggle tools, remember contexts, and hold multi‑round conversations—all crucial for turning chatty software into an effective executive assistant. But the company’s use of Sonnet has made its agent functionally unusable inside China without a VPN. If you open Manus from a mainland IP address, you’ll see a notice explaining that the team is “working on integrating Qwen’s model,” a special local version that is built on top of Alibaba’s open-source model.  An engineer overseeing ByteDance’s work on developing an agent, who spoke to MIT Technology Review anonymously to avoid sanction, said that the absence of Claude Sonnet models “limits everything we do in China.” DeepSeek’s open models, he added, still hallucinate too often and lack training on real‑world workflows. Developers we spoke with rank Alibaba’s Qwen series as the best domestic alternative, yet most say that switching to Qwen knocks performance down a notch. Jiaxin Pei, a postdoctoral researcher at Stanford’s Institute for Human‑Centered AI, thinks that gap will close: “Building agentic capabilities in base LLMs has become a key focus for many LLM builders, and once people realize the value of this, it will only be a matter of time.” For now, Manus is doubling down on audiences it can already serve. In a written response, the company said its “primary focus is overseas expansion,” noting that new offices in San Francisco, Singapore, and Tokyo have opened in the past month. A super‑app approach Although the concept of AI agents is still relatively new, the consumer-facing AI app market in China is already crowded with major tech players. DeepSeek remains the most widely used, while ByteDance’s Doubao and Moonshot’s Kimi have also become household names. However, most of these apps are still optimized for chat and entertainment rather than task execution. This gap in the local market has pushed China’s big tech firms to roll out their own user-facing agents, though early versions remain uneven in quality and rough around the edges.  ByteDance is testing Coze Space, an AI agent based on its own Doubao model family that lets users toggle between “plan” and “execute” modes, so they can either directly guide the agent’s actions or step back and watch it work autonomously. It connects up to 14 popular apps, including GitHub, Notion, and the company’s own Lark office suite. Early reviews say the tool can feel clunky and has a high failure rate, but it clearly aims to match what Manus offers. Meanwhile, Zhipu AI has released a free agent called AutoGLM Rumination, built on its proprietary ChatGLM models. Shanghai‑based Minimax has launched Minimax Agent. Both products look almost identical to Manus and demo basic tasks such as building a simple website, planning a trip, making a small Flash game, or running quick data analysis. Despite the limited usability of most general AI agents launched within China, big companies have plans to change that. During a May 15 earnings call, Tencent president Liu Zhiping teased an agent that would weave automation directly into China’s most ubiquitous app, WeChat.  Considered the original super-app, WeChat already handles messaging, mobile payments, news, and millions of mini‑programs that act like embedded apps. These programs give Tencent, its developer, access to data from millions of services that pervade everyday life in China, an advantage most competitors can only envy. Historically, China’s consumer internet has splintered into competing walled gardens—share a Taobao link in WeChat and it resolves as plaintext, not a preview card. Unlike the more interoperable Western internet, China’s tech giants have long resisted integration with one another, choosing to wage platform war at the expense of a seamless user experience. But the use of mini‑programs has given WeChat unprecedented reach across services that once resisted interoperability, from gym bookings to grocery orders. An agent able to roam that ecosystem could bypass the integration headaches dogging independent startups. Alibaba, the e-commerce giant behind the Qwen model series, has been a front-runner in China’s AI race but has been slower to release consumer-facing products. Even though Qwen was the most downloaded open-source model on Hugging Face in 2024, it didn’t power a dedicated chatbot app until early 2025. In March, Alibaba rebranded its cloud storage and search app Quark into an all-in-one AI search tool. By June, Quark had introduced DeepResearch—a new mode that marks its most agent-like effort to date.  ByteDance and Alibaba did not reply to MIT Technology Review’s request for comments. “Historically, Chinese tech products tend to pursue the all-in-one, super-app approach, and the latest Chinese AI agents reflect just that,” says Li of Simular, who previously worked at Google DeepMind on AI-enabled work automation. “In contrast, AI agents in the US are more focused on serving specific verticals.” Pei, the researcher at Stanford, says that existing tech giants could have a huge advantage in bringing the vision of general AI agents to life—especially those with built-in integration across services. “The customer-facing AI agent market is still very early, with tons of problems like authentication and liability,” he says. “But companies that already operate across a wide range of services have a natural advantage in deploying agents at scale.” #manus #has #kickstarted #agent #boom
    WWW.TECHNOLOGYREVIEW.COM
    Manus has kick-started an AI agent boom in China
    Last year, China saw a boom in foundation models, the do-everything large language models that underpin the AI revolution. This year, the focus has shifted to AI agents—systems that are less about responding to users’ queries and more about autonomously accomplishing things for them.  There are now a host of Chinese startups building these general-purpose digital tools, which can answer emails, browse the internet to plan vacations, and even design an interactive website. Many of these have emerged in just the last two months, following in the footsteps of Manus—a general AI agent that sparked weeks of social media frenzy for invite codes after its limited-release launch in early March.  These emerging AI agents aren’t large language models themselves. Instead, they’re built on top of them, using a workflow-based structure designed to get things done. A lot of these systems also introduce a different way of interacting with AI. Rather than just chatting back and forth with users, they are optimized for managing and executing multistep tasks—booking flights, managing schedules, conducting research—by using external tools and remembering instructions.  China could take the lead on building these kinds of agents. The country’s tightly integrated app ecosystems, rapid product cycles, and digitally fluent user base could provide a favorable environment for embedding AI into daily life.  For now, its leading AI agent startups are focusing their attention on the global market, because the best Western models don’t operate inside China’s firewalls. But that could change soon: Tech giants like ByteDance and Tencent are preparing their own AI agents that could bake automation directly into their native super-apps, pulling data from their vast ecosystem of programs that dominate many aspects of daily life in the country.  As the race to define what a useful AI agent looks like unfolds, a mix of ambitious startups and entrenched tech giants are now testing how these tools might actually work in practice—and for whom. Set the standard It’s been a whirlwind few months for Manus, which was developed by the Wuhan-based startup Butterfly Effect. The company raised $75 million in a funding round led by the US venture capital firm Benchmark, took the product on an ambitious global roadshow, and hired dozens of new employees.  Even before registration opened to the public in May, Manus had become a reference point for what a broad, consumer‑oriented AI agent should accomplish. Rather than handling narrow chores for businesses, this “general” agent is designed to be able to help with everyday tasks like trip planning, stock comparison, or your kid’s school project.  Unlike previous AI agents, Manus uses a browser-based sandbox that lets users supervise the agent like an intern, watching in real time as it scrolls through web pages, reads articles, or codes actions. It also proactively asks clarifying questions, supports long-term memory that would serve as context for future tasks. “Manus represents a promising product experience for AI agents,” says Ang Li, cofounder and CEO of Simular, a startup based in Palo Alto, California, that’s building computer use agents, AI agents that control a virtual computer. “I believe Chinese startups have a huge advantage when it comes to designing consumer products, thanks to cutthroat domestic competition that leads to fast execution and greater attention to product details.” In the case of Manus, the competition is moving fast. Two of the most buzzy follow‑ups, Genspark and Flowith, for example, are already boasting benchmark scores that match or edge past Manus’s.  Genspark, led by former Baidu executives Eric Jing and Kay Zhu, links many small “super agents” through what it calls multi‑component prompting. The agent can switch among several large language models, accepts both images and text, and carries out tasks from making slide decks to placing phone calls. Whereas Manus relies heavily on Browser Use, a popular open-source product that lets agents operate a web browser in a virtual window like a human, Genspark directly integrates with a wide array of tools and APIs. Launched in April, the company says that it already has over 5 million users and over $36 million in yearly revenue. Flowith, the work of a young team that first grabbed public attention in April 2025 at a developer event hosted by the popular social media app Xiaohongshu, takes a different tack. Marketed as an “infinite agent,” it opens on a blank canvas where each question becomes a node on a branching map. Users can backtrack, take new branches, and store results in personal or sharable “knowledge gardens”—a design that feels more like project management software (think Notion) than a typical chat interface. Every inquiry or task builds its own mind-map-like graph, encouraging a more nonlinear and creative interaction with AI. Flowith’s core agent, NEO, runs in the cloud and can perform scheduled tasks like sending emails and compiling files. The founders want the app to be a “knowledge marketbase”, and aims to tap into the social aspect of AI with the aspiration of becoming “the OnlyFans of AI knowledge creators”. What they also share with Manus is the global ambition. Both Genspark and Flowith have stated that their primary focus is the international market. A global address Startups like Manus, Genspark, and Flowith—though founded by Chinese entrepreneurs—could blend seamlessly into the global tech scene and compete effectively abroad. Founders, investors, and analysts that MIT Technology Review has spoken to believe Chinese companies are moving fast, executing well, and quickly coming up with new products.  Money reinforces the pull to launch overseas. Customers there pay more, and there are plenty to go around. “You can price in USD, and with the exchange rate that’s a sevenfold multiplier,” Manus cofounder Xiao Hong quipped on a podcast. “Even if we’re only operating at 10% power because of cultural differences overseas, we’ll still make more than in China.” But creating the same functionality in China is a challenge. Major US AI companies including OpenAI and Anthropic have opted out of mainland China because of geopolitical risks and challenges with regulatory compliance. Their absence initially created a black market as users resorted to VPNs and third-party mirrors to access tools like ChatGPT and Claude. That vacuum has since been filled by a new wave of Chinese chatbots—DeepSeek, Doubao, Kimi—but the appetite for foreign models hasn’t gone away.  Manus, for example, uses Anthropic’s Claude Sonnet—widely considered the top model for agentic tasks. Manus cofounder Zhang Tao has repeatedly praised Claude’s ability to juggle tools, remember contexts, and hold multi‑round conversations—all crucial for turning chatty software into an effective executive assistant. But the company’s use of Sonnet has made its agent functionally unusable inside China without a VPN. If you open Manus from a mainland IP address, you’ll see a notice explaining that the team is “working on integrating Qwen’s model,” a special local version that is built on top of Alibaba’s open-source model.  An engineer overseeing ByteDance’s work on developing an agent, who spoke to MIT Technology Review anonymously to avoid sanction, said that the absence of Claude Sonnet models “limits everything we do in China.” DeepSeek’s open models, he added, still hallucinate too often and lack training on real‑world workflows. Developers we spoke with rank Alibaba’s Qwen series as the best domestic alternative, yet most say that switching to Qwen knocks performance down a notch. Jiaxin Pei, a postdoctoral researcher at Stanford’s Institute for Human‑Centered AI, thinks that gap will close: “Building agentic capabilities in base LLMs has become a key focus for many LLM builders, and once people realize the value of this, it will only be a matter of time.” For now, Manus is doubling down on audiences it can already serve. In a written response, the company said its “primary focus is overseas expansion,” noting that new offices in San Francisco, Singapore, and Tokyo have opened in the past month. A super‑app approach Although the concept of AI agents is still relatively new, the consumer-facing AI app market in China is already crowded with major tech players. DeepSeek remains the most widely used, while ByteDance’s Doubao and Moonshot’s Kimi have also become household names. However, most of these apps are still optimized for chat and entertainment rather than task execution. This gap in the local market has pushed China’s big tech firms to roll out their own user-facing agents, though early versions remain uneven in quality and rough around the edges.  ByteDance is testing Coze Space, an AI agent based on its own Doubao model family that lets users toggle between “plan” and “execute” modes, so they can either directly guide the agent’s actions or step back and watch it work autonomously. It connects up to 14 popular apps, including GitHub, Notion, and the company’s own Lark office suite. Early reviews say the tool can feel clunky and has a high failure rate, but it clearly aims to match what Manus offers. Meanwhile, Zhipu AI has released a free agent called AutoGLM Rumination, built on its proprietary ChatGLM models. Shanghai‑based Minimax has launched Minimax Agent. Both products look almost identical to Manus and demo basic tasks such as building a simple website, planning a trip, making a small Flash game, or running quick data analysis. Despite the limited usability of most general AI agents launched within China, big companies have plans to change that. During a May 15 earnings call, Tencent president Liu Zhiping teased an agent that would weave automation directly into China’s most ubiquitous app, WeChat.  Considered the original super-app, WeChat already handles messaging, mobile payments, news, and millions of mini‑programs that act like embedded apps. These programs give Tencent, its developer, access to data from millions of services that pervade everyday life in China, an advantage most competitors can only envy. Historically, China’s consumer internet has splintered into competing walled gardens—share a Taobao link in WeChat and it resolves as plaintext, not a preview card. Unlike the more interoperable Western internet, China’s tech giants have long resisted integration with one another, choosing to wage platform war at the expense of a seamless user experience. But the use of mini‑programs has given WeChat unprecedented reach across services that once resisted interoperability, from gym bookings to grocery orders. An agent able to roam that ecosystem could bypass the integration headaches dogging independent startups. Alibaba, the e-commerce giant behind the Qwen model series, has been a front-runner in China’s AI race but has been slower to release consumer-facing products. Even though Qwen was the most downloaded open-source model on Hugging Face in 2024, it didn’t power a dedicated chatbot app until early 2025. In March, Alibaba rebranded its cloud storage and search app Quark into an all-in-one AI search tool. By June, Quark had introduced DeepResearch—a new mode that marks its most agent-like effort to date.  ByteDance and Alibaba did not reply to MIT Technology Review’s request for comments. “Historically, Chinese tech products tend to pursue the all-in-one, super-app approach, and the latest Chinese AI agents reflect just that,” says Li of Simular, who previously worked at Google DeepMind on AI-enabled work automation. “In contrast, AI agents in the US are more focused on serving specific verticals.” Pei, the researcher at Stanford, says that existing tech giants could have a huge advantage in bringing the vision of general AI agents to life—especially those with built-in integration across services. “The customer-facing AI agent market is still very early, with tons of problems like authentication and liability,” he says. “But companies that already operate across a wide range of services have a natural advantage in deploying agents at scale.”
    Like
    Love
    Wow
    Sad
    Angry
    421
    0 Комментарии 0 Поделились
  • Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning

    Large Reasoning Models, trained from LLMs using reinforcement learning, demonstrated great performance in complex reasoning tasks, including mathematics, STEM, and coding. However, existing LRMs face challenges in completing various puzzle tasks that require purely logical reasoning skills, which are easy and obvious for humans. Current methods targeting puzzles focus only on designing benchmarks for evaluation, lacking the training methods and resources for modern LLMs to tackle this challenge. Current puzzle datasets lack diversity and scalability, covering limited puzzle types with little control over generation or difficulty. Moreover, due to the success of the “LLM+RLVR” paradigm, it has become crucial to obtain large, diverse, and challenging sets of verifiable puzzle prompts for training agents.
    Reinforcement Learning with Verifiable Rewardshas emerged as a key method for improving models’ reasoning capabilities, removing the need for reward models by directly assigning rewards based on objectively verifiable answers. Puzzles are particularly well-suited for RLVR. However, most prior RLVR research has overlooked the puzzles’ potential for delivering effective reward signals. In puzzle reasoning of LLMs, existing benchmarks evaluate different types of reasoning, including abstract, deductive, and compositional reasoning. Few benchmarks support scalable generation and difficulty control but lack puzzle diversity. Moreover, the improvement of LLMs’ puzzle-solving abilities mainly falls into two categories: tool integration and RLVR.
    Researchers from ByteDance Seed, Fudan University, Tsinghua University, Nanjing University, and Shanghai Jiao Tong University have proposed Enigmata, the first comprehensive toolkit designed for improving LLMs with puzzle reasoning skills. It contains 36 tasks across seven categories, each featuring a generator that produces unlimited examples with controllable difficulty and a rule-based verifier for automatic evaluation. The researchers further developed Enigmata-Eval as a rigorous benchmark and created optimized multi-task RLVR strategies. Puzzle data from Enigmata enhances SoTA performance on advanced math and STEM reasoning tasks like AIME, BeyondAIME, and GPQA when trained on larger models like Seed1.5-Thinking. This shows the generalization benefits of Enigmata.

    The Enigmata-Data comprises 36 puzzle tasks organized into 7 primary categories, including Crypto, Arithmetic, Logic, Grid, Graph, Search, and Sequential Puzzle, making it the only dataset having multiple task categories with scalability, automatic verification, and public availability. The data construction follows a three-phase pipeline: Tasks Collection and Design, Auto-Generator and Verifier Development, and Sliding Difficulty Control. Moreover, the Enigmata-Eval is developed by systematically sampling from the broader dataset, aiming to extract 50 instances per difficulty level for each task. The final evaluation set contains 4,758 puzzle instances rather than the theoretical maximum of 5,400, due to inherent constraints, where some tasks generate fewer instances per difficulty level.

    The proposed model outperforms most public models on Enigmata-Eval with 32B parameters, showing the effectiveness of the dataset and training recipe. The model stands out on the challenging ARC-AGI benchmark, surpassing strong reasoning models such as Gemini 2.5 Pro, o3-mini, and o1. The Qwen2.5-32B-Enigmata shows outstanding performance in structured reasoning categories, outperforming in Crypto, Arithmetic, and Logic tasks, suggesting effective development of rule-based reasoning capabilities. The model shows competitive performance in search tasks that require strategic exploration and planning capabilities. Moreover, Crypto and Arithmetic tasks tend to provide the highest accuracy, while spatial and sequential tasks remain more difficult.
    In this paper, researchers introduced Enigmata, a comprehensive suite for equipping LLMs with advanced puzzle reasoning that integrates seamlessly with RL using verifiable rule-based rewards. The trained Enigmata-Model shows superior performance and robust generalization skills through RLVR training. Experiments reveal that when applied to larger models such as Seed1.5-Thinking, synthetic puzzle data brings additional benefits in other domains, including mathematics and STEM reasoning over state-of-the-art models. Enigmata provides a solid foundation for the research community to advance reasoning model development, offering a unified framework that effectively bridges logical puzzle-solving with broader reasoning capabilities in LLMs.

    Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic IntegrationSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning
    #enigmatas #multistage #mixtraining #reinforcement #learning
    Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning
    Large Reasoning Models, trained from LLMs using reinforcement learning, demonstrated great performance in complex reasoning tasks, including mathematics, STEM, and coding. However, existing LRMs face challenges in completing various puzzle tasks that require purely logical reasoning skills, which are easy and obvious for humans. Current methods targeting puzzles focus only on designing benchmarks for evaluation, lacking the training methods and resources for modern LLMs to tackle this challenge. Current puzzle datasets lack diversity and scalability, covering limited puzzle types with little control over generation or difficulty. Moreover, due to the success of the “LLM+RLVR” paradigm, it has become crucial to obtain large, diverse, and challenging sets of verifiable puzzle prompts for training agents. Reinforcement Learning with Verifiable Rewardshas emerged as a key method for improving models’ reasoning capabilities, removing the need for reward models by directly assigning rewards based on objectively verifiable answers. Puzzles are particularly well-suited for RLVR. However, most prior RLVR research has overlooked the puzzles’ potential for delivering effective reward signals. In puzzle reasoning of LLMs, existing benchmarks evaluate different types of reasoning, including abstract, deductive, and compositional reasoning. Few benchmarks support scalable generation and difficulty control but lack puzzle diversity. Moreover, the improvement of LLMs’ puzzle-solving abilities mainly falls into two categories: tool integration and RLVR. Researchers from ByteDance Seed, Fudan University, Tsinghua University, Nanjing University, and Shanghai Jiao Tong University have proposed Enigmata, the first comprehensive toolkit designed for improving LLMs with puzzle reasoning skills. It contains 36 tasks across seven categories, each featuring a generator that produces unlimited examples with controllable difficulty and a rule-based verifier for automatic evaluation. The researchers further developed Enigmata-Eval as a rigorous benchmark and created optimized multi-task RLVR strategies. Puzzle data from Enigmata enhances SoTA performance on advanced math and STEM reasoning tasks like AIME, BeyondAIME, and GPQA when trained on larger models like Seed1.5-Thinking. This shows the generalization benefits of Enigmata. The Enigmata-Data comprises 36 puzzle tasks organized into 7 primary categories, including Crypto, Arithmetic, Logic, Grid, Graph, Search, and Sequential Puzzle, making it the only dataset having multiple task categories with scalability, automatic verification, and public availability. The data construction follows a three-phase pipeline: Tasks Collection and Design, Auto-Generator and Verifier Development, and Sliding Difficulty Control. Moreover, the Enigmata-Eval is developed by systematically sampling from the broader dataset, aiming to extract 50 instances per difficulty level for each task. The final evaluation set contains 4,758 puzzle instances rather than the theoretical maximum of 5,400, due to inherent constraints, where some tasks generate fewer instances per difficulty level. The proposed model outperforms most public models on Enigmata-Eval with 32B parameters, showing the effectiveness of the dataset and training recipe. The model stands out on the challenging ARC-AGI benchmark, surpassing strong reasoning models such as Gemini 2.5 Pro, o3-mini, and o1. The Qwen2.5-32B-Enigmata shows outstanding performance in structured reasoning categories, outperforming in Crypto, Arithmetic, and Logic tasks, suggesting effective development of rule-based reasoning capabilities. The model shows competitive performance in search tasks that require strategic exploration and planning capabilities. Moreover, Crypto and Arithmetic tasks tend to provide the highest accuracy, while spatial and sequential tasks remain more difficult. In this paper, researchers introduced Enigmata, a comprehensive suite for equipping LLMs with advanced puzzle reasoning that integrates seamlessly with RL using verifiable rule-based rewards. The trained Enigmata-Model shows superior performance and robust generalization skills through RLVR training. Experiments reveal that when applied to larger models such as Seed1.5-Thinking, synthetic puzzle data brings additional benefits in other domains, including mathematics and STEM reasoning over state-of-the-art models. Enigmata provides a solid foundation for the research community to advance reasoning model development, offering a unified framework that effectively bridges logical puzzle-solving with broader reasoning capabilities in LLMs. Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic IntegrationSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning #enigmatas #multistage #mixtraining #reinforcement #learning
    WWW.MARKTECHPOST.COM
    Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning
    Large Reasoning Models (LRMs), trained from LLMs using reinforcement learning (RL), demonstrated great performance in complex reasoning tasks, including mathematics, STEM, and coding. However, existing LRMs face challenges in completing various puzzle tasks that require purely logical reasoning skills, which are easy and obvious for humans. Current methods targeting puzzles focus only on designing benchmarks for evaluation, lacking the training methods and resources for modern LLMs to tackle this challenge. Current puzzle datasets lack diversity and scalability, covering limited puzzle types with little control over generation or difficulty. Moreover, due to the success of the “LLM+RLVR” paradigm, it has become crucial to obtain large, diverse, and challenging sets of verifiable puzzle prompts for training agents. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key method for improving models’ reasoning capabilities, removing the need for reward models by directly assigning rewards based on objectively verifiable answers. Puzzles are particularly well-suited for RLVR. However, most prior RLVR research has overlooked the puzzles’ potential for delivering effective reward signals. In puzzle reasoning of LLMs, existing benchmarks evaluate different types of reasoning, including abstract, deductive, and compositional reasoning. Few benchmarks support scalable generation and difficulty control but lack puzzle diversity. Moreover, the improvement of LLMs’ puzzle-solving abilities mainly falls into two categories: tool integration and RLVR. Researchers from ByteDance Seed, Fudan University, Tsinghua University, Nanjing University, and Shanghai Jiao Tong University have proposed Enigmata, the first comprehensive toolkit designed for improving LLMs with puzzle reasoning skills. It contains 36 tasks across seven categories, each featuring a generator that produces unlimited examples with controllable difficulty and a rule-based verifier for automatic evaluation. The researchers further developed Enigmata-Eval as a rigorous benchmark and created optimized multi-task RLVR strategies. Puzzle data from Enigmata enhances SoTA performance on advanced math and STEM reasoning tasks like AIME, BeyondAIME, and GPQA when trained on larger models like Seed1.5-Thinking. This shows the generalization benefits of Enigmata. The Enigmata-Data comprises 36 puzzle tasks organized into 7 primary categories, including Crypto, Arithmetic, Logic, Grid, Graph, Search, and Sequential Puzzle, making it the only dataset having multiple task categories with scalability, automatic verification, and public availability. The data construction follows a three-phase pipeline: Tasks Collection and Design, Auto-Generator and Verifier Development, and Sliding Difficulty Control. Moreover, the Enigmata-Eval is developed by systematically sampling from the broader dataset, aiming to extract 50 instances per difficulty level for each task. The final evaluation set contains 4,758 puzzle instances rather than the theoretical maximum of 5,400, due to inherent constraints, where some tasks generate fewer instances per difficulty level. The proposed model outperforms most public models on Enigmata-Eval with 32B parameters, showing the effectiveness of the dataset and training recipe. The model stands out on the challenging ARC-AGI benchmark, surpassing strong reasoning models such as Gemini 2.5 Pro, o3-mini, and o1. The Qwen2.5-32B-Enigmata shows outstanding performance in structured reasoning categories, outperforming in Crypto, Arithmetic, and Logic tasks, suggesting effective development of rule-based reasoning capabilities. The model shows competitive performance in search tasks that require strategic exploration and planning capabilities. Moreover, Crypto and Arithmetic tasks tend to provide the highest accuracy, while spatial and sequential tasks remain more difficult. In this paper, researchers introduced Enigmata, a comprehensive suite for equipping LLMs with advanced puzzle reasoning that integrates seamlessly with RL using verifiable rule-based rewards. The trained Enigmata-Model shows superior performance and robust generalization skills through RLVR training. Experiments reveal that when applied to larger models such as Seed1.5-Thinking (20B/200B parameters), synthetic puzzle data brings additional benefits in other domains, including mathematics and STEM reasoning over state-of-the-art models. Enigmata provides a solid foundation for the research community to advance reasoning model development, offering a unified framework that effectively bridges logical puzzle-solving with broader reasoning capabilities in LLMs. Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic IntegrationSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning
    0 Комментарии 0 Поделились
  • Huawei Supernode 384 disrupts Nvidia’s AI market hold

    Huawei’s AI capabilities have made a breakthrough in the form of the company’s Supernode 384 architecture, marking an important moment in the global processor wars amid US-China tech tensions.The Chinese tech giant’s latest innovation emerged from last Friday’s Kunpeng Ascend Developer Conference in Shenzhen, where company executives demonstrated how the computing framework challenges Nvidia’s long-standing market dominance directly, as the company continues to operate under severe US-led trade restrictions.Architectural innovation born from necessityZhang Dixuan, president of Huawei’s Ascend computing business, articulated the fundamental problem driving the innovation during his conference keynote: “As the scale of parallel processing grows, cross-machine bandwidth in traditional server architectures has become a critical bottleneck for training.”The Supernode 384 abandons Von Neumann computing principles in favour of a peer-to-peer architecture engineered specifically for modern AI workloads. The change proves especially powerful for Mixture-of-Experts modelsHuawei’s CloudMatrix 384 implementation showcases impressive technical specifications: 384 Ascend AI processors spanning 12 computing cabinets and four bus cabinets, generating 300 petaflops of raw computational power paired with 48 terabytes of high-bandwidth memory, representing a leap in integrated AI computing infrastructure.Performance metrics challenge industry leadersReal-world benchmark testing reveals the system’s competitive positioning in comparison to established solutions. Dense AI models like Meta’s LLaMA 3 achieved 132 tokens per second per card on the Supernode 384 – delivering 2.5 times superior performance compared to traditional cluster architectures.Communications-intensive applications demonstrate even more dramatic improvements. Models from Alibaba’s Qwen and DeepSeek families reached 600 to 750 tokens per second per card, revealing the architecture’s optimisation for next-generation AI workloads.The performance gains stem from fundamental infrastructure redesigns. Huawei replaced conventional Ethernet interconnects with high-speed bus connections, improving communications bandwidth by 15 times while reducing single-hop latency from 2 microseconds to 200 nanoseconds – a tenfold improvement.Geopolitical strategy drives technical innovationThe Supernode 384’s development cannot be divorced from broader US-China technological competition. American sanctions have systematically restricted Huawei’s access to cutting-edge semiconductor technologies, forcing the company to maximise performance within existing constraints.Industry analysis from SemiAnalysis suggests the CloudMatrix 384 uses Huawei’s latest Ascend 910C AI processor, which acknowledges inherent performance limitations but highlights architectural advantages: “Huawei is a generation behind in chips, but its scale-up solution is arguably a generation ahead of Nvidia and AMD’s current products in the market.”The assessment reveals how Huawei AI computing strategies have evolved beyond traditional hardware specifications toward system-level optimisation and architectural innovation.Market implications and deployment realityBeyond laboratory demonstrations, Huawei has operationalised CloudMatrix 384 systems in multiple Chinese data centres in Anhui Province, Inner Mongolia, and Guizhou Province. Such practical deployments validate the architecture’s viability and establishes an infrastructure framework for broader market adoption.The system’s scalability potential – supporting tens of thousands of linked processors – positions it as a compelling platform for training increasingly sophisticated AI models. The capability addresses growing industry demands for massive-scale AI implementation in diverse sectors.Industry disruption and future considerationsHuawei’s architectural breakthrough introduces both opportunities and complications for the global AI ecosystem. While providing viable alternatives to Nvidia’s market-leading solutions, it simultaneously accelerates the fragmentation of international technology infrastructure along geopolitical lines.The success of Huawei AI computing initiatives will depend on developer ecosystem adoption and sustained performance validation. The company’s aggressive developer conference outreach indicated a recognition that technical innovation alone cannot guarantee market acceptance.For organisations evaluating AI infrastructure investments, the Supernode 384 represents a new option that combines competitive performance with independence from US-controlled supply chains. However, long-term viability remains contingent on continued innovation cycles and improved geopolitical stability.See also: Oracle plans B Nvidia chip deal for AI facility in TexasWant to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.Explore other upcoming enterprise technology events and webinars powered by TechForge here.
    #huawei #supernode #disrupts #nvidias #market
    Huawei Supernode 384 disrupts Nvidia’s AI market hold
    Huawei’s AI capabilities have made a breakthrough in the form of the company’s Supernode 384 architecture, marking an important moment in the global processor wars amid US-China tech tensions.The Chinese tech giant’s latest innovation emerged from last Friday’s Kunpeng Ascend Developer Conference in Shenzhen, where company executives demonstrated how the computing framework challenges Nvidia’s long-standing market dominance directly, as the company continues to operate under severe US-led trade restrictions.Architectural innovation born from necessityZhang Dixuan, president of Huawei’s Ascend computing business, articulated the fundamental problem driving the innovation during his conference keynote: “As the scale of parallel processing grows, cross-machine bandwidth in traditional server architectures has become a critical bottleneck for training.”The Supernode 384 abandons Von Neumann computing principles in favour of a peer-to-peer architecture engineered specifically for modern AI workloads. The change proves especially powerful for Mixture-of-Experts modelsHuawei’s CloudMatrix 384 implementation showcases impressive technical specifications: 384 Ascend AI processors spanning 12 computing cabinets and four bus cabinets, generating 300 petaflops of raw computational power paired with 48 terabytes of high-bandwidth memory, representing a leap in integrated AI computing infrastructure.Performance metrics challenge industry leadersReal-world benchmark testing reveals the system’s competitive positioning in comparison to established solutions. Dense AI models like Meta’s LLaMA 3 achieved 132 tokens per second per card on the Supernode 384 – delivering 2.5 times superior performance compared to traditional cluster architectures.Communications-intensive applications demonstrate even more dramatic improvements. Models from Alibaba’s Qwen and DeepSeek families reached 600 to 750 tokens per second per card, revealing the architecture’s optimisation for next-generation AI workloads.The performance gains stem from fundamental infrastructure redesigns. Huawei replaced conventional Ethernet interconnects with high-speed bus connections, improving communications bandwidth by 15 times while reducing single-hop latency from 2 microseconds to 200 nanoseconds – a tenfold improvement.Geopolitical strategy drives technical innovationThe Supernode 384’s development cannot be divorced from broader US-China technological competition. American sanctions have systematically restricted Huawei’s access to cutting-edge semiconductor technologies, forcing the company to maximise performance within existing constraints.Industry analysis from SemiAnalysis suggests the CloudMatrix 384 uses Huawei’s latest Ascend 910C AI processor, which acknowledges inherent performance limitations but highlights architectural advantages: “Huawei is a generation behind in chips, but its scale-up solution is arguably a generation ahead of Nvidia and AMD’s current products in the market.”The assessment reveals how Huawei AI computing strategies have evolved beyond traditional hardware specifications toward system-level optimisation and architectural innovation.Market implications and deployment realityBeyond laboratory demonstrations, Huawei has operationalised CloudMatrix 384 systems in multiple Chinese data centres in Anhui Province, Inner Mongolia, and Guizhou Province. Such practical deployments validate the architecture’s viability and establishes an infrastructure framework for broader market adoption.The system’s scalability potential – supporting tens of thousands of linked processors – positions it as a compelling platform for training increasingly sophisticated AI models. The capability addresses growing industry demands for massive-scale AI implementation in diverse sectors.Industry disruption and future considerationsHuawei’s architectural breakthrough introduces both opportunities and complications for the global AI ecosystem. While providing viable alternatives to Nvidia’s market-leading solutions, it simultaneously accelerates the fragmentation of international technology infrastructure along geopolitical lines.The success of Huawei AI computing initiatives will depend on developer ecosystem adoption and sustained performance validation. The company’s aggressive developer conference outreach indicated a recognition that technical innovation alone cannot guarantee market acceptance.For organisations evaluating AI infrastructure investments, the Supernode 384 represents a new option that combines competitive performance with independence from US-controlled supply chains. However, long-term viability remains contingent on continued innovation cycles and improved geopolitical stability.See also: Oracle plans B Nvidia chip deal for AI facility in TexasWant to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.Explore other upcoming enterprise technology events and webinars powered by TechForge here. #huawei #supernode #disrupts #nvidias #market
    WWW.ARTIFICIALINTELLIGENCE-NEWS.COM
    Huawei Supernode 384 disrupts Nvidia’s AI market hold
    Huawei’s AI capabilities have made a breakthrough in the form of the company’s Supernode 384 architecture, marking an important moment in the global processor wars amid US-China tech tensions.The Chinese tech giant’s latest innovation emerged from last Friday’s Kunpeng Ascend Developer Conference in Shenzhen, where company executives demonstrated how the computing framework challenges Nvidia’s long-standing market dominance directly, as the company continues to operate under severe US-led trade restrictions.Architectural innovation born from necessityZhang Dixuan, president of Huawei’s Ascend computing business, articulated the fundamental problem driving the innovation during his conference keynote: “As the scale of parallel processing grows, cross-machine bandwidth in traditional server architectures has become a critical bottleneck for training.”The Supernode 384 abandons Von Neumann computing principles in favour of a peer-to-peer architecture engineered specifically for modern AI workloads. The change proves especially powerful for Mixture-of-Experts models (machine-learning systems using multiple specialised sub-networks to solve complex computational challenges.)Huawei’s CloudMatrix 384 implementation showcases impressive technical specifications: 384 Ascend AI processors spanning 12 computing cabinets and four bus cabinets, generating 300 petaflops of raw computational power paired with 48 terabytes of high-bandwidth memory, representing a leap in integrated AI computing infrastructure.Performance metrics challenge industry leadersReal-world benchmark testing reveals the system’s competitive positioning in comparison to established solutions. Dense AI models like Meta’s LLaMA 3 achieved 132 tokens per second per card on the Supernode 384 – delivering 2.5 times superior performance compared to traditional cluster architectures.Communications-intensive applications demonstrate even more dramatic improvements. Models from Alibaba’s Qwen and DeepSeek families reached 600 to 750 tokens per second per card, revealing the architecture’s optimisation for next-generation AI workloads.The performance gains stem from fundamental infrastructure redesigns. Huawei replaced conventional Ethernet interconnects with high-speed bus connections, improving communications bandwidth by 15 times while reducing single-hop latency from 2 microseconds to 200 nanoseconds – a tenfold improvement.Geopolitical strategy drives technical innovationThe Supernode 384’s development cannot be divorced from broader US-China technological competition. American sanctions have systematically restricted Huawei’s access to cutting-edge semiconductor technologies, forcing the company to maximise performance within existing constraints.Industry analysis from SemiAnalysis suggests the CloudMatrix 384 uses Huawei’s latest Ascend 910C AI processor, which acknowledges inherent performance limitations but highlights architectural advantages: “Huawei is a generation behind in chips, but its scale-up solution is arguably a generation ahead of Nvidia and AMD’s current products in the market.”The assessment reveals how Huawei AI computing strategies have evolved beyond traditional hardware specifications toward system-level optimisation and architectural innovation.Market implications and deployment realityBeyond laboratory demonstrations, Huawei has operationalised CloudMatrix 384 systems in multiple Chinese data centres in Anhui Province, Inner Mongolia, and Guizhou Province. Such practical deployments validate the architecture’s viability and establishes an infrastructure framework for broader market adoption.The system’s scalability potential – supporting tens of thousands of linked processors – positions it as a compelling platform for training increasingly sophisticated AI models. The capability addresses growing industry demands for massive-scale AI implementation in diverse sectors.Industry disruption and future considerationsHuawei’s architectural breakthrough introduces both opportunities and complications for the global AI ecosystem. While providing viable alternatives to Nvidia’s market-leading solutions, it simultaneously accelerates the fragmentation of international technology infrastructure along geopolitical lines.The success of Huawei AI computing initiatives will depend on developer ecosystem adoption and sustained performance validation. The company’s aggressive developer conference outreach indicated a recognition that technical innovation alone cannot guarantee market acceptance.For organisations evaluating AI infrastructure investments, the Supernode 384 represents a new option that combines competitive performance with independence from US-controlled supply chains. However, long-term viability remains contingent on continued innovation cycles and improved geopolitical stability.(Image from Pixabay)See also: Oracle plans $40B Nvidia chip deal for AI facility in TexasWant to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.Explore other upcoming enterprise technology events and webinars powered by TechForge here.
    0 Комментарии 0 Поделились
  • The DeepSeek R1 update proves its an active threat to OpenAI and Google

    DeepSeek's R1 update, plus the rest of the AI news this week.
    Credit: Thomas Fuller / SOPA Images / LightRocket / Getty Images

    This week, DeepSeek released an updated version of its R1 model on HuggingFace, reigniting the open-source versus closed-source competition. The updated version, called DeekSeek-R1-0528, has 685 billion parameters, an upgrade from January's version, which had 671 billion. Unlike OpenAI and Google's models, which are famously closed-source, DeepSeek's model weights are publicly available. According to the benchmarks, the R1-0528 update has improved reasoning and inference capabilities and is closing the gap with OpenAI's o3 and Google's Gemini 2.5 Pro. DeepSeek also introduced a distilled version of R1-0528 using Alibaba's Qwen3 8B model. This is an example of a lightweight model that is less capable but also requires less computing power. DeepSeek-R1-0528-Qwen3-8B outperforms both Google's latest lightweight model Gemini-2.5-Flash-Thinking-0520 and OpenAI's o3-mini in certain benchmarks. But the bigger deal is that DeekSeek's distilled model can reportedly run on a single GPU, according to TechCrunch.

    You May Also Like

    To… distill all this information, the Chinese rival is catching up to its U.S. competitors with an open-weight approach that's cheaper and more accessible. Plus, DeepSeek continues to prove that AI models may not require as much computing power as OpenAI, Google, and other AI heavyweights currently use. Suffice to say, watch this space.That said, DeepSeek's models also have their drawbacks. According to one AI developer, the new DeepSeek update is even more censored than its previous version when it comes to criticism of the Chinese government. Of course, a lot more happened in the AI world over the past few days. After last week's parade of AI events from Google, Anthropic, and Microsoft, this week was lighter on product and feature news. That's one reason DeepSeek's R1 update captured the AI world's attention this week. In other AI news, Anthropic finally gets voice mode, AI influencers go viral, Anthropic's CEO warns of mass layoffs, and an AI-generated kangaroo. Google's Veo 3 takes the internet by stormOn virtually every social media platform, users are freaking out about the new Veo 3, Google's new AI video model. The results are impressive, and we're already seeing short films made entirely with Veo 3. Not bad for a product that came out 11 days ago.

    Not to be outdone by AI video artists, a reporter from The Wall Street Journal made a short film about herself and a robot using Veo 3.Mashable's Tech Editor Timothy Werth recapped Veo's big week and had a simple conclusion: We're so cooked.More AI product news: Claude's new voice mode and the beginning of the agentic browser eraAfter last week's barrage, this week was lighter on the volume of AI news. But what was announced this week is no less significant. 

    Mashable Light Speed

    Want more out-of-this world tech, space and science stories?
    Sign up for Mashable's weekly Light Speed newsletter.

    By clicking Sign Me Up, you confirm you are 16+ and agree to our Terms of Use and Privacy Policy.

    Thanks for signing up!

    Anthropic finally introduced its own voice mode for Claude to compete with ChatGPT, Grok, and Gemini. The feature is currently in beta on mobile for the Claude app and will even be available to free plans with a limit of 20 to 30 voice conversations per day. Anthropic says you can ask Claude to summarize your calendar or read documents out loud. Paying subscribers can connect to Google Workspace for Calendar, Gmail, and Docs access. OpenAI is exploring the ability to sign into third-party apps with ChatGPT. We don't know much yet, but the company posted an interest form on its site for developers using Codex, its engineering agent, to add this capability to their own apps. It may not sound like a big deal, but it basically means users could easily link their personalized ChatGPT memories and settings to third-party apps, much like the way it works when you sign into a new app with your Google account.Opera announced a new agentic AI browser called Neon. "Much more than a place to view web pages, Neon can browse with you or for you, take action, and help you get things done," the announcement read. That includes a chatbot interface within the browser and the ability to fill in web forms for tasks like booking trips and shopping. The announcement, which included a promo video of a humanoid robot browsing the robot, which is scant on details but says Neon will be a "premium subscription product" and has a waitlist to sign up.The browser has suddenly become a new frontier for agentic AI, now that it's capable of automating web search tasks. Perplexity is working on a similar tool called Comet, and The Browser Company pivoted from its Arc browser to a more AI-centric browser called Dia. All of this is happening while Google might be forced to sell off Chrome, which OpenAI has kindly offered to take off its hands. Dario Amodei's prediction about AI replacing entry-level jobs is already starting to happenAnthropic CEO Dario Amodei warned in an interview with Axios that AI could "wipe out half of all entry-level white-collar jobs." Amodei's predictions might be spot on because a new study from VC firm SignalFire found that hiring for entry-level jobs is down to 7 percent from 25 percent in the previous year. Some of that is due to changes in the economic climate, but AI is definitely a factor since firms are opting to automate the less-technical aspects of work that would've been taken on by new hires. 

    Related Stories

    The latest in AI culture: That AI-generated kangaroo, Judge Judy, and everything elseGoogle wants you to know its AI overviews reach 1.5 billion people a month. They probably don't want you to know AI Overviews still struggles to count, spell, and know what year it is. As Mashable's Tim Marcin put it, would AI Overviews pass concussion protocol?The proposal of a 10-year ban on states regulating AI is pretty unpopular, according to a poll from Common Sense Media. The survey found that 57 percent of respondents opposed the moratorium, including half of the Republican respondents. As Mashable's Rebecca Ruiz reported, "the vast majority of respondents, regardless of their political affiliation, agreed that Congress shouldn't ban states from enacting or enforcing their own youth online safety and privacy laws."In the private sector, The New York Times signed a licensing deal with Amazon to allow their editorial content to be used for Amazon's AI models. The details are unclear, but from the outside, this seems like a change of tune from the Times, which is currently suing OpenAI for copyright infringement for allegedly using its content to train its models. That viral video of an emotional support kangaroo holding a plane ticket and being denied boarding? It's AI-generated, of course. Slightly more obvious, but no less creepy is another viral trend of using AI to turn public figures like Emmanuel Macron and Judge Judy into babies. These are strange AI-slop-infested times we're living in. AI has some positive uses too. This week, we learned about a new humanoid robot from HuggingFace called HopeJr, which could be available for sale later this year for just And to end this recap on a high note, the nonprofit Colossal Foundation has developed an AI algorithm to detect the bird calls of the near-extinct tooth-billed pigeon. Also known as the "little dodo," the tooth-billed pigeon is Samoa's national bird, and scientists are using the bioacoustic algorithm to locate and protect them. Want to get the latest AI news, from new product features to viral trends? Check back next week for another AI news recap, and in the meantime, follow @cecily_mauran and @mashable for more news.Disclosure: Ziff Davis, Mashable’s parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.

    Topics
    OpenAI
    DeepSeek

    Cecily Mauran
    Tech Reporter

    Cecily is a tech reporter at Mashable who covers AI, Apple, and emerging tech trends. Before getting her master's degree at Columbia Journalism School, she spent several years working with startups and social impact businesses for Unreasonable Group and B Lab. Before that, she co-founded a startup consulting business for emerging entrepreneurial hubs in South America, Europe, and Asia. You can find her on X at @cecily_mauran.
    #deepseek #update #proves #its #active
    The DeepSeek R1 update proves its an active threat to OpenAI and Google
    DeepSeek's R1 update, plus the rest of the AI news this week. Credit: Thomas Fuller / SOPA Images / LightRocket / Getty Images This week, DeepSeek released an updated version of its R1 model on HuggingFace, reigniting the open-source versus closed-source competition. The updated version, called DeekSeek-R1-0528, has 685 billion parameters, an upgrade from January's version, which had 671 billion. Unlike OpenAI and Google's models, which are famously closed-source, DeepSeek's model weights are publicly available. According to the benchmarks, the R1-0528 update has improved reasoning and inference capabilities and is closing the gap with OpenAI's o3 and Google's Gemini 2.5 Pro. DeepSeek also introduced a distilled version of R1-0528 using Alibaba's Qwen3 8B model. This is an example of a lightweight model that is less capable but also requires less computing power. DeepSeek-R1-0528-Qwen3-8B outperforms both Google's latest lightweight model Gemini-2.5-Flash-Thinking-0520 and OpenAI's o3-mini in certain benchmarks. But the bigger deal is that DeekSeek's distilled model can reportedly run on a single GPU, according to TechCrunch. You May Also Like To… distill all this information, the Chinese rival is catching up to its U.S. competitors with an open-weight approach that's cheaper and more accessible. Plus, DeepSeek continues to prove that AI models may not require as much computing power as OpenAI, Google, and other AI heavyweights currently use. Suffice to say, watch this space.That said, DeepSeek's models also have their drawbacks. According to one AI developer, the new DeepSeek update is even more censored than its previous version when it comes to criticism of the Chinese government. Of course, a lot more happened in the AI world over the past few days. After last week's parade of AI events from Google, Anthropic, and Microsoft, this week was lighter on product and feature news. That's one reason DeepSeek's R1 update captured the AI world's attention this week. In other AI news, Anthropic finally gets voice mode, AI influencers go viral, Anthropic's CEO warns of mass layoffs, and an AI-generated kangaroo. Google's Veo 3 takes the internet by stormOn virtually every social media platform, users are freaking out about the new Veo 3, Google's new AI video model. The results are impressive, and we're already seeing short films made entirely with Veo 3. Not bad for a product that came out 11 days ago. Not to be outdone by AI video artists, a reporter from The Wall Street Journal made a short film about herself and a robot using Veo 3.Mashable's Tech Editor Timothy Werth recapped Veo's big week and had a simple conclusion: We're so cooked.More AI product news: Claude's new voice mode and the beginning of the agentic browser eraAfter last week's barrage, this week was lighter on the volume of AI news. But what was announced this week is no less significant.  Mashable Light Speed Want more out-of-this world tech, space and science stories? Sign up for Mashable's weekly Light Speed newsletter. By clicking Sign Me Up, you confirm you are 16+ and agree to our Terms of Use and Privacy Policy. Thanks for signing up! Anthropic finally introduced its own voice mode for Claude to compete with ChatGPT, Grok, and Gemini. The feature is currently in beta on mobile for the Claude app and will even be available to free plans with a limit of 20 to 30 voice conversations per day. Anthropic says you can ask Claude to summarize your calendar or read documents out loud. Paying subscribers can connect to Google Workspace for Calendar, Gmail, and Docs access. OpenAI is exploring the ability to sign into third-party apps with ChatGPT. We don't know much yet, but the company posted an interest form on its site for developers using Codex, its engineering agent, to add this capability to their own apps. It may not sound like a big deal, but it basically means users could easily link their personalized ChatGPT memories and settings to third-party apps, much like the way it works when you sign into a new app with your Google account.Opera announced a new agentic AI browser called Neon. "Much more than a place to view web pages, Neon can browse with you or for you, take action, and help you get things done," the announcement read. That includes a chatbot interface within the browser and the ability to fill in web forms for tasks like booking trips and shopping. The announcement, which included a promo video of a humanoid robot browsing the robot, which is scant on details but says Neon will be a "premium subscription product" and has a waitlist to sign up.The browser has suddenly become a new frontier for agentic AI, now that it's capable of automating web search tasks. Perplexity is working on a similar tool called Comet, and The Browser Company pivoted from its Arc browser to a more AI-centric browser called Dia. All of this is happening while Google might be forced to sell off Chrome, which OpenAI has kindly offered to take off its hands. Dario Amodei's prediction about AI replacing entry-level jobs is already starting to happenAnthropic CEO Dario Amodei warned in an interview with Axios that AI could "wipe out half of all entry-level white-collar jobs." Amodei's predictions might be spot on because a new study from VC firm SignalFire found that hiring for entry-level jobs is down to 7 percent from 25 percent in the previous year. Some of that is due to changes in the economic climate, but AI is definitely a factor since firms are opting to automate the less-technical aspects of work that would've been taken on by new hires.  Related Stories The latest in AI culture: That AI-generated kangaroo, Judge Judy, and everything elseGoogle wants you to know its AI overviews reach 1.5 billion people a month. They probably don't want you to know AI Overviews still struggles to count, spell, and know what year it is. As Mashable's Tim Marcin put it, would AI Overviews pass concussion protocol?The proposal of a 10-year ban on states regulating AI is pretty unpopular, according to a poll from Common Sense Media. The survey found that 57 percent of respondents opposed the moratorium, including half of the Republican respondents. As Mashable's Rebecca Ruiz reported, "the vast majority of respondents, regardless of their political affiliation, agreed that Congress shouldn't ban states from enacting or enforcing their own youth online safety and privacy laws."In the private sector, The New York Times signed a licensing deal with Amazon to allow their editorial content to be used for Amazon's AI models. The details are unclear, but from the outside, this seems like a change of tune from the Times, which is currently suing OpenAI for copyright infringement for allegedly using its content to train its models. That viral video of an emotional support kangaroo holding a plane ticket and being denied boarding? It's AI-generated, of course. Slightly more obvious, but no less creepy is another viral trend of using AI to turn public figures like Emmanuel Macron and Judge Judy into babies. These are strange AI-slop-infested times we're living in. AI has some positive uses too. This week, we learned about a new humanoid robot from HuggingFace called HopeJr, which could be available for sale later this year for just And to end this recap on a high note, the nonprofit Colossal Foundation has developed an AI algorithm to detect the bird calls of the near-extinct tooth-billed pigeon. Also known as the "little dodo," the tooth-billed pigeon is Samoa's national bird, and scientists are using the bioacoustic algorithm to locate and protect them. Want to get the latest AI news, from new product features to viral trends? Check back next week for another AI news recap, and in the meantime, follow @cecily_mauran and @mashable for more news.Disclosure: Ziff Davis, Mashable’s parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems. Topics OpenAI DeepSeek Cecily Mauran Tech Reporter Cecily is a tech reporter at Mashable who covers AI, Apple, and emerging tech trends. Before getting her master's degree at Columbia Journalism School, she spent several years working with startups and social impact businesses for Unreasonable Group and B Lab. Before that, she co-founded a startup consulting business for emerging entrepreneurial hubs in South America, Europe, and Asia. You can find her on X at @cecily_mauran. #deepseek #update #proves #its #active
    MASHABLE.COM
    The DeepSeek R1 update proves its an active threat to OpenAI and Google
    DeepSeek's R1 update, plus the rest of the AI news this week. Credit: Thomas Fuller / SOPA Images / LightRocket / Getty Images This week, DeepSeek released an updated version of its R1 model on HuggingFace, reigniting the open-source versus closed-source competition. The updated version, called DeekSeek-R1-0528, has 685 billion parameters, an upgrade from January's version, which had 671 billion. Unlike OpenAI and Google's models, which are famously closed-source, DeepSeek's model weights are publicly available. According to the benchmarks, the R1-0528 update has improved reasoning and inference capabilities and is closing the gap with OpenAI's o3 and Google's Gemini 2.5 Pro. DeepSeek also introduced a distilled version of R1-0528 using Alibaba's Qwen3 8B model. This is an example of a lightweight model that is less capable but also requires less computing power. DeepSeek-R1-0528-Qwen3-8B outperforms both Google's latest lightweight model Gemini-2.5-Flash-Thinking-0520 and OpenAI's o3-mini in certain benchmarks. But the bigger deal is that DeekSeek's distilled model can reportedly run on a single GPU, according to TechCrunch. You May Also Like To… distill all this information, the Chinese rival is catching up to its U.S. competitors with an open-weight approach that's cheaper and more accessible. Plus, DeepSeek continues to prove that AI models may not require as much computing power as OpenAI, Google, and other AI heavyweights currently use. Suffice to say, watch this space.That said, DeepSeek's models also have their drawbacks. According to one AI developer (via TechCrunch), the new DeepSeek update is even more censored than its previous version when it comes to criticism of the Chinese government. Of course, a lot more happened in the AI world over the past few days. After last week's parade of AI events from Google, Anthropic, and Microsoft, this week was lighter on product and feature news. That's one reason DeepSeek's R1 update captured the AI world's attention this week. In other AI news, Anthropic finally gets voice mode, AI influencers go viral, Anthropic's CEO warns of mass layoffs, and an AI-generated kangaroo. Google's Veo 3 takes the internet by stormOn virtually every social media platform, users are freaking out about the new Veo 3, Google's new AI video model. The results are impressive, and we're already seeing short films made entirely with Veo 3. Not bad for a product that came out 11 days ago. Not to be outdone by AI video artists, a reporter from The Wall Street Journal made a short film about herself and a robot using Veo 3.Mashable's Tech Editor Timothy Werth recapped Veo's big week and had a simple conclusion: We're so cooked.More AI product news: Claude's new voice mode and the beginning of the agentic browser eraAfter last week's barrage, this week was lighter on the volume of AI news. But what was announced this week is no less significant.  Mashable Light Speed Want more out-of-this world tech, space and science stories? Sign up for Mashable's weekly Light Speed newsletter. By clicking Sign Me Up, you confirm you are 16+ and agree to our Terms of Use and Privacy Policy. Thanks for signing up! Anthropic finally introduced its own voice mode for Claude to compete with ChatGPT, Grok, and Gemini. The feature is currently in beta on mobile for the Claude app and will even be available to free plans with a limit of 20 to 30 voice conversations per day. Anthropic says you can ask Claude to summarize your calendar or read documents out loud. Paying subscribers can connect to Google Workspace for Calendar, Gmail, and Docs access. OpenAI is exploring the ability to sign into third-party apps with ChatGPT. We don't know much yet, but the company posted an interest form on its site for developers using Codex, its engineering agent, to add this capability to their own apps. It may not sound like a big deal, but it basically means users could easily link their personalized ChatGPT memories and settings to third-party apps, much like the way it works when you sign into a new app with your Google account.Opera announced a new agentic AI browser called Neon. "Much more than a place to view web pages, Neon can browse with you or for you, take action, and help you get things done," the announcement read. That includes a chatbot interface within the browser and the ability to fill in web forms for tasks like booking trips and shopping. The announcement, which included a promo video of a humanoid robot browsing the robot, which is scant on details but says Neon will be a "premium subscription product" and has a waitlist to sign up.The browser has suddenly become a new frontier for agentic AI, now that it's capable of automating web search tasks. Perplexity is working on a similar tool called Comet, and The Browser Company pivoted from its Arc browser to a more AI-centric browser called Dia. All of this is happening while Google might be forced to sell off Chrome, which OpenAI has kindly offered to take off its hands. Dario Amodei's prediction about AI replacing entry-level jobs is already starting to happenAnthropic CEO Dario Amodei warned in an interview with Axios that AI could "wipe out half of all entry-level white-collar jobs." Amodei's predictions might be spot on because a new study from VC firm SignalFire found that hiring for entry-level jobs is down to 7 percent from 25 percent in the previous year. Some of that is due to changes in the economic climate, but AI is definitely a factor since firms are opting to automate the less-technical aspects of work that would've been taken on by new hires.  Related Stories The latest in AI culture: That AI-generated kangaroo, Judge Judy, and everything elseGoogle wants you to know its AI overviews reach 1.5 billion people a month. They probably don't want you to know AI Overviews still struggles to count, spell, and know what year it is. As Mashable's Tim Marcin put it, would AI Overviews pass concussion protocol?The proposal of a 10-year ban on states regulating AI is pretty unpopular, according to a poll from Common Sense Media. The survey found that 57 percent of respondents opposed the moratorium, including half of the Republican respondents. As Mashable's Rebecca Ruiz reported, "the vast majority of respondents, regardless of their political affiliation, agreed that Congress shouldn't ban states from enacting or enforcing their own youth online safety and privacy laws."In the private sector, The New York Times signed a licensing deal with Amazon to allow their editorial content to be used for Amazon's AI models. The details are unclear, but from the outside, this seems like a change of tune from the Times, which is currently suing OpenAI for copyright infringement for allegedly using its content to train its models. That viral video of an emotional support kangaroo holding a plane ticket and being denied boarding? It's AI-generated, of course. Slightly more obvious, but no less creepy is another viral trend of using AI to turn public figures like Emmanuel Macron and Judge Judy into babies. These are strange AI-slop-infested times we're living in. AI has some positive uses too. This week, we learned about a new humanoid robot from HuggingFace called HopeJr (with engineering by The Robot Studio), which could be available for sale later this year for just $3,000.And to end this recap on a high note, the nonprofit Colossal Foundation has developed an AI algorithm to detect the bird calls of the near-extinct tooth-billed pigeon. Also known as the "little dodo," the tooth-billed pigeon is Samoa's national bird, and scientists are using the bioacoustic algorithm to locate and protect them. Want to get the latest AI news, from new product features to viral trends? Check back next week for another AI news recap, and in the meantime, follow @cecily_mauran and @mashable for more news.Disclosure: Ziff Davis, Mashable’s parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems. Topics OpenAI DeepSeek Cecily Mauran Tech Reporter Cecily is a tech reporter at Mashable who covers AI, Apple, and emerging tech trends. Before getting her master's degree at Columbia Journalism School, she spent several years working with startups and social impact businesses for Unreasonable Group and B Lab. Before that, she co-founded a startup consulting business for emerging entrepreneurial hubs in South America, Europe, and Asia. You can find her on X at @cecily_mauran.
    0 Комментарии 0 Поделились
  • This AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-Solving

    Reasoning tasks are a fundamental aspect of artificial intelligence, encompassing areas like commonsense understanding, mathematical problem-solving, and symbolic reasoning. These tasks often involve multiple steps of logical inference, which large language modelsattempt to mimic through structured approaches such as chain-of-thoughtprompting. However, as LLMs grow in size and complexity, they tend to produce longer outputs across all tasks, regardless of difficulty, leading to significant inefficiencies. The field has been striving to balance the depth of reasoning with computational cost while also ensuring that models can adapt their reasoning strategies to meet the unique needs of each problem.
    A key issue with current reasoning models is the inability to tailor the reasoning process to different task complexities. Most models, including well-known ones like OpenAI’s o1 and DeepSeek-R1, apply a uniform strategy—typically relying on Long CoT across all tasks. This causes the “overthinking” problem, where models generate unnecessarily verbose explanations for simpler tasks. Not only does this waste resources, but it also degrades accuracy, as excessive reasoning can introduce irrelevant information. Approaches such as prompt-guided generation or token budget estimation have attempted to mitigate this issue. Still, these methods are limited by their dependence on predefined assumptions, which are not always reliable for diverse tasks.

    Attempts to address these issues include methods like GRPO, length-penalty mechanisms, and rule-based prompt controls. While GRPO enables models to learn different reasoning strategies by rewarding correct answers, it leads to a “format collapse,” where models increasingly rely on Long CoT, crowding out more efficient formats, such as Short CoT or Direct Answer. Length-penalty techniques, such as those applied in methods like THINKPRUNE, control output length during training or inference, but often at the cost of reduced accuracy, especially in complex problem-solving tasks. These solutions struggle to achieve a consistent trade-off between reasoning effectiveness and efficiency, highlighting the need for an adaptive approach.
    A team of researchers from Fudan University and Ohio State University introduced the Adaptive Reasoning Model, which dynamically adjusts reasoning formats based on task difficulty. ARM supports four distinct reasoning styles: Direct Answer for simple tasks, Short CoT for concise reasoning, Code for structured problem-solving, and Long CoT for deep multi-step reasoning. It operates in an Adaptive Mode by default, automatically selecting the appropriate format, and also provides Instruction-Guided and Consensus-Guided Modes for explicit control or aggregation across formats. The key innovation lies in its training process, which utilizes Ada-GRPO, an extension of GRPO that introduces a format diversity reward mechanism. This prevents the dominance of Long CoT and ensures that ARM continues to explore and use simpler reasoning formats when appropriate.

    The ARM methodology is built on a two-stage framework. First, the model undergoes Supervised Fine-Tuningwith 10.8K questions, each annotated across four reasoning formats, sourced from datasets like AQuA-Rat and generated with tools such as GPT-4o and DeepSeek-R1. This stage teaches the model the structure of each reasoning format but does not instill adaptiveness. The second stage applies Ada-GRPO, where the model receives scaled rewards for using less frequent formats, such as Direct Answer or Short CoT. A decaying factor ensures that this reward gradually shifts back to accuracy as training progresses, preventing long-term bias toward inefficient exploration. This structure enables ARM to avoid format collapse and dynamically match reasoning strategies to task difficulty, achieving a balance of efficiency and performance.

    ARM demonstrated impressive results across various benchmarks, including commonsense, mathematical, and symbolic reasoning tasks. It reduced token usage by an average of 30%, with reductions as high as 70% for simpler tasks, compared to models relying solely on Long CoT. ARM achieved a 2x training speedup over GRPO-based models, accelerating model development without sacrificing accuracy. For example, ARM-7B achieved 75.9% accuracy on the challenging AIME’25 task while using 32.5% fewer tokens. ARM-14B achieved 85.6% accuracy on OpenBookQA and 86.4% accuracy on the MATH dataset, with a token usage reduction of over 30% compared to Qwen2.5SFT+GRPO models. These numbers demonstrate ARM’s ability to maintain competitive performance while delivering significant efficiency gains.
    Overall, the Adaptive Reasoning Model addresses the persistent inefficiency of reasoning models by enabling the adaptive selection of reasoning formats based on task difficulty. The introduction of Ada-GRPO and the multi-format training framework ensures that models no longer waste resources on overthinking. Instead, ARM provides a flexible and practical solution for balancing accuracy and computational cost in reasoning tasks, making it a promising approach for scalable and efficient large language models.

    Check out the Paper, Models on Hugging Face and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural NetworksNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding
    #this #paper #introduces #arm #adagrpo
    This AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-Solving
    Reasoning tasks are a fundamental aspect of artificial intelligence, encompassing areas like commonsense understanding, mathematical problem-solving, and symbolic reasoning. These tasks often involve multiple steps of logical inference, which large language modelsattempt to mimic through structured approaches such as chain-of-thoughtprompting. However, as LLMs grow in size and complexity, they tend to produce longer outputs across all tasks, regardless of difficulty, leading to significant inefficiencies. The field has been striving to balance the depth of reasoning with computational cost while also ensuring that models can adapt their reasoning strategies to meet the unique needs of each problem. A key issue with current reasoning models is the inability to tailor the reasoning process to different task complexities. Most models, including well-known ones like OpenAI’s o1 and DeepSeek-R1, apply a uniform strategy—typically relying on Long CoT across all tasks. This causes the “overthinking” problem, where models generate unnecessarily verbose explanations for simpler tasks. Not only does this waste resources, but it also degrades accuracy, as excessive reasoning can introduce irrelevant information. Approaches such as prompt-guided generation or token budget estimation have attempted to mitigate this issue. Still, these methods are limited by their dependence on predefined assumptions, which are not always reliable for diverse tasks. Attempts to address these issues include methods like GRPO, length-penalty mechanisms, and rule-based prompt controls. While GRPO enables models to learn different reasoning strategies by rewarding correct answers, it leads to a “format collapse,” where models increasingly rely on Long CoT, crowding out more efficient formats, such as Short CoT or Direct Answer. Length-penalty techniques, such as those applied in methods like THINKPRUNE, control output length during training or inference, but often at the cost of reduced accuracy, especially in complex problem-solving tasks. These solutions struggle to achieve a consistent trade-off between reasoning effectiveness and efficiency, highlighting the need for an adaptive approach. A team of researchers from Fudan University and Ohio State University introduced the Adaptive Reasoning Model, which dynamically adjusts reasoning formats based on task difficulty. ARM supports four distinct reasoning styles: Direct Answer for simple tasks, Short CoT for concise reasoning, Code for structured problem-solving, and Long CoT for deep multi-step reasoning. It operates in an Adaptive Mode by default, automatically selecting the appropriate format, and also provides Instruction-Guided and Consensus-Guided Modes for explicit control or aggregation across formats. The key innovation lies in its training process, which utilizes Ada-GRPO, an extension of GRPO that introduces a format diversity reward mechanism. This prevents the dominance of Long CoT and ensures that ARM continues to explore and use simpler reasoning formats when appropriate. The ARM methodology is built on a two-stage framework. First, the model undergoes Supervised Fine-Tuningwith 10.8K questions, each annotated across four reasoning formats, sourced from datasets like AQuA-Rat and generated with tools such as GPT-4o and DeepSeek-R1. This stage teaches the model the structure of each reasoning format but does not instill adaptiveness. The second stage applies Ada-GRPO, where the model receives scaled rewards for using less frequent formats, such as Direct Answer or Short CoT. A decaying factor ensures that this reward gradually shifts back to accuracy as training progresses, preventing long-term bias toward inefficient exploration. This structure enables ARM to avoid format collapse and dynamically match reasoning strategies to task difficulty, achieving a balance of efficiency and performance. ARM demonstrated impressive results across various benchmarks, including commonsense, mathematical, and symbolic reasoning tasks. It reduced token usage by an average of 30%, with reductions as high as 70% for simpler tasks, compared to models relying solely on Long CoT. ARM achieved a 2x training speedup over GRPO-based models, accelerating model development without sacrificing accuracy. For example, ARM-7B achieved 75.9% accuracy on the challenging AIME’25 task while using 32.5% fewer tokens. ARM-14B achieved 85.6% accuracy on OpenBookQA and 86.4% accuracy on the MATH dataset, with a token usage reduction of over 30% compared to Qwen2.5SFT+GRPO models. These numbers demonstrate ARM’s ability to maintain competitive performance while delivering significant efficiency gains. Overall, the Adaptive Reasoning Model addresses the persistent inefficiency of reasoning models by enabling the adaptive selection of reasoning formats based on task difficulty. The introduction of Ada-GRPO and the multi-format training framework ensures that models no longer waste resources on overthinking. Instead, ARM provides a flexible and practical solution for balancing accuracy and computational cost in reasoning tasks, making it a promising approach for scalable and efficient large language models. Check out the Paper, Models on Hugging Face and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural NetworksNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding #this #paper #introduces #arm #adagrpo
    WWW.MARKTECHPOST.COM
    This AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-Solving
    Reasoning tasks are a fundamental aspect of artificial intelligence, encompassing areas like commonsense understanding, mathematical problem-solving, and symbolic reasoning. These tasks often involve multiple steps of logical inference, which large language models (LLMs) attempt to mimic through structured approaches such as chain-of-thought (CoT) prompting. However, as LLMs grow in size and complexity, they tend to produce longer outputs across all tasks, regardless of difficulty, leading to significant inefficiencies. The field has been striving to balance the depth of reasoning with computational cost while also ensuring that models can adapt their reasoning strategies to meet the unique needs of each problem. A key issue with current reasoning models is the inability to tailor the reasoning process to different task complexities. Most models, including well-known ones like OpenAI’s o1 and DeepSeek-R1, apply a uniform strategy—typically relying on Long CoT across all tasks. This causes the “overthinking” problem, where models generate unnecessarily verbose explanations for simpler tasks. Not only does this waste resources, but it also degrades accuracy, as excessive reasoning can introduce irrelevant information. Approaches such as prompt-guided generation or token budget estimation have attempted to mitigate this issue. Still, these methods are limited by their dependence on predefined assumptions, which are not always reliable for diverse tasks. Attempts to address these issues include methods like GRPO (Group Relative Policy Optimization), length-penalty mechanisms, and rule-based prompt controls. While GRPO enables models to learn different reasoning strategies by rewarding correct answers, it leads to a “format collapse,” where models increasingly rely on Long CoT, crowding out more efficient formats, such as Short CoT or Direct Answer. Length-penalty techniques, such as those applied in methods like THINKPRUNE, control output length during training or inference, but often at the cost of reduced accuracy, especially in complex problem-solving tasks. These solutions struggle to achieve a consistent trade-off between reasoning effectiveness and efficiency, highlighting the need for an adaptive approach. A team of researchers from Fudan University and Ohio State University introduced the Adaptive Reasoning Model (ARM), which dynamically adjusts reasoning formats based on task difficulty. ARM supports four distinct reasoning styles: Direct Answer for simple tasks, Short CoT for concise reasoning, Code for structured problem-solving, and Long CoT for deep multi-step reasoning. It operates in an Adaptive Mode by default, automatically selecting the appropriate format, and also provides Instruction-Guided and Consensus-Guided Modes for explicit control or aggregation across formats. The key innovation lies in its training process, which utilizes Ada-GRPO, an extension of GRPO that introduces a format diversity reward mechanism. This prevents the dominance of Long CoT and ensures that ARM continues to explore and use simpler reasoning formats when appropriate. The ARM methodology is built on a two-stage framework. First, the model undergoes Supervised Fine-Tuning (SFT) with 10.8K questions, each annotated across four reasoning formats, sourced from datasets like AQuA-Rat and generated with tools such as GPT-4o and DeepSeek-R1. This stage teaches the model the structure of each reasoning format but does not instill adaptiveness. The second stage applies Ada-GRPO, where the model receives scaled rewards for using less frequent formats, such as Direct Answer or Short CoT. A decaying factor ensures that this reward gradually shifts back to accuracy as training progresses, preventing long-term bias toward inefficient exploration. This structure enables ARM to avoid format collapse and dynamically match reasoning strategies to task difficulty, achieving a balance of efficiency and performance. ARM demonstrated impressive results across various benchmarks, including commonsense, mathematical, and symbolic reasoning tasks. It reduced token usage by an average of 30%, with reductions as high as 70% for simpler tasks, compared to models relying solely on Long CoT. ARM achieved a 2x training speedup over GRPO-based models, accelerating model development without sacrificing accuracy. For example, ARM-7B achieved 75.9% accuracy on the challenging AIME’25 task while using 32.5% fewer tokens. ARM-14B achieved 85.6% accuracy on OpenBookQA and 86.4% accuracy on the MATH dataset, with a token usage reduction of over 30% compared to Qwen2.5SFT+GRPO models. These numbers demonstrate ARM’s ability to maintain competitive performance while delivering significant efficiency gains. Overall, the Adaptive Reasoning Model addresses the persistent inefficiency of reasoning models by enabling the adaptive selection of reasoning formats based on task difficulty. The introduction of Ada-GRPO and the multi-format training framework ensures that models no longer waste resources on overthinking. Instead, ARM provides a flexible and practical solution for balancing accuracy and computational cost in reasoning tasks, making it a promising approach for scalable and efficient large language models. Check out the Paper, Models on Hugging Face and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural NetworksNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding
    5 Комментарии 0 Поделились
CGShares https://cgshares.com