• TikTok, apparemment, se lance dans la réalité augmentée avec ses propres lunettes AR. ByteDance veut défier Meta et Google. C'est un peu le même vieux discours sur les nouvelles technologies, non ? On dirait que tout le monde essaie de faire la prochaine grande chose sans vraiment savoir pourquoi. Mais bon, des lunettes AR, ça a l'air... intéressant, je suppose.

    Bref, on attend de voir ce que ça va donner. Peut-être que ça va changer quelque chose, ou peut-être pas. Qui sait.

    #TikTok #RéalitéAugmentée #LunettesAR #ByteDance #Technologie
    TikTok, apparemment, se lance dans la réalité augmentée avec ses propres lunettes AR. ByteDance veut défier Meta et Google. C'est un peu le même vieux discours sur les nouvelles technologies, non ? On dirait que tout le monde essaie de faire la prochaine grande chose sans vraiment savoir pourquoi. Mais bon, des lunettes AR, ça a l'air... intéressant, je suppose. Bref, on attend de voir ce que ça va donner. Peut-être que ça va changer quelque chose, ou peut-être pas. Qui sait. #TikTok #RéalitéAugmentée #LunettesAR #ByteDance #Technologie
    TikTok prépare ses propres lunettes AR pour défier Meta et Google
    ByteDance, la maison-mère de TikTok, s’attaquerait au monde de la réalité mixte selon The Information. […] Cet article TikTok prépare ses propres lunettes AR pour défier Meta et Google a été publié sur REALITE-VIRTUELLE.COM.
    1 Yorumlar 0 hisse senetleri
  • ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation

    Autoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses on generating images one token at a time, similar to how sentences are constructed in language models. The appeal of this approach lies in its ability to maintain structural coherence across the image while allowing for high levels of control during the generation process. As researchers began to apply these techniques to visual data, they found that structured prediction not only preserved spatial integrity but also supported tasks like image manipulation and multimodal translation effectively.
    Despite these benefits, generating high-resolution images remains computationally expensive and slow. A primary issue is the number of tokens needed to represent complex visuals. Raster-scan methods that flatten 2D images into linear sequences require thousands of tokens for detailed images, resulting in long inference times and high memory consumption. Models like Infinity need over 10,000 tokens for a 1024×1024 image. This becomes unsustainable for real-time applications or when scaling to more extensive datasets. Reducing the token burden while preserving or improving output quality has become a pressing challenge.

    Efforts to mitigate token inflation have led to innovations like next-scale prediction seen in VAR and FlexVAR. These models create images by predicting progressively finer scales, which imitates the human tendency to sketch rough outlines before adding detail. However, they still rely on hundreds of tokens—680 in the case of VAR and FlexVAR for 256×256 images. Moreover, approaches like TiTok and FlexTok use 1D tokenization to compress spatial redundancy, but they often fail to scale efficiently. For example, FlexTok’s gFID increases from 1.9 at 32 tokens to 2.5 at 256 tokens, highlighting a degradation in output quality as the token count grows.
    Researchers from ByteDance introduced DetailFlow, a 1D autoregressive image generation framework. This method arranges token sequences from global to fine detail using a process called next-detail prediction. Unlike traditional 2D raster-scan or scale-based techniques, DetailFlow employs a 1D tokenizer trained on progressively degraded images. This design allows the model to prioritize foundational image structures before refining visual details. By mapping tokens directly to resolution levels, DetailFlow significantly reduces token requirements, enabling images to be generated in a semantically ordered, coarse-to-fine manner.

    The mechanism in DetailFlow centers on a 1D latent space where each token contributes incrementally more detail. Earlier tokens encode global features, while later tokens refine specific visual aspects. To train this, the researchers created a resolution mapping function that links token count to target resolution. During training, the model is exposed to images of varying quality levels and learns to predict progressively higher-resolution outputs as more tokens are introduced. It also implements parallel token prediction by grouping sequences and predicting entire sets at once. Since parallel prediction can introduce sampling errors, a self-correction mechanism was integrated. This system perturbs certain tokens during training and teaches subsequent tokens to compensate, ensuring that final images maintain structural and visual integrity.
    The results from the experiments on the ImageNet 256×256 benchmark were noteworthy. DetailFlow achieved a gFID score of 2.96 using only 128 tokens, outperforming VAR at 3.3 and FlexVAR at 3.05, both of which used 680 tokens. Even more impressive, DetailFlow-64 reached a gFID of 2.62 using 512 tokens. In terms of speed, it delivered nearly double the inference rate of VAR and FlexVAR. A further ablation study confirmed that the self-correction training and semantic ordering of tokens substantially improved output quality. For example, enabling self-correction dropped the gFID from 4.11 to 3.68 in one setting. These metrics demonstrate both higher quality and faster generation compared to established models.

    By focusing on semantic structure and reducing redundancy, DetailFlow presents a viable solution to long-standing issues in autoregressive image generation. The method’s coarse-to-fine approach, efficient parallel decoding, and ability to self-correct highlight how architectural innovations can address performance and scalability limitations. Through their structured use of 1D tokens, the researchers from ByteDance have demonstrated a model that maintains high image fidelity while significantly reducing computational load, making it a valuable addition to image synthesis research.

    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement FinetuningNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal ReasoningNikhilhttps://www.marktechpost.com/author/nikhil0980/NVIDIA AI Introduces Fast-dLLM: A Training-Free Framework That Brings KV Caching and Parallel Decoding to Diffusion LLMsNikhilhttps://www.marktechpost.com/author/nikhil0980/Meet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental Validation
    #bytedance #researchers #introduce #detailflow #coarsetofine
    ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation
    Autoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses on generating images one token at a time, similar to how sentences are constructed in language models. The appeal of this approach lies in its ability to maintain structural coherence across the image while allowing for high levels of control during the generation process. As researchers began to apply these techniques to visual data, they found that structured prediction not only preserved spatial integrity but also supported tasks like image manipulation and multimodal translation effectively. Despite these benefits, generating high-resolution images remains computationally expensive and slow. A primary issue is the number of tokens needed to represent complex visuals. Raster-scan methods that flatten 2D images into linear sequences require thousands of tokens for detailed images, resulting in long inference times and high memory consumption. Models like Infinity need over 10,000 tokens for a 1024×1024 image. This becomes unsustainable for real-time applications or when scaling to more extensive datasets. Reducing the token burden while preserving or improving output quality has become a pressing challenge. Efforts to mitigate token inflation have led to innovations like next-scale prediction seen in VAR and FlexVAR. These models create images by predicting progressively finer scales, which imitates the human tendency to sketch rough outlines before adding detail. However, they still rely on hundreds of tokens—680 in the case of VAR and FlexVAR for 256×256 images. Moreover, approaches like TiTok and FlexTok use 1D tokenization to compress spatial redundancy, but they often fail to scale efficiently. For example, FlexTok’s gFID increases from 1.9 at 32 tokens to 2.5 at 256 tokens, highlighting a degradation in output quality as the token count grows. Researchers from ByteDance introduced DetailFlow, a 1D autoregressive image generation framework. This method arranges token sequences from global to fine detail using a process called next-detail prediction. Unlike traditional 2D raster-scan or scale-based techniques, DetailFlow employs a 1D tokenizer trained on progressively degraded images. This design allows the model to prioritize foundational image structures before refining visual details. By mapping tokens directly to resolution levels, DetailFlow significantly reduces token requirements, enabling images to be generated in a semantically ordered, coarse-to-fine manner. The mechanism in DetailFlow centers on a 1D latent space where each token contributes incrementally more detail. Earlier tokens encode global features, while later tokens refine specific visual aspects. To train this, the researchers created a resolution mapping function that links token count to target resolution. During training, the model is exposed to images of varying quality levels and learns to predict progressively higher-resolution outputs as more tokens are introduced. It also implements parallel token prediction by grouping sequences and predicting entire sets at once. Since parallel prediction can introduce sampling errors, a self-correction mechanism was integrated. This system perturbs certain tokens during training and teaches subsequent tokens to compensate, ensuring that final images maintain structural and visual integrity. The results from the experiments on the ImageNet 256×256 benchmark were noteworthy. DetailFlow achieved a gFID score of 2.96 using only 128 tokens, outperforming VAR at 3.3 and FlexVAR at 3.05, both of which used 680 tokens. Even more impressive, DetailFlow-64 reached a gFID of 2.62 using 512 tokens. In terms of speed, it delivered nearly double the inference rate of VAR and FlexVAR. A further ablation study confirmed that the self-correction training and semantic ordering of tokens substantially improved output quality. For example, enabling self-correction dropped the gFID from 4.11 to 3.68 in one setting. These metrics demonstrate both higher quality and faster generation compared to established models. By focusing on semantic structure and reducing redundancy, DetailFlow presents a viable solution to long-standing issues in autoregressive image generation. The method’s coarse-to-fine approach, efficient parallel decoding, and ability to self-correct highlight how architectural innovations can address performance and scalability limitations. Through their structured use of 1D tokens, the researchers from ByteDance have demonstrated a model that maintains high image fidelity while significantly reducing computational load, making it a valuable addition to image synthesis research. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement FinetuningNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal ReasoningNikhilhttps://www.marktechpost.com/author/nikhil0980/NVIDIA AI Introduces Fast-dLLM: A Training-Free Framework That Brings KV Caching and Parallel Decoding to Diffusion LLMsNikhilhttps://www.marktechpost.com/author/nikhil0980/Meet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental Validation #bytedance #researchers #introduce #detailflow #coarsetofine
    WWW.MARKTECHPOST.COM
    ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation
    Autoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses on generating images one token at a time, similar to how sentences are constructed in language models. The appeal of this approach lies in its ability to maintain structural coherence across the image while allowing for high levels of control during the generation process. As researchers began to apply these techniques to visual data, they found that structured prediction not only preserved spatial integrity but also supported tasks like image manipulation and multimodal translation effectively. Despite these benefits, generating high-resolution images remains computationally expensive and slow. A primary issue is the number of tokens needed to represent complex visuals. Raster-scan methods that flatten 2D images into linear sequences require thousands of tokens for detailed images, resulting in long inference times and high memory consumption. Models like Infinity need over 10,000 tokens for a 1024×1024 image. This becomes unsustainable for real-time applications or when scaling to more extensive datasets. Reducing the token burden while preserving or improving output quality has become a pressing challenge. Efforts to mitigate token inflation have led to innovations like next-scale prediction seen in VAR and FlexVAR. These models create images by predicting progressively finer scales, which imitates the human tendency to sketch rough outlines before adding detail. However, they still rely on hundreds of tokens—680 in the case of VAR and FlexVAR for 256×256 images. Moreover, approaches like TiTok and FlexTok use 1D tokenization to compress spatial redundancy, but they often fail to scale efficiently. For example, FlexTok’s gFID increases from 1.9 at 32 tokens to 2.5 at 256 tokens, highlighting a degradation in output quality as the token count grows. Researchers from ByteDance introduced DetailFlow, a 1D autoregressive image generation framework. This method arranges token sequences from global to fine detail using a process called next-detail prediction. Unlike traditional 2D raster-scan or scale-based techniques, DetailFlow employs a 1D tokenizer trained on progressively degraded images. This design allows the model to prioritize foundational image structures before refining visual details. By mapping tokens directly to resolution levels, DetailFlow significantly reduces token requirements, enabling images to be generated in a semantically ordered, coarse-to-fine manner. The mechanism in DetailFlow centers on a 1D latent space where each token contributes incrementally more detail. Earlier tokens encode global features, while later tokens refine specific visual aspects. To train this, the researchers created a resolution mapping function that links token count to target resolution. During training, the model is exposed to images of varying quality levels and learns to predict progressively higher-resolution outputs as more tokens are introduced. It also implements parallel token prediction by grouping sequences and predicting entire sets at once. Since parallel prediction can introduce sampling errors, a self-correction mechanism was integrated. This system perturbs certain tokens during training and teaches subsequent tokens to compensate, ensuring that final images maintain structural and visual integrity. The results from the experiments on the ImageNet 256×256 benchmark were noteworthy. DetailFlow achieved a gFID score of 2.96 using only 128 tokens, outperforming VAR at 3.3 and FlexVAR at 3.05, both of which used 680 tokens. Even more impressive, DetailFlow-64 reached a gFID of 2.62 using 512 tokens. In terms of speed, it delivered nearly double the inference rate of VAR and FlexVAR. A further ablation study confirmed that the self-correction training and semantic ordering of tokens substantially improved output quality. For example, enabling self-correction dropped the gFID from 4.11 to 3.68 in one setting. These metrics demonstrate both higher quality and faster generation compared to established models. By focusing on semantic structure and reducing redundancy, DetailFlow presents a viable solution to long-standing issues in autoregressive image generation. The method’s coarse-to-fine approach, efficient parallel decoding, and ability to self-correct highlight how architectural innovations can address performance and scalability limitations. Through their structured use of 1D tokens, the researchers from ByteDance have demonstrated a model that maintains high image fidelity while significantly reducing computational load, making it a valuable addition to image synthesis research. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement FinetuningNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal ReasoningNikhilhttps://www.marktechpost.com/author/nikhil0980/NVIDIA AI Introduces Fast-dLLM: A Training-Free Framework That Brings KV Caching and Parallel Decoding to Diffusion LLMsNikhilhttps://www.marktechpost.com/author/nikhil0980/Meet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental Validation
    Like
    Love
    Wow
    Sad
    Angry
    821
    0 Yorumlar 0 hisse senetleri
  • Manus has kick-started an AI agent boom in China

    Last year, China saw a boom in foundation models, the do-everything large language models that underpin the AI revolution. This year, the focus has shifted to AI agents—systems that are less about responding to users’ queries and more about autonomously accomplishing things for them. 

    There are now a host of Chinese startups building these general-purpose digital tools, which can answer emails, browse the internet to plan vacations, and even design an interactive website. Many of these have emerged in just the last two months, following in the footsteps of Manus—a general AI agent that sparked weeks of social media frenzy for invite codes after its limited-release launch in early March. 

    These emerging AI agents aren’t large language models themselves. Instead, they’re built on top of them, using a workflow-based structure designed to get things done. A lot of these systems also introduce a different way of interacting with AI. Rather than just chatting back and forth with users, they are optimized for managing and executing multistep tasks—booking flights, managing schedules, conducting research—by using external tools and remembering instructions. 

    China could take the lead on building these kinds of agents. The country’s tightly integrated app ecosystems, rapid product cycles, and digitally fluent user base could provide a favorable environment for embedding AI into daily life. 

    For now, its leading AI agent startups are focusing their attention on the global market, because the best Western models don’t operate inside China’s firewalls. But that could change soon: Tech giants like ByteDance and Tencent are preparing their own AI agents that could bake automation directly into their native super-apps, pulling data from their vast ecosystem of programs that dominate many aspects of daily life in the country. 

    As the race to define what a useful AI agent looks like unfolds, a mix of ambitious startups and entrenched tech giants are now testing how these tools might actually work in practice—and for whom.

    Set the standard

    It’s been a whirlwind few months for Manus, which was developed by the Wuhan-based startup Butterfly Effect. The company raised million in a funding round led by the US venture capital firm Benchmark, took the product on an ambitious global roadshow, and hired dozens of new employees. 

    Even before registration opened to the public in May, Manus had become a reference point for what a broad, consumer‑oriented AI agent should accomplish. Rather than handling narrow chores for businesses, this “general” agent is designed to be able to help with everyday tasks like trip planning, stock comparison, or your kid’s school project. 

    Unlike previous AI agents, Manus uses a browser-based sandbox that lets users supervise the agent like an intern, watching in real time as it scrolls through web pages, reads articles, or codes actions. It also proactively asks clarifying questions, supports long-term memory that would serve as context for future tasks.

    “Manus represents a promising product experience for AI agents,” says Ang Li, cofounder and CEO of Simular, a startup based in Palo Alto, California, that’s building computer use agents, AI agents that control a virtual computer. “I believe Chinese startups have a huge advantage when it comes to designing consumer products, thanks to cutthroat domestic competition that leads to fast execution and greater attention to product details.”

    In the case of Manus, the competition is moving fast. Two of the most buzzy follow‑ups, Genspark and Flowith, for example, are already boasting benchmark scores that match or edge past Manus’s. 

    Genspark, led by former Baidu executives Eric Jing and Kay Zhu, links many small “super agents” through what it calls multi‑component prompting. The agent can switch among several large language models, accepts both images and text, and carries out tasks from making slide decks to placing phone calls. Whereas Manus relies heavily on Browser Use, a popular open-source product that lets agents operate a web browser in a virtual window like a human, Genspark directly integrates with a wide array of tools and APIs. Launched in April, the company says that it already has over 5 million users and over million in yearly revenue.

    Flowith, the work of a young team that first grabbed public attention in April 2025 at a developer event hosted by the popular social media app Xiaohongshu, takes a different tack. Marketed as an “infinite agent,” it opens on a blank canvas where each question becomes a node on a branching map. Users can backtrack, take new branches, and store results in personal or sharable “knowledge gardens”—a design that feels more like project management softwarethan a typical chat interface. Every inquiry or task builds its own mind-map-like graph, encouraging a more nonlinear and creative interaction with AI. Flowith’s core agent, NEO, runs in the cloud and can perform scheduled tasks like sending emails and compiling files. The founders want the app to be a “knowledge marketbase”, and aims to tap into the social aspect of AI with the aspiration of becoming “the OnlyFans of AI knowledge creators”.

    What they also share with Manus is the global ambition. Both Genspark and Flowith have stated that their primary focus is the international market.

    A global address

    Startups like Manus, Genspark, and Flowith—though founded by Chinese entrepreneurs—could blend seamlessly into the global tech scene and compete effectively abroad. Founders, investors, and analysts that MIT Technology Review has spoken to believe Chinese companies are moving fast, executing well, and quickly coming up with new products. 

    Money reinforces the pull to launch overseas. Customers there pay more, and there are plenty to go around. “You can price in USD, and with the exchange rate that’s a sevenfold multiplier,” Manus cofounder Xiao Hong quipped on a podcast. “Even if we’re only operating at 10% power because of cultural differences overseas, we’ll still make more than in China.”

    But creating the same functionality in China is a challenge. Major US AI companies including OpenAI and Anthropic have opted out of mainland China because of geopolitical risks and challenges with regulatory compliance. Their absence initially created a black market as users resorted to VPNs and third-party mirrors to access tools like ChatGPT and Claude. That vacuum has since been filled by a new wave of Chinese chatbots—DeepSeek, Doubao, Kimi—but the appetite for foreign models hasn’t gone away. 

    Manus, for example, uses Anthropic’s Claude Sonnet—widely considered the top model for agentic tasks. Manus cofounder Zhang Tao has repeatedly praised Claude’s ability to juggle tools, remember contexts, and hold multi‑round conversations—all crucial for turning chatty software into an effective executive assistant.

    But the company’s use of Sonnet has made its agent functionally unusable inside China without a VPN. If you open Manus from a mainland IP address, you’ll see a notice explaining that the team is “working on integrating Qwen’s model,” a special local version that is built on top of Alibaba’s open-source model. 

    An engineer overseeing ByteDance’s work on developing an agent, who spoke to MIT Technology Review anonymously to avoid sanction, said that the absence of Claude Sonnet models “limits everything we do in China.” DeepSeek’s open models, he added, still hallucinate too often and lack training on real‑world workflows. Developers we spoke with rank Alibaba’s Qwen series as the best domestic alternative, yet most say that switching to Qwen knocks performance down a notch.

    Jiaxin Pei, a postdoctoral researcher at Stanford’s Institute for Human‑Centered AI, thinks that gap will close: “Building agentic capabilities in base LLMs has become a key focus for many LLM builders, and once people realize the value of this, it will only be a matter of time.”

    For now, Manus is doubling down on audiences it can already serve. In a written response, the company said its “primary focus is overseas expansion,” noting that new offices in San Francisco, Singapore, and Tokyo have opened in the past month.

    A super‑app approach

    Although the concept of AI agents is still relatively new, the consumer-facing AI app market in China is already crowded with major tech players. DeepSeek remains the most widely used, while ByteDance’s Doubao and Moonshot’s Kimi have also become household names. However, most of these apps are still optimized for chat and entertainment rather than task execution. This gap in the local market has pushed China’s big tech firms to roll out their own user-facing agents, though early versions remain uneven in quality and rough around the edges. 

    ByteDance is testing Coze Space, an AI agent based on its own Doubao model family that lets users toggle between “plan” and “execute” modes, so they can either directly guide the agent’s actions or step back and watch it work autonomously. It connects up to 14 popular apps, including GitHub, Notion, and the company’s own Lark office suite. Early reviews say the tool can feel clunky and has a high failure rate, but it clearly aims to match what Manus offers.

    Meanwhile, Zhipu AI has released a free agent called AutoGLM Rumination, built on its proprietary ChatGLM models. Shanghai‑based Minimax has launched Minimax Agent. Both products look almost identical to Manus and demo basic tasks such as building a simple website, planning a trip, making a small Flash game, or running quick data analysis.

    Despite the limited usability of most general AI agents launched within China, big companies have plans to change that. During a May 15 earnings call, Tencent president Liu Zhiping teased an agent that would weave automation directly into China’s most ubiquitous app, WeChat. 

    Considered the original super-app, WeChat already handles messaging, mobile payments, news, and millions of mini‑programs that act like embedded apps. These programs give Tencent, its developer, access to data from millions of services that pervade everyday life in China, an advantage most competitors can only envy.

    Historically, China’s consumer internet has splintered into competing walled gardens—share a Taobao link in WeChat and it resolves as plaintext, not a preview card. Unlike the more interoperable Western internet, China’s tech giants have long resisted integration with one another, choosing to wage platform war at the expense of a seamless user experience.

    But the use of mini‑programs has given WeChat unprecedented reach across services that once resisted interoperability, from gym bookings to grocery orders. An agent able to roam that ecosystem could bypass the integration headaches dogging independent startups.

    Alibaba, the e-commerce giant behind the Qwen model series, has been a front-runner in China’s AI race but has been slower to release consumer-facing products. Even though Qwen was the most downloaded open-source model on Hugging Face in 2024, it didn’t power a dedicated chatbot app until early 2025. In March, Alibaba rebranded its cloud storage and search app Quark into an all-in-one AI search tool. By June, Quark had introduced DeepResearch—a new mode that marks its most agent-like effort to date. 

    ByteDance and Alibaba did not reply to MIT Technology Review’s request for comments.

    “Historically, Chinese tech products tend to pursue the all-in-one, super-app approach, and the latest Chinese AI agents reflect just that,” says Li of Simular, who previously worked at Google DeepMind on AI-enabled work automation. “In contrast, AI agents in the US are more focused on serving specific verticals.”

    Pei, the researcher at Stanford, says that existing tech giants could have a huge advantage in bringing the vision of general AI agents to life—especially those with built-in integration across services. “The customer-facing AI agent market is still very early, with tons of problems like authentication and liability,” he says. “But companies that already operate across a wide range of services have a natural advantage in deploying agents at scale.”
    #manus #has #kickstarted #agent #boom
    Manus has kick-started an AI agent boom in China
    Last year, China saw a boom in foundation models, the do-everything large language models that underpin the AI revolution. This year, the focus has shifted to AI agents—systems that are less about responding to users’ queries and more about autonomously accomplishing things for them.  There are now a host of Chinese startups building these general-purpose digital tools, which can answer emails, browse the internet to plan vacations, and even design an interactive website. Many of these have emerged in just the last two months, following in the footsteps of Manus—a general AI agent that sparked weeks of social media frenzy for invite codes after its limited-release launch in early March.  These emerging AI agents aren’t large language models themselves. Instead, they’re built on top of them, using a workflow-based structure designed to get things done. A lot of these systems also introduce a different way of interacting with AI. Rather than just chatting back and forth with users, they are optimized for managing and executing multistep tasks—booking flights, managing schedules, conducting research—by using external tools and remembering instructions.  China could take the lead on building these kinds of agents. The country’s tightly integrated app ecosystems, rapid product cycles, and digitally fluent user base could provide a favorable environment for embedding AI into daily life.  For now, its leading AI agent startups are focusing their attention on the global market, because the best Western models don’t operate inside China’s firewalls. But that could change soon: Tech giants like ByteDance and Tencent are preparing their own AI agents that could bake automation directly into their native super-apps, pulling data from their vast ecosystem of programs that dominate many aspects of daily life in the country.  As the race to define what a useful AI agent looks like unfolds, a mix of ambitious startups and entrenched tech giants are now testing how these tools might actually work in practice—and for whom. Set the standard It’s been a whirlwind few months for Manus, which was developed by the Wuhan-based startup Butterfly Effect. The company raised million in a funding round led by the US venture capital firm Benchmark, took the product on an ambitious global roadshow, and hired dozens of new employees.  Even before registration opened to the public in May, Manus had become a reference point for what a broad, consumer‑oriented AI agent should accomplish. Rather than handling narrow chores for businesses, this “general” agent is designed to be able to help with everyday tasks like trip planning, stock comparison, or your kid’s school project.  Unlike previous AI agents, Manus uses a browser-based sandbox that lets users supervise the agent like an intern, watching in real time as it scrolls through web pages, reads articles, or codes actions. It also proactively asks clarifying questions, supports long-term memory that would serve as context for future tasks. “Manus represents a promising product experience for AI agents,” says Ang Li, cofounder and CEO of Simular, a startup based in Palo Alto, California, that’s building computer use agents, AI agents that control a virtual computer. “I believe Chinese startups have a huge advantage when it comes to designing consumer products, thanks to cutthroat domestic competition that leads to fast execution and greater attention to product details.” In the case of Manus, the competition is moving fast. Two of the most buzzy follow‑ups, Genspark and Flowith, for example, are already boasting benchmark scores that match or edge past Manus’s.  Genspark, led by former Baidu executives Eric Jing and Kay Zhu, links many small “super agents” through what it calls multi‑component prompting. The agent can switch among several large language models, accepts both images and text, and carries out tasks from making slide decks to placing phone calls. Whereas Manus relies heavily on Browser Use, a popular open-source product that lets agents operate a web browser in a virtual window like a human, Genspark directly integrates with a wide array of tools and APIs. Launched in April, the company says that it already has over 5 million users and over million in yearly revenue. Flowith, the work of a young team that first grabbed public attention in April 2025 at a developer event hosted by the popular social media app Xiaohongshu, takes a different tack. Marketed as an “infinite agent,” it opens on a blank canvas where each question becomes a node on a branching map. Users can backtrack, take new branches, and store results in personal or sharable “knowledge gardens”—a design that feels more like project management softwarethan a typical chat interface. Every inquiry or task builds its own mind-map-like graph, encouraging a more nonlinear and creative interaction with AI. Flowith’s core agent, NEO, runs in the cloud and can perform scheduled tasks like sending emails and compiling files. The founders want the app to be a “knowledge marketbase”, and aims to tap into the social aspect of AI with the aspiration of becoming “the OnlyFans of AI knowledge creators”. What they also share with Manus is the global ambition. Both Genspark and Flowith have stated that their primary focus is the international market. A global address Startups like Manus, Genspark, and Flowith—though founded by Chinese entrepreneurs—could blend seamlessly into the global tech scene and compete effectively abroad. Founders, investors, and analysts that MIT Technology Review has spoken to believe Chinese companies are moving fast, executing well, and quickly coming up with new products.  Money reinforces the pull to launch overseas. Customers there pay more, and there are plenty to go around. “You can price in USD, and with the exchange rate that’s a sevenfold multiplier,” Manus cofounder Xiao Hong quipped on a podcast. “Even if we’re only operating at 10% power because of cultural differences overseas, we’ll still make more than in China.” But creating the same functionality in China is a challenge. Major US AI companies including OpenAI and Anthropic have opted out of mainland China because of geopolitical risks and challenges with regulatory compliance. Their absence initially created a black market as users resorted to VPNs and third-party mirrors to access tools like ChatGPT and Claude. That vacuum has since been filled by a new wave of Chinese chatbots—DeepSeek, Doubao, Kimi—but the appetite for foreign models hasn’t gone away.  Manus, for example, uses Anthropic’s Claude Sonnet—widely considered the top model for agentic tasks. Manus cofounder Zhang Tao has repeatedly praised Claude’s ability to juggle tools, remember contexts, and hold multi‑round conversations—all crucial for turning chatty software into an effective executive assistant. But the company’s use of Sonnet has made its agent functionally unusable inside China without a VPN. If you open Manus from a mainland IP address, you’ll see a notice explaining that the team is “working on integrating Qwen’s model,” a special local version that is built on top of Alibaba’s open-source model.  An engineer overseeing ByteDance’s work on developing an agent, who spoke to MIT Technology Review anonymously to avoid sanction, said that the absence of Claude Sonnet models “limits everything we do in China.” DeepSeek’s open models, he added, still hallucinate too often and lack training on real‑world workflows. Developers we spoke with rank Alibaba’s Qwen series as the best domestic alternative, yet most say that switching to Qwen knocks performance down a notch. Jiaxin Pei, a postdoctoral researcher at Stanford’s Institute for Human‑Centered AI, thinks that gap will close: “Building agentic capabilities in base LLMs has become a key focus for many LLM builders, and once people realize the value of this, it will only be a matter of time.” For now, Manus is doubling down on audiences it can already serve. In a written response, the company said its “primary focus is overseas expansion,” noting that new offices in San Francisco, Singapore, and Tokyo have opened in the past month. A super‑app approach Although the concept of AI agents is still relatively new, the consumer-facing AI app market in China is already crowded with major tech players. DeepSeek remains the most widely used, while ByteDance’s Doubao and Moonshot’s Kimi have also become household names. However, most of these apps are still optimized for chat and entertainment rather than task execution. This gap in the local market has pushed China’s big tech firms to roll out their own user-facing agents, though early versions remain uneven in quality and rough around the edges.  ByteDance is testing Coze Space, an AI agent based on its own Doubao model family that lets users toggle between “plan” and “execute” modes, so they can either directly guide the agent’s actions or step back and watch it work autonomously. It connects up to 14 popular apps, including GitHub, Notion, and the company’s own Lark office suite. Early reviews say the tool can feel clunky and has a high failure rate, but it clearly aims to match what Manus offers. Meanwhile, Zhipu AI has released a free agent called AutoGLM Rumination, built on its proprietary ChatGLM models. Shanghai‑based Minimax has launched Minimax Agent. Both products look almost identical to Manus and demo basic tasks such as building a simple website, planning a trip, making a small Flash game, or running quick data analysis. Despite the limited usability of most general AI agents launched within China, big companies have plans to change that. During a May 15 earnings call, Tencent president Liu Zhiping teased an agent that would weave automation directly into China’s most ubiquitous app, WeChat.  Considered the original super-app, WeChat already handles messaging, mobile payments, news, and millions of mini‑programs that act like embedded apps. These programs give Tencent, its developer, access to data from millions of services that pervade everyday life in China, an advantage most competitors can only envy. Historically, China’s consumer internet has splintered into competing walled gardens—share a Taobao link in WeChat and it resolves as plaintext, not a preview card. Unlike the more interoperable Western internet, China’s tech giants have long resisted integration with one another, choosing to wage platform war at the expense of a seamless user experience. But the use of mini‑programs has given WeChat unprecedented reach across services that once resisted interoperability, from gym bookings to grocery orders. An agent able to roam that ecosystem could bypass the integration headaches dogging independent startups. Alibaba, the e-commerce giant behind the Qwen model series, has been a front-runner in China’s AI race but has been slower to release consumer-facing products. Even though Qwen was the most downloaded open-source model on Hugging Face in 2024, it didn’t power a dedicated chatbot app until early 2025. In March, Alibaba rebranded its cloud storage and search app Quark into an all-in-one AI search tool. By June, Quark had introduced DeepResearch—a new mode that marks its most agent-like effort to date.  ByteDance and Alibaba did not reply to MIT Technology Review’s request for comments. “Historically, Chinese tech products tend to pursue the all-in-one, super-app approach, and the latest Chinese AI agents reflect just that,” says Li of Simular, who previously worked at Google DeepMind on AI-enabled work automation. “In contrast, AI agents in the US are more focused on serving specific verticals.” Pei, the researcher at Stanford, says that existing tech giants could have a huge advantage in bringing the vision of general AI agents to life—especially those with built-in integration across services. “The customer-facing AI agent market is still very early, with tons of problems like authentication and liability,” he says. “But companies that already operate across a wide range of services have a natural advantage in deploying agents at scale.” #manus #has #kickstarted #agent #boom
    WWW.TECHNOLOGYREVIEW.COM
    Manus has kick-started an AI agent boom in China
    Last year, China saw a boom in foundation models, the do-everything large language models that underpin the AI revolution. This year, the focus has shifted to AI agents—systems that are less about responding to users’ queries and more about autonomously accomplishing things for them.  There are now a host of Chinese startups building these general-purpose digital tools, which can answer emails, browse the internet to plan vacations, and even design an interactive website. Many of these have emerged in just the last two months, following in the footsteps of Manus—a general AI agent that sparked weeks of social media frenzy for invite codes after its limited-release launch in early March.  These emerging AI agents aren’t large language models themselves. Instead, they’re built on top of them, using a workflow-based structure designed to get things done. A lot of these systems also introduce a different way of interacting with AI. Rather than just chatting back and forth with users, they are optimized for managing and executing multistep tasks—booking flights, managing schedules, conducting research—by using external tools and remembering instructions.  China could take the lead on building these kinds of agents. The country’s tightly integrated app ecosystems, rapid product cycles, and digitally fluent user base could provide a favorable environment for embedding AI into daily life.  For now, its leading AI agent startups are focusing their attention on the global market, because the best Western models don’t operate inside China’s firewalls. But that could change soon: Tech giants like ByteDance and Tencent are preparing their own AI agents that could bake automation directly into their native super-apps, pulling data from their vast ecosystem of programs that dominate many aspects of daily life in the country.  As the race to define what a useful AI agent looks like unfolds, a mix of ambitious startups and entrenched tech giants are now testing how these tools might actually work in practice—and for whom. Set the standard It’s been a whirlwind few months for Manus, which was developed by the Wuhan-based startup Butterfly Effect. The company raised $75 million in a funding round led by the US venture capital firm Benchmark, took the product on an ambitious global roadshow, and hired dozens of new employees.  Even before registration opened to the public in May, Manus had become a reference point for what a broad, consumer‑oriented AI agent should accomplish. Rather than handling narrow chores for businesses, this “general” agent is designed to be able to help with everyday tasks like trip planning, stock comparison, or your kid’s school project.  Unlike previous AI agents, Manus uses a browser-based sandbox that lets users supervise the agent like an intern, watching in real time as it scrolls through web pages, reads articles, or codes actions. It also proactively asks clarifying questions, supports long-term memory that would serve as context for future tasks. “Manus represents a promising product experience for AI agents,” says Ang Li, cofounder and CEO of Simular, a startup based in Palo Alto, California, that’s building computer use agents, AI agents that control a virtual computer. “I believe Chinese startups have a huge advantage when it comes to designing consumer products, thanks to cutthroat domestic competition that leads to fast execution and greater attention to product details.” In the case of Manus, the competition is moving fast. Two of the most buzzy follow‑ups, Genspark and Flowith, for example, are already boasting benchmark scores that match or edge past Manus’s.  Genspark, led by former Baidu executives Eric Jing and Kay Zhu, links many small “super agents” through what it calls multi‑component prompting. The agent can switch among several large language models, accepts both images and text, and carries out tasks from making slide decks to placing phone calls. Whereas Manus relies heavily on Browser Use, a popular open-source product that lets agents operate a web browser in a virtual window like a human, Genspark directly integrates with a wide array of tools and APIs. Launched in April, the company says that it already has over 5 million users and over $36 million in yearly revenue. Flowith, the work of a young team that first grabbed public attention in April 2025 at a developer event hosted by the popular social media app Xiaohongshu, takes a different tack. Marketed as an “infinite agent,” it opens on a blank canvas where each question becomes a node on a branching map. Users can backtrack, take new branches, and store results in personal or sharable “knowledge gardens”—a design that feels more like project management software (think Notion) than a typical chat interface. Every inquiry or task builds its own mind-map-like graph, encouraging a more nonlinear and creative interaction with AI. Flowith’s core agent, NEO, runs in the cloud and can perform scheduled tasks like sending emails and compiling files. The founders want the app to be a “knowledge marketbase”, and aims to tap into the social aspect of AI with the aspiration of becoming “the OnlyFans of AI knowledge creators”. What they also share with Manus is the global ambition. Both Genspark and Flowith have stated that their primary focus is the international market. A global address Startups like Manus, Genspark, and Flowith—though founded by Chinese entrepreneurs—could blend seamlessly into the global tech scene and compete effectively abroad. Founders, investors, and analysts that MIT Technology Review has spoken to believe Chinese companies are moving fast, executing well, and quickly coming up with new products.  Money reinforces the pull to launch overseas. Customers there pay more, and there are plenty to go around. “You can price in USD, and with the exchange rate that’s a sevenfold multiplier,” Manus cofounder Xiao Hong quipped on a podcast. “Even if we’re only operating at 10% power because of cultural differences overseas, we’ll still make more than in China.” But creating the same functionality in China is a challenge. Major US AI companies including OpenAI and Anthropic have opted out of mainland China because of geopolitical risks and challenges with regulatory compliance. Their absence initially created a black market as users resorted to VPNs and third-party mirrors to access tools like ChatGPT and Claude. That vacuum has since been filled by a new wave of Chinese chatbots—DeepSeek, Doubao, Kimi—but the appetite for foreign models hasn’t gone away.  Manus, for example, uses Anthropic’s Claude Sonnet—widely considered the top model for agentic tasks. Manus cofounder Zhang Tao has repeatedly praised Claude’s ability to juggle tools, remember contexts, and hold multi‑round conversations—all crucial for turning chatty software into an effective executive assistant. But the company’s use of Sonnet has made its agent functionally unusable inside China without a VPN. If you open Manus from a mainland IP address, you’ll see a notice explaining that the team is “working on integrating Qwen’s model,” a special local version that is built on top of Alibaba’s open-source model.  An engineer overseeing ByteDance’s work on developing an agent, who spoke to MIT Technology Review anonymously to avoid sanction, said that the absence of Claude Sonnet models “limits everything we do in China.” DeepSeek’s open models, he added, still hallucinate too often and lack training on real‑world workflows. Developers we spoke with rank Alibaba’s Qwen series as the best domestic alternative, yet most say that switching to Qwen knocks performance down a notch. Jiaxin Pei, a postdoctoral researcher at Stanford’s Institute for Human‑Centered AI, thinks that gap will close: “Building agentic capabilities in base LLMs has become a key focus for many LLM builders, and once people realize the value of this, it will only be a matter of time.” For now, Manus is doubling down on audiences it can already serve. In a written response, the company said its “primary focus is overseas expansion,” noting that new offices in San Francisco, Singapore, and Tokyo have opened in the past month. A super‑app approach Although the concept of AI agents is still relatively new, the consumer-facing AI app market in China is already crowded with major tech players. DeepSeek remains the most widely used, while ByteDance’s Doubao and Moonshot’s Kimi have also become household names. However, most of these apps are still optimized for chat and entertainment rather than task execution. This gap in the local market has pushed China’s big tech firms to roll out their own user-facing agents, though early versions remain uneven in quality and rough around the edges.  ByteDance is testing Coze Space, an AI agent based on its own Doubao model family that lets users toggle between “plan” and “execute” modes, so they can either directly guide the agent’s actions or step back and watch it work autonomously. It connects up to 14 popular apps, including GitHub, Notion, and the company’s own Lark office suite. Early reviews say the tool can feel clunky and has a high failure rate, but it clearly aims to match what Manus offers. Meanwhile, Zhipu AI has released a free agent called AutoGLM Rumination, built on its proprietary ChatGLM models. Shanghai‑based Minimax has launched Minimax Agent. Both products look almost identical to Manus and demo basic tasks such as building a simple website, planning a trip, making a small Flash game, or running quick data analysis. Despite the limited usability of most general AI agents launched within China, big companies have plans to change that. During a May 15 earnings call, Tencent president Liu Zhiping teased an agent that would weave automation directly into China’s most ubiquitous app, WeChat.  Considered the original super-app, WeChat already handles messaging, mobile payments, news, and millions of mini‑programs that act like embedded apps. These programs give Tencent, its developer, access to data from millions of services that pervade everyday life in China, an advantage most competitors can only envy. Historically, China’s consumer internet has splintered into competing walled gardens—share a Taobao link in WeChat and it resolves as plaintext, not a preview card. Unlike the more interoperable Western internet, China’s tech giants have long resisted integration with one another, choosing to wage platform war at the expense of a seamless user experience. But the use of mini‑programs has given WeChat unprecedented reach across services that once resisted interoperability, from gym bookings to grocery orders. An agent able to roam that ecosystem could bypass the integration headaches dogging independent startups. Alibaba, the e-commerce giant behind the Qwen model series, has been a front-runner in China’s AI race but has been slower to release consumer-facing products. Even though Qwen was the most downloaded open-source model on Hugging Face in 2024, it didn’t power a dedicated chatbot app until early 2025. In March, Alibaba rebranded its cloud storage and search app Quark into an all-in-one AI search tool. By June, Quark had introduced DeepResearch—a new mode that marks its most agent-like effort to date.  ByteDance and Alibaba did not reply to MIT Technology Review’s request for comments. “Historically, Chinese tech products tend to pursue the all-in-one, super-app approach, and the latest Chinese AI agents reflect just that,” says Li of Simular, who previously worked at Google DeepMind on AI-enabled work automation. “In contrast, AI agents in the US are more focused on serving specific verticals.” Pei, the researcher at Stanford, says that existing tech giants could have a huge advantage in bringing the vision of general AI agents to life—especially those with built-in integration across services. “The customer-facing AI agent market is still very early, with tons of problems like authentication and liability,” he says. “But companies that already operate across a wide range of services have a natural advantage in deploying agents at scale.”
    Like
    Love
    Wow
    Sad
    Angry
    421
    0 Yorumlar 0 hisse senetleri
  • Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning

    Large Reasoning Models, trained from LLMs using reinforcement learning, demonstrated great performance in complex reasoning tasks, including mathematics, STEM, and coding. However, existing LRMs face challenges in completing various puzzle tasks that require purely logical reasoning skills, which are easy and obvious for humans. Current methods targeting puzzles focus only on designing benchmarks for evaluation, lacking the training methods and resources for modern LLMs to tackle this challenge. Current puzzle datasets lack diversity and scalability, covering limited puzzle types with little control over generation or difficulty. Moreover, due to the success of the “LLM+RLVR” paradigm, it has become crucial to obtain large, diverse, and challenging sets of verifiable puzzle prompts for training agents.
    Reinforcement Learning with Verifiable Rewardshas emerged as a key method for improving models’ reasoning capabilities, removing the need for reward models by directly assigning rewards based on objectively verifiable answers. Puzzles are particularly well-suited for RLVR. However, most prior RLVR research has overlooked the puzzles’ potential for delivering effective reward signals. In puzzle reasoning of LLMs, existing benchmarks evaluate different types of reasoning, including abstract, deductive, and compositional reasoning. Few benchmarks support scalable generation and difficulty control but lack puzzle diversity. Moreover, the improvement of LLMs’ puzzle-solving abilities mainly falls into two categories: tool integration and RLVR.
    Researchers from ByteDance Seed, Fudan University, Tsinghua University, Nanjing University, and Shanghai Jiao Tong University have proposed Enigmata, the first comprehensive toolkit designed for improving LLMs with puzzle reasoning skills. It contains 36 tasks across seven categories, each featuring a generator that produces unlimited examples with controllable difficulty and a rule-based verifier for automatic evaluation. The researchers further developed Enigmata-Eval as a rigorous benchmark and created optimized multi-task RLVR strategies. Puzzle data from Enigmata enhances SoTA performance on advanced math and STEM reasoning tasks like AIME, BeyondAIME, and GPQA when trained on larger models like Seed1.5-Thinking. This shows the generalization benefits of Enigmata.

    The Enigmata-Data comprises 36 puzzle tasks organized into 7 primary categories, including Crypto, Arithmetic, Logic, Grid, Graph, Search, and Sequential Puzzle, making it the only dataset having multiple task categories with scalability, automatic verification, and public availability. The data construction follows a three-phase pipeline: Tasks Collection and Design, Auto-Generator and Verifier Development, and Sliding Difficulty Control. Moreover, the Enigmata-Eval is developed by systematically sampling from the broader dataset, aiming to extract 50 instances per difficulty level for each task. The final evaluation set contains 4,758 puzzle instances rather than the theoretical maximum of 5,400, due to inherent constraints, where some tasks generate fewer instances per difficulty level.

    The proposed model outperforms most public models on Enigmata-Eval with 32B parameters, showing the effectiveness of the dataset and training recipe. The model stands out on the challenging ARC-AGI benchmark, surpassing strong reasoning models such as Gemini 2.5 Pro, o3-mini, and o1. The Qwen2.5-32B-Enigmata shows outstanding performance in structured reasoning categories, outperforming in Crypto, Arithmetic, and Logic tasks, suggesting effective development of rule-based reasoning capabilities. The model shows competitive performance in search tasks that require strategic exploration and planning capabilities. Moreover, Crypto and Arithmetic tasks tend to provide the highest accuracy, while spatial and sequential tasks remain more difficult.
    In this paper, researchers introduced Enigmata, a comprehensive suite for equipping LLMs with advanced puzzle reasoning that integrates seamlessly with RL using verifiable rule-based rewards. The trained Enigmata-Model shows superior performance and robust generalization skills through RLVR training. Experiments reveal that when applied to larger models such as Seed1.5-Thinking, synthetic puzzle data brings additional benefits in other domains, including mathematics and STEM reasoning over state-of-the-art models. Enigmata provides a solid foundation for the research community to advance reasoning model development, offering a unified framework that effectively bridges logical puzzle-solving with broader reasoning capabilities in LLMs.

    Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic IntegrationSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning
    #enigmatas #multistage #mixtraining #reinforcement #learning
    Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning
    Large Reasoning Models, trained from LLMs using reinforcement learning, demonstrated great performance in complex reasoning tasks, including mathematics, STEM, and coding. However, existing LRMs face challenges in completing various puzzle tasks that require purely logical reasoning skills, which are easy and obvious for humans. Current methods targeting puzzles focus only on designing benchmarks for evaluation, lacking the training methods and resources for modern LLMs to tackle this challenge. Current puzzle datasets lack diversity and scalability, covering limited puzzle types with little control over generation or difficulty. Moreover, due to the success of the “LLM+RLVR” paradigm, it has become crucial to obtain large, diverse, and challenging sets of verifiable puzzle prompts for training agents. Reinforcement Learning with Verifiable Rewardshas emerged as a key method for improving models’ reasoning capabilities, removing the need for reward models by directly assigning rewards based on objectively verifiable answers. Puzzles are particularly well-suited for RLVR. However, most prior RLVR research has overlooked the puzzles’ potential for delivering effective reward signals. In puzzle reasoning of LLMs, existing benchmarks evaluate different types of reasoning, including abstract, deductive, and compositional reasoning. Few benchmarks support scalable generation and difficulty control but lack puzzle diversity. Moreover, the improvement of LLMs’ puzzle-solving abilities mainly falls into two categories: tool integration and RLVR. Researchers from ByteDance Seed, Fudan University, Tsinghua University, Nanjing University, and Shanghai Jiao Tong University have proposed Enigmata, the first comprehensive toolkit designed for improving LLMs with puzzle reasoning skills. It contains 36 tasks across seven categories, each featuring a generator that produces unlimited examples with controllable difficulty and a rule-based verifier for automatic evaluation. The researchers further developed Enigmata-Eval as a rigorous benchmark and created optimized multi-task RLVR strategies. Puzzle data from Enigmata enhances SoTA performance on advanced math and STEM reasoning tasks like AIME, BeyondAIME, and GPQA when trained on larger models like Seed1.5-Thinking. This shows the generalization benefits of Enigmata. The Enigmata-Data comprises 36 puzzle tasks organized into 7 primary categories, including Crypto, Arithmetic, Logic, Grid, Graph, Search, and Sequential Puzzle, making it the only dataset having multiple task categories with scalability, automatic verification, and public availability. The data construction follows a three-phase pipeline: Tasks Collection and Design, Auto-Generator and Verifier Development, and Sliding Difficulty Control. Moreover, the Enigmata-Eval is developed by systematically sampling from the broader dataset, aiming to extract 50 instances per difficulty level for each task. The final evaluation set contains 4,758 puzzle instances rather than the theoretical maximum of 5,400, due to inherent constraints, where some tasks generate fewer instances per difficulty level. The proposed model outperforms most public models on Enigmata-Eval with 32B parameters, showing the effectiveness of the dataset and training recipe. The model stands out on the challenging ARC-AGI benchmark, surpassing strong reasoning models such as Gemini 2.5 Pro, o3-mini, and o1. The Qwen2.5-32B-Enigmata shows outstanding performance in structured reasoning categories, outperforming in Crypto, Arithmetic, and Logic tasks, suggesting effective development of rule-based reasoning capabilities. The model shows competitive performance in search tasks that require strategic exploration and planning capabilities. Moreover, Crypto and Arithmetic tasks tend to provide the highest accuracy, while spatial and sequential tasks remain more difficult. In this paper, researchers introduced Enigmata, a comprehensive suite for equipping LLMs with advanced puzzle reasoning that integrates seamlessly with RL using verifiable rule-based rewards. The trained Enigmata-Model shows superior performance and robust generalization skills through RLVR training. Experiments reveal that when applied to larger models such as Seed1.5-Thinking, synthetic puzzle data brings additional benefits in other domains, including mathematics and STEM reasoning over state-of-the-art models. Enigmata provides a solid foundation for the research community to advance reasoning model development, offering a unified framework that effectively bridges logical puzzle-solving with broader reasoning capabilities in LLMs. Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic IntegrationSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning #enigmatas #multistage #mixtraining #reinforcement #learning
    WWW.MARKTECHPOST.COM
    Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning
    Large Reasoning Models (LRMs), trained from LLMs using reinforcement learning (RL), demonstrated great performance in complex reasoning tasks, including mathematics, STEM, and coding. However, existing LRMs face challenges in completing various puzzle tasks that require purely logical reasoning skills, which are easy and obvious for humans. Current methods targeting puzzles focus only on designing benchmarks for evaluation, lacking the training methods and resources for modern LLMs to tackle this challenge. Current puzzle datasets lack diversity and scalability, covering limited puzzle types with little control over generation or difficulty. Moreover, due to the success of the “LLM+RLVR” paradigm, it has become crucial to obtain large, diverse, and challenging sets of verifiable puzzle prompts for training agents. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key method for improving models’ reasoning capabilities, removing the need for reward models by directly assigning rewards based on objectively verifiable answers. Puzzles are particularly well-suited for RLVR. However, most prior RLVR research has overlooked the puzzles’ potential for delivering effective reward signals. In puzzle reasoning of LLMs, existing benchmarks evaluate different types of reasoning, including abstract, deductive, and compositional reasoning. Few benchmarks support scalable generation and difficulty control but lack puzzle diversity. Moreover, the improvement of LLMs’ puzzle-solving abilities mainly falls into two categories: tool integration and RLVR. Researchers from ByteDance Seed, Fudan University, Tsinghua University, Nanjing University, and Shanghai Jiao Tong University have proposed Enigmata, the first comprehensive toolkit designed for improving LLMs with puzzle reasoning skills. It contains 36 tasks across seven categories, each featuring a generator that produces unlimited examples with controllable difficulty and a rule-based verifier for automatic evaluation. The researchers further developed Enigmata-Eval as a rigorous benchmark and created optimized multi-task RLVR strategies. Puzzle data from Enigmata enhances SoTA performance on advanced math and STEM reasoning tasks like AIME, BeyondAIME, and GPQA when trained on larger models like Seed1.5-Thinking. This shows the generalization benefits of Enigmata. The Enigmata-Data comprises 36 puzzle tasks organized into 7 primary categories, including Crypto, Arithmetic, Logic, Grid, Graph, Search, and Sequential Puzzle, making it the only dataset having multiple task categories with scalability, automatic verification, and public availability. The data construction follows a three-phase pipeline: Tasks Collection and Design, Auto-Generator and Verifier Development, and Sliding Difficulty Control. Moreover, the Enigmata-Eval is developed by systematically sampling from the broader dataset, aiming to extract 50 instances per difficulty level for each task. The final evaluation set contains 4,758 puzzle instances rather than the theoretical maximum of 5,400, due to inherent constraints, where some tasks generate fewer instances per difficulty level. The proposed model outperforms most public models on Enigmata-Eval with 32B parameters, showing the effectiveness of the dataset and training recipe. The model stands out on the challenging ARC-AGI benchmark, surpassing strong reasoning models such as Gemini 2.5 Pro, o3-mini, and o1. The Qwen2.5-32B-Enigmata shows outstanding performance in structured reasoning categories, outperforming in Crypto, Arithmetic, and Logic tasks, suggesting effective development of rule-based reasoning capabilities. The model shows competitive performance in search tasks that require strategic exploration and planning capabilities. Moreover, Crypto and Arithmetic tasks tend to provide the highest accuracy, while spatial and sequential tasks remain more difficult. In this paper, researchers introduced Enigmata, a comprehensive suite for equipping LLMs with advanced puzzle reasoning that integrates seamlessly with RL using verifiable rule-based rewards. The trained Enigmata-Model shows superior performance and robust generalization skills through RLVR training. Experiments reveal that when applied to larger models such as Seed1.5-Thinking (20B/200B parameters), synthetic puzzle data brings additional benefits in other domains, including mathematics and STEM reasoning over state-of-the-art models. Enigmata provides a solid foundation for the research community to advance reasoning model development, offering a unified framework that effectively bridges logical puzzle-solving with broader reasoning capabilities in LLMs. Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic IntegrationSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning
    0 Yorumlar 0 hisse senetleri
  • Augmented World Expo 2025 will draw 400 speakers, 6K attendees and 300 global exhibitors

    Augmented World Expo 2025 will draw more than 6,000 attendees, 400 speakers and 300 global exhibitors to its event June 10 to June 12 in Long Beach, California.
    The speaker lineup includes Snap CEO Evan Spiegel, Atari cofounder Nolan Bushnell and Oculus/Anduril founder Palmer Luckey. If the show is any indication, the XR industry isn’t doing so bad. A variety of market researchers are forecasting fast growth for the industry through 2030. Ori Inbar, CEO of AWE, believes that the XR revolution is “ready to conquer the mainstream.” But to get there, he believes the industry still needs to create “head-turning content that must be experienced.”
    Of course, the red hot days of the “metaverse,” inspired by Neal Stephenson’s Snow Crash sci-fi novel in 1992, is no longer driving the industry forward. With less focus on sci-fi, the industry is focused on practical uses for mixed reality technology in the enterprise and consumer markets like gaming.
    But will XR and the metaverse be overrun by AI, or will it carry them to the mass market destination?
    Much is riding on how committed Mark Zuckerberg’s Meta will be even as it reprioritizes some resources away from XR to AI. Meta, which acquired Luckey’s Oculus back in 2014, has invested billions every quarter in the technology, with no profits so far. But, in a very unexpected turnaround, Zuckerberg and Luckey buried the hatchet on the past differences and set up an alliance between Meta and Anduril — the latter being Luckey’s AI/drone defense company.
    Zuckerberg has new competition from his own nemesis, Apple, which launched the Apple Vision Pro in February 2024. However, Apple has slowed down its development of the next-generation XR headset, while Zuckerberg has put more emphasis on AR/AI glasses.
    Spiegel, the CEO of Snap, has focused on augmented reality glasses. His Spectacles are now in their fifth generation, powered by the Snap OS and authoring tool Lens Studio.
    Nolan Bushnell, founder of Atari and Chuck E. Cheese, will deliver a one-of-a-kind talk on the main stage with five of his children, who are continuing his pioneering vision in gaming through XR. Brent Bushnell, Nolan’s eldest son, recently debuted DreamPark, a new XR startup that turns any park or playground into a mixed reality theme parks.
    Others speakers include Vicki Dobbs Beck – VP, Immersive Content Innovation, Lucasfilm & ILM Immersive; Ziad Asghar – SVP & GM XR, Qualcomm; Brian McClendon – Chief Technology Officer, Niantic Spatial, Inc.; Jason Rubin – VP, Metaverse Experiences, Meta; Hugo Swart, Senior Director of XR Ecosystem Strategy and Technology, Google; Jacqui Bransky – VP Web3 & Innovation, Warner Records; Chi Xu – CEO and Founder, XREAL; Helen Papagiannis – AR Pioneer and XR Hall of Famer; and Tom Furness – Grandfather of VR and Founder, Virtual World Society.
    AWE Builders Nexus will be a new program focused on startups this year. Startup founders, developers, designers, product managers, and business leaders alike will get the resources they need to build something extraordinary, get advice and funding, scale through partnerships, and win customers, Inbar said. The event will also feature the AWE Gaming Hub.
    I also interviewed some companies that are showcasing technology at the show. Here’s some snippets from what they are going to show.
    Pico VR
    Pico started out in Beijing, China, in 2015 and is now hitting its 10th anniversary. It is making the standalone Pico XR headsets, and it was acquired by ByteDance, the owner of TikTok, in 2021. In September 2024, the company launched the Pico 4 Ultra Enterprise headset, filling out the high end of its product line in addition to its G3 and Neo 3 legacy headsets.
    Pico also has its set of full-body motion trackers to its product offerings to allow for full-body and object tracking. That’s helping it with its focus on location-based entertainment in markets such as China. It’s focused on WiFi7, hand tracking and motion tracking.
    Leland Hedges, head of enterprise business at Pico, said that the LBE market in China has grown by 1,000% in the last six to nine months Pico has an app for PC streaming and another app for managing devices over a LAN. Pico can track play spaces with columns or cordoned-off areas. Hedges said the company will share 15 different user stories at AWE in public places such as zoos, museums, aquariums and planetariums.
    Convai
    Purnendu Mukherjee, CEO of Convai, showed me a bunch of demos at the Game Developers Conference where it has been able to create avatar-based demos of generative AI solutions with 3D animated people. These can be used to show off brands and greet people on web sites or as avatars in games.
    At AWE, Convai will also off learning and training scenarios for education and enterprises through a variety of simulations. Convai can render high fidelity avatars that are effectively coming from the cloud. At GDC, Convai scanned me and captured my voice so that it can create a lifelike avatar of me. These avatars can be created quickly and answer a variety of questions from website visitors. The idea is to enable non-technical people to create simulations without the need to code anything.
    In a demo, Convai’s avatar of me said, “I’ve been covering the games industry for many years now at games beat I’ve seen it evolve from the arcades to the massive global phenomenon it is today. I love digging into the business side of gaming, the technology, the culture, the whole shebang.” Convai will announce pricing for its self-serve platform as well as an enterprise subscription fee.
    Doublepoint
    Ohto Pentikäinen, CEO of Doublepoint, has a technology that detects the gesture you can make with your hand. It captures that movement via a smartwatch and allows you to control things on a TV interface or an XR device. With Android XR, Doublepoint is showing off demos where gesture control can unlock a more intuitive and comfortable augmented reality experience for those wearing AR glasses. Xreal is one of the glasses makers that is using the technology for controlling an AR user interface with gestures.
    “Our technology is able to fully control a XR system. A stat that we can update you on is that there’s 150,000 people who have downloaded the technology so far, and we have a developer community of over 2,000 people since January 2024,” Pentikäinen said.
    Now the company is starting its own Doublepoing developer program, and this adds layers on top of the enterprise client. So now the company can provide technology for indie developers or startups that are building augmented reality or AI hardware experiences.
    “We’re empowering developers in AR robotics and AI hardware, and we’re providing everything that we’re providing the enterprise clients, but for a much reduced price,” Pentikäinen said.
    #augmented #world #expo #will #draw
    Augmented World Expo 2025 will draw 400 speakers, 6K attendees and 300 global exhibitors
    Augmented World Expo 2025 will draw more than 6,000 attendees, 400 speakers and 300 global exhibitors to its event June 10 to June 12 in Long Beach, California. The speaker lineup includes Snap CEO Evan Spiegel, Atari cofounder Nolan Bushnell and Oculus/Anduril founder Palmer Luckey. If the show is any indication, the XR industry isn’t doing so bad. A variety of market researchers are forecasting fast growth for the industry through 2030. Ori Inbar, CEO of AWE, believes that the XR revolution is “ready to conquer the mainstream.” But to get there, he believes the industry still needs to create “head-turning content that must be experienced.” Of course, the red hot days of the “metaverse,” inspired by Neal Stephenson’s Snow Crash sci-fi novel in 1992, is no longer driving the industry forward. With less focus on sci-fi, the industry is focused on practical uses for mixed reality technology in the enterprise and consumer markets like gaming. But will XR and the metaverse be overrun by AI, or will it carry them to the mass market destination? Much is riding on how committed Mark Zuckerberg’s Meta will be even as it reprioritizes some resources away from XR to AI. Meta, which acquired Luckey’s Oculus back in 2014, has invested billions every quarter in the technology, with no profits so far. But, in a very unexpected turnaround, Zuckerberg and Luckey buried the hatchet on the past differences and set up an alliance between Meta and Anduril — the latter being Luckey’s AI/drone defense company. Zuckerberg has new competition from his own nemesis, Apple, which launched the Apple Vision Pro in February 2024. However, Apple has slowed down its development of the next-generation XR headset, while Zuckerberg has put more emphasis on AR/AI glasses. Spiegel, the CEO of Snap, has focused on augmented reality glasses. His Spectacles are now in their fifth generation, powered by the Snap OS and authoring tool Lens Studio. Nolan Bushnell, founder of Atari and Chuck E. Cheese, will deliver a one-of-a-kind talk on the main stage with five of his children, who are continuing his pioneering vision in gaming through XR. Brent Bushnell, Nolan’s eldest son, recently debuted DreamPark, a new XR startup that turns any park or playground into a mixed reality theme parks. Others speakers include Vicki Dobbs Beck – VP, Immersive Content Innovation, Lucasfilm & ILM Immersive; Ziad Asghar – SVP & GM XR, Qualcomm; Brian McClendon – Chief Technology Officer, Niantic Spatial, Inc.; Jason Rubin – VP, Metaverse Experiences, Meta; Hugo Swart, Senior Director of XR Ecosystem Strategy and Technology, Google; Jacqui Bransky – VP Web3 & Innovation, Warner Records; Chi Xu – CEO and Founder, XREAL; Helen Papagiannis – AR Pioneer and XR Hall of Famer; and Tom Furness – Grandfather of VR and Founder, Virtual World Society. AWE Builders Nexus will be a new program focused on startups this year. Startup founders, developers, designers, product managers, and business leaders alike will get the resources they need to build something extraordinary, get advice and funding, scale through partnerships, and win customers, Inbar said. The event will also feature the AWE Gaming Hub. I also interviewed some companies that are showcasing technology at the show. Here’s some snippets from what they are going to show. Pico VR Pico started out in Beijing, China, in 2015 and is now hitting its 10th anniversary. It is making the standalone Pico XR headsets, and it was acquired by ByteDance, the owner of TikTok, in 2021. In September 2024, the company launched the Pico 4 Ultra Enterprise headset, filling out the high end of its product line in addition to its G3 and Neo 3 legacy headsets. Pico also has its set of full-body motion trackers to its product offerings to allow for full-body and object tracking. That’s helping it with its focus on location-based entertainment in markets such as China. It’s focused on WiFi7, hand tracking and motion tracking. Leland Hedges, head of enterprise business at Pico, said that the LBE market in China has grown by 1,000% in the last six to nine months Pico has an app for PC streaming and another app for managing devices over a LAN. Pico can track play spaces with columns or cordoned-off areas. Hedges said the company will share 15 different user stories at AWE in public places such as zoos, museums, aquariums and planetariums. Convai Purnendu Mukherjee, CEO of Convai, showed me a bunch of demos at the Game Developers Conference where it has been able to create avatar-based demos of generative AI solutions with 3D animated people. These can be used to show off brands and greet people on web sites or as avatars in games. At AWE, Convai will also off learning and training scenarios for education and enterprises through a variety of simulations. Convai can render high fidelity avatars that are effectively coming from the cloud. At GDC, Convai scanned me and captured my voice so that it can create a lifelike avatar of me. These avatars can be created quickly and answer a variety of questions from website visitors. The idea is to enable non-technical people to create simulations without the need to code anything. In a demo, Convai’s avatar of me said, “I’ve been covering the games industry for many years now at games beat I’ve seen it evolve from the arcades to the massive global phenomenon it is today. I love digging into the business side of gaming, the technology, the culture, the whole shebang.” Convai will announce pricing for its self-serve platform as well as an enterprise subscription fee. Doublepoint Ohto Pentikäinen, CEO of Doublepoint, has a technology that detects the gesture you can make with your hand. It captures that movement via a smartwatch and allows you to control things on a TV interface or an XR device. With Android XR, Doublepoint is showing off demos where gesture control can unlock a more intuitive and comfortable augmented reality experience for those wearing AR glasses. Xreal is one of the glasses makers that is using the technology for controlling an AR user interface with gestures. “Our technology is able to fully control a XR system. A stat that we can update you on is that there’s 150,000 people who have downloaded the technology so far, and we have a developer community of over 2,000 people since January 2024,” Pentikäinen said. Now the company is starting its own Doublepoing developer program, and this adds layers on top of the enterprise client. So now the company can provide technology for indie developers or startups that are building augmented reality or AI hardware experiences. “We’re empowering developers in AR robotics and AI hardware, and we’re providing everything that we’re providing the enterprise clients, but for a much reduced price,” Pentikäinen said. #augmented #world #expo #will #draw
    VENTUREBEAT.COM
    Augmented World Expo 2025 will draw 400 speakers, 6K attendees and 300 global exhibitors
    Augmented World Expo 2025 will draw more than 6,000 attendees, 400 speakers and 300 global exhibitors to its event June 10 to June 12 in Long Beach, California. The speaker lineup includes Snap CEO Evan Spiegel, Atari cofounder Nolan Bushnell and Oculus/Anduril founder Palmer Luckey. If the show is any indication, the XR industry isn’t doing so bad. A variety of market researchers are forecasting fast growth for the industry through 2030. Ori Inbar, CEO of AWE, believes that the XR revolution is “ready to conquer the mainstream.” But to get there, he believes the industry still needs to create “head-turning content that must be experienced.” Of course, the red hot days of the “metaverse,” inspired by Neal Stephenson’s Snow Crash sci-fi novel in 1992, is no longer driving the industry forward. With less focus on sci-fi, the industry is focused on practical uses for mixed reality technology in the enterprise and consumer markets like gaming. But will XR and the metaverse be overrun by AI, or will it carry them to the mass market destination? Much is riding on how committed Mark Zuckerberg’s Meta will be even as it reprioritizes some resources away from XR to AI. Meta, which acquired Luckey’s Oculus back in 2014, has invested billions every quarter in the technology, with no profits so far. But, in a very unexpected turnaround, Zuckerberg and Luckey buried the hatchet on the past differences and set up an alliance between Meta and Anduril — the latter being Luckey’s AI/drone defense company. Zuckerberg has new competition from his own nemesis, Apple, which launched the Apple Vision Pro in February 2024. However, Apple has slowed down its development of the next-generation XR headset, while Zuckerberg has put more emphasis on AR/AI glasses. Spiegel, the CEO of Snap, has focused on augmented reality glasses. His Spectacles are now in their fifth generation, powered by the Snap OS and authoring tool Lens Studio. Nolan Bushnell, founder of Atari and Chuck E. Cheese, will deliver a one-of-a-kind talk on the main stage with five of his children, who are continuing his pioneering vision in gaming through XR. Brent Bushnell, Nolan’s eldest son, recently debuted DreamPark, a new XR startup that turns any park or playground into a mixed reality theme parks. Others speakers include Vicki Dobbs Beck – VP, Immersive Content Innovation, Lucasfilm & ILM Immersive; Ziad Asghar – SVP & GM XR, Qualcomm; Brian McClendon – Chief Technology Officer, Niantic Spatial, Inc.; Jason Rubin – VP, Metaverse Experiences, Meta; Hugo Swart, Senior Director of XR Ecosystem Strategy and Technology, Google; Jacqui Bransky – VP Web3 & Innovation, Warner Records; Chi Xu – CEO and Founder, XREAL; Helen Papagiannis – AR Pioneer and XR Hall of Famer; and Tom Furness – Grandfather of VR and Founder, Virtual World Society. AWE Builders Nexus will be a new program focused on startups this year. Startup founders, developers, designers, product managers, and business leaders alike will get the resources they need to build something extraordinary, get advice and funding, scale through partnerships, and win customers, Inbar said. The event will also feature the AWE Gaming Hub. I also interviewed some companies that are showcasing technology at the show. Here’s some snippets from what they are going to show. Pico VR Pico started out in Beijing, China, in 2015 and is now hitting its 10th anniversary. It is making the standalone Pico XR headsets, and it was acquired by ByteDance, the owner of TikTok, in 2021. In September 2024, the company launched the Pico 4 Ultra Enterprise headset, filling out the high end of its product line in addition to its G3 and Neo 3 legacy headsets. Pico also has its set of full-body motion trackers to its product offerings to allow for full-body and object tracking. That’s helping it with its focus on location-based entertainment in markets such as China. It’s focused on WiFi7, hand tracking and motion tracking. Leland Hedges, head of enterprise business at Pico, said that the LBE market in China has grown by 1,000% in the last six to nine months Pico has an app for PC streaming and another app for managing devices over a LAN. Pico can track play spaces with columns or cordoned-off areas. Hedges said the company will share 15 different user stories at AWE in public places such as zoos, museums, aquariums and planetariums. Convai Purnendu Mukherjee, CEO of Convai, showed me a bunch of demos at the Game Developers Conference where it has been able to create avatar-based demos of generative AI solutions with 3D animated people. These can be used to show off brands and greet people on web sites or as avatars in games. At AWE, Convai will also off learning and training scenarios for education and enterprises through a variety of simulations. Convai can render high fidelity avatars that are effectively coming from the cloud. At GDC, Convai scanned me and captured my voice so that it can create a lifelike avatar of me. These avatars can be created quickly and answer a variety of questions from website visitors. The idea is to enable non-technical people to create simulations without the need to code anything. In a demo, Convai’s avatar of me said, “I’ve been covering the games industry for many years now at games beat I’ve seen it evolve from the arcades to the massive global phenomenon it is today. I love digging into the business side of gaming, the technology, the culture, the whole shebang.” Convai will announce pricing for its self-serve platform as well as an enterprise subscription fee. Doublepoint Ohto Pentikäinen, CEO of Doublepoint, has a technology that detects the gesture you can make with your hand. It captures that movement via a smartwatch and allows you to control things on a TV interface or an XR device. With Android XR, Doublepoint is showing off demos where gesture control can unlock a more intuitive and comfortable augmented reality experience for those wearing AR glasses. Xreal is one of the glasses makers that is using the technology for controlling an AR user interface with gestures. “Our technology is able to fully control a XR system. A stat that we can update you on is that there’s 150,000 people who have downloaded the technology so far, and we have a developer community of over 2,000 people since January 2024,” Pentikäinen said. Now the company is starting its own Doublepoing developer program, and this adds layers on top of the enterprise client. So now the company can provide technology for indie developers or startups that are building augmented reality or AI hardware experiences. “We’re empowering developers in AR robotics and AI hardware, and we’re providing everything that we’re providing the enterprise clients, but for a much reduced price,” Pentikäinen said.
    0 Yorumlar 0 hisse senetleri
  • Reinforcement Learning Makes LLMs Search-Savvy: Ant Group Researchers Introduce SEM to Optimize Tool Usage and Reasoning Efficiency

    Recent progress in LLMs has shown their potential in performing complex reasoning tasks and effectively using external tools like search engines. Despite this, teaching models to make smart decisions about when to rely on internal knowledge versus search remains a key challenge. While simple prompt-based methods can guide models to invoke tools, LLMs still struggle with more nuanced behaviors, such as recognizing when an initial search was incorrect and deciding to search again. RL has been explored to improve these behaviors by rewarding effective search usage. However, RL often leads to unnecessary tool use, with models executing redundant searches even for simple tasks, highlighting inefficiencies that must be addressed.
    Various RL strategies, including Proximal Policy Optimization, Direct Preference Optimization, and Group Relative Policy Optimization, have been used to align LLM behavior with human expectations. PPO helps balance learning exploration with maintaining policy stability, while DPO simplifies alignment by directly optimizing model responses based on user preferences. GRPO introduces group-based evaluations to capture subtle improvements in reasoning better. Meanwhile, treating LLMs as autonomous agents that plan and execute multi-step reasoning tasks is gaining traction. Frameworks like AutoGPT and LangChain showcase how these agents can refine their outputs through iterative reasoning and search. Yet, current agent systems often depend on fixed prompts or heuristic-based tool use, limiting their adaptability and efficiency. 
    Researchers at Ant Group introduce SEM, a post-training reinforcement learning framework designed to teach LLMs when to use search tools and when to rely on internal knowledge. By training on a balanced dataset combining questions that do and do not require external retrieval, SEM guides the model to issue search requests only when necessary. Using a structured reasoning format and GRPO, the framework rewards accurate answers without search and penalizes unnecessary tool use. Results show that SEM improves response accuracy and efficiency, helping models better judge when external information is needed, thus enhancing reasoning in complex scenarios. 
    To integrate search tools into a model’s reasoning process, SEM uses reinforcement learning to teach models when and how to use search effectively. The training data combines Musiqueand MMLU, helping models learn to judge when search is necessary. Using the GRPO framework, the model is rewarded for accurate, efficient answers, discouraging unnecessary searches, and encouraging them when internal knowledge falls short. A structured response formatstandardizes training and allows for precise reward assignment, improving both reasoning quality and search decision-making. 
    The study evaluates a model trained to determine when to rely on its internal knowledge and when to use external search. It combines Musiqueand MMLUfor training and evaluates performance on datasets like HotpotQA, GSM8K, and MMLU. The proposed SEM method outperforms baselines like Naive RAG and ReSearch in answer accuracy and search efficiency. SEM reduces unnecessary searches on known questions while improving reasoning on unknown ones. Case studies and training curves confirm SEM’s stable learning and intelligent decision-making. Overall, SEM enhances retrieval decisions and internal reasoning in large language models. 

    In conclusion, SEM is a post-training reinforcement learning framework designed to improve how large language models use external search tools. The model is trained on a dataset combining MuSiQue and MMLU, helping it distinguish between questions it can answer internally and those that require external retrieval. SEM uses a structured reasoning approach and a reward function that penalizes unnecessary searches while promoting accurate and efficient retrieval. Experiments on benchmarks like HotpotQA, GSM8K, and MMLU show that SEM reduces redundant searches and improves accuracy. This approach enhances reasoning efficiency and intelligent use of external knowledge in LLMs. 

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.
    Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context AgentsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational EfficiencySana Hassanhttps://www.marktechpost.com/author/sana-hassan/Meet LangGraph Multi-Agent Swarm: A Python Library for Creating Swarm-Style Multi-Agent Systems Using LangGraphSana Hassanhttps://www.marktechpost.com/author/sana-hassan/ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and Reasoning

    Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub!
    #reinforcement #learning #makes #llms #searchsavvy
    Reinforcement Learning Makes LLMs Search-Savvy: Ant Group Researchers Introduce SEM to Optimize Tool Usage and Reasoning Efficiency
    Recent progress in LLMs has shown their potential in performing complex reasoning tasks and effectively using external tools like search engines. Despite this, teaching models to make smart decisions about when to rely on internal knowledge versus search remains a key challenge. While simple prompt-based methods can guide models to invoke tools, LLMs still struggle with more nuanced behaviors, such as recognizing when an initial search was incorrect and deciding to search again. RL has been explored to improve these behaviors by rewarding effective search usage. However, RL often leads to unnecessary tool use, with models executing redundant searches even for simple tasks, highlighting inefficiencies that must be addressed. Various RL strategies, including Proximal Policy Optimization, Direct Preference Optimization, and Group Relative Policy Optimization, have been used to align LLM behavior with human expectations. PPO helps balance learning exploration with maintaining policy stability, while DPO simplifies alignment by directly optimizing model responses based on user preferences. GRPO introduces group-based evaluations to capture subtle improvements in reasoning better. Meanwhile, treating LLMs as autonomous agents that plan and execute multi-step reasoning tasks is gaining traction. Frameworks like AutoGPT and LangChain showcase how these agents can refine their outputs through iterative reasoning and search. Yet, current agent systems often depend on fixed prompts or heuristic-based tool use, limiting their adaptability and efficiency.  Researchers at Ant Group introduce SEM, a post-training reinforcement learning framework designed to teach LLMs when to use search tools and when to rely on internal knowledge. By training on a balanced dataset combining questions that do and do not require external retrieval, SEM guides the model to issue search requests only when necessary. Using a structured reasoning format and GRPO, the framework rewards accurate answers without search and penalizes unnecessary tool use. Results show that SEM improves response accuracy and efficiency, helping models better judge when external information is needed, thus enhancing reasoning in complex scenarios.  To integrate search tools into a model’s reasoning process, SEM uses reinforcement learning to teach models when and how to use search effectively. The training data combines Musiqueand MMLU, helping models learn to judge when search is necessary. Using the GRPO framework, the model is rewarded for accurate, efficient answers, discouraging unnecessary searches, and encouraging them when internal knowledge falls short. A structured response formatstandardizes training and allows for precise reward assignment, improving both reasoning quality and search decision-making.  The study evaluates a model trained to determine when to rely on its internal knowledge and when to use external search. It combines Musiqueand MMLUfor training and evaluates performance on datasets like HotpotQA, GSM8K, and MMLU. The proposed SEM method outperforms baselines like Naive RAG and ReSearch in answer accuracy and search efficiency. SEM reduces unnecessary searches on known questions while improving reasoning on unknown ones. Case studies and training curves confirm SEM’s stable learning and intelligent decision-making. Overall, SEM enhances retrieval decisions and internal reasoning in large language models.  In conclusion, SEM is a post-training reinforcement learning framework designed to improve how large language models use external search tools. The model is trained on a dataset combining MuSiQue and MMLU, helping it distinguish between questions it can answer internally and those that require external retrieval. SEM uses a structured reasoning approach and a reward function that penalizes unnecessary searches while promoting accurate and efficient retrieval. Experiments on benchmarks like HotpotQA, GSM8K, and MMLU show that SEM reduces redundant searches and improves accuracy. This approach enhances reasoning efficiency and intelligent use of external knowledge in LLMs.  Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context AgentsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational EfficiencySana Hassanhttps://www.marktechpost.com/author/sana-hassan/Meet LangGraph Multi-Agent Swarm: A Python Library for Creating Swarm-Style Multi-Agent Systems Using LangGraphSana Hassanhttps://www.marktechpost.com/author/sana-hassan/ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and Reasoning 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! #reinforcement #learning #makes #llms #searchsavvy
    WWW.MARKTECHPOST.COM
    Reinforcement Learning Makes LLMs Search-Savvy: Ant Group Researchers Introduce SEM to Optimize Tool Usage and Reasoning Efficiency
    Recent progress in LLMs has shown their potential in performing complex reasoning tasks and effectively using external tools like search engines. Despite this, teaching models to make smart decisions about when to rely on internal knowledge versus search remains a key challenge. While simple prompt-based methods can guide models to invoke tools, LLMs still struggle with more nuanced behaviors, such as recognizing when an initial search was incorrect and deciding to search again. RL has been explored to improve these behaviors by rewarding effective search usage. However, RL often leads to unnecessary tool use, with models executing redundant searches even for simple tasks, highlighting inefficiencies that must be addressed. Various RL strategies, including Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO), have been used to align LLM behavior with human expectations. PPO helps balance learning exploration with maintaining policy stability, while DPO simplifies alignment by directly optimizing model responses based on user preferences. GRPO introduces group-based evaluations to capture subtle improvements in reasoning better. Meanwhile, treating LLMs as autonomous agents that plan and execute multi-step reasoning tasks is gaining traction. Frameworks like AutoGPT and LangChain showcase how these agents can refine their outputs through iterative reasoning and search. Yet, current agent systems often depend on fixed prompts or heuristic-based tool use, limiting their adaptability and efficiency.  Researchers at Ant Group introduce SEM, a post-training reinforcement learning framework designed to teach LLMs when to use search tools and when to rely on internal knowledge. By training on a balanced dataset combining questions that do and do not require external retrieval, SEM guides the model to issue search requests only when necessary. Using a structured reasoning format and GRPO, the framework rewards accurate answers without search and penalizes unnecessary tool use. Results show that SEM improves response accuracy and efficiency, helping models better judge when external information is needed, thus enhancing reasoning in complex scenarios.  To integrate search tools into a model’s reasoning process, SEM uses reinforcement learning to teach models when and how to use search effectively. The training data combines Musique (questions needing external info) and MMLU (questions answerable from prior knowledge), helping models learn to judge when search is necessary. Using the GRPO framework, the model is rewarded for accurate, efficient answers, discouraging unnecessary searches, and encouraging them when internal knowledge falls short. A structured response format (<think>, <answer>, <search>, <result>) standardizes training and allows for precise reward assignment, improving both reasoning quality and search decision-making.  The study evaluates a model trained to determine when to rely on its internal knowledge and when to use external search. It combines Musique (unfamiliar questions) and MMLU (familiar questions) for training and evaluates performance on datasets like HotpotQA, GSM8K, and MMLU. The proposed SEM method outperforms baselines like Naive RAG and ReSearch in answer accuracy and search efficiency. SEM reduces unnecessary searches on known questions while improving reasoning on unknown ones. Case studies and training curves confirm SEM’s stable learning and intelligent decision-making. Overall, SEM enhances retrieval decisions and internal reasoning in large language models.  In conclusion, SEM is a post-training reinforcement learning framework designed to improve how large language models use external search tools. The model is trained on a dataset combining MuSiQue and MMLU, helping it distinguish between questions it can answer internally and those that require external retrieval. SEM uses a structured reasoning approach and a reward function that penalizes unnecessary searches while promoting accurate and efficient retrieval. Experiments on benchmarks like HotpotQA, GSM8K, and MMLU show that SEM reduces redundant searches and improves accuracy. This approach enhances reasoning efficiency and intelligent use of external knowledge in LLMs.  Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context AgentsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational EfficiencySana Hassanhttps://www.marktechpost.com/author/sana-hassan/Meet LangGraph Multi-Agent Swarm: A Python Library for Creating Swarm-Style Multi-Agent Systems Using LangGraphSana Hassanhttps://www.marktechpost.com/author/sana-hassan/ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and Reasoning 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! (Promoted)
    0 Yorumlar 0 hisse senetleri
  • SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context Agents

    Recent advancements in LM agents have shown promising potential for automating intricate real-world tasks. These agents typically operate by proposing and executing actions through APIs, supporting applications such as software engineering, robotics, and scientific experimentation. As these tasks become more complex, LM agent frameworks have evolved to include multiple agents, multi-step retrieval, and tailored scaffolding to optimize performance. A central challenge lies in effectively exploring and understanding the environment, which has prompted the development of engineered scaffolds using tools, memory mechanisms, and custom pipelines. However, most existing methods assume partial observability, requiring agents to collect observations incrementally. While this assumption holds in dynamic or unfamiliar environments, it is less applicable in fully observable settings like SWE-bench, where all relevant information is accessible from the start.
    In software engineering, research on LM agents has focused on two main strategies: agent-based frameworks and structured pipelines. Agent-based systems, such as SWE-Agent and OpenHands CodeAct, allow LMs to interact autonomously with codebases, often through custom interfaces and retrieval tools. Other models like Moatless and AutoCodeRover enhance localization through search techniques, while SpecRover refines scaffolding design. Alternatively, structured pipelines—such as Agentless and CodeMonkey—decompose tasks into sequential phases like localization, repair, and validation. While these approaches depend on engineered components for performance, the current study proposes leveraging Long-Context LMsto directly interpret the entire task environment. Advances in LCLM architecture and infrastructure now allow these models to outperform retrieval-augmented systems in many contexts, reducing reliance on complex external scaffolding. 
    Researchers from Stanford, IBM, and the University of Toronto explored whether complex scaffolding is necessary for LM agents tackling tasks like SWE-bench. They show that simply using LCLMs, such as Gemini-1.5-Pro, with proper prompting and no scaffolding, can achieve competitive performance—reaching 38% on SWE-Bench-Verified. Gemini-2.5-Pro, using the same simple setup, reaches 50.8%. Their work suggests that many complex agentic designs could be replaced with a single powerful LCLM, simplifying architecture and training. Additionally, a hybrid two-stage approach using Gemini-1.5-Pro and Claude-3.7 achieves a 48.6% solve rate, further supporting this simplified direction. 
    Traditional LM agents rely on interactive exploration due to partial observability, but many tasks, like software debugging, allow full observability. The study proposes state-in-context agents that leverage LCLMs to directly process full or compressed environment states, bypassing the need for complex agentic scaffolding. For large codebases, a ranking-based compression selects relevant files to fit within context limits. Two methods are introduced: DIRECTSOLVE, where LCLMs solve tasks using the full context; and SELECTSOLVE, where LCLMs localize relevant files for short-context LMsto solve. Both use targeted patch formats and validation to ensure accuracy and reduce hallucination. 
    The experiments evaluate a simplified agent framework using LLMs on the SWE-bench Verified benchmark, which includes 500 real-world software engineering tasks. The proposed methods, DIRECTSOLVE and SELECTSOLVE, utilize LCLMs like Gemini-1.5-Pro and Gemini-2.5-Pro, and in SELECTSOLVE, an additional SCLMfor patch generation. Results show that DIRECTSOLVE outperforms complex agentic approaches like Agentless and CodeAct with minimal engineering. SELECTSOLVE further improves accuracy by leveraging stronger models for patching. Ablation studies highlight the importance of CoT prompting, code restatement, and token-efficient context design. Additionally, positioning relevant files at the start of the prompt improves performance, underscoring limitations in long-context processing. 

    In conclusion, the cost of using LCLM-based methods is currently higher than existing approaches like Agentless and CodeAct, averaging per instance compared to and respectively. However, rapid drops in inference costs and increasing context lengths make LCLMs more practical. Techniques like KV caching significantly lower costs after initial runs, reducing it to about Although slight codebase changes still limit caching benefits, further improvements could help. The study also suggests that LCLMs can handle long interaction histories, reducing the need for complex memory and retrieval mechanisms. Notably, unscaffolded LCLM models can perform competitively on SWE-bench tasks. 

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.
    Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational EfficiencySana Hassanhttps://www.marktechpost.com/author/sana-hassan/Meet LangGraph Multi-Agent Swarm: A Python Library for Creating Swarm-Style Multi-Agent Systems Using LangGraphSana Hassanhttps://www.marktechpost.com/author/sana-hassan/ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and ReasoningSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers from Tsinghua and ModelBest Release Ultra-FineWeb: A Trillion-Token Dataset Enhancing LLM Accuracy Across Benchmarks

    Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub!
    #swebench #performance #reaches #without #tool
    SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context Agents
    Recent advancements in LM agents have shown promising potential for automating intricate real-world tasks. These agents typically operate by proposing and executing actions through APIs, supporting applications such as software engineering, robotics, and scientific experimentation. As these tasks become more complex, LM agent frameworks have evolved to include multiple agents, multi-step retrieval, and tailored scaffolding to optimize performance. A central challenge lies in effectively exploring and understanding the environment, which has prompted the development of engineered scaffolds using tools, memory mechanisms, and custom pipelines. However, most existing methods assume partial observability, requiring agents to collect observations incrementally. While this assumption holds in dynamic or unfamiliar environments, it is less applicable in fully observable settings like SWE-bench, where all relevant information is accessible from the start. In software engineering, research on LM agents has focused on two main strategies: agent-based frameworks and structured pipelines. Agent-based systems, such as SWE-Agent and OpenHands CodeAct, allow LMs to interact autonomously with codebases, often through custom interfaces and retrieval tools. Other models like Moatless and AutoCodeRover enhance localization through search techniques, while SpecRover refines scaffolding design. Alternatively, structured pipelines—such as Agentless and CodeMonkey—decompose tasks into sequential phases like localization, repair, and validation. While these approaches depend on engineered components for performance, the current study proposes leveraging Long-Context LMsto directly interpret the entire task environment. Advances in LCLM architecture and infrastructure now allow these models to outperform retrieval-augmented systems in many contexts, reducing reliance on complex external scaffolding.  Researchers from Stanford, IBM, and the University of Toronto explored whether complex scaffolding is necessary for LM agents tackling tasks like SWE-bench. They show that simply using LCLMs, such as Gemini-1.5-Pro, with proper prompting and no scaffolding, can achieve competitive performance—reaching 38% on SWE-Bench-Verified. Gemini-2.5-Pro, using the same simple setup, reaches 50.8%. Their work suggests that many complex agentic designs could be replaced with a single powerful LCLM, simplifying architecture and training. Additionally, a hybrid two-stage approach using Gemini-1.5-Pro and Claude-3.7 achieves a 48.6% solve rate, further supporting this simplified direction.  Traditional LM agents rely on interactive exploration due to partial observability, but many tasks, like software debugging, allow full observability. The study proposes state-in-context agents that leverage LCLMs to directly process full or compressed environment states, bypassing the need for complex agentic scaffolding. For large codebases, a ranking-based compression selects relevant files to fit within context limits. Two methods are introduced: DIRECTSOLVE, where LCLMs solve tasks using the full context; and SELECTSOLVE, where LCLMs localize relevant files for short-context LMsto solve. Both use targeted patch formats and validation to ensure accuracy and reduce hallucination.  The experiments evaluate a simplified agent framework using LLMs on the SWE-bench Verified benchmark, which includes 500 real-world software engineering tasks. The proposed methods, DIRECTSOLVE and SELECTSOLVE, utilize LCLMs like Gemini-1.5-Pro and Gemini-2.5-Pro, and in SELECTSOLVE, an additional SCLMfor patch generation. Results show that DIRECTSOLVE outperforms complex agentic approaches like Agentless and CodeAct with minimal engineering. SELECTSOLVE further improves accuracy by leveraging stronger models for patching. Ablation studies highlight the importance of CoT prompting, code restatement, and token-efficient context design. Additionally, positioning relevant files at the start of the prompt improves performance, underscoring limitations in long-context processing.  In conclusion, the cost of using LCLM-based methods is currently higher than existing approaches like Agentless and CodeAct, averaging per instance compared to and respectively. However, rapid drops in inference costs and increasing context lengths make LCLMs more practical. Techniques like KV caching significantly lower costs after initial runs, reducing it to about Although slight codebase changes still limit caching benefits, further improvements could help. The study also suggests that LCLMs can handle long interaction histories, reducing the need for complex memory and retrieval mechanisms. Notably, unscaffolded LCLM models can perform competitively on SWE-bench tasks.  Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational EfficiencySana Hassanhttps://www.marktechpost.com/author/sana-hassan/Meet LangGraph Multi-Agent Swarm: A Python Library for Creating Swarm-Style Multi-Agent Systems Using LangGraphSana Hassanhttps://www.marktechpost.com/author/sana-hassan/ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and ReasoningSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers from Tsinghua and ModelBest Release Ultra-FineWeb: A Trillion-Token Dataset Enhancing LLM Accuracy Across Benchmarks 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! #swebench #performance #reaches #without #tool
    WWW.MARKTECHPOST.COM
    SWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context Agents
    Recent advancements in LM agents have shown promising potential for automating intricate real-world tasks. These agents typically operate by proposing and executing actions through APIs, supporting applications such as software engineering, robotics, and scientific experimentation. As these tasks become more complex, LM agent frameworks have evolved to include multiple agents, multi-step retrieval, and tailored scaffolding to optimize performance. A central challenge lies in effectively exploring and understanding the environment, which has prompted the development of engineered scaffolds using tools, memory mechanisms, and custom pipelines. However, most existing methods assume partial observability, requiring agents to collect observations incrementally. While this assumption holds in dynamic or unfamiliar environments, it is less applicable in fully observable settings like SWE-bench, where all relevant information is accessible from the start. In software engineering, research on LM agents has focused on two main strategies: agent-based frameworks and structured pipelines. Agent-based systems, such as SWE-Agent and OpenHands CodeAct, allow LMs to interact autonomously with codebases, often through custom interfaces and retrieval tools. Other models like Moatless and AutoCodeRover enhance localization through search techniques, while SpecRover refines scaffolding design. Alternatively, structured pipelines—such as Agentless and CodeMonkey—decompose tasks into sequential phases like localization, repair, and validation. While these approaches depend on engineered components for performance, the current study proposes leveraging Long-Context LMs (LCLMs) to directly interpret the entire task environment. Advances in LCLM architecture and infrastructure now allow these models to outperform retrieval-augmented systems in many contexts, reducing reliance on complex external scaffolding.  Researchers from Stanford, IBM, and the University of Toronto explored whether complex scaffolding is necessary for LM agents tackling tasks like SWE-bench. They show that simply using LCLMs, such as Gemini-1.5-Pro, with proper prompting and no scaffolding, can achieve competitive performance—reaching 38% on SWE-Bench-Verified. Gemini-2.5-Pro, using the same simple setup, reaches 50.8%. Their work suggests that many complex agentic designs could be replaced with a single powerful LCLM, simplifying architecture and training. Additionally, a hybrid two-stage approach using Gemini-1.5-Pro and Claude-3.7 achieves a 48.6% solve rate, further supporting this simplified direction.  Traditional LM agents rely on interactive exploration due to partial observability, but many tasks, like software debugging, allow full observability. The study proposes state-in-context agents that leverage LCLMs to directly process full or compressed environment states, bypassing the need for complex agentic scaffolding. For large codebases, a ranking-based compression selects relevant files to fit within context limits. Two methods are introduced: DIRECTSOLVE, where LCLMs solve tasks using the full context; and SELECTSOLVE, where LCLMs localize relevant files for short-context LMs (SCLMs) to solve. Both use targeted patch formats and validation to ensure accuracy and reduce hallucination.  The experiments evaluate a simplified agent framework using LLMs on the SWE-bench Verified benchmark, which includes 500 real-world software engineering tasks. The proposed methods, DIRECTSOLVE and SELECTSOLVE, utilize LCLMs like Gemini-1.5-Pro and Gemini-2.5-Pro, and in SELECTSOLVE, an additional SCLM (Claude-3.7-Sonnet) for patch generation. Results show that DIRECTSOLVE outperforms complex agentic approaches like Agentless and CodeAct with minimal engineering. SELECTSOLVE further improves accuracy by leveraging stronger models for patching. Ablation studies highlight the importance of CoT prompting, code restatement, and token-efficient context design. Additionally, positioning relevant files at the start of the prompt improves performance, underscoring limitations in long-context processing.  In conclusion, the cost of using LCLM-based methods is currently higher than existing approaches like Agentless and CodeAct, averaging $2.60 per instance compared to $0.25 and $0.87, respectively. However, rapid drops in inference costs and increasing context lengths make LCLMs more practical. Techniques like KV caching significantly lower costs after initial runs, reducing it to about $0.725. Although slight codebase changes still limit caching benefits, further improvements could help. The study also suggests that LCLMs can handle long interaction histories, reducing the need for complex memory and retrieval mechanisms. Notably, unscaffolded LCLM models can perform competitively on SWE-bench tasks.  Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational EfficiencySana Hassanhttps://www.marktechpost.com/author/sana-hassan/Meet LangGraph Multi-Agent Swarm: A Python Library for Creating Swarm-Style Multi-Agent Systems Using LangGraphSana Hassanhttps://www.marktechpost.com/author/sana-hassan/ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and ReasoningSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers from Tsinghua and ModelBest Release Ultra-FineWeb: A Trillion-Token Dataset Enhancing LLM Accuracy Across Benchmarks 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! (Promoted)
    0 Yorumlar 0 hisse senetleri
  • Former Activision Boss Bobby Kotick Wants To Buy Tiktok: Report

    TikTok is in an exceptionally tough spot these days. Despite everyone you know using it for hours on end, the video-sharing app is currently facing legislation that would force its ban in the U.S pending a potential sale, and prospective buyers are lining up. One of these potential buyers is reportedly Bobby Kotick, the former boss of Activision Blizzard, according to the Wall Street Journal.Suggested ReadingWhy This Under-the-Radar AAA Title Is More Than Just A Far Cry Clone

    Share SubtitlesOffEnglishview videoSuggested ReadingWhy This Under-the-Radar AAA Title Is More Than Just A Far Cry Clone

    Share SubtitlesOffEnglishTikTok has been scrutinized for years by U.S. lawmakers who have argued that its China-based parent company ByteDance may share data it collects with the Chinese government, or that the app could serve as a propaganda delivery tool. Despite tensions ramping up some time ago, leading many to believe that the app would be banned in the U.S., matters had seemingly cooled until a bill was pushed through the House Energy and Commerce Committee last week, ratcheting up the pressure on ByteDance. The bill is expected to be reviewed and approved by the House of Representatives this week before being sent to the Senate, and President Joe Biden has already claimed he would sign off on a ban if the bill made it through legislation.The bill requires that ByteDance “divest itself” of TikTok or see the app banned in the U.S., which has led to renewed interest from potential buyers, including Kotick. Kotick, according to WSJ’s sources, has floated the idea of a buy to ByteDance’s co-founder and is reportedly looking for partners, which could include Sam Altman of OpenAI. According to the Wall Street Journal, “OpenAI could use TikTok to help train its AI models if a partner such as Kotick could raise the capital for such an acquisition.” TikTok’s sale has been estimated to be in the range of “hundreds of billions of dollars.”Kotick departed from Activision Blizzard late last year after completing the publisher’s billion sale to Microsoft. Kotick’s tenure at Activision Blizzard spanned decades and came under fire in 2021, when the state of California filed a now-dismissed lawsuit following an investigation into allegations of sexual harassment and discrimination. Ultimately, California’s Civil Rights Department withdrew all allegations and claims relating to harassment and settled with Activision Blizzard in December 2023 for million to resolve unsubstantiated pay and promotions claims.The court-approved settlement included a statement that provided that: “o court or any independent investigation has substantiated any allegations that there has been systemic or widespread sexual harassment at Activision Blizzard; that Activision Blizzard senior executives ignored, condoned, or tolerated a culture of systemic harassment, retaliation, or discrimination; or that Activision Blizzard’s Board of Directors including its Chief Executive Officer, Robert Kotick, acted improperly with regard to the handling of any instances of workplace misconduct.”In addition, the settlement noted that a former chair of the EEOC had conducted a review of the company’s policies, practices and certain complaint data and reported that there was no widespread harassment at the company. The company itself publicly released its Transparency Report, which further asserted that there was never been widespread or systemic harassment or gender pay inequity at Activision Blizzard. Kotick departed with a golden parachute estimated to be worth around million.Updated: 04/01/2024, 2:00 p.m. ET: This article has been updated to include details of the CRD settlement, that Activision Blizzard denied any wrongdoing, and the settlement confirms CRD could not substantiate those claims. Updated: 05/17/2025, 12:10 p.m. ET: This article has been updated to include additional language from the CRD settlement, and to include that as part of the settlement, the CRD withdrew the claims related to harassment from its complaint..
    #former #activision #boss #bobby #kotick
    Former Activision Boss Bobby Kotick Wants To Buy Tiktok: Report
    TikTok is in an exceptionally tough spot these days. Despite everyone you know using it for hours on end, the video-sharing app is currently facing legislation that would force its ban in the U.S pending a potential sale, and prospective buyers are lining up. One of these potential buyers is reportedly Bobby Kotick, the former boss of Activision Blizzard, according to the Wall Street Journal.Suggested ReadingWhy This Under-the-Radar AAA Title Is More Than Just A Far Cry Clone Share SubtitlesOffEnglishview videoSuggested ReadingWhy This Under-the-Radar AAA Title Is More Than Just A Far Cry Clone Share SubtitlesOffEnglishTikTok has been scrutinized for years by U.S. lawmakers who have argued that its China-based parent company ByteDance may share data it collects with the Chinese government, or that the app could serve as a propaganda delivery tool. Despite tensions ramping up some time ago, leading many to believe that the app would be banned in the U.S., matters had seemingly cooled until a bill was pushed through the House Energy and Commerce Committee last week, ratcheting up the pressure on ByteDance. The bill is expected to be reviewed and approved by the House of Representatives this week before being sent to the Senate, and President Joe Biden has already claimed he would sign off on a ban if the bill made it through legislation.The bill requires that ByteDance “divest itself” of TikTok or see the app banned in the U.S., which has led to renewed interest from potential buyers, including Kotick. Kotick, according to WSJ’s sources, has floated the idea of a buy to ByteDance’s co-founder and is reportedly looking for partners, which could include Sam Altman of OpenAI. According to the Wall Street Journal, “OpenAI could use TikTok to help train its AI models if a partner such as Kotick could raise the capital for such an acquisition.” TikTok’s sale has been estimated to be in the range of “hundreds of billions of dollars.”Kotick departed from Activision Blizzard late last year after completing the publisher’s billion sale to Microsoft. Kotick’s tenure at Activision Blizzard spanned decades and came under fire in 2021, when the state of California filed a now-dismissed lawsuit following an investigation into allegations of sexual harassment and discrimination. Ultimately, California’s Civil Rights Department withdrew all allegations and claims relating to harassment and settled with Activision Blizzard in December 2023 for million to resolve unsubstantiated pay and promotions claims.The court-approved settlement included a statement that provided that: “o court or any independent investigation has substantiated any allegations that there has been systemic or widespread sexual harassment at Activision Blizzard; that Activision Blizzard senior executives ignored, condoned, or tolerated a culture of systemic harassment, retaliation, or discrimination; or that Activision Blizzard’s Board of Directors including its Chief Executive Officer, Robert Kotick, acted improperly with regard to the handling of any instances of workplace misconduct.”In addition, the settlement noted that a former chair of the EEOC had conducted a review of the company’s policies, practices and certain complaint data and reported that there was no widespread harassment at the company. The company itself publicly released its Transparency Report, which further asserted that there was never been widespread or systemic harassment or gender pay inequity at Activision Blizzard. Kotick departed with a golden parachute estimated to be worth around million.Updated: 04/01/2024, 2:00 p.m. ET: This article has been updated to include details of the CRD settlement, that Activision Blizzard denied any wrongdoing, and the settlement confirms CRD could not substantiate those claims. Updated: 05/17/2025, 12:10 p.m. ET: This article has been updated to include additional language from the CRD settlement, and to include that as part of the settlement, the CRD withdrew the claims related to harassment from its complaint.. #former #activision #boss #bobby #kotick
    KOTAKU.COM
    Former Activision Boss Bobby Kotick Wants To Buy Tiktok: Report
    TikTok is in an exceptionally tough spot these days. Despite everyone you know using it for hours on end, the video-sharing app is currently facing legislation that would force its ban in the U.S pending a potential sale, and prospective buyers are lining up. One of these potential buyers is reportedly Bobby Kotick, the former boss of Activision Blizzard, according to the Wall Street Journal.Suggested ReadingWhy This Under-the-Radar AAA Title Is More Than Just A Far Cry Clone Share SubtitlesOffEnglishview videoSuggested ReadingWhy This Under-the-Radar AAA Title Is More Than Just A Far Cry Clone Share SubtitlesOffEnglishTikTok has been scrutinized for years by U.S. lawmakers who have argued that its China-based parent company ByteDance may share data it collects with the Chinese government, or that the app could serve as a propaganda delivery tool. Despite tensions ramping up some time ago, leading many to believe that the app would be banned in the U.S., matters had seemingly cooled until a bill was pushed through the House Energy and Commerce Committee last week, ratcheting up the pressure on ByteDance. The bill is expected to be reviewed and approved by the House of Representatives this week before being sent to the Senate, and President Joe Biden has already claimed he would sign off on a ban if the bill made it through legislation.The bill requires that ByteDance “divest itself” of TikTok or see the app banned in the U.S., which has led to renewed interest from potential buyers, including Kotick. Kotick, according to WSJ’s sources, has floated the idea of a buy to ByteDance’s co-founder and is reportedly looking for partners, which could include Sam Altman of OpenAI. According to the Wall Street Journal, “OpenAI could use TikTok to help train its AI models if a partner such as Kotick could raise the capital for such an acquisition.” TikTok’s sale has been estimated to be in the range of “hundreds of billions of dollars.”Kotick departed from Activision Blizzard late last year after completing the publisher’s $68 billion sale to Microsoft. Kotick’s tenure at Activision Blizzard spanned decades and came under fire in 2021, when the state of California filed a now-dismissed lawsuit following an investigation into allegations of sexual harassment and discrimination. Ultimately, California’s Civil Rights Department withdrew all allegations and claims relating to harassment and settled with Activision Blizzard in December 2023 for $54 million to resolve unsubstantiated pay and promotions claims.The court-approved settlement included a statement that provided that: “[N]o court or any independent investigation has substantiated any allegations that there has been systemic or widespread sexual harassment at Activision Blizzard; that Activision Blizzard senior executives ignored, condoned, or tolerated a culture of systemic harassment, retaliation, or discrimination; or that Activision Blizzard’s Board of Directors including its Chief Executive Officer, Robert Kotick, acted improperly with regard to the handling of any instances of workplace misconduct.”In addition, the settlement noted that a former chair of the EEOC had conducted a review of the company’s policies, practices and certain complaint data and reported that there was no widespread harassment at the company. The company itself publicly released its Transparency Report, which further asserted that there was never been widespread or systemic harassment or gender pay inequity at Activision Blizzard. Kotick departed with a golden parachute estimated to be worth around $15 million.Updated: 04/01/2024, 2:00 p.m. ET: This article has been updated to include details of the CRD settlement, that Activision Blizzard denied any wrongdoing, and the settlement confirms CRD could not substantiate those claims. Updated: 05/17/2025, 12:10 p.m. ET: This article has been updated to include additional language from the CRD settlement, and to include that as part of the settlement, the CRD withdrew the claims related to harassment from its complaint..
    0 Yorumlar 0 hisse senetleri
CGShares https://cgshares.com