• ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation

    Autoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses on generating images one token at a time, similar to how sentences are constructed in language models. The appeal of this approach lies in its ability to maintain structural coherence across the image while allowing for high levels of control during the generation process. As researchers began to apply these techniques to visual data, they found that structured prediction not only preserved spatial integrity but also supported tasks like image manipulation and multimodal translation effectively.
    Despite these benefits, generating high-resolution images remains computationally expensive and slow. A primary issue is the number of tokens needed to represent complex visuals. Raster-scan methods that flatten 2D images into linear sequences require thousands of tokens for detailed images, resulting in long inference times and high memory consumption. Models like Infinity need over 10,000 tokens for a 1024×1024 image. This becomes unsustainable for real-time applications or when scaling to more extensive datasets. Reducing the token burden while preserving or improving output quality has become a pressing challenge.

    Efforts to mitigate token inflation have led to innovations like next-scale prediction seen in VAR and FlexVAR. These models create images by predicting progressively finer scales, which imitates the human tendency to sketch rough outlines before adding detail. However, they still rely on hundreds of tokens—680 in the case of VAR and FlexVAR for 256×256 images. Moreover, approaches like TiTok and FlexTok use 1D tokenization to compress spatial redundancy, but they often fail to scale efficiently. For example, FlexTok’s gFID increases from 1.9 at 32 tokens to 2.5 at 256 tokens, highlighting a degradation in output quality as the token count grows.
    Researchers from ByteDance introduced DetailFlow, a 1D autoregressive image generation framework. This method arranges token sequences from global to fine detail using a process called next-detail prediction. Unlike traditional 2D raster-scan or scale-based techniques, DetailFlow employs a 1D tokenizer trained on progressively degraded images. This design allows the model to prioritize foundational image structures before refining visual details. By mapping tokens directly to resolution levels, DetailFlow significantly reduces token requirements, enabling images to be generated in a semantically ordered, coarse-to-fine manner.

    The mechanism in DetailFlow centers on a 1D latent space where each token contributes incrementally more detail. Earlier tokens encode global features, while later tokens refine specific visual aspects. To train this, the researchers created a resolution mapping function that links token count to target resolution. During training, the model is exposed to images of varying quality levels and learns to predict progressively higher-resolution outputs as more tokens are introduced. It also implements parallel token prediction by grouping sequences and predicting entire sets at once. Since parallel prediction can introduce sampling errors, a self-correction mechanism was integrated. This system perturbs certain tokens during training and teaches subsequent tokens to compensate, ensuring that final images maintain structural and visual integrity.
    The results from the experiments on the ImageNet 256×256 benchmark were noteworthy. DetailFlow achieved a gFID score of 2.96 using only 128 tokens, outperforming VAR at 3.3 and FlexVAR at 3.05, both of which used 680 tokens. Even more impressive, DetailFlow-64 reached a gFID of 2.62 using 512 tokens. In terms of speed, it delivered nearly double the inference rate of VAR and FlexVAR. A further ablation study confirmed that the self-correction training and semantic ordering of tokens substantially improved output quality. For example, enabling self-correction dropped the gFID from 4.11 to 3.68 in one setting. These metrics demonstrate both higher quality and faster generation compared to established models.

    By focusing on semantic structure and reducing redundancy, DetailFlow presents a viable solution to long-standing issues in autoregressive image generation. The method’s coarse-to-fine approach, efficient parallel decoding, and ability to self-correct highlight how architectural innovations can address performance and scalability limitations. Through their structured use of 1D tokens, the researchers from ByteDance have demonstrated a model that maintains high image fidelity while significantly reducing computational load, making it a valuable addition to image synthesis research.

    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement FinetuningNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal ReasoningNikhilhttps://www.marktechpost.com/author/nikhil0980/NVIDIA AI Introduces Fast-dLLM: A Training-Free Framework That Brings KV Caching and Parallel Decoding to Diffusion LLMsNikhilhttps://www.marktechpost.com/author/nikhil0980/Meet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental Validation
    #bytedance #researchers #introduce #detailflow #coarsetofine
    ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation
    Autoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses on generating images one token at a time, similar to how sentences are constructed in language models. The appeal of this approach lies in its ability to maintain structural coherence across the image while allowing for high levels of control during the generation process. As researchers began to apply these techniques to visual data, they found that structured prediction not only preserved spatial integrity but also supported tasks like image manipulation and multimodal translation effectively. Despite these benefits, generating high-resolution images remains computationally expensive and slow. A primary issue is the number of tokens needed to represent complex visuals. Raster-scan methods that flatten 2D images into linear sequences require thousands of tokens for detailed images, resulting in long inference times and high memory consumption. Models like Infinity need over 10,000 tokens for a 1024×1024 image. This becomes unsustainable for real-time applications or when scaling to more extensive datasets. Reducing the token burden while preserving or improving output quality has become a pressing challenge. Efforts to mitigate token inflation have led to innovations like next-scale prediction seen in VAR and FlexVAR. These models create images by predicting progressively finer scales, which imitates the human tendency to sketch rough outlines before adding detail. However, they still rely on hundreds of tokens—680 in the case of VAR and FlexVAR for 256×256 images. Moreover, approaches like TiTok and FlexTok use 1D tokenization to compress spatial redundancy, but they often fail to scale efficiently. For example, FlexTok’s gFID increases from 1.9 at 32 tokens to 2.5 at 256 tokens, highlighting a degradation in output quality as the token count grows. Researchers from ByteDance introduced DetailFlow, a 1D autoregressive image generation framework. This method arranges token sequences from global to fine detail using a process called next-detail prediction. Unlike traditional 2D raster-scan or scale-based techniques, DetailFlow employs a 1D tokenizer trained on progressively degraded images. This design allows the model to prioritize foundational image structures before refining visual details. By mapping tokens directly to resolution levels, DetailFlow significantly reduces token requirements, enabling images to be generated in a semantically ordered, coarse-to-fine manner. The mechanism in DetailFlow centers on a 1D latent space where each token contributes incrementally more detail. Earlier tokens encode global features, while later tokens refine specific visual aspects. To train this, the researchers created a resolution mapping function that links token count to target resolution. During training, the model is exposed to images of varying quality levels and learns to predict progressively higher-resolution outputs as more tokens are introduced. It also implements parallel token prediction by grouping sequences and predicting entire sets at once. Since parallel prediction can introduce sampling errors, a self-correction mechanism was integrated. This system perturbs certain tokens during training and teaches subsequent tokens to compensate, ensuring that final images maintain structural and visual integrity. The results from the experiments on the ImageNet 256×256 benchmark were noteworthy. DetailFlow achieved a gFID score of 2.96 using only 128 tokens, outperforming VAR at 3.3 and FlexVAR at 3.05, both of which used 680 tokens. Even more impressive, DetailFlow-64 reached a gFID of 2.62 using 512 tokens. In terms of speed, it delivered nearly double the inference rate of VAR and FlexVAR. A further ablation study confirmed that the self-correction training and semantic ordering of tokens substantially improved output quality. For example, enabling self-correction dropped the gFID from 4.11 to 3.68 in one setting. These metrics demonstrate both higher quality and faster generation compared to established models. By focusing on semantic structure and reducing redundancy, DetailFlow presents a viable solution to long-standing issues in autoregressive image generation. The method’s coarse-to-fine approach, efficient parallel decoding, and ability to self-correct highlight how architectural innovations can address performance and scalability limitations. Through their structured use of 1D tokens, the researchers from ByteDance have demonstrated a model that maintains high image fidelity while significantly reducing computational load, making it a valuable addition to image synthesis research. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement FinetuningNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal ReasoningNikhilhttps://www.marktechpost.com/author/nikhil0980/NVIDIA AI Introduces Fast-dLLM: A Training-Free Framework That Brings KV Caching and Parallel Decoding to Diffusion LLMsNikhilhttps://www.marktechpost.com/author/nikhil0980/Meet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental Validation #bytedance #researchers #introduce #detailflow #coarsetofine
    WWW.MARKTECHPOST.COM
    ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation
    Autoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses on generating images one token at a time, similar to how sentences are constructed in language models. The appeal of this approach lies in its ability to maintain structural coherence across the image while allowing for high levels of control during the generation process. As researchers began to apply these techniques to visual data, they found that structured prediction not only preserved spatial integrity but also supported tasks like image manipulation and multimodal translation effectively. Despite these benefits, generating high-resolution images remains computationally expensive and slow. A primary issue is the number of tokens needed to represent complex visuals. Raster-scan methods that flatten 2D images into linear sequences require thousands of tokens for detailed images, resulting in long inference times and high memory consumption. Models like Infinity need over 10,000 tokens for a 1024×1024 image. This becomes unsustainable for real-time applications or when scaling to more extensive datasets. Reducing the token burden while preserving or improving output quality has become a pressing challenge. Efforts to mitigate token inflation have led to innovations like next-scale prediction seen in VAR and FlexVAR. These models create images by predicting progressively finer scales, which imitates the human tendency to sketch rough outlines before adding detail. However, they still rely on hundreds of tokens—680 in the case of VAR and FlexVAR for 256×256 images. Moreover, approaches like TiTok and FlexTok use 1D tokenization to compress spatial redundancy, but they often fail to scale efficiently. For example, FlexTok’s gFID increases from 1.9 at 32 tokens to 2.5 at 256 tokens, highlighting a degradation in output quality as the token count grows. Researchers from ByteDance introduced DetailFlow, a 1D autoregressive image generation framework. This method arranges token sequences from global to fine detail using a process called next-detail prediction. Unlike traditional 2D raster-scan or scale-based techniques, DetailFlow employs a 1D tokenizer trained on progressively degraded images. This design allows the model to prioritize foundational image structures before refining visual details. By mapping tokens directly to resolution levels, DetailFlow significantly reduces token requirements, enabling images to be generated in a semantically ordered, coarse-to-fine manner. The mechanism in DetailFlow centers on a 1D latent space where each token contributes incrementally more detail. Earlier tokens encode global features, while later tokens refine specific visual aspects. To train this, the researchers created a resolution mapping function that links token count to target resolution. During training, the model is exposed to images of varying quality levels and learns to predict progressively higher-resolution outputs as more tokens are introduced. It also implements parallel token prediction by grouping sequences and predicting entire sets at once. Since parallel prediction can introduce sampling errors, a self-correction mechanism was integrated. This system perturbs certain tokens during training and teaches subsequent tokens to compensate, ensuring that final images maintain structural and visual integrity. The results from the experiments on the ImageNet 256×256 benchmark were noteworthy. DetailFlow achieved a gFID score of 2.96 using only 128 tokens, outperforming VAR at 3.3 and FlexVAR at 3.05, both of which used 680 tokens. Even more impressive, DetailFlow-64 reached a gFID of 2.62 using 512 tokens. In terms of speed, it delivered nearly double the inference rate of VAR and FlexVAR. A further ablation study confirmed that the self-correction training and semantic ordering of tokens substantially improved output quality. For example, enabling self-correction dropped the gFID from 4.11 to 3.68 in one setting. These metrics demonstrate both higher quality and faster generation compared to established models. By focusing on semantic structure and reducing redundancy, DetailFlow presents a viable solution to long-standing issues in autoregressive image generation. The method’s coarse-to-fine approach, efficient parallel decoding, and ability to self-correct highlight how architectural innovations can address performance and scalability limitations. Through their structured use of 1D tokens, the researchers from ByteDance have demonstrated a model that maintains high image fidelity while significantly reducing computational load, making it a valuable addition to image synthesis research. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement FinetuningNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal ReasoningNikhilhttps://www.marktechpost.com/author/nikhil0980/NVIDIA AI Introduces Fast-dLLM: A Training-Free Framework That Brings KV Caching and Parallel Decoding to Diffusion LLMsNikhilhttps://www.marktechpost.com/author/nikhil0980/Meet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental Validation
    Like
    Love
    Wow
    Sad
    Angry
    821
    0 Commentarii 0 Distribuiri 0 previzualizare
  • Meet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental Validation

    Scientific research across fields like chemistry, biology, and artificial intelligence has long relied on human experts to explore knowledge, generate ideas, design experiments, and refine results. Yet, as problems grow more complex and data-intensive, discovery slows. While AI tools, such as language models and robotics, can handle specific tasks, like literature searches or code analysis, they rarely encompass the entire research cycle. Bridging the gap between idea generation and experimental validation remains a key challenge. For AI to autonomously advance science, it must propose hypotheses, design and execute experiments, analyze outcomes, and refine approaches in an iterative loop. Without this integration, AI risks producing disconnected ideas that depend on human supervision for validation.
    Before the introduction of a unified system, researchers relied on separate tools for each stage of the process. Large language models could help find relevant scientific papers, but they didn’t directly feed into experiment design or result analysis. Robotics can assist in automating physical experiments, and coding libraries like PyTorch can help build models; however, these tools operate independently of each other. There was no single system capable of handling the entire process, from forming ideas to verifying them through experiments. This led to bottlenecks, where researchers had to connect the dots manually, slowing progress and leaving room for errors or missed opportunities. The need for an integrated system that could handle the entire research cycle became clear.
    Researchers from the NovelSeek Team at the Shanghai Artificial Intelligence Laboratory developed NovelSeek, an AI system designed to run the entire scientific discovery process autonomously. NovelSeek comprises four main modules that work in tandem: a system that generates and refines research ideas, a feedback loop where human experts can interact with and refine these ideas, a method for translating ideas into code and experiment plans, and a process for conducting multiple rounds of experiments. What makes NovelSeek stand out is its versatility; it works across 12 scientific research tasks, including predicting chemical reaction yields, understanding molecular dynamics, forecasting time-series data, and handling functions like 2D semantic segmentation and 3D object classification. The team designed NovelSeek to minimize human involvement, expedite discoveries, and deliver consistent, high-quality results.

    The system behind NovelSeek involves multiple specialized agents, each focused on a specific part of the research workflow. The “Survey Agent” helps the system understand the problem by searching scientific papers and identifying relevant information based on keywords and task definitions. It adapts its search strategy by first doing a broad survey of papers, then going deeper by analyzing full-text documents for detailed insights. This ensures that the system captures both general trends and specific technical knowledge. The “Code Review Agent” examines existing codebases, whether user-uploaded or sourced from public repositories like GitHub, to understand how current methods work and identify areas for improvement. It checks how code is structured, looks for errors, and creates summaries that help the system build on past work. The “Idea Innovation Agent” generates creative research ideas, pushing the system to explore different approaches and refine them by comparing them to related studies and previous results. The system even includes a “Planning and Execution Agent” that turns ideas into detailed experiments, handles errors during the testing process, and ensures smooth execution of multi-step research plans.

    NovelSeek delivered impressive results across various tasks. In chemical reaction yield prediction, NovelSeek improved performance from a baseline of 24.2%to 34.8%in just 12 hours, progress that human researchers typically need months to achieve. In enhancer activity prediction, a key task in biology, NovelSeek raised the Pearson correlation coefficient from 0.65 to 0.79 within 4 hours. For 2D semantic segmentation, a task used in computer vision, precision improved from 78.8% to 81.0% in just 30 hours. These performance boosts, achieved in a fraction of the time typically needed, highlight the system’s efficiency. NovelSeek also successfully managed large, complex codebases with multiple files, demonstrating its ability to handle research tasks at a project level, not just in small, isolated tests. The team has made the code open-source, allowing others to use, test, and contribute to its improvement.

    Several Key Takeaways from the Research on NovelSeek include:

    NovelSeek supports 12 research tasks, including chemical reaction prediction, molecular dynamics, and 3D object classification.
    Reaction yield prediction accuracy improved from 24.2% to 34.8% in 12 hours.
    Enhancer activity prediction performance increased from 0.65 to 0.79 in 4 hours.
    2D semantic segmentation precision improved from 78.8% to 81.0% in 30 hours.
    NovelSeek includes agents for literature search, code analysis, idea generation, and experiment execution.
    The system is open-source, enabling reproducibility and collaboration across scientific fields.

    In conclusion, NovelSeek demonstrates how combining AI tools into a single system can accelerate scientific discovery and reduce its dependence on human effort. It ties together the key steps, generating ideas, turning them into methods, and testing them through experiments, into one streamlined process. What once took researchers months or years can now be done in days or even hours. By linking every stage of research into a continuous loop, NovelSeek helps teams move from rough ideas to real-world results more quickly. This system highlights the power of AI not just to assist, but to drive scientific research in a way that could reshape how discoveries are made across many fields.

    Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-SolvingNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural Networks
    #meet #novelseek #unified #multiagent #framework
    Meet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental Validation
    Scientific research across fields like chemistry, biology, and artificial intelligence has long relied on human experts to explore knowledge, generate ideas, design experiments, and refine results. Yet, as problems grow more complex and data-intensive, discovery slows. While AI tools, such as language models and robotics, can handle specific tasks, like literature searches or code analysis, they rarely encompass the entire research cycle. Bridging the gap between idea generation and experimental validation remains a key challenge. For AI to autonomously advance science, it must propose hypotheses, design and execute experiments, analyze outcomes, and refine approaches in an iterative loop. Without this integration, AI risks producing disconnected ideas that depend on human supervision for validation. Before the introduction of a unified system, researchers relied on separate tools for each stage of the process. Large language models could help find relevant scientific papers, but they didn’t directly feed into experiment design or result analysis. Robotics can assist in automating physical experiments, and coding libraries like PyTorch can help build models; however, these tools operate independently of each other. There was no single system capable of handling the entire process, from forming ideas to verifying them through experiments. This led to bottlenecks, where researchers had to connect the dots manually, slowing progress and leaving room for errors or missed opportunities. The need for an integrated system that could handle the entire research cycle became clear. Researchers from the NovelSeek Team at the Shanghai Artificial Intelligence Laboratory developed NovelSeek, an AI system designed to run the entire scientific discovery process autonomously. NovelSeek comprises four main modules that work in tandem: a system that generates and refines research ideas, a feedback loop where human experts can interact with and refine these ideas, a method for translating ideas into code and experiment plans, and a process for conducting multiple rounds of experiments. What makes NovelSeek stand out is its versatility; it works across 12 scientific research tasks, including predicting chemical reaction yields, understanding molecular dynamics, forecasting time-series data, and handling functions like 2D semantic segmentation and 3D object classification. The team designed NovelSeek to minimize human involvement, expedite discoveries, and deliver consistent, high-quality results. The system behind NovelSeek involves multiple specialized agents, each focused on a specific part of the research workflow. The “Survey Agent” helps the system understand the problem by searching scientific papers and identifying relevant information based on keywords and task definitions. It adapts its search strategy by first doing a broad survey of papers, then going deeper by analyzing full-text documents for detailed insights. This ensures that the system captures both general trends and specific technical knowledge. The “Code Review Agent” examines existing codebases, whether user-uploaded or sourced from public repositories like GitHub, to understand how current methods work and identify areas for improvement. It checks how code is structured, looks for errors, and creates summaries that help the system build on past work. The “Idea Innovation Agent” generates creative research ideas, pushing the system to explore different approaches and refine them by comparing them to related studies and previous results. The system even includes a “Planning and Execution Agent” that turns ideas into detailed experiments, handles errors during the testing process, and ensures smooth execution of multi-step research plans. NovelSeek delivered impressive results across various tasks. In chemical reaction yield prediction, NovelSeek improved performance from a baseline of 24.2%to 34.8%in just 12 hours, progress that human researchers typically need months to achieve. In enhancer activity prediction, a key task in biology, NovelSeek raised the Pearson correlation coefficient from 0.65 to 0.79 within 4 hours. For 2D semantic segmentation, a task used in computer vision, precision improved from 78.8% to 81.0% in just 30 hours. These performance boosts, achieved in a fraction of the time typically needed, highlight the system’s efficiency. NovelSeek also successfully managed large, complex codebases with multiple files, demonstrating its ability to handle research tasks at a project level, not just in small, isolated tests. The team has made the code open-source, allowing others to use, test, and contribute to its improvement. Several Key Takeaways from the Research on NovelSeek include: NovelSeek supports 12 research tasks, including chemical reaction prediction, molecular dynamics, and 3D object classification. Reaction yield prediction accuracy improved from 24.2% to 34.8% in 12 hours. Enhancer activity prediction performance increased from 0.65 to 0.79 in 4 hours. 2D semantic segmentation precision improved from 78.8% to 81.0% in 30 hours. NovelSeek includes agents for literature search, code analysis, idea generation, and experiment execution. The system is open-source, enabling reproducibility and collaboration across scientific fields. In conclusion, NovelSeek demonstrates how combining AI tools into a single system can accelerate scientific discovery and reduce its dependence on human effort. It ties together the key steps, generating ideas, turning them into methods, and testing them through experiments, into one streamlined process. What once took researchers months or years can now be done in days or even hours. By linking every stage of research into a continuous loop, NovelSeek helps teams move from rough ideas to real-world results more quickly. This system highlights the power of AI not just to assist, but to drive scientific research in a way that could reshape how discoveries are made across many fields. Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-SolvingNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural Networks #meet #novelseek #unified #multiagent #framework
    WWW.MARKTECHPOST.COM
    Meet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental Validation
    Scientific research across fields like chemistry, biology, and artificial intelligence has long relied on human experts to explore knowledge, generate ideas, design experiments, and refine results. Yet, as problems grow more complex and data-intensive, discovery slows. While AI tools, such as language models and robotics, can handle specific tasks, like literature searches or code analysis, they rarely encompass the entire research cycle. Bridging the gap between idea generation and experimental validation remains a key challenge. For AI to autonomously advance science, it must propose hypotheses, design and execute experiments, analyze outcomes, and refine approaches in an iterative loop. Without this integration, AI risks producing disconnected ideas that depend on human supervision for validation. Before the introduction of a unified system, researchers relied on separate tools for each stage of the process. Large language models could help find relevant scientific papers, but they didn’t directly feed into experiment design or result analysis. Robotics can assist in automating physical experiments, and coding libraries like PyTorch can help build models; however, these tools operate independently of each other. There was no single system capable of handling the entire process, from forming ideas to verifying them through experiments. This led to bottlenecks, where researchers had to connect the dots manually, slowing progress and leaving room for errors or missed opportunities. The need for an integrated system that could handle the entire research cycle became clear. Researchers from the NovelSeek Team at the Shanghai Artificial Intelligence Laboratory developed NovelSeek, an AI system designed to run the entire scientific discovery process autonomously. NovelSeek comprises four main modules that work in tandem: a system that generates and refines research ideas, a feedback loop where human experts can interact with and refine these ideas, a method for translating ideas into code and experiment plans, and a process for conducting multiple rounds of experiments. What makes NovelSeek stand out is its versatility; it works across 12 scientific research tasks, including predicting chemical reaction yields, understanding molecular dynamics, forecasting time-series data, and handling functions like 2D semantic segmentation and 3D object classification. The team designed NovelSeek to minimize human involvement, expedite discoveries, and deliver consistent, high-quality results. The system behind NovelSeek involves multiple specialized agents, each focused on a specific part of the research workflow. The “Survey Agent” helps the system understand the problem by searching scientific papers and identifying relevant information based on keywords and task definitions. It adapts its search strategy by first doing a broad survey of papers, then going deeper by analyzing full-text documents for detailed insights. This ensures that the system captures both general trends and specific technical knowledge. The “Code Review Agent” examines existing codebases, whether user-uploaded or sourced from public repositories like GitHub, to understand how current methods work and identify areas for improvement. It checks how code is structured, looks for errors, and creates summaries that help the system build on past work. The “Idea Innovation Agent” generates creative research ideas, pushing the system to explore different approaches and refine them by comparing them to related studies and previous results. The system even includes a “Planning and Execution Agent” that turns ideas into detailed experiments, handles errors during the testing process, and ensures smooth execution of multi-step research plans. NovelSeek delivered impressive results across various tasks. In chemical reaction yield prediction, NovelSeek improved performance from a baseline of 24.2% (with a variation of ±4.2) to 34.8% (with a much smaller variation of ±1.1) in just 12 hours, progress that human researchers typically need months to achieve. In enhancer activity prediction, a key task in biology, NovelSeek raised the Pearson correlation coefficient from 0.65 to 0.79 within 4 hours. For 2D semantic segmentation, a task used in computer vision, precision improved from 78.8% to 81.0% in just 30 hours. These performance boosts, achieved in a fraction of the time typically needed, highlight the system’s efficiency. NovelSeek also successfully managed large, complex codebases with multiple files, demonstrating its ability to handle research tasks at a project level, not just in small, isolated tests. The team has made the code open-source, allowing others to use, test, and contribute to its improvement. Several Key Takeaways from the Research on NovelSeek include: NovelSeek supports 12 research tasks, including chemical reaction prediction, molecular dynamics, and 3D object classification. Reaction yield prediction accuracy improved from 24.2% to 34.8% in 12 hours. Enhancer activity prediction performance increased from 0.65 to 0.79 in 4 hours. 2D semantic segmentation precision improved from 78.8% to 81.0% in 30 hours. NovelSeek includes agents for literature search, code analysis, idea generation, and experiment execution. The system is open-source, enabling reproducibility and collaboration across scientific fields. In conclusion, NovelSeek demonstrates how combining AI tools into a single system can accelerate scientific discovery and reduce its dependence on human effort. It ties together the key steps, generating ideas, turning them into methods, and testing them through experiments, into one streamlined process. What once took researchers months or years can now be done in days or even hours. By linking every stage of research into a continuous loop, NovelSeek helps teams move from rough ideas to real-world results more quickly. This system highlights the power of AI not just to assist, but to drive scientific research in a way that could reshape how discoveries are made across many fields. Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-SolvingNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural Networks
    0 Commentarii 0 Distribuiri 0 previzualizare
  • Bioprinted organs ‘10–15 years away,’ says startup regenerating dog skin

    Human organs could be bioprinted for transplants within 10 years, according to Lithuanian startup Vital3D. But before reaching human hearts and kidneys, the company is starting with something simpler: regenerating dog skin.
    Based in Vilnius, Vital3D is already bioprinting functional tissue constructs. Using a proprietary laser system, the startup deposits living cells and biomaterials in precise 3D patterns. The structures mimic natural biological systems — and could one day form entire organs tailored to a patient’s unique anatomy.
    That mission is both professional and personal for CEO Vidmantas Šakalys. After losing a mentor to urinary cancer, he set out to develop 3D-printed kidneys that could save others from the same fate. But before reaching that goal, the company needs a commercial product to fund the long road ahead.
    That product is VitalHeal — the first-ever bioprinted wound patch for pets. Dogs are the initial target, with human applications slated to follow.
    Šakalys calls the patch “a first step” towards bioprinted kidneys. “Printing organs for transplantation is a really challenging task,” he tells TNW after a tour of his lab. “It’s 10 or 15 years away from now, and as a commercial entity, we need to have commercially available products earlier. So we start with simpler products and then move into more difficult ones.”
    Register Now

    The path may be simpler, but the technology is anything but.
    Bioprinting goes to the vet
    VitalHeal is embedded with growth factors that accelerate skin regeneration.
    Across the patch’s surface, tiny pores about one-fifth the width of a human hair enable air circulation while blocking bacteria. Once applied, VitalHeal seals the wound and maintains constant pressure while the growth factors get to work.
    According to Vital3D, the patch can reduce healing time from 10–12 weeks to just four to six. Infection risk can drop from 30% to under 10%, vet visits from eight to two or three, and surgery times by half.
    Current treatments, the startup argues, can be costly, ineffective, and distressing for animals. VitalHeal is designed to provide a safer, faster, and cheaper alternative.
    Vital3D says the market is big — and the data backs up the claim.
    Vital3D’s FemtoBrush system promises high-speed and high-precision bioprinting. Credit: Vital3D
    Commercial prospects
    The global animal wound care market is projected to grow from bnin 2024 to bnby 2030, fuelled by rising pet ownership and demand for advanced veterinary care. Vital3D forecasts an initial serviceable addressable marketof €76.5mn across the EU and US. By 2027-2028, the company aims to sell 100,000 units.
    Dogs are a logical starting point. Their size, activity levels, and surgeries raise their risk of wounds. Around half of dogs over age 10 are also affected by cancer, further increasing demand for effective wound care.
    At €300 retail, the patches won’t be cheap. But Vital3D claims they could slash treatment costs for pet owners from €3,000 to €1,500. Production at scale is expected to bring prices down further. 
    After strong results in rats, trials on dogs will begin this summer in clinics in Lithuania and the UK — Vital3D’s pilot markets.
    If all goes to plan, a non-degradable patch will launch in Europe next year. The company will then progress to a biodegradable version.
    From there, the company plans to adapt the tech for humans. The initial focus will be wound care for people with diabetes, 25% of whom suffer from impaired healing. Future versions could support burn victims, injured soldiers, and others in need of advanced skin restoration.
    Freshly printed fluids in a bio-ink droplet. Credit: Vital3D
    Vital3D is also exploring other medical frontiers. In partnership with Lithuania’s National Cancer Institute, the startup is building organoids — mini versions of organs — for cancer drug testing. Another project involves bioprinted stents, which are showing promise in early animal trials. But all these efforts serve a bigger mission.
    “Our final target is to move to organ printing for transplants,” says Šakalys.
    Bioprinting organs
    A computer engineer by training, Šakalys has worked with photonic innovations for over 10 years. 
    At his previous startup, Femtika, he harnessed lasers to produce tiny components for microelectronics, medical devices, and aerospace engineering. He realised they could also enable precise bioprinting. 
    In 2021, he co-founded Vital3D to advance the concept. The company’s printing system directs light towards a photosensitive bio-ink. The material is hardened and formed into a structure, with living cells and biomaterials moulded into intricate 3D patterns.
    The shape of the laser beam can be adjusted to replicate complex biological forms — potentially even entire organs.
    But there are still major scientific hurdles to overcome. One is vascularisation, the formation of blood vessels in intricate networks. Another is the diverse variety of cell types in many organs. Replicating these sophisticated natural structures will be challenging.
    “First of all, we want to solve the vasculature. Then we will go into the differentiation of cells,” Šakalys says.
    “Our target is to see if we can print from fewer cells, but try to differentiate them while printing into different types of cells.” 
    If successful, Vital3D could help ease the global shortage of transplantable organs. Fewer than 10% of patients who need a transplant receive one each year, according to the World Health Organisation. In the US alone, around 90,000 people are waiting for a kidney — a shortfall that’s fuelling a thriving black market.
    Šakalys believes that could be just the start. He envisions bioprinting not just creating organs, but also advancing a new era of personalised medicine.
    “It can bring a lot of benefits to society,” he says. “Not just bioprinting for transplants, but also tissue engineering as well.”
    Want to discover the next big thing in tech? Then take a trip to TNW Conference, where thousands of founders, investors, and corporate innovators will share their ideas. The event takes place on June 19–20 in Amsterdam and tickets are on sale now. Use the code TNWXMEDIA2025 at the checkout to get 30% off.

    Story by

    Thomas Macaulay

    Managing editor

    Thomas is the managing editor of TNW. He leads our coverage of European tech and oversees our talented team of writers. Away from work, he eThomas is the managing editor of TNW. He leads our coverage of European tech and oversees our talented team of writers. Away from work, he enjoys playing chessand the guitar.

    Get the TNW newsletter
    Get the most important tech news in your inbox each week.

    Also tagged with
    #bioprinted #organs #years #away #says
    Bioprinted organs ‘10–15 years away,’ says startup regenerating dog skin
    Human organs could be bioprinted for transplants within 10 years, according to Lithuanian startup Vital3D. But before reaching human hearts and kidneys, the company is starting with something simpler: regenerating dog skin. Based in Vilnius, Vital3D is already bioprinting functional tissue constructs. Using a proprietary laser system, the startup deposits living cells and biomaterials in precise 3D patterns. The structures mimic natural biological systems — and could one day form entire organs tailored to a patient’s unique anatomy. That mission is both professional and personal for CEO Vidmantas Šakalys. After losing a mentor to urinary cancer, he set out to develop 3D-printed kidneys that could save others from the same fate. But before reaching that goal, the company needs a commercial product to fund the long road ahead. That product is VitalHeal — the first-ever bioprinted wound patch for pets. Dogs are the initial target, with human applications slated to follow. Šakalys calls the patch “a first step” towards bioprinted kidneys. “Printing organs for transplantation is a really challenging task,” he tells TNW after a tour of his lab. “It’s 10 or 15 years away from now, and as a commercial entity, we need to have commercially available products earlier. So we start with simpler products and then move into more difficult ones.” Register Now The path may be simpler, but the technology is anything but. Bioprinting goes to the vet VitalHeal is embedded with growth factors that accelerate skin regeneration. Across the patch’s surface, tiny pores about one-fifth the width of a human hair enable air circulation while blocking bacteria. Once applied, VitalHeal seals the wound and maintains constant pressure while the growth factors get to work. According to Vital3D, the patch can reduce healing time from 10–12 weeks to just four to six. Infection risk can drop from 30% to under 10%, vet visits from eight to two or three, and surgery times by half. Current treatments, the startup argues, can be costly, ineffective, and distressing for animals. VitalHeal is designed to provide a safer, faster, and cheaper alternative. Vital3D says the market is big — and the data backs up the claim. Vital3D’s FemtoBrush system promises high-speed and high-precision bioprinting. Credit: Vital3D Commercial prospects The global animal wound care market is projected to grow from bnin 2024 to bnby 2030, fuelled by rising pet ownership and demand for advanced veterinary care. Vital3D forecasts an initial serviceable addressable marketof €76.5mn across the EU and US. By 2027-2028, the company aims to sell 100,000 units. Dogs are a logical starting point. Their size, activity levels, and surgeries raise their risk of wounds. Around half of dogs over age 10 are also affected by cancer, further increasing demand for effective wound care. At €300 retail, the patches won’t be cheap. But Vital3D claims they could slash treatment costs for pet owners from €3,000 to €1,500. Production at scale is expected to bring prices down further.  After strong results in rats, trials on dogs will begin this summer in clinics in Lithuania and the UK — Vital3D’s pilot markets. If all goes to plan, a non-degradable patch will launch in Europe next year. The company will then progress to a biodegradable version. From there, the company plans to adapt the tech for humans. The initial focus will be wound care for people with diabetes, 25% of whom suffer from impaired healing. Future versions could support burn victims, injured soldiers, and others in need of advanced skin restoration. Freshly printed fluids in a bio-ink droplet. Credit: Vital3D Vital3D is also exploring other medical frontiers. In partnership with Lithuania’s National Cancer Institute, the startup is building organoids — mini versions of organs — for cancer drug testing. Another project involves bioprinted stents, which are showing promise in early animal trials. But all these efforts serve a bigger mission. “Our final target is to move to organ printing for transplants,” says Šakalys. Bioprinting organs A computer engineer by training, Šakalys has worked with photonic innovations for over 10 years.  At his previous startup, Femtika, he harnessed lasers to produce tiny components for microelectronics, medical devices, and aerospace engineering. He realised they could also enable precise bioprinting.  In 2021, he co-founded Vital3D to advance the concept. The company’s printing system directs light towards a photosensitive bio-ink. The material is hardened and formed into a structure, with living cells and biomaterials moulded into intricate 3D patterns. The shape of the laser beam can be adjusted to replicate complex biological forms — potentially even entire organs. But there are still major scientific hurdles to overcome. One is vascularisation, the formation of blood vessels in intricate networks. Another is the diverse variety of cell types in many organs. Replicating these sophisticated natural structures will be challenging. “First of all, we want to solve the vasculature. Then we will go into the differentiation of cells,” Šakalys says. “Our target is to see if we can print from fewer cells, but try to differentiate them while printing into different types of cells.”  If successful, Vital3D could help ease the global shortage of transplantable organs. Fewer than 10% of patients who need a transplant receive one each year, according to the World Health Organisation. In the US alone, around 90,000 people are waiting for a kidney — a shortfall that’s fuelling a thriving black market. Šakalys believes that could be just the start. He envisions bioprinting not just creating organs, but also advancing a new era of personalised medicine. “It can bring a lot of benefits to society,” he says. “Not just bioprinting for transplants, but also tissue engineering as well.” Want to discover the next big thing in tech? Then take a trip to TNW Conference, where thousands of founders, investors, and corporate innovators will share their ideas. The event takes place on June 19–20 in Amsterdam and tickets are on sale now. Use the code TNWXMEDIA2025 at the checkout to get 30% off. Story by Thomas Macaulay Managing editor Thomas is the managing editor of TNW. He leads our coverage of European tech and oversees our talented team of writers. Away from work, he eThomas is the managing editor of TNW. He leads our coverage of European tech and oversees our talented team of writers. Away from work, he enjoys playing chessand the guitar. Get the TNW newsletter Get the most important tech news in your inbox each week. Also tagged with #bioprinted #organs #years #away #says
    THENEXTWEB.COM
    Bioprinted organs ‘10–15 years away,’ says startup regenerating dog skin
    Human organs could be bioprinted for transplants within 10 years, according to Lithuanian startup Vital3D. But before reaching human hearts and kidneys, the company is starting with something simpler: regenerating dog skin. Based in Vilnius, Vital3D is already bioprinting functional tissue constructs. Using a proprietary laser system, the startup deposits living cells and biomaterials in precise 3D patterns. The structures mimic natural biological systems — and could one day form entire organs tailored to a patient’s unique anatomy. That mission is both professional and personal for CEO Vidmantas Šakalys. After losing a mentor to urinary cancer, he set out to develop 3D-printed kidneys that could save others from the same fate. But before reaching that goal, the company needs a commercial product to fund the long road ahead. That product is VitalHeal — the first-ever bioprinted wound patch for pets. Dogs are the initial target, with human applications slated to follow. Šakalys calls the patch “a first step” towards bioprinted kidneys. “Printing organs for transplantation is a really challenging task,” he tells TNW after a tour of his lab. “It’s 10 or 15 years away from now, and as a commercial entity, we need to have commercially available products earlier. So we start with simpler products and then move into more difficult ones.” Register Now The path may be simpler, but the technology is anything but. Bioprinting goes to the vet VitalHeal is embedded with growth factors that accelerate skin regeneration. Across the patch’s surface, tiny pores about one-fifth the width of a human hair enable air circulation while blocking bacteria. Once applied, VitalHeal seals the wound and maintains constant pressure while the growth factors get to work. According to Vital3D, the patch can reduce healing time from 10–12 weeks to just four to six. Infection risk can drop from 30% to under 10%, vet visits from eight to two or three, and surgery times by half. Current treatments, the startup argues, can be costly, ineffective, and distressing for animals. VitalHeal is designed to provide a safer, faster, and cheaper alternative. Vital3D says the market is big — and the data backs up the claim. Vital3D’s FemtoBrush system promises high-speed and high-precision bioprinting. Credit: Vital3D Commercial prospects The global animal wound care market is projected to grow from $1.4bn (€1.24bn) in 2024 to $2.1bn (€1.87bn) by 2030, fuelled by rising pet ownership and demand for advanced veterinary care. Vital3D forecasts an initial serviceable addressable market (ISAM) of €76.5mn across the EU and US. By 2027-2028, the company aims to sell 100,000 units. Dogs are a logical starting point. Their size, activity levels, and surgeries raise their risk of wounds. Around half of dogs over age 10 are also affected by cancer, further increasing demand for effective wound care. At €300 retail (or €150 wholesale), the patches won’t be cheap. But Vital3D claims they could slash treatment costs for pet owners from €3,000 to €1,500. Production at scale is expected to bring prices down further.  After strong results in rats, trials on dogs will begin this summer in clinics in Lithuania and the UK — Vital3D’s pilot markets. If all goes to plan, a non-degradable patch will launch in Europe next year. The company will then progress to a biodegradable version. From there, the company plans to adapt the tech for humans. The initial focus will be wound care for people with diabetes, 25% of whom suffer from impaired healing. Future versions could support burn victims, injured soldiers, and others in need of advanced skin restoration. Freshly printed fluids in a bio-ink droplet. Credit: Vital3D Vital3D is also exploring other medical frontiers. In partnership with Lithuania’s National Cancer Institute, the startup is building organoids — mini versions of organs — for cancer drug testing. Another project involves bioprinted stents, which are showing promise in early animal trials. But all these efforts serve a bigger mission. “Our final target is to move to organ printing for transplants,” says Šakalys. Bioprinting organs A computer engineer by training, Šakalys has worked with photonic innovations for over 10 years.  At his previous startup, Femtika, he harnessed lasers to produce tiny components for microelectronics, medical devices, and aerospace engineering. He realised they could also enable precise bioprinting.  In 2021, he co-founded Vital3D to advance the concept. The company’s printing system directs light towards a photosensitive bio-ink. The material is hardened and formed into a structure, with living cells and biomaterials moulded into intricate 3D patterns. The shape of the laser beam can be adjusted to replicate complex biological forms — potentially even entire organs. But there are still major scientific hurdles to overcome. One is vascularisation, the formation of blood vessels in intricate networks. Another is the diverse variety of cell types in many organs. Replicating these sophisticated natural structures will be challenging. “First of all, we want to solve the vasculature. Then we will go into the differentiation of cells,” Šakalys says. “Our target is to see if we can print from fewer cells, but try to differentiate them while printing into different types of cells.”  If successful, Vital3D could help ease the global shortage of transplantable organs. Fewer than 10% of patients who need a transplant receive one each year, according to the World Health Organisation. In the US alone, around 90,000 people are waiting for a kidney — a shortfall that’s fuelling a thriving black market. Šakalys believes that could be just the start. He envisions bioprinting not just creating organs, but also advancing a new era of personalised medicine. “It can bring a lot of benefits to society,” he says. “Not just bioprinting for transplants, but also tissue engineering as well.” Want to discover the next big thing in tech? Then take a trip to TNW Conference, where thousands of founders, investors, and corporate innovators will share their ideas. The event takes place on June 19–20 in Amsterdam and tickets are on sale now. Use the code TNWXMEDIA2025 at the checkout to get 30% off. Story by Thomas Macaulay Managing editor Thomas is the managing editor of TNW. He leads our coverage of European tech and oversees our talented team of writers. Away from work, he e (show all) Thomas is the managing editor of TNW. He leads our coverage of European tech and oversees our talented team of writers. Away from work, he enjoys playing chess (badly) and the guitar (even worse). Get the TNW newsletter Get the most important tech news in your inbox each week. Also tagged with
    0 Commentarii 0 Distribuiri 0 previzualizare
  • This AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-Solving

    Reasoning tasks are a fundamental aspect of artificial intelligence, encompassing areas like commonsense understanding, mathematical problem-solving, and symbolic reasoning. These tasks often involve multiple steps of logical inference, which large language modelsattempt to mimic through structured approaches such as chain-of-thoughtprompting. However, as LLMs grow in size and complexity, they tend to produce longer outputs across all tasks, regardless of difficulty, leading to significant inefficiencies. The field has been striving to balance the depth of reasoning with computational cost while also ensuring that models can adapt their reasoning strategies to meet the unique needs of each problem.
    A key issue with current reasoning models is the inability to tailor the reasoning process to different task complexities. Most models, including well-known ones like OpenAI’s o1 and DeepSeek-R1, apply a uniform strategy—typically relying on Long CoT across all tasks. This causes the “overthinking” problem, where models generate unnecessarily verbose explanations for simpler tasks. Not only does this waste resources, but it also degrades accuracy, as excessive reasoning can introduce irrelevant information. Approaches such as prompt-guided generation or token budget estimation have attempted to mitigate this issue. Still, these methods are limited by their dependence on predefined assumptions, which are not always reliable for diverse tasks.

    Attempts to address these issues include methods like GRPO, length-penalty mechanisms, and rule-based prompt controls. While GRPO enables models to learn different reasoning strategies by rewarding correct answers, it leads to a “format collapse,” where models increasingly rely on Long CoT, crowding out more efficient formats, such as Short CoT or Direct Answer. Length-penalty techniques, such as those applied in methods like THINKPRUNE, control output length during training or inference, but often at the cost of reduced accuracy, especially in complex problem-solving tasks. These solutions struggle to achieve a consistent trade-off between reasoning effectiveness and efficiency, highlighting the need for an adaptive approach.
    A team of researchers from Fudan University and Ohio State University introduced the Adaptive Reasoning Model, which dynamically adjusts reasoning formats based on task difficulty. ARM supports four distinct reasoning styles: Direct Answer for simple tasks, Short CoT for concise reasoning, Code for structured problem-solving, and Long CoT for deep multi-step reasoning. It operates in an Adaptive Mode by default, automatically selecting the appropriate format, and also provides Instruction-Guided and Consensus-Guided Modes for explicit control or aggregation across formats. The key innovation lies in its training process, which utilizes Ada-GRPO, an extension of GRPO that introduces a format diversity reward mechanism. This prevents the dominance of Long CoT and ensures that ARM continues to explore and use simpler reasoning formats when appropriate.

    The ARM methodology is built on a two-stage framework. First, the model undergoes Supervised Fine-Tuningwith 10.8K questions, each annotated across four reasoning formats, sourced from datasets like AQuA-Rat and generated with tools such as GPT-4o and DeepSeek-R1. This stage teaches the model the structure of each reasoning format but does not instill adaptiveness. The second stage applies Ada-GRPO, where the model receives scaled rewards for using less frequent formats, such as Direct Answer or Short CoT. A decaying factor ensures that this reward gradually shifts back to accuracy as training progresses, preventing long-term bias toward inefficient exploration. This structure enables ARM to avoid format collapse and dynamically match reasoning strategies to task difficulty, achieving a balance of efficiency and performance.

    ARM demonstrated impressive results across various benchmarks, including commonsense, mathematical, and symbolic reasoning tasks. It reduced token usage by an average of 30%, with reductions as high as 70% for simpler tasks, compared to models relying solely on Long CoT. ARM achieved a 2x training speedup over GRPO-based models, accelerating model development without sacrificing accuracy. For example, ARM-7B achieved 75.9% accuracy on the challenging AIME’25 task while using 32.5% fewer tokens. ARM-14B achieved 85.6% accuracy on OpenBookQA and 86.4% accuracy on the MATH dataset, with a token usage reduction of over 30% compared to Qwen2.5SFT+GRPO models. These numbers demonstrate ARM’s ability to maintain competitive performance while delivering significant efficiency gains.
    Overall, the Adaptive Reasoning Model addresses the persistent inefficiency of reasoning models by enabling the adaptive selection of reasoning formats based on task difficulty. The introduction of Ada-GRPO and the multi-format training framework ensures that models no longer waste resources on overthinking. Instead, ARM provides a flexible and practical solution for balancing accuracy and computational cost in reasoning tasks, making it a promising approach for scalable and efficient large language models.

    Check out the Paper, Models on Hugging Face and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural NetworksNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding
    #this #paper #introduces #arm #adagrpo
    This AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-Solving
    Reasoning tasks are a fundamental aspect of artificial intelligence, encompassing areas like commonsense understanding, mathematical problem-solving, and symbolic reasoning. These tasks often involve multiple steps of logical inference, which large language modelsattempt to mimic through structured approaches such as chain-of-thoughtprompting. However, as LLMs grow in size and complexity, they tend to produce longer outputs across all tasks, regardless of difficulty, leading to significant inefficiencies. The field has been striving to balance the depth of reasoning with computational cost while also ensuring that models can adapt their reasoning strategies to meet the unique needs of each problem. A key issue with current reasoning models is the inability to tailor the reasoning process to different task complexities. Most models, including well-known ones like OpenAI’s o1 and DeepSeek-R1, apply a uniform strategy—typically relying on Long CoT across all tasks. This causes the “overthinking” problem, where models generate unnecessarily verbose explanations for simpler tasks. Not only does this waste resources, but it also degrades accuracy, as excessive reasoning can introduce irrelevant information. Approaches such as prompt-guided generation or token budget estimation have attempted to mitigate this issue. Still, these methods are limited by their dependence on predefined assumptions, which are not always reliable for diverse tasks. Attempts to address these issues include methods like GRPO, length-penalty mechanisms, and rule-based prompt controls. While GRPO enables models to learn different reasoning strategies by rewarding correct answers, it leads to a “format collapse,” where models increasingly rely on Long CoT, crowding out more efficient formats, such as Short CoT or Direct Answer. Length-penalty techniques, such as those applied in methods like THINKPRUNE, control output length during training or inference, but often at the cost of reduced accuracy, especially in complex problem-solving tasks. These solutions struggle to achieve a consistent trade-off between reasoning effectiveness and efficiency, highlighting the need for an adaptive approach. A team of researchers from Fudan University and Ohio State University introduced the Adaptive Reasoning Model, which dynamically adjusts reasoning formats based on task difficulty. ARM supports four distinct reasoning styles: Direct Answer for simple tasks, Short CoT for concise reasoning, Code for structured problem-solving, and Long CoT for deep multi-step reasoning. It operates in an Adaptive Mode by default, automatically selecting the appropriate format, and also provides Instruction-Guided and Consensus-Guided Modes for explicit control or aggregation across formats. The key innovation lies in its training process, which utilizes Ada-GRPO, an extension of GRPO that introduces a format diversity reward mechanism. This prevents the dominance of Long CoT and ensures that ARM continues to explore and use simpler reasoning formats when appropriate. The ARM methodology is built on a two-stage framework. First, the model undergoes Supervised Fine-Tuningwith 10.8K questions, each annotated across four reasoning formats, sourced from datasets like AQuA-Rat and generated with tools such as GPT-4o and DeepSeek-R1. This stage teaches the model the structure of each reasoning format but does not instill adaptiveness. The second stage applies Ada-GRPO, where the model receives scaled rewards for using less frequent formats, such as Direct Answer or Short CoT. A decaying factor ensures that this reward gradually shifts back to accuracy as training progresses, preventing long-term bias toward inefficient exploration. This structure enables ARM to avoid format collapse and dynamically match reasoning strategies to task difficulty, achieving a balance of efficiency and performance. ARM demonstrated impressive results across various benchmarks, including commonsense, mathematical, and symbolic reasoning tasks. It reduced token usage by an average of 30%, with reductions as high as 70% for simpler tasks, compared to models relying solely on Long CoT. ARM achieved a 2x training speedup over GRPO-based models, accelerating model development without sacrificing accuracy. For example, ARM-7B achieved 75.9% accuracy on the challenging AIME’25 task while using 32.5% fewer tokens. ARM-14B achieved 85.6% accuracy on OpenBookQA and 86.4% accuracy on the MATH dataset, with a token usage reduction of over 30% compared to Qwen2.5SFT+GRPO models. These numbers demonstrate ARM’s ability to maintain competitive performance while delivering significant efficiency gains. Overall, the Adaptive Reasoning Model addresses the persistent inefficiency of reasoning models by enabling the adaptive selection of reasoning formats based on task difficulty. The introduction of Ada-GRPO and the multi-format training framework ensures that models no longer waste resources on overthinking. Instead, ARM provides a flexible and practical solution for balancing accuracy and computational cost in reasoning tasks, making it a promising approach for scalable and efficient large language models. Check out the Paper, Models on Hugging Face and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural NetworksNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding #this #paper #introduces #arm #adagrpo
    WWW.MARKTECHPOST.COM
    This AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-Solving
    Reasoning tasks are a fundamental aspect of artificial intelligence, encompassing areas like commonsense understanding, mathematical problem-solving, and symbolic reasoning. These tasks often involve multiple steps of logical inference, which large language models (LLMs) attempt to mimic through structured approaches such as chain-of-thought (CoT) prompting. However, as LLMs grow in size and complexity, they tend to produce longer outputs across all tasks, regardless of difficulty, leading to significant inefficiencies. The field has been striving to balance the depth of reasoning with computational cost while also ensuring that models can adapt their reasoning strategies to meet the unique needs of each problem. A key issue with current reasoning models is the inability to tailor the reasoning process to different task complexities. Most models, including well-known ones like OpenAI’s o1 and DeepSeek-R1, apply a uniform strategy—typically relying on Long CoT across all tasks. This causes the “overthinking” problem, where models generate unnecessarily verbose explanations for simpler tasks. Not only does this waste resources, but it also degrades accuracy, as excessive reasoning can introduce irrelevant information. Approaches such as prompt-guided generation or token budget estimation have attempted to mitigate this issue. Still, these methods are limited by their dependence on predefined assumptions, which are not always reliable for diverse tasks. Attempts to address these issues include methods like GRPO (Group Relative Policy Optimization), length-penalty mechanisms, and rule-based prompt controls. While GRPO enables models to learn different reasoning strategies by rewarding correct answers, it leads to a “format collapse,” where models increasingly rely on Long CoT, crowding out more efficient formats, such as Short CoT or Direct Answer. Length-penalty techniques, such as those applied in methods like THINKPRUNE, control output length during training or inference, but often at the cost of reduced accuracy, especially in complex problem-solving tasks. These solutions struggle to achieve a consistent trade-off between reasoning effectiveness and efficiency, highlighting the need for an adaptive approach. A team of researchers from Fudan University and Ohio State University introduced the Adaptive Reasoning Model (ARM), which dynamically adjusts reasoning formats based on task difficulty. ARM supports four distinct reasoning styles: Direct Answer for simple tasks, Short CoT for concise reasoning, Code for structured problem-solving, and Long CoT for deep multi-step reasoning. It operates in an Adaptive Mode by default, automatically selecting the appropriate format, and also provides Instruction-Guided and Consensus-Guided Modes for explicit control or aggregation across formats. The key innovation lies in its training process, which utilizes Ada-GRPO, an extension of GRPO that introduces a format diversity reward mechanism. This prevents the dominance of Long CoT and ensures that ARM continues to explore and use simpler reasoning formats when appropriate. The ARM methodology is built on a two-stage framework. First, the model undergoes Supervised Fine-Tuning (SFT) with 10.8K questions, each annotated across four reasoning formats, sourced from datasets like AQuA-Rat and generated with tools such as GPT-4o and DeepSeek-R1. This stage teaches the model the structure of each reasoning format but does not instill adaptiveness. The second stage applies Ada-GRPO, where the model receives scaled rewards for using less frequent formats, such as Direct Answer or Short CoT. A decaying factor ensures that this reward gradually shifts back to accuracy as training progresses, preventing long-term bias toward inefficient exploration. This structure enables ARM to avoid format collapse and dynamically match reasoning strategies to task difficulty, achieving a balance of efficiency and performance. ARM demonstrated impressive results across various benchmarks, including commonsense, mathematical, and symbolic reasoning tasks. It reduced token usage by an average of 30%, with reductions as high as 70% for simpler tasks, compared to models relying solely on Long CoT. ARM achieved a 2x training speedup over GRPO-based models, accelerating model development without sacrificing accuracy. For example, ARM-7B achieved 75.9% accuracy on the challenging AIME’25 task while using 32.5% fewer tokens. ARM-14B achieved 85.6% accuracy on OpenBookQA and 86.4% accuracy on the MATH dataset, with a token usage reduction of over 30% compared to Qwen2.5SFT+GRPO models. These numbers demonstrate ARM’s ability to maintain competitive performance while delivering significant efficiency gains. Overall, the Adaptive Reasoning Model addresses the persistent inefficiency of reasoning models by enabling the adaptive selection of reasoning formats based on task difficulty. The introduction of Ada-GRPO and the multi-format training framework ensures that models no longer waste resources on overthinking. Instead, ARM provides a flexible and practical solution for balancing accuracy and computational cost in reasoning tasks, making it a promising approach for scalable and efficient large language models. Check out the Paper, Models on Hugging Face and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural NetworksNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding
    5 Commentarii 0 Distribuiri 0 previzualizare
  • This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost Efficiency

    Web navigation focuses on teaching machines how to interact with websites to perform tasks such as searching for information, shopping, or booking services. Building a capable web navigation agent is a complex task because it requires understanding the structure of websites, interpreting user goals, and making a series of decisions across multiple steps. These tasks are further complicated by the need for agents to adapt in dynamic web environments, where content can change frequently and where multimodal information, such as text and images, must be understood together.
    A key problem in web navigation is the absence of reliable and detailed reward models that can guide agents in real-time. Existing methods primarily rely on multimodal large language modelslike GPT-4o and GPT-4o-mini as evaluators, which are expensive, slow, and often inaccurate, especially when handling long sequences of actions in multi-step tasks. These models use prompting-based evaluation or binary success/failure feedback but fail to provide step-level guidance, often leading to errors such as repeated actions or missing critical steps like clicking specific buttons or filling form fields. This limitation reduces the practicality of deploying web agents in real-world scenarios, where efficiency, accuracy, and cost-effectiveness are crucial.

    The research team from Yonsei University and Carnegie Mellon University introduced WEB-SHEPHERD, a process reward model specifically designed for web navigation tasks. WEB-SHEPHERD is the first model to evaluate web navigation agents at the step level, using structured checklists to guide assessments. The researchers also developed the WEBPRM COLLECTION, a dataset of 40,000 step-level annotated web navigation tasks, and the WEBREWARDBENCH benchmark for evaluating PRMs. These resources were designed to enable WEB-SHEPHERD to provide detailed feedback by breaking down complex tasks into smaller, measurable subgoals.

    WEB-SHEPHERD works by generating a checklist for each task based on the user’s instruction, such as “Search for product” or “Click on product page,” and evaluates the agent’s progress against these subgoals. The model uses next-token prediction to generate feedback and assigns rewards based on checklist completion. This process enables WEB-SHEPHERD to assess the correctness of each step with fine-grained judgment. The model estimates the reward for each step by combining the probabilities of “Yes,” “No,” and “In Progress” tokens and averages these across the checklist. This detailed scoring system enables agents to receive targeted feedback on their progress, enhancing their ability to navigate complex websites.
    The researchers demonstrated that WEB-SHEPHERD significantly outperforms existing models. On the WEBREWARDBENCH benchmark, WEB-SHEPHERD achieved a Mean Reciprocal Rankscore of 87.6% and a trajectory accuracy of 55% in the text-only setting, compared to GPT-4o-mini’s 47.5% MRR and 0% trajectory accuracy without checklists. When tested in WebArena-lite using GPT-4o-mini as the policy model, WEB-SHEPHERD achieved a 34.55% success rate, which is 10.9 points higher than using GPT-4o-mini as the evaluator, while also being ten times more cost-efficient. In ablation studies, the researchers observed that WEB-SHEPHERD’s performance dropped significantly when checklists or feedback were removed, proving their importance for accurate reward assignments. They also showed that multimodal input, surprisingly, did not always improve performance and sometimes introduced noise.

    This research highlights the critical role of detailed process-level rewards in building reliable web agents. The team’s work addresses the core challenge of web navigation—evaluating complex, multi-step actions—and offers a solution that is both scalable and cost-effective. With WEB-SHEPHERD, agents can now receive accurate feedback during navigation, enabling them to make better decisions and complete tasks more effectively.

    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural NetworksNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual GroundingNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM Inference
    #this #paper #introduces #webshepherd #process
    This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost Efficiency
    Web navigation focuses on teaching machines how to interact with websites to perform tasks such as searching for information, shopping, or booking services. Building a capable web navigation agent is a complex task because it requires understanding the structure of websites, interpreting user goals, and making a series of decisions across multiple steps. These tasks are further complicated by the need for agents to adapt in dynamic web environments, where content can change frequently and where multimodal information, such as text and images, must be understood together. A key problem in web navigation is the absence of reliable and detailed reward models that can guide agents in real-time. Existing methods primarily rely on multimodal large language modelslike GPT-4o and GPT-4o-mini as evaluators, which are expensive, slow, and often inaccurate, especially when handling long sequences of actions in multi-step tasks. These models use prompting-based evaluation or binary success/failure feedback but fail to provide step-level guidance, often leading to errors such as repeated actions or missing critical steps like clicking specific buttons or filling form fields. This limitation reduces the practicality of deploying web agents in real-world scenarios, where efficiency, accuracy, and cost-effectiveness are crucial. The research team from Yonsei University and Carnegie Mellon University introduced WEB-SHEPHERD, a process reward model specifically designed for web navigation tasks. WEB-SHEPHERD is the first model to evaluate web navigation agents at the step level, using structured checklists to guide assessments. The researchers also developed the WEBPRM COLLECTION, a dataset of 40,000 step-level annotated web navigation tasks, and the WEBREWARDBENCH benchmark for evaluating PRMs. These resources were designed to enable WEB-SHEPHERD to provide detailed feedback by breaking down complex tasks into smaller, measurable subgoals. WEB-SHEPHERD works by generating a checklist for each task based on the user’s instruction, such as “Search for product” or “Click on product page,” and evaluates the agent’s progress against these subgoals. The model uses next-token prediction to generate feedback and assigns rewards based on checklist completion. This process enables WEB-SHEPHERD to assess the correctness of each step with fine-grained judgment. The model estimates the reward for each step by combining the probabilities of “Yes,” “No,” and “In Progress” tokens and averages these across the checklist. This detailed scoring system enables agents to receive targeted feedback on their progress, enhancing their ability to navigate complex websites. The researchers demonstrated that WEB-SHEPHERD significantly outperforms existing models. On the WEBREWARDBENCH benchmark, WEB-SHEPHERD achieved a Mean Reciprocal Rankscore of 87.6% and a trajectory accuracy of 55% in the text-only setting, compared to GPT-4o-mini’s 47.5% MRR and 0% trajectory accuracy without checklists. When tested in WebArena-lite using GPT-4o-mini as the policy model, WEB-SHEPHERD achieved a 34.55% success rate, which is 10.9 points higher than using GPT-4o-mini as the evaluator, while also being ten times more cost-efficient. In ablation studies, the researchers observed that WEB-SHEPHERD’s performance dropped significantly when checklists or feedback were removed, proving their importance for accurate reward assignments. They also showed that multimodal input, surprisingly, did not always improve performance and sometimes introduced noise. This research highlights the critical role of detailed process-level rewards in building reliable web agents. The team’s work addresses the core challenge of web navigation—evaluating complex, multi-step actions—and offers a solution that is both scalable and cost-effective. With WEB-SHEPHERD, agents can now receive accurate feedback during navigation, enabling them to make better decisions and complete tasks more effectively. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural NetworksNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual GroundingNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM Inference #this #paper #introduces #webshepherd #process
    WWW.MARKTECHPOST.COM
    This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost Efficiency
    Web navigation focuses on teaching machines how to interact with websites to perform tasks such as searching for information, shopping, or booking services. Building a capable web navigation agent is a complex task because it requires understanding the structure of websites, interpreting user goals, and making a series of decisions across multiple steps. These tasks are further complicated by the need for agents to adapt in dynamic web environments, where content can change frequently and where multimodal information, such as text and images, must be understood together. A key problem in web navigation is the absence of reliable and detailed reward models that can guide agents in real-time. Existing methods primarily rely on multimodal large language models (MLLMs) like GPT-4o and GPT-4o-mini as evaluators, which are expensive, slow, and often inaccurate, especially when handling long sequences of actions in multi-step tasks. These models use prompting-based evaluation or binary success/failure feedback but fail to provide step-level guidance, often leading to errors such as repeated actions or missing critical steps like clicking specific buttons or filling form fields. This limitation reduces the practicality of deploying web agents in real-world scenarios, where efficiency, accuracy, and cost-effectiveness are crucial. The research team from Yonsei University and Carnegie Mellon University introduced WEB-SHEPHERD, a process reward model specifically designed for web navigation tasks. WEB-SHEPHERD is the first model to evaluate web navigation agents at the step level, using structured checklists to guide assessments. The researchers also developed the WEBPRM COLLECTION, a dataset of 40,000 step-level annotated web navigation tasks, and the WEBREWARDBENCH benchmark for evaluating PRMs. These resources were designed to enable WEB-SHEPHERD to provide detailed feedback by breaking down complex tasks into smaller, measurable subgoals. WEB-SHEPHERD works by generating a checklist for each task based on the user’s instruction, such as “Search for product” or “Click on product page,” and evaluates the agent’s progress against these subgoals. The model uses next-token prediction to generate feedback and assigns rewards based on checklist completion. This process enables WEB-SHEPHERD to assess the correctness of each step with fine-grained judgment. The model estimates the reward for each step by combining the probabilities of “Yes,” “No,” and “In Progress” tokens and averages these across the checklist. This detailed scoring system enables agents to receive targeted feedback on their progress, enhancing their ability to navigate complex websites. The researchers demonstrated that WEB-SHEPHERD significantly outperforms existing models. On the WEBREWARDBENCH benchmark, WEB-SHEPHERD achieved a Mean Reciprocal Rank (MRR) score of 87.6% and a trajectory accuracy of 55% in the text-only setting, compared to GPT-4o-mini’s 47.5% MRR and 0% trajectory accuracy without checklists. When tested in WebArena-lite using GPT-4o-mini as the policy model, WEB-SHEPHERD achieved a 34.55% success rate, which is 10.9 points higher than using GPT-4o-mini as the evaluator, while also being ten times more cost-efficient. In ablation studies, the researchers observed that WEB-SHEPHERD’s performance dropped significantly when checklists or feedback were removed, proving their importance for accurate reward assignments. They also showed that multimodal input, surprisingly, did not always improve performance and sometimes introduced noise. This research highlights the critical role of detailed process-level rewards in building reliable web agents. The team’s work addresses the core challenge of web navigation—evaluating complex, multi-step actions—and offers a solution that is both scalable and cost-effective. With WEB-SHEPHERD, agents can now receive accurate feedback during navigation, enabling them to make better decisions and complete tasks more effectively. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural NetworksNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual GroundingNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM Inference
    0 Commentarii 0 Distribuiri 0 previzualizare
  • This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding

    The core idea of Multimodal Large Language Modelsis to create models that can combine the richness of visual content with the logic of language. However, despite advances in this field, many models struggle to connect the two domains effectively, leading to limited performance in complex reasoning tasks that involve visual components.
    A major challenge in building such models is their limited ability to combine visual understanding with logical thinking. Current systems often produce textual outputs that explain reasoning but fail to reference the specific parts of an image they rely on. This creates a gap where models may arrive at an answer without clearly showing how the visual evidence contributed to their decision. It’s also difficult to ensure that models generate visual reasoning steps directly connecting to their answers. The fundamental problem lies in how to naturally train models to interleave text and image reasoning without needing large datasets annotated with visual references, which are scarce and expensive to produce.

    Existing methods try to address this by using reinforcement learning or prompting strategies. Some systems generate bounding box coordinates as answers, while others produce step-by-step textual reasoning chains. However, these approaches have limitations. Models that only produce bounding boxes lack explanation, while those generating only text risk ignoring visual evidence. Previous methods often separate visual grounding and reasoning, making it hard for models to explain why a particular visual element leads to a certain conclusion. While some models use dense supervision data or additional tools, they generally require heavy annotation and do not scale well. This makes it difficult for developers to create models that can explain their reasoning transparently and handle various visual tasks with minimal data.
    Researchers from UC Santa Cruz and eBay introduced a new method called Grounded Reasoning with Images and Textthat allows MLLMs like Qwen 2.5-VL and InternVL 3 to generate reasoning chains that mix natural language with explicit bounding box coordinates pointing to relevant image regions. This unified approach enables models to reason about and visually ground their answers without requiring dense annotations or labeled reasoning chains. GRIT also uses a lightweight reinforcement learning algorithm called GRPO-GR, which optimizes both the accuracy of the final answer and the structure of the reasoning, encouraging models to include specific tokens like <think> and <rethink>, as well as bounding box formats. This design eliminates the need for costly annotated data while ensuring that models learn to reference visual content meaningfully within their logical steps.

    The methodology in GRIT focuses on generating outputs that combine textual reasoning and visual grounding seamlessly. Instead of requiring models to process cropped images or additional visual data after generating bounding boxes, GRIT teaches models to use their internal understanding of the image. Bounding boxes are generated during the reasoning process, and models learn to reflect on these coordinates within their logical reasoning. The reinforcement learning framework rewards the correct use of bounding box formats and reasoning structure, and it guides models to produce coherent, grounded reasoning chains. GRIT demonstrates remarkable data efficiency by using only 20 image-question-answer triplets sourced from Visual Spatial Reasoning and TallyQA datasets. The model training was conducted on NVIDIA A100 GPUs, with optimization techniques like AdamW and a cosine scheduler applied over 200 training steps, which shows the method’s scalability despite limited data.
    Performance evaluations revealed that GRIT-trained models outperform several baselines in reasoning and grounding accuracy. For example, Qwen 2.5-VL trained with GRIT achieved 72.9% answer accuracy on Visual Spatial Reasoning, 47.8% on TallyQA, and 62.8% on GQA datasets. It also reached a grounding IoU score of 0.325 on VSR and 0.447 on TallyQA. In contrast, baseline models like Direct Query or Chain-of-Thought often performed significantly lower, showing limited ability to unify reasoning with visual grounding. GRIT models demonstrated a strong correlation between visual regions and textual reasoning, producing outputs that reflected a meaningful connection between image evidence and logical thought. GRIT also showed improvements on out-of-domain benchmarks, though gains were more pronounced on in-domain data, highlighting the importance of training data diversity for broader generalization.

    In conclusion, the research addressed the problem of disconnected reasoning and visual grounding in MLLMs by introducing GRIT. The method allows models to reason with images through a simple, efficient approach that requires minimal data. GRIT successfully teaches MLLMs to combine visual evidence with logical reasoning in a unified output, achieving strong performance across multiple benchmarks and demonstrating a promising step toward more interpretable AI systems.

    Check out the Paper, Project, and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM InferenceNikhilhttps://www.marktechpost.com/author/nikhil0980/Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPONikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE: A Parallel Computation Method for Efficient and Scalable Language Model Deployment
    #this #paper #introduces #grit #method
    This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding
    The core idea of Multimodal Large Language Modelsis to create models that can combine the richness of visual content with the logic of language. However, despite advances in this field, many models struggle to connect the two domains effectively, leading to limited performance in complex reasoning tasks that involve visual components. A major challenge in building such models is their limited ability to combine visual understanding with logical thinking. Current systems often produce textual outputs that explain reasoning but fail to reference the specific parts of an image they rely on. This creates a gap where models may arrive at an answer without clearly showing how the visual evidence contributed to their decision. It’s also difficult to ensure that models generate visual reasoning steps directly connecting to their answers. The fundamental problem lies in how to naturally train models to interleave text and image reasoning without needing large datasets annotated with visual references, which are scarce and expensive to produce. Existing methods try to address this by using reinforcement learning or prompting strategies. Some systems generate bounding box coordinates as answers, while others produce step-by-step textual reasoning chains. However, these approaches have limitations. Models that only produce bounding boxes lack explanation, while those generating only text risk ignoring visual evidence. Previous methods often separate visual grounding and reasoning, making it hard for models to explain why a particular visual element leads to a certain conclusion. While some models use dense supervision data or additional tools, they generally require heavy annotation and do not scale well. This makes it difficult for developers to create models that can explain their reasoning transparently and handle various visual tasks with minimal data. Researchers from UC Santa Cruz and eBay introduced a new method called Grounded Reasoning with Images and Textthat allows MLLMs like Qwen 2.5-VL and InternVL 3 to generate reasoning chains that mix natural language with explicit bounding box coordinates pointing to relevant image regions. This unified approach enables models to reason about and visually ground their answers without requiring dense annotations or labeled reasoning chains. GRIT also uses a lightweight reinforcement learning algorithm called GRPO-GR, which optimizes both the accuracy of the final answer and the structure of the reasoning, encouraging models to include specific tokens like <think> and <rethink>, as well as bounding box formats. This design eliminates the need for costly annotated data while ensuring that models learn to reference visual content meaningfully within their logical steps. The methodology in GRIT focuses on generating outputs that combine textual reasoning and visual grounding seamlessly. Instead of requiring models to process cropped images or additional visual data after generating bounding boxes, GRIT teaches models to use their internal understanding of the image. Bounding boxes are generated during the reasoning process, and models learn to reflect on these coordinates within their logical reasoning. The reinforcement learning framework rewards the correct use of bounding box formats and reasoning structure, and it guides models to produce coherent, grounded reasoning chains. GRIT demonstrates remarkable data efficiency by using only 20 image-question-answer triplets sourced from Visual Spatial Reasoning and TallyQA datasets. The model training was conducted on NVIDIA A100 GPUs, with optimization techniques like AdamW and a cosine scheduler applied over 200 training steps, which shows the method’s scalability despite limited data. Performance evaluations revealed that GRIT-trained models outperform several baselines in reasoning and grounding accuracy. For example, Qwen 2.5-VL trained with GRIT achieved 72.9% answer accuracy on Visual Spatial Reasoning, 47.8% on TallyQA, and 62.8% on GQA datasets. It also reached a grounding IoU score of 0.325 on VSR and 0.447 on TallyQA. In contrast, baseline models like Direct Query or Chain-of-Thought often performed significantly lower, showing limited ability to unify reasoning with visual grounding. GRIT models demonstrated a strong correlation between visual regions and textual reasoning, producing outputs that reflected a meaningful connection between image evidence and logical thought. GRIT also showed improvements on out-of-domain benchmarks, though gains were more pronounced on in-domain data, highlighting the importance of training data diversity for broader generalization. In conclusion, the research addressed the problem of disconnected reasoning and visual grounding in MLLMs by introducing GRIT. The method allows models to reason with images through a simple, efficient approach that requires minimal data. GRIT successfully teaches MLLMs to combine visual evidence with logical reasoning in a unified output, achieving strong performance across multiple benchmarks and demonstrating a promising step toward more interpretable AI systems. Check out the Paper, Project, and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM InferenceNikhilhttps://www.marktechpost.com/author/nikhil0980/Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPONikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE: A Parallel Computation Method for Efficient and Scalable Language Model Deployment #this #paper #introduces #grit #method
    WWW.MARKTECHPOST.COM
    This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding
    The core idea of Multimodal Large Language Models (MLLMs) is to create models that can combine the richness of visual content with the logic of language. However, despite advances in this field, many models struggle to connect the two domains effectively, leading to limited performance in complex reasoning tasks that involve visual components. A major challenge in building such models is their limited ability to combine visual understanding with logical thinking. Current systems often produce textual outputs that explain reasoning but fail to reference the specific parts of an image they rely on. This creates a gap where models may arrive at an answer without clearly showing how the visual evidence contributed to their decision. It’s also difficult to ensure that models generate visual reasoning steps directly connecting to their answers. The fundamental problem lies in how to naturally train models to interleave text and image reasoning without needing large datasets annotated with visual references, which are scarce and expensive to produce. Existing methods try to address this by using reinforcement learning or prompting strategies. Some systems generate bounding box coordinates as answers, while others produce step-by-step textual reasoning chains. However, these approaches have limitations. Models that only produce bounding boxes lack explanation, while those generating only text risk ignoring visual evidence. Previous methods often separate visual grounding and reasoning, making it hard for models to explain why a particular visual element leads to a certain conclusion. While some models use dense supervision data or additional tools, they generally require heavy annotation and do not scale well. This makes it difficult for developers to create models that can explain their reasoning transparently and handle various visual tasks with minimal data. Researchers from UC Santa Cruz and eBay introduced a new method called Grounded Reasoning with Images and Text (GRIT) that allows MLLMs like Qwen 2.5-VL and InternVL 3 to generate reasoning chains that mix natural language with explicit bounding box coordinates pointing to relevant image regions. This unified approach enables models to reason about and visually ground their answers without requiring dense annotations or labeled reasoning chains. GRIT also uses a lightweight reinforcement learning algorithm called GRPO-GR, which optimizes both the accuracy of the final answer and the structure of the reasoning, encouraging models to include specific tokens like <think> and <rethink>, as well as bounding box formats. This design eliminates the need for costly annotated data while ensuring that models learn to reference visual content meaningfully within their logical steps. The methodology in GRIT focuses on generating outputs that combine textual reasoning and visual grounding seamlessly. Instead of requiring models to process cropped images or additional visual data after generating bounding boxes, GRIT teaches models to use their internal understanding of the image. Bounding boxes are generated during the reasoning process, and models learn to reflect on these coordinates within their logical reasoning. The reinforcement learning framework rewards the correct use of bounding box formats and reasoning structure, and it guides models to produce coherent, grounded reasoning chains. GRIT demonstrates remarkable data efficiency by using only 20 image-question-answer triplets sourced from Visual Spatial Reasoning and TallyQA datasets. The model training was conducted on NVIDIA A100 GPUs, with optimization techniques like AdamW and a cosine scheduler applied over 200 training steps, which shows the method’s scalability despite limited data. Performance evaluations revealed that GRIT-trained models outperform several baselines in reasoning and grounding accuracy. For example, Qwen 2.5-VL trained with GRIT achieved 72.9% answer accuracy on Visual Spatial Reasoning, 47.8% on TallyQA, and 62.8% on GQA datasets. It also reached a grounding IoU score of 0.325 on VSR and 0.447 on TallyQA. In contrast, baseline models like Direct Query or Chain-of-Thought often performed significantly lower, showing limited ability to unify reasoning with visual grounding. GRIT models demonstrated a strong correlation between visual regions and textual reasoning, producing outputs that reflected a meaningful connection between image evidence and logical thought. GRIT also showed improvements on out-of-domain benchmarks, though gains were more pronounced on in-domain data, highlighting the importance of training data diversity for broader generalization. In conclusion, the research addressed the problem of disconnected reasoning and visual grounding in MLLMs by introducing GRIT. The method allows models to reason with images through a simple, efficient approach that requires minimal data. GRIT successfully teaches MLLMs to combine visual evidence with logical reasoning in a unified output, achieving strong performance across multiple benchmarks and demonstrating a promising step toward more interpretable AI systems. Check out the Paper, Project, and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM InferenceNikhilhttps://www.marktechpost.com/author/nikhil0980/Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPONikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE (Parallel Scaling): A Parallel Computation Method for Efficient and Scalable Language Model Deployment
    0 Commentarii 0 Distribuiri 0 previzualizare
  • This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM Inference

    A prominent area of exploration involves enabling large language modelsto function collaboratively. Multi-agent systems powered by LLMs are now being examined for their potential to coordinate challenging problems by splitting tasks and working simultaneously. This direction has gained attention due to its potential to increase efficiency and reduce latency in real-time applications.
    A common issue in collaborative LLM systems is agents’ sequential, turn-based communication. In such systems, each agent must wait for others to complete their reasoning steps before proceeding. This slows down processing, especially in situations demanding rapid responses. Moreover, agents often duplicate efforts or generate inconsistent outputs, as they cannot see the evolving thoughts of their peers during generation. This latency and redundancy reduce the practicality of deploying multi-agent LLMs, particularly when time and computation are constrained, such as edge devices.

    Most current solutions have relied on sequential or independently parallel sampling techniques to improve reasoning. Methods like Chain-of-Thought prompting help models to solve problems in a structured way but often come with increased inference time. Approaches such as Tree-of-Thoughts and Graph-of-Thoughts expand on this by branching reasoning paths. However, these approaches still do not allow for real-time mutual adaptation among agents. Multi-agent setups have explored collaborative methods, but mostly through alternating message exchanges, which again introduces delays. Some advanced systems propose complex dynamic scheduling or role-based configurations, which are not optimized for efficient inference.
    Research from MediaTek Research introduced a new method called Group Think. This approach enables multiple reasoning agents within a single LLM to operate concurrently, observing each other’s partial outputs at the token level. Each reasoning thread adapts to the evolving thoughts of the others mid-generation. This mechanism reduces duplication and enables agents to shift direction if another thread is better positioned to continue a specific line of reasoning. Group Think is implemented through a token-level attention mechanism that lets each agent attend to previously generated tokens from all agents, supporting real-time collaboration.
    The method works by assigning each agent its own sequence of token indices, allowing their outputs to be interleaved in memory. These interleaved tokens are stored in a shared cache accessible to all agents during generation. This design allows efficient attention across reasoning threads without architectural changes to the transformer model. The implementation works both on personal devices and in data centers. On local devices, it effectively uses idle compute by batching multiple agent outputs, even with a batch size of one. In data centers, Group Think allows multiple requests to be processed together, interleaving tokens across agents while maintaining correct attention dynamics.

    Performance tests demonstrate that Group Think significantly improves latency and output quality. In enumeration tasks, such as listing 100 distinct names, it achieved near-complete results more rapidly than conventional Chain-of-Thought approaches. The acceleration was proportional to the number of thinkers; for example, four thinkers reduced latency by a factor of about four. In divide-and-conquer problems, using the Floyd–Warshall algorithm on a graph of five nodes, four thinkers reduced the completion time to half that of a single agent. Group Think solved code generation challenges in programming tasks more effectively than baseline models. With four or more thinkers, the model produced correct code segments much faster than traditional reasoning models.
    This research shows that existing LLMs, though not explicitly trained for collaboration, can already demonstrate emergent group reasoning behaviors under the Group Think setup. In experiments, agents naturally diversified their work to avoid redundancy, often dividing tasks by topic or focus area. These findings suggest that Group Think’s efficiency and sophistication could be enhanced further with dedicated training on collaborative data.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPONikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE: A Parallel Computation Method for Efficient and Scalable Language Model DeploymentNikhilhttps://www.marktechpost.com/author/nikhil0980/Google AI Releases Standalone NotebookLM Mobile App with Offline Audio and Seamless Source Integration
    #this #paper #introduces #group #think
    This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM Inference
    A prominent area of exploration involves enabling large language modelsto function collaboratively. Multi-agent systems powered by LLMs are now being examined for their potential to coordinate challenging problems by splitting tasks and working simultaneously. This direction has gained attention due to its potential to increase efficiency and reduce latency in real-time applications. A common issue in collaborative LLM systems is agents’ sequential, turn-based communication. In such systems, each agent must wait for others to complete their reasoning steps before proceeding. This slows down processing, especially in situations demanding rapid responses. Moreover, agents often duplicate efforts or generate inconsistent outputs, as they cannot see the evolving thoughts of their peers during generation. This latency and redundancy reduce the practicality of deploying multi-agent LLMs, particularly when time and computation are constrained, such as edge devices. Most current solutions have relied on sequential or independently parallel sampling techniques to improve reasoning. Methods like Chain-of-Thought prompting help models to solve problems in a structured way but often come with increased inference time. Approaches such as Tree-of-Thoughts and Graph-of-Thoughts expand on this by branching reasoning paths. However, these approaches still do not allow for real-time mutual adaptation among agents. Multi-agent setups have explored collaborative methods, but mostly through alternating message exchanges, which again introduces delays. Some advanced systems propose complex dynamic scheduling or role-based configurations, which are not optimized for efficient inference. Research from MediaTek Research introduced a new method called Group Think. This approach enables multiple reasoning agents within a single LLM to operate concurrently, observing each other’s partial outputs at the token level. Each reasoning thread adapts to the evolving thoughts of the others mid-generation. This mechanism reduces duplication and enables agents to shift direction if another thread is better positioned to continue a specific line of reasoning. Group Think is implemented through a token-level attention mechanism that lets each agent attend to previously generated tokens from all agents, supporting real-time collaboration. The method works by assigning each agent its own sequence of token indices, allowing their outputs to be interleaved in memory. These interleaved tokens are stored in a shared cache accessible to all agents during generation. This design allows efficient attention across reasoning threads without architectural changes to the transformer model. The implementation works both on personal devices and in data centers. On local devices, it effectively uses idle compute by batching multiple agent outputs, even with a batch size of one. In data centers, Group Think allows multiple requests to be processed together, interleaving tokens across agents while maintaining correct attention dynamics. Performance tests demonstrate that Group Think significantly improves latency and output quality. In enumeration tasks, such as listing 100 distinct names, it achieved near-complete results more rapidly than conventional Chain-of-Thought approaches. The acceleration was proportional to the number of thinkers; for example, four thinkers reduced latency by a factor of about four. In divide-and-conquer problems, using the Floyd–Warshall algorithm on a graph of five nodes, four thinkers reduced the completion time to half that of a single agent. Group Think solved code generation challenges in programming tasks more effectively than baseline models. With four or more thinkers, the model produced correct code segments much faster than traditional reasoning models. This research shows that existing LLMs, though not explicitly trained for collaboration, can already demonstrate emergent group reasoning behaviors under the Group Think setup. In experiments, agents naturally diversified their work to avoid redundancy, often dividing tasks by topic or focus area. These findings suggest that Group Think’s efficiency and sophistication could be enhanced further with dedicated training on collaborative data. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPONikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE: A Parallel Computation Method for Efficient and Scalable Language Model DeploymentNikhilhttps://www.marktechpost.com/author/nikhil0980/Google AI Releases Standalone NotebookLM Mobile App with Offline Audio and Seamless Source Integration #this #paper #introduces #group #think
    WWW.MARKTECHPOST.COM
    This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM Inference
    A prominent area of exploration involves enabling large language models (LLMs) to function collaboratively. Multi-agent systems powered by LLMs are now being examined for their potential to coordinate challenging problems by splitting tasks and working simultaneously. This direction has gained attention due to its potential to increase efficiency and reduce latency in real-time applications. A common issue in collaborative LLM systems is agents’ sequential, turn-based communication. In such systems, each agent must wait for others to complete their reasoning steps before proceeding. This slows down processing, especially in situations demanding rapid responses. Moreover, agents often duplicate efforts or generate inconsistent outputs, as they cannot see the evolving thoughts of their peers during generation. This latency and redundancy reduce the practicality of deploying multi-agent LLMs, particularly when time and computation are constrained, such as edge devices. Most current solutions have relied on sequential or independently parallel sampling techniques to improve reasoning. Methods like Chain-of-Thought prompting help models to solve problems in a structured way but often come with increased inference time. Approaches such as Tree-of-Thoughts and Graph-of-Thoughts expand on this by branching reasoning paths. However, these approaches still do not allow for real-time mutual adaptation among agents. Multi-agent setups have explored collaborative methods, but mostly through alternating message exchanges, which again introduces delays. Some advanced systems propose complex dynamic scheduling or role-based configurations, which are not optimized for efficient inference. Research from MediaTek Research introduced a new method called Group Think. This approach enables multiple reasoning agents within a single LLM to operate concurrently, observing each other’s partial outputs at the token level. Each reasoning thread adapts to the evolving thoughts of the others mid-generation. This mechanism reduces duplication and enables agents to shift direction if another thread is better positioned to continue a specific line of reasoning. Group Think is implemented through a token-level attention mechanism that lets each agent attend to previously generated tokens from all agents, supporting real-time collaboration. The method works by assigning each agent its own sequence of token indices, allowing their outputs to be interleaved in memory. These interleaved tokens are stored in a shared cache accessible to all agents during generation. This design allows efficient attention across reasoning threads without architectural changes to the transformer model. The implementation works both on personal devices and in data centers. On local devices, it effectively uses idle compute by batching multiple agent outputs, even with a batch size of one. In data centers, Group Think allows multiple requests to be processed together, interleaving tokens across agents while maintaining correct attention dynamics. Performance tests demonstrate that Group Think significantly improves latency and output quality. In enumeration tasks, such as listing 100 distinct names, it achieved near-complete results more rapidly than conventional Chain-of-Thought approaches. The acceleration was proportional to the number of thinkers; for example, four thinkers reduced latency by a factor of about four. In divide-and-conquer problems, using the Floyd–Warshall algorithm on a graph of five nodes, four thinkers reduced the completion time to half that of a single agent. Group Think solved code generation challenges in programming tasks more effectively than baseline models. With four or more thinkers, the model produced correct code segments much faster than traditional reasoning models. This research shows that existing LLMs, though not explicitly trained for collaboration, can already demonstrate emergent group reasoning behaviors under the Group Think setup. In experiments, agents naturally diversified their work to avoid redundancy, often dividing tasks by topic or focus area. These findings suggest that Group Think’s efficiency and sophistication could be enhanced further with dedicated training on collaborative data. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPONikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE (Parallel Scaling): A Parallel Computation Method for Efficient and Scalable Language Model DeploymentNikhilhttps://www.marktechpost.com/author/nikhil0980/Google AI Releases Standalone NotebookLM Mobile App with Offline Audio and Seamless Source Integration
    0 Commentarii 0 Distribuiri 0 previzualizare
  • Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPO

    The effectiveness of language models relies on their ability to simulate human-like step-by-step deduction. However, these reasoning sequences are resource-intensive and can be wasteful for simple questions that do not require elaborate computation. This lack of awareness regarding the complexity of the task is one of the core challenges in these models. They often default to detailed reasoning even for queries that could be answered directly. Such an approach increases token usage, extends response time, and increases system latency and memory usage. As a result, there’s a pressing need to equip language models with a mechanism that allows them to make autonomous decisions about whether to think deeply or respond succinctly.
    Current tools attempting to solve this issue either rely on manually set heuristics or prompt engineering to switch between short and long responses. Some methods use separate models and route questions based on complexity estimates. Still, these external routing systems often lack insight into the target model’s strengths and fail to make optimal decisions. Other techniques fine-tune models with prompt-based cues like “reasoning on/off,” but these rely on static rules rather than dynamic understanding. Despite some improvements, these approaches fail to enable fully autonomous and context-sensitive control within a single model.

    Researchers from the National University of Singapore introduced a new framework called Thinkless, which equips a language model with the ability to dynamically decide between using short or long-form reasoning. The framework is built on reinforcement learning and introduces two special control tokens—<short> for concise answers and <think> for detailed responses. By incorporating a novel algorithm called Decoupled Group Relative Policy Optimization, Thinkless separates the training focus between selecting the reasoning mode and improving the accuracy of the generated response. This design prevents the model from falling into one-dimensional behavior and enables adaptive reasoning tailored to each query.
    The methodology involves two stages: warm-up distillation and reinforcement learning. In the distillation phase, Thinkless is trained using outputs from two expert models—one specializing in short responses and the other in detailed reasoning. This stage helps the model establish a firm link between the control token and the desired reasoning format. The reinforcement learning stage then fine-tunes the model’s ability to decide which reasoning mode to use. DeGRPO decomposes the learning into two separate objectives: one for training the control token and another for refining the response tokens. This approach avoids the gradient imbalances in earlier models, where longer responses would overpower the learning signal, leading to a collapse in reasoning diversity. Thinkless ensures that both <short> and <think> tokens receive balanced updates, promoting stable learning across response types.

    When evaluated, Thinkless significantly reduced long-form reasoning while preserving high accuracy. On the Minerva Algebra benchmark, the model used the <think> token in only 25.88% of cases while achieving 94.59% accuracy. In contrast, conventional reasoning models had to use extended chains of thought much more frequently. On the AIME 2024 dataset, Thinkless reached a 27.33% accuracy rate with 100% usage of the reasoning mode, showing that it could maintain performance when full reasoning was necessary. On the GSM8K dataset, it utilized <think> only 13.31% of the time, yet still achieved 84.18% accuracy. These results reflect the model’s ability to handle simple and complex queries with appropriate reasoning depth, cutting down on unnecessary token generation by as much as 90% in some tasks.
    Overall, this study from the National University of Singapore researchers presents a compelling solution to the inefficiencies of uniform reasoning in large language models. By introducing a mechanism that enables models to judge task complexity and adjust their inference strategy accordingly, Thinkless optimizes both accuracy and efficiency. The method balances depth of reasoning and response precision without relying on fixed rules, offering a data-driven approach to more intelligent language model behavior.

    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE: A Parallel Computation Method for Efficient and Scalable Language Model DeploymentNikhilhttps://www.marktechpost.com/author/nikhil0980/Google AI Releases Standalone NotebookLM Mobile App with Offline Audio and Seamless Source IntegrationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DB
    #researchers #national #university #singapore #introduce
    Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPO
    The effectiveness of language models relies on their ability to simulate human-like step-by-step deduction. However, these reasoning sequences are resource-intensive and can be wasteful for simple questions that do not require elaborate computation. This lack of awareness regarding the complexity of the task is one of the core challenges in these models. They often default to detailed reasoning even for queries that could be answered directly. Such an approach increases token usage, extends response time, and increases system latency and memory usage. As a result, there’s a pressing need to equip language models with a mechanism that allows them to make autonomous decisions about whether to think deeply or respond succinctly. Current tools attempting to solve this issue either rely on manually set heuristics or prompt engineering to switch between short and long responses. Some methods use separate models and route questions based on complexity estimates. Still, these external routing systems often lack insight into the target model’s strengths and fail to make optimal decisions. Other techniques fine-tune models with prompt-based cues like “reasoning on/off,” but these rely on static rules rather than dynamic understanding. Despite some improvements, these approaches fail to enable fully autonomous and context-sensitive control within a single model. Researchers from the National University of Singapore introduced a new framework called Thinkless, which equips a language model with the ability to dynamically decide between using short or long-form reasoning. The framework is built on reinforcement learning and introduces two special control tokens—<short> for concise answers and <think> for detailed responses. By incorporating a novel algorithm called Decoupled Group Relative Policy Optimization, Thinkless separates the training focus between selecting the reasoning mode and improving the accuracy of the generated response. This design prevents the model from falling into one-dimensional behavior and enables adaptive reasoning tailored to each query. The methodology involves two stages: warm-up distillation and reinforcement learning. In the distillation phase, Thinkless is trained using outputs from two expert models—one specializing in short responses and the other in detailed reasoning. This stage helps the model establish a firm link between the control token and the desired reasoning format. The reinforcement learning stage then fine-tunes the model’s ability to decide which reasoning mode to use. DeGRPO decomposes the learning into two separate objectives: one for training the control token and another for refining the response tokens. This approach avoids the gradient imbalances in earlier models, where longer responses would overpower the learning signal, leading to a collapse in reasoning diversity. Thinkless ensures that both <short> and <think> tokens receive balanced updates, promoting stable learning across response types. When evaluated, Thinkless significantly reduced long-form reasoning while preserving high accuracy. On the Minerva Algebra benchmark, the model used the <think> token in only 25.88% of cases while achieving 94.59% accuracy. In contrast, conventional reasoning models had to use extended chains of thought much more frequently. On the AIME 2024 dataset, Thinkless reached a 27.33% accuracy rate with 100% usage of the reasoning mode, showing that it could maintain performance when full reasoning was necessary. On the GSM8K dataset, it utilized <think> only 13.31% of the time, yet still achieved 84.18% accuracy. These results reflect the model’s ability to handle simple and complex queries with appropriate reasoning depth, cutting down on unnecessary token generation by as much as 90% in some tasks. Overall, this study from the National University of Singapore researchers presents a compelling solution to the inefficiencies of uniform reasoning in large language models. By introducing a mechanism that enables models to judge task complexity and adjust their inference strategy accordingly, Thinkless optimizes both accuracy and efficiency. The method balances depth of reasoning and response precision without relying on fixed rules, offering a data-driven approach to more intelligent language model behavior. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE: A Parallel Computation Method for Efficient and Scalable Language Model DeploymentNikhilhttps://www.marktechpost.com/author/nikhil0980/Google AI Releases Standalone NotebookLM Mobile App with Offline Audio and Seamless Source IntegrationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DB #researchers #national #university #singapore #introduce
    WWW.MARKTECHPOST.COM
    Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPO
    The effectiveness of language models relies on their ability to simulate human-like step-by-step deduction. However, these reasoning sequences are resource-intensive and can be wasteful for simple questions that do not require elaborate computation. This lack of awareness regarding the complexity of the task is one of the core challenges in these models. They often default to detailed reasoning even for queries that could be answered directly. Such an approach increases token usage, extends response time, and increases system latency and memory usage. As a result, there’s a pressing need to equip language models with a mechanism that allows them to make autonomous decisions about whether to think deeply or respond succinctly. Current tools attempting to solve this issue either rely on manually set heuristics or prompt engineering to switch between short and long responses. Some methods use separate models and route questions based on complexity estimates. Still, these external routing systems often lack insight into the target model’s strengths and fail to make optimal decisions. Other techniques fine-tune models with prompt-based cues like “reasoning on/off,” but these rely on static rules rather than dynamic understanding. Despite some improvements, these approaches fail to enable fully autonomous and context-sensitive control within a single model. Researchers from the National University of Singapore introduced a new framework called Thinkless, which equips a language model with the ability to dynamically decide between using short or long-form reasoning. The framework is built on reinforcement learning and introduces two special control tokens—<short> for concise answers and <think> for detailed responses. By incorporating a novel algorithm called Decoupled Group Relative Policy Optimization (DeGRPO), Thinkless separates the training focus between selecting the reasoning mode and improving the accuracy of the generated response. This design prevents the model from falling into one-dimensional behavior and enables adaptive reasoning tailored to each query. The methodology involves two stages: warm-up distillation and reinforcement learning. In the distillation phase, Thinkless is trained using outputs from two expert models—one specializing in short responses and the other in detailed reasoning. This stage helps the model establish a firm link between the control token and the desired reasoning format. The reinforcement learning stage then fine-tunes the model’s ability to decide which reasoning mode to use. DeGRPO decomposes the learning into two separate objectives: one for training the control token and another for refining the response tokens. This approach avoids the gradient imbalances in earlier models, where longer responses would overpower the learning signal, leading to a collapse in reasoning diversity. Thinkless ensures that both <short> and <think> tokens receive balanced updates, promoting stable learning across response types. When evaluated, Thinkless significantly reduced long-form reasoning while preserving high accuracy. On the Minerva Algebra benchmark, the model used the <think> token in only 25.88% of cases while achieving 94.59% accuracy. In contrast, conventional reasoning models had to use extended chains of thought much more frequently. On the AIME 2024 dataset, Thinkless reached a 27.33% accuracy rate with 100% usage of the reasoning mode, showing that it could maintain performance when full reasoning was necessary. On the GSM8K dataset, it utilized <think> only 13.31% of the time, yet still achieved 84.18% accuracy. These results reflect the model’s ability to handle simple and complex queries with appropriate reasoning depth, cutting down on unnecessary token generation by as much as 90% in some tasks. Overall, this study from the National University of Singapore researchers presents a compelling solution to the inefficiencies of uniform reasoning in large language models. By introducing a mechanism that enables models to judge task complexity and adjust their inference strategy accordingly, Thinkless optimizes both accuracy and efficiency. The method balances depth of reasoning and response precision without relying on fixed rules, offering a data-driven approach to more intelligent language model behavior. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE (Parallel Scaling): A Parallel Computation Method for Efficient and Scalable Language Model DeploymentNikhilhttps://www.marktechpost.com/author/nikhil0980/Google AI Releases Standalone NotebookLM Mobile App with Offline Audio and Seamless Source IntegrationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DB
    0 Commentarii 0 Distribuiri 0 previzualizare
  • This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code Alignment

    Multimodal mathematical reasoning enables machines to solve problems involving textual information and visual components like diagrams and figures. This requires combining language understanding and visual interpretation to make sense of complex mathematical contexts. Such capabilities are vital in education, automated tutoring, and document analysis, where problems are often presented with a blend of text and images.
    A major obstacle in this area is the lack of high-quality, precise alignment between math images and their textual or symbolic representations. Most datasets used to train large multimodal models are derived from image captions in natural settings, which often miss the detailed elements essential for mathematical accuracy. This creates problems for models that rely on these data sources, making them unreliable when dealing with geometry, figures, or technical diagrams. A model’s performance in mathematical reasoning depends heavily on its ability to correctly interpret and link these visual details with mathematical expressions or instructions.

    In the past, some approaches tried to address this by either enhancing the visual encoders or using manually crafted datasets. However, these methods tend to produce low image diversity, relying on hand-coded or template-based generation, which limits their applicability. Some efforts, like Math-LLaVA and MAVIS, developed synthetic datasets and used templates or predefined categories. Still, they could not dynamically create a wide variety of math-related visuals. This shortfall restricts the learning scope of models and leaves them struggling with more complex or less structured mathematical problems.
    Researchers from the Multimedia Laboratory at The Chinese University of Hong Kong and CPII under InnoHK introduced a novel approach called MathCoder-VL. This method combines a vision-to-code model named FigCodifier and a synthetic data engine. They constructed the ImgCode-8.6M dataset using a model-in-the-loop strategy, which allowed them to build the largest image-code dataset to date iteratively. Further, they developed MM-MathInstruct-3M, a multimodal instruction dataset enriched with newly synthesized images. The MathCoder-VL model is trained in two stages: mid-training on ImgCode-8.6M to improve visual-text alignment and fine-tuning on MM-MathInstruct-3M to strengthen reasoning abilities.

    The FigCodifier model works by translating mathematical figures into code that can recreate those figures exactly. This code-image pairing ensures strict alignment and accuracy, unlike caption-based datasets. The process begins with 119K image-code pairs from DaTikZ and expands through iterative training using images collected from textbooks, K12 datasets, and arXiv papers. The final dataset includes 8.6 million code-image pairs and covers various mathematical topics. FigCodifier also supports Python-based rendering, which adds variety to image generation. The system filters low-quality data by checking code validity and removing redundant or unhelpful visuals, resulting in 4.3M high-quality TikZ and 4.3M Python-based pairs.
    Performance evaluations show that MathCoder-VL outperforms multiple open-source models. The 8B version achieved 73.6% accuracy on the MathVista Geometry Problem Solving subset, surpassing GPT-4o and Claude 3.5 Sonnet by 8.9% and 9.2%, respectively. It also scored 26.1% on MATH-Vision and 46.5% on MathVerse. In Chinese-language benchmarks, it achieved 51.2% on GAOKAO-MM. On the We-Math benchmark, it solved two-step problems at 58.6%, outperforming GPT-4o’s 58.1%. Its performance on three-step problems reached 52.1%, again exceeding GPT-4o’s 43.6%. Compared to its base model InternVL2-8B, it showed gains of 6.1% on MATH-Vision and 11.6% on MathVista.

    This work clearly defines the problem of insufficient visual-textual alignment in multimodal math reasoning and provides a scalable and innovative solution. The introduction of FigCodifier and synthetic datasets allows models to learn from accurate, diverse visuals paired with exact code, significantly boosting their reasoning abilities. MathCoder-VL represents a practical advancement in this field, demonstrating how thoughtful model design and high-quality data can overcome longstanding limitations in mathematical AI.

    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE: A Parallel Computation Method for Efficient and Scalable Language Model DeploymentNikhilhttps://www.marktechpost.com/author/nikhil0980/Google AI Releases Standalone NotebookLM Mobile App with Offline Audio and Seamless Source IntegrationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DBNikhilhttps://www.marktechpost.com/author/nikhil0980/LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap
    #this #paper #introduces #mathcodervl #figcodifier
    This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code Alignment
    Multimodal mathematical reasoning enables machines to solve problems involving textual information and visual components like diagrams and figures. This requires combining language understanding and visual interpretation to make sense of complex mathematical contexts. Such capabilities are vital in education, automated tutoring, and document analysis, where problems are often presented with a blend of text and images. A major obstacle in this area is the lack of high-quality, precise alignment between math images and their textual or symbolic representations. Most datasets used to train large multimodal models are derived from image captions in natural settings, which often miss the detailed elements essential for mathematical accuracy. This creates problems for models that rely on these data sources, making them unreliable when dealing with geometry, figures, or technical diagrams. A model’s performance in mathematical reasoning depends heavily on its ability to correctly interpret and link these visual details with mathematical expressions or instructions. In the past, some approaches tried to address this by either enhancing the visual encoders or using manually crafted datasets. However, these methods tend to produce low image diversity, relying on hand-coded or template-based generation, which limits their applicability. Some efforts, like Math-LLaVA and MAVIS, developed synthetic datasets and used templates or predefined categories. Still, they could not dynamically create a wide variety of math-related visuals. This shortfall restricts the learning scope of models and leaves them struggling with more complex or less structured mathematical problems. Researchers from the Multimedia Laboratory at The Chinese University of Hong Kong and CPII under InnoHK introduced a novel approach called MathCoder-VL. This method combines a vision-to-code model named FigCodifier and a synthetic data engine. They constructed the ImgCode-8.6M dataset using a model-in-the-loop strategy, which allowed them to build the largest image-code dataset to date iteratively. Further, they developed MM-MathInstruct-3M, a multimodal instruction dataset enriched with newly synthesized images. The MathCoder-VL model is trained in two stages: mid-training on ImgCode-8.6M to improve visual-text alignment and fine-tuning on MM-MathInstruct-3M to strengthen reasoning abilities. The FigCodifier model works by translating mathematical figures into code that can recreate those figures exactly. This code-image pairing ensures strict alignment and accuracy, unlike caption-based datasets. The process begins with 119K image-code pairs from DaTikZ and expands through iterative training using images collected from textbooks, K12 datasets, and arXiv papers. The final dataset includes 8.6 million code-image pairs and covers various mathematical topics. FigCodifier also supports Python-based rendering, which adds variety to image generation. The system filters low-quality data by checking code validity and removing redundant or unhelpful visuals, resulting in 4.3M high-quality TikZ and 4.3M Python-based pairs. Performance evaluations show that MathCoder-VL outperforms multiple open-source models. The 8B version achieved 73.6% accuracy on the MathVista Geometry Problem Solving subset, surpassing GPT-4o and Claude 3.5 Sonnet by 8.9% and 9.2%, respectively. It also scored 26.1% on MATH-Vision and 46.5% on MathVerse. In Chinese-language benchmarks, it achieved 51.2% on GAOKAO-MM. On the We-Math benchmark, it solved two-step problems at 58.6%, outperforming GPT-4o’s 58.1%. Its performance on three-step problems reached 52.1%, again exceeding GPT-4o’s 43.6%. Compared to its base model InternVL2-8B, it showed gains of 6.1% on MATH-Vision and 11.6% on MathVista. This work clearly defines the problem of insufficient visual-textual alignment in multimodal math reasoning and provides a scalable and innovative solution. The introduction of FigCodifier and synthetic datasets allows models to learn from accurate, diverse visuals paired with exact code, significantly boosting their reasoning abilities. MathCoder-VL represents a practical advancement in this field, demonstrating how thoughtful model design and high-quality data can overcome longstanding limitations in mathematical AI. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE: A Parallel Computation Method for Efficient and Scalable Language Model DeploymentNikhilhttps://www.marktechpost.com/author/nikhil0980/Google AI Releases Standalone NotebookLM Mobile App with Offline Audio and Seamless Source IntegrationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DBNikhilhttps://www.marktechpost.com/author/nikhil0980/LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap #this #paper #introduces #mathcodervl #figcodifier
    WWW.MARKTECHPOST.COM
    This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code Alignment
    Multimodal mathematical reasoning enables machines to solve problems involving textual information and visual components like diagrams and figures. This requires combining language understanding and visual interpretation to make sense of complex mathematical contexts. Such capabilities are vital in education, automated tutoring, and document analysis, where problems are often presented with a blend of text and images. A major obstacle in this area is the lack of high-quality, precise alignment between math images and their textual or symbolic representations. Most datasets used to train large multimodal models are derived from image captions in natural settings, which often miss the detailed elements essential for mathematical accuracy. This creates problems for models that rely on these data sources, making them unreliable when dealing with geometry, figures, or technical diagrams. A model’s performance in mathematical reasoning depends heavily on its ability to correctly interpret and link these visual details with mathematical expressions or instructions. In the past, some approaches tried to address this by either enhancing the visual encoders or using manually crafted datasets. However, these methods tend to produce low image diversity, relying on hand-coded or template-based generation, which limits their applicability. Some efforts, like Math-LLaVA and MAVIS, developed synthetic datasets and used templates or predefined categories. Still, they could not dynamically create a wide variety of math-related visuals. This shortfall restricts the learning scope of models and leaves them struggling with more complex or less structured mathematical problems. Researchers from the Multimedia Laboratory at The Chinese University of Hong Kong and CPII under InnoHK introduced a novel approach called MathCoder-VL. This method combines a vision-to-code model named FigCodifier and a synthetic data engine. They constructed the ImgCode-8.6M dataset using a model-in-the-loop strategy, which allowed them to build the largest image-code dataset to date iteratively. Further, they developed MM-MathInstruct-3M, a multimodal instruction dataset enriched with newly synthesized images. The MathCoder-VL model is trained in two stages: mid-training on ImgCode-8.6M to improve visual-text alignment and fine-tuning on MM-MathInstruct-3M to strengthen reasoning abilities. The FigCodifier model works by translating mathematical figures into code that can recreate those figures exactly. This code-image pairing ensures strict alignment and accuracy, unlike caption-based datasets. The process begins with 119K image-code pairs from DaTikZ and expands through iterative training using images collected from textbooks, K12 datasets, and arXiv papers. The final dataset includes 8.6 million code-image pairs and covers various mathematical topics. FigCodifier also supports Python-based rendering, which adds variety to image generation. The system filters low-quality data by checking code validity and removing redundant or unhelpful visuals, resulting in 4.3M high-quality TikZ and 4.3M Python-based pairs. Performance evaluations show that MathCoder-VL outperforms multiple open-source models. The 8B version achieved 73.6% accuracy on the MathVista Geometry Problem Solving subset, surpassing GPT-4o and Claude 3.5 Sonnet by 8.9% and 9.2%, respectively. It also scored 26.1% on MATH-Vision and 46.5% on MathVerse. In Chinese-language benchmarks, it achieved 51.2% on GAOKAO-MM. On the We-Math benchmark, it solved two-step problems at 58.6%, outperforming GPT-4o’s 58.1%. Its performance on three-step problems reached 52.1%, again exceeding GPT-4o’s 43.6%. Compared to its base model InternVL2-8B, it showed gains of 6.1% on MATH-Vision and 11.6% on MathVista. This work clearly defines the problem of insufficient visual-textual alignment in multimodal math reasoning and provides a scalable and innovative solution. The introduction of FigCodifier and synthetic datasets allows models to learn from accurate, diverse visuals paired with exact code, significantly boosting their reasoning abilities. MathCoder-VL represents a practical advancement in this field, demonstrating how thoughtful model design and high-quality data can overcome longstanding limitations in mathematical AI. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE (Parallel Scaling): A Parallel Computation Method for Efficient and Scalable Language Model DeploymentNikhilhttps://www.marktechpost.com/author/nikhil0980/Google AI Releases Standalone NotebookLM Mobile App with Offline Audio and Seamless Source IntegrationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DBNikhilhttps://www.marktechpost.com/author/nikhil0980/LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap
    0 Commentarii 0 Distribuiri 0 previzualizare
  • Google AI Releases Standalone NotebookLM Mobile App with Offline Audio and Seamless Source Integration

    Google has officially rolled out the NotebookLM mobile app, extending its AI-powered research assistant to Android devices. The app aims to bring personalized learning and content synthesis directly to users’ pockets by introducing new features that combine mobility, context-awareness, and interactive capabilities.
    NotebookLM, which first launched in 2023 as a web-based experimental tool, is designed to help users organize and interact with their own documents and media using a fine-tuned version of Google’s Gemini 1.5 Pro model. With the new mobile release, Google is positioning NotebookLM as more than just a passive summarization tool — it’s evolving into an on-the-go research companion.
    Expanding Contextual AI to Mobile
    One of the core capabilities of NotebookLM is its source-grounded AI assistance. Users can upload documents—such as PDFs, Google Docs, or links—and the assistant will generate summaries, answer questions, and synthesize information based strictly on the materials provided. This grounded approach ensures transparency and relevance, reducing the hallucination risks common in general-purpose LLMs.
    The new Android app brings this paradigm to mobile in a more seamless, integrated way. Users can now add sources directly from their mobile device, whether browsing a web page, reading a PDF, or watching a YouTube video. A simple tap on the “Share” button in any app lets users send content to NotebookLM, which will ingest it as a new source. This makes it easier for users to build and maintain their research libraries without needing to manually upload files later.
    Offline and Background Audio Overviews
    NotebookLM’s Audio Overviews were introduced earlier this year to make content more accessible through auditory summaries. The mobile app enhances this feature significantly by enabling offline downloads and background playback.
    Users can now listen to these overviews while commuting, exercising, or multitasking — even without an internet connection. This supports a broader range of learning contexts, particularly for users who prefer audio-first experiences or have limited screen time. Importantly, audio playback continues in the background, aligning with typical media consumption behaviors on mobile.
    Interactive Audio Conversations with AI Hosts
    In a move to blend passive listening with interactivity, NotebookLM’s mobile app introduces Interactive Audio Overviews. These allow users to engage in real-time conversations with the AI “hosts” that narrate their overviews. By tapping “Join”, users can interrupt the playback to ask questions, request clarifications, or guide the summary in a new direction — a feature that adds conversational depth to the learning experience.
    This marks a departure from static, pre-recorded summaries and moves toward adaptive audio interfaces powered by LLMs. While many AI tools offer Q&A-style interactions, combining them with ambient audio and voice-first navigation is relatively new and signals Google’s intent to create more naturalistic AI assistants.
    Conclusion
    With the NotebookLM mobile app, Google is bridging the gap between context-rich AI research tools and real-world mobile usability. Features like offline Audio Overviews, universal source sharing, and interactive playback demonstrate a clear step toward personalized, context-aware AI that’s accessible anywhere.
    As AI tools continue to evolve beyond chatbots and simple prompts, NotebookLM stands out by focusing on what users already know and want to learn — organizing their own knowledge sources rather than relying solely on web-scale data. The mobile release pushes this philosophy forward, turning everyday moments into opportunities for exploration and deeper understanding.

    Check out the APP on both Android and iOS. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DBNikhilhttps://www.marktechpost.com/author/nikhil0980/LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing GapNikhilhttps://www.marktechpost.com/author/nikhil0980/LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified TasksNikhilhttps://www.marktechpost.com/author/nikhil0980/Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation

    Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub!
    #google #releases #standalone #notebooklm #mobile
    Google AI Releases Standalone NotebookLM Mobile App with Offline Audio and Seamless Source Integration
    Google has officially rolled out the NotebookLM mobile app, extending its AI-powered research assistant to Android devices. The app aims to bring personalized learning and content synthesis directly to users’ pockets by introducing new features that combine mobility, context-awareness, and interactive capabilities. NotebookLM, which first launched in 2023 as a web-based experimental tool, is designed to help users organize and interact with their own documents and media using a fine-tuned version of Google’s Gemini 1.5 Pro model. With the new mobile release, Google is positioning NotebookLM as more than just a passive summarization tool — it’s evolving into an on-the-go research companion. Expanding Contextual AI to Mobile One of the core capabilities of NotebookLM is its source-grounded AI assistance. Users can upload documents—such as PDFs, Google Docs, or links—and the assistant will generate summaries, answer questions, and synthesize information based strictly on the materials provided. This grounded approach ensures transparency and relevance, reducing the hallucination risks common in general-purpose LLMs. The new Android app brings this paradigm to mobile in a more seamless, integrated way. Users can now add sources directly from their mobile device, whether browsing a web page, reading a PDF, or watching a YouTube video. A simple tap on the “Share” button in any app lets users send content to NotebookLM, which will ingest it as a new source. This makes it easier for users to build and maintain their research libraries without needing to manually upload files later. Offline and Background Audio Overviews NotebookLM’s Audio Overviews were introduced earlier this year to make content more accessible through auditory summaries. The mobile app enhances this feature significantly by enabling offline downloads and background playback. Users can now listen to these overviews while commuting, exercising, or multitasking — even without an internet connection. This supports a broader range of learning contexts, particularly for users who prefer audio-first experiences or have limited screen time. Importantly, audio playback continues in the background, aligning with typical media consumption behaviors on mobile. Interactive Audio Conversations with AI Hosts In a move to blend passive listening with interactivity, NotebookLM’s mobile app introduces Interactive Audio Overviews. These allow users to engage in real-time conversations with the AI “hosts” that narrate their overviews. By tapping “Join”, users can interrupt the playback to ask questions, request clarifications, or guide the summary in a new direction — a feature that adds conversational depth to the learning experience. This marks a departure from static, pre-recorded summaries and moves toward adaptive audio interfaces powered by LLMs. While many AI tools offer Q&A-style interactions, combining them with ambient audio and voice-first navigation is relatively new and signals Google’s intent to create more naturalistic AI assistants. Conclusion With the NotebookLM mobile app, Google is bridging the gap between context-rich AI research tools and real-world mobile usability. Features like offline Audio Overviews, universal source sharing, and interactive playback demonstrate a clear step toward personalized, context-aware AI that’s accessible anywhere. As AI tools continue to evolve beyond chatbots and simple prompts, NotebookLM stands out by focusing on what users already know and want to learn — organizing their own knowledge sources rather than relying solely on web-scale data. The mobile release pushes this philosophy forward, turning everyday moments into opportunities for exploration and deeper understanding. Check out the APP on both Android and iOS. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DBNikhilhttps://www.marktechpost.com/author/nikhil0980/LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing GapNikhilhttps://www.marktechpost.com/author/nikhil0980/LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified TasksNikhilhttps://www.marktechpost.com/author/nikhil0980/Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! #google #releases #standalone #notebooklm #mobile
    WWW.MARKTECHPOST.COM
    Google AI Releases Standalone NotebookLM Mobile App with Offline Audio and Seamless Source Integration
    Google has officially rolled out the NotebookLM mobile app, extending its AI-powered research assistant to Android devices. The app aims to bring personalized learning and content synthesis directly to users’ pockets by introducing new features that combine mobility, context-awareness, and interactive capabilities. NotebookLM, which first launched in 2023 as a web-based experimental tool, is designed to help users organize and interact with their own documents and media using a fine-tuned version of Google’s Gemini 1.5 Pro model. With the new mobile release, Google is positioning NotebookLM as more than just a passive summarization tool — it’s evolving into an on-the-go research companion. Expanding Contextual AI to Mobile One of the core capabilities of NotebookLM is its source-grounded AI assistance. Users can upload documents—such as PDFs, Google Docs, or links—and the assistant will generate summaries, answer questions, and synthesize information based strictly on the materials provided. This grounded approach ensures transparency and relevance, reducing the hallucination risks common in general-purpose LLMs. The new Android app brings this paradigm to mobile in a more seamless, integrated way. Users can now add sources directly from their mobile device, whether browsing a web page, reading a PDF, or watching a YouTube video. A simple tap on the “Share” button in any app lets users send content to NotebookLM, which will ingest it as a new source. This makes it easier for users to build and maintain their research libraries without needing to manually upload files later. Offline and Background Audio Overviews NotebookLM’s Audio Overviews were introduced earlier this year to make content more accessible through auditory summaries. The mobile app enhances this feature significantly by enabling offline downloads and background playback. Users can now listen to these overviews while commuting, exercising, or multitasking — even without an internet connection. This supports a broader range of learning contexts, particularly for users who prefer audio-first experiences or have limited screen time. Importantly, audio playback continues in the background, aligning with typical media consumption behaviors on mobile. Interactive Audio Conversations with AI Hosts In a move to blend passive listening with interactivity, NotebookLM’s mobile app introduces Interactive Audio Overviews. These allow users to engage in real-time conversations with the AI “hosts” that narrate their overviews. By tapping “Join”, users can interrupt the playback to ask questions, request clarifications, or guide the summary in a new direction — a feature that adds conversational depth to the learning experience. This marks a departure from static, pre-recorded summaries and moves toward adaptive audio interfaces powered by LLMs. While many AI tools offer Q&A-style interactions, combining them with ambient audio and voice-first navigation is relatively new and signals Google’s intent to create more naturalistic AI assistants. Conclusion With the NotebookLM mobile app, Google is bridging the gap between context-rich AI research tools and real-world mobile usability. Features like offline Audio Overviews, universal source sharing, and interactive playback demonstrate a clear step toward personalized, context-aware AI that’s accessible anywhere. As AI tools continue to evolve beyond chatbots and simple prompts, NotebookLM stands out by focusing on what users already know and want to learn — organizing their own knowledge sources rather than relying solely on web-scale data. The mobile release pushes this philosophy forward, turning everyday moments into opportunities for exploration and deeper understanding. Check out the APP on both Android and iOS. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DBNikhilhttps://www.marktechpost.com/author/nikhil0980/LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing GapNikhilhttps://www.marktechpost.com/author/nikhil0980/LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified TasksNikhilhttps://www.marktechpost.com/author/nikhil0980/Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! (Promoted)
    0 Commentarii 0 Distribuiri 0 previzualizare
Sponsorizeaza Paginile
CGShares https://cgshares.com