-
- EXPLORE
-
-
-
-
AI/ML Research and Dev News Platform (1 million+monthly traffic) | 50k+ ML subreddit | Contact: Asif@marktechpost.com
Jüngste Beiträge
-
EPFL Researchers Unveil FG2 at CVPR: A New AI Model That Slashes Localization Errors by 28% for Autonomous Vehicles in GPS-Denied Environments
Navigating the dense urban canyons of cities like San Francisco or New York can be a nightmare for GPS systems. The towering skyscrapers block and reflect satellite signals, leading to location errors of tens of meters. For you and me, that might mean a missed turn. But for an autonomous vehicle or a delivery robot, that level of imprecision is the difference between a successful mission and a costly failure. These machines require pinpoint accuracy to operate safely and efficiently. Addressing this critical challenge, researchers from the École Polytechnique Fédérale de Lausannein Switzerland have introduced a groundbreaking new method for visual localization during CVPR 2025
Their new paper, “FG2: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching,” presents a novel AI model that significantly enhances the ability of a ground-level system, like an autonomous car, to determine its exact position and orientation using only a camera and a corresponding aerialimage. The new approach has demonstrated a remarkable 28% reduction in mean localization error compared to the previous state-of-the-art on a challenging public dataset.
Key Takeaways:
Superior Accuracy: The FG2 model reduces the average localization error by a significant 28% on the VIGOR cross-area test set, a challenging benchmark for this task.
Human-like Intuition: Instead of relying on abstract descriptors, the model mimics human reasoning by matching fine-grained, semantically consistent features—like curbs, crosswalks, and buildings—between a ground-level photo and an aerial map.
Enhanced Interpretability: The method allows researchers to “see” what the AI is “thinking” by visualizing exactly which features in the ground and aerial images are being matched, a major step forward from previous “black box” models.
Weakly Supervised Learning: Remarkably, the model learns these complex and consistent feature matches without any direct labels for correspondences. It achieves this using only the final camera pose as a supervisory signal.
Challenge: Seeing the World from Two Different Angles
The core problem of cross-view localization is the dramatic difference in perspective between a street-level camera and an overhead satellite view. A building facade seen from the ground looks completely different from its rooftop signature in an aerial image. Existing methods have struggled with this. Some create a general “descriptor” for the entire scene, but this is an abstract approach that doesn’t mirror how humans naturally localize themselves by spotting specific landmarks. Other methods transform the ground image into a Bird’s-Eye-Viewbut are often limited to the ground plane, ignoring crucial vertical structures like buildings.
FG2: Matching Fine-Grained Features
The EPFL team’s FG2 method introduces a more intuitive and effective process. It aligns two sets of points: one generated from the ground-level image and another sampled from the aerial map.
Here’s a breakdown of their innovative pipeline:
Mapping to 3D: The process begins by taking the features from the ground-level image and lifting them into a 3D point cloud centered around the camera. This creates a 3D representation of the immediate environment.
Smart Pooling to BEV: This is where the magic happens. Instead of simply flattening the 3D data, the model learns to intelligently select the most important features along the verticaldimension for each point. It essentially asks, “For this spot on the map, is the ground-level road marking more important, or is the edge of that building’s roof the better landmark?” This selection process is crucial, as it allows the model to correctly associate features like building facades with their corresponding rooftops in the aerial view.
Feature Matching and Pose Estimation: Once both the ground and aerial views are represented as 2D point planes with rich feature descriptors, the model computes the similarity between them. It then samples a sparse set of the most confident matches and uses a classic geometric algorithm called Procrustes alignment to calculate the precise 3-DoFpose.
Unprecedented Performance and Interpretability
The results speak for themselves. On the challenging VIGOR dataset, which includes images from different cities in its cross-area test, FG2 reduced the mean localization error by 28% compared to the previous best method. It also demonstrated superior generalization capabilities on the KITTI dataset, a staple in autonomous driving research.
Perhaps more importantly, the FG2 model offers a new level of transparency. By visualizing the matched points, the researchers showed that the model learns semantically consistent correspondences without being explicitly told to. For example, the system correctly matches zebra crossings, road markings, and even building facades in the ground view to their corresponding locations on the aerial map. This interpretability is extremenly valuable for building trust in safety-critical autonomous systems.
“A Clearer Path” for Autonomous Navigation
The FG2 method represents a significant leap forward in fine-grained visual localization. By developing a model that intelligently selects and matches features in a way that mirrors human intuition, the EPFL researchers have not only shattered previous accuracy records but also made the decision-making process of the AI more interpretable. This work paves the way for more robust and reliable navigation systems for autonomous vehicles, drones, and robots, bringing us one step closer to a future where machines can confidently navigate our world, even when GPS fails them.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
Jean-marc MommessinJean-marc is a successful AI business executive .He leads and accelerates growth for AI powered solutions and started a computer vision company in 2006. He is a recognized speaker at AI conferences and has an MBA from Stanford.Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/AI-Generated Ad Created with Google’s Veo3 Airs During NBA Finals, Slashing Production Costs by 95%Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video ControlJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Snowflake Charts New AI Territory: Cortex AISQL & Snowflake Intelligence Poised to Reshape Data AnalyticsJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Exclusive Talk: Joey Conway of NVIDIA on Llama Nemotron Ultra and Open Source Models
#epfl #researchers #unveil #fg2 #cvprEPFL Researchers Unveil FG2 at CVPR: A New AI Model That Slashes Localization Errors by 28% for Autonomous Vehicles in GPS-Denied EnvironmentsNavigating the dense urban canyons of cities like San Francisco or New York can be a nightmare for GPS systems. The towering skyscrapers block and reflect satellite signals, leading to location errors of tens of meters. For you and me, that might mean a missed turn. But for an autonomous vehicle or a delivery robot, that level of imprecision is the difference between a successful mission and a costly failure. These machines require pinpoint accuracy to operate safely and efficiently. Addressing this critical challenge, researchers from the École Polytechnique Fédérale de Lausannein Switzerland have introduced a groundbreaking new method for visual localization during CVPR 2025 Their new paper, “FG2: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching,” presents a novel AI model that significantly enhances the ability of a ground-level system, like an autonomous car, to determine its exact position and orientation using only a camera and a corresponding aerialimage. The new approach has demonstrated a remarkable 28% reduction in mean localization error compared to the previous state-of-the-art on a challenging public dataset. Key Takeaways: Superior Accuracy: The FG2 model reduces the average localization error by a significant 28% on the VIGOR cross-area test set, a challenging benchmark for this task. Human-like Intuition: Instead of relying on abstract descriptors, the model mimics human reasoning by matching fine-grained, semantically consistent features—like curbs, crosswalks, and buildings—between a ground-level photo and an aerial map. Enhanced Interpretability: The method allows researchers to “see” what the AI is “thinking” by visualizing exactly which features in the ground and aerial images are being matched, a major step forward from previous “black box” models. Weakly Supervised Learning: Remarkably, the model learns these complex and consistent feature matches without any direct labels for correspondences. It achieves this using only the final camera pose as a supervisory signal. Challenge: Seeing the World from Two Different Angles The core problem of cross-view localization is the dramatic difference in perspective between a street-level camera and an overhead satellite view. A building facade seen from the ground looks completely different from its rooftop signature in an aerial image. Existing methods have struggled with this. Some create a general “descriptor” for the entire scene, but this is an abstract approach that doesn’t mirror how humans naturally localize themselves by spotting specific landmarks. Other methods transform the ground image into a Bird’s-Eye-Viewbut are often limited to the ground plane, ignoring crucial vertical structures like buildings. FG2: Matching Fine-Grained Features The EPFL team’s FG2 method introduces a more intuitive and effective process. It aligns two sets of points: one generated from the ground-level image and another sampled from the aerial map. Here’s a breakdown of their innovative pipeline: Mapping to 3D: The process begins by taking the features from the ground-level image and lifting them into a 3D point cloud centered around the camera. This creates a 3D representation of the immediate environment. Smart Pooling to BEV: This is where the magic happens. Instead of simply flattening the 3D data, the model learns to intelligently select the most important features along the verticaldimension for each point. It essentially asks, “For this spot on the map, is the ground-level road marking more important, or is the edge of that building’s roof the better landmark?” This selection process is crucial, as it allows the model to correctly associate features like building facades with their corresponding rooftops in the aerial view. Feature Matching and Pose Estimation: Once both the ground and aerial views are represented as 2D point planes with rich feature descriptors, the model computes the similarity between them. It then samples a sparse set of the most confident matches and uses a classic geometric algorithm called Procrustes alignment to calculate the precise 3-DoFpose. Unprecedented Performance and Interpretability The results speak for themselves. On the challenging VIGOR dataset, which includes images from different cities in its cross-area test, FG2 reduced the mean localization error by 28% compared to the previous best method. It also demonstrated superior generalization capabilities on the KITTI dataset, a staple in autonomous driving research. Perhaps more importantly, the FG2 model offers a new level of transparency. By visualizing the matched points, the researchers showed that the model learns semantically consistent correspondences without being explicitly told to. For example, the system correctly matches zebra crossings, road markings, and even building facades in the ground view to their corresponding locations on the aerial map. This interpretability is extremenly valuable for building trust in safety-critical autonomous systems. “A Clearer Path” for Autonomous Navigation The FG2 method represents a significant leap forward in fine-grained visual localization. By developing a model that intelligently selects and matches features in a way that mirrors human intuition, the EPFL researchers have not only shattered previous accuracy records but also made the decision-making process of the AI more interpretable. This work paves the way for more robust and reliable navigation systems for autonomous vehicles, drones, and robots, bringing us one step closer to a future where machines can confidently navigate our world, even when GPS fails them. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Jean-marc MommessinJean-marc is a successful AI business executive .He leads and accelerates growth for AI powered solutions and started a computer vision company in 2006. He is a recognized speaker at AI conferences and has an MBA from Stanford.Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/AI-Generated Ad Created with Google’s Veo3 Airs During NBA Finals, Slashing Production Costs by 95%Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video ControlJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Snowflake Charts New AI Territory: Cortex AISQL & Snowflake Intelligence Poised to Reshape Data AnalyticsJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Exclusive Talk: Joey Conway of NVIDIA on Llama Nemotron Ultra and Open Source Models #epfl #researchers #unveil #fg2 #cvprWWW.MARKTECHPOST.COMEPFL Researchers Unveil FG2 at CVPR: A New AI Model That Slashes Localization Errors by 28% for Autonomous Vehicles in GPS-Denied EnvironmentsNavigating the dense urban canyons of cities like San Francisco or New York can be a nightmare for GPS systems. The towering skyscrapers block and reflect satellite signals, leading to location errors of tens of meters. For you and me, that might mean a missed turn. But for an autonomous vehicle or a delivery robot, that level of imprecision is the difference between a successful mission and a costly failure. These machines require pinpoint accuracy to operate safely and efficiently. Addressing this critical challenge, researchers from the École Polytechnique Fédérale de Lausanne (EPFL) in Switzerland have introduced a groundbreaking new method for visual localization during CVPR 2025 Their new paper, “FG2: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching,” presents a novel AI model that significantly enhances the ability of a ground-level system, like an autonomous car, to determine its exact position and orientation using only a camera and a corresponding aerial (or satellite) image. The new approach has demonstrated a remarkable 28% reduction in mean localization error compared to the previous state-of-the-art on a challenging public dataset. Key Takeaways: Superior Accuracy: The FG2 model reduces the average localization error by a significant 28% on the VIGOR cross-area test set, a challenging benchmark for this task. Human-like Intuition: Instead of relying on abstract descriptors, the model mimics human reasoning by matching fine-grained, semantically consistent features—like curbs, crosswalks, and buildings—between a ground-level photo and an aerial map. Enhanced Interpretability: The method allows researchers to “see” what the AI is “thinking” by visualizing exactly which features in the ground and aerial images are being matched, a major step forward from previous “black box” models. Weakly Supervised Learning: Remarkably, the model learns these complex and consistent feature matches without any direct labels for correspondences. It achieves this using only the final camera pose as a supervisory signal. Challenge: Seeing the World from Two Different Angles The core problem of cross-view localization is the dramatic difference in perspective between a street-level camera and an overhead satellite view. A building facade seen from the ground looks completely different from its rooftop signature in an aerial image. Existing methods have struggled with this. Some create a general “descriptor” for the entire scene, but this is an abstract approach that doesn’t mirror how humans naturally localize themselves by spotting specific landmarks. Other methods transform the ground image into a Bird’s-Eye-View (BEV) but are often limited to the ground plane, ignoring crucial vertical structures like buildings. FG2: Matching Fine-Grained Features The EPFL team’s FG2 method introduces a more intuitive and effective process. It aligns two sets of points: one generated from the ground-level image and another sampled from the aerial map. Here’s a breakdown of their innovative pipeline: Mapping to 3D: The process begins by taking the features from the ground-level image and lifting them into a 3D point cloud centered around the camera. This creates a 3D representation of the immediate environment. Smart Pooling to BEV: This is where the magic happens. Instead of simply flattening the 3D data, the model learns to intelligently select the most important features along the vertical (height) dimension for each point. It essentially asks, “For this spot on the map, is the ground-level road marking more important, or is the edge of that building’s roof the better landmark?” This selection process is crucial, as it allows the model to correctly associate features like building facades with their corresponding rooftops in the aerial view. Feature Matching and Pose Estimation: Once both the ground and aerial views are represented as 2D point planes with rich feature descriptors, the model computes the similarity between them. It then samples a sparse set of the most confident matches and uses a classic geometric algorithm called Procrustes alignment to calculate the precise 3-DoF (x, y, and yaw) pose. Unprecedented Performance and Interpretability The results speak for themselves. On the challenging VIGOR dataset, which includes images from different cities in its cross-area test, FG2 reduced the mean localization error by 28% compared to the previous best method. It also demonstrated superior generalization capabilities on the KITTI dataset, a staple in autonomous driving research. Perhaps more importantly, the FG2 model offers a new level of transparency. By visualizing the matched points, the researchers showed that the model learns semantically consistent correspondences without being explicitly told to. For example, the system correctly matches zebra crossings, road markings, and even building facades in the ground view to their corresponding locations on the aerial map. This interpretability is extremenly valuable for building trust in safety-critical autonomous systems. “A Clearer Path” for Autonomous Navigation The FG2 method represents a significant leap forward in fine-grained visual localization. By developing a model that intelligently selects and matches features in a way that mirrors human intuition, the EPFL researchers have not only shattered previous accuracy records but also made the decision-making process of the AI more interpretable. This work paves the way for more robust and reliable navigation systems for autonomous vehicles, drones, and robots, bringing us one step closer to a future where machines can confidently navigate our world, even when GPS fails them. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Jean-marc MommessinJean-marc is a successful AI business executive .He leads and accelerates growth for AI powered solutions and started a computer vision company in 2006. He is a recognized speaker at AI conferences and has an MBA from Stanford.Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/AI-Generated Ad Created with Google’s Veo3 Airs During NBA Finals, Slashing Production Costs by 95%Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video ControlJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Snowflake Charts New AI Territory: Cortex AISQL & Snowflake Intelligence Poised to Reshape Data AnalyticsJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Exclusive Talk: Joey Conway of NVIDIA on Llama Nemotron Ultra and Open Source ModelsPlease log in to like, share and comment! -
OThink-R1: A Dual-Mode Reasoning Framework to Cut Redundant Computation in LLMs
The Inefficiency of Static Chain-of-Thought Reasoning in LRMs
Recent LRMs achieve top performance by using detailed CoT reasoning to solve complex tasks. However, many simple tasks they handle could be solved by smaller models with fewer tokens, making such elaborate reasoning unnecessary. This echoes human thinking, where we use fast, intuitive responses for easy problems and slower, analytical thinking for complex ones. While LRMs mimic slow, logical reasoning, they generate significantly longer outputs, thereby increasing computational cost. Current methods for reducing reasoning steps lack flexibility, limiting models to a single fixed reasoning style. There is a growing need for adaptive reasoning that adjusts effort according to task difficulty.
Limitations of Existing Training-Based and Training-Free Approaches
Recent research on improving reasoning efficiency in LRMs can be categorized into two main areas: training-based and training-free methods. Training strategies often use reinforcement learning or fine-tuning to limit token usage or adjust reasoning depth, but they tend to follow fixed patterns without flexibility. Training-free approaches utilize prompt engineering or pattern detection to shorten outputs during inference; however, they also lack adaptability. More recent work focuses on variable-length reasoning, where models adjust reasoning depth based on task complexity. Others study “overthinking,” where models over-reason unnecessarily. However, few methods enable dynamic switching between quick and thorough reasoning—something this paper addresses directly.
Introducing OThink-R1: Dynamic Fast/Slow Reasoning Framework
Researchers from Zhejiang University and OPPO have developed OThink-R1, a new approach that enables LRMs to switch between fast and slow thinking smartly, much like humans do. By analyzing reasoning patterns, they identified which steps are essential and which are redundant. With help from another model acting as a judge, they trained LRMs to adapt their reasoning style based on task complexity. Their method reduces unnecessary reasoning by over 23% without losing accuracy. Using a loss function and fine-tuned datasets, OThink-R1 outperforms previous models in both efficiency and performance on various math and question-answering tasks.
System Architecture: Reasoning Pruning and Dual-Reference Optimization
The OThink-R1 framework helps LRMs dynamically switch between fast and slow thinking. First, it identifies when LRMs include unnecessary reasoning, like overexplaining or double-checking, versus when detailed steps are truly essential. Using this, it builds a curated training dataset by pruning redundant reasoning and retaining valuable logic. Then, during fine-tuning, a special loss function balances both reasoning styles. This dual-reference loss compares the model’s outputs with both fast and slow thinking variants, encouraging flexibility. As a result, OThink-R1 can adaptively choose the most efficient reasoning path for each problem while preserving accuracy and logical depth.
Empirical Evaluation and Comparative Performance
The OThink-R1 model was tested on simpler QA and math tasks to evaluate its ability to switch between fast and slow reasoning. Using datasets like OpenBookQA, CommonsenseQA, ASDIV, and GSM8K, the model demonstrated strong performance, generating fewer tokens while maintaining or improving accuracy. Compared to baselines such as NoThinking and DualFormer, OThink-R1 demonstrated a better balance between efficiency and effectiveness. Ablation studies confirmed the importance of pruning, KL constraints, and LLM-Judge in achieving optimal results. A case study illustrated that unnecessary reasoning can lead to overthinking and reduced accuracy, highlighting OThink-R1’s strength in adaptive reasoning.
Conclusion: Towards Scalable and Efficient Hybrid Reasoning Systems
In conclusion, OThink-R1 is a large reasoning model that adaptively switches between fast and slow thinking modes to improve both efficiency and performance. It addresses the issue of unnecessarily complex reasoning in large models by analyzing and classifying reasoning steps as either essential or redundant. By pruning the redundant ones while maintaining logical accuracy, OThink-R1 reduces unnecessary computation. It also introduces a dual-reference KL-divergence loss to strengthen hybrid reasoning. Tested on math and QA tasks, it cuts down reasoning redundancy by 23% without sacrificing accuracy, showing promise for building more adaptive, scalable, and efficient AI reasoning systems in the future.
Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Building AI-Powered Applications Using the Plan → Files → Code Workflow in TinyDevSana Hassanhttps://www.marktechpost.com/author/sana-hassan/MemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Google AI Unveils a Hybrid AI-Physics Model for Accurate Regional Climate Risk Forecasts with Better Uncertainty AssessmentSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Run Multiple AI Coding Agents in Parallel with Container-Use from Dagger
#othinkr1 #dualmode #reasoning #framework #cutOThink-R1: A Dual-Mode Reasoning Framework to Cut Redundant Computation in LLMsThe Inefficiency of Static Chain-of-Thought Reasoning in LRMs Recent LRMs achieve top performance by using detailed CoT reasoning to solve complex tasks. However, many simple tasks they handle could be solved by smaller models with fewer tokens, making such elaborate reasoning unnecessary. This echoes human thinking, where we use fast, intuitive responses for easy problems and slower, analytical thinking for complex ones. While LRMs mimic slow, logical reasoning, they generate significantly longer outputs, thereby increasing computational cost. Current methods for reducing reasoning steps lack flexibility, limiting models to a single fixed reasoning style. There is a growing need for adaptive reasoning that adjusts effort according to task difficulty. Limitations of Existing Training-Based and Training-Free Approaches Recent research on improving reasoning efficiency in LRMs can be categorized into two main areas: training-based and training-free methods. Training strategies often use reinforcement learning or fine-tuning to limit token usage or adjust reasoning depth, but they tend to follow fixed patterns without flexibility. Training-free approaches utilize prompt engineering or pattern detection to shorten outputs during inference; however, they also lack adaptability. More recent work focuses on variable-length reasoning, where models adjust reasoning depth based on task complexity. Others study “overthinking,” where models over-reason unnecessarily. However, few methods enable dynamic switching between quick and thorough reasoning—something this paper addresses directly. Introducing OThink-R1: Dynamic Fast/Slow Reasoning Framework Researchers from Zhejiang University and OPPO have developed OThink-R1, a new approach that enables LRMs to switch between fast and slow thinking smartly, much like humans do. By analyzing reasoning patterns, they identified which steps are essential and which are redundant. With help from another model acting as a judge, they trained LRMs to adapt their reasoning style based on task complexity. Their method reduces unnecessary reasoning by over 23% without losing accuracy. Using a loss function and fine-tuned datasets, OThink-R1 outperforms previous models in both efficiency and performance on various math and question-answering tasks. System Architecture: Reasoning Pruning and Dual-Reference Optimization The OThink-R1 framework helps LRMs dynamically switch between fast and slow thinking. First, it identifies when LRMs include unnecessary reasoning, like overexplaining or double-checking, versus when detailed steps are truly essential. Using this, it builds a curated training dataset by pruning redundant reasoning and retaining valuable logic. Then, during fine-tuning, a special loss function balances both reasoning styles. This dual-reference loss compares the model’s outputs with both fast and slow thinking variants, encouraging flexibility. As a result, OThink-R1 can adaptively choose the most efficient reasoning path for each problem while preserving accuracy and logical depth. Empirical Evaluation and Comparative Performance The OThink-R1 model was tested on simpler QA and math tasks to evaluate its ability to switch between fast and slow reasoning. Using datasets like OpenBookQA, CommonsenseQA, ASDIV, and GSM8K, the model demonstrated strong performance, generating fewer tokens while maintaining or improving accuracy. Compared to baselines such as NoThinking and DualFormer, OThink-R1 demonstrated a better balance between efficiency and effectiveness. Ablation studies confirmed the importance of pruning, KL constraints, and LLM-Judge in achieving optimal results. A case study illustrated that unnecessary reasoning can lead to overthinking and reduced accuracy, highlighting OThink-R1’s strength in adaptive reasoning. Conclusion: Towards Scalable and Efficient Hybrid Reasoning Systems In conclusion, OThink-R1 is a large reasoning model that adaptively switches between fast and slow thinking modes to improve both efficiency and performance. It addresses the issue of unnecessarily complex reasoning in large models by analyzing and classifying reasoning steps as either essential or redundant. By pruning the redundant ones while maintaining logical accuracy, OThink-R1 reduces unnecessary computation. It also introduces a dual-reference KL-divergence loss to strengthen hybrid reasoning. Tested on math and QA tasks, it cuts down reasoning redundancy by 23% without sacrificing accuracy, showing promise for building more adaptive, scalable, and efficient AI reasoning systems in the future. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Building AI-Powered Applications Using the Plan → Files → Code Workflow in TinyDevSana Hassanhttps://www.marktechpost.com/author/sana-hassan/MemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Google AI Unveils a Hybrid AI-Physics Model for Accurate Regional Climate Risk Forecasts with Better Uncertainty AssessmentSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Run Multiple AI Coding Agents in Parallel with Container-Use from Dagger #othinkr1 #dualmode #reasoning #framework #cutWWW.MARKTECHPOST.COMOThink-R1: A Dual-Mode Reasoning Framework to Cut Redundant Computation in LLMsThe Inefficiency of Static Chain-of-Thought Reasoning in LRMs Recent LRMs achieve top performance by using detailed CoT reasoning to solve complex tasks. However, many simple tasks they handle could be solved by smaller models with fewer tokens, making such elaborate reasoning unnecessary. This echoes human thinking, where we use fast, intuitive responses for easy problems and slower, analytical thinking for complex ones. While LRMs mimic slow, logical reasoning, they generate significantly longer outputs, thereby increasing computational cost. Current methods for reducing reasoning steps lack flexibility, limiting models to a single fixed reasoning style. There is a growing need for adaptive reasoning that adjusts effort according to task difficulty. Limitations of Existing Training-Based and Training-Free Approaches Recent research on improving reasoning efficiency in LRMs can be categorized into two main areas: training-based and training-free methods. Training strategies often use reinforcement learning or fine-tuning to limit token usage or adjust reasoning depth, but they tend to follow fixed patterns without flexibility. Training-free approaches utilize prompt engineering or pattern detection to shorten outputs during inference; however, they also lack adaptability. More recent work focuses on variable-length reasoning, where models adjust reasoning depth based on task complexity. Others study “overthinking,” where models over-reason unnecessarily. However, few methods enable dynamic switching between quick and thorough reasoning—something this paper addresses directly. Introducing OThink-R1: Dynamic Fast/Slow Reasoning Framework Researchers from Zhejiang University and OPPO have developed OThink-R1, a new approach that enables LRMs to switch between fast and slow thinking smartly, much like humans do. By analyzing reasoning patterns, they identified which steps are essential and which are redundant. With help from another model acting as a judge, they trained LRMs to adapt their reasoning style based on task complexity. Their method reduces unnecessary reasoning by over 23% without losing accuracy. Using a loss function and fine-tuned datasets, OThink-R1 outperforms previous models in both efficiency and performance on various math and question-answering tasks. System Architecture: Reasoning Pruning and Dual-Reference Optimization The OThink-R1 framework helps LRMs dynamically switch between fast and slow thinking. First, it identifies when LRMs include unnecessary reasoning, like overexplaining or double-checking, versus when detailed steps are truly essential. Using this, it builds a curated training dataset by pruning redundant reasoning and retaining valuable logic. Then, during fine-tuning, a special loss function balances both reasoning styles. This dual-reference loss compares the model’s outputs with both fast and slow thinking variants, encouraging flexibility. As a result, OThink-R1 can adaptively choose the most efficient reasoning path for each problem while preserving accuracy and logical depth. Empirical Evaluation and Comparative Performance The OThink-R1 model was tested on simpler QA and math tasks to evaluate its ability to switch between fast and slow reasoning. Using datasets like OpenBookQA, CommonsenseQA, ASDIV, and GSM8K, the model demonstrated strong performance, generating fewer tokens while maintaining or improving accuracy. Compared to baselines such as NoThinking and DualFormer, OThink-R1 demonstrated a better balance between efficiency and effectiveness. Ablation studies confirmed the importance of pruning, KL constraints, and LLM-Judge in achieving optimal results. A case study illustrated that unnecessary reasoning can lead to overthinking and reduced accuracy, highlighting OThink-R1’s strength in adaptive reasoning. Conclusion: Towards Scalable and Efficient Hybrid Reasoning Systems In conclusion, OThink-R1 is a large reasoning model that adaptively switches between fast and slow thinking modes to improve both efficiency and performance. It addresses the issue of unnecessarily complex reasoning in large models by analyzing and classifying reasoning steps as either essential or redundant. By pruning the redundant ones while maintaining logical accuracy, OThink-R1 reduces unnecessary computation. It also introduces a dual-reference KL-divergence loss to strengthen hybrid reasoning. Tested on math and QA tasks, it cuts down reasoning redundancy by 23% without sacrificing accuracy, showing promise for building more adaptive, scalable, and efficient AI reasoning systems in the future. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Building AI-Powered Applications Using the Plan → Files → Code Workflow in TinyDevSana Hassanhttps://www.marktechpost.com/author/sana-hassan/MemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Google AI Unveils a Hybrid AI-Physics Model for Accurate Regional Climate Risk Forecasts with Better Uncertainty AssessmentSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Run Multiple AI Coding Agents in Parallel with Container-Use from Dagger0 Kommentare 0 Anteile -
ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation
Autoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses on generating images one token at a time, similar to how sentences are constructed in language models. The appeal of this approach lies in its ability to maintain structural coherence across the image while allowing for high levels of control during the generation process. As researchers began to apply these techniques to visual data, they found that structured prediction not only preserved spatial integrity but also supported tasks like image manipulation and multimodal translation effectively.
Despite these benefits, generating high-resolution images remains computationally expensive and slow. A primary issue is the number of tokens needed to represent complex visuals. Raster-scan methods that flatten 2D images into linear sequences require thousands of tokens for detailed images, resulting in long inference times and high memory consumption. Models like Infinity need over 10,000 tokens for a 1024×1024 image. This becomes unsustainable for real-time applications or when scaling to more extensive datasets. Reducing the token burden while preserving or improving output quality has become a pressing challenge.
Efforts to mitigate token inflation have led to innovations like next-scale prediction seen in VAR and FlexVAR. These models create images by predicting progressively finer scales, which imitates the human tendency to sketch rough outlines before adding detail. However, they still rely on hundreds of tokens—680 in the case of VAR and FlexVAR for 256×256 images. Moreover, approaches like TiTok and FlexTok use 1D tokenization to compress spatial redundancy, but they often fail to scale efficiently. For example, FlexTok’s gFID increases from 1.9 at 32 tokens to 2.5 at 256 tokens, highlighting a degradation in output quality as the token count grows.
Researchers from ByteDance introduced DetailFlow, a 1D autoregressive image generation framework. This method arranges token sequences from global to fine detail using a process called next-detail prediction. Unlike traditional 2D raster-scan or scale-based techniques, DetailFlow employs a 1D tokenizer trained on progressively degraded images. This design allows the model to prioritize foundational image structures before refining visual details. By mapping tokens directly to resolution levels, DetailFlow significantly reduces token requirements, enabling images to be generated in a semantically ordered, coarse-to-fine manner.
The mechanism in DetailFlow centers on a 1D latent space where each token contributes incrementally more detail. Earlier tokens encode global features, while later tokens refine specific visual aspects. To train this, the researchers created a resolution mapping function that links token count to target resolution. During training, the model is exposed to images of varying quality levels and learns to predict progressively higher-resolution outputs as more tokens are introduced. It also implements parallel token prediction by grouping sequences and predicting entire sets at once. Since parallel prediction can introduce sampling errors, a self-correction mechanism was integrated. This system perturbs certain tokens during training and teaches subsequent tokens to compensate, ensuring that final images maintain structural and visual integrity.
The results from the experiments on the ImageNet 256×256 benchmark were noteworthy. DetailFlow achieved a gFID score of 2.96 using only 128 tokens, outperforming VAR at 3.3 and FlexVAR at 3.05, both of which used 680 tokens. Even more impressive, DetailFlow-64 reached a gFID of 2.62 using 512 tokens. In terms of speed, it delivered nearly double the inference rate of VAR and FlexVAR. A further ablation study confirmed that the self-correction training and semantic ordering of tokens substantially improved output quality. For example, enabling self-correction dropped the gFID from 4.11 to 3.68 in one setting. These metrics demonstrate both higher quality and faster generation compared to established models.
By focusing on semantic structure and reducing redundancy, DetailFlow presents a viable solution to long-standing issues in autoregressive image generation. The method’s coarse-to-fine approach, efficient parallel decoding, and ability to self-correct highlight how architectural innovations can address performance and scalability limitations. Through their structured use of 1D tokens, the researchers from ByteDance have demonstrated a model that maintains high image fidelity while significantly reducing computational load, making it a valuable addition to image synthesis research.
Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement FinetuningNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal ReasoningNikhilhttps://www.marktechpost.com/author/nikhil0980/NVIDIA AI Introduces Fast-dLLM: A Training-Free Framework That Brings KV Caching and Parallel Decoding to Diffusion LLMsNikhilhttps://www.marktechpost.com/author/nikhil0980/Meet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental Validation
#bytedance #researchers #introduce #detailflow #coarsetofineByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image GenerationAutoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses on generating images one token at a time, similar to how sentences are constructed in language models. The appeal of this approach lies in its ability to maintain structural coherence across the image while allowing for high levels of control during the generation process. As researchers began to apply these techniques to visual data, they found that structured prediction not only preserved spatial integrity but also supported tasks like image manipulation and multimodal translation effectively. Despite these benefits, generating high-resolution images remains computationally expensive and slow. A primary issue is the number of tokens needed to represent complex visuals. Raster-scan methods that flatten 2D images into linear sequences require thousands of tokens for detailed images, resulting in long inference times and high memory consumption. Models like Infinity need over 10,000 tokens for a 1024×1024 image. This becomes unsustainable for real-time applications or when scaling to more extensive datasets. Reducing the token burden while preserving or improving output quality has become a pressing challenge. Efforts to mitigate token inflation have led to innovations like next-scale prediction seen in VAR and FlexVAR. These models create images by predicting progressively finer scales, which imitates the human tendency to sketch rough outlines before adding detail. However, they still rely on hundreds of tokens—680 in the case of VAR and FlexVAR for 256×256 images. Moreover, approaches like TiTok and FlexTok use 1D tokenization to compress spatial redundancy, but they often fail to scale efficiently. For example, FlexTok’s gFID increases from 1.9 at 32 tokens to 2.5 at 256 tokens, highlighting a degradation in output quality as the token count grows. Researchers from ByteDance introduced DetailFlow, a 1D autoregressive image generation framework. This method arranges token sequences from global to fine detail using a process called next-detail prediction. Unlike traditional 2D raster-scan or scale-based techniques, DetailFlow employs a 1D tokenizer trained on progressively degraded images. This design allows the model to prioritize foundational image structures before refining visual details. By mapping tokens directly to resolution levels, DetailFlow significantly reduces token requirements, enabling images to be generated in a semantically ordered, coarse-to-fine manner. The mechanism in DetailFlow centers on a 1D latent space where each token contributes incrementally more detail. Earlier tokens encode global features, while later tokens refine specific visual aspects. To train this, the researchers created a resolution mapping function that links token count to target resolution. During training, the model is exposed to images of varying quality levels and learns to predict progressively higher-resolution outputs as more tokens are introduced. It also implements parallel token prediction by grouping sequences and predicting entire sets at once. Since parallel prediction can introduce sampling errors, a self-correction mechanism was integrated. This system perturbs certain tokens during training and teaches subsequent tokens to compensate, ensuring that final images maintain structural and visual integrity. The results from the experiments on the ImageNet 256×256 benchmark were noteworthy. DetailFlow achieved a gFID score of 2.96 using only 128 tokens, outperforming VAR at 3.3 and FlexVAR at 3.05, both of which used 680 tokens. Even more impressive, DetailFlow-64 reached a gFID of 2.62 using 512 tokens. In terms of speed, it delivered nearly double the inference rate of VAR and FlexVAR. A further ablation study confirmed that the self-correction training and semantic ordering of tokens substantially improved output quality. For example, enabling self-correction dropped the gFID from 4.11 to 3.68 in one setting. These metrics demonstrate both higher quality and faster generation compared to established models. By focusing on semantic structure and reducing redundancy, DetailFlow presents a viable solution to long-standing issues in autoregressive image generation. The method’s coarse-to-fine approach, efficient parallel decoding, and ability to self-correct highlight how architectural innovations can address performance and scalability limitations. Through their structured use of 1D tokens, the researchers from ByteDance have demonstrated a model that maintains high image fidelity while significantly reducing computational load, making it a valuable addition to image synthesis research. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement FinetuningNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal ReasoningNikhilhttps://www.marktechpost.com/author/nikhil0980/NVIDIA AI Introduces Fast-dLLM: A Training-Free Framework That Brings KV Caching and Parallel Decoding to Diffusion LLMsNikhilhttps://www.marktechpost.com/author/nikhil0980/Meet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental Validation #bytedance #researchers #introduce #detailflow #coarsetofineWWW.MARKTECHPOST.COMByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image GenerationAutoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses on generating images one token at a time, similar to how sentences are constructed in language models. The appeal of this approach lies in its ability to maintain structural coherence across the image while allowing for high levels of control during the generation process. As researchers began to apply these techniques to visual data, they found that structured prediction not only preserved spatial integrity but also supported tasks like image manipulation and multimodal translation effectively. Despite these benefits, generating high-resolution images remains computationally expensive and slow. A primary issue is the number of tokens needed to represent complex visuals. Raster-scan methods that flatten 2D images into linear sequences require thousands of tokens for detailed images, resulting in long inference times and high memory consumption. Models like Infinity need over 10,000 tokens for a 1024×1024 image. This becomes unsustainable for real-time applications or when scaling to more extensive datasets. Reducing the token burden while preserving or improving output quality has become a pressing challenge. Efforts to mitigate token inflation have led to innovations like next-scale prediction seen in VAR and FlexVAR. These models create images by predicting progressively finer scales, which imitates the human tendency to sketch rough outlines before adding detail. However, they still rely on hundreds of tokens—680 in the case of VAR and FlexVAR for 256×256 images. Moreover, approaches like TiTok and FlexTok use 1D tokenization to compress spatial redundancy, but they often fail to scale efficiently. For example, FlexTok’s gFID increases from 1.9 at 32 tokens to 2.5 at 256 tokens, highlighting a degradation in output quality as the token count grows. Researchers from ByteDance introduced DetailFlow, a 1D autoregressive image generation framework. This method arranges token sequences from global to fine detail using a process called next-detail prediction. Unlike traditional 2D raster-scan or scale-based techniques, DetailFlow employs a 1D tokenizer trained on progressively degraded images. This design allows the model to prioritize foundational image structures before refining visual details. By mapping tokens directly to resolution levels, DetailFlow significantly reduces token requirements, enabling images to be generated in a semantically ordered, coarse-to-fine manner. The mechanism in DetailFlow centers on a 1D latent space where each token contributes incrementally more detail. Earlier tokens encode global features, while later tokens refine specific visual aspects. To train this, the researchers created a resolution mapping function that links token count to target resolution. During training, the model is exposed to images of varying quality levels and learns to predict progressively higher-resolution outputs as more tokens are introduced. It also implements parallel token prediction by grouping sequences and predicting entire sets at once. Since parallel prediction can introduce sampling errors, a self-correction mechanism was integrated. This system perturbs certain tokens during training and teaches subsequent tokens to compensate, ensuring that final images maintain structural and visual integrity. The results from the experiments on the ImageNet 256×256 benchmark were noteworthy. DetailFlow achieved a gFID score of 2.96 using only 128 tokens, outperforming VAR at 3.3 and FlexVAR at 3.05, both of which used 680 tokens. Even more impressive, DetailFlow-64 reached a gFID of 2.62 using 512 tokens. In terms of speed, it delivered nearly double the inference rate of VAR and FlexVAR. A further ablation study confirmed that the self-correction training and semantic ordering of tokens substantially improved output quality. For example, enabling self-correction dropped the gFID from 4.11 to 3.68 in one setting. These metrics demonstrate both higher quality and faster generation compared to established models. By focusing on semantic structure and reducing redundancy, DetailFlow presents a viable solution to long-standing issues in autoregressive image generation. The method’s coarse-to-fine approach, efficient parallel decoding, and ability to self-correct highlight how architectural innovations can address performance and scalability limitations. Through their structured use of 1D tokens, the researchers from ByteDance have demonstrated a model that maintains high image fidelity while significantly reducing computational load, making it a valuable addition to image synthesis research. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement FinetuningNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal ReasoningNikhilhttps://www.marktechpost.com/author/nikhil0980/NVIDIA AI Introduces Fast-dLLM: A Training-Free Framework That Brings KV Caching and Parallel Decoding to Diffusion LLMsNikhilhttps://www.marktechpost.com/author/nikhil0980/Meet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental Validation -
Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards
Text embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation. However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed.
Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding
Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages, making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs.
These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs.
Technical Architecture
Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to thetoken. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function.
The models are trained using a robust multi-stage training pipeline:
Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks.
Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity, fine-tuning performance in downstream applications.
Model merging: Spherical linear interpolationof multiple fine-tuned checkpoints ensures robustness and generalization.
This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings.
Performance Benchmarks and Insights
The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks.
On MMTEB, Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series.
On MTEB: Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B.
On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA.
For reranking:
Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers.
Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance.
Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops, emphasizing their contributions.
Conclusion
Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation.
Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent Using LangGraph and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding
#alibaba #qwen #team #releases #qwen3embeddingAlibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking StandardsText embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation. However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed. Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages, making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs. These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs. Technical Architecture Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to thetoken. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function. The models are trained using a robust multi-stage training pipeline: Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks. Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity, fine-tuning performance in downstream applications. Model merging: Spherical linear interpolationof multiple fine-tuned checkpoints ensures robustness and generalization. This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings. Performance Benchmarks and Insights The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks. On MMTEB, Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series. On MTEB: Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B. On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA. For reranking: Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers. Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance. Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops, emphasizing their contributions. Conclusion Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation. Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent Using LangGraph and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding #alibaba #qwen #team #releases #qwen3embeddingWWW.MARKTECHPOST.COMAlibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking StandardsText embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation (RAG). However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed. Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages (119 in total), making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs. These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs. Technical Architecture Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to the [EOS] token. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function. The models are trained using a robust multi-stage training pipeline: Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks. Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity (>0.7), fine-tuning performance in downstream applications. Model merging: Spherical linear interpolation (SLERP) of multiple fine-tuned checkpoints ensures robustness and generalization. This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings. Performance Benchmarks and Insights The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks. On MMTEB (216 tasks across 250+ languages), Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series. On MTEB (English v2): Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B. On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA. For reranking: Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers. Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance. Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops (up to 6 points on MMTEB), emphasizing their contributions. Conclusion Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation. Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent Using LangGraph and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding -
Top Artificial Intelligence AI Books to Read in 2025
Artificial Intelligencehas been making significant strides over the past few years, with the emergence of Large Language Modelsmarking a major milestone in its growth. With such widespread adoption, feeling left out of this revolution is not uncommon. One way an individual can stay updated with the latest trends is by reading books on various facets of AI. Following are the top AI books one should read in 2025.
Deep LearningThis book covers a wide range of deep learning topics along with their mathematical and conceptual background. It also provides information on the different deep learning techniques used in various industrial applications.
Python: Advanced Guide to Artificial Intelligence
This book helps individuals familiarize themselves with the most popular machine learningalgorithms and delves into the details of deep learning, covering topics like CNN, RNN, etc. It provides a comprehensive understanding of advanced AI concepts while focusing on their practical implementation using Python.
Machine Learningfor Dummies
This book explains the fundamentals of machine learning by providing practical examples using Python and R. It is a beginner-friendly guide and a good starting point for people new to this field.
Machine Learning for Beginners
Given the pace with which machine learning systems are growing, this book provides a good base for anyone shifting to this field. The author talks about machine intelligence’s historical background and provides beginners with information on how advanced algorithms work.
Artificial Intelligence: A Modern Approach
This is a well-acclaimed book that covers the breadth of AI topics, including problem-solving, knowledge representation, machine learning, and natural language processing. It provides theoretical explanations along with practical examples, making it an excellent starting point for anyone looking to dive into the world of AI.
Human Compatible: Artificial Intelligence and the Problem of Control
The book discusses the inevitable conflict between humans and machines, providing important context before we advocate for AI. The author also talks about the possibility of superhuman AI and questions the concepts of human comprehension and machine learning.
The Alignment Problem: Machine Learning and Human Values
This book talks about a concept called “The Alignment Problem,” where the systems we aim to teach, don’t perform as expected, and various ethical and existential risks emerge.
Life 3.0: Being Human in the Age of Artificial Intelligence
The author of this book talks about questions like what the future of AI will look like and the possibility of superhuman intelligence becoming our master. He also talks about how we can ensure these systems perform without malfunctioning.
The Coming Wave: Technology, Power, and the Twenty-First Century’s Greatest Dilemma
This book warns about the risks that emerging technologies pose to global order. It covers topics like robotics and large language models and examines the forces that fuel these innovations.
Artificial Intelligence Engines: A Tutorial Introduction to the Mathematics of Deep Learning
“Artificial Intelligence Engines” dives into the mathematical foundations of deep learning. It provides a holistic understanding of deep learning, covering both the historical development of neural networks as well as modern techniques and architecture while focusing on the underlying mathematical concepts.
Neural Networks and Deep Learning
This book covers the fundamental concepts of neural networks and deep learning. It also covers the mathematical aspects of the same, covering topics like linear algebra, probability theory, and numerical computation.
Artificial Intelligence for Humans
This book explains how AI algorithms are used using actual numeric calculations. The book aims to target those without an extensive mathematical background and each unit is followed by examples in different programming languages.
AI Superpowers: China, Silicon Valley, and the New World Order
The author of this book explains the unexpected consequences of AI development. The book sheds light on the competition between the USA and China over AI innovations through actual events.
Hello World: Being Human in the Age of Algorithms
The author talks about the powers and limitations of the algorithms that are widely used today. The book prepares its readers for the moral uncertainties of a world run by code.
The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
This book talks about the concept of the “Master algorithm,” which is a single, overarching learning algorithm capable of incorporating different approaches.
Applied Artificial Intelligence: A Handbook for Business Leaders
“Applied Artificial Intelligence” provides a guide for businesses on how to leverage AI to drive innovation and growth. It covers various applications of AI and also explores its ethical considerations. Additionally, it sheds light on building AI teams and talent acquisition.
Superintelligence: Paths, Dangers, Strategies
This book asks questions like whether AI agents will save or destroy us and what happens when machines surpass humans in general intelligence. The author talks about the importance of global collaboration in developing safe AI.
We make a small profit from purchases made via referral/affiliate links attached to each book mentioned in the above list.
If you want to suggest any book that we missed from this list, then please email us at asif@marktechpost.com
Shobha KakkarShobha is a data analyst with a proven track record of developing innovative machine-learning solutions that drive business value.Shobha Kakkarhttps://www.marktechpost.com/author/shobha-kakkar/Hugging Face Introduces a Free Model Context ProtocolCourse: A Developer’s Guide to Build and Deploy Context-Aware AI Agents and ApplicationsShobha Kakkarhttps://www.marktechpost.com/author/shobha-kakkar/13 Free AI Courses on AI Agents in 2025Shobha Kakkarhttps://www.marktechpost.com/author/shobha-kakkar/OpenAI Just Announced API Access to o1Shobha Kakkarhttps://www.marktechpost.com/author/shobha-kakkar/OpenAI Just Released Sora: The Most Awaited AI Video-Generation Tool
#top #artificial #intelligence #books #readTop Artificial Intelligence AI Books to Read in 2025Artificial Intelligencehas been making significant strides over the past few years, with the emergence of Large Language Modelsmarking a major milestone in its growth. With such widespread adoption, feeling left out of this revolution is not uncommon. One way an individual can stay updated with the latest trends is by reading books on various facets of AI. Following are the top AI books one should read in 2025. Deep LearningThis book covers a wide range of deep learning topics along with their mathematical and conceptual background. It also provides information on the different deep learning techniques used in various industrial applications. Python: Advanced Guide to Artificial Intelligence This book helps individuals familiarize themselves with the most popular machine learningalgorithms and delves into the details of deep learning, covering topics like CNN, RNN, etc. It provides a comprehensive understanding of advanced AI concepts while focusing on their practical implementation using Python. Machine Learningfor Dummies This book explains the fundamentals of machine learning by providing practical examples using Python and R. It is a beginner-friendly guide and a good starting point for people new to this field. Machine Learning for Beginners Given the pace with which machine learning systems are growing, this book provides a good base for anyone shifting to this field. The author talks about machine intelligence’s historical background and provides beginners with information on how advanced algorithms work. Artificial Intelligence: A Modern Approach This is a well-acclaimed book that covers the breadth of AI topics, including problem-solving, knowledge representation, machine learning, and natural language processing. It provides theoretical explanations along with practical examples, making it an excellent starting point for anyone looking to dive into the world of AI. Human Compatible: Artificial Intelligence and the Problem of Control The book discusses the inevitable conflict between humans and machines, providing important context before we advocate for AI. The author also talks about the possibility of superhuman AI and questions the concepts of human comprehension and machine learning. The Alignment Problem: Machine Learning and Human Values This book talks about a concept called “The Alignment Problem,” where the systems we aim to teach, don’t perform as expected, and various ethical and existential risks emerge. Life 3.0: Being Human in the Age of Artificial Intelligence The author of this book talks about questions like what the future of AI will look like and the possibility of superhuman intelligence becoming our master. He also talks about how we can ensure these systems perform without malfunctioning. The Coming Wave: Technology, Power, and the Twenty-First Century’s Greatest Dilemma This book warns about the risks that emerging technologies pose to global order. It covers topics like robotics and large language models and examines the forces that fuel these innovations. Artificial Intelligence Engines: A Tutorial Introduction to the Mathematics of Deep Learning “Artificial Intelligence Engines” dives into the mathematical foundations of deep learning. It provides a holistic understanding of deep learning, covering both the historical development of neural networks as well as modern techniques and architecture while focusing on the underlying mathematical concepts. Neural Networks and Deep Learning This book covers the fundamental concepts of neural networks and deep learning. It also covers the mathematical aspects of the same, covering topics like linear algebra, probability theory, and numerical computation. Artificial Intelligence for Humans This book explains how AI algorithms are used using actual numeric calculations. The book aims to target those without an extensive mathematical background and each unit is followed by examples in different programming languages. AI Superpowers: China, Silicon Valley, and the New World Order The author of this book explains the unexpected consequences of AI development. The book sheds light on the competition between the USA and China over AI innovations through actual events. Hello World: Being Human in the Age of Algorithms The author talks about the powers and limitations of the algorithms that are widely used today. The book prepares its readers for the moral uncertainties of a world run by code. The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World This book talks about the concept of the “Master algorithm,” which is a single, overarching learning algorithm capable of incorporating different approaches. Applied Artificial Intelligence: A Handbook for Business Leaders “Applied Artificial Intelligence” provides a guide for businesses on how to leverage AI to drive innovation and growth. It covers various applications of AI and also explores its ethical considerations. Additionally, it sheds light on building AI teams and talent acquisition. Superintelligence: Paths, Dangers, Strategies This book asks questions like whether AI agents will save or destroy us and what happens when machines surpass humans in general intelligence. The author talks about the importance of global collaboration in developing safe AI. We make a small profit from purchases made via referral/affiliate links attached to each book mentioned in the above list. If you want to suggest any book that we missed from this list, then please email us at asif@marktechpost.com Shobha KakkarShobha is a data analyst with a proven track record of developing innovative machine-learning solutions that drive business value.Shobha Kakkarhttps://www.marktechpost.com/author/shobha-kakkar/Hugging Face Introduces a Free Model Context ProtocolCourse: A Developer’s Guide to Build and Deploy Context-Aware AI Agents and ApplicationsShobha Kakkarhttps://www.marktechpost.com/author/shobha-kakkar/13 Free AI Courses on AI Agents in 2025Shobha Kakkarhttps://www.marktechpost.com/author/shobha-kakkar/OpenAI Just Announced API Access to o1Shobha Kakkarhttps://www.marktechpost.com/author/shobha-kakkar/OpenAI Just Released Sora: The Most Awaited AI Video-Generation Tool #top #artificial #intelligence #books #readWWW.MARKTECHPOST.COMTop Artificial Intelligence AI Books to Read in 2025Artificial Intelligence (AI) has been making significant strides over the past few years, with the emergence of Large Language Models (LLMs) marking a major milestone in its growth. With such widespread adoption, feeling left out of this revolution is not uncommon. One way an individual can stay updated with the latest trends is by reading books on various facets of AI. Following are the top AI books one should read in 2025. Deep Learning (Adaptive Computation and Machine Learning series) This book covers a wide range of deep learning topics along with their mathematical and conceptual background. It also provides information on the different deep learning techniques used in various industrial applications. Python: Advanced Guide to Artificial Intelligence This book helps individuals familiarize themselves with the most popular machine learning (ML) algorithms and delves into the details of deep learning, covering topics like CNN, RNN, etc. It provides a comprehensive understanding of advanced AI concepts while focusing on their practical implementation using Python. Machine Learning (in Python and R) for Dummies This book explains the fundamentals of machine learning by providing practical examples using Python and R. It is a beginner-friendly guide and a good starting point for people new to this field. Machine Learning for Beginners Given the pace with which machine learning systems are growing, this book provides a good base for anyone shifting to this field. The author talks about machine intelligence’s historical background and provides beginners with information on how advanced algorithms work. Artificial Intelligence: A Modern Approach This is a well-acclaimed book that covers the breadth of AI topics, including problem-solving, knowledge representation, machine learning, and natural language processing. It provides theoretical explanations along with practical examples, making it an excellent starting point for anyone looking to dive into the world of AI. Human Compatible: Artificial Intelligence and the Problem of Control The book discusses the inevitable conflict between humans and machines, providing important context before we advocate for AI. The author also talks about the possibility of superhuman AI and questions the concepts of human comprehension and machine learning. The Alignment Problem: Machine Learning and Human Values This book talks about a concept called “The Alignment Problem,” where the systems we aim to teach, don’t perform as expected, and various ethical and existential risks emerge. Life 3.0: Being Human in the Age of Artificial Intelligence The author of this book talks about questions like what the future of AI will look like and the possibility of superhuman intelligence becoming our master. He also talks about how we can ensure these systems perform without malfunctioning. The Coming Wave: Technology, Power, and the Twenty-First Century’s Greatest Dilemma This book warns about the risks that emerging technologies pose to global order. It covers topics like robotics and large language models and examines the forces that fuel these innovations. Artificial Intelligence Engines: A Tutorial Introduction to the Mathematics of Deep Learning “Artificial Intelligence Engines” dives into the mathematical foundations of deep learning. It provides a holistic understanding of deep learning, covering both the historical development of neural networks as well as modern techniques and architecture while focusing on the underlying mathematical concepts. Neural Networks and Deep Learning This book covers the fundamental concepts of neural networks and deep learning. It also covers the mathematical aspects of the same, covering topics like linear algebra, probability theory, and numerical computation. Artificial Intelligence for Humans This book explains how AI algorithms are used using actual numeric calculations. The book aims to target those without an extensive mathematical background and each unit is followed by examples in different programming languages. AI Superpowers: China, Silicon Valley, and the New World Order The author of this book explains the unexpected consequences of AI development. The book sheds light on the competition between the USA and China over AI innovations through actual events. Hello World: Being Human in the Age of Algorithms The author talks about the powers and limitations of the algorithms that are widely used today. The book prepares its readers for the moral uncertainties of a world run by code. The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World This book talks about the concept of the “Master algorithm,” which is a single, overarching learning algorithm capable of incorporating different approaches. Applied Artificial Intelligence: A Handbook for Business Leaders “Applied Artificial Intelligence” provides a guide for businesses on how to leverage AI to drive innovation and growth. It covers various applications of AI and also explores its ethical considerations. Additionally, it sheds light on building AI teams and talent acquisition. Superintelligence: Paths, Dangers, Strategies This book asks questions like whether AI agents will save or destroy us and what happens when machines surpass humans in general intelligence. The author talks about the importance of global collaboration in developing safe AI. We make a small profit from purchases made via referral/affiliate links attached to each book mentioned in the above list. If you want to suggest any book that we missed from this list, then please email us at asif@marktechpost.com Shobha KakkarShobha is a data analyst with a proven track record of developing innovative machine-learning solutions that drive business value.Shobha Kakkarhttps://www.marktechpost.com/author/shobha-kakkar/Hugging Face Introduces a Free Model Context Protocol (MCP) Course: A Developer’s Guide to Build and Deploy Context-Aware AI Agents and ApplicationsShobha Kakkarhttps://www.marktechpost.com/author/shobha-kakkar/13 Free AI Courses on AI Agents in 2025Shobha Kakkarhttps://www.marktechpost.com/author/shobha-kakkar/OpenAI Just Announced API Access to o1 (Advanced Reasoning Model)Shobha Kakkarhttps://www.marktechpost.com/author/shobha-kakkar/OpenAI Just Released Sora: The Most Awaited AI Video-Generation Tool -
Mistral AI Introduces Codestral Embed: A High-Performance Code Embedding Model for Scalable Retrieval and Semantic Understanding
Modern software engineering faces growing challenges in accurately retrieving and understanding code across diverse programming languages and large-scale codebases. Existing embedding models often struggle to capture the deep semantics of code, resulting in poor performance in tasks such as code search, RAG, and semantic analysis. These limitations hinder developers’ ability to efficiently locate relevant code snippets, reuse components, and manage large projects effectively. As software systems grow increasingly complex, there is a pressing need for more effective, language-agnostic representations of code that can power reliable and high-quality retrieval and reasoning across a wide range of development tasks.
Mistral AI has introduced Codestral Embed, a specialized embedding model built specifically for code-related tasks. Designed to handle real-world code more effectively than existing solutions, it enables powerful retrieval capabilities across large codebases. What sets it apart is its flexibility—users can adjust embedding dimensions and precision levels to balance performance with storage efficiency. Even at lower dimensions, such as 256 with int8 precision, Codestral Embed reportedly surpasses top models from competitors like OpenAI, Cohere, and Voyage, offering high retrieval quality at a reduced storage cost.
Beyond basic retrieval, Codestral Embed supports a wide range of developer-focused applications. These include code completion, explanation, editing, semantic search, and duplicate detection. The model can also help organize and analyze repositories by clustering code based on functionality or structure, eliminating the need for manual supervision. This makes it particularly useful for tasks like understanding architectural patterns, categorizing code, or supporting automated documentation, ultimately helping developers work more efficiently with large and complex codebases.
Codestral Embed is tailored for understanding and retrieving code efficiently, especially in large-scale development environments. It powers retrieval-augmented generation by quickly fetching relevant context for tasks like code completion, editing, and explanation—ideal for use in coding assistants and agent-based tools. Developers can also perform semantic code searches using natural language or code queries to find relevant snippets. Its ability to detect similar or duplicated code helps with reuse, policy enforcement, and cleaning up redundancy. Additionally, it can cluster code by functionality or structure, making it useful for repository analysis, spotting architectural patterns, and enhancing documentation workflows.
Codestral Embed is a specialized embedding model designed to enhance code retrieval and semantic analysis tasks. It surpasses existing models, such as OpenAI’s and Cohere’s, in benchmarks like SWE-Bench Lite and CodeSearchNet. The model offers customizable embedding dimensions and precision levels, allowing users to effectively balance performance and storage needs. Key applications include retrieval-augmented generation, semantic code search, duplicate detection, and code clustering. Available via API at per million tokens, with a 50% discount for batch processing, Codestral Embed supports various output formats and dimensions, catering to diverse development workflows.
In conclusion, Codestral Embed offers customizable embedding dimensions and precisions, enabling developers to strike a balance between performance and storage efficiency. Benchmark evaluations indicate that Codestral Embed surpasses existing models like OpenAI’s and Cohere’s in various code-related tasks, including retrieval-augmented generation and semantic code search. Its applications span from identifying duplicate code segments to facilitating semantic clustering for code analytics. Available through Mistral’s API, Codestral Embed provides a flexible and efficient solution for developers seeking advanced code understanding capabilities.
vides valuable insights for the community.
Check out the Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/This AI Paper from Microsoft Introduces WINA: A Training-Free Sparse Activation Framework for Efficient Large Language Model InferenceSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Apple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and AccuracySana Hassanhttps://www.marktechpost.com/author/sana-hassan/National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text Generation
#mistral #introduces #codestral #embed #highperformanceMistral AI Introduces Codestral Embed: A High-Performance Code Embedding Model for Scalable Retrieval and Semantic UnderstandingModern software engineering faces growing challenges in accurately retrieving and understanding code across diverse programming languages and large-scale codebases. Existing embedding models often struggle to capture the deep semantics of code, resulting in poor performance in tasks such as code search, RAG, and semantic analysis. These limitations hinder developers’ ability to efficiently locate relevant code snippets, reuse components, and manage large projects effectively. As software systems grow increasingly complex, there is a pressing need for more effective, language-agnostic representations of code that can power reliable and high-quality retrieval and reasoning across a wide range of development tasks. Mistral AI has introduced Codestral Embed, a specialized embedding model built specifically for code-related tasks. Designed to handle real-world code more effectively than existing solutions, it enables powerful retrieval capabilities across large codebases. What sets it apart is its flexibility—users can adjust embedding dimensions and precision levels to balance performance with storage efficiency. Even at lower dimensions, such as 256 with int8 precision, Codestral Embed reportedly surpasses top models from competitors like OpenAI, Cohere, and Voyage, offering high retrieval quality at a reduced storage cost. Beyond basic retrieval, Codestral Embed supports a wide range of developer-focused applications. These include code completion, explanation, editing, semantic search, and duplicate detection. The model can also help organize and analyze repositories by clustering code based on functionality or structure, eliminating the need for manual supervision. This makes it particularly useful for tasks like understanding architectural patterns, categorizing code, or supporting automated documentation, ultimately helping developers work more efficiently with large and complex codebases. Codestral Embed is tailored for understanding and retrieving code efficiently, especially in large-scale development environments. It powers retrieval-augmented generation by quickly fetching relevant context for tasks like code completion, editing, and explanation—ideal for use in coding assistants and agent-based tools. Developers can also perform semantic code searches using natural language or code queries to find relevant snippets. Its ability to detect similar or duplicated code helps with reuse, policy enforcement, and cleaning up redundancy. Additionally, it can cluster code by functionality or structure, making it useful for repository analysis, spotting architectural patterns, and enhancing documentation workflows. Codestral Embed is a specialized embedding model designed to enhance code retrieval and semantic analysis tasks. It surpasses existing models, such as OpenAI’s and Cohere’s, in benchmarks like SWE-Bench Lite and CodeSearchNet. The model offers customizable embedding dimensions and precision levels, allowing users to effectively balance performance and storage needs. Key applications include retrieval-augmented generation, semantic code search, duplicate detection, and code clustering. Available via API at per million tokens, with a 50% discount for batch processing, Codestral Embed supports various output formats and dimensions, catering to diverse development workflows. In conclusion, Codestral Embed offers customizable embedding dimensions and precisions, enabling developers to strike a balance between performance and storage efficiency. Benchmark evaluations indicate that Codestral Embed surpasses existing models like OpenAI’s and Cohere’s in various code-related tasks, including retrieval-augmented generation and semantic code search. Its applications span from identifying duplicate code segments to facilitating semantic clustering for code analytics. Available through Mistral’s API, Codestral Embed provides a flexible and efficient solution for developers seeking advanced code understanding capabilities. vides valuable insights for the community. Check out the Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/This AI Paper from Microsoft Introduces WINA: A Training-Free Sparse Activation Framework for Efficient Large Language Model InferenceSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Apple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and AccuracySana Hassanhttps://www.marktechpost.com/author/sana-hassan/National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text Generation #mistral #introduces #codestral #embed #highperformanceWWW.MARKTECHPOST.COMMistral AI Introduces Codestral Embed: A High-Performance Code Embedding Model for Scalable Retrieval and Semantic UnderstandingModern software engineering faces growing challenges in accurately retrieving and understanding code across diverse programming languages and large-scale codebases. Existing embedding models often struggle to capture the deep semantics of code, resulting in poor performance in tasks such as code search, RAG, and semantic analysis. These limitations hinder developers’ ability to efficiently locate relevant code snippets, reuse components, and manage large projects effectively. As software systems grow increasingly complex, there is a pressing need for more effective, language-agnostic representations of code that can power reliable and high-quality retrieval and reasoning across a wide range of development tasks. Mistral AI has introduced Codestral Embed, a specialized embedding model built specifically for code-related tasks. Designed to handle real-world code more effectively than existing solutions, it enables powerful retrieval capabilities across large codebases. What sets it apart is its flexibility—users can adjust embedding dimensions and precision levels to balance performance with storage efficiency. Even at lower dimensions, such as 256 with int8 precision, Codestral Embed reportedly surpasses top models from competitors like OpenAI, Cohere, and Voyage, offering high retrieval quality at a reduced storage cost. Beyond basic retrieval, Codestral Embed supports a wide range of developer-focused applications. These include code completion, explanation, editing, semantic search, and duplicate detection. The model can also help organize and analyze repositories by clustering code based on functionality or structure, eliminating the need for manual supervision. This makes it particularly useful for tasks like understanding architectural patterns, categorizing code, or supporting automated documentation, ultimately helping developers work more efficiently with large and complex codebases. Codestral Embed is tailored for understanding and retrieving code efficiently, especially in large-scale development environments. It powers retrieval-augmented generation by quickly fetching relevant context for tasks like code completion, editing, and explanation—ideal for use in coding assistants and agent-based tools. Developers can also perform semantic code searches using natural language or code queries to find relevant snippets. Its ability to detect similar or duplicated code helps with reuse, policy enforcement, and cleaning up redundancy. Additionally, it can cluster code by functionality or structure, making it useful for repository analysis, spotting architectural patterns, and enhancing documentation workflows. Codestral Embed is a specialized embedding model designed to enhance code retrieval and semantic analysis tasks. It surpasses existing models, such as OpenAI’s and Cohere’s, in benchmarks like SWE-Bench Lite and CodeSearchNet. The model offers customizable embedding dimensions and precision levels, allowing users to effectively balance performance and storage needs. Key applications include retrieval-augmented generation, semantic code search, duplicate detection, and code clustering. Available via API at $0.15 per million tokens, with a 50% discount for batch processing, Codestral Embed supports various output formats and dimensions, catering to diverse development workflows. In conclusion, Codestral Embed offers customizable embedding dimensions and precisions, enabling developers to strike a balance between performance and storage efficiency. Benchmark evaluations indicate that Codestral Embed surpasses existing models like OpenAI’s and Cohere’s in various code-related tasks, including retrieval-augmented generation and semantic code search. Its applications span from identifying duplicate code segments to facilitating semantic clustering for code analytics. Available through Mistral’s API, Codestral Embed provides a flexible and efficient solution for developers seeking advanced code understanding capabilities. vides valuable insights for the community. Check out the Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/This AI Paper from Microsoft Introduces WINA: A Training-Free Sparse Activation Framework for Efficient Large Language Model InferenceSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Apple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and AccuracySana Hassanhttps://www.marktechpost.com/author/sana-hassan/National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text Generation0 Kommentare 0 Anteile -
Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning
Large Reasoning Models, trained from LLMs using reinforcement learning, demonstrated great performance in complex reasoning tasks, including mathematics, STEM, and coding. However, existing LRMs face challenges in completing various puzzle tasks that require purely logical reasoning skills, which are easy and obvious for humans. Current methods targeting puzzles focus only on designing benchmarks for evaluation, lacking the training methods and resources for modern LLMs to tackle this challenge. Current puzzle datasets lack diversity and scalability, covering limited puzzle types with little control over generation or difficulty. Moreover, due to the success of the “LLM+RLVR” paradigm, it has become crucial to obtain large, diverse, and challenging sets of verifiable puzzle prompts for training agents.
Reinforcement Learning with Verifiable Rewardshas emerged as a key method for improving models’ reasoning capabilities, removing the need for reward models by directly assigning rewards based on objectively verifiable answers. Puzzles are particularly well-suited for RLVR. However, most prior RLVR research has overlooked the puzzles’ potential for delivering effective reward signals. In puzzle reasoning of LLMs, existing benchmarks evaluate different types of reasoning, including abstract, deductive, and compositional reasoning. Few benchmarks support scalable generation and difficulty control but lack puzzle diversity. Moreover, the improvement of LLMs’ puzzle-solving abilities mainly falls into two categories: tool integration and RLVR.
Researchers from ByteDance Seed, Fudan University, Tsinghua University, Nanjing University, and Shanghai Jiao Tong University have proposed Enigmata, the first comprehensive toolkit designed for improving LLMs with puzzle reasoning skills. It contains 36 tasks across seven categories, each featuring a generator that produces unlimited examples with controllable difficulty and a rule-based verifier for automatic evaluation. The researchers further developed Enigmata-Eval as a rigorous benchmark and created optimized multi-task RLVR strategies. Puzzle data from Enigmata enhances SoTA performance on advanced math and STEM reasoning tasks like AIME, BeyondAIME, and GPQA when trained on larger models like Seed1.5-Thinking. This shows the generalization benefits of Enigmata.
The Enigmata-Data comprises 36 puzzle tasks organized into 7 primary categories, including Crypto, Arithmetic, Logic, Grid, Graph, Search, and Sequential Puzzle, making it the only dataset having multiple task categories with scalability, automatic verification, and public availability. The data construction follows a three-phase pipeline: Tasks Collection and Design, Auto-Generator and Verifier Development, and Sliding Difficulty Control. Moreover, the Enigmata-Eval is developed by systematically sampling from the broader dataset, aiming to extract 50 instances per difficulty level for each task. The final evaluation set contains 4,758 puzzle instances rather than the theoretical maximum of 5,400, due to inherent constraints, where some tasks generate fewer instances per difficulty level.
The proposed model outperforms most public models on Enigmata-Eval with 32B parameters, showing the effectiveness of the dataset and training recipe. The model stands out on the challenging ARC-AGI benchmark, surpassing strong reasoning models such as Gemini 2.5 Pro, o3-mini, and o1. The Qwen2.5-32B-Enigmata shows outstanding performance in structured reasoning categories, outperforming in Crypto, Arithmetic, and Logic tasks, suggesting effective development of rule-based reasoning capabilities. The model shows competitive performance in search tasks that require strategic exploration and planning capabilities. Moreover, Crypto and Arithmetic tasks tend to provide the highest accuracy, while spatial and sequential tasks remain more difficult.
In this paper, researchers introduced Enigmata, a comprehensive suite for equipping LLMs with advanced puzzle reasoning that integrates seamlessly with RL using verifiable rule-based rewards. The trained Enigmata-Model shows superior performance and robust generalization skills through RLVR training. Experiments reveal that when applied to larger models such as Seed1.5-Thinking, synthetic puzzle data brings additional benefits in other domains, including mathematics and STEM reasoning over state-of-the-art models. Enigmata provides a solid foundation for the research community to advance reasoning model development, offering a unified framework that effectively bridges logical puzzle-solving with broader reasoning capabilities in LLMs.
Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic IntegrationSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning
#enigmatas #multistage #mixtraining #reinforcement #learningEnigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle ReasoningLarge Reasoning Models, trained from LLMs using reinforcement learning, demonstrated great performance in complex reasoning tasks, including mathematics, STEM, and coding. However, existing LRMs face challenges in completing various puzzle tasks that require purely logical reasoning skills, which are easy and obvious for humans. Current methods targeting puzzles focus only on designing benchmarks for evaluation, lacking the training methods and resources for modern LLMs to tackle this challenge. Current puzzle datasets lack diversity and scalability, covering limited puzzle types with little control over generation or difficulty. Moreover, due to the success of the “LLM+RLVR” paradigm, it has become crucial to obtain large, diverse, and challenging sets of verifiable puzzle prompts for training agents. Reinforcement Learning with Verifiable Rewardshas emerged as a key method for improving models’ reasoning capabilities, removing the need for reward models by directly assigning rewards based on objectively verifiable answers. Puzzles are particularly well-suited for RLVR. However, most prior RLVR research has overlooked the puzzles’ potential for delivering effective reward signals. In puzzle reasoning of LLMs, existing benchmarks evaluate different types of reasoning, including abstract, deductive, and compositional reasoning. Few benchmarks support scalable generation and difficulty control but lack puzzle diversity. Moreover, the improvement of LLMs’ puzzle-solving abilities mainly falls into two categories: tool integration and RLVR. Researchers from ByteDance Seed, Fudan University, Tsinghua University, Nanjing University, and Shanghai Jiao Tong University have proposed Enigmata, the first comprehensive toolkit designed for improving LLMs with puzzle reasoning skills. It contains 36 tasks across seven categories, each featuring a generator that produces unlimited examples with controllable difficulty and a rule-based verifier for automatic evaluation. The researchers further developed Enigmata-Eval as a rigorous benchmark and created optimized multi-task RLVR strategies. Puzzle data from Enigmata enhances SoTA performance on advanced math and STEM reasoning tasks like AIME, BeyondAIME, and GPQA when trained on larger models like Seed1.5-Thinking. This shows the generalization benefits of Enigmata. The Enigmata-Data comprises 36 puzzle tasks organized into 7 primary categories, including Crypto, Arithmetic, Logic, Grid, Graph, Search, and Sequential Puzzle, making it the only dataset having multiple task categories with scalability, automatic verification, and public availability. The data construction follows a three-phase pipeline: Tasks Collection and Design, Auto-Generator and Verifier Development, and Sliding Difficulty Control. Moreover, the Enigmata-Eval is developed by systematically sampling from the broader dataset, aiming to extract 50 instances per difficulty level for each task. The final evaluation set contains 4,758 puzzle instances rather than the theoretical maximum of 5,400, due to inherent constraints, where some tasks generate fewer instances per difficulty level. The proposed model outperforms most public models on Enigmata-Eval with 32B parameters, showing the effectiveness of the dataset and training recipe. The model stands out on the challenging ARC-AGI benchmark, surpassing strong reasoning models such as Gemini 2.5 Pro, o3-mini, and o1. The Qwen2.5-32B-Enigmata shows outstanding performance in structured reasoning categories, outperforming in Crypto, Arithmetic, and Logic tasks, suggesting effective development of rule-based reasoning capabilities. The model shows competitive performance in search tasks that require strategic exploration and planning capabilities. Moreover, Crypto and Arithmetic tasks tend to provide the highest accuracy, while spatial and sequential tasks remain more difficult. In this paper, researchers introduced Enigmata, a comprehensive suite for equipping LLMs with advanced puzzle reasoning that integrates seamlessly with RL using verifiable rule-based rewards. The trained Enigmata-Model shows superior performance and robust generalization skills through RLVR training. Experiments reveal that when applied to larger models such as Seed1.5-Thinking, synthetic puzzle data brings additional benefits in other domains, including mathematics and STEM reasoning over state-of-the-art models. Enigmata provides a solid foundation for the research community to advance reasoning model development, offering a unified framework that effectively bridges logical puzzle-solving with broader reasoning capabilities in LLMs. Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic IntegrationSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning #enigmatas #multistage #mixtraining #reinforcement #learningWWW.MARKTECHPOST.COMEnigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle ReasoningLarge Reasoning Models (LRMs), trained from LLMs using reinforcement learning (RL), demonstrated great performance in complex reasoning tasks, including mathematics, STEM, and coding. However, existing LRMs face challenges in completing various puzzle tasks that require purely logical reasoning skills, which are easy and obvious for humans. Current methods targeting puzzles focus only on designing benchmarks for evaluation, lacking the training methods and resources for modern LLMs to tackle this challenge. Current puzzle datasets lack diversity and scalability, covering limited puzzle types with little control over generation or difficulty. Moreover, due to the success of the “LLM+RLVR” paradigm, it has become crucial to obtain large, diverse, and challenging sets of verifiable puzzle prompts for training agents. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key method for improving models’ reasoning capabilities, removing the need for reward models by directly assigning rewards based on objectively verifiable answers. Puzzles are particularly well-suited for RLVR. However, most prior RLVR research has overlooked the puzzles’ potential for delivering effective reward signals. In puzzle reasoning of LLMs, existing benchmarks evaluate different types of reasoning, including abstract, deductive, and compositional reasoning. Few benchmarks support scalable generation and difficulty control but lack puzzle diversity. Moreover, the improvement of LLMs’ puzzle-solving abilities mainly falls into two categories: tool integration and RLVR. Researchers from ByteDance Seed, Fudan University, Tsinghua University, Nanjing University, and Shanghai Jiao Tong University have proposed Enigmata, the first comprehensive toolkit designed for improving LLMs with puzzle reasoning skills. It contains 36 tasks across seven categories, each featuring a generator that produces unlimited examples with controllable difficulty and a rule-based verifier for automatic evaluation. The researchers further developed Enigmata-Eval as a rigorous benchmark and created optimized multi-task RLVR strategies. Puzzle data from Enigmata enhances SoTA performance on advanced math and STEM reasoning tasks like AIME, BeyondAIME, and GPQA when trained on larger models like Seed1.5-Thinking. This shows the generalization benefits of Enigmata. The Enigmata-Data comprises 36 puzzle tasks organized into 7 primary categories, including Crypto, Arithmetic, Logic, Grid, Graph, Search, and Sequential Puzzle, making it the only dataset having multiple task categories with scalability, automatic verification, and public availability. The data construction follows a three-phase pipeline: Tasks Collection and Design, Auto-Generator and Verifier Development, and Sliding Difficulty Control. Moreover, the Enigmata-Eval is developed by systematically sampling from the broader dataset, aiming to extract 50 instances per difficulty level for each task. The final evaluation set contains 4,758 puzzle instances rather than the theoretical maximum of 5,400, due to inherent constraints, where some tasks generate fewer instances per difficulty level. The proposed model outperforms most public models on Enigmata-Eval with 32B parameters, showing the effectiveness of the dataset and training recipe. The model stands out on the challenging ARC-AGI benchmark, surpassing strong reasoning models such as Gemini 2.5 Pro, o3-mini, and o1. The Qwen2.5-32B-Enigmata shows outstanding performance in structured reasoning categories, outperforming in Crypto, Arithmetic, and Logic tasks, suggesting effective development of rule-based reasoning capabilities. The model shows competitive performance in search tasks that require strategic exploration and planning capabilities. Moreover, Crypto and Arithmetic tasks tend to provide the highest accuracy, while spatial and sequential tasks remain more difficult. In this paper, researchers introduced Enigmata, a comprehensive suite for equipping LLMs with advanced puzzle reasoning that integrates seamlessly with RL using verifiable rule-based rewards. The trained Enigmata-Model shows superior performance and robust generalization skills through RLVR training. Experiments reveal that when applied to larger models such as Seed1.5-Thinking (20B/200B parameters), synthetic puzzle data brings additional benefits in other domains, including mathematics and STEM reasoning over state-of-the-art models. Enigmata provides a solid foundation for the research community to advance reasoning model development, offering a unified framework that effectively bridges logical puzzle-solving with broader reasoning capabilities in LLMs. Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic IntegrationSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning0 Kommentare 0 Anteile -
The Legal Accountability of AI-Generated Deepfakes in Election Misinformation
How Deepfakes Are Created
Generative AI models enable the creation of highly realistic fake media. Most deepfakes today are produced by training deep neural networks on real images, video or audio of a target person. The two predominant AI architectures are generative adversarial networksand autoencoders. A GAN consists of a generator network that produces synthetic images and a discriminator network that tries to distinguish fakes from real data. Through iterative training, the generator learns to produce outputs that increasingly fool the discriminator¹. Autoencoder-based tools similarly learn to encode a target face and then decode it onto a source video. In practice, deepfake creators use accessible software: open-source tools like DeepFaceLab and FaceSwap dominate video face-swapping². Voice-cloning toolscan mimic a person’s speech from minutes of audio. Commercial platforms like Synthesia allow text-to-video avatars, which have already been misused in disinformation campaigns³. Even mobile appslet users do basic face swaps in minutes⁴. In short, advances in GANs and related models make deepfakes cheaper and easier to generate than ever.
Diagram of a generative adversarial network: A generator network creates fake images from random input and a discriminator network distinguishes fakes from real examples. Over time the generator improves until its outputs “fool” the discriminator⁵
During creation, a deepfake algorithm is typically trained on a large dataset of real images or audio from the target. The more varied and high-quality the training data, the more realistic the deepfake. The output often then undergoes post-processingto enhance believability¹. Technical defenses focus on two fronts: detection and authentication. Detection uses AI models to spot inconsistenciesthat betray a synthetic origin⁵. Authentication embeds markers before dissemination – for example, invisible watermarks or cryptographically signed metadata indicating authenticity⁶. The EU AI Act will soon mandate that major AI content providers embed machine-readable “watermark” signals in synthetic media⁷. However, as GAO notes, detection is an arms race – even a marked deepfake can sometimes evade notice – and labels alone don’t stop false narratives from spreading⁸⁹.
Deepfakes in Recent Elections: Examples
Deepfakes and AI-generated imagery already have made headlines in election cycles around the world. In the 2024 U.S. primary season, a digitally-altered audio robocall mimicked President Biden’s voice urging Democrats not to vote in the New Hampshire primary. The callerwas later fined million by the FCC and indicted under existing telemarketing laws¹⁰¹¹.Also in 2024, former President Trump posted on social media a collage implying that pop singer Taylor Swift endorsed his campaign, using AI-generated images of Swift in “Swifties for Trump” shirts¹². The posts sparked media uproar, though analysts noted the same effect could have been achieved without AI¹². Similarly, Elon Musk’s X platform carried AI-generated clips, including a parody “Ad” depicting Vice-President Harris’s voice via an AI clone¹³.
Beyond the U.S., deepfake-like content has appeared globally. In Indonesia’s 2024 presidential election, a video surfaced on social media in which a convincingly generated image of the late President Suharto appeared to endorse the candidate of the Golkar Party. Days later, the endorsed candidatewon the presidency¹⁴. In Bangladesh, a viral deepfake video superimposed the face of opposition leader Rumeen Farhana onto a bikini-clad body – an incendiary fabrication designed to discredit her in the conservative Muslim-majority society¹⁵. Moldova’s pro-Western President Maia Sandu has been repeatedly targeted by AI-driven disinformation; one deepfake video falsely showed her resigning and endorsing a Russian-friendly party, apparently to sow distrust in the electoral process¹⁶. Even in Taiwan, a TikTok clip circulated that synthetically portrayed a U.S. politician making foreign-policy statements – stoking confusion ahead of Taiwanese elections¹⁷. In Slovakia’s recent campaign, AI-generated audio mimicking the liberal party leader suggested he plotted vote-rigging and beer-price hikes – instantly spreading on social media just days before the election¹⁸. These examples show that deepfakes have touched diverse polities, often aiming to undermine candidates or confuse voters¹⁵¹⁸.
Notably, many of the most viral “deepfakes” in 2024 were actually circulated as obvious memes or claims, rather than subtle deceptions. Experts observed that outright undetectable AI deepfakes were relatively rare; more common were AI-generated memes plainly shared by partisans, or cheaply doctored “cheapfakes” made with basic editing tools¹³¹⁹. For instance, social media was awash with memes of Kamala Harris in Soviet garb or of Black Americans holding Trump signs¹³, but these were typically used satirically, not meant to be secretly believed. Nonetheless, even unsophisticated fakes can sway opinion: a U.S. study found that false presidential adsdid change voter attitudes in swing states. In sum, deepfakes are a real and growing phenomenon in election campaigns²⁰²¹ worldwide – a trend taken seriously by voters and regulators alike.
U.S. Legal Framework and Accountability
In the U.S., deepfake creators and distributors of election misinformation face a patchwork of tools, but no single comprehensive federal “deepfake law.” Existing laws relevant to disinformation include statutes against impersonating government officials, electioneering, and targeted statutes like criminal electioneering communications. In some cases ordinary laws have been stretched: the NH robocall used the Telephone Consumer Protection Act and mail/telemarketing fraud provisions, resulting in the M fine and a criminal charge. Similarly, voice impostors can potentially violate laws against “false advertising” or “unlawful corporate communications.” However, these laws were enacted before AI, and litigators have warned they often do not fit neatly. For example, deceptive deepfake claims not tied to a specific victim do not easily fit into defamation or privacy torts. Voter intimidation lawsalso leave a gap for non-threatening falsehoods about voting logistics or endorsements.
Recognizing these gaps, some courts and agencies are invoking other theories. The U.S. Department of Justice has recently charged individuals under broad fraud statutes, and state attorneys general have considered deepfake misinformation as interference with voting rights. Notably, the Federal Election Commissionis preparing to enforce new rules: in April 2024 it issued an advisory opinion limiting “non-candidate electioneering communications” that use falsified media, effectively requiring that political ads use only real images of the candidate. If finalized, that would make it unlawful for campaigns to pay for ads depicting a candidate saying things they never did. Similarly, the Federal Trade Commissionand Department of Justicehave signaled that purely commercial deepfakes could violate consumer protection or election laws.
U.S. Legislation and Proposals
Federal lawmakers have proposed new statutes. The DEEPFAKES Accountability Actwould, among other things, impose a disclosure requirement: political ads featuring a manipulated media likeness would need clear disclaimers identifying the content as synthetic. It also increases penalties for producing false election videos or audio intended to influence the vote. While not yet enacted, supporters argue it would provide a uniform rule for all federal and state campaigns. The Brennan Center supports transparency requirements over outright bans, suggesting laws should narrowly target deceptive deepfakes in paid ads or certain categorieswhile carving out parody and news coverage.
At the state level, over 20 states have passed deepfake laws specifically for elections. For example, Florida and California forbid distributing falsified audio/visual media of candidates with intent to deceive voters. Some statesdefine “deepfake” in statutes and allow candidates to sue or revoke candidacies of violators. These measures have had mixed success: courts have struck down overly broad provisions that acted as prior restraints. Critically, these state laws raise First Amendment issues: political speech is highly protected, so any restriction must be tightly tailored. Already, Texas and Virginia statutes are under legal review, and Elon Musk’s company has sued under California’s lawas unconstitutional. In practice, most lawsuits have so far centered on defamation or intellectual property, rather than election-focused statutes.
Policy Recommendations: Balancing Integrity and Speech
Given the rapidly evolving technology, experts recommend a multi-pronged approach. Most stress transparency and disclosure as core principles. For example, the Brennan Center urges requiring any political communication that uses AI-synthesized images or voice to include a clear label. This could be a digital watermark or a visible disclaimer. Transparency has two advantages: it forces campaigns and platforms to “own” the use of AI, and it alerts audiences to treat the content with skepticism.
Outright bans on all deepfakes would likely violate free speech, but targeted bans on specific harmsmay be defensible. Indeed, Florida already penalizes misuse of recordings in voter suppression. Another recommendation is limited liability: tying penalties to demonstrable intent to mislead, not to the mere act of content creation. Both U.S. federal proposals and EU law generally condition fines on the “appearance of fraud” or deception.
Technical solutions can complement laws. Watermarking original mediacould deter the reuse of authentic images in doctored fakes. Open tools for deepfake detection – some supported by government research grants – should be deployed by fact-checkers and social platforms. Making detection datasets publicly availablehelps improve AI models to spot fakes. International cooperation is also urged: cross-border agreements on information-sharing could help trace and halt disinformation campaigns. The G7 and APEC have all recently committed to fighting election interference via AI, which may lead to joint norms or rapid response teams.
Ultimately, many analysts believe the strongest “cure” is a well-informed public: education campaigns to teach voters to question sensational media, and a robust independent press to debunk falsehoods swiftly. While the law can penalize the worst offenders, awareness and resilience in the electorate are crucial buffers against influence operations. As Georgia Tech’s Sean Parker quipped in 2019, “the real question is not if deepfakes will influence elections, but who will be empowered by the first effective one.” Thus policies should aim to deter malicious use without unduly chilling innovation or satire.
References:
/.
/.
.
.
.
.
.
.
.
/.
.
.
/.
/.
.
The post The Legal Accountability of AI-Generated Deepfakes in Election Misinformation appeared first on MarkTechPost.
#legal #accountability #aigenerated #deepfakes #electionThe Legal Accountability of AI-Generated Deepfakes in Election MisinformationHow Deepfakes Are Created Generative AI models enable the creation of highly realistic fake media. Most deepfakes today are produced by training deep neural networks on real images, video or audio of a target person. The two predominant AI architectures are generative adversarial networksand autoencoders. A GAN consists of a generator network that produces synthetic images and a discriminator network that tries to distinguish fakes from real data. Through iterative training, the generator learns to produce outputs that increasingly fool the discriminator¹. Autoencoder-based tools similarly learn to encode a target face and then decode it onto a source video. In practice, deepfake creators use accessible software: open-source tools like DeepFaceLab and FaceSwap dominate video face-swapping². Voice-cloning toolscan mimic a person’s speech from minutes of audio. Commercial platforms like Synthesia allow text-to-video avatars, which have already been misused in disinformation campaigns³. Even mobile appslet users do basic face swaps in minutes⁴. In short, advances in GANs and related models make deepfakes cheaper and easier to generate than ever. Diagram of a generative adversarial network: A generator network creates fake images from random input and a discriminator network distinguishes fakes from real examples. Over time the generator improves until its outputs “fool” the discriminator⁵ During creation, a deepfake algorithm is typically trained on a large dataset of real images or audio from the target. The more varied and high-quality the training data, the more realistic the deepfake. The output often then undergoes post-processingto enhance believability¹. Technical defenses focus on two fronts: detection and authentication. Detection uses AI models to spot inconsistenciesthat betray a synthetic origin⁵. Authentication embeds markers before dissemination – for example, invisible watermarks or cryptographically signed metadata indicating authenticity⁶. The EU AI Act will soon mandate that major AI content providers embed machine-readable “watermark” signals in synthetic media⁷. However, as GAO notes, detection is an arms race – even a marked deepfake can sometimes evade notice – and labels alone don’t stop false narratives from spreading⁸⁹. Deepfakes in Recent Elections: Examples Deepfakes and AI-generated imagery already have made headlines in election cycles around the world. In the 2024 U.S. primary season, a digitally-altered audio robocall mimicked President Biden’s voice urging Democrats not to vote in the New Hampshire primary. The callerwas later fined million by the FCC and indicted under existing telemarketing laws¹⁰¹¹.Also in 2024, former President Trump posted on social media a collage implying that pop singer Taylor Swift endorsed his campaign, using AI-generated images of Swift in “Swifties for Trump” shirts¹². The posts sparked media uproar, though analysts noted the same effect could have been achieved without AI¹². Similarly, Elon Musk’s X platform carried AI-generated clips, including a parody “Ad” depicting Vice-President Harris’s voice via an AI clone¹³. Beyond the U.S., deepfake-like content has appeared globally. In Indonesia’s 2024 presidential election, a video surfaced on social media in which a convincingly generated image of the late President Suharto appeared to endorse the candidate of the Golkar Party. Days later, the endorsed candidatewon the presidency¹⁴. In Bangladesh, a viral deepfake video superimposed the face of opposition leader Rumeen Farhana onto a bikini-clad body – an incendiary fabrication designed to discredit her in the conservative Muslim-majority society¹⁵. Moldova’s pro-Western President Maia Sandu has been repeatedly targeted by AI-driven disinformation; one deepfake video falsely showed her resigning and endorsing a Russian-friendly party, apparently to sow distrust in the electoral process¹⁶. Even in Taiwan, a TikTok clip circulated that synthetically portrayed a U.S. politician making foreign-policy statements – stoking confusion ahead of Taiwanese elections¹⁷. In Slovakia’s recent campaign, AI-generated audio mimicking the liberal party leader suggested he plotted vote-rigging and beer-price hikes – instantly spreading on social media just days before the election¹⁸. These examples show that deepfakes have touched diverse polities, often aiming to undermine candidates or confuse voters¹⁵¹⁸. Notably, many of the most viral “deepfakes” in 2024 were actually circulated as obvious memes or claims, rather than subtle deceptions. Experts observed that outright undetectable AI deepfakes were relatively rare; more common were AI-generated memes plainly shared by partisans, or cheaply doctored “cheapfakes” made with basic editing tools¹³¹⁹. For instance, social media was awash with memes of Kamala Harris in Soviet garb or of Black Americans holding Trump signs¹³, but these were typically used satirically, not meant to be secretly believed. Nonetheless, even unsophisticated fakes can sway opinion: a U.S. study found that false presidential adsdid change voter attitudes in swing states. In sum, deepfakes are a real and growing phenomenon in election campaigns²⁰²¹ worldwide – a trend taken seriously by voters and regulators alike. U.S. Legal Framework and Accountability In the U.S., deepfake creators and distributors of election misinformation face a patchwork of tools, but no single comprehensive federal “deepfake law.” Existing laws relevant to disinformation include statutes against impersonating government officials, electioneering, and targeted statutes like criminal electioneering communications. In some cases ordinary laws have been stretched: the NH robocall used the Telephone Consumer Protection Act and mail/telemarketing fraud provisions, resulting in the M fine and a criminal charge. Similarly, voice impostors can potentially violate laws against “false advertising” or “unlawful corporate communications.” However, these laws were enacted before AI, and litigators have warned they often do not fit neatly. For example, deceptive deepfake claims not tied to a specific victim do not easily fit into defamation or privacy torts. Voter intimidation lawsalso leave a gap for non-threatening falsehoods about voting logistics or endorsements. Recognizing these gaps, some courts and agencies are invoking other theories. The U.S. Department of Justice has recently charged individuals under broad fraud statutes, and state attorneys general have considered deepfake misinformation as interference with voting rights. Notably, the Federal Election Commissionis preparing to enforce new rules: in April 2024 it issued an advisory opinion limiting “non-candidate electioneering communications” that use falsified media, effectively requiring that political ads use only real images of the candidate. If finalized, that would make it unlawful for campaigns to pay for ads depicting a candidate saying things they never did. Similarly, the Federal Trade Commissionand Department of Justicehave signaled that purely commercial deepfakes could violate consumer protection or election laws. U.S. Legislation and Proposals Federal lawmakers have proposed new statutes. The DEEPFAKES Accountability Actwould, among other things, impose a disclosure requirement: political ads featuring a manipulated media likeness would need clear disclaimers identifying the content as synthetic. It also increases penalties for producing false election videos or audio intended to influence the vote. While not yet enacted, supporters argue it would provide a uniform rule for all federal and state campaigns. The Brennan Center supports transparency requirements over outright bans, suggesting laws should narrowly target deceptive deepfakes in paid ads or certain categorieswhile carving out parody and news coverage. At the state level, over 20 states have passed deepfake laws specifically for elections. For example, Florida and California forbid distributing falsified audio/visual media of candidates with intent to deceive voters. Some statesdefine “deepfake” in statutes and allow candidates to sue or revoke candidacies of violators. These measures have had mixed success: courts have struck down overly broad provisions that acted as prior restraints. Critically, these state laws raise First Amendment issues: political speech is highly protected, so any restriction must be tightly tailored. Already, Texas and Virginia statutes are under legal review, and Elon Musk’s company has sued under California’s lawas unconstitutional. In practice, most lawsuits have so far centered on defamation or intellectual property, rather than election-focused statutes. Policy Recommendations: Balancing Integrity and Speech Given the rapidly evolving technology, experts recommend a multi-pronged approach. Most stress transparency and disclosure as core principles. For example, the Brennan Center urges requiring any political communication that uses AI-synthesized images or voice to include a clear label. This could be a digital watermark or a visible disclaimer. Transparency has two advantages: it forces campaigns and platforms to “own” the use of AI, and it alerts audiences to treat the content with skepticism. Outright bans on all deepfakes would likely violate free speech, but targeted bans on specific harmsmay be defensible. Indeed, Florida already penalizes misuse of recordings in voter suppression. Another recommendation is limited liability: tying penalties to demonstrable intent to mislead, not to the mere act of content creation. Both U.S. federal proposals and EU law generally condition fines on the “appearance of fraud” or deception. Technical solutions can complement laws. Watermarking original mediacould deter the reuse of authentic images in doctored fakes. Open tools for deepfake detection – some supported by government research grants – should be deployed by fact-checkers and social platforms. Making detection datasets publicly availablehelps improve AI models to spot fakes. International cooperation is also urged: cross-border agreements on information-sharing could help trace and halt disinformation campaigns. The G7 and APEC have all recently committed to fighting election interference via AI, which may lead to joint norms or rapid response teams. Ultimately, many analysts believe the strongest “cure” is a well-informed public: education campaigns to teach voters to question sensational media, and a robust independent press to debunk falsehoods swiftly. While the law can penalize the worst offenders, awareness and resilience in the electorate are crucial buffers against influence operations. As Georgia Tech’s Sean Parker quipped in 2019, “the real question is not if deepfakes will influence elections, but who will be empowered by the first effective one.” Thus policies should aim to deter malicious use without unduly chilling innovation or satire. References: /. /. . . . . . . . /. . . /. /. . The post The Legal Accountability of AI-Generated Deepfakes in Election Misinformation appeared first on MarkTechPost. #legal #accountability #aigenerated #deepfakes #electionWWW.MARKTECHPOST.COMThe Legal Accountability of AI-Generated Deepfakes in Election MisinformationHow Deepfakes Are Created Generative AI models enable the creation of highly realistic fake media. Most deepfakes today are produced by training deep neural networks on real images, video or audio of a target person. The two predominant AI architectures are generative adversarial networks (GANs) and autoencoders. A GAN consists of a generator network that produces synthetic images and a discriminator network that tries to distinguish fakes from real data. Through iterative training, the generator learns to produce outputs that increasingly fool the discriminator¹. Autoencoder-based tools similarly learn to encode a target face and then decode it onto a source video. In practice, deepfake creators use accessible software: open-source tools like DeepFaceLab and FaceSwap dominate video face-swapping (one estimate suggests DeepFaceLab was used for over 95% of known deepfake videos)². Voice-cloning tools (often built on similar AI principles) can mimic a person’s speech from minutes of audio. Commercial platforms like Synthesia allow text-to-video avatars (turning typed scripts into lifelike “spokespeople”), which have already been misused in disinformation campaigns³. Even mobile apps (e.g. FaceApp, Zao) let users do basic face swaps in minutes⁴. In short, advances in GANs and related models make deepfakes cheaper and easier to generate than ever. Diagram of a generative adversarial network (GAN): A generator network creates fake images from random input and a discriminator network distinguishes fakes from real examples. Over time the generator improves until its outputs “fool” the discriminator⁵ During creation, a deepfake algorithm is typically trained on a large dataset of real images or audio from the target. The more varied and high-quality the training data, the more realistic the deepfake. The output often then undergoes post-processing (color adjustments, lip-syncing refinements) to enhance believability¹. Technical defenses focus on two fronts: detection and authentication. Detection uses AI models to spot inconsistencies (blinking irregularities, audio artifacts or metadata mismatches) that betray a synthetic origin⁵. Authentication embeds markers before dissemination – for example, invisible watermarks or cryptographically signed metadata indicating authenticity⁶. The EU AI Act will soon mandate that major AI content providers embed machine-readable “watermark” signals in synthetic media⁷. However, as GAO notes, detection is an arms race – even a marked deepfake can sometimes evade notice – and labels alone don’t stop false narratives from spreading⁸⁹. Deepfakes in Recent Elections: Examples Deepfakes and AI-generated imagery already have made headlines in election cycles around the world. In the 2024 U.S. primary season, a digitally-altered audio robocall mimicked President Biden’s voice urging Democrats not to vote in the New Hampshire primary. The caller (“Susan Anderson”) was later fined $6 million by the FCC and indicted under existing telemarketing laws¹⁰¹¹. (Importantly, FCC rules on robocalls applied regardless of AI: the perpetrator could have used a voice actor or recording instead.) Also in 2024, former President Trump posted on social media a collage implying that pop singer Taylor Swift endorsed his campaign, using AI-generated images of Swift in “Swifties for Trump” shirts¹². The posts sparked media uproar, though analysts noted the same effect could have been achieved without AI (e.g., by photoshopping text on real images)¹². Similarly, Elon Musk’s X platform carried AI-generated clips, including a parody “Ad” depicting Vice-President Harris’s voice via an AI clone¹³. Beyond the U.S., deepfake-like content has appeared globally. In Indonesia’s 2024 presidential election, a video surfaced on social media in which a convincingly generated image of the late President Suharto appeared to endorse the candidate of the Golkar Party. Days later, the endorsed candidate (who is Suharto’s son-in-law) won the presidency¹⁴. In Bangladesh, a viral deepfake video superimposed the face of opposition leader Rumeen Farhana onto a bikini-clad body – an incendiary fabrication designed to discredit her in the conservative Muslim-majority society¹⁵. Moldova’s pro-Western President Maia Sandu has been repeatedly targeted by AI-driven disinformation; one deepfake video falsely showed her resigning and endorsing a Russian-friendly party, apparently to sow distrust in the electoral process¹⁶. Even in Taiwan (amidst tensions with China), a TikTok clip circulated that synthetically portrayed a U.S. politician making foreign-policy statements – stoking confusion ahead of Taiwanese elections¹⁷. In Slovakia’s recent campaign, AI-generated audio mimicking the liberal party leader suggested he plotted vote-rigging and beer-price hikes – instantly spreading on social media just days before the election¹⁸. These examples show that deepfakes have touched diverse polities (from Bangladesh and Indonesia to Moldova, Slovakia, India and beyond), often aiming to undermine candidates or confuse voters¹⁵¹⁸. Notably, many of the most viral “deepfakes” in 2024 were actually circulated as obvious memes or claims, rather than subtle deceptions. Experts observed that outright undetectable AI deepfakes were relatively rare; more common were AI-generated memes plainly shared by partisans, or cheaply doctored “cheapfakes” made with basic editing tools¹³¹⁹. For instance, social media was awash with memes of Kamala Harris in Soviet garb or of Black Americans holding Trump signs¹³, but these were typically used satirically, not meant to be secretly believed. Nonetheless, even unsophisticated fakes can sway opinion: a U.S. study found that false presidential ads (not necessarily AI-made) did change voter attitudes in swing states. In sum, deepfakes are a real and growing phenomenon in election campaigns²⁰²¹ worldwide – a trend taken seriously by voters and regulators alike. U.S. Legal Framework and Accountability In the U.S., deepfake creators and distributors of election misinformation face a patchwork of tools, but no single comprehensive federal “deepfake law.” Existing laws relevant to disinformation include statutes against impersonating government officials, electioneering (such as the Bipartisan Campaign Reform Act, which requires disclaimers on political ads), and targeted statutes like criminal electioneering communications. In some cases ordinary laws have been stretched: the NH robocall used the Telephone Consumer Protection Act and mail/telemarketing fraud provisions, resulting in the $6M fine and a criminal charge. Similarly, voice impostors can potentially violate laws against “false advertising” or “unlawful corporate communications.” However, these laws were enacted before AI, and litigators have warned they often do not fit neatly. For example, deceptive deepfake claims not tied to a specific victim do not easily fit into defamation or privacy torts. Voter intimidation laws (prohibiting threats or coercion) also leave a gap for non-threatening falsehoods about voting logistics or endorsements. Recognizing these gaps, some courts and agencies are invoking other theories. The U.S. Department of Justice has recently charged individuals under broad fraud statutes (e.g. for a plot to impersonate an aide to swing votes in 2020), and state attorneys general have considered deepfake misinformation as interference with voting rights. Notably, the Federal Election Commission (FEC) is preparing to enforce new rules: in April 2024 it issued an advisory opinion limiting “non-candidate electioneering communications” that use falsified media, effectively requiring that political ads use only real images of the candidate. If finalized, that would make it unlawful for campaigns to pay for ads depicting a candidate saying things they never did. Similarly, the Federal Trade Commission (FTC) and Department of Justice (DOJ) have signaled that purely commercial deepfakes could violate consumer protection or election laws (for example, liability for mass false impersonation or for foreign-funded electioneering). U.S. Legislation and Proposals Federal lawmakers have proposed new statutes. The DEEPFAKES Accountability Act (H.R.5586 in the 118th Congress) would, among other things, impose a disclosure requirement: political ads featuring a manipulated media likeness would need clear disclaimers identifying the content as synthetic. It also increases penalties for producing false election videos or audio intended to influence the vote. While not yet enacted, supporters argue it would provide a uniform rule for all federal and state campaigns. The Brennan Center supports transparency requirements over outright bans, suggesting laws should narrowly target deceptive deepfakes in paid ads or certain categories (e.g. false claims about time/place/manner of voting) while carving out parody and news coverage. At the state level, over 20 states have passed deepfake laws specifically for elections. For example, Florida and California forbid distributing falsified audio/visual media of candidates with intent to deceive voters (though Florida’s law exempts parody). Some states (like Texas) define “deepfake” in statutes and allow candidates to sue or revoke candidacies of violators. These measures have had mixed success: courts have struck down overly broad provisions that acted as prior restraints (e.g. Minnesota’s 2023 law was challenged for threatening injunctions against anyone “reasonably believed” to violate it). Critically, these state laws raise First Amendment issues: political speech is highly protected, so any restriction must be tightly tailored. Already, Texas and Virginia statutes are under legal review, and Elon Musk’s company has sued under California’s law (which requires platforms to label or block deepfakes) as unconstitutional. In practice, most lawsuits have so far centered on defamation or intellectual property (for instance, a celebrity suing over a botched celebrity-deepfake video), rather than election-focused statutes. Policy Recommendations: Balancing Integrity and Speech Given the rapidly evolving technology, experts recommend a multi-pronged approach. Most stress transparency and disclosure as core principles. For example, the Brennan Center urges requiring any political communication that uses AI-synthesized images or voice to include a clear label. This could be a digital watermark or a visible disclaimer. Transparency has two advantages: it forces campaigns and platforms to “own” the use of AI, and it alerts audiences to treat the content with skepticism. Outright bans on all deepfakes would likely violate free speech, but targeted bans on specific harms (e.g. automated phone calls impersonating voters, or videos claiming false polling information) may be defensible. Indeed, Florida already penalizes misuse of recordings in voter suppression. Another recommendation is limited liability: tying penalties to demonstrable intent to mislead, not to the mere act of content creation. Both U.S. federal proposals and EU law generally condition fines on the “appearance of fraud” or deception. Technical solutions can complement laws. Watermarking original media (as encouraged by the EU AI Act) could deter the reuse of authentic images in doctored fakes. Open tools for deepfake detection – some supported by government research grants – should be deployed by fact-checkers and social platforms. Making detection datasets publicly available (e.g. the MIT OpenDATATEST) helps improve AI models to spot fakes. International cooperation is also urged: cross-border agreements on information-sharing could help trace and halt disinformation campaigns. The G7 and APEC have all recently committed to fighting election interference via AI, which may lead to joint norms or rapid response teams. Ultimately, many analysts believe the strongest “cure” is a well-informed public: education campaigns to teach voters to question sensational media, and a robust independent press to debunk falsehoods swiftly. While the law can penalize the worst offenders, awareness and resilience in the electorate are crucial buffers against influence operations. As Georgia Tech’s Sean Parker quipped in 2019, “the real question is not if deepfakes will influence elections, but who will be empowered by the first effective one.” Thus policies should aim to deter malicious use without unduly chilling innovation or satire. References: https://www.security.org/resources/deepfake-statistics/. https://www.wired.com/story/synthesia-ai-deepfakes-it-control-riparbelli/. https://www.gao.gov/products/gao-24-107292. https://technologyquotient.freshfields.com/post/102jb19/eu-ai-act-unpacked-8-new-rules-on-deepfakes. https://knightcolumbia.org/blog/we-looked-at-78-election-deepfakes-political-misinformation-is-not-an-ai-problem. https://www.npr.org/2024/12/21/nx-s1-5220301/deepfakes-memes-artificial-intelligence-elections. https://apnews.com/article/artificial-intelligence-elections-disinformation-chatgpt-bc283e7426402f0b4baa7df280a4c3fd. https://www.lawfaremedia.org/article/new-and-old-tools-to-tackle-deepfakes-and-election-lies-in-2024. https://www.brennancenter.org/our-work/research-reports/regulating-ai-deepfakes-and-synthetic-media-political-arena. https://firstamendment.mtsu.edu/article/political-deepfakes-and-elections/. https://www.ncsl.org/technology-and-communication/deceptive-audio-or-visual-media-deepfakes-2024-legislation. https://law.unh.edu/sites/default/files/media/2022/06/nagumotu_pp113-157.pdf. https://dfrlab.org/2024/10/02/brazil-election-ai-research/. https://dfrlab.org/2024/11/26/brazil-election-ai-deepfakes/. https://freedomhouse.org/article/eu-digital-services-act-win-transparency. The post The Legal Accountability of AI-Generated Deepfakes in Election Misinformation appeared first on MarkTechPost.0 Kommentare 0 Anteile -
Guide to Using the Desktop Commander MCP Server
The Desktop Commander MCP Server is a powerful tool that brings all your development operations into one chat interface. Built on top of the MCP Filesystem Server, it allows you to search, edit, and manage files, run terminal commands, and control processes directly from your desktop using the Model Context Protocol.
Following are the core capabilities of the Desktop Commander MCP server:
Terminal & Process Control
Execute terminal commands with live output streaming
Set timeouts and run commands in the background
Manage sessions for long-running tasks
List and kill running processes with detailed info
Configuration Management
Get or set server settings like:
defaultShellblockedCommandsallowedDirectories for file access
telemetryEnabled
Apply changes without restarting the server
Filesystem Operations
Read and write files with line-based limits
Append or overwrite file content
Create and list directories
Move or rename files and folders
Get file and directory metadata
Search files by nameCode & Text Editing
Perform precise text replacementsRewrite entire files for major updates
Search and replace patterns across multiple files
Use vscode-ripgrep for fast recursive text/code search
Audit Logging
All actions are logged with timestamps and arguments
Logs auto-rotate at 10MB to avoid clutter
In this tutorial, we will be connecting Claude desktop with the MCP server and perform some tasks.
Step 1: Setting up dependencies
Node JS
We need npx to run the Desktop Commander server, which comes with Node.js.
Download the latest version of Node.js from nodejs.org
Run the installer.
Leave all settings as default and complete the installation
Claude Desktop
Download Claude using .
Step 2: Configuring the MCP Server
Next, configure Claude to connect to your MCP server. Open the claude_desktop_config.json file located in the Claude installation directory using any text editor. If the file doesn’t exist, you can create it manually. Once opened, enter the following code:
{
"mcpServers": {
"desktop-commander": {
"command": "npx",
"args":}
}
}
Step 3: Running the server
Once the MCP configuration is complete, your server should appear in Claude. The Desktop Commander server is a powerful interface, offering 18 tools for tasks like file management, terminal execution, process control, and more.
Feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Arham IslamI am a Civil Engineering Graduatefrom Jamia Millia Islamia, New Delhi, and I have a keen interest in Data Science, especially Neural Networks and their application in various areas.Arham Islamhttps://www.marktechpost.com/author/arhamislam/Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data VaultArham Islamhttps://www.marktechpost.com/author/arhamislam/Step-by-Step Guide to Create an AI agent with Google ADKArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an LLM Agent with Tool Access Using MCP-UseArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an AgentQL Model Context ProtocolServer
#guide #using #desktop #commander #mcpGuide to Using the Desktop Commander MCP ServerThe Desktop Commander MCP Server is a powerful tool that brings all your development operations into one chat interface. Built on top of the MCP Filesystem Server, it allows you to search, edit, and manage files, run terminal commands, and control processes directly from your desktop using the Model Context Protocol. Following are the core capabilities of the Desktop Commander MCP server: Terminal & Process Control Execute terminal commands with live output streaming Set timeouts and run commands in the background Manage sessions for long-running tasks List and kill running processes with detailed info Configuration Management Get or set server settings like: defaultShellblockedCommandsallowedDirectories for file access telemetryEnabled Apply changes without restarting the server Filesystem Operations Read and write files with line-based limits Append or overwrite file content Create and list directories Move or rename files and folders Get file and directory metadata Search files by nameCode & Text Editing Perform precise text replacementsRewrite entire files for major updates Search and replace patterns across multiple files Use vscode-ripgrep for fast recursive text/code search Audit Logging All actions are logged with timestamps and arguments Logs auto-rotate at 10MB to avoid clutter In this tutorial, we will be connecting Claude desktop with the MCP server and perform some tasks. Step 1: Setting up dependencies Node JS We need npx to run the Desktop Commander server, which comes with Node.js. Download the latest version of Node.js from nodejs.org Run the installer. Leave all settings as default and complete the installation Claude Desktop Download Claude using . Step 2: Configuring the MCP Server Next, configure Claude to connect to your MCP server. Open the claude_desktop_config.json file located in the Claude installation directory using any text editor. If the file doesn’t exist, you can create it manually. Once opened, enter the following code: { "mcpServers": { "desktop-commander": { "command": "npx", "args":} } } Step 3: Running the server Once the MCP configuration is complete, your server should appear in Claude. The Desktop Commander server is a powerful interface, offering 18 tools for tasks like file management, terminal execution, process control, and more. Feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Arham IslamI am a Civil Engineering Graduatefrom Jamia Millia Islamia, New Delhi, and I have a keen interest in Data Science, especially Neural Networks and their application in various areas.Arham Islamhttps://www.marktechpost.com/author/arhamislam/Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data VaultArham Islamhttps://www.marktechpost.com/author/arhamislam/Step-by-Step Guide to Create an AI agent with Google ADKArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an LLM Agent with Tool Access Using MCP-UseArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an AgentQL Model Context ProtocolServer #guide #using #desktop #commander #mcpWWW.MARKTECHPOST.COMGuide to Using the Desktop Commander MCP ServerThe Desktop Commander MCP Server is a powerful tool that brings all your development operations into one chat interface. Built on top of the MCP Filesystem Server, it allows you to search, edit, and manage files, run terminal commands, and control processes directly from your desktop using the Model Context Protocol (MCP). Following are the core capabilities of the Desktop Commander MCP server: Terminal & Process Control Execute terminal commands with live output streaming Set timeouts and run commands in the background Manage sessions for long-running tasks List and kill running processes with detailed info Configuration Management Get or set server settings like: defaultShell (e.g., bash, zsh) blockedCommands (e.g., rm, shutdown) allowedDirectories for file access telemetryEnabled Apply changes without restarting the server Filesystem Operations Read and write files with line-based limits Append or overwrite file content Create and list directories Move or rename files and folders Get file and directory metadata Search files by name (case-insensitive) Code & Text Editing Perform precise text replacements (e.g., change config values) Rewrite entire files for major updates Search and replace patterns across multiple files Use vscode-ripgrep for fast recursive text/code search Audit Logging All actions are logged with timestamps and arguments Logs auto-rotate at 10MB to avoid clutter In this tutorial, we will be connecting Claude desktop with the MCP server and perform some tasks. Step 1: Setting up dependencies Node JS We need npx to run the Desktop Commander server, which comes with Node.js. Download the latest version of Node.js from nodejs.org Run the installer. Leave all settings as default and complete the installation Claude Desktop Download Claude using https://claude.ai/download. Step 2: Configuring the MCP Server Next, configure Claude to connect to your MCP server. Open the claude_desktop_config.json file located in the Claude installation directory using any text editor. If the file doesn’t exist, you can create it manually. Once opened, enter the following code: { "mcpServers": { "desktop-commander": { "command": "npx", "args": [ "-y", "@wonderwhy-er/desktop-commander" ] } } } Step 3: Running the server Once the MCP configuration is complete, your server should appear in Claude. The Desktop Commander server is a powerful interface, offering 18 tools for tasks like file management, terminal execution, process control, and more. Feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Arham IslamI am a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I have a keen interest in Data Science, especially Neural Networks and their application in various areas.Arham Islamhttps://www.marktechpost.com/author/arhamislam/Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data Vault (SDV)Arham Islamhttps://www.marktechpost.com/author/arhamislam/Step-by-Step Guide to Create an AI agent with Google ADKArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an LLM Agent with Tool Access Using MCP-UseArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an AgentQL Model Context Protocol (MCP) Server0 Kommentare 0 Anteile -
Meet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental Validation
Scientific research across fields like chemistry, biology, and artificial intelligence has long relied on human experts to explore knowledge, generate ideas, design experiments, and refine results. Yet, as problems grow more complex and data-intensive, discovery slows. While AI tools, such as language models and robotics, can handle specific tasks, like literature searches or code analysis, they rarely encompass the entire research cycle. Bridging the gap between idea generation and experimental validation remains a key challenge. For AI to autonomously advance science, it must propose hypotheses, design and execute experiments, analyze outcomes, and refine approaches in an iterative loop. Without this integration, AI risks producing disconnected ideas that depend on human supervision for validation.
Before the introduction of a unified system, researchers relied on separate tools for each stage of the process. Large language models could help find relevant scientific papers, but they didn’t directly feed into experiment design or result analysis. Robotics can assist in automating physical experiments, and coding libraries like PyTorch can help build models; however, these tools operate independently of each other. There was no single system capable of handling the entire process, from forming ideas to verifying them through experiments. This led to bottlenecks, where researchers had to connect the dots manually, slowing progress and leaving room for errors or missed opportunities. The need for an integrated system that could handle the entire research cycle became clear.
Researchers from the NovelSeek Team at the Shanghai Artificial Intelligence Laboratory developed NovelSeek, an AI system designed to run the entire scientific discovery process autonomously. NovelSeek comprises four main modules that work in tandem: a system that generates and refines research ideas, a feedback loop where human experts can interact with and refine these ideas, a method for translating ideas into code and experiment plans, and a process for conducting multiple rounds of experiments. What makes NovelSeek stand out is its versatility; it works across 12 scientific research tasks, including predicting chemical reaction yields, understanding molecular dynamics, forecasting time-series data, and handling functions like 2D semantic segmentation and 3D object classification. The team designed NovelSeek to minimize human involvement, expedite discoveries, and deliver consistent, high-quality results.
The system behind NovelSeek involves multiple specialized agents, each focused on a specific part of the research workflow. The “Survey Agent” helps the system understand the problem by searching scientific papers and identifying relevant information based on keywords and task definitions. It adapts its search strategy by first doing a broad survey of papers, then going deeper by analyzing full-text documents for detailed insights. This ensures that the system captures both general trends and specific technical knowledge. The “Code Review Agent” examines existing codebases, whether user-uploaded or sourced from public repositories like GitHub, to understand how current methods work and identify areas for improvement. It checks how code is structured, looks for errors, and creates summaries that help the system build on past work. The “Idea Innovation Agent” generates creative research ideas, pushing the system to explore different approaches and refine them by comparing them to related studies and previous results. The system even includes a “Planning and Execution Agent” that turns ideas into detailed experiments, handles errors during the testing process, and ensures smooth execution of multi-step research plans.
NovelSeek delivered impressive results across various tasks. In chemical reaction yield prediction, NovelSeek improved performance from a baseline of 24.2%to 34.8%in just 12 hours, progress that human researchers typically need months to achieve. In enhancer activity prediction, a key task in biology, NovelSeek raised the Pearson correlation coefficient from 0.65 to 0.79 within 4 hours. For 2D semantic segmentation, a task used in computer vision, precision improved from 78.8% to 81.0% in just 30 hours. These performance boosts, achieved in a fraction of the time typically needed, highlight the system’s efficiency. NovelSeek also successfully managed large, complex codebases with multiple files, demonstrating its ability to handle research tasks at a project level, not just in small, isolated tests. The team has made the code open-source, allowing others to use, test, and contribute to its improvement.
Several Key Takeaways from the Research on NovelSeek include:
NovelSeek supports 12 research tasks, including chemical reaction prediction, molecular dynamics, and 3D object classification.
Reaction yield prediction accuracy improved from 24.2% to 34.8% in 12 hours.
Enhancer activity prediction performance increased from 0.65 to 0.79 in 4 hours.
2D semantic segmentation precision improved from 78.8% to 81.0% in 30 hours.
NovelSeek includes agents for literature search, code analysis, idea generation, and experiment execution.
The system is open-source, enabling reproducibility and collaboration across scientific fields.
In conclusion, NovelSeek demonstrates how combining AI tools into a single system can accelerate scientific discovery and reduce its dependence on human effort. It ties together the key steps, generating ideas, turning them into methods, and testing them through experiments, into one streamlined process. What once took researchers months or years can now be done in days or even hours. By linking every stage of research into a continuous loop, NovelSeek helps teams move from rough ideas to real-world results more quickly. This system highlights the power of AI not just to assist, but to drive scientific research in a way that could reshape how discoveries are made across many fields.
Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-SolvingNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural Networks
#meet #novelseek #unified #multiagent #frameworkMeet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental ValidationScientific research across fields like chemistry, biology, and artificial intelligence has long relied on human experts to explore knowledge, generate ideas, design experiments, and refine results. Yet, as problems grow more complex and data-intensive, discovery slows. While AI tools, such as language models and robotics, can handle specific tasks, like literature searches or code analysis, they rarely encompass the entire research cycle. Bridging the gap between idea generation and experimental validation remains a key challenge. For AI to autonomously advance science, it must propose hypotheses, design and execute experiments, analyze outcomes, and refine approaches in an iterative loop. Without this integration, AI risks producing disconnected ideas that depend on human supervision for validation. Before the introduction of a unified system, researchers relied on separate tools for each stage of the process. Large language models could help find relevant scientific papers, but they didn’t directly feed into experiment design or result analysis. Robotics can assist in automating physical experiments, and coding libraries like PyTorch can help build models; however, these tools operate independently of each other. There was no single system capable of handling the entire process, from forming ideas to verifying them through experiments. This led to bottlenecks, where researchers had to connect the dots manually, slowing progress and leaving room for errors or missed opportunities. The need for an integrated system that could handle the entire research cycle became clear. Researchers from the NovelSeek Team at the Shanghai Artificial Intelligence Laboratory developed NovelSeek, an AI system designed to run the entire scientific discovery process autonomously. NovelSeek comprises four main modules that work in tandem: a system that generates and refines research ideas, a feedback loop where human experts can interact with and refine these ideas, a method for translating ideas into code and experiment plans, and a process for conducting multiple rounds of experiments. What makes NovelSeek stand out is its versatility; it works across 12 scientific research tasks, including predicting chemical reaction yields, understanding molecular dynamics, forecasting time-series data, and handling functions like 2D semantic segmentation and 3D object classification. The team designed NovelSeek to minimize human involvement, expedite discoveries, and deliver consistent, high-quality results. The system behind NovelSeek involves multiple specialized agents, each focused on a specific part of the research workflow. The “Survey Agent” helps the system understand the problem by searching scientific papers and identifying relevant information based on keywords and task definitions. It adapts its search strategy by first doing a broad survey of papers, then going deeper by analyzing full-text documents for detailed insights. This ensures that the system captures both general trends and specific technical knowledge. The “Code Review Agent” examines existing codebases, whether user-uploaded or sourced from public repositories like GitHub, to understand how current methods work and identify areas for improvement. It checks how code is structured, looks for errors, and creates summaries that help the system build on past work. The “Idea Innovation Agent” generates creative research ideas, pushing the system to explore different approaches and refine them by comparing them to related studies and previous results. The system even includes a “Planning and Execution Agent” that turns ideas into detailed experiments, handles errors during the testing process, and ensures smooth execution of multi-step research plans. NovelSeek delivered impressive results across various tasks. In chemical reaction yield prediction, NovelSeek improved performance from a baseline of 24.2%to 34.8%in just 12 hours, progress that human researchers typically need months to achieve. In enhancer activity prediction, a key task in biology, NovelSeek raised the Pearson correlation coefficient from 0.65 to 0.79 within 4 hours. For 2D semantic segmentation, a task used in computer vision, precision improved from 78.8% to 81.0% in just 30 hours. These performance boosts, achieved in a fraction of the time typically needed, highlight the system’s efficiency. NovelSeek also successfully managed large, complex codebases with multiple files, demonstrating its ability to handle research tasks at a project level, not just in small, isolated tests. The team has made the code open-source, allowing others to use, test, and contribute to its improvement. Several Key Takeaways from the Research on NovelSeek include: NovelSeek supports 12 research tasks, including chemical reaction prediction, molecular dynamics, and 3D object classification. Reaction yield prediction accuracy improved from 24.2% to 34.8% in 12 hours. Enhancer activity prediction performance increased from 0.65 to 0.79 in 4 hours. 2D semantic segmentation precision improved from 78.8% to 81.0% in 30 hours. NovelSeek includes agents for literature search, code analysis, idea generation, and experiment execution. The system is open-source, enabling reproducibility and collaboration across scientific fields. In conclusion, NovelSeek demonstrates how combining AI tools into a single system can accelerate scientific discovery and reduce its dependence on human effort. It ties together the key steps, generating ideas, turning them into methods, and testing them through experiments, into one streamlined process. What once took researchers months or years can now be done in days or even hours. By linking every stage of research into a continuous loop, NovelSeek helps teams move from rough ideas to real-world results more quickly. This system highlights the power of AI not just to assist, but to drive scientific research in a way that could reshape how discoveries are made across many fields. Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-SolvingNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural Networks #meet #novelseek #unified #multiagent #frameworkWWW.MARKTECHPOST.COMMeet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental ValidationScientific research across fields like chemistry, biology, and artificial intelligence has long relied on human experts to explore knowledge, generate ideas, design experiments, and refine results. Yet, as problems grow more complex and data-intensive, discovery slows. While AI tools, such as language models and robotics, can handle specific tasks, like literature searches or code analysis, they rarely encompass the entire research cycle. Bridging the gap between idea generation and experimental validation remains a key challenge. For AI to autonomously advance science, it must propose hypotheses, design and execute experiments, analyze outcomes, and refine approaches in an iterative loop. Without this integration, AI risks producing disconnected ideas that depend on human supervision for validation. Before the introduction of a unified system, researchers relied on separate tools for each stage of the process. Large language models could help find relevant scientific papers, but they didn’t directly feed into experiment design or result analysis. Robotics can assist in automating physical experiments, and coding libraries like PyTorch can help build models; however, these tools operate independently of each other. There was no single system capable of handling the entire process, from forming ideas to verifying them through experiments. This led to bottlenecks, where researchers had to connect the dots manually, slowing progress and leaving room for errors or missed opportunities. The need for an integrated system that could handle the entire research cycle became clear. Researchers from the NovelSeek Team at the Shanghai Artificial Intelligence Laboratory developed NovelSeek, an AI system designed to run the entire scientific discovery process autonomously. NovelSeek comprises four main modules that work in tandem: a system that generates and refines research ideas, a feedback loop where human experts can interact with and refine these ideas, a method for translating ideas into code and experiment plans, and a process for conducting multiple rounds of experiments. What makes NovelSeek stand out is its versatility; it works across 12 scientific research tasks, including predicting chemical reaction yields, understanding molecular dynamics, forecasting time-series data, and handling functions like 2D semantic segmentation and 3D object classification. The team designed NovelSeek to minimize human involvement, expedite discoveries, and deliver consistent, high-quality results. The system behind NovelSeek involves multiple specialized agents, each focused on a specific part of the research workflow. The “Survey Agent” helps the system understand the problem by searching scientific papers and identifying relevant information based on keywords and task definitions. It adapts its search strategy by first doing a broad survey of papers, then going deeper by analyzing full-text documents for detailed insights. This ensures that the system captures both general trends and specific technical knowledge. The “Code Review Agent” examines existing codebases, whether user-uploaded or sourced from public repositories like GitHub, to understand how current methods work and identify areas for improvement. It checks how code is structured, looks for errors, and creates summaries that help the system build on past work. The “Idea Innovation Agent” generates creative research ideas, pushing the system to explore different approaches and refine them by comparing them to related studies and previous results. The system even includes a “Planning and Execution Agent” that turns ideas into detailed experiments, handles errors during the testing process, and ensures smooth execution of multi-step research plans. NovelSeek delivered impressive results across various tasks. In chemical reaction yield prediction, NovelSeek improved performance from a baseline of 24.2% (with a variation of ±4.2) to 34.8% (with a much smaller variation of ±1.1) in just 12 hours, progress that human researchers typically need months to achieve. In enhancer activity prediction, a key task in biology, NovelSeek raised the Pearson correlation coefficient from 0.65 to 0.79 within 4 hours. For 2D semantic segmentation, a task used in computer vision, precision improved from 78.8% to 81.0% in just 30 hours. These performance boosts, achieved in a fraction of the time typically needed, highlight the system’s efficiency. NovelSeek also successfully managed large, complex codebases with multiple files, demonstrating its ability to handle research tasks at a project level, not just in small, isolated tests. The team has made the code open-source, allowing others to use, test, and contribute to its improvement. Several Key Takeaways from the Research on NovelSeek include: NovelSeek supports 12 research tasks, including chemical reaction prediction, molecular dynamics, and 3D object classification. Reaction yield prediction accuracy improved from 24.2% to 34.8% in 12 hours. Enhancer activity prediction performance increased from 0.65 to 0.79 in 4 hours. 2D semantic segmentation precision improved from 78.8% to 81.0% in 30 hours. NovelSeek includes agents for literature search, code analysis, idea generation, and experiment execution. The system is open-source, enabling reproducibility and collaboration across scientific fields. In conclusion, NovelSeek demonstrates how combining AI tools into a single system can accelerate scientific discovery and reduce its dependence on human effort. It ties together the key steps, generating ideas, turning them into methods, and testing them through experiments, into one streamlined process. What once took researchers months or years can now be done in days or even hours. By linking every stage of research into a continuous loop, NovelSeek helps teams move from rough ideas to real-world results more quickly. This system highlights the power of AI not just to assist, but to drive scientific research in a way that could reshape how discoveries are made across many fields. Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-SolvingNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural Networks0 Kommentare 0 Anteile -
Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic Integration
State-of-the-art models show human-competitive accuracy on AIME, GPQA, MATH-500, and OlympiadBench, solving Olympiad-level problems. Recent multimodal foundation models have advanced benchmarks for disciplinary knowledge and mathematical reasoning. However, these evaluations miss a crucial aspect of machine intelligence: physical reasoning, which requires integrating disciplinary knowledge, symbolic operations, and real-world constraints. Physical problem-solving differs fundamentally from pure mathematical reasoning as it demands models to decode implicit conditions in questions. For example, interpreting “smooth surface” as zero friction coefficient, and maintaining physical consistency across reasoning chains because physical laws remain constant regardless of reasoning trajectories.
MLLM shows excellent visual understanding by integrating visual and textual data across various tasks, motivating exploration of its reasoning abilities. However, uncertainty remains regarding whether these models possess genuine advanced reasoning capabilities for visual tasks, particularly in physical domains closer to real-world scenarios. Several LLM benchmarks have emerged to evaluate reasoning abilities, with PHYBench being most relevant for physics reasoning. MLLM scientific benchmarks, such as PhysReason and EMMA, contain multimodal physics problems with figures, however, they include only small physics subsets, which inadequately evaluate MLLMs’ capabilities for reasoning and solving advanced physics problems.
Researchers from the University of Hong Kong, the University of Michigan, the University of Toronto, the University of Waterloo, and the Ohio State University have proposed PHYX, a novel benchmark to evaluate the physical reasoning capabilities of foundation models. It comprises 3,000 visually-grounded physics questions, precisely curated across six distinct physics domains: Mechanics, Electromagnetism, Thermodynamics, Wave/Acoustics, Optics, and Modern Physics. It evaluates physics-based reasoning via multimodal problem-solving with three core innovations:3,000 newly collected questions with realistic physical scenarios requiring integrated visual analysis and causal reasoning,Expert-validated data design covering six fundamental physics domains, andStrict unified three-step evaluation protocols.
Researchers designed a four-stage data collection process to ensure high-quality data. The process begins with an in-depth survey of core physics disciplines to determine coverage across diverse domains and subfields, followed by the recruitment of STEM graduate students as expert annotators. They comply with copyright restrictions and avoid data contamination by selecting questions without answers that are immediately available. Moreover, quality control involves a three-stage cleaning process including duplicate detection through lexical overlap analysis with manual review by physics Ph.D. students, followed by filtering the shortest 10% of questions based on textual length, resulting in 3,000 high-quality questions from an initial collection of 3,300.
PHYX presents significant challenges for current models, with even the worst-performing human experts achieving 75.6% accuracy, outperforming all evaluated models and showing a gap between human expertise and current model capabilities. The benchmark reveals that multiple-choice formats narrow performance gaps by allowing weaker models to rely on surface-level cues, but open-ended questions demand genuine reasoning and precise answer generation. Comparing GPT-4o’s performance on PHYX to previously reported results on MathVista and MATH-V, lower accuracy in physical reasoning tasks emphasizes that physical reasoning requires deeper integration of abstract concepts and real-world knowledge, presenting greater challenges than purely mathematical contexts.
In conclusion, researchers introduced PHYX, the first large-scale benchmark for evaluating physical reasoning in multimodal, visually grounded scenarios. Rigorous evaluation reveals that state-of-the-art models show limitations in physical reasoning, relying predominantly on memorized knowledge, mathematical formulas, and superficial visual patterns rather than genuine understanding of physical principles. The benchmark focuses exclusively on English-language prompts and annotations, limiting assessment of multilingual reasoning abilities. Also, while images depict physically realistic scenarios, they are often schematic or textbook-style rather than real-world photographs, which may not fully capture the complexity of perception in natural environments.
Check out the Paper, Code and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement LearningSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models
#multimodal #foundation #models #fall #shortMultimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic IntegrationState-of-the-art models show human-competitive accuracy on AIME, GPQA, MATH-500, and OlympiadBench, solving Olympiad-level problems. Recent multimodal foundation models have advanced benchmarks for disciplinary knowledge and mathematical reasoning. However, these evaluations miss a crucial aspect of machine intelligence: physical reasoning, which requires integrating disciplinary knowledge, symbolic operations, and real-world constraints. Physical problem-solving differs fundamentally from pure mathematical reasoning as it demands models to decode implicit conditions in questions. For example, interpreting “smooth surface” as zero friction coefficient, and maintaining physical consistency across reasoning chains because physical laws remain constant regardless of reasoning trajectories. MLLM shows excellent visual understanding by integrating visual and textual data across various tasks, motivating exploration of its reasoning abilities. However, uncertainty remains regarding whether these models possess genuine advanced reasoning capabilities for visual tasks, particularly in physical domains closer to real-world scenarios. Several LLM benchmarks have emerged to evaluate reasoning abilities, with PHYBench being most relevant for physics reasoning. MLLM scientific benchmarks, such as PhysReason and EMMA, contain multimodal physics problems with figures, however, they include only small physics subsets, which inadequately evaluate MLLMs’ capabilities for reasoning and solving advanced physics problems. Researchers from the University of Hong Kong, the University of Michigan, the University of Toronto, the University of Waterloo, and the Ohio State University have proposed PHYX, a novel benchmark to evaluate the physical reasoning capabilities of foundation models. It comprises 3,000 visually-grounded physics questions, precisely curated across six distinct physics domains: Mechanics, Electromagnetism, Thermodynamics, Wave/Acoustics, Optics, and Modern Physics. It evaluates physics-based reasoning via multimodal problem-solving with three core innovations:3,000 newly collected questions with realistic physical scenarios requiring integrated visual analysis and causal reasoning,Expert-validated data design covering six fundamental physics domains, andStrict unified three-step evaluation protocols. Researchers designed a four-stage data collection process to ensure high-quality data. The process begins with an in-depth survey of core physics disciplines to determine coverage across diverse domains and subfields, followed by the recruitment of STEM graduate students as expert annotators. They comply with copyright restrictions and avoid data contamination by selecting questions without answers that are immediately available. Moreover, quality control involves a three-stage cleaning process including duplicate detection through lexical overlap analysis with manual review by physics Ph.D. students, followed by filtering the shortest 10% of questions based on textual length, resulting in 3,000 high-quality questions from an initial collection of 3,300. PHYX presents significant challenges for current models, with even the worst-performing human experts achieving 75.6% accuracy, outperforming all evaluated models and showing a gap between human expertise and current model capabilities. The benchmark reveals that multiple-choice formats narrow performance gaps by allowing weaker models to rely on surface-level cues, but open-ended questions demand genuine reasoning and precise answer generation. Comparing GPT-4o’s performance on PHYX to previously reported results on MathVista and MATH-V, lower accuracy in physical reasoning tasks emphasizes that physical reasoning requires deeper integration of abstract concepts and real-world knowledge, presenting greater challenges than purely mathematical contexts. In conclusion, researchers introduced PHYX, the first large-scale benchmark for evaluating physical reasoning in multimodal, visually grounded scenarios. Rigorous evaluation reveals that state-of-the-art models show limitations in physical reasoning, relying predominantly on memorized knowledge, mathematical formulas, and superficial visual patterns rather than genuine understanding of physical principles. The benchmark focuses exclusively on English-language prompts and annotations, limiting assessment of multilingual reasoning abilities. Also, while images depict physically realistic scenarios, they are often schematic or textbook-style rather than real-world photographs, which may not fully capture the complexity of perception in natural environments. Check out the Paper, Code and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement LearningSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models #multimodal #foundation #models #fall #shortWWW.MARKTECHPOST.COMMultimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic IntegrationState-of-the-art models show human-competitive accuracy on AIME, GPQA, MATH-500, and OlympiadBench, solving Olympiad-level problems. Recent multimodal foundation models have advanced benchmarks for disciplinary knowledge and mathematical reasoning. However, these evaluations miss a crucial aspect of machine intelligence: physical reasoning, which requires integrating disciplinary knowledge, symbolic operations, and real-world constraints. Physical problem-solving differs fundamentally from pure mathematical reasoning as it demands models to decode implicit conditions in questions. For example, interpreting “smooth surface” as zero friction coefficient, and maintaining physical consistency across reasoning chains because physical laws remain constant regardless of reasoning trajectories. MLLM shows excellent visual understanding by integrating visual and textual data across various tasks, motivating exploration of its reasoning abilities. However, uncertainty remains regarding whether these models possess genuine advanced reasoning capabilities for visual tasks, particularly in physical domains closer to real-world scenarios. Several LLM benchmarks have emerged to evaluate reasoning abilities, with PHYBench being most relevant for physics reasoning. MLLM scientific benchmarks, such as PhysReason and EMMA, contain multimodal physics problems with figures, however, they include only small physics subsets, which inadequately evaluate MLLMs’ capabilities for reasoning and solving advanced physics problems. Researchers from the University of Hong Kong, the University of Michigan, the University of Toronto, the University of Waterloo, and the Ohio State University have proposed PHYX, a novel benchmark to evaluate the physical reasoning capabilities of foundation models. It comprises 3,000 visually-grounded physics questions, precisely curated across six distinct physics domains: Mechanics, Electromagnetism, Thermodynamics, Wave/Acoustics, Optics, and Modern Physics. It evaluates physics-based reasoning via multimodal problem-solving with three core innovations: (a) 3,000 newly collected questions with realistic physical scenarios requiring integrated visual analysis and causal reasoning, (b) Expert-validated data design covering six fundamental physics domains, and (c) Strict unified three-step evaluation protocols. Researchers designed a four-stage data collection process to ensure high-quality data. The process begins with an in-depth survey of core physics disciplines to determine coverage across diverse domains and subfields, followed by the recruitment of STEM graduate students as expert annotators. They comply with copyright restrictions and avoid data contamination by selecting questions without answers that are immediately available. Moreover, quality control involves a three-stage cleaning process including duplicate detection through lexical overlap analysis with manual review by physics Ph.D. students, followed by filtering the shortest 10% of questions based on textual length, resulting in 3,000 high-quality questions from an initial collection of 3,300. PHYX presents significant challenges for current models, with even the worst-performing human experts achieving 75.6% accuracy, outperforming all evaluated models and showing a gap between human expertise and current model capabilities. The benchmark reveals that multiple-choice formats narrow performance gaps by allowing weaker models to rely on surface-level cues, but open-ended questions demand genuine reasoning and precise answer generation. Comparing GPT-4o’s performance on PHYX to previously reported results on MathVista and MATH-V (both 63.8%), lower accuracy in physical reasoning tasks emphasizes that physical reasoning requires deeper integration of abstract concepts and real-world knowledge, presenting greater challenges than purely mathematical contexts. In conclusion, researchers introduced PHYX, the first large-scale benchmark for evaluating physical reasoning in multimodal, visually grounded scenarios. Rigorous evaluation reveals that state-of-the-art models show limitations in physical reasoning, relying predominantly on memorized knowledge, mathematical formulas, and superficial visual patterns rather than genuine understanding of physical principles. The benchmark focuses exclusively on English-language prompts and annotations, limiting assessment of multilingual reasoning abilities. Also, while images depict physically realistic scenarios, they are often schematic or textbook-style rather than real-world photographs, which may not fully capture the complexity of perception in natural environments. Check out the Paper, Code and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement LearningSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models14 Kommentare 0 Anteile -
A Coding Guide to Building a Scalable Multi-Agent Communication Systems Using Agent Communication Protocol (ACP)
In this tutorial, we implement the Agent Communication Protocolthrough building a flexible, ACP-compliant messaging system in Python, leveraging Google’s Gemini API for natural language processing. Beginning with the installation and configuration of the google-generativeai library, the tutorial introduces core abstractions, message types, performatives, and the ACPMessage data class, which standardizes inter-agent communication. By defining ACPAgent and ACPMessageBroker classes, the guide demonstrates how to create, send, route, and process structured messages among multiple autonomous agents. Through clear code examples, users learn to implement querying, requesting actions, and broadcasting information, while maintaining conversation threads, acknowledgments, and error handling.
import google.generativeai as genai
import json
import time
import uuid
from enum import Enum
from typing import Dict, List, Any, Optional
from dataclasses import dataclass, asdict
GEMINI_API_KEY = "Use Your Gemini API Key"
genai.configureWe import essential Python modules, ranging from JSON handling and timing to unique identifier generation and type annotations, to support a structured ACP implementation. It then retrieves the user’s Gemini API key placeholder and configures the google-generativeai client for subsequent calls to the Gemini language model.
class ACPMessageType:
"""Standard ACP message types"""
REQUEST = "request"
RESPONSE = "response"
INFORM = "inform"
QUERY = "query"
SUBSCRIBE = "subscribe"
UNSUBSCRIBE = "unsubscribe"
ERROR = "error"
ACK = "acknowledge"
The ACPMessageType enumeration defines the core message categories used in the Agent Communication Protocol, including requests, responses, informational broadcasts, queries, and control actions like subscription management, error signaling, and acknowledgments. By centralizing these message types, the protocol ensures consistent handling and routing of inter-agent communications throughout the system.
class ACPPerformative:
"""ACP speech acts"""
TELL = "tell"
ASK = "ask"
REPLY = "reply"
REQUEST_ACTION = "request-action"
AGREE = "agree"
REFUSE = "refuse"
PROPOSE = "propose"
ACCEPT = "accept"
REJECT = "reject"
The ACPPerformative enumeration captures the variety of speech acts agents can use when interacting under the ACP framework, mapping high-level intentions, such as making requests, posing questions, giving commands, or negotiating agreements, onto standardized labels. This clear taxonomy enables agents to interpret and respond to messages in contextually appropriate ways, ensuring robust and semantically rich communication.
@dataclass
class ACPMessage:
"""Agent Communication Protocol Message Structure"""
message_id: str
sender: str
receiver: str
performative: str
content: Dictprotocol: str = "ACP-1.0"
conversation_id: str = None
reply_to: str = None
language: str = "english"
encoding: str = "json"
timestamp: float = None
def __post_init__:
if self.timestamp is None:
self.timestamp = time.timeif self.conversation_id is None:
self.conversation_id = str)
def to_acp_format-> str:
"""Convert to standard ACP message format"""
acp_msg = {
"message-id": self.message_id,
"sender": self.sender,
"receiver": self.receiver,
"performative": self.performative,
"content": self.content,
"protocol": self.protocol,
"conversation-id": self.conversation_id,
"reply-to": self.reply_to,
"language": self.language,
"encoding": self.encoding,
"timestamp": self.timestamp
}
return json.dumps@classmethod
def from_acp_format-> 'ACPMessage':
"""Parse ACP message from string format"""
data = json.loadsreturn cls,
conversation_id=data.get,
reply_to=data.get,
language=data.get,
encoding=data.get,
timestamp=data.get)
)
The ACPMessage data class encapsulates all the fields required for a structured ACP exchange, including identifiers, participants, performative, payload, and metadata such as protocol version, language, and timestamps. Its __post_init__ method auto-populates missing timestamp and conversation_id values, ensuring every message is uniquely tracked. Utility methods to_acp_format and from_acp_format handle serialization to and from the standardized JSON representation for seamless transmission and parsing.
class ACPAgent:
"""Agent implementing Agent Communication Protocol"""
def __init__:
self.agent_id = agent_id
self.name = name
self.capabilities = capabilities
self.model = genai.GenerativeModelself.message_queue: List=self.subscriptions: Dict] = {}
self.conversations: Dict] = {}
def create_message-> ACPMessage:
"""Create a new ACP-compliant message"""
return ACPMessage),
sender=self.agent_id,
receiver=receiver,
performative=performative,
content=content,
conversation_id=conversation_id,
reply_to=reply_to
)
def send_inform-> ACPMessage:
"""Send an INFORM message"""
content = {"fact": fact, "data": data}
return self.create_messagedef send_query-> ACPMessage:
"""Send a QUERY message"""
content = {"question": question, "query-type": query_type}
return self.create_messagedef send_request-> ACPMessage:
"""Send a REQUEST message"""
content = {"action": action, "parameters": parameters or {}}
return self.create_messagedef send_reply-> ACPMessage:
"""Send a REPLY message in response to another message"""
content = {"response": response_data, "original-question": original_msg.content}
return self.create_messagedef process_message-> Optional:
"""Process incoming ACP message and generate appropriate response"""
self.message_queue.appendconv_id = message.conversation_id
if conv_id not in self.conversations:
self.conversations=self.conversations.appendif message.performative == ACPPerformative.ASK.value:
return self._handle_queryelif message.performative == ACPPerformative.REQUEST_ACTION.value:
return self._handle_requestelif message.performative == ACPPerformative.TELL.value:
return self._handle_informreturn None
def _handle_query-> ACPMessage:
"""Handle incoming query messages"""
question = message.content.getprompt = f"As agent {self.name} with capabilities {self.capabilities}, answer: {question}"
try:
response = self.model.generate_contentanswer = response.text.stripexcept:
answer = "Unable to process query at this time"
return self.send_replydef _handle_request-> ACPMessage:
"""Handle incoming action requests"""
action = message.content.getparameters = message.content.getif anyfor capability in self.capabilities):
result = f"Executing {action} with parameters {parameters}"
status = "agreed"
else:
result = f"Cannot perform {action} - not in my capabilities"
status = "refused"
return self.send_replydef _handle_inform-> Optional:
"""Handle incoming information messages"""
fact = message.content.getprintack_content = {"status": "received", "fact": fact}
return self.create_messageThe ACPAgent class encapsulates an autonomous entity capable of sending, receiving, and processing ACP-compliant messages using Gemini’s language model. It manages its own message queue, conversation history, and subscriptions, and provides helper methodsto construct correctly formatted ACPMessage instances. Incoming messages are routed through process_message, which delegates to specialized handlers for queries, action requests, and informational messages.
class ACPMessageBroker:
"""Message broker implementing ACP routing and delivery"""
def __init__:
self.agents: Dict= {}
self.message_log: List=self.routing_table: Dict= {}
def register_agent:
"""Register an agent with the message broker"""
self.agents= agent
self.routing_table= "local"
print")
def route_message-> bool:
"""Route ACP message to appropriate recipient"""
if message.receiver not in self.agents:
printreturn False
printprintprintprint}")
receiver_agent = self.agentsresponse = receiver_agent.process_messageself.message_log.appendif response:
printprintprint}")
if response.receiver in self.agents:
self.agents.process_messageself.message_log.appendreturn True
def broadcast_message:
"""Broadcast message to multiple recipients"""
for recipient in recipients:
msg_copy = ACPMessage),
sender=message.sender,
receiver=recipient,
performative=message.performative,
content=message.content.copy,
conversation_id=message.conversation_id
)
self.route_messageThe ACPMessageBroker serves as the central router for ACP messages, maintaining a registry of agents and a message log. It provides methods to register agents, deliver individual messages via route_message, which handles lookup, logging, and response chaining, and to send the same message to multiple recipients with broadcast_message.
def demonstrate_acp:
"""Comprehensive demonstration of Agent Communication Protocol"""
printDEMONSTRATION")
printbroker = ACPMessageBrokerresearcher = ACPAgentassistant = ACPAgentcalculator = ACPAgentbroker.register_agentbroker.register_agentbroker.register_agentprintfor agent_id, agent in broker.agents.items:
print: {', '.join}")
print")
query_msg = assistant.send_querybroker.route_messageprint")
calc_request = researcher.send_request+ 10"})
broker.route_messageprint")
info_msg = researcher.send_informbroker.route_messageprintprint}")
print)}")
print)}")
printsample_msg = assistant.send_queryprint)
The demonstrate_acp function orchestrates a hands-on walkthrough of the entire ACP framework: it initializes a broker and three distinct agents, registers them, and illustrates three key interaction scenarios, querying for information, requesting a computation, and sharing an update. After routing each message and handling responses, it prints summary statistics on the message flow. It showcases a formatted ACP message, providing users with a clear, end-to-end example of how agents communicate under the protocol.
def setup_guide:
print ACP PROTOCOL FEATURES:
• Standardized message format with required fields
• Speech act performatives• Conversation tracking and message threading
• Error handling and acknowledgments
• Message routing and delivery confirmation
EXTEND THE PROTOCOL:
```python
# Create custom agent
my_agent = ACPAgentbroker.register_agent# Send custom message
msg = my_agent.send_querybroker.route_message```
""")
if __name__ == "__main__":
setup_guidedemonstrate_acpFinally, the setup_guide function provides a quick-start reference for running the ACP demo in Google Colab, outlining how to obtain and configure your Gemini API key and invoke the demonstrate_acp routine. It also summarizes key protocol features, such as standardized message formats, performatives, and message routing. It provides a concise code snippet illustrating how to register custom agents and send tailored messages.
In conclusion, this tutorial implements ACP-based multi-agent systems capable of research, computation, and collaboration tasks. The provided sample scenarios illustrate common use cases, information queries, computational requests, and fact sharing, while the broker ensures reliable message delivery and logging. Readers are encouraged to extend the framework by adding new agent capabilities, integrating domain-specific actions, or incorporating more sophisticated subscription and notification mechanisms.
Download the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Yandex Releases Yambda: The World’s Largest Event Dataset to Accelerate Recommender SystemsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Stanford Researchers Introduced Biomni: A Biomedical AI Agent for Automation Across Diverse Tasks and Data TypesAsif Razzaqhttps://www.marktechpost.com/author/6flvq/DeepSeek Releases R1-0528: An Open-Source Reasoning AI Model Delivering Enhanced Math and Code Performance with Single-GPU EfficiencyAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Guide for Building a Self-Improving AI Agent Using Google’s Gemini API with Intelligent Adaptation Features
#coding #guide #building #scalable #multiagentA Coding Guide to Building a Scalable Multi-Agent Communication Systems Using Agent Communication Protocol (ACP)In this tutorial, we implement the Agent Communication Protocolthrough building a flexible, ACP-compliant messaging system in Python, leveraging Google’s Gemini API for natural language processing. Beginning with the installation and configuration of the google-generativeai library, the tutorial introduces core abstractions, message types, performatives, and the ACPMessage data class, which standardizes inter-agent communication. By defining ACPAgent and ACPMessageBroker classes, the guide demonstrates how to create, send, route, and process structured messages among multiple autonomous agents. Through clear code examples, users learn to implement querying, requesting actions, and broadcasting information, while maintaining conversation threads, acknowledgments, and error handling. import google.generativeai as genai import json import time import uuid from enum import Enum from typing import Dict, List, Any, Optional from dataclasses import dataclass, asdict GEMINI_API_KEY = "Use Your Gemini API Key" genai.configureWe import essential Python modules, ranging from JSON handling and timing to unique identifier generation and type annotations, to support a structured ACP implementation. It then retrieves the user’s Gemini API key placeholder and configures the google-generativeai client for subsequent calls to the Gemini language model. class ACPMessageType: """Standard ACP message types""" REQUEST = "request" RESPONSE = "response" INFORM = "inform" QUERY = "query" SUBSCRIBE = "subscribe" UNSUBSCRIBE = "unsubscribe" ERROR = "error" ACK = "acknowledge" The ACPMessageType enumeration defines the core message categories used in the Agent Communication Protocol, including requests, responses, informational broadcasts, queries, and control actions like subscription management, error signaling, and acknowledgments. By centralizing these message types, the protocol ensures consistent handling and routing of inter-agent communications throughout the system. class ACPPerformative: """ACP speech acts""" TELL = "tell" ASK = "ask" REPLY = "reply" REQUEST_ACTION = "request-action" AGREE = "agree" REFUSE = "refuse" PROPOSE = "propose" ACCEPT = "accept" REJECT = "reject" The ACPPerformative enumeration captures the variety of speech acts agents can use when interacting under the ACP framework, mapping high-level intentions, such as making requests, posing questions, giving commands, or negotiating agreements, onto standardized labels. This clear taxonomy enables agents to interpret and respond to messages in contextually appropriate ways, ensuring robust and semantically rich communication. @dataclass class ACPMessage: """Agent Communication Protocol Message Structure""" message_id: str sender: str receiver: str performative: str content: Dictprotocol: str = "ACP-1.0" conversation_id: str = None reply_to: str = None language: str = "english" encoding: str = "json" timestamp: float = None def __post_init__: if self.timestamp is None: self.timestamp = time.timeif self.conversation_id is None: self.conversation_id = str) def to_acp_format-> str: """Convert to standard ACP message format""" acp_msg = { "message-id": self.message_id, "sender": self.sender, "receiver": self.receiver, "performative": self.performative, "content": self.content, "protocol": self.protocol, "conversation-id": self.conversation_id, "reply-to": self.reply_to, "language": self.language, "encoding": self.encoding, "timestamp": self.timestamp } return json.dumps@classmethod def from_acp_format-> 'ACPMessage': """Parse ACP message from string format""" data = json.loadsreturn cls, conversation_id=data.get, reply_to=data.get, language=data.get, encoding=data.get, timestamp=data.get) ) The ACPMessage data class encapsulates all the fields required for a structured ACP exchange, including identifiers, participants, performative, payload, and metadata such as protocol version, language, and timestamps. Its __post_init__ method auto-populates missing timestamp and conversation_id values, ensuring every message is uniquely tracked. Utility methods to_acp_format and from_acp_format handle serialization to and from the standardized JSON representation for seamless transmission and parsing. class ACPAgent: """Agent implementing Agent Communication Protocol""" def __init__: self.agent_id = agent_id self.name = name self.capabilities = capabilities self.model = genai.GenerativeModelself.message_queue: List=self.subscriptions: Dict] = {} self.conversations: Dict] = {} def create_message-> ACPMessage: """Create a new ACP-compliant message""" return ACPMessage), sender=self.agent_id, receiver=receiver, performative=performative, content=content, conversation_id=conversation_id, reply_to=reply_to ) def send_inform-> ACPMessage: """Send an INFORM message""" content = {"fact": fact, "data": data} return self.create_messagedef send_query-> ACPMessage: """Send a QUERY message""" content = {"question": question, "query-type": query_type} return self.create_messagedef send_request-> ACPMessage: """Send a REQUEST message""" content = {"action": action, "parameters": parameters or {}} return self.create_messagedef send_reply-> ACPMessage: """Send a REPLY message in response to another message""" content = {"response": response_data, "original-question": original_msg.content} return self.create_messagedef process_message-> Optional: """Process incoming ACP message and generate appropriate response""" self.message_queue.appendconv_id = message.conversation_id if conv_id not in self.conversations: self.conversations=self.conversations.appendif message.performative == ACPPerformative.ASK.value: return self._handle_queryelif message.performative == ACPPerformative.REQUEST_ACTION.value: return self._handle_requestelif message.performative == ACPPerformative.TELL.value: return self._handle_informreturn None def _handle_query-> ACPMessage: """Handle incoming query messages""" question = message.content.getprompt = f"As agent {self.name} with capabilities {self.capabilities}, answer: {question}" try: response = self.model.generate_contentanswer = response.text.stripexcept: answer = "Unable to process query at this time" return self.send_replydef _handle_request-> ACPMessage: """Handle incoming action requests""" action = message.content.getparameters = message.content.getif anyfor capability in self.capabilities): result = f"Executing {action} with parameters {parameters}" status = "agreed" else: result = f"Cannot perform {action} - not in my capabilities" status = "refused" return self.send_replydef _handle_inform-> Optional: """Handle incoming information messages""" fact = message.content.getprintack_content = {"status": "received", "fact": fact} return self.create_messageThe ACPAgent class encapsulates an autonomous entity capable of sending, receiving, and processing ACP-compliant messages using Gemini’s language model. It manages its own message queue, conversation history, and subscriptions, and provides helper methodsto construct correctly formatted ACPMessage instances. Incoming messages are routed through process_message, which delegates to specialized handlers for queries, action requests, and informational messages. class ACPMessageBroker: """Message broker implementing ACP routing and delivery""" def __init__: self.agents: Dict= {} self.message_log: List=self.routing_table: Dict= {} def register_agent: """Register an agent with the message broker""" self.agents= agent self.routing_table= "local" print") def route_message-> bool: """Route ACP message to appropriate recipient""" if message.receiver not in self.agents: printreturn False printprintprintprint}") receiver_agent = self.agentsresponse = receiver_agent.process_messageself.message_log.appendif response: printprintprint}") if response.receiver in self.agents: self.agents.process_messageself.message_log.appendreturn True def broadcast_message: """Broadcast message to multiple recipients""" for recipient in recipients: msg_copy = ACPMessage), sender=message.sender, receiver=recipient, performative=message.performative, content=message.content.copy, conversation_id=message.conversation_id ) self.route_messageThe ACPMessageBroker serves as the central router for ACP messages, maintaining a registry of agents and a message log. It provides methods to register agents, deliver individual messages via route_message, which handles lookup, logging, and response chaining, and to send the same message to multiple recipients with broadcast_message. def demonstrate_acp: """Comprehensive demonstration of Agent Communication Protocol""" printDEMONSTRATION") printbroker = ACPMessageBrokerresearcher = ACPAgentassistant = ACPAgentcalculator = ACPAgentbroker.register_agentbroker.register_agentbroker.register_agentprintfor agent_id, agent in broker.agents.items: print: {', '.join}") print") query_msg = assistant.send_querybroker.route_messageprint") calc_request = researcher.send_request+ 10"}) broker.route_messageprint") info_msg = researcher.send_informbroker.route_messageprintprint}") print)}") print)}") printsample_msg = assistant.send_queryprint) The demonstrate_acp function orchestrates a hands-on walkthrough of the entire ACP framework: it initializes a broker and three distinct agents, registers them, and illustrates three key interaction scenarios, querying for information, requesting a computation, and sharing an update. After routing each message and handling responses, it prints summary statistics on the message flow. It showcases a formatted ACP message, providing users with a clear, end-to-end example of how agents communicate under the protocol. def setup_guide: print🔧 ACP PROTOCOL FEATURES: • Standardized message format with required fields • Speech act performatives• Conversation tracking and message threading • Error handling and acknowledgments • Message routing and delivery confirmation 📝 EXTEND THE PROTOCOL: ```python # Create custom agent my_agent = ACPAgentbroker.register_agent# Send custom message msg = my_agent.send_querybroker.route_message``` """) if __name__ == "__main__": setup_guidedemonstrate_acpFinally, the setup_guide function provides a quick-start reference for running the ACP demo in Google Colab, outlining how to obtain and configure your Gemini API key and invoke the demonstrate_acp routine. It also summarizes key protocol features, such as standardized message formats, performatives, and message routing. It provides a concise code snippet illustrating how to register custom agents and send tailored messages. In conclusion, this tutorial implements ACP-based multi-agent systems capable of research, computation, and collaboration tasks. The provided sample scenarios illustrate common use cases, information queries, computational requests, and fact sharing, while the broker ensures reliable message delivery and logging. Readers are encouraged to extend the framework by adding new agent capabilities, integrating domain-specific actions, or incorporating more sophisticated subscription and notification mechanisms. Download the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Yandex Releases Yambda: The World’s Largest Event Dataset to Accelerate Recommender SystemsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Stanford Researchers Introduced Biomni: A Biomedical AI Agent for Automation Across Diverse Tasks and Data TypesAsif Razzaqhttps://www.marktechpost.com/author/6flvq/DeepSeek Releases R1-0528: An Open-Source Reasoning AI Model Delivering Enhanced Math and Code Performance with Single-GPU EfficiencyAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Guide for Building a Self-Improving AI Agent Using Google’s Gemini API with Intelligent Adaptation Features #coding #guide #building #scalable #multiagentWWW.MARKTECHPOST.COMA Coding Guide to Building a Scalable Multi-Agent Communication Systems Using Agent Communication Protocol (ACP)In this tutorial, we implement the Agent Communication Protocol (ACP) through building a flexible, ACP-compliant messaging system in Python, leveraging Google’s Gemini API for natural language processing. Beginning with the installation and configuration of the google-generativeai library, the tutorial introduces core abstractions, message types, performatives, and the ACPMessage data class, which standardizes inter-agent communication. By defining ACPAgent and ACPMessageBroker classes, the guide demonstrates how to create, send, route, and process structured messages among multiple autonomous agents. Through clear code examples, users learn to implement querying, requesting actions, and broadcasting information, while maintaining conversation threads, acknowledgments, and error handling. import google.generativeai as genai import json import time import uuid from enum import Enum from typing import Dict, List, Any, Optional from dataclasses import dataclass, asdict GEMINI_API_KEY = "Use Your Gemini API Key" genai.configure(api_key=GEMINI_API_KEY) We import essential Python modules, ranging from JSON handling and timing to unique identifier generation and type annotations, to support a structured ACP implementation. It then retrieves the user’s Gemini API key placeholder and configures the google-generativeai client for subsequent calls to the Gemini language model. class ACPMessageType(Enum): """Standard ACP message types""" REQUEST = "request" RESPONSE = "response" INFORM = "inform" QUERY = "query" SUBSCRIBE = "subscribe" UNSUBSCRIBE = "unsubscribe" ERROR = "error" ACK = "acknowledge" The ACPMessageType enumeration defines the core message categories used in the Agent Communication Protocol, including requests, responses, informational broadcasts, queries, and control actions like subscription management, error signaling, and acknowledgments. By centralizing these message types, the protocol ensures consistent handling and routing of inter-agent communications throughout the system. class ACPPerformative(Enum): """ACP speech acts (performatives)""" TELL = "tell" ASK = "ask" REPLY = "reply" REQUEST_ACTION = "request-action" AGREE = "agree" REFUSE = "refuse" PROPOSE = "propose" ACCEPT = "accept" REJECT = "reject" The ACPPerformative enumeration captures the variety of speech acts agents can use when interacting under the ACP framework, mapping high-level intentions, such as making requests, posing questions, giving commands, or negotiating agreements, onto standardized labels. This clear taxonomy enables agents to interpret and respond to messages in contextually appropriate ways, ensuring robust and semantically rich communication. @dataclass class ACPMessage: """Agent Communication Protocol Message Structure""" message_id: str sender: str receiver: str performative: str content: Dict[str, Any] protocol: str = "ACP-1.0" conversation_id: str = None reply_to: str = None language: str = "english" encoding: str = "json" timestamp: float = None def __post_init__(self): if self.timestamp is None: self.timestamp = time.time() if self.conversation_id is None: self.conversation_id = str(uuid.uuid4()) def to_acp_format(self) -> str: """Convert to standard ACP message format""" acp_msg = { "message-id": self.message_id, "sender": self.sender, "receiver": self.receiver, "performative": self.performative, "content": self.content, "protocol": self.protocol, "conversation-id": self.conversation_id, "reply-to": self.reply_to, "language": self.language, "encoding": self.encoding, "timestamp": self.timestamp } return json.dumps(acp_msg, indent=2) @classmethod def from_acp_format(cls, acp_string: str) -> 'ACPMessage': """Parse ACP message from string format""" data = json.loads(acp_string) return cls( message_id=data["message-id"], sender=data["sender"], receiver=data["receiver"], performative=data["performative"], content=data["content"], protocol=data.get("protocol", "ACP-1.0"), conversation_id=data.get("conversation-id"), reply_to=data.get("reply-to"), language=data.get("language", "english"), encoding=data.get("encoding", "json"), timestamp=data.get("timestamp", time.time()) ) The ACPMessage data class encapsulates all the fields required for a structured ACP exchange, including identifiers, participants, performative, payload, and metadata such as protocol version, language, and timestamps. Its __post_init__ method auto-populates missing timestamp and conversation_id values, ensuring every message is uniquely tracked. Utility methods to_acp_format and from_acp_format handle serialization to and from the standardized JSON representation for seamless transmission and parsing. class ACPAgent: """Agent implementing Agent Communication Protocol""" def __init__(self, agent_id: str, name: str, capabilities: List[str]): self.agent_id = agent_id self.name = name self.capabilities = capabilities self.model = genai.GenerativeModel("gemini-1.5-flash") self.message_queue: List[ACPMessage] = [] self.subscriptions: Dict[str, List[str]] = {} self.conversations: Dict[str, List[ACPMessage]] = {} def create_message(self, receiver: str, performative: str, content: Dict[str, Any], conversation_id: str = None, reply_to: str = None) -> ACPMessage: """Create a new ACP-compliant message""" return ACPMessage( message_id=str(uuid.uuid4()), sender=self.agent_id, receiver=receiver, performative=performative, content=content, conversation_id=conversation_id, reply_to=reply_to ) def send_inform(self, receiver: str, fact: str, data: Any = None) -> ACPMessage: """Send an INFORM message (telling someone a fact)""" content = {"fact": fact, "data": data} return self.create_message(receiver, ACPPerformative.TELL.value, content) def send_query(self, receiver: str, question: str, query_type: str = "yes-no") -> ACPMessage: """Send a QUERY message (asking for information)""" content = {"question": question, "query-type": query_type} return self.create_message(receiver, ACPPerformative.ASK.value, content) def send_request(self, receiver: str, action: str, parameters: Dict = None) -> ACPMessage: """Send a REQUEST message (asking someone to perform an action)""" content = {"action": action, "parameters": parameters or {}} return self.create_message(receiver, ACPPerformative.REQUEST_ACTION.value, content) def send_reply(self, original_msg: ACPMessage, response_data: Any) -> ACPMessage: """Send a REPLY message in response to another message""" content = {"response": response_data, "original-question": original_msg.content} return self.create_message( original_msg.sender, ACPPerformative.REPLY.value, content, conversation_id=original_msg.conversation_id, reply_to=original_msg.message_id ) def process_message(self, message: ACPMessage) -> Optional[ACPMessage]: """Process incoming ACP message and generate appropriate response""" self.message_queue.append(message) conv_id = message.conversation_id if conv_id not in self.conversations: self.conversations[conv_id] = [] self.conversations[conv_id].append(message) if message.performative == ACPPerformative.ASK.value: return self._handle_query(message) elif message.performative == ACPPerformative.REQUEST_ACTION.value: return self._handle_request(message) elif message.performative == ACPPerformative.TELL.value: return self._handle_inform(message) return None def _handle_query(self, message: ACPMessage) -> ACPMessage: """Handle incoming query messages""" question = message.content.get("question", "") prompt = f"As agent {self.name} with capabilities {self.capabilities}, answer: {question}" try: response = self.model.generate_content(prompt) answer = response.text.strip() except: answer = "Unable to process query at this time" return self.send_reply(message, {"answer": answer, "confidence": 0.8}) def _handle_request(self, message: ACPMessage) -> ACPMessage: """Handle incoming action requests""" action = message.content.get("action", "") parameters = message.content.get("parameters", {}) if any(capability in action.lower() for capability in self.capabilities): result = f"Executing {action} with parameters {parameters}" status = "agreed" else: result = f"Cannot perform {action} - not in my capabilities" status = "refused" return self.send_reply(message, {"status": status, "result": result}) def _handle_inform(self, message: ACPMessage) -> Optional[ACPMessage]: """Handle incoming information messages""" fact = message.content.get("fact", "") print(f"[{self.name}] Received information: {fact}") ack_content = {"status": "received", "fact": fact} return self.create_message(message.sender, "acknowledge", ack_content, conversation_id=message.conversation_id) The ACPAgent class encapsulates an autonomous entity capable of sending, receiving, and processing ACP-compliant messages using Gemini’s language model. It manages its own message queue, conversation history, and subscriptions, and provides helper methods (send_inform, send_query, send_request, send_reply) to construct correctly formatted ACPMessage instances. Incoming messages are routed through process_message, which delegates to specialized handlers for queries, action requests, and informational messages. class ACPMessageBroker: """Message broker implementing ACP routing and delivery""" def __init__(self): self.agents: Dict[str, ACPAgent] = {} self.message_log: List[ACPMessage] = [] self.routing_table: Dict[str, str] = {} def register_agent(self, agent: ACPAgent): """Register an agent with the message broker""" self.agents[agent.agent_id] = agent self.routing_table[agent.agent_id] = "local" print(f"✓ Registered agent: {agent.name} ({agent.agent_id})") def route_message(self, message: ACPMessage) -> bool: """Route ACP message to appropriate recipient""" if message.receiver not in self.agents: print(f"✗ Receiver {message.receiver} not found") return False print(f"\n📨 ACP MESSAGE ROUTING:") print(f"From: {message.sender} → To: {message.receiver}") print(f"Performative: {message.performative}") print(f"Content: {json.dumps(message.content, indent=2)}") receiver_agent = self.agents[message.receiver] response = receiver_agent.process_message(message) self.message_log.append(message) if response: print(f"\n📤 GENERATED RESPONSE:") print(f"From: {response.sender} → To: {response.receiver}") print(f"Content: {json.dumps(response.content, indent=2)}") if response.receiver in self.agents: self.agents[response.receiver].process_message(response) self.message_log.append(response) return True def broadcast_message(self, message: ACPMessage, recipients: List[str]): """Broadcast message to multiple recipients""" for recipient in recipients: msg_copy = ACPMessage( message_id=str(uuid.uuid4()), sender=message.sender, receiver=recipient, performative=message.performative, content=message.content.copy(), conversation_id=message.conversation_id ) self.route_message(msg_copy) The ACPMessageBroker serves as the central router for ACP messages, maintaining a registry of agents and a message log. It provides methods to register agents, deliver individual messages via route_message, which handles lookup, logging, and response chaining, and to send the same message to multiple recipients with broadcast_message. def demonstrate_acp(): """Comprehensive demonstration of Agent Communication Protocol""" print("🤖 AGENT COMMUNICATION PROTOCOL (ACP) DEMONSTRATION") print("=" * 60) broker = ACPMessageBroker() researcher = ACPAgent("agent-001", "Dr. Research", ["analysis", "research", "data-processing"]) assistant = ACPAgent("agent-002", "AI Assistant", ["information", "scheduling", "communication"]) calculator = ACPAgent("agent-003", "MathBot", ["calculation", "mathematics", "computation"]) broker.register_agent(researcher) broker.register_agent(assistant) broker.register_agent(calculator) print(f"\n📋 REGISTERED AGENTS:") for agent_id, agent in broker.agents.items(): print(f" • {agent.name} ({agent_id}): {', '.join(agent.capabilities)}") print(f"\n🔬 SCENARIO 1: Information Query (ASK performative)") query_msg = assistant.send_query("agent-001", "What are the key factors in AI research?") broker.route_message(query_msg) print(f"\n🔢 SCENARIO 2: Action Request (REQUEST-ACTION performative)") calc_request = researcher.send_request("agent-003", "calculate", {"expression": "sqrt(144) + 10"}) broker.route_message(calc_request) print(f"\n📢 SCENARIO 3: Information Sharing (TELL performative)") info_msg = researcher.send_inform("agent-002", "New research paper published on quantum computing") broker.route_message(info_msg) print(f"\n📊 PROTOCOL STATISTICS:") print(f" • Total messages processed: {len(broker.message_log)}") print(f" • Active conversations: {len(set(msg.conversation_id for msg in broker.message_log))}") print(f" • Message types used: {len(set(msg.performative for msg in broker.message_log))}") print(f"\n📋 SAMPLE ACP MESSAGE FORMAT:") sample_msg = assistant.send_query("agent-001", "Sample question for format demonstration") print(sample_msg.to_acp_format()) The demonstrate_acp function orchestrates a hands-on walkthrough of the entire ACP framework: it initializes a broker and three distinct agents (Researcher, AI Assistant, and MathBot), registers them, and illustrates three key interaction scenarios, querying for information, requesting a computation, and sharing an update. After routing each message and handling responses, it prints summary statistics on the message flow. It showcases a formatted ACP message, providing users with a clear, end-to-end example of how agents communicate under the protocol. def setup_guide(): print(""" 🚀 GOOGLE COLAB SETUP GUIDE: 1. Get Gemini API Key: https://makersuite.google.com/app/apikey 2. Replace: GEMINI_API_KEY = "YOUR_ACTUAL_API_KEY" 3. Run: demonstrate_acp() 🔧 ACP PROTOCOL FEATURES: • Standardized message format with required fields • Speech act performatives (TELL, ASK, REQUEST-ACTION, etc.) • Conversation tracking and message threading • Error handling and acknowledgments • Message routing and delivery confirmation 📝 EXTEND THE PROTOCOL: ```python # Create custom agent my_agent = ACPAgent("my-001", "CustomBot", ["custom-capability"]) broker.register_agent(my_agent) # Send custom message msg = my_agent.send_query("agent-001", "Your question here") broker.route_message(msg) ``` """) if __name__ == "__main__": setup_guide() demonstrate_acp() Finally, the setup_guide function provides a quick-start reference for running the ACP demo in Google Colab, outlining how to obtain and configure your Gemini API key and invoke the demonstrate_acp routine. It also summarizes key protocol features, such as standardized message formats, performatives, and message routing. It provides a concise code snippet illustrating how to register custom agents and send tailored messages. In conclusion, this tutorial implements ACP-based multi-agent systems capable of research, computation, and collaboration tasks. The provided sample scenarios illustrate common use cases, information queries, computational requests, and fact sharing, while the broker ensures reliable message delivery and logging. Readers are encouraged to extend the framework by adding new agent capabilities, integrating domain-specific actions, or incorporating more sophisticated subscription and notification mechanisms. Download the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Yandex Releases Yambda: The World’s Largest Event Dataset to Accelerate Recommender SystemsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Stanford Researchers Introduced Biomni: A Biomedical AI Agent for Automation Across Diverse Tasks and Data TypesAsif Razzaqhttps://www.marktechpost.com/author/6flvq/DeepSeek Releases R1-0528: An Open-Source Reasoning AI Model Delivering Enhanced Math and Code Performance with Single-GPU EfficiencyAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Guide for Building a Self-Improving AI Agent Using Google’s Gemini API with Intelligent Adaptation Features0 Kommentare 0 Anteile -
This AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-Solving
Reasoning tasks are a fundamental aspect of artificial intelligence, encompassing areas like commonsense understanding, mathematical problem-solving, and symbolic reasoning. These tasks often involve multiple steps of logical inference, which large language modelsattempt to mimic through structured approaches such as chain-of-thoughtprompting. However, as LLMs grow in size and complexity, they tend to produce longer outputs across all tasks, regardless of difficulty, leading to significant inefficiencies. The field has been striving to balance the depth of reasoning with computational cost while also ensuring that models can adapt their reasoning strategies to meet the unique needs of each problem.
A key issue with current reasoning models is the inability to tailor the reasoning process to different task complexities. Most models, including well-known ones like OpenAI’s o1 and DeepSeek-R1, apply a uniform strategy—typically relying on Long CoT across all tasks. This causes the “overthinking” problem, where models generate unnecessarily verbose explanations for simpler tasks. Not only does this waste resources, but it also degrades accuracy, as excessive reasoning can introduce irrelevant information. Approaches such as prompt-guided generation or token budget estimation have attempted to mitigate this issue. Still, these methods are limited by their dependence on predefined assumptions, which are not always reliable for diverse tasks.
Attempts to address these issues include methods like GRPO, length-penalty mechanisms, and rule-based prompt controls. While GRPO enables models to learn different reasoning strategies by rewarding correct answers, it leads to a “format collapse,” where models increasingly rely on Long CoT, crowding out more efficient formats, such as Short CoT or Direct Answer. Length-penalty techniques, such as those applied in methods like THINKPRUNE, control output length during training or inference, but often at the cost of reduced accuracy, especially in complex problem-solving tasks. These solutions struggle to achieve a consistent trade-off between reasoning effectiveness and efficiency, highlighting the need for an adaptive approach.
A team of researchers from Fudan University and Ohio State University introduced the Adaptive Reasoning Model, which dynamically adjusts reasoning formats based on task difficulty. ARM supports four distinct reasoning styles: Direct Answer for simple tasks, Short CoT for concise reasoning, Code for structured problem-solving, and Long CoT for deep multi-step reasoning. It operates in an Adaptive Mode by default, automatically selecting the appropriate format, and also provides Instruction-Guided and Consensus-Guided Modes for explicit control or aggregation across formats. The key innovation lies in its training process, which utilizes Ada-GRPO, an extension of GRPO that introduces a format diversity reward mechanism. This prevents the dominance of Long CoT and ensures that ARM continues to explore and use simpler reasoning formats when appropriate.
The ARM methodology is built on a two-stage framework. First, the model undergoes Supervised Fine-Tuningwith 10.8K questions, each annotated across four reasoning formats, sourced from datasets like AQuA-Rat and generated with tools such as GPT-4o and DeepSeek-R1. This stage teaches the model the structure of each reasoning format but does not instill adaptiveness. The second stage applies Ada-GRPO, where the model receives scaled rewards for using less frequent formats, such as Direct Answer or Short CoT. A decaying factor ensures that this reward gradually shifts back to accuracy as training progresses, preventing long-term bias toward inefficient exploration. This structure enables ARM to avoid format collapse and dynamically match reasoning strategies to task difficulty, achieving a balance of efficiency and performance.
ARM demonstrated impressive results across various benchmarks, including commonsense, mathematical, and symbolic reasoning tasks. It reduced token usage by an average of 30%, with reductions as high as 70% for simpler tasks, compared to models relying solely on Long CoT. ARM achieved a 2x training speedup over GRPO-based models, accelerating model development without sacrificing accuracy. For example, ARM-7B achieved 75.9% accuracy on the challenging AIME’25 task while using 32.5% fewer tokens. ARM-14B achieved 85.6% accuracy on OpenBookQA and 86.4% accuracy on the MATH dataset, with a token usage reduction of over 30% compared to Qwen2.5SFT+GRPO models. These numbers demonstrate ARM’s ability to maintain competitive performance while delivering significant efficiency gains.
Overall, the Adaptive Reasoning Model addresses the persistent inefficiency of reasoning models by enabling the adaptive selection of reasoning formats based on task difficulty. The introduction of Ada-GRPO and the multi-format training framework ensures that models no longer waste resources on overthinking. Instead, ARM provides a flexible and practical solution for balancing accuracy and computational cost in reasoning tasks, making it a promising approach for scalable and efficient large language models.
Check out the Paper, Models on Hugging Face and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural NetworksNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding
#this #paper #introduces #arm #adagrpoThis AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-SolvingReasoning tasks are a fundamental aspect of artificial intelligence, encompassing areas like commonsense understanding, mathematical problem-solving, and symbolic reasoning. These tasks often involve multiple steps of logical inference, which large language modelsattempt to mimic through structured approaches such as chain-of-thoughtprompting. However, as LLMs grow in size and complexity, they tend to produce longer outputs across all tasks, regardless of difficulty, leading to significant inefficiencies. The field has been striving to balance the depth of reasoning with computational cost while also ensuring that models can adapt their reasoning strategies to meet the unique needs of each problem. A key issue with current reasoning models is the inability to tailor the reasoning process to different task complexities. Most models, including well-known ones like OpenAI’s o1 and DeepSeek-R1, apply a uniform strategy—typically relying on Long CoT across all tasks. This causes the “overthinking” problem, where models generate unnecessarily verbose explanations for simpler tasks. Not only does this waste resources, but it also degrades accuracy, as excessive reasoning can introduce irrelevant information. Approaches such as prompt-guided generation or token budget estimation have attempted to mitigate this issue. Still, these methods are limited by their dependence on predefined assumptions, which are not always reliable for diverse tasks. Attempts to address these issues include methods like GRPO, length-penalty mechanisms, and rule-based prompt controls. While GRPO enables models to learn different reasoning strategies by rewarding correct answers, it leads to a “format collapse,” where models increasingly rely on Long CoT, crowding out more efficient formats, such as Short CoT or Direct Answer. Length-penalty techniques, such as those applied in methods like THINKPRUNE, control output length during training or inference, but often at the cost of reduced accuracy, especially in complex problem-solving tasks. These solutions struggle to achieve a consistent trade-off between reasoning effectiveness and efficiency, highlighting the need for an adaptive approach. A team of researchers from Fudan University and Ohio State University introduced the Adaptive Reasoning Model, which dynamically adjusts reasoning formats based on task difficulty. ARM supports four distinct reasoning styles: Direct Answer for simple tasks, Short CoT for concise reasoning, Code for structured problem-solving, and Long CoT for deep multi-step reasoning. It operates in an Adaptive Mode by default, automatically selecting the appropriate format, and also provides Instruction-Guided and Consensus-Guided Modes for explicit control or aggregation across formats. The key innovation lies in its training process, which utilizes Ada-GRPO, an extension of GRPO that introduces a format diversity reward mechanism. This prevents the dominance of Long CoT and ensures that ARM continues to explore and use simpler reasoning formats when appropriate. The ARM methodology is built on a two-stage framework. First, the model undergoes Supervised Fine-Tuningwith 10.8K questions, each annotated across four reasoning formats, sourced from datasets like AQuA-Rat and generated with tools such as GPT-4o and DeepSeek-R1. This stage teaches the model the structure of each reasoning format but does not instill adaptiveness. The second stage applies Ada-GRPO, where the model receives scaled rewards for using less frequent formats, such as Direct Answer or Short CoT. A decaying factor ensures that this reward gradually shifts back to accuracy as training progresses, preventing long-term bias toward inefficient exploration. This structure enables ARM to avoid format collapse and dynamically match reasoning strategies to task difficulty, achieving a balance of efficiency and performance. ARM demonstrated impressive results across various benchmarks, including commonsense, mathematical, and symbolic reasoning tasks. It reduced token usage by an average of 30%, with reductions as high as 70% for simpler tasks, compared to models relying solely on Long CoT. ARM achieved a 2x training speedup over GRPO-based models, accelerating model development without sacrificing accuracy. For example, ARM-7B achieved 75.9% accuracy on the challenging AIME’25 task while using 32.5% fewer tokens. ARM-14B achieved 85.6% accuracy on OpenBookQA and 86.4% accuracy on the MATH dataset, with a token usage reduction of over 30% compared to Qwen2.5SFT+GRPO models. These numbers demonstrate ARM’s ability to maintain competitive performance while delivering significant efficiency gains. Overall, the Adaptive Reasoning Model addresses the persistent inefficiency of reasoning models by enabling the adaptive selection of reasoning formats based on task difficulty. The introduction of Ada-GRPO and the multi-format training framework ensures that models no longer waste resources on overthinking. Instead, ARM provides a flexible and practical solution for balancing accuracy and computational cost in reasoning tasks, making it a promising approach for scalable and efficient large language models. Check out the Paper, Models on Hugging Face and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural NetworksNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding #this #paper #introduces #arm #adagrpoWWW.MARKTECHPOST.COMThis AI Paper Introduces ARM and Ada-GRPO: Adaptive Reasoning Models for Efficient and Scalable Problem-SolvingReasoning tasks are a fundamental aspect of artificial intelligence, encompassing areas like commonsense understanding, mathematical problem-solving, and symbolic reasoning. These tasks often involve multiple steps of logical inference, which large language models (LLMs) attempt to mimic through structured approaches such as chain-of-thought (CoT) prompting. However, as LLMs grow in size and complexity, they tend to produce longer outputs across all tasks, regardless of difficulty, leading to significant inefficiencies. The field has been striving to balance the depth of reasoning with computational cost while also ensuring that models can adapt their reasoning strategies to meet the unique needs of each problem. A key issue with current reasoning models is the inability to tailor the reasoning process to different task complexities. Most models, including well-known ones like OpenAI’s o1 and DeepSeek-R1, apply a uniform strategy—typically relying on Long CoT across all tasks. This causes the “overthinking” problem, where models generate unnecessarily verbose explanations for simpler tasks. Not only does this waste resources, but it also degrades accuracy, as excessive reasoning can introduce irrelevant information. Approaches such as prompt-guided generation or token budget estimation have attempted to mitigate this issue. Still, these methods are limited by their dependence on predefined assumptions, which are not always reliable for diverse tasks. Attempts to address these issues include methods like GRPO (Group Relative Policy Optimization), length-penalty mechanisms, and rule-based prompt controls. While GRPO enables models to learn different reasoning strategies by rewarding correct answers, it leads to a “format collapse,” where models increasingly rely on Long CoT, crowding out more efficient formats, such as Short CoT or Direct Answer. Length-penalty techniques, such as those applied in methods like THINKPRUNE, control output length during training or inference, but often at the cost of reduced accuracy, especially in complex problem-solving tasks. These solutions struggle to achieve a consistent trade-off between reasoning effectiveness and efficiency, highlighting the need for an adaptive approach. A team of researchers from Fudan University and Ohio State University introduced the Adaptive Reasoning Model (ARM), which dynamically adjusts reasoning formats based on task difficulty. ARM supports four distinct reasoning styles: Direct Answer for simple tasks, Short CoT for concise reasoning, Code for structured problem-solving, and Long CoT for deep multi-step reasoning. It operates in an Adaptive Mode by default, automatically selecting the appropriate format, and also provides Instruction-Guided and Consensus-Guided Modes for explicit control or aggregation across formats. The key innovation lies in its training process, which utilizes Ada-GRPO, an extension of GRPO that introduces a format diversity reward mechanism. This prevents the dominance of Long CoT and ensures that ARM continues to explore and use simpler reasoning formats when appropriate. The ARM methodology is built on a two-stage framework. First, the model undergoes Supervised Fine-Tuning (SFT) with 10.8K questions, each annotated across four reasoning formats, sourced from datasets like AQuA-Rat and generated with tools such as GPT-4o and DeepSeek-R1. This stage teaches the model the structure of each reasoning format but does not instill adaptiveness. The second stage applies Ada-GRPO, where the model receives scaled rewards for using less frequent formats, such as Direct Answer or Short CoT. A decaying factor ensures that this reward gradually shifts back to accuracy as training progresses, preventing long-term bias toward inefficient exploration. This structure enables ARM to avoid format collapse and dynamically match reasoning strategies to task difficulty, achieving a balance of efficiency and performance. ARM demonstrated impressive results across various benchmarks, including commonsense, mathematical, and symbolic reasoning tasks. It reduced token usage by an average of 30%, with reductions as high as 70% for simpler tasks, compared to models relying solely on Long CoT. ARM achieved a 2x training speedup over GRPO-based models, accelerating model development without sacrificing accuracy. For example, ARM-7B achieved 75.9% accuracy on the challenging AIME’25 task while using 32.5% fewer tokens. ARM-14B achieved 85.6% accuracy on OpenBookQA and 86.4% accuracy on the MATH dataset, with a token usage reduction of over 30% compared to Qwen2.5SFT+GRPO models. These numbers demonstrate ARM’s ability to maintain competitive performance while delivering significant efficiency gains. Overall, the Adaptive Reasoning Model addresses the persistent inefficiency of reasoning models by enabling the adaptive selection of reasoning formats based on task difficulty. The introduction of Ada-GRPO and the multi-format training framework ensures that models no longer waste resources on overthinking. Instead, ARM provides a flexible and practical solution for balancing accuracy and computational cost in reasoning tasks, making it a promising approach for scalable and efficient large language models. Check out the Paper, Models on Hugging Face and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural NetworksNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding5 Kommentare 0 Anteile -
Apple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and Accuracy
Long CoT reasoning improves large language models’ performance on complex tasks but comes with drawbacks. The typical “think-then-answer” method slows down response times, disrupting real-time interactions like those in chatbots. It also risks inaccuracies, as errors in earlier reasoning steps can lead to a misleading final answer. Unlike humans, who often share partial thoughts or conclusions during conversations, LLMs delay responses until all reasoning is complete. While RL is commonly used to train reasoning models, it mainly rewards final answers, overlooking useful intermediate insights. There is growing interest in teaching models that alternate between thinking and answering, but this remains a challenge.
RL has become a popular method to enhance reasoning in LLMs, building on its success in aligning models with human preferences. Two common reward types guide RL: outcome-based rewards, which focus on the final answer, and process-based rewards, which provide feedback on intermediate reasoning steps. While PRMs offer more detailed supervision, they often rely on human annotation and additional models, making them complex and prone to issues like reward hacking. Separately, efforts to improve LLM reasoning have explored prompting strategies, structured reasoning, tool integration, and methods to reduce latency and improve efficiency.
Researchers from Apple and Duke University introduce Interleaved Reasoning, a new RL approach that enables language models to alternate between thinking and answering when solving complex, multi-step questions. Instead of waiting until the end to respond, models provide informative intermediate answers, which improves feedback for users and guides their reasoning. Using a straightforward rule-based reward, the model is trained to produce helpful reasoning steps, leading to over 80% faster responses and up to 19.3% better accuracy. Trained only on QA and logic datasets, the method demonstrates strong generalization to more challenging benchmarks, such as MATH, GPQA, and MMLU.
The study proposes a reinforcement learning framework to train LLMs for Interleaved Reasoning, where models alternate between internal thinking and user-facing intermediate answers. Each intermediate step, or “sub-answer,” is shared once the model reaches a meaningful milestone in reasoning. A specialized training template with <think> and <answer> tags is used. The approach utilizes rule-based rewards—specifically, format, final accuracy, and conditional intermediate accuracy—to guide learning. Notably, intermediate rewards are applied only when specific criteria are met, ensuring the model prioritizes overall correctness. They also test different reward schemes, such as all-or-none, partial credit, and time-discounted rewards, to optimize the quality of reasoning.
The interleaved reasoning approach was evaluated on both familiar and unfamiliar datasets using Qwen2.5 models. Unlike traditional methods that separate thinking and answering, the interleaved method provides answers incrementally, improving both speed and usefulness. When combined with intermediate rewards, it significantly enhances model performance while reducing response delays by over 80%. Even without exposure to new domains during training, the model adapts well, showing strong generalization. These results highlight the value of interleaved reasoning in making AI systems more responsive and effective in real-world, multi-step reasoning tasks.
In conclusion, the study explores how interleaved reasoning—where models alternate between reasoning and generating intermediate answers—can significantly improve performance and responsiveness. Using the Qwen2.5-1.5B model, the authors show that providing timely intermediate feedback during training boosts accuracy and accelerates response generation. Different RL strategies were tested, with PPO showing stable results, and conditional, time-discounted rewards proving to be the most effective. The method scales well to complex tasks and outperforms traditional think-then-answer baselines. Unlike token-level reward models, this approach employs simple rule-based rewards after completing full reasoning steps, thereby avoiding reward hacking. Ultimately, interleaved reasoning enhances reasoning quality and efficiency without relying on external tools.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text GenerationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Can Now Reason Beyond Language: Researchers Introduce Soft Thinking to Replace Discrete Tokens with Continuous Concept EmbeddingsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary SearchSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Microsoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language Interfaces
#apple #duke #researchers #present #reinforcementApple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and AccuracyLong CoT reasoning improves large language models’ performance on complex tasks but comes with drawbacks. The typical “think-then-answer” method slows down response times, disrupting real-time interactions like those in chatbots. It also risks inaccuracies, as errors in earlier reasoning steps can lead to a misleading final answer. Unlike humans, who often share partial thoughts or conclusions during conversations, LLMs delay responses until all reasoning is complete. While RL is commonly used to train reasoning models, it mainly rewards final answers, overlooking useful intermediate insights. There is growing interest in teaching models that alternate between thinking and answering, but this remains a challenge. RL has become a popular method to enhance reasoning in LLMs, building on its success in aligning models with human preferences. Two common reward types guide RL: outcome-based rewards, which focus on the final answer, and process-based rewards, which provide feedback on intermediate reasoning steps. While PRMs offer more detailed supervision, they often rely on human annotation and additional models, making them complex and prone to issues like reward hacking. Separately, efforts to improve LLM reasoning have explored prompting strategies, structured reasoning, tool integration, and methods to reduce latency and improve efficiency. Researchers from Apple and Duke University introduce Interleaved Reasoning, a new RL approach that enables language models to alternate between thinking and answering when solving complex, multi-step questions. Instead of waiting until the end to respond, models provide informative intermediate answers, which improves feedback for users and guides their reasoning. Using a straightforward rule-based reward, the model is trained to produce helpful reasoning steps, leading to over 80% faster responses and up to 19.3% better accuracy. Trained only on QA and logic datasets, the method demonstrates strong generalization to more challenging benchmarks, such as MATH, GPQA, and MMLU. The study proposes a reinforcement learning framework to train LLMs for Interleaved Reasoning, where models alternate between internal thinking and user-facing intermediate answers. Each intermediate step, or “sub-answer,” is shared once the model reaches a meaningful milestone in reasoning. A specialized training template with <think> and <answer> tags is used. The approach utilizes rule-based rewards—specifically, format, final accuracy, and conditional intermediate accuracy—to guide learning. Notably, intermediate rewards are applied only when specific criteria are met, ensuring the model prioritizes overall correctness. They also test different reward schemes, such as all-or-none, partial credit, and time-discounted rewards, to optimize the quality of reasoning. The interleaved reasoning approach was evaluated on both familiar and unfamiliar datasets using Qwen2.5 models. Unlike traditional methods that separate thinking and answering, the interleaved method provides answers incrementally, improving both speed and usefulness. When combined with intermediate rewards, it significantly enhances model performance while reducing response delays by over 80%. Even without exposure to new domains during training, the model adapts well, showing strong generalization. These results highlight the value of interleaved reasoning in making AI systems more responsive and effective in real-world, multi-step reasoning tasks. In conclusion, the study explores how interleaved reasoning—where models alternate between reasoning and generating intermediate answers—can significantly improve performance and responsiveness. Using the Qwen2.5-1.5B model, the authors show that providing timely intermediate feedback during training boosts accuracy and accelerates response generation. Different RL strategies were tested, with PPO showing stable results, and conditional, time-discounted rewards proving to be the most effective. The method scales well to complex tasks and outperforms traditional think-then-answer baselines. Unlike token-level reward models, this approach employs simple rule-based rewards after completing full reasoning steps, thereby avoiding reward hacking. Ultimately, interleaved reasoning enhances reasoning quality and efficiency without relying on external tools. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text GenerationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Can Now Reason Beyond Language: Researchers Introduce Soft Thinking to Replace Discrete Tokens with Continuous Concept EmbeddingsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary SearchSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Microsoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language Interfaces #apple #duke #researchers #present #reinforcementWWW.MARKTECHPOST.COMApple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and AccuracyLong CoT reasoning improves large language models’ performance on complex tasks but comes with drawbacks. The typical “think-then-answer” method slows down response times, disrupting real-time interactions like those in chatbots. It also risks inaccuracies, as errors in earlier reasoning steps can lead to a misleading final answer. Unlike humans, who often share partial thoughts or conclusions during conversations, LLMs delay responses until all reasoning is complete. While RL is commonly used to train reasoning models, it mainly rewards final answers, overlooking useful intermediate insights. There is growing interest in teaching models that alternate between thinking and answering, but this remains a challenge. RL has become a popular method to enhance reasoning in LLMs, building on its success in aligning models with human preferences. Two common reward types guide RL: outcome-based rewards (ORM), which focus on the final answer, and process-based rewards (PRM), which provide feedback on intermediate reasoning steps. While PRMs offer more detailed supervision, they often rely on human annotation and additional models, making them complex and prone to issues like reward hacking. Separately, efforts to improve LLM reasoning have explored prompting strategies, structured reasoning, tool integration, and methods to reduce latency and improve efficiency. Researchers from Apple and Duke University introduce Interleaved Reasoning, a new RL approach that enables language models to alternate between thinking and answering when solving complex, multi-step questions. Instead of waiting until the end to respond, models provide informative intermediate answers, which improves feedback for users and guides their reasoning. Using a straightforward rule-based reward, the model is trained to produce helpful reasoning steps, leading to over 80% faster responses and up to 19.3% better accuracy. Trained only on QA and logic datasets, the method demonstrates strong generalization to more challenging benchmarks, such as MATH, GPQA, and MMLU. The study proposes a reinforcement learning framework to train LLMs for Interleaved Reasoning, where models alternate between internal thinking and user-facing intermediate answers. Each intermediate step, or “sub-answer,” is shared once the model reaches a meaningful milestone in reasoning. A specialized training template with <think> and <answer> tags is used. The approach utilizes rule-based rewards—specifically, format, final accuracy, and conditional intermediate accuracy—to guide learning. Notably, intermediate rewards are applied only when specific criteria are met, ensuring the model prioritizes overall correctness. They also test different reward schemes, such as all-or-none, partial credit, and time-discounted rewards, to optimize the quality of reasoning. The interleaved reasoning approach was evaluated on both familiar and unfamiliar datasets using Qwen2.5 models (1.5B and 7B). Unlike traditional methods that separate thinking and answering, the interleaved method provides answers incrementally, improving both speed and usefulness. When combined with intermediate rewards, it significantly enhances model performance while reducing response delays by over 80%. Even without exposure to new domains during training, the model adapts well, showing strong generalization. These results highlight the value of interleaved reasoning in making AI systems more responsive and effective in real-world, multi-step reasoning tasks. In conclusion, the study explores how interleaved reasoning—where models alternate between reasoning and generating intermediate answers—can significantly improve performance and responsiveness. Using the Qwen2.5-1.5B model, the authors show that providing timely intermediate feedback during training boosts accuracy and accelerates response generation. Different RL strategies were tested, with PPO showing stable results, and conditional, time-discounted rewards proving to be the most effective. The method scales well to complex tasks and outperforms traditional think-then-answer baselines. Unlike token-level reward models, this approach employs simple rule-based rewards after completing full reasoning steps, thereby avoiding reward hacking. Ultimately, interleaved reasoning enhances reasoning quality and efficiency without relying on external tools. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text GenerationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Can Now Reason Beyond Language: Researchers Introduce Soft Thinking to Replace Discrete Tokens with Continuous Concept EmbeddingsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary SearchSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Microsoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language Interfaces0 Kommentare 0 Anteile -
This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost Efficiency
Web navigation focuses on teaching machines how to interact with websites to perform tasks such as searching for information, shopping, or booking services. Building a capable web navigation agent is a complex task because it requires understanding the structure of websites, interpreting user goals, and making a series of decisions across multiple steps. These tasks are further complicated by the need for agents to adapt in dynamic web environments, where content can change frequently and where multimodal information, such as text and images, must be understood together.
A key problem in web navigation is the absence of reliable and detailed reward models that can guide agents in real-time. Existing methods primarily rely on multimodal large language modelslike GPT-4o and GPT-4o-mini as evaluators, which are expensive, slow, and often inaccurate, especially when handling long sequences of actions in multi-step tasks. These models use prompting-based evaluation or binary success/failure feedback but fail to provide step-level guidance, often leading to errors such as repeated actions or missing critical steps like clicking specific buttons or filling form fields. This limitation reduces the practicality of deploying web agents in real-world scenarios, where efficiency, accuracy, and cost-effectiveness are crucial.
The research team from Yonsei University and Carnegie Mellon University introduced WEB-SHEPHERD, a process reward model specifically designed for web navigation tasks. WEB-SHEPHERD is the first model to evaluate web navigation agents at the step level, using structured checklists to guide assessments. The researchers also developed the WEBPRM COLLECTION, a dataset of 40,000 step-level annotated web navigation tasks, and the WEBREWARDBENCH benchmark for evaluating PRMs. These resources were designed to enable WEB-SHEPHERD to provide detailed feedback by breaking down complex tasks into smaller, measurable subgoals.
WEB-SHEPHERD works by generating a checklist for each task based on the user’s instruction, such as “Search for product” or “Click on product page,” and evaluates the agent’s progress against these subgoals. The model uses next-token prediction to generate feedback and assigns rewards based on checklist completion. This process enables WEB-SHEPHERD to assess the correctness of each step with fine-grained judgment. The model estimates the reward for each step by combining the probabilities of “Yes,” “No,” and “In Progress” tokens and averages these across the checklist. This detailed scoring system enables agents to receive targeted feedback on their progress, enhancing their ability to navigate complex websites.
The researchers demonstrated that WEB-SHEPHERD significantly outperforms existing models. On the WEBREWARDBENCH benchmark, WEB-SHEPHERD achieved a Mean Reciprocal Rankscore of 87.6% and a trajectory accuracy of 55% in the text-only setting, compared to GPT-4o-mini’s 47.5% MRR and 0% trajectory accuracy without checklists. When tested in WebArena-lite using GPT-4o-mini as the policy model, WEB-SHEPHERD achieved a 34.55% success rate, which is 10.9 points higher than using GPT-4o-mini as the evaluator, while also being ten times more cost-efficient. In ablation studies, the researchers observed that WEB-SHEPHERD’s performance dropped significantly when checklists or feedback were removed, proving their importance for accurate reward assignments. They also showed that multimodal input, surprisingly, did not always improve performance and sometimes introduced noise.
This research highlights the critical role of detailed process-level rewards in building reliable web agents. The team’s work addresses the core challenge of web navigation—evaluating complex, multi-step actions—and offers a solution that is both scalable and cost-effective. With WEB-SHEPHERD, agents can now receive accurate feedback during navigation, enabling them to make better decisions and complete tasks more effectively.
Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural NetworksNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual GroundingNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM Inference
#this #paper #introduces #webshepherd #processThis AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyWeb navigation focuses on teaching machines how to interact with websites to perform tasks such as searching for information, shopping, or booking services. Building a capable web navigation agent is a complex task because it requires understanding the structure of websites, interpreting user goals, and making a series of decisions across multiple steps. These tasks are further complicated by the need for agents to adapt in dynamic web environments, where content can change frequently and where multimodal information, such as text and images, must be understood together. A key problem in web navigation is the absence of reliable and detailed reward models that can guide agents in real-time. Existing methods primarily rely on multimodal large language modelslike GPT-4o and GPT-4o-mini as evaluators, which are expensive, slow, and often inaccurate, especially when handling long sequences of actions in multi-step tasks. These models use prompting-based evaluation or binary success/failure feedback but fail to provide step-level guidance, often leading to errors such as repeated actions or missing critical steps like clicking specific buttons or filling form fields. This limitation reduces the practicality of deploying web agents in real-world scenarios, where efficiency, accuracy, and cost-effectiveness are crucial. The research team from Yonsei University and Carnegie Mellon University introduced WEB-SHEPHERD, a process reward model specifically designed for web navigation tasks. WEB-SHEPHERD is the first model to evaluate web navigation agents at the step level, using structured checklists to guide assessments. The researchers also developed the WEBPRM COLLECTION, a dataset of 40,000 step-level annotated web navigation tasks, and the WEBREWARDBENCH benchmark for evaluating PRMs. These resources were designed to enable WEB-SHEPHERD to provide detailed feedback by breaking down complex tasks into smaller, measurable subgoals. WEB-SHEPHERD works by generating a checklist for each task based on the user’s instruction, such as “Search for product” or “Click on product page,” and evaluates the agent’s progress against these subgoals. The model uses next-token prediction to generate feedback and assigns rewards based on checklist completion. This process enables WEB-SHEPHERD to assess the correctness of each step with fine-grained judgment. The model estimates the reward for each step by combining the probabilities of “Yes,” “No,” and “In Progress” tokens and averages these across the checklist. This detailed scoring system enables agents to receive targeted feedback on their progress, enhancing their ability to navigate complex websites. The researchers demonstrated that WEB-SHEPHERD significantly outperforms existing models. On the WEBREWARDBENCH benchmark, WEB-SHEPHERD achieved a Mean Reciprocal Rankscore of 87.6% and a trajectory accuracy of 55% in the text-only setting, compared to GPT-4o-mini’s 47.5% MRR and 0% trajectory accuracy without checklists. When tested in WebArena-lite using GPT-4o-mini as the policy model, WEB-SHEPHERD achieved a 34.55% success rate, which is 10.9 points higher than using GPT-4o-mini as the evaluator, while also being ten times more cost-efficient. In ablation studies, the researchers observed that WEB-SHEPHERD’s performance dropped significantly when checklists or feedback were removed, proving their importance for accurate reward assignments. They also showed that multimodal input, surprisingly, did not always improve performance and sometimes introduced noise. This research highlights the critical role of detailed process-level rewards in building reliable web agents. The team’s work addresses the core challenge of web navigation—evaluating complex, multi-step actions—and offers a solution that is both scalable and cost-effective. With WEB-SHEPHERD, agents can now receive accurate feedback during navigation, enabling them to make better decisions and complete tasks more effectively. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural NetworksNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual GroundingNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM Inference #this #paper #introduces #webshepherd #processWWW.MARKTECHPOST.COMThis AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost EfficiencyWeb navigation focuses on teaching machines how to interact with websites to perform tasks such as searching for information, shopping, or booking services. Building a capable web navigation agent is a complex task because it requires understanding the structure of websites, interpreting user goals, and making a series of decisions across multiple steps. These tasks are further complicated by the need for agents to adapt in dynamic web environments, where content can change frequently and where multimodal information, such as text and images, must be understood together. A key problem in web navigation is the absence of reliable and detailed reward models that can guide agents in real-time. Existing methods primarily rely on multimodal large language models (MLLMs) like GPT-4o and GPT-4o-mini as evaluators, which are expensive, slow, and often inaccurate, especially when handling long sequences of actions in multi-step tasks. These models use prompting-based evaluation or binary success/failure feedback but fail to provide step-level guidance, often leading to errors such as repeated actions or missing critical steps like clicking specific buttons or filling form fields. This limitation reduces the practicality of deploying web agents in real-world scenarios, where efficiency, accuracy, and cost-effectiveness are crucial. The research team from Yonsei University and Carnegie Mellon University introduced WEB-SHEPHERD, a process reward model specifically designed for web navigation tasks. WEB-SHEPHERD is the first model to evaluate web navigation agents at the step level, using structured checklists to guide assessments. The researchers also developed the WEBPRM COLLECTION, a dataset of 40,000 step-level annotated web navigation tasks, and the WEBREWARDBENCH benchmark for evaluating PRMs. These resources were designed to enable WEB-SHEPHERD to provide detailed feedback by breaking down complex tasks into smaller, measurable subgoals. WEB-SHEPHERD works by generating a checklist for each task based on the user’s instruction, such as “Search for product” or “Click on product page,” and evaluates the agent’s progress against these subgoals. The model uses next-token prediction to generate feedback and assigns rewards based on checklist completion. This process enables WEB-SHEPHERD to assess the correctness of each step with fine-grained judgment. The model estimates the reward for each step by combining the probabilities of “Yes,” “No,” and “In Progress” tokens and averages these across the checklist. This detailed scoring system enables agents to receive targeted feedback on their progress, enhancing their ability to navigate complex websites. The researchers demonstrated that WEB-SHEPHERD significantly outperforms existing models. On the WEBREWARDBENCH benchmark, WEB-SHEPHERD achieved a Mean Reciprocal Rank (MRR) score of 87.6% and a trajectory accuracy of 55% in the text-only setting, compared to GPT-4o-mini’s 47.5% MRR and 0% trajectory accuracy without checklists. When tested in WebArena-lite using GPT-4o-mini as the policy model, WEB-SHEPHERD achieved a 34.55% success rate, which is 10.9 points higher than using GPT-4o-mini as the evaluator, while also being ten times more cost-efficient. In ablation studies, the researchers observed that WEB-SHEPHERD’s performance dropped significantly when checklists or feedback were removed, proving their importance for accurate reward assignments. They also showed that multimodal input, surprisingly, did not always improve performance and sometimes introduced noise. This research highlights the critical role of detailed process-level rewards in building reliable web agents. The team’s work addresses the core challenge of web navigation—evaluating complex, multi-step actions—and offers a solution that is both scalable and cost-effective. With WEB-SHEPHERD, agents can now receive accurate feedback during navigation, enabling them to make better decisions and complete tasks more effectively. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural NetworksNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual GroundingNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM Inference0 Kommentare 0 Anteile -
Qwen Researchers Proposes QwenLong-L1: A Reinforcement Learning Framework for Long-Context Reasoning in Large Language Models
While large reasoning modelshave shown impressive capabilities in short-context reasoning through reinforcement learning, these gains do not generalize well to long-context scenarios. Applications such as multi-document QA, research synthesis, and legal or financial analysis require models to process and reason over sequences exceeding 100K tokens. However, RL optimization in such regimes is plagued by slower reward convergence, unstable policy updates due to KL divergence fluctuations, and reduced exploration resulting from entropy collapse. These bottlenecks reveal a fundamental gap in transitioning LRMs from short-context proficiency to long-context generalization.
QwenLong-L1: A Structured RL Framework for Long-Context Adaptation
To address these limitations, the Qwen Research team introduces QwenLong-L1, a novel RL framework designed to adapt LRMs to long-context reasoning tasks. The framework is structured into three key stages:
Warm-up Supervised Fine-Tuning: Provides a stable initialization for the policy model by training on curated question-context-answer triplets, ensuring basic competence in contextual comprehension and answer extraction.
Curriculum-Guided Phased Reinforcement Learning: Introduces a staged training process with gradually increasing context lengths. This progression enables the model to incrementally acquire long-context reasoning behaviors without destabilizing policy updates.
Difficulty-Aware Retrospective Sampling: Enhances exploration by maintaining and reusing hard examples from previous phases, weighted by their difficulty, to encourage deeper reasoning and robustness across diverse inputs.
These stages are complemented by hybrid reward mechanisms—combining rule-based exact match verification with semantic evaluation by a lightweight LLM—ensuring both precision and recall during policy training.
Technical Design and Methodological Advantages
QwenLong-L1 integrates recent advances in group-relative RL optimization, specifically GRPO and DAPO, to mitigate the computational overhead associated with long-context value estimation:
GRPO estimates advantage by normalizing rewards within sampled groups, eliminating the need for a separate value network and encouraging diverse generation patterns.
DAPO incorporates mechanisms such as dynamic sampling, overlength penalty shaping, and asymmetric clipping thresholds to prevent entropy collapse and mitigate length biases during training.
The reward function is defined as the maximum of two signals: a deterministic rule-based match and a semantic judgment from a compact evaluator model. This hybrid approach avoids overfitting to rigid formats while maintaining answer correctness across varied notations and phrasings.
Moreover, the framework is optimized via progressive context scaling, where the RL process transitions from 20K-token to 60K-token input lengths in controlled phases, stabilizing training dynamics and facilitating policy generalization.
Experimental Results and Benchmark Performance
QwenLong-L1 was evaluated on seven long-context document QA benchmarks, including DocMath, Frames, 2WikiMultihopQA, HotpotQA, Musique, NarrativeQA, and Qasper. The 32B variant, QwenLong-L1-32B, demonstrated strong empirical performance:
It outperformed baseline models such as R1-Distill-Qwen-32B by 5.1 points and exceeded leading proprietary systems like OpenAI-o3-mini and Qwen3-235B-A22B.
Its performance was comparable to Claude-3.7-Sonnet-Thinking, indicating competitive reasoning capabilities under extreme context lengths.
Pass@K analysis revealed consistent improvements with increased sampling, achieving a Pass@2 average of 73.7, surpassing DeepSeek-R1 and OpenAI-o1-preview, even at low sampling rates.
Ablation studies further validated the individual contributions of SFT, phased RL, and retrospective sampling. Notably, RL played a decisive role in enabling emergent reasoning behaviors such as grounding, subgoal setting, verification, and backtracking—traits not effectively induced by supervised fine-tuning alone.
Conclusion
QwenLong-L1 represents a systematic approach to equipping LRMs with robust long-context reasoning capabilities through reinforcement learning. Its design effectively bridges the gap between short-context expertise and the demands of information-dense environments by combining supervised initialization, curriculum-driven context scaling, and hybrid evaluation strategies. The framework not only achieves state-of-the-art results across long-context benchmarks but also demonstrates the emergence of interpretable reasoning patterns during training.
Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGen
#qwen #researchers #proposes #qwenlongl1 #reinforcementQwen Researchers Proposes QwenLong-L1: A Reinforcement Learning Framework for Long-Context Reasoning in Large Language ModelsWhile large reasoning modelshave shown impressive capabilities in short-context reasoning through reinforcement learning, these gains do not generalize well to long-context scenarios. Applications such as multi-document QA, research synthesis, and legal or financial analysis require models to process and reason over sequences exceeding 100K tokens. However, RL optimization in such regimes is plagued by slower reward convergence, unstable policy updates due to KL divergence fluctuations, and reduced exploration resulting from entropy collapse. These bottlenecks reveal a fundamental gap in transitioning LRMs from short-context proficiency to long-context generalization. QwenLong-L1: A Structured RL Framework for Long-Context Adaptation To address these limitations, the Qwen Research team introduces QwenLong-L1, a novel RL framework designed to adapt LRMs to long-context reasoning tasks. The framework is structured into three key stages: Warm-up Supervised Fine-Tuning: Provides a stable initialization for the policy model by training on curated question-context-answer triplets, ensuring basic competence in contextual comprehension and answer extraction. Curriculum-Guided Phased Reinforcement Learning: Introduces a staged training process with gradually increasing context lengths. This progression enables the model to incrementally acquire long-context reasoning behaviors without destabilizing policy updates. Difficulty-Aware Retrospective Sampling: Enhances exploration by maintaining and reusing hard examples from previous phases, weighted by their difficulty, to encourage deeper reasoning and robustness across diverse inputs. These stages are complemented by hybrid reward mechanisms—combining rule-based exact match verification with semantic evaluation by a lightweight LLM—ensuring both precision and recall during policy training. Technical Design and Methodological Advantages QwenLong-L1 integrates recent advances in group-relative RL optimization, specifically GRPO and DAPO, to mitigate the computational overhead associated with long-context value estimation: GRPO estimates advantage by normalizing rewards within sampled groups, eliminating the need for a separate value network and encouraging diverse generation patterns. DAPO incorporates mechanisms such as dynamic sampling, overlength penalty shaping, and asymmetric clipping thresholds to prevent entropy collapse and mitigate length biases during training. The reward function is defined as the maximum of two signals: a deterministic rule-based match and a semantic judgment from a compact evaluator model. This hybrid approach avoids overfitting to rigid formats while maintaining answer correctness across varied notations and phrasings. Moreover, the framework is optimized via progressive context scaling, where the RL process transitions from 20K-token to 60K-token input lengths in controlled phases, stabilizing training dynamics and facilitating policy generalization. Experimental Results and Benchmark Performance QwenLong-L1 was evaluated on seven long-context document QA benchmarks, including DocMath, Frames, 2WikiMultihopQA, HotpotQA, Musique, NarrativeQA, and Qasper. The 32B variant, QwenLong-L1-32B, demonstrated strong empirical performance: It outperformed baseline models such as R1-Distill-Qwen-32B by 5.1 points and exceeded leading proprietary systems like OpenAI-o3-mini and Qwen3-235B-A22B. Its performance was comparable to Claude-3.7-Sonnet-Thinking, indicating competitive reasoning capabilities under extreme context lengths. Pass@K analysis revealed consistent improvements with increased sampling, achieving a Pass@2 average of 73.7, surpassing DeepSeek-R1 and OpenAI-o1-preview, even at low sampling rates. Ablation studies further validated the individual contributions of SFT, phased RL, and retrospective sampling. Notably, RL played a decisive role in enabling emergent reasoning behaviors such as grounding, subgoal setting, verification, and backtracking—traits not effectively induced by supervised fine-tuning alone. Conclusion QwenLong-L1 represents a systematic approach to equipping LRMs with robust long-context reasoning capabilities through reinforcement learning. Its design effectively bridges the gap between short-context expertise and the demands of information-dense environments by combining supervised initialization, curriculum-driven context scaling, and hybrid evaluation strategies. The framework not only achieves state-of-the-art results across long-context benchmarks but also demonstrates the emergence of interpretable reasoning patterns during training. Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGen #qwen #researchers #proposes #qwenlongl1 #reinforcementWWW.MARKTECHPOST.COMQwen Researchers Proposes QwenLong-L1: A Reinforcement Learning Framework for Long-Context Reasoning in Large Language ModelsWhile large reasoning models (LRMs) have shown impressive capabilities in short-context reasoning through reinforcement learning (RL), these gains do not generalize well to long-context scenarios. Applications such as multi-document QA, research synthesis, and legal or financial analysis require models to process and reason over sequences exceeding 100K tokens. However, RL optimization in such regimes is plagued by slower reward convergence, unstable policy updates due to KL divergence fluctuations, and reduced exploration resulting from entropy collapse. These bottlenecks reveal a fundamental gap in transitioning LRMs from short-context proficiency to long-context generalization. QwenLong-L1: A Structured RL Framework for Long-Context Adaptation To address these limitations, the Qwen Research team introduces QwenLong-L1, a novel RL framework designed to adapt LRMs to long-context reasoning tasks. The framework is structured into three key stages: Warm-up Supervised Fine-Tuning (SFT): Provides a stable initialization for the policy model by training on curated question-context-answer triplets, ensuring basic competence in contextual comprehension and answer extraction. Curriculum-Guided Phased Reinforcement Learning: Introduces a staged training process with gradually increasing context lengths. This progression enables the model to incrementally acquire long-context reasoning behaviors without destabilizing policy updates. Difficulty-Aware Retrospective Sampling: Enhances exploration by maintaining and reusing hard examples from previous phases, weighted by their difficulty, to encourage deeper reasoning and robustness across diverse inputs. These stages are complemented by hybrid reward mechanisms—combining rule-based exact match verification with semantic evaluation by a lightweight LLM—ensuring both precision and recall during policy training. Technical Design and Methodological Advantages QwenLong-L1 integrates recent advances in group-relative RL optimization, specifically GRPO and DAPO, to mitigate the computational overhead associated with long-context value estimation: GRPO estimates advantage by normalizing rewards within sampled groups, eliminating the need for a separate value network and encouraging diverse generation patterns. DAPO incorporates mechanisms such as dynamic sampling, overlength penalty shaping, and asymmetric clipping thresholds to prevent entropy collapse and mitigate length biases during training. The reward function is defined as the maximum of two signals: a deterministic rule-based match and a semantic judgment from a compact evaluator model (e.g., Qwen2.5-1.5B). This hybrid approach avoids overfitting to rigid formats while maintaining answer correctness across varied notations and phrasings. Moreover, the framework is optimized via progressive context scaling, where the RL process transitions from 20K-token to 60K-token input lengths in controlled phases, stabilizing training dynamics and facilitating policy generalization. Experimental Results and Benchmark Performance QwenLong-L1 was evaluated on seven long-context document QA benchmarks, including DocMath, Frames, 2WikiMultihopQA, HotpotQA, Musique, NarrativeQA, and Qasper. The 32B variant, QwenLong-L1-32B, demonstrated strong empirical performance: It outperformed baseline models such as R1-Distill-Qwen-32B by 5.1 points and exceeded leading proprietary systems like OpenAI-o3-mini and Qwen3-235B-A22B. Its performance was comparable to Claude-3.7-Sonnet-Thinking, indicating competitive reasoning capabilities under extreme context lengths. Pass@K analysis revealed consistent improvements with increased sampling, achieving a Pass@2 average of 73.7, surpassing DeepSeek-R1 and OpenAI-o1-preview, even at low sampling rates. Ablation studies further validated the individual contributions of SFT, phased RL, and retrospective sampling. Notably, RL played a decisive role in enabling emergent reasoning behaviors such as grounding, subgoal setting, verification, and backtracking—traits not effectively induced by supervised fine-tuning alone. Conclusion QwenLong-L1 represents a systematic approach to equipping LRMs with robust long-context reasoning capabilities through reinforcement learning. Its design effectively bridges the gap between short-context expertise and the demands of information-dense environments by combining supervised initialization, curriculum-driven context scaling, and hybrid evaluation strategies. The framework not only achieves state-of-the-art results across long-context benchmarks but also demonstrates the emergence of interpretable reasoning patterns during training. Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGen4 Kommentare 0 Anteile -
Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data Vault (SDV)
Real-world data is often costly, messy, and limited by privacy rules. Synthetic data offers a solution—and it’s already widely used:
LLMs train on AI-generated text
Fraud systems simulate edge cases
Vision models pretrain on fake images
SDVis an open-source Python library that generates realistic tabular data using machine learning. It learns patterns from real data and creates high-quality synthetic data for safe sharing, testing, and model training.
In this tutorial, we’ll use SDV to generate synthetic data step by step.
pip install sdv
We will first install the sdv library:
from sdv.io.local import CSVHandler
connector = CSVHandlerFOLDER_NAME = '.' # If the data is in the same directory
data = connector.readsalesDf = dataNext, we import the necessary module and connect to our local folder containing the dataset files. This reads the CSV files from the specified folder and stores them as pandas DataFrames. In this case, we access the main dataset using data.
from sdv.metadata import Metadata
metadata = Metadata.load_from_jsonWe now import the metadata for our dataset. This metadata is stored in a JSON file and tells SDV how to interpret your data. It includes:
The table name
The primary key
The data type of each columnOptional column formats like datetime patterns or ID patterns
Table relationshipsHere is a sample metadata.json format:
{
"METADATA_SPEC_VERSION": "V1",
"tables": {
"your_table_name": {
"primary_key": "your_primary_key_column",
"columns": {
"your_primary_key_column": { "sdtype": "id", "regex_format": "T{6}" },
"date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" },
"category_column": { "sdtype": "categorical" },
"numeric_column": { "sdtype": "numerical" }
},
"column_relationships":}
}
}
from sdv.metadata import Metadata
metadata = Metadata.detect_from_dataframesAlternatively, we can use the SDV library to automatically infer the metadata. However, the results may not always be accurate or complete, so you might need to review and update it if there are any discrepancies.
from sdv.single_table import GaussianCopulaSynthesizer
synthesizer = GaussianCopulaSynthesizersynthesizer.fitsynthetic_data = synthesizer.sampleWith the metadata and original dataset ready, we can now use SDV to train a model and generate synthetic data. The model learns the structure and patterns in your real dataset and uses that knowledge to create synthetic records.
You can control how many rows to generate using the num_rows argument.
from sdv.evaluation.single_table import evaluate_quality
quality_report = evaluate_qualityThe SDV library also provides tools to evaluate the quality of your synthetic data by comparing it to the original dataset. A great place to start is by generating a quality report
You can also visualize how the synthetic data compares to the real data using SDV’s built-in plotting tools. For example, import get_column_plot from sdv.evaluation.single_table to create comparison plots for specific columns:
from sdv.evaluation.single_table import get_column_plot
fig = get_column_plotfig.showWe can observe that the distribution of the ‘Sales’ column in the real and synthetic data is very similar. To explore further, we can use matplotlib to create more detailed comparisons—such as visualizing the average monthly sales trends across both datasets.
import pandas as pd
import matplotlib.pyplot as plt
# Ensure 'Date' columns are datetime
salesDf= pd.to_datetimesynthetic_data= pd.to_datetime# Extract 'Month' as year-month string
salesDf= salesDf.dt.to_period.astypesynthetic_data= synthetic_data.dt.to_period.astype# Group by 'Month' and calculate average sales
actual_avg_monthly = salesDf.groupby.mean.renamesynthetic_avg_monthly = synthetic_data.groupby.mean.rename# Merge the two series into a DataFrame
avg_monthly_comparison = pd.concat.fillna# Plot
plt.figure)
plt.plotplt.plotplt.titleplt.xlabelplt.ylabelplt.xticksplt.gridplt.legendplt.ylim# y-axis starts at 0
plt.tight_layoutplt.showThis chart also shows that the average monthly sales in both datasets are very similar, with only minimal differences.
In this tutorial, we demonstrated how to prepare your data and metadata for synthetic data generation using the SDV library. By training a model on your original dataset, SDV can create high-quality synthetic data that closely mirrors the real data’s patterns and distributions. We also explored how to evaluate and visualize the synthetic data, confirming that key metrics like sales distributions and monthly trends remain consistent. Synthetic data offers a powerful way to overcome privacy and availability challenges while enabling robust data analysis and machine learning workflows.
Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Arham IslamI am a Civil Engineering Graduatefrom Jamia Millia Islamia, New Delhi, and I have a keen interest in Data Science, especially Neural Networks and their application in various areas.Arham Islamhttps://www.marktechpost.com/author/arhamislam/Step-by-Step Guide to Create an AI agent with Google ADKArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an LLM Agent with Tool Access Using MCP-UseArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an AgentQL Model Context ProtocolServerArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing An Airbnb and Excel MCP Server
#stepbystep #guide #creating #synthetic #dataStep-by-Step Guide to Creating Synthetic Data Using the Synthetic Data Vault (SDV)Real-world data is often costly, messy, and limited by privacy rules. Synthetic data offers a solution—and it’s already widely used: LLMs train on AI-generated text Fraud systems simulate edge cases Vision models pretrain on fake images SDVis an open-source Python library that generates realistic tabular data using machine learning. It learns patterns from real data and creates high-quality synthetic data for safe sharing, testing, and model training. In this tutorial, we’ll use SDV to generate synthetic data step by step. pip install sdv We will first install the sdv library: from sdv.io.local import CSVHandler connector = CSVHandlerFOLDER_NAME = '.' # If the data is in the same directory data = connector.readsalesDf = dataNext, we import the necessary module and connect to our local folder containing the dataset files. This reads the CSV files from the specified folder and stores them as pandas DataFrames. In this case, we access the main dataset using data. from sdv.metadata import Metadata metadata = Metadata.load_from_jsonWe now import the metadata for our dataset. This metadata is stored in a JSON file and tells SDV how to interpret your data. It includes: The table name The primary key The data type of each columnOptional column formats like datetime patterns or ID patterns Table relationshipsHere is a sample metadata.json format: { "METADATA_SPEC_VERSION": "V1", "tables": { "your_table_name": { "primary_key": "your_primary_key_column", "columns": { "your_primary_key_column": { "sdtype": "id", "regex_format": "T{6}" }, "date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" }, "category_column": { "sdtype": "categorical" }, "numeric_column": { "sdtype": "numerical" } }, "column_relationships":} } } from sdv.metadata import Metadata metadata = Metadata.detect_from_dataframesAlternatively, we can use the SDV library to automatically infer the metadata. However, the results may not always be accurate or complete, so you might need to review and update it if there are any discrepancies. from sdv.single_table import GaussianCopulaSynthesizer synthesizer = GaussianCopulaSynthesizersynthesizer.fitsynthetic_data = synthesizer.sampleWith the metadata and original dataset ready, we can now use SDV to train a model and generate synthetic data. The model learns the structure and patterns in your real dataset and uses that knowledge to create synthetic records. You can control how many rows to generate using the num_rows argument. from sdv.evaluation.single_table import evaluate_quality quality_report = evaluate_qualityThe SDV library also provides tools to evaluate the quality of your synthetic data by comparing it to the original dataset. A great place to start is by generating a quality report You can also visualize how the synthetic data compares to the real data using SDV’s built-in plotting tools. For example, import get_column_plot from sdv.evaluation.single_table to create comparison plots for specific columns: from sdv.evaluation.single_table import get_column_plot fig = get_column_plotfig.showWe can observe that the distribution of the ‘Sales’ column in the real and synthetic data is very similar. To explore further, we can use matplotlib to create more detailed comparisons—such as visualizing the average monthly sales trends across both datasets. import pandas as pd import matplotlib.pyplot as plt # Ensure 'Date' columns are datetime salesDf= pd.to_datetimesynthetic_data= pd.to_datetime# Extract 'Month' as year-month string salesDf= salesDf.dt.to_period.astypesynthetic_data= synthetic_data.dt.to_period.astype# Group by 'Month' and calculate average sales actual_avg_monthly = salesDf.groupby.mean.renamesynthetic_avg_monthly = synthetic_data.groupby.mean.rename# Merge the two series into a DataFrame avg_monthly_comparison = pd.concat.fillna# Plot plt.figure) plt.plotplt.plotplt.titleplt.xlabelplt.ylabelplt.xticksplt.gridplt.legendplt.ylim# y-axis starts at 0 plt.tight_layoutplt.showThis chart also shows that the average monthly sales in both datasets are very similar, with only minimal differences. In this tutorial, we demonstrated how to prepare your data and metadata for synthetic data generation using the SDV library. By training a model on your original dataset, SDV can create high-quality synthetic data that closely mirrors the real data’s patterns and distributions. We also explored how to evaluate and visualize the synthetic data, confirming that key metrics like sales distributions and monthly trends remain consistent. Synthetic data offers a powerful way to overcome privacy and availability challenges while enabling robust data analysis and machine learning workflows. Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Arham IslamI am a Civil Engineering Graduatefrom Jamia Millia Islamia, New Delhi, and I have a keen interest in Data Science, especially Neural Networks and their application in various areas.Arham Islamhttps://www.marktechpost.com/author/arhamislam/Step-by-Step Guide to Create an AI agent with Google ADKArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an LLM Agent with Tool Access Using MCP-UseArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an AgentQL Model Context ProtocolServerArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing An Airbnb and Excel MCP Server #stepbystep #guide #creating #synthetic #dataWWW.MARKTECHPOST.COMStep-by-Step Guide to Creating Synthetic Data Using the Synthetic Data Vault (SDV)Real-world data is often costly, messy, and limited by privacy rules. Synthetic data offers a solution—and it’s already widely used: LLMs train on AI-generated text Fraud systems simulate edge cases Vision models pretrain on fake images SDV (Synthetic Data Vault) is an open-source Python library that generates realistic tabular data using machine learning. It learns patterns from real data and creates high-quality synthetic data for safe sharing, testing, and model training. In this tutorial, we’ll use SDV to generate synthetic data step by step. pip install sdv We will first install the sdv library: from sdv.io.local import CSVHandler connector = CSVHandler() FOLDER_NAME = '.' # If the data is in the same directory data = connector.read(folder_name=FOLDER_NAME) salesDf = data['data'] Next, we import the necessary module and connect to our local folder containing the dataset files. This reads the CSV files from the specified folder and stores them as pandas DataFrames. In this case, we access the main dataset using data[‘data’]. from sdv.metadata import Metadata metadata = Metadata.load_from_json('metadata.json') We now import the metadata for our dataset. This metadata is stored in a JSON file and tells SDV how to interpret your data. It includes: The table name The primary key The data type of each column (e.g., categorical, numerical, datetime, etc.) Optional column formats like datetime patterns or ID patterns Table relationships (for multi-table setups) Here is a sample metadata.json format: { "METADATA_SPEC_VERSION": "V1", "tables": { "your_table_name": { "primary_key": "your_primary_key_column", "columns": { "your_primary_key_column": { "sdtype": "id", "regex_format": "T[0-9]{6}" }, "date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" }, "category_column": { "sdtype": "categorical" }, "numeric_column": { "sdtype": "numerical" } }, "column_relationships": [] } } } from sdv.metadata import Metadata metadata = Metadata.detect_from_dataframes(data) Alternatively, we can use the SDV library to automatically infer the metadata. However, the results may not always be accurate or complete, so you might need to review and update it if there are any discrepancies. from sdv.single_table import GaussianCopulaSynthesizer synthesizer = GaussianCopulaSynthesizer(metadata) synthesizer.fit(data=salesDf) synthetic_data = synthesizer.sample(num_rows=10000) With the metadata and original dataset ready, we can now use SDV to train a model and generate synthetic data. The model learns the structure and patterns in your real dataset and uses that knowledge to create synthetic records. You can control how many rows to generate using the num_rows argument. from sdv.evaluation.single_table import evaluate_quality quality_report = evaluate_quality( salesDf, synthetic_data, metadata) The SDV library also provides tools to evaluate the quality of your synthetic data by comparing it to the original dataset. A great place to start is by generating a quality report You can also visualize how the synthetic data compares to the real data using SDV’s built-in plotting tools. For example, import get_column_plot from sdv.evaluation.single_table to create comparison plots for specific columns: from sdv.evaluation.single_table import get_column_plot fig = get_column_plot( real_data=salesDf, synthetic_data=synthetic_data, column_name='Sales', metadata=metadata ) fig.show() We can observe that the distribution of the ‘Sales’ column in the real and synthetic data is very similar. To explore further, we can use matplotlib to create more detailed comparisons—such as visualizing the average monthly sales trends across both datasets. import pandas as pd import matplotlib.pyplot as plt # Ensure 'Date' columns are datetime salesDf['Date'] = pd.to_datetime(salesDf['Date'], format='%d-%m-%Y') synthetic_data['Date'] = pd.to_datetime(synthetic_data['Date'], format='%d-%m-%Y') # Extract 'Month' as year-month string salesDf['Month'] = salesDf['Date'].dt.to_period('M').astype(str) synthetic_data['Month'] = synthetic_data['Date'].dt.to_period('M').astype(str) # Group by 'Month' and calculate average sales actual_avg_monthly = salesDf.groupby('Month')['Sales'].mean().rename('Actual Average Sales') synthetic_avg_monthly = synthetic_data.groupby('Month')['Sales'].mean().rename('Synthetic Average Sales') # Merge the two series into a DataFrame avg_monthly_comparison = pd.concat([actual_avg_monthly, synthetic_avg_monthly], axis=1).fillna(0) # Plot plt.figure(figsize=(10, 6)) plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Actual Average Sales'], label='Actual Average Sales', marker='o') plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Synthetic Average Sales'], label='Synthetic Average Sales', marker='o') plt.title('Average Monthly Sales Comparison: Actual vs Synthetic') plt.xlabel('Month') plt.ylabel('Average Sales') plt.xticks(rotation=45) plt.grid(True) plt.legend() plt.ylim(bottom=0) # y-axis starts at 0 plt.tight_layout() plt.show() This chart also shows that the average monthly sales in both datasets are very similar, with only minimal differences. In this tutorial, we demonstrated how to prepare your data and metadata for synthetic data generation using the SDV library. By training a model on your original dataset, SDV can create high-quality synthetic data that closely mirrors the real data’s patterns and distributions. We also explored how to evaluate and visualize the synthetic data, confirming that key metrics like sales distributions and monthly trends remain consistent. Synthetic data offers a powerful way to overcome privacy and availability challenges while enabling robust data analysis and machine learning workflows. Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Arham IslamI am a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I have a keen interest in Data Science, especially Neural Networks and their application in various areas.Arham Islamhttps://www.marktechpost.com/author/arhamislam/Step-by-Step Guide to Create an AI agent with Google ADKArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an LLM Agent with Tool Access Using MCP-UseArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an AgentQL Model Context Protocol (MCP) ServerArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing An Airbnb and Excel MCP Server0 Kommentare 0 Anteile -
NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific Tasks
NVIDIA has released Llama Nemotron Nano 4B, an open-source reasoning model designed to deliver strong performance and efficiency across scientific tasks, programming, symbolic math, function calling, and instruction following—while being compact enough for edge deployment. With just 4 billion parameters, it achieves higher accuracy and up to 50% greater throughput than comparable open models with up to 8 billion parameters, according to internal benchmarks.
The model is positioned as a practical foundation for deploying language-based AI agents in resource-constrained environments. By focusing on inference efficiency, Llama Nemotron Nano 4B addresses a growing demand for compact models capable of supporting hybrid reasoning and instruction-following tasks outside traditional cloud settings.
Model Architecture and Training Stack
Nemotron Nano 4B builds upon the Llama 3.1 architecture and shares lineage with NVIDIA’s earlier “Minitron” family. The architecture follows a dense, decoder-only transformer design. The model has been optimized for performance in reasoning-intensive workloads while maintaining a lightweight parameter count.
The post-training stack for the model includes multi-stage supervised fine-tuning on curated datasets for mathematics, coding, reasoning tasks, and function calling. In addition to traditional supervised learning, Nemotron Nano 4B has undergone reinforcement learning optimization using Reward-aware Preference Optimization, a method intended to enhance the model’s utility in chat-based and instruction-following environments.
This combination of instruction tuning and reward modeling helps align the model’s outputs more closely with user intent, particularly in multi-turn reasoning scenarios. The training approach reflects NVIDIA’s emphasis on aligning smaller models to practical usage tasks that traditionally require significantly larger parameter sizes.
Performance Benchmarks
Despite its compact footprint, Nemotron Nano 4B exhibits robust performance in both single-turn and multi-turn reasoning tasks. According to NVIDIA, it provides 50% higher inference throughput compared to similar open-weight models within the 8B parameter range. The model supports a context window of up to 128,000 tokens, which is particularly useful for tasks involving long documents, nested function calls, or multi-hop reasoning chains.
While NVIDIA has not disclosed full benchmark tables in the Hugging Face documentation, the model reportedly outperforms other open alternatives in benchmarks across math, code generation, and function calling precision. Its throughput advantage suggests it can serve as a viable default for developers targeting efficient inference pipelines with moderately complex workloads.
Edge-Ready Deployment
One of the core differentiators of Nemotron Nano 4B is its focus on edge deployment. The model has been explicitly tested and optimized to run efficiently on NVIDIA Jetson platforms and NVIDIA RTX GPUs. This enables real-time reasoning capabilities on low-power embedded devices, including robotics systems, autonomous edge agents, or local developer workstations.
For enterprises and research teams concerned with privacy and deployment control, the ability to run advanced reasoning models locally—without relying on cloud inference APIs—can provide both cost savings and greater flexibility.
Licensing and Access
The model is released under the NVIDIA Open Model License, which permits commercial usage. It is available through Hugging Face at huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1, with all relevant model weights, configuration files, and tokenizer artifacts openly accessible. The license structure aligns with NVIDIA’s broader strategy of supporting developer ecosystems around its open models.
Conclusion
Nemotron Nano 4B represents NVIDIA’s continued investment in bringing scalable, practical AI models to a broader development audience—especially those targeting edge or cost-sensitive deployment scenarios. While the field continues to see rapid progress in ultra-large models, compact and efficient models like Nemotron Nano 4B provide a counterbalance, enabling deployment flexibility without compromising too heavily on performance.
Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGenAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use
#nvidia #releases #llama #nemotron #nanoNVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific TasksNVIDIA has released Llama Nemotron Nano 4B, an open-source reasoning model designed to deliver strong performance and efficiency across scientific tasks, programming, symbolic math, function calling, and instruction following—while being compact enough for edge deployment. With just 4 billion parameters, it achieves higher accuracy and up to 50% greater throughput than comparable open models with up to 8 billion parameters, according to internal benchmarks. The model is positioned as a practical foundation for deploying language-based AI agents in resource-constrained environments. By focusing on inference efficiency, Llama Nemotron Nano 4B addresses a growing demand for compact models capable of supporting hybrid reasoning and instruction-following tasks outside traditional cloud settings. Model Architecture and Training Stack Nemotron Nano 4B builds upon the Llama 3.1 architecture and shares lineage with NVIDIA’s earlier “Minitron” family. The architecture follows a dense, decoder-only transformer design. The model has been optimized for performance in reasoning-intensive workloads while maintaining a lightweight parameter count. The post-training stack for the model includes multi-stage supervised fine-tuning on curated datasets for mathematics, coding, reasoning tasks, and function calling. In addition to traditional supervised learning, Nemotron Nano 4B has undergone reinforcement learning optimization using Reward-aware Preference Optimization, a method intended to enhance the model’s utility in chat-based and instruction-following environments. This combination of instruction tuning and reward modeling helps align the model’s outputs more closely with user intent, particularly in multi-turn reasoning scenarios. The training approach reflects NVIDIA’s emphasis on aligning smaller models to practical usage tasks that traditionally require significantly larger parameter sizes. Performance Benchmarks Despite its compact footprint, Nemotron Nano 4B exhibits robust performance in both single-turn and multi-turn reasoning tasks. According to NVIDIA, it provides 50% higher inference throughput compared to similar open-weight models within the 8B parameter range. The model supports a context window of up to 128,000 tokens, which is particularly useful for tasks involving long documents, nested function calls, or multi-hop reasoning chains. While NVIDIA has not disclosed full benchmark tables in the Hugging Face documentation, the model reportedly outperforms other open alternatives in benchmarks across math, code generation, and function calling precision. Its throughput advantage suggests it can serve as a viable default for developers targeting efficient inference pipelines with moderately complex workloads. Edge-Ready Deployment One of the core differentiators of Nemotron Nano 4B is its focus on edge deployment. The model has been explicitly tested and optimized to run efficiently on NVIDIA Jetson platforms and NVIDIA RTX GPUs. This enables real-time reasoning capabilities on low-power embedded devices, including robotics systems, autonomous edge agents, or local developer workstations. For enterprises and research teams concerned with privacy and deployment control, the ability to run advanced reasoning models locally—without relying on cloud inference APIs—can provide both cost savings and greater flexibility. Licensing and Access The model is released under the NVIDIA Open Model License, which permits commercial usage. It is available through Hugging Face at huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1, with all relevant model weights, configuration files, and tokenizer artifacts openly accessible. The license structure aligns with NVIDIA’s broader strategy of supporting developer ecosystems around its open models. Conclusion Nemotron Nano 4B represents NVIDIA’s continued investment in bringing scalable, practical AI models to a broader development audience—especially those targeting edge or cost-sensitive deployment scenarios. While the field continues to see rapid progress in ultra-large models, compact and efficient models like Nemotron Nano 4B provide a counterbalance, enabling deployment flexibility without compromising too heavily on performance. Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGenAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use #nvidia #releases #llama #nemotron #nanoWWW.MARKTECHPOST.COMNVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific TasksNVIDIA has released Llama Nemotron Nano 4B, an open-source reasoning model designed to deliver strong performance and efficiency across scientific tasks, programming, symbolic math, function calling, and instruction following—while being compact enough for edge deployment. With just 4 billion parameters, it achieves higher accuracy and up to 50% greater throughput than comparable open models with up to 8 billion parameters, according to internal benchmarks. The model is positioned as a practical foundation for deploying language-based AI agents in resource-constrained environments. By focusing on inference efficiency, Llama Nemotron Nano 4B addresses a growing demand for compact models capable of supporting hybrid reasoning and instruction-following tasks outside traditional cloud settings. Model Architecture and Training Stack Nemotron Nano 4B builds upon the Llama 3.1 architecture and shares lineage with NVIDIA’s earlier “Minitron” family. The architecture follows a dense, decoder-only transformer design. The model has been optimized for performance in reasoning-intensive workloads while maintaining a lightweight parameter count. The post-training stack for the model includes multi-stage supervised fine-tuning on curated datasets for mathematics, coding, reasoning tasks, and function calling. In addition to traditional supervised learning, Nemotron Nano 4B has undergone reinforcement learning optimization using Reward-aware Preference Optimization (RPO), a method intended to enhance the model’s utility in chat-based and instruction-following environments. This combination of instruction tuning and reward modeling helps align the model’s outputs more closely with user intent, particularly in multi-turn reasoning scenarios. The training approach reflects NVIDIA’s emphasis on aligning smaller models to practical usage tasks that traditionally require significantly larger parameter sizes. Performance Benchmarks Despite its compact footprint, Nemotron Nano 4B exhibits robust performance in both single-turn and multi-turn reasoning tasks. According to NVIDIA, it provides 50% higher inference throughput compared to similar open-weight models within the 8B parameter range. The model supports a context window of up to 128,000 tokens, which is particularly useful for tasks involving long documents, nested function calls, or multi-hop reasoning chains. While NVIDIA has not disclosed full benchmark tables in the Hugging Face documentation, the model reportedly outperforms other open alternatives in benchmarks across math, code generation, and function calling precision. Its throughput advantage suggests it can serve as a viable default for developers targeting efficient inference pipelines with moderately complex workloads. Edge-Ready Deployment One of the core differentiators of Nemotron Nano 4B is its focus on edge deployment. The model has been explicitly tested and optimized to run efficiently on NVIDIA Jetson platforms and NVIDIA RTX GPUs. This enables real-time reasoning capabilities on low-power embedded devices, including robotics systems, autonomous edge agents, or local developer workstations. For enterprises and research teams concerned with privacy and deployment control, the ability to run advanced reasoning models locally—without relying on cloud inference APIs—can provide both cost savings and greater flexibility. Licensing and Access The model is released under the NVIDIA Open Model License, which permits commercial usage. It is available through Hugging Face at huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1, with all relevant model weights, configuration files, and tokenizer artifacts openly accessible. The license structure aligns with NVIDIA’s broader strategy of supporting developer ecosystems around its open models. Conclusion Nemotron Nano 4B represents NVIDIA’s continued investment in bringing scalable, practical AI models to a broader development audience—especially those targeting edge or cost-sensitive deployment scenarios. While the field continues to see rapid progress in ultra-large models, compact and efficient models like Nemotron Nano 4B provide a counterbalance, enabling deployment flexibility without compromising too heavily on performance. Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGenAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use0 Kommentare 0 Anteile -
This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding
The core idea of Multimodal Large Language Modelsis to create models that can combine the richness of visual content with the logic of language. However, despite advances in this field, many models struggle to connect the two domains effectively, leading to limited performance in complex reasoning tasks that involve visual components.
A major challenge in building such models is their limited ability to combine visual understanding with logical thinking. Current systems often produce textual outputs that explain reasoning but fail to reference the specific parts of an image they rely on. This creates a gap where models may arrive at an answer without clearly showing how the visual evidence contributed to their decision. It’s also difficult to ensure that models generate visual reasoning steps directly connecting to their answers. The fundamental problem lies in how to naturally train models to interleave text and image reasoning without needing large datasets annotated with visual references, which are scarce and expensive to produce.
Existing methods try to address this by using reinforcement learning or prompting strategies. Some systems generate bounding box coordinates as answers, while others produce step-by-step textual reasoning chains. However, these approaches have limitations. Models that only produce bounding boxes lack explanation, while those generating only text risk ignoring visual evidence. Previous methods often separate visual grounding and reasoning, making it hard for models to explain why a particular visual element leads to a certain conclusion. While some models use dense supervision data or additional tools, they generally require heavy annotation and do not scale well. This makes it difficult for developers to create models that can explain their reasoning transparently and handle various visual tasks with minimal data.
Researchers from UC Santa Cruz and eBay introduced a new method called Grounded Reasoning with Images and Textthat allows MLLMs like Qwen 2.5-VL and InternVL 3 to generate reasoning chains that mix natural language with explicit bounding box coordinates pointing to relevant image regions. This unified approach enables models to reason about and visually ground their answers without requiring dense annotations or labeled reasoning chains. GRIT also uses a lightweight reinforcement learning algorithm called GRPO-GR, which optimizes both the accuracy of the final answer and the structure of the reasoning, encouraging models to include specific tokens like <think> and <rethink>, as well as bounding box formats. This design eliminates the need for costly annotated data while ensuring that models learn to reference visual content meaningfully within their logical steps.
The methodology in GRIT focuses on generating outputs that combine textual reasoning and visual grounding seamlessly. Instead of requiring models to process cropped images or additional visual data after generating bounding boxes, GRIT teaches models to use their internal understanding of the image. Bounding boxes are generated during the reasoning process, and models learn to reflect on these coordinates within their logical reasoning. The reinforcement learning framework rewards the correct use of bounding box formats and reasoning structure, and it guides models to produce coherent, grounded reasoning chains. GRIT demonstrates remarkable data efficiency by using only 20 image-question-answer triplets sourced from Visual Spatial Reasoning and TallyQA datasets. The model training was conducted on NVIDIA A100 GPUs, with optimization techniques like AdamW and a cosine scheduler applied over 200 training steps, which shows the method’s scalability despite limited data.
Performance evaluations revealed that GRIT-trained models outperform several baselines in reasoning and grounding accuracy. For example, Qwen 2.5-VL trained with GRIT achieved 72.9% answer accuracy on Visual Spatial Reasoning, 47.8% on TallyQA, and 62.8% on GQA datasets. It also reached a grounding IoU score of 0.325 on VSR and 0.447 on TallyQA. In contrast, baseline models like Direct Query or Chain-of-Thought often performed significantly lower, showing limited ability to unify reasoning with visual grounding. GRIT models demonstrated a strong correlation between visual regions and textual reasoning, producing outputs that reflected a meaningful connection between image evidence and logical thought. GRIT also showed improvements on out-of-domain benchmarks, though gains were more pronounced on in-domain data, highlighting the importance of training data diversity for broader generalization.
In conclusion, the research addressed the problem of disconnected reasoning and visual grounding in MLLMs by introducing GRIT. The method allows models to reason with images through a simple, efficient approach that requires minimal data. GRIT successfully teaches MLLMs to combine visual evidence with logical reasoning in a unified output, achieving strong performance across multiple benchmarks and demonstrating a promising step toward more interpretable AI systems.
Check out the Paper, Project, and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM InferenceNikhilhttps://www.marktechpost.com/author/nikhil0980/Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPONikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE: A Parallel Computation Method for Efficient and Scalable Language Model Deployment
#this #paper #introduces #grit #methodThis AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual GroundingThe core idea of Multimodal Large Language Modelsis to create models that can combine the richness of visual content with the logic of language. However, despite advances in this field, many models struggle to connect the two domains effectively, leading to limited performance in complex reasoning tasks that involve visual components. A major challenge in building such models is their limited ability to combine visual understanding with logical thinking. Current systems often produce textual outputs that explain reasoning but fail to reference the specific parts of an image they rely on. This creates a gap where models may arrive at an answer without clearly showing how the visual evidence contributed to their decision. It’s also difficult to ensure that models generate visual reasoning steps directly connecting to their answers. The fundamental problem lies in how to naturally train models to interleave text and image reasoning without needing large datasets annotated with visual references, which are scarce and expensive to produce. Existing methods try to address this by using reinforcement learning or prompting strategies. Some systems generate bounding box coordinates as answers, while others produce step-by-step textual reasoning chains. However, these approaches have limitations. Models that only produce bounding boxes lack explanation, while those generating only text risk ignoring visual evidence. Previous methods often separate visual grounding and reasoning, making it hard for models to explain why a particular visual element leads to a certain conclusion. While some models use dense supervision data or additional tools, they generally require heavy annotation and do not scale well. This makes it difficult for developers to create models that can explain their reasoning transparently and handle various visual tasks with minimal data. Researchers from UC Santa Cruz and eBay introduced a new method called Grounded Reasoning with Images and Textthat allows MLLMs like Qwen 2.5-VL and InternVL 3 to generate reasoning chains that mix natural language with explicit bounding box coordinates pointing to relevant image regions. This unified approach enables models to reason about and visually ground their answers without requiring dense annotations or labeled reasoning chains. GRIT also uses a lightweight reinforcement learning algorithm called GRPO-GR, which optimizes both the accuracy of the final answer and the structure of the reasoning, encouraging models to include specific tokens like <think> and <rethink>, as well as bounding box formats. This design eliminates the need for costly annotated data while ensuring that models learn to reference visual content meaningfully within their logical steps. The methodology in GRIT focuses on generating outputs that combine textual reasoning and visual grounding seamlessly. Instead of requiring models to process cropped images or additional visual data after generating bounding boxes, GRIT teaches models to use their internal understanding of the image. Bounding boxes are generated during the reasoning process, and models learn to reflect on these coordinates within their logical reasoning. The reinforcement learning framework rewards the correct use of bounding box formats and reasoning structure, and it guides models to produce coherent, grounded reasoning chains. GRIT demonstrates remarkable data efficiency by using only 20 image-question-answer triplets sourced from Visual Spatial Reasoning and TallyQA datasets. The model training was conducted on NVIDIA A100 GPUs, with optimization techniques like AdamW and a cosine scheduler applied over 200 training steps, which shows the method’s scalability despite limited data. Performance evaluations revealed that GRIT-trained models outperform several baselines in reasoning and grounding accuracy. For example, Qwen 2.5-VL trained with GRIT achieved 72.9% answer accuracy on Visual Spatial Reasoning, 47.8% on TallyQA, and 62.8% on GQA datasets. It also reached a grounding IoU score of 0.325 on VSR and 0.447 on TallyQA. In contrast, baseline models like Direct Query or Chain-of-Thought often performed significantly lower, showing limited ability to unify reasoning with visual grounding. GRIT models demonstrated a strong correlation between visual regions and textual reasoning, producing outputs that reflected a meaningful connection between image evidence and logical thought. GRIT also showed improvements on out-of-domain benchmarks, though gains were more pronounced on in-domain data, highlighting the importance of training data diversity for broader generalization. In conclusion, the research addressed the problem of disconnected reasoning and visual grounding in MLLMs by introducing GRIT. The method allows models to reason with images through a simple, efficient approach that requires minimal data. GRIT successfully teaches MLLMs to combine visual evidence with logical reasoning in a unified output, achieving strong performance across multiple benchmarks and demonstrating a promising step toward more interpretable AI systems. Check out the Paper, Project, and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM InferenceNikhilhttps://www.marktechpost.com/author/nikhil0980/Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPONikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE: A Parallel Computation Method for Efficient and Scalable Language Model Deployment #this #paper #introduces #grit #methodWWW.MARKTECHPOST.COMThis AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual GroundingThe core idea of Multimodal Large Language Models (MLLMs) is to create models that can combine the richness of visual content with the logic of language. However, despite advances in this field, many models struggle to connect the two domains effectively, leading to limited performance in complex reasoning tasks that involve visual components. A major challenge in building such models is their limited ability to combine visual understanding with logical thinking. Current systems often produce textual outputs that explain reasoning but fail to reference the specific parts of an image they rely on. This creates a gap where models may arrive at an answer without clearly showing how the visual evidence contributed to their decision. It’s also difficult to ensure that models generate visual reasoning steps directly connecting to their answers. The fundamental problem lies in how to naturally train models to interleave text and image reasoning without needing large datasets annotated with visual references, which are scarce and expensive to produce. Existing methods try to address this by using reinforcement learning or prompting strategies. Some systems generate bounding box coordinates as answers, while others produce step-by-step textual reasoning chains. However, these approaches have limitations. Models that only produce bounding boxes lack explanation, while those generating only text risk ignoring visual evidence. Previous methods often separate visual grounding and reasoning, making it hard for models to explain why a particular visual element leads to a certain conclusion. While some models use dense supervision data or additional tools, they generally require heavy annotation and do not scale well. This makes it difficult for developers to create models that can explain their reasoning transparently and handle various visual tasks with minimal data. Researchers from UC Santa Cruz and eBay introduced a new method called Grounded Reasoning with Images and Text (GRIT) that allows MLLMs like Qwen 2.5-VL and InternVL 3 to generate reasoning chains that mix natural language with explicit bounding box coordinates pointing to relevant image regions. This unified approach enables models to reason about and visually ground their answers without requiring dense annotations or labeled reasoning chains. GRIT also uses a lightweight reinforcement learning algorithm called GRPO-GR, which optimizes both the accuracy of the final answer and the structure of the reasoning, encouraging models to include specific tokens like <think> and <rethink>, as well as bounding box formats. This design eliminates the need for costly annotated data while ensuring that models learn to reference visual content meaningfully within their logical steps. The methodology in GRIT focuses on generating outputs that combine textual reasoning and visual grounding seamlessly. Instead of requiring models to process cropped images or additional visual data after generating bounding boxes, GRIT teaches models to use their internal understanding of the image. Bounding boxes are generated during the reasoning process, and models learn to reflect on these coordinates within their logical reasoning. The reinforcement learning framework rewards the correct use of bounding box formats and reasoning structure, and it guides models to produce coherent, grounded reasoning chains. GRIT demonstrates remarkable data efficiency by using only 20 image-question-answer triplets sourced from Visual Spatial Reasoning and TallyQA datasets. The model training was conducted on NVIDIA A100 GPUs, with optimization techniques like AdamW and a cosine scheduler applied over 200 training steps, which shows the method’s scalability despite limited data. Performance evaluations revealed that GRIT-trained models outperform several baselines in reasoning and grounding accuracy. For example, Qwen 2.5-VL trained with GRIT achieved 72.9% answer accuracy on Visual Spatial Reasoning, 47.8% on TallyQA, and 62.8% on GQA datasets. It also reached a grounding IoU score of 0.325 on VSR and 0.447 on TallyQA. In contrast, baseline models like Direct Query or Chain-of-Thought often performed significantly lower, showing limited ability to unify reasoning with visual grounding. GRIT models demonstrated a strong correlation between visual regions and textual reasoning, producing outputs that reflected a meaningful connection between image evidence and logical thought. GRIT also showed improvements on out-of-domain benchmarks, though gains were more pronounced on in-domain data, highlighting the importance of training data diversity for broader generalization. In conclusion, the research addressed the problem of disconnected reasoning and visual grounding in MLLMs by introducing GRIT. The method allows models to reason with images through a simple, efficient approach that requires minimal data. GRIT successfully teaches MLLMs to combine visual evidence with logical reasoning in a unified output, achieving strong performance across multiple benchmarks and demonstrating a promising step toward more interpretable AI systems. Check out the Paper, Project, and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM InferenceNikhilhttps://www.marktechpost.com/author/nikhil0980/Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPONikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE (Parallel Scaling): A Parallel Computation Method for Efficient and Scalable Language Model Deployment0 Kommentare 0 Anteile -
Microsoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language Interfaces
Many websites lack accessible and cost-effective ways to integrate natural language interfaces, making it difficult for users to interact with site content through conversational AI. Existing solutions often depend on centralized, proprietary services or require significant technical expertise, limiting scalability and adaptability. This creates a barrier for developers who want to implement intelligent agents capable of answering questions or assisting users using the site’s data. As a result, there is a need for an open, standardized approach that allows websites to expose structured information and support natural language interactions without relying heavily on external infrastructure or high-cost models.
Building conversational interfaces for websites remains a complex challenge, often requiring custom solutions and deep technical expertise. NLWeb, developed by Microsoft researchers, aims to simplify this process by enabling sites to support natural language interactions easily. By natively integrating with the Machine Communication Protocol, NLWeb allows the same language interfaces to be used by both human users and AI agents. It builds on existing web standards like Schema.org and RSS—already used by millions of websites—to provide a semantic foundation that can be easily leveraged for natural language capabilities.
NLWeb is not a single tool or product but a suite of open protocols and open-source reference implementations designed to lay the groundwork for an AI-enabled web. Like HTML once did for document sharing, NLWeb envisions a shared infrastructure for integrating conversational AI into web content. Its sample code is a practical starting point rather than a final solution, encouraging community innovation and diverse implementations. This open, collaborative model draws inspiration from the early days of the internet, where shared standards and grassroots efforts drove rapid progress. NLWeb aims to do the same for AI-driven web experiences by enabling human-friendly interfaces and agent-to-agent communication through common protocols.
NLWeb consists of two main parts: a simple protocol for natural language interaction with websites, and a JSON-based response format that uses Schema.org. It includes an implementation that works well for sites structured as item lists—like products or reviews—and offers UI widgets to enable conversational access to such content. NLWeb also acts as an MCPserver, letting AI models ask questions via a standardized “ask” method. Responses combine existing site data with insights from large language models, enhancing user interaction. NLWeb is open, cross-platform, and compatible with various AI models and vector databases, offering flexible integration options.
NLWeb offers web publishers a simple way to add conversational AI to their sites with minimal coding and without needing to build chatbots from scratch. It uses existing site data, ensuring accurate, real-time responses while keeping costs low. Publishers can choose which AI models to use and maintain control over their data. The system improves user engagement by enabling natural interactions, personalizing content, and enhancing support. Its open-source nature allows customization, and it positions websites for a future where AI agents browse and interact with the web.
In conclusion, NLWeb represents a foundational step toward a more interactive and intelligent web, where users can engage with websites through natural language rather than rigid interfaces. By combining structured data formats like Schema.org with the power of AI models, NLWeb simplifies the creation of conversational experiences. It empowers publishers to enhance their sites with minimal effort, offering benefits like improved user engagement, faster support, and personalized content delivery. As the web evolves into an ecosystem where AI agents play a growing role, NLWeb ensures that websites are not only accessible to humans but also seamlessly integrable with the agent-driven digital future.
Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional CompilersSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Evaluating Enterprise-Grade AI Assistants: A Benchmark for Complex, Voice-Driven WorkflowsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Beyond Aha Moments: Structuring Reasoning in Large Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/RXTX: A Machine Learning-Guided Algorithm for Efficient Structured Matrix Multiplication
#microsoft #releases #nlweb #open #projectMicrosoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language InterfacesMany websites lack accessible and cost-effective ways to integrate natural language interfaces, making it difficult for users to interact with site content through conversational AI. Existing solutions often depend on centralized, proprietary services or require significant technical expertise, limiting scalability and adaptability. This creates a barrier for developers who want to implement intelligent agents capable of answering questions or assisting users using the site’s data. As a result, there is a need for an open, standardized approach that allows websites to expose structured information and support natural language interactions without relying heavily on external infrastructure or high-cost models. Building conversational interfaces for websites remains a complex challenge, often requiring custom solutions and deep technical expertise. NLWeb, developed by Microsoft researchers, aims to simplify this process by enabling sites to support natural language interactions easily. By natively integrating with the Machine Communication Protocol, NLWeb allows the same language interfaces to be used by both human users and AI agents. It builds on existing web standards like Schema.org and RSS—already used by millions of websites—to provide a semantic foundation that can be easily leveraged for natural language capabilities. NLWeb is not a single tool or product but a suite of open protocols and open-source reference implementations designed to lay the groundwork for an AI-enabled web. Like HTML once did for document sharing, NLWeb envisions a shared infrastructure for integrating conversational AI into web content. Its sample code is a practical starting point rather than a final solution, encouraging community innovation and diverse implementations. This open, collaborative model draws inspiration from the early days of the internet, where shared standards and grassroots efforts drove rapid progress. NLWeb aims to do the same for AI-driven web experiences by enabling human-friendly interfaces and agent-to-agent communication through common protocols. NLWeb consists of two main parts: a simple protocol for natural language interaction with websites, and a JSON-based response format that uses Schema.org. It includes an implementation that works well for sites structured as item lists—like products or reviews—and offers UI widgets to enable conversational access to such content. NLWeb also acts as an MCPserver, letting AI models ask questions via a standardized “ask” method. Responses combine existing site data with insights from large language models, enhancing user interaction. NLWeb is open, cross-platform, and compatible with various AI models and vector databases, offering flexible integration options. NLWeb offers web publishers a simple way to add conversational AI to their sites with minimal coding and without needing to build chatbots from scratch. It uses existing site data, ensuring accurate, real-time responses while keeping costs low. Publishers can choose which AI models to use and maintain control over their data. The system improves user engagement by enabling natural interactions, personalizing content, and enhancing support. Its open-source nature allows customization, and it positions websites for a future where AI agents browse and interact with the web. In conclusion, NLWeb represents a foundational step toward a more interactive and intelligent web, where users can engage with websites through natural language rather than rigid interfaces. By combining structured data formats like Schema.org with the power of AI models, NLWeb simplifies the creation of conversational experiences. It empowers publishers to enhance their sites with minimal effort, offering benefits like improved user engagement, faster support, and personalized content delivery. As the web evolves into an ecosystem where AI agents play a growing role, NLWeb ensures that websites are not only accessible to humans but also seamlessly integrable with the agent-driven digital future. Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional CompilersSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Evaluating Enterprise-Grade AI Assistants: A Benchmark for Complex, Voice-Driven WorkflowsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Beyond Aha Moments: Structuring Reasoning in Large Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/RXTX: A Machine Learning-Guided Algorithm for Efficient Structured Matrix Multiplication #microsoft #releases #nlweb #open #projectWWW.MARKTECHPOST.COMMicrosoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language InterfacesMany websites lack accessible and cost-effective ways to integrate natural language interfaces, making it difficult for users to interact with site content through conversational AI. Existing solutions often depend on centralized, proprietary services or require significant technical expertise, limiting scalability and adaptability. This creates a barrier for developers who want to implement intelligent agents capable of answering questions or assisting users using the site’s data. As a result, there is a need for an open, standardized approach that allows websites to expose structured information and support natural language interactions without relying heavily on external infrastructure or high-cost models. Building conversational interfaces for websites remains a complex challenge, often requiring custom solutions and deep technical expertise. NLWeb, developed by Microsoft researchers, aims to simplify this process by enabling sites to support natural language interactions easily. By natively integrating with the Machine Communication Protocol (MCP), NLWeb allows the same language interfaces to be used by both human users and AI agents. It builds on existing web standards like Schema.org and RSS—already used by millions of websites—to provide a semantic foundation that can be easily leveraged for natural language capabilities. NLWeb is not a single tool or product but a suite of open protocols and open-source reference implementations designed to lay the groundwork for an AI-enabled web. Like HTML once did for document sharing, NLWeb envisions a shared infrastructure for integrating conversational AI into web content. Its sample code is a practical starting point rather than a final solution, encouraging community innovation and diverse implementations. This open, collaborative model draws inspiration from the early days of the internet, where shared standards and grassroots efforts drove rapid progress. NLWeb aims to do the same for AI-driven web experiences by enabling human-friendly interfaces and agent-to-agent communication through common protocols. NLWeb consists of two main parts: a simple protocol for natural language interaction with websites, and a JSON-based response format that uses Schema.org. It includes an implementation that works well for sites structured as item lists—like products or reviews—and offers UI widgets to enable conversational access to such content. NLWeb also acts as an MCP (Model Context Protocol) server, letting AI models ask questions via a standardized “ask” method. Responses combine existing site data with insights from large language models, enhancing user interaction. NLWeb is open, cross-platform, and compatible with various AI models and vector databases, offering flexible integration options. NLWeb offers web publishers a simple way to add conversational AI to their sites with minimal coding and without needing to build chatbots from scratch. It uses existing site data, ensuring accurate, real-time responses while keeping costs low. Publishers can choose which AI models to use and maintain control over their data. The system improves user engagement by enabling natural interactions, personalizing content, and enhancing support. Its open-source nature allows customization, and it positions websites for a future where AI agents browse and interact with the web. In conclusion, NLWeb represents a foundational step toward a more interactive and intelligent web, where users can engage with websites through natural language rather than rigid interfaces. By combining structured data formats like Schema.org with the power of AI models, NLWeb simplifies the creation of conversational experiences. It empowers publishers to enhance their sites with minimal effort, offering benefits like improved user engagement, faster support, and personalized content delivery. As the web evolves into an ecosystem where AI agents play a growing role, NLWeb ensures that websites are not only accessible to humans but also seamlessly integrable with the agent-driven digital future. Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional CompilersSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Evaluating Enterprise-Grade AI Assistants: A Benchmark for Complex, Voice-Driven WorkflowsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Beyond Aha Moments: Structuring Reasoning in Large Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/RXTX: A Machine Learning-Guided Algorithm for Efficient Structured Matrix Multiplication0 Kommentare 0 Anteile
Mehr Artikel