Search

Marktechpost AI RT hat einen Link geteilt

2025-05-30 13:55:56 ·

Apple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and Accuracy

Long CoT reasoning improves large language models’ performance on complex tasks but comes with drawbacks. The typical “think-then-answer” method slows down response times, disrupting real-time interactions like those in chatbots. It also risks inaccuracies, as errors in earlier reasoning steps can lead to a misleading final answer. Unlike humans, who often share partial thoughts or conclusions during conversations, LLMs delay responses until all reasoning is complete. While RL is commonly used to train reasoning models, it mainly rewards final answers, overlooking useful intermediate insights. There is growing interest in teaching models that alternate between thinking and answering, but this remains a challenge.
RL has become a popular method to enhance reasoning in LLMs, building on its success in aligning models with human preferences. Two common reward types guide RL: outcome-based rewards, which focus on the final answer, and process-based rewards, which provide feedback on intermediate reasoning steps. While PRMs offer more detailed supervision, they often rely on human annotation and additional models, making them complex and prone to issues like reward hacking. Separately, efforts to improve LLM reasoning have explored prompting strategies, structured reasoning, tool integration, and methods to reduce latency and improve efficiency.
Researchers from Apple and Duke University introduce Interleaved Reasoning, a new RL approach that enables language models to alternate between thinking and answering when solving complex, multi-step questions. Instead of waiting until the end to respond, models provide informative intermediate answers, which improves feedback for users and guides their reasoning. Using a straightforward rule-based reward, the model is trained to produce helpful reasoning steps, leading to over 80% faster responses and up to 19.3% better accuracy. Trained only on QA and logic datasets, the method demonstrates strong generalization to more challenging benchmarks, such as MATH, GPQA, and MMLU.
The study proposes a reinforcement learning framework to train LLMs for Interleaved Reasoning, where models alternate between internal thinking and user-facing intermediate answers. Each intermediate step, or “sub-answer,” is shared once the model reaches a meaningful milestone in reasoning. A specialized training template with <think> and <answer> tags is used. The approach utilizes rule-based rewards—specifically, format, final accuracy, and conditional intermediate accuracy—to guide learning. Notably, intermediate rewards are applied only when specific criteria are met, ensuring the model prioritizes overall correctness. They also test different reward schemes, such as all-or-none, partial credit, and time-discounted rewards, to optimize the quality of reasoning.
The interleaved reasoning approach was evaluated on both familiar and unfamiliar datasets using Qwen2.5 models. Unlike traditional methods that separate thinking and answering, the interleaved method provides answers incrementally, improving both speed and usefulness. When combined with intermediate rewards, it significantly enhances model performance while reducing response delays by over 80%. Even without exposure to new domains during training, the model adapts well, showing strong generalization. These results highlight the value of interleaved reasoning in making AI systems more responsive and effective in real-world, multi-step reasoning tasks.

In conclusion, the study explores how interleaved reasoning—where models alternate between reasoning and generating intermediate answers—can significantly improve performance and responsiveness. Using the Qwen2.5-1.5B model, the authors show that providing timely intermediate feedback during training boosts accuracy and accelerates response generation. Different RL strategies were tested, with PPO showing stable results, and conditional, time-discounted rewards proving to be the most effective. The method scales well to complex tasks and outperforms traditional think-then-answer baselines. Unlike token-level reward models, this approach employs simple rule-based rewards after completing full reasoning steps, thereby avoiding reward hacking. Ultimately, interleaved reasoning enhances reasoning quality and efficiency without relying on external tools.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text GenerationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Can Now Reason Beyond Language: Researchers Introduce Soft Thinking to Replace Discrete Tokens with Continuous Concept EmbeddingsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary SearchSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Microsoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language Interfaces
#apple #duke #researchers #present #reinforcement

Apple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and Accuracy
Long CoT reasoning improves large language models’ performance on complex tasks but comes with drawbacks. The typical “think-then-answer” method slows down response times, disrupting real-time interactions like those in chatbots. It also risks inaccuracies, as errors in earlier reasoning steps can lead to a misleading final answer. Unlike humans, who often share partial thoughts or conclusions during conversations, LLMs delay responses until all reasoning is complete. While RL is commonly used to train reasoning models, it mainly rewards final answers, overlooking useful intermediate insights. There is growing interest in teaching models that alternate between thinking and answering, but this remains a challenge. RL has become a popular method to enhance reasoning in LLMs, building on its success in aligning models with human preferences. Two common reward types guide RL: outcome-based rewards, which focus on the final answer, and process-based rewards, which provide feedback on intermediate reasoning steps. While PRMs offer more detailed supervision, they often rely on human annotation and additional models, making them complex and prone to issues like reward hacking. Separately, efforts to improve LLM reasoning have explored prompting strategies, structured reasoning, tool integration, and methods to reduce latency and improve efficiency. Researchers from Apple and Duke University introduce Interleaved Reasoning, a new RL approach that enables language models to alternate between thinking and answering when solving complex, multi-step questions. Instead of waiting until the end to respond, models provide informative intermediate answers, which improves feedback for users and guides their reasoning. Using a straightforward rule-based reward, the model is trained to produce helpful reasoning steps, leading to over 80% faster responses and up to 19.3% better accuracy. Trained only on QA and logic datasets, the method demonstrates strong generalization to more challenging benchmarks, such as MATH, GPQA, and MMLU. The study proposes a reinforcement learning framework to train LLMs for Interleaved Reasoning, where models alternate between internal thinking and user-facing intermediate answers. Each intermediate step, or “sub-answer,” is shared once the model reaches a meaningful milestone in reasoning. A specialized training template with <think> and <answer> tags is used. The approach utilizes rule-based rewards—specifically, format, final accuracy, and conditional intermediate accuracy—to guide learning. Notably, intermediate rewards are applied only when specific criteria are met, ensuring the model prioritizes overall correctness. They also test different reward schemes, such as all-or-none, partial credit, and time-discounted rewards, to optimize the quality of reasoning. The interleaved reasoning approach was evaluated on both familiar and unfamiliar datasets using Qwen2.5 models. Unlike traditional methods that separate thinking and answering, the interleaved method provides answers incrementally, improving both speed and usefulness. When combined with intermediate rewards, it significantly enhances model performance while reducing response delays by over 80%. Even without exposure to new domains during training, the model adapts well, showing strong generalization. These results highlight the value of interleaved reasoning in making AI systems more responsive and effective in real-world, multi-step reasoning tasks. In conclusion, the study explores how interleaved reasoning—where models alternate between reasoning and generating intermediate answers—can significantly improve performance and responsiveness. Using the Qwen2.5-1.5B model, the authors show that providing timely intermediate feedback during training boosts accuracy and accelerates response generation. Different RL strategies were tested, with PPO showing stable results, and conditional, time-discounted rewards proving to be the most effective. The method scales well to complex tasks and outperforms traditional think-then-answer baselines. Unlike token-level reward models, this approach employs simple rule-based rewards after completing full reasoning steps, thereby avoiding reward hacking. Ultimately, interleaved reasoning enhances reasoning quality and efficiency without relying on external tools. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text GenerationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Can Now Reason Beyond Language: Researchers Introduce Soft Thinking to Replace Discrete Tokens with Continuous Concept EmbeddingsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary SearchSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Microsoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language Interfaces #apple #duke #researchers #present #reinforcement

WWW.MARKTECHPOST.COM

Apple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and Accuracy

Long CoT reasoning improves large language models’ performance on complex tasks but comes with drawbacks. The typical “think-then-answer” method slows down response times, disrupting real-time interactions like those in chatbots. It also risks inaccuracies, as errors in earlier reasoning steps can lead to a misleading final answer. Unlike humans, who often share partial thoughts or conclusions during conversations, LLMs delay responses until all reasoning is complete. While RL is commonly used to train reasoning models, it mainly rewards final answers, overlooking useful intermediate insights. There is growing interest in teaching models that alternate between thinking and answering, but this remains a challenge. RL has become a popular method to enhance reasoning in LLMs, building on its success in aligning models with human preferences. Two common reward types guide RL: outcome-based rewards (ORM), which focus on the final answer, and process-based rewards (PRM), which provide feedback on intermediate reasoning steps. While PRMs offer more detailed supervision, they often rely on human annotation and additional models, making them complex and prone to issues like reward hacking. Separately, efforts to improve LLM reasoning have explored prompting strategies, structured reasoning, tool integration, and methods to reduce latency and improve efficiency. Researchers from Apple and Duke University introduce Interleaved Reasoning, a new RL approach that enables language models to alternate between thinking and answering when solving complex, multi-step questions. Instead of waiting until the end to respond, models provide informative intermediate answers, which improves feedback for users and guides their reasoning. Using a straightforward rule-based reward, the model is trained to produce helpful reasoning steps, leading to over 80% faster responses and up to 19.3% better accuracy. Trained only on QA and logic datasets, the method demonstrates strong generalization to more challenging benchmarks, such as MATH, GPQA, and MMLU. The study proposes a reinforcement learning framework to train LLMs for Interleaved Reasoning, where models alternate between internal thinking and user-facing intermediate answers. Each intermediate step, or “sub-answer,” is shared once the model reaches a meaningful milestone in reasoning. A specialized training template with <think> and <answer> tags is used. The approach utilizes rule-based rewards—specifically, format, final accuracy, and conditional intermediate accuracy—to guide learning. Notably, intermediate rewards are applied only when specific criteria are met, ensuring the model prioritizes overall correctness. They also test different reward schemes, such as all-or-none, partial credit, and time-discounted rewards, to optimize the quality of reasoning. The interleaved reasoning approach was evaluated on both familiar and unfamiliar datasets using Qwen2.5 models (1.5B and 7B). Unlike traditional methods that separate thinking and answering, the interleaved method provides answers incrementally, improving both speed and usefulness. When combined with intermediate rewards, it significantly enhances model performance while reducing response delays by over 80%. Even without exposure to new domains during training, the model adapts well, showing strong generalization. These results highlight the value of interleaved reasoning in making AI systems more responsive and effective in real-world, multi-step reasoning tasks. In conclusion, the study explores how interleaved reasoning—where models alternate between reasoning and generating intermediate answers—can significantly improve performance and responsiveness. Using the Qwen2.5-1.5B model, the authors show that providing timely intermediate feedback during training boosts accuracy and accelerates response generation. Different RL strategies were tested, with PPO showing stable results, and conditional, time-discounted rewards proving to be the most effective. The method scales well to complex tasks and outperforms traditional think-then-answer baselines. Unlike token-level reward models, this approach employs simple rule-based rewards after completing full reasoning steps, thereby avoiding reward hacking. Ultimately, interleaved reasoning enhances reasoning quality and efficiency without relying on external tools. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text GenerationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Can Now Reason Beyond Language: Researchers Introduce Soft Thinking to Replace Discrete Tokens with Continuous Concept EmbeddingsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary SearchSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Microsoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language Interfaces

·53 Ansichten

Please log in to like, share and comment!
Marktechpost AI RT hat einen Link geteilt

2025-05-27 16:55:49 ·

Qwen Researchers Proposes QwenLong-L1: A Reinforcement Learning Framework for Long-Context Reasoning in Large Language Models

While large reasoning modelshave shown impressive capabilities in short-context reasoning through reinforcement learning, these gains do not generalize well to long-context scenarios. Applications such as multi-document QA, research synthesis, and legal or financial analysis require models to process and reason over sequences exceeding 100K tokens. However, RL optimization in such regimes is plagued by slower reward convergence, unstable policy updates due to KL divergence fluctuations, and reduced exploration resulting from entropy collapse. These bottlenecks reveal a fundamental gap in transitioning LRMs from short-context proficiency to long-context generalization.
QwenLong-L1: A Structured RL Framework for Long-Context Adaptation
To address these limitations, the Qwen Research team introduces QwenLong-L1, a novel RL framework designed to adapt LRMs to long-context reasoning tasks. The framework is structured into three key stages:

Warm-up Supervised Fine-Tuning: Provides a stable initialization for the policy model by training on curated question-context-answer triplets, ensuring basic competence in contextual comprehension and answer extraction.
Curriculum-Guided Phased Reinforcement Learning: Introduces a staged training process with gradually increasing context lengths. This progression enables the model to incrementally acquire long-context reasoning behaviors without destabilizing policy updates.
Difficulty-Aware Retrospective Sampling: Enhances exploration by maintaining and reusing hard examples from previous phases, weighted by their difficulty, to encourage deeper reasoning and robustness across diverse inputs.

These stages are complemented by hybrid reward mechanisms—combining rule-based exact match verification with semantic evaluation by a lightweight LLM—ensuring both precision and recall during policy training.

Technical Design and Methodological Advantages
QwenLong-L1 integrates recent advances in group-relative RL optimization, specifically GRPO and DAPO, to mitigate the computational overhead associated with long-context value estimation:

GRPO estimates advantage by normalizing rewards within sampled groups, eliminating the need for a separate value network and encouraging diverse generation patterns.
DAPO incorporates mechanisms such as dynamic sampling, overlength penalty shaping, and asymmetric clipping thresholds to prevent entropy collapse and mitigate length biases during training.

The reward function is defined as the maximum of two signals: a deterministic rule-based match and a semantic judgment from a compact evaluator model. This hybrid approach avoids overfitting to rigid formats while maintaining answer correctness across varied notations and phrasings.
Moreover, the framework is optimized via progressive context scaling, where the RL process transitions from 20K-token to 60K-token input lengths in controlled phases, stabilizing training dynamics and facilitating policy generalization.
Experimental Results and Benchmark Performance
QwenLong-L1 was evaluated on seven long-context document QA benchmarks, including DocMath, Frames, 2WikiMultihopQA, HotpotQA, Musique, NarrativeQA, and Qasper. The 32B variant, QwenLong-L1-32B, demonstrated strong empirical performance:

It outperformed baseline models such as R1-Distill-Qwen-32B by 5.1 points and exceeded leading proprietary systems like OpenAI-o3-mini and Qwen3-235B-A22B.
Its performance was comparable to Claude-3.7-Sonnet-Thinking, indicating competitive reasoning capabilities under extreme context lengths.
Pass@K analysis revealed consistent improvements with increased sampling, achieving a Pass@2 average of 73.7, surpassing DeepSeek-R1 and OpenAI-o1-preview, even at low sampling rates.

Ablation studies further validated the individual contributions of SFT, phased RL, and retrospective sampling. Notably, RL played a decisive role in enabling emergent reasoning behaviors such as grounding, subgoal setting, verification, and backtracking—traits not effectively induced by supervised fine-tuning alone.
Conclusion
QwenLong-L1 represents a systematic approach to equipping LRMs with robust long-context reasoning capabilities through reinforcement learning. Its design effectively bridges the gap between short-context expertise and the demands of information-dense environments by combining supervised initialization, curriculum-driven context scaling, and hybrid evaluation strategies. The framework not only achieves state-of-the-art results across long-context benchmarks but also demonstrates the emergence of interpretable reasoning patterns during training.

Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGen
#qwen #researchers #proposes #qwenlongl1 #reinforcement

Qwen Researchers Proposes QwenLong-L1: A Reinforcement Learning Framework for Long-Context Reasoning in Large Language Models
While large reasoning modelshave shown impressive capabilities in short-context reasoning through reinforcement learning, these gains do not generalize well to long-context scenarios. Applications such as multi-document QA, research synthesis, and legal or financial analysis require models to process and reason over sequences exceeding 100K tokens. However, RL optimization in such regimes is plagued by slower reward convergence, unstable policy updates due to KL divergence fluctuations, and reduced exploration resulting from entropy collapse. These bottlenecks reveal a fundamental gap in transitioning LRMs from short-context proficiency to long-context generalization. QwenLong-L1: A Structured RL Framework for Long-Context Adaptation To address these limitations, the Qwen Research team introduces QwenLong-L1, a novel RL framework designed to adapt LRMs to long-context reasoning tasks. The framework is structured into three key stages: Warm-up Supervised Fine-Tuning: Provides a stable initialization for the policy model by training on curated question-context-answer triplets, ensuring basic competence in contextual comprehension and answer extraction. Curriculum-Guided Phased Reinforcement Learning: Introduces a staged training process with gradually increasing context lengths. This progression enables the model to incrementally acquire long-context reasoning behaviors without destabilizing policy updates. Difficulty-Aware Retrospective Sampling: Enhances exploration by maintaining and reusing hard examples from previous phases, weighted by their difficulty, to encourage deeper reasoning and robustness across diverse inputs. These stages are complemented by hybrid reward mechanisms—combining rule-based exact match verification with semantic evaluation by a lightweight LLM—ensuring both precision and recall during policy training. Technical Design and Methodological Advantages QwenLong-L1 integrates recent advances in group-relative RL optimization, specifically GRPO and DAPO, to mitigate the computational overhead associated with long-context value estimation: GRPO estimates advantage by normalizing rewards within sampled groups, eliminating the need for a separate value network and encouraging diverse generation patterns. DAPO incorporates mechanisms such as dynamic sampling, overlength penalty shaping, and asymmetric clipping thresholds to prevent entropy collapse and mitigate length biases during training. The reward function is defined as the maximum of two signals: a deterministic rule-based match and a semantic judgment from a compact evaluator model. This hybrid approach avoids overfitting to rigid formats while maintaining answer correctness across varied notations and phrasings. Moreover, the framework is optimized via progressive context scaling, where the RL process transitions from 20K-token to 60K-token input lengths in controlled phases, stabilizing training dynamics and facilitating policy generalization. Experimental Results and Benchmark Performance QwenLong-L1 was evaluated on seven long-context document QA benchmarks, including DocMath, Frames, 2WikiMultihopQA, HotpotQA, Musique, NarrativeQA, and Qasper. The 32B variant, QwenLong-L1-32B, demonstrated strong empirical performance: It outperformed baseline models such as R1-Distill-Qwen-32B by 5.1 points and exceeded leading proprietary systems like OpenAI-o3-mini and Qwen3-235B-A22B. Its performance was comparable to Claude-3.7-Sonnet-Thinking, indicating competitive reasoning capabilities under extreme context lengths. Pass@K analysis revealed consistent improvements with increased sampling, achieving a Pass@2 average of 73.7, surpassing DeepSeek-R1 and OpenAI-o1-preview, even at low sampling rates. Ablation studies further validated the individual contributions of SFT, phased RL, and retrospective sampling. Notably, RL played a decisive role in enabling emergent reasoning behaviors such as grounding, subgoal setting, verification, and backtracking—traits not effectively induced by supervised fine-tuning alone. Conclusion QwenLong-L1 represents a systematic approach to equipping LRMs with robust long-context reasoning capabilities through reinforcement learning. Its design effectively bridges the gap between short-context expertise and the demands of information-dense environments by combining supervised initialization, curriculum-driven context scaling, and hybrid evaluation strategies. The framework not only achieves state-of-the-art results across long-context benchmarks but also demonstrates the emergence of interpretable reasoning patterns during training. Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGen #qwen #researchers #proposes #qwenlongl1 #reinforcement

WWW.MARKTECHPOST.COM

Qwen Researchers Proposes QwenLong-L1: A Reinforcement Learning Framework for Long-Context Reasoning in Large Language Models

While large reasoning models (LRMs) have shown impressive capabilities in short-context reasoning through reinforcement learning (RL), these gains do not generalize well to long-context scenarios. Applications such as multi-document QA, research synthesis, and legal or financial analysis require models to process and reason over sequences exceeding 100K tokens. However, RL optimization in such regimes is plagued by slower reward convergence, unstable policy updates due to KL divergence fluctuations, and reduced exploration resulting from entropy collapse. These bottlenecks reveal a fundamental gap in transitioning LRMs from short-context proficiency to long-context generalization. QwenLong-L1: A Structured RL Framework for Long-Context Adaptation To address these limitations, the Qwen Research team introduces QwenLong-L1, a novel RL framework designed to adapt LRMs to long-context reasoning tasks. The framework is structured into three key stages: Warm-up Supervised Fine-Tuning (SFT): Provides a stable initialization for the policy model by training on curated question-context-answer triplets, ensuring basic competence in contextual comprehension and answer extraction. Curriculum-Guided Phased Reinforcement Learning: Introduces a staged training process with gradually increasing context lengths. This progression enables the model to incrementally acquire long-context reasoning behaviors without destabilizing policy updates. Difficulty-Aware Retrospective Sampling: Enhances exploration by maintaining and reusing hard examples from previous phases, weighted by their difficulty, to encourage deeper reasoning and robustness across diverse inputs. These stages are complemented by hybrid reward mechanisms—combining rule-based exact match verification with semantic evaluation by a lightweight LLM—ensuring both precision and recall during policy training. Technical Design and Methodological Advantages QwenLong-L1 integrates recent advances in group-relative RL optimization, specifically GRPO and DAPO, to mitigate the computational overhead associated with long-context value estimation: GRPO estimates advantage by normalizing rewards within sampled groups, eliminating the need for a separate value network and encouraging diverse generation patterns. DAPO incorporates mechanisms such as dynamic sampling, overlength penalty shaping, and asymmetric clipping thresholds to prevent entropy collapse and mitigate length biases during training. The reward function is defined as the maximum of two signals: a deterministic rule-based match and a semantic judgment from a compact evaluator model (e.g., Qwen2.5-1.5B). This hybrid approach avoids overfitting to rigid formats while maintaining answer correctness across varied notations and phrasings. Moreover, the framework is optimized via progressive context scaling, where the RL process transitions from 20K-token to 60K-token input lengths in controlled phases, stabilizing training dynamics and facilitating policy generalization. Experimental Results and Benchmark Performance QwenLong-L1 was evaluated on seven long-context document QA benchmarks, including DocMath, Frames, 2WikiMultihopQA, HotpotQA, Musique, NarrativeQA, and Qasper. The 32B variant, QwenLong-L1-32B, demonstrated strong empirical performance: It outperformed baseline models such as R1-Distill-Qwen-32B by 5.1 points and exceeded leading proprietary systems like OpenAI-o3-mini and Qwen3-235B-A22B. Its performance was comparable to Claude-3.7-Sonnet-Thinking, indicating competitive reasoning capabilities under extreme context lengths. Pass@K analysis revealed consistent improvements with increased sampling, achieving a Pass@2 average of 73.7, surpassing DeepSeek-R1 and OpenAI-o1-preview, even at low sampling rates. Ablation studies further validated the individual contributions of SFT, phased RL, and retrospective sampling. Notably, RL played a decisive role in enabling emergent reasoning behaviors such as grounding, subgoal setting, verification, and backtracking—traits not effectively induced by supervised fine-tuning alone. Conclusion QwenLong-L1 represents a systematic approach to equipping LRMs with robust long-context reasoning capabilities through reinforcement learning. Its design effectively bridges the gap between short-context expertise and the demands of information-dense environments by combining supervised initialization, curriculum-driven context scaling, and hybrid evaluation strategies. The framework not only achieves state-of-the-art results across long-context benchmarks but also demonstrates the emergence of interpretable reasoning patterns during training. Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGen

4 Kommentare ·335 Ansichten

Please log in to like, share and comment!
Marktechpost AI RT hat einen Link geteilt

2025-05-25 14:44:03 ·

This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding

The core idea of Multimodal Large Language Modelsis to create models that can combine the richness of visual content with the logic of language. However, despite advances in this field, many models struggle to connect the two domains effectively, leading to limited performance in complex reasoning tasks that involve visual components.
A major challenge in building such models is their limited ability to combine visual understanding with logical thinking. Current systems often produce textual outputs that explain reasoning but fail to reference the specific parts of an image they rely on. This creates a gap where models may arrive at an answer without clearly showing how the visual evidence contributed to their decision. It’s also difficult to ensure that models generate visual reasoning steps directly connecting to their answers. The fundamental problem lies in how to naturally train models to interleave text and image reasoning without needing large datasets annotated with visual references, which are scarce and expensive to produce.

Existing methods try to address this by using reinforcement learning or prompting strategies. Some systems generate bounding box coordinates as answers, while others produce step-by-step textual reasoning chains. However, these approaches have limitations. Models that only produce bounding boxes lack explanation, while those generating only text risk ignoring visual evidence. Previous methods often separate visual grounding and reasoning, making it hard for models to explain why a particular visual element leads to a certain conclusion. While some models use dense supervision data or additional tools, they generally require heavy annotation and do not scale well. This makes it difficult for developers to create models that can explain their reasoning transparently and handle various visual tasks with minimal data.
Researchers from UC Santa Cruz and eBay introduced a new method called Grounded Reasoning with Images and Textthat allows MLLMs like Qwen 2.5-VL and InternVL 3 to generate reasoning chains that mix natural language with explicit bounding box coordinates pointing to relevant image regions. This unified approach enables models to reason about and visually ground their answers without requiring dense annotations or labeled reasoning chains. GRIT also uses a lightweight reinforcement learning algorithm called GRPO-GR, which optimizes both the accuracy of the final answer and the structure of the reasoning, encouraging models to include specific tokens like <think> and <rethink>, as well as bounding box formats. This design eliminates the need for costly annotated data while ensuring that models learn to reference visual content meaningfully within their logical steps.

The methodology in GRIT focuses on generating outputs that combine textual reasoning and visual grounding seamlessly. Instead of requiring models to process cropped images or additional visual data after generating bounding boxes, GRIT teaches models to use their internal understanding of the image. Bounding boxes are generated during the reasoning process, and models learn to reflect on these coordinates within their logical reasoning. The reinforcement learning framework rewards the correct use of bounding box formats and reasoning structure, and it guides models to produce coherent, grounded reasoning chains. GRIT demonstrates remarkable data efficiency by using only 20 image-question-answer triplets sourced from Visual Spatial Reasoning and TallyQA datasets. The model training was conducted on NVIDIA A100 GPUs, with optimization techniques like AdamW and a cosine scheduler applied over 200 training steps, which shows the method’s scalability despite limited data.
Performance evaluations revealed that GRIT-trained models outperform several baselines in reasoning and grounding accuracy. For example, Qwen 2.5-VL trained with GRIT achieved 72.9% answer accuracy on Visual Spatial Reasoning, 47.8% on TallyQA, and 62.8% on GQA datasets. It also reached a grounding IoU score of 0.325 on VSR and 0.447 on TallyQA. In contrast, baseline models like Direct Query or Chain-of-Thought often performed significantly lower, showing limited ability to unify reasoning with visual grounding. GRIT models demonstrated a strong correlation between visual regions and textual reasoning, producing outputs that reflected a meaningful connection between image evidence and logical thought. GRIT also showed improvements on out-of-domain benchmarks, though gains were more pronounced on in-domain data, highlighting the importance of training data diversity for broader generalization.

In conclusion, the research addressed the problem of disconnected reasoning and visual grounding in MLLMs by introducing GRIT. The method allows models to reason with images through a simple, efficient approach that requires minimal data. GRIT successfully teaches MLLMs to combine visual evidence with logical reasoning in a unified output, achieving strong performance across multiple benchmarks and demonstrating a promising step toward more interpretable AI systems.

Check out the Paper, Project, and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM InferenceNikhilhttps://www.marktechpost.com/author/nikhil0980/Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPONikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE: A Parallel Computation Method for Efficient and Scalable Language Model Deployment
#this #paper #introduces #grit #method

This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding
The core idea of Multimodal Large Language Modelsis to create models that can combine the richness of visual content with the logic of language. However, despite advances in this field, many models struggle to connect the two domains effectively, leading to limited performance in complex reasoning tasks that involve visual components. A major challenge in building such models is their limited ability to combine visual understanding with logical thinking. Current systems often produce textual outputs that explain reasoning but fail to reference the specific parts of an image they rely on. This creates a gap where models may arrive at an answer without clearly showing how the visual evidence contributed to their decision. It’s also difficult to ensure that models generate visual reasoning steps directly connecting to their answers. The fundamental problem lies in how to naturally train models to interleave text and image reasoning without needing large datasets annotated with visual references, which are scarce and expensive to produce. Existing methods try to address this by using reinforcement learning or prompting strategies. Some systems generate bounding box coordinates as answers, while others produce step-by-step textual reasoning chains. However, these approaches have limitations. Models that only produce bounding boxes lack explanation, while those generating only text risk ignoring visual evidence. Previous methods often separate visual grounding and reasoning, making it hard for models to explain why a particular visual element leads to a certain conclusion. While some models use dense supervision data or additional tools, they generally require heavy annotation and do not scale well. This makes it difficult for developers to create models that can explain their reasoning transparently and handle various visual tasks with minimal data. Researchers from UC Santa Cruz and eBay introduced a new method called Grounded Reasoning with Images and Textthat allows MLLMs like Qwen 2.5-VL and InternVL 3 to generate reasoning chains that mix natural language with explicit bounding box coordinates pointing to relevant image regions. This unified approach enables models to reason about and visually ground their answers without requiring dense annotations or labeled reasoning chains. GRIT also uses a lightweight reinforcement learning algorithm called GRPO-GR, which optimizes both the accuracy of the final answer and the structure of the reasoning, encouraging models to include specific tokens like <think> and <rethink>, as well as bounding box formats. This design eliminates the need for costly annotated data while ensuring that models learn to reference visual content meaningfully within their logical steps. The methodology in GRIT focuses on generating outputs that combine textual reasoning and visual grounding seamlessly. Instead of requiring models to process cropped images or additional visual data after generating bounding boxes, GRIT teaches models to use their internal understanding of the image. Bounding boxes are generated during the reasoning process, and models learn to reflect on these coordinates within their logical reasoning. The reinforcement learning framework rewards the correct use of bounding box formats and reasoning structure, and it guides models to produce coherent, grounded reasoning chains. GRIT demonstrates remarkable data efficiency by using only 20 image-question-answer triplets sourced from Visual Spatial Reasoning and TallyQA datasets. The model training was conducted on NVIDIA A100 GPUs, with optimization techniques like AdamW and a cosine scheduler applied over 200 training steps, which shows the method’s scalability despite limited data. Performance evaluations revealed that GRIT-trained models outperform several baselines in reasoning and grounding accuracy. For example, Qwen 2.5-VL trained with GRIT achieved 72.9% answer accuracy on Visual Spatial Reasoning, 47.8% on TallyQA, and 62.8% on GQA datasets. It also reached a grounding IoU score of 0.325 on VSR and 0.447 on TallyQA. In contrast, baseline models like Direct Query or Chain-of-Thought often performed significantly lower, showing limited ability to unify reasoning with visual grounding. GRIT models demonstrated a strong correlation between visual regions and textual reasoning, producing outputs that reflected a meaningful connection between image evidence and logical thought. GRIT also showed improvements on out-of-domain benchmarks, though gains were more pronounced on in-domain data, highlighting the importance of training data diversity for broader generalization. In conclusion, the research addressed the problem of disconnected reasoning and visual grounding in MLLMs by introducing GRIT. The method allows models to reason with images through a simple, efficient approach that requires minimal data. GRIT successfully teaches MLLMs to combine visual evidence with logical reasoning in a unified output, achieving strong performance across multiple benchmarks and demonstrating a promising step toward more interpretable AI systems. Check out the Paper, Project, and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM InferenceNikhilhttps://www.marktechpost.com/author/nikhil0980/Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPONikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE: A Parallel Computation Method for Efficient and Scalable Language Model Deployment #this #paper #introduces #grit #method

WWW.MARKTECHPOST.COM

This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding

The core idea of Multimodal Large Language Models (MLLMs) is to create models that can combine the richness of visual content with the logic of language. However, despite advances in this field, many models struggle to connect the two domains effectively, leading to limited performance in complex reasoning tasks that involve visual components. A major challenge in building such models is their limited ability to combine visual understanding with logical thinking. Current systems often produce textual outputs that explain reasoning but fail to reference the specific parts of an image they rely on. This creates a gap where models may arrive at an answer without clearly showing how the visual evidence contributed to their decision. It’s also difficult to ensure that models generate visual reasoning steps directly connecting to their answers. The fundamental problem lies in how to naturally train models to interleave text and image reasoning without needing large datasets annotated with visual references, which are scarce and expensive to produce. Existing methods try to address this by using reinforcement learning or prompting strategies. Some systems generate bounding box coordinates as answers, while others produce step-by-step textual reasoning chains. However, these approaches have limitations. Models that only produce bounding boxes lack explanation, while those generating only text risk ignoring visual evidence. Previous methods often separate visual grounding and reasoning, making it hard for models to explain why a particular visual element leads to a certain conclusion. While some models use dense supervision data or additional tools, they generally require heavy annotation and do not scale well. This makes it difficult for developers to create models that can explain their reasoning transparently and handle various visual tasks with minimal data. Researchers from UC Santa Cruz and eBay introduced a new method called Grounded Reasoning with Images and Text (GRIT) that allows MLLMs like Qwen 2.5-VL and InternVL 3 to generate reasoning chains that mix natural language with explicit bounding box coordinates pointing to relevant image regions. This unified approach enables models to reason about and visually ground their answers without requiring dense annotations or labeled reasoning chains. GRIT also uses a lightweight reinforcement learning algorithm called GRPO-GR, which optimizes both the accuracy of the final answer and the structure of the reasoning, encouraging models to include specific tokens like <think> and <rethink>, as well as bounding box formats. This design eliminates the need for costly annotated data while ensuring that models learn to reference visual content meaningfully within their logical steps. The methodology in GRIT focuses on generating outputs that combine textual reasoning and visual grounding seamlessly. Instead of requiring models to process cropped images or additional visual data after generating bounding boxes, GRIT teaches models to use their internal understanding of the image. Bounding boxes are generated during the reasoning process, and models learn to reflect on these coordinates within their logical reasoning. The reinforcement learning framework rewards the correct use of bounding box formats and reasoning structure, and it guides models to produce coherent, grounded reasoning chains. GRIT demonstrates remarkable data efficiency by using only 20 image-question-answer triplets sourced from Visual Spatial Reasoning and TallyQA datasets. The model training was conducted on NVIDIA A100 GPUs, with optimization techniques like AdamW and a cosine scheduler applied over 200 training steps, which shows the method’s scalability despite limited data. Performance evaluations revealed that GRIT-trained models outperform several baselines in reasoning and grounding accuracy. For example, Qwen 2.5-VL trained with GRIT achieved 72.9% answer accuracy on Visual Spatial Reasoning, 47.8% on TallyQA, and 62.8% on GQA datasets. It also reached a grounding IoU score of 0.325 on VSR and 0.447 on TallyQA. In contrast, baseline models like Direct Query or Chain-of-Thought often performed significantly lower, showing limited ability to unify reasoning with visual grounding. GRIT models demonstrated a strong correlation between visual regions and textual reasoning, producing outputs that reflected a meaningful connection between image evidence and logical thought. GRIT also showed improvements on out-of-domain benchmarks, though gains were more pronounced on in-domain data, highlighting the importance of training data diversity for broader generalization. In conclusion, the research addressed the problem of disconnected reasoning and visual grounding in MLLMs by introducing GRIT. The method allows models to reason with images through a simple, efficient approach that requires minimal data. GRIT successfully teaches MLLMs to combine visual evidence with logical reasoning in a unified output, achieving strong performance across multiple benchmarks and demonstrating a promising step toward more interpretable AI systems. Check out the Paper, Project, and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM InferenceNikhilhttps://www.marktechpost.com/author/nikhil0980/Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPONikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE (Parallel Scaling): A Parallel Computation Method for Efficient and Scalable Language Model Deployment

·289 Ansichten

Please log in to like, share and comment!
Marktechpost AI RT hat einen Link geteilt

2025-05-24 20:43:38 ·

Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional Compilers

LLMs have shown impressive capabilities across various programming tasks, yet their potential for program optimization has not been fully explored. While some recent efforts have used LLMs to enhance performance in languages like C++ and Python, the broader application of LLMs to optimize code, especially in low-level programming contexts, remains limited. Existing LLM benchmarks largely focus on code generation from natural language or solving GitHub issues, as seen in HumanEval, MBPP, APPS, SWE-bench, and SWE-agent. Moreover, models such as Codex, AlphaCode, and Code Llama primarily aim to improve code generation quality rather than performance. However, select research has begun addressing optimization, including parallelization and code efficiency improvements, though many of these approaches are constrained by the need for formal verification, limiting scalability.
In contrast, some newer methods embrace test-based validation, allowing optimization of more complex programs with loops. Learning-based strategies in compiler optimization—like AutoPhase, which uses reinforcement learning for pass sequencing, and Coreset, which applies graph neural networks—have shown promise in improving performance. Superoptimization techniques aim to find the most efficient version of a program but are typically restricted to small-scale problems. Additionally, frameworks like AutoTVM and Ansor have focused on optimizing GPU kernel code through statistical modeling and search. Recently, LLM-driven optimization has gained attention, with reinforcement learning approaches guiding LLMs using feedback from test cases. Techniques like CodeRL and PPOCoder leverage policy optimization methods to fine-tune models for better performance, even across resource-constrained programming languages like Verilog.
Stanford, UIUC, CMU, and Visa Research researchers explore using LLMs to optimize assembly code performance—an area traditionally handled by compilers like GCC. They introduce a reinforcement learning framework using Proximal Policy Optimization, guided by a reward balancing correctness and speedup over the gcc -O3 baseline. Using a dataset of 8,072 real-world programs, their model, Qwen2.5-Coder-7B-PPO, achieves a 96.0% test pass rate and a 1.47× average speedup, outperforming 20 other models, including Claude-3.7-sonnet. Their results show that with RL training, LLMs can effectively outperform conventional compiler optimizations.
The methodology involves optimizing compiled C programs for performance using an RL approach. Given a C program C, it is compiled to assembly P using gcc -O3. The goal is to generate a new assembly program P’ that is functionally equivalent but faster. Correctness is verified using a test set, and speedup is measured by execution time improvement. Using CodeNet as the dataset, the authors apply PPO to train a language model that generates improved code. Two reward functions—Correctness-Guided Speedup and Speedup-Only—are used to guide training based on program validity, correctness, and performance gains.
The study evaluates various language models on optimizing assembly code, revealing that most models struggle with low test pass rates and minimal speedups. However, Qwen2.5-Coder-7B-PPO, trained with reinforcement learning, significantly outperforms others, achieving 96% accuracy and a 1.47× average speedup. Ablation studies show that using gcc -O3 as a reference aids performance, while removing it leads to sharp declines. Notably, models like Claude-3.7-sonnet can surpass compilers by identifying hardware-specific optimizations, such as replacing loops with a single popcnt instruction, demonstrating their ability to perform semantic-level code transformations beyond traditional compiler capabilities.

In conclusion, the study explores using LLMs to optimize assembly code, a domain where traditional compilers struggle due to the complexity of low-level performance tuning. The authors fine-tune Qwen2.5-Coder-7B using PPO, rewarding both correctnessand speedup over gcc -O3. They introduce a benchmark of 8,072 real-world C programs to evaluate performance. The model achieves a 96.0% test pass rate and a 1.47× average speedup, outperforming 20 other models, including Claude-3.7-sonnet. While effective, limitations include a lack of formal correctness guarantees and variability in hardware performance across systems.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Evaluating Enterprise-Grade AI Assistants: A Benchmark for Complex, Voice-Driven WorkflowsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Beyond Aha Moments: Structuring Reasoning in Large Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/RXTX: A Machine Learning-Guided Algorithm for Efficient Structured Matrix MultiplicationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/From Protocol to Production: How Model Context ProtocolGateways Enable Secure, Scalable, and Seamless AI Integrations Across Enterprises
#optimizing #assembly #code #with #llms

Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional Compilers
LLMs have shown impressive capabilities across various programming tasks, yet their potential for program optimization has not been fully explored. While some recent efforts have used LLMs to enhance performance in languages like C++ and Python, the broader application of LLMs to optimize code, especially in low-level programming contexts, remains limited. Existing LLM benchmarks largely focus on code generation from natural language or solving GitHub issues, as seen in HumanEval, MBPP, APPS, SWE-bench, and SWE-agent. Moreover, models such as Codex, AlphaCode, and Code Llama primarily aim to improve code generation quality rather than performance. However, select research has begun addressing optimization, including parallelization and code efficiency improvements, though many of these approaches are constrained by the need for formal verification, limiting scalability. In contrast, some newer methods embrace test-based validation, allowing optimization of more complex programs with loops. Learning-based strategies in compiler optimization—like AutoPhase, which uses reinforcement learning for pass sequencing, and Coreset, which applies graph neural networks—have shown promise in improving performance. Superoptimization techniques aim to find the most efficient version of a program but are typically restricted to small-scale problems. Additionally, frameworks like AutoTVM and Ansor have focused on optimizing GPU kernel code through statistical modeling and search. Recently, LLM-driven optimization has gained attention, with reinforcement learning approaches guiding LLMs using feedback from test cases. Techniques like CodeRL and PPOCoder leverage policy optimization methods to fine-tune models for better performance, even across resource-constrained programming languages like Verilog. Stanford, UIUC, CMU, and Visa Research researchers explore using LLMs to optimize assembly code performance—an area traditionally handled by compilers like GCC. They introduce a reinforcement learning framework using Proximal Policy Optimization, guided by a reward balancing correctness and speedup over the gcc -O3 baseline. Using a dataset of 8,072 real-world programs, their model, Qwen2.5-Coder-7B-PPO, achieves a 96.0% test pass rate and a 1.47× average speedup, outperforming 20 other models, including Claude-3.7-sonnet. Their results show that with RL training, LLMs can effectively outperform conventional compiler optimizations. The methodology involves optimizing compiled C programs for performance using an RL approach. Given a C program C, it is compiled to assembly P using gcc -O3. The goal is to generate a new assembly program P’ that is functionally equivalent but faster. Correctness is verified using a test set, and speedup is measured by execution time improvement. Using CodeNet as the dataset, the authors apply PPO to train a language model that generates improved code. Two reward functions—Correctness-Guided Speedup and Speedup-Only—are used to guide training based on program validity, correctness, and performance gains. The study evaluates various language models on optimizing assembly code, revealing that most models struggle with low test pass rates and minimal speedups. However, Qwen2.5-Coder-7B-PPO, trained with reinforcement learning, significantly outperforms others, achieving 96% accuracy and a 1.47× average speedup. Ablation studies show that using gcc -O3 as a reference aids performance, while removing it leads to sharp declines. Notably, models like Claude-3.7-sonnet can surpass compilers by identifying hardware-specific optimizations, such as replacing loops with a single popcnt instruction, demonstrating their ability to perform semantic-level code transformations beyond traditional compiler capabilities. In conclusion, the study explores using LLMs to optimize assembly code, a domain where traditional compilers struggle due to the complexity of low-level performance tuning. The authors fine-tune Qwen2.5-Coder-7B using PPO, rewarding both correctnessand speedup over gcc -O3. They introduce a benchmark of 8,072 real-world C programs to evaluate performance. The model achieves a 96.0% test pass rate and a 1.47× average speedup, outperforming 20 other models, including Claude-3.7-sonnet. While effective, limitations include a lack of formal correctness guarantees and variability in hardware performance across systems. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Evaluating Enterprise-Grade AI Assistants: A Benchmark for Complex, Voice-Driven WorkflowsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Beyond Aha Moments: Structuring Reasoning in Large Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/RXTX: A Machine Learning-Guided Algorithm for Efficient Structured Matrix MultiplicationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/From Protocol to Production: How Model Context ProtocolGateways Enable Secure, Scalable, and Seamless AI Integrations Across Enterprises #optimizing #assembly #code #with #llms

WWW.MARKTECHPOST.COM

Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional Compilers

LLMs have shown impressive capabilities across various programming tasks, yet their potential for program optimization has not been fully explored. While some recent efforts have used LLMs to enhance performance in languages like C++ and Python, the broader application of LLMs to optimize code, especially in low-level programming contexts, remains limited. Existing LLM benchmarks largely focus on code generation from natural language or solving GitHub issues, as seen in HumanEval, MBPP, APPS, SWE-bench, and SWE-agent. Moreover, models such as Codex, AlphaCode, and Code Llama primarily aim to improve code generation quality rather than performance. However, select research has begun addressing optimization, including parallelization and code efficiency improvements, though many of these approaches are constrained by the need for formal verification, limiting scalability. In contrast, some newer methods embrace test-based validation, allowing optimization of more complex programs with loops. Learning-based strategies in compiler optimization—like AutoPhase, which uses reinforcement learning for pass sequencing, and Coreset, which applies graph neural networks—have shown promise in improving performance. Superoptimization techniques aim to find the most efficient version of a program but are typically restricted to small-scale problems. Additionally, frameworks like AutoTVM and Ansor have focused on optimizing GPU kernel code through statistical modeling and search. Recently, LLM-driven optimization has gained attention, with reinforcement learning approaches guiding LLMs using feedback from test cases. Techniques like CodeRL and PPOCoder leverage policy optimization methods to fine-tune models for better performance, even across resource-constrained programming languages like Verilog. Stanford, UIUC, CMU, and Visa Research researchers explore using LLMs to optimize assembly code performance—an area traditionally handled by compilers like GCC. They introduce a reinforcement learning framework using Proximal Policy Optimization (PPO), guided by a reward balancing correctness and speedup over the gcc -O3 baseline. Using a dataset of 8,072 real-world programs, their model, Qwen2.5-Coder-7B-PPO, achieves a 96.0% test pass rate and a 1.47× average speedup, outperforming 20 other models, including Claude-3.7-sonnet. Their results show that with RL training, LLMs can effectively outperform conventional compiler optimizations. The methodology involves optimizing compiled C programs for performance using an RL approach. Given a C program C, it is compiled to assembly P using gcc -O3. The goal is to generate a new assembly program P’ that is functionally equivalent but faster. Correctness is verified using a test set, and speedup is measured by execution time improvement. Using CodeNet as the dataset, the authors apply PPO to train a language model that generates improved code. Two reward functions—Correctness-Guided Speedup and Speedup-Only—are used to guide training based on program validity, correctness, and performance gains. The study evaluates various language models on optimizing assembly code, revealing that most models struggle with low test pass rates and minimal speedups. However, Qwen2.5-Coder-7B-PPO, trained with reinforcement learning, significantly outperforms others, achieving 96% accuracy and a 1.47× average speedup. Ablation studies show that using gcc -O3 as a reference aids performance, while removing it leads to sharp declines. Notably, models like Claude-3.7-sonnet can surpass compilers by identifying hardware-specific optimizations, such as replacing loops with a single popcnt instruction, demonstrating their ability to perform semantic-level code transformations beyond traditional compiler capabilities. In conclusion, the study explores using LLMs to optimize assembly code, a domain where traditional compilers struggle due to the complexity of low-level performance tuning. The authors fine-tune Qwen2.5-Coder-7B using PPO, rewarding both correctness (via test cases) and speedup over gcc -O3. They introduce a benchmark of 8,072 real-world C programs to evaluate performance. The model achieves a 96.0% test pass rate and a 1.47× average speedup, outperforming 20 other models, including Claude-3.7-sonnet. While effective, limitations include a lack of formal correctness guarantees and variability in hardware performance across systems. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Evaluating Enterprise-Grade AI Assistants: A Benchmark for Complex, Voice-Driven WorkflowsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Beyond Aha Moments: Structuring Reasoning in Large Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/RXTX: A Machine Learning-Guided Algorithm for Efficient Structured Matrix MultiplicationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/From Protocol to Production: How Model Context Protocol (MCP) Gateways Enable Secure, Scalable, and Seamless AI Integrations Across Enterprises

·269 Ansichten

Please log in to like, share and comment!
Marktechpost AI RT hat einen Link geteilt

2025-05-23 10:41:29 ·

Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models

Recent advances in long-contextmodeling have unlocked new capabilities for LLMs and large vision-language models. Long-context vision–language modelsshow an important step forward by enabling LVLMs to process hundreds of images and thousands of interleaved text tokens in a single forward pass. However, the development of effective evaluation benchmarks lags. It is still unclear how well current LCVLMs perform in long-context settings, what tasks they struggle with, and how robust they are to input length variation. Current benchmarks face the following problem:Limited coverage of downstream tasks,Insufficient coverage of image types,Lack of context length control, andSingle context length.
Various techniques have extended context windows for LVLMs, including longer pre-training lengths, position extrapolation, and efficient architectures. Models like Gemini-2.5 and Qwen2.5-VL have adopted these approaches alongside vision token compression methods to accommodate longer sequences. For evaluation, the Needle-in-a-Haystack task became a standard benchmark for testing LC ability by inserting information at specific depths within long texts. However, existing vision-language benchmarks remain limited, focusing only on NIAH variants or long-document VQA tasks. Even MileBench contains short-context tasks with an average length of only 9K tokens, failing to evaluate true LC capabilities across diverse vision-language applications.
Researchers from HKUST, Tencent AI Seattle Lab, University of Edinburgh, Miniml.AI, and NVIDIA AI Technology Center have proposed MMLONGBENCH, the first comprehensive benchmark for evaluating LCVLMs. It comprises 13,331 examples spanning five downstream task categories, including Visual RAG and Many-Shot ICL, covering natural and synthetic image types. All examples are standardized across five input lengths from 8K to 128K tokens using a cross-modal tokenization scheme combining vision patches and text tokens. Through benchmarking 46 closed-source and open-source models, the research reveals that single-task performance poorly predicts overall LC capability, both model types struggle with LC tasks, and stronger reasoning models show better LC performance.

Researchers construct LC by inserting gold passages containing answers among large sets of distracting passages retrieved from Wikipedia. For ViQuAE, gold passages from KILT are used, while InfoSeek uses lead sections from Wikipedia entity pages. Further, Wikipedia pages are split into 100-word passages, and retrieved distractors are added until reaching desired input lengths. Many-shot in-context learning tasks utilize four diverse image classification datasets: Stanford Cars, Food101, SUN397, and iNat2021, accommodating 500 images within 128K context windows. Cross-modal token counting combines text tokens using the Llama2 tokenizer with visual tokens processed through 14×14 patches and 2×2 pixel unshuffle compression, ensuring compatibility with modern LVLMs for evaluation.
The evaluation on MMLONGBENCH across tasks and context Lengths shows that all models struggle, but closed-source models perform better. For the longest input length of 128K, all models struggle with long-context vision-language tasks, with GPT-4o achieving only 62.9 average performance. Gemini-2.5-Pro became the strongest performer, outperforming open-source models by 20 points except on ICL tasks. Further, Ovis2-34B model achieves a score of 41.6 on summarization, similar to GPT-4o. Qwen2.5-VL-32B achieves a SubEM score of 64.6 on VRAG, even better than Gemini-2.0-Flash. Models show generalization capabilities beyond their training context lengths, with Qwen2-VL-72B achieving a 51.9 average score at 128K despite a 32K training window.

In conclusion, researchers introduced MMLONGBENCH, the first comprehensive benchmark for evaluating LCVLMs across diverse downstream tasks. It provides a rigorous foundation for diagnosing frontier model capabilities by covering five distinct task categories with unified cross-modal token counting and standardized context lengths. The evaluation of 46 models demonstrates that single-task performance unreliably predicts overall long-context ability, and frontier models face significant challenges in OCR accuracy and cross-modal retrieval. MMLONGBENCH is a standard evaluation framework to drive future research toward more efficient vision-language token encodings, robust position-extrapolation schemes, and improved multi-modal retrieval and reasoning capabilities.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Enhancing Language Model Generalization: Bridging the Gap Between In-Context Learning and Fine-TuningSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Salesforce AI Researchers Introduce UAEval4RAG: A New Benchmark to Evaluate RAG Systems’ Ability to Reject Unanswerable QueriesSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Google Researchers Introduce LightLab: A Diffusion-Based AI Method for Physically Plausible, Fine-Grained Light Control in Single ImagesSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/DanceGRPO: A Unified Framework for Reinforcement Learning in Visual Generation Across Multiple Paradigms and Tasks
#researchers #introduce #mmlongbench #comprehensive #benchmark

Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models
Recent advances in long-contextmodeling have unlocked new capabilities for LLMs and large vision-language models. Long-context vision–language modelsshow an important step forward by enabling LVLMs to process hundreds of images and thousands of interleaved text tokens in a single forward pass. However, the development of effective evaluation benchmarks lags. It is still unclear how well current LCVLMs perform in long-context settings, what tasks they struggle with, and how robust they are to input length variation. Current benchmarks face the following problem:Limited coverage of downstream tasks,Insufficient coverage of image types,Lack of context length control, andSingle context length. Various techniques have extended context windows for LVLMs, including longer pre-training lengths, position extrapolation, and efficient architectures. Models like Gemini-2.5 and Qwen2.5-VL have adopted these approaches alongside vision token compression methods to accommodate longer sequences. For evaluation, the Needle-in-a-Haystack task became a standard benchmark for testing LC ability by inserting information at specific depths within long texts. However, existing vision-language benchmarks remain limited, focusing only on NIAH variants or long-document VQA tasks. Even MileBench contains short-context tasks with an average length of only 9K tokens, failing to evaluate true LC capabilities across diverse vision-language applications. Researchers from HKUST, Tencent AI Seattle Lab, University of Edinburgh, Miniml.AI, and NVIDIA AI Technology Center have proposed MMLONGBENCH, the first comprehensive benchmark for evaluating LCVLMs. It comprises 13,331 examples spanning five downstream task categories, including Visual RAG and Many-Shot ICL, covering natural and synthetic image types. All examples are standardized across five input lengths from 8K to 128K tokens using a cross-modal tokenization scheme combining vision patches and text tokens. Through benchmarking 46 closed-source and open-source models, the research reveals that single-task performance poorly predicts overall LC capability, both model types struggle with LC tasks, and stronger reasoning models show better LC performance. Researchers construct LC by inserting gold passages containing answers among large sets of distracting passages retrieved from Wikipedia. For ViQuAE, gold passages from KILT are used, while InfoSeek uses lead sections from Wikipedia entity pages. Further, Wikipedia pages are split into 100-word passages, and retrieved distractors are added until reaching desired input lengths. Many-shot in-context learning tasks utilize four diverse image classification datasets: Stanford Cars, Food101, SUN397, and iNat2021, accommodating 500 images within 128K context windows. Cross-modal token counting combines text tokens using the Llama2 tokenizer with visual tokens processed through 14×14 patches and 2×2 pixel unshuffle compression, ensuring compatibility with modern LVLMs for evaluation. The evaluation on MMLONGBENCH across tasks and context Lengths shows that all models struggle, but closed-source models perform better. For the longest input length of 128K, all models struggle with long-context vision-language tasks, with GPT-4o achieving only 62.9 average performance. Gemini-2.5-Pro became the strongest performer, outperforming open-source models by 20 points except on ICL tasks. Further, Ovis2-34B model achieves a score of 41.6 on summarization, similar to GPT-4o. Qwen2.5-VL-32B achieves a SubEM score of 64.6 on VRAG, even better than Gemini-2.0-Flash. Models show generalization capabilities beyond their training context lengths, with Qwen2-VL-72B achieving a 51.9 average score at 128K despite a 32K training window. In conclusion, researchers introduced MMLONGBENCH, the first comprehensive benchmark for evaluating LCVLMs across diverse downstream tasks. It provides a rigorous foundation for diagnosing frontier model capabilities by covering five distinct task categories with unified cross-modal token counting and standardized context lengths. The evaluation of 46 models demonstrates that single-task performance unreliably predicts overall long-context ability, and frontier models face significant challenges in OCR accuracy and cross-modal retrieval. MMLONGBENCH is a standard evaluation framework to drive future research toward more efficient vision-language token encodings, robust position-extrapolation schemes, and improved multi-modal retrieval and reasoning capabilities. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Enhancing Language Model Generalization: Bridging the Gap Between In-Context Learning and Fine-TuningSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Salesforce AI Researchers Introduce UAEval4RAG: A New Benchmark to Evaluate RAG Systems’ Ability to Reject Unanswerable QueriesSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Google Researchers Introduce LightLab: A Diffusion-Based AI Method for Physically Plausible, Fine-Grained Light Control in Single ImagesSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/DanceGRPO: A Unified Framework for Reinforcement Learning in Visual Generation Across Multiple Paradigms and Tasks #researchers #introduce #mmlongbench #comprehensive #benchmark

WWW.MARKTECHPOST.COM

Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models

Recent advances in long-context (LC) modeling have unlocked new capabilities for LLMs and large vision-language models (LVLMs). Long-context vision–language models (LCVLMs) show an important step forward by enabling LVLMs to process hundreds of images and thousands of interleaved text tokens in a single forward pass. However, the development of effective evaluation benchmarks lags. It is still unclear how well current LCVLMs perform in long-context settings, what tasks they struggle with, and how robust they are to input length variation. Current benchmarks face the following problem: (a) Limited coverage of downstream tasks, (b) Insufficient coverage of image types, (c) Lack of context length control, and (d) Single context length. Various techniques have extended context windows for LVLMs, including longer pre-training lengths, position extrapolation, and efficient architectures. Models like Gemini-2.5 and Qwen2.5-VL have adopted these approaches alongside vision token compression methods to accommodate longer sequences. For evaluation, the Needle-in-a-Haystack task became a standard benchmark for testing LC ability by inserting information at specific depths within long texts. However, existing vision-language benchmarks remain limited, focusing only on NIAH variants or long-document VQA tasks. Even MileBench contains short-context tasks with an average length of only 9K tokens, failing to evaluate true LC capabilities across diverse vision-language applications. Researchers from HKUST, Tencent AI Seattle Lab, University of Edinburgh, Miniml.AI, and NVIDIA AI Technology Center have proposed MMLONGBENCH, the first comprehensive benchmark for evaluating LCVLMs. It comprises 13,331 examples spanning five downstream task categories, including Visual RAG and Many-Shot ICL, covering natural and synthetic image types. All examples are standardized across five input lengths from 8K to 128K tokens using a cross-modal tokenization scheme combining vision patches and text tokens. Through benchmarking 46 closed-source and open-source models, the research reveals that single-task performance poorly predicts overall LC capability, both model types struggle with LC tasks, and stronger reasoning models show better LC performance. Researchers construct LC by inserting gold passages containing answers among large sets of distracting passages retrieved from Wikipedia. For ViQuAE, gold passages from KILT are used, while InfoSeek uses lead sections from Wikipedia entity pages. Further, Wikipedia pages are split into 100-word passages, and retrieved distractors are added until reaching desired input lengths. Many-shot in-context learning tasks utilize four diverse image classification datasets: Stanford Cars, Food101, SUN397, and iNat2021, accommodating 500 images within 128K context windows. Cross-modal token counting combines text tokens using the Llama2 tokenizer with visual tokens processed through 14×14 patches and 2×2 pixel unshuffle compression, ensuring compatibility with modern LVLMs for evaluation. The evaluation on MMLONGBENCH across tasks and context Lengths shows that all models struggle, but closed-source models perform better. For the longest input length of 128K, all models struggle with long-context vision-language tasks, with GPT-4o achieving only 62.9 average performance. Gemini-2.5-Pro became the strongest performer, outperforming open-source models by 20 points except on ICL tasks. Further, Ovis2-34B model achieves a score of 41.6 on summarization, similar to GPT-4o (42.4). Qwen2.5-VL-32B achieves a SubEM score of 64.6 on VRAG, even better than Gemini-2.0-Flash. Models show generalization capabilities beyond their training context lengths, with Qwen2-VL-72B achieving a 51.9 average score at 128K despite a 32K training window. In conclusion, researchers introduced MMLONGBENCH, the first comprehensive benchmark for evaluating LCVLMs across diverse downstream tasks. It provides a rigorous foundation for diagnosing frontier model capabilities by covering five distinct task categories with unified cross-modal token counting and standardized context lengths. The evaluation of 46 models demonstrates that single-task performance unreliably predicts overall long-context ability, and frontier models face significant challenges in OCR accuracy and cross-modal retrieval. MMLONGBENCH is a standard evaluation framework to drive future research toward more efficient vision-language token encodings, robust position-extrapolation schemes, and improved multi-modal retrieval and reasoning capabilities. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Enhancing Language Model Generalization: Bridging the Gap Between In-Context Learning and Fine-TuningSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Salesforce AI Researchers Introduce UAEval4RAG: A New Benchmark to Evaluate RAG Systems’ Ability to Reject Unanswerable QueriesSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Google Researchers Introduce LightLab: A Diffusion-Based AI Method for Physically Plausible, Fine-Grained Light Control in Single ImagesSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/DanceGRPO: A Unified Framework for Reinforcement Learning in Visual Generation Across Multiple Paradigms and Tasks

·240 Ansichten

Please log in to like, share and comment!
Marktechpost AI RT hat einen Link geteilt

2025-05-22 08:56:37 ·

Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding

Addressing Architectural Trade-offs in Language Models
As language models scale, balancing expressivity, efficiency, and adaptability becomes increasingly challenging. Transformer architectures dominate due to their strong performance across a wide range of tasks, but they are computationally expensive—particularly for long-context scenarios—due to the quadratic complexity of self-attention. On the other hand, Structured State Space Modelsoffer improved efficiency and linear scaling, yet often lack the nuanced sequence modeling required for complex language understanding. A combined architecture that leverages the strengths of both approaches is needed to support diverse applications across environments.
Introducing Falcon-H1: A Hybrid Architecture
The Falcon-H1 series, released by the Technology Innovation Institute, introduces a hybrid family of language models that combine Transformer attention mechanisms with Mamba2-based SSM components. This architecture is designed to improve computational efficiency while maintaining competitive performance across tasks requiring deep contextual understanding.
Falcon-H1 covers a wide parameter range—from 0.5B to 34B—catering to use cases from resource-constrained deployments to large-scale distributed inference. The design aims to address common bottlenecks in LLM deployment: memory efficiency, scalability, multilingual support, and the ability to handle extended input sequences.

Source: /
Architectural Details and Design Objectives
Falcon-H1 adopts a parallel structure where attention heads and Mamba2 SSMs operate side by side. This design allows each mechanism to independently contribute to sequence modeling: attention heads specialize in capturing token-level dependencies, while SSM components support efficient long-range information retention.
The series supports a context length of up to 256K tokens, which is particularly useful for applications in document summarization, retrieval-augmented generation, and multi-turn dialogue systems. Model training incorporates a customized microparameterizationrecipe and optimized data pipelines, allowing for stable and efficient training across model sizes.
The models are trained with a focus on multilingual capabilities. The architecture is natively equipped to handle 18 languages, with coverage including English, Chinese, Arabic, Hindi, French, and others. The framework is extensible to over 100 languages, supporting localization and region-specific model adaptation.
Empirical Results and Comparative Evaluation
Despite relatively modest parameter counts, Falcon-H1 models demonstrate strong empirical performance:

Falcon-H1-0.5B achieves results comparable to 7B-parameter models released in 2024.
Falcon-H1-1.5B-Deep performs on par with leading 7B to 10B Transformer models.
Falcon-H1-34B matches or exceeds the performance of models such as Qwen3-32B, Llama4-Scout-17B/109B, and Gemma3-27B across several benchmarks.

Evaluations emphasize both general-purpose language understanding and multilingual benchmarks. Notably, the models achieve strong performance across both high-resource and low-resource languages without requiring excessive fine-tuning or additional adaptation layers.

Source: /
Deployment and inference are supported through integration with open-source tools such as Hugging Face Transformers. FlashAttention-2 compatibility further reduces memory usage during inference, offering an attractive efficiency-performance balance for enterprise use.
Conclusion
Falcon-H1 represents a methodical effort to refine language model architecture by integrating complementary mechanisms—attention and SSMs—within a unified framework. By doing so, it addresses key limitations in both long-context processing and scaling efficiency. The model family provides a range of options for practitioners, from lightweight variants suitable for edge deployment to high-capacity configurations for server-side applications.
Through its multilingual coverage, long-context capabilities, and architectural flexibility, Falcon-H1 offers a technically sound foundation for research and production use cases that demand performance without compromising on efficiency or accessibility.

Check out the Official Release, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative Modeling
#technology #innovation #institute #tii #releases

Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding
Addressing Architectural Trade-offs in Language Models As language models scale, balancing expressivity, efficiency, and adaptability becomes increasingly challenging. Transformer architectures dominate due to their strong performance across a wide range of tasks, but they are computationally expensive—particularly for long-context scenarios—due to the quadratic complexity of self-attention. On the other hand, Structured State Space Modelsoffer improved efficiency and linear scaling, yet often lack the nuanced sequence modeling required for complex language understanding. A combined architecture that leverages the strengths of both approaches is needed to support diverse applications across environments. Introducing Falcon-H1: A Hybrid Architecture The Falcon-H1 series, released by the Technology Innovation Institute, introduces a hybrid family of language models that combine Transformer attention mechanisms with Mamba2-based SSM components. This architecture is designed to improve computational efficiency while maintaining competitive performance across tasks requiring deep contextual understanding. Falcon-H1 covers a wide parameter range—from 0.5B to 34B—catering to use cases from resource-constrained deployments to large-scale distributed inference. The design aims to address common bottlenecks in LLM deployment: memory efficiency, scalability, multilingual support, and the ability to handle extended input sequences. Source: / Architectural Details and Design Objectives Falcon-H1 adopts a parallel structure where attention heads and Mamba2 SSMs operate side by side. This design allows each mechanism to independently contribute to sequence modeling: attention heads specialize in capturing token-level dependencies, while SSM components support efficient long-range information retention. The series supports a context length of up to 256K tokens, which is particularly useful for applications in document summarization, retrieval-augmented generation, and multi-turn dialogue systems. Model training incorporates a customized microparameterizationrecipe and optimized data pipelines, allowing for stable and efficient training across model sizes. The models are trained with a focus on multilingual capabilities. The architecture is natively equipped to handle 18 languages, with coverage including English, Chinese, Arabic, Hindi, French, and others. The framework is extensible to over 100 languages, supporting localization and region-specific model adaptation. Empirical Results and Comparative Evaluation Despite relatively modest parameter counts, Falcon-H1 models demonstrate strong empirical performance: Falcon-H1-0.5B achieves results comparable to 7B-parameter models released in 2024. Falcon-H1-1.5B-Deep performs on par with leading 7B to 10B Transformer models. Falcon-H1-34B matches or exceeds the performance of models such as Qwen3-32B, Llama4-Scout-17B/109B, and Gemma3-27B across several benchmarks. Evaluations emphasize both general-purpose language understanding and multilingual benchmarks. Notably, the models achieve strong performance across both high-resource and low-resource languages without requiring excessive fine-tuning or additional adaptation layers. Source: / Deployment and inference are supported through integration with open-source tools such as Hugging Face Transformers. FlashAttention-2 compatibility further reduces memory usage during inference, offering an attractive efficiency-performance balance for enterprise use. Conclusion Falcon-H1 represents a methodical effort to refine language model architecture by integrating complementary mechanisms—attention and SSMs—within a unified framework. By doing so, it addresses key limitations in both long-context processing and scaling efficiency. The model family provides a range of options for practitioners, from lightweight variants suitable for edge deployment to high-capacity configurations for server-side applications. Through its multilingual coverage, long-context capabilities, and architectural flexibility, Falcon-H1 offers a technically sound foundation for research and production use cases that demand performance without compromising on efficiency or accessibility. Check out the Official Release, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative Modeling #technology #innovation #institute #tii #releases

WWW.MARKTECHPOST.COM

Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding

Addressing Architectural Trade-offs in Language Models As language models scale, balancing expressivity, efficiency, and adaptability becomes increasingly challenging. Transformer architectures dominate due to their strong performance across a wide range of tasks, but they are computationally expensive—particularly for long-context scenarios—due to the quadratic complexity of self-attention. On the other hand, Structured State Space Models (SSMs) offer improved efficiency and linear scaling, yet often lack the nuanced sequence modeling required for complex language understanding. A combined architecture that leverages the strengths of both approaches is needed to support diverse applications across environments. Introducing Falcon-H1: A Hybrid Architecture The Falcon-H1 series, released by the Technology Innovation Institute (TII), introduces a hybrid family of language models that combine Transformer attention mechanisms with Mamba2-based SSM components. This architecture is designed to improve computational efficiency while maintaining competitive performance across tasks requiring deep contextual understanding. Falcon-H1 covers a wide parameter range—from 0.5B to 34B—catering to use cases from resource-constrained deployments to large-scale distributed inference. The design aims to address common bottlenecks in LLM deployment: memory efficiency, scalability, multilingual support, and the ability to handle extended input sequences. Source: https://falcon-lm.github.io/blog/falcon-h1/ Architectural Details and Design Objectives Falcon-H1 adopts a parallel structure where attention heads and Mamba2 SSMs operate side by side. This design allows each mechanism to independently contribute to sequence modeling: attention heads specialize in capturing token-level dependencies, while SSM components support efficient long-range information retention. The series supports a context length of up to 256K tokens, which is particularly useful for applications in document summarization, retrieval-augmented generation, and multi-turn dialogue systems. Model training incorporates a customized microparameterization (μP) recipe and optimized data pipelines, allowing for stable and efficient training across model sizes. The models are trained with a focus on multilingual capabilities. The architecture is natively equipped to handle 18 languages, with coverage including English, Chinese, Arabic, Hindi, French, and others. The framework is extensible to over 100 languages, supporting localization and region-specific model adaptation. Empirical Results and Comparative Evaluation Despite relatively modest parameter counts, Falcon-H1 models demonstrate strong empirical performance: Falcon-H1-0.5B achieves results comparable to 7B-parameter models released in 2024. Falcon-H1-1.5B-Deep performs on par with leading 7B to 10B Transformer models. Falcon-H1-34B matches or exceeds the performance of models such as Qwen3-32B, Llama4-Scout-17B/109B, and Gemma3-27B across several benchmarks. Evaluations emphasize both general-purpose language understanding and multilingual benchmarks. Notably, the models achieve strong performance across both high-resource and low-resource languages without requiring excessive fine-tuning or additional adaptation layers. Source: https://falcon-lm.github.io/blog/falcon-h1/ Deployment and inference are supported through integration with open-source tools such as Hugging Face Transformers. FlashAttention-2 compatibility further reduces memory usage during inference, offering an attractive efficiency-performance balance for enterprise use. Conclusion Falcon-H1 represents a methodical effort to refine language model architecture by integrating complementary mechanisms—attention and SSMs—within a unified framework. By doing so, it addresses key limitations in both long-context processing and scaling efficiency. The model family provides a range of options for practitioners, from lightweight variants suitable for edge deployment to high-capacity configurations for server-side applications. Through its multilingual coverage, long-context capabilities, and architectural flexibility, Falcon-H1 offers a technically sound foundation for research and production use cases that demand performance without compromising on efficiency or accessibility. Check out the Official Release, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative Modeling

·307 Ansichten

Please log in to like, share and comment!
Marktechpost AI RT hat einen Link geteilt

2025-05-21 22:35:37 ·

Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data

Large language models are now being used for evaluation and judgment tasks, extending beyond their traditional role of text generation. This has led to “LLM-as-a-Judge,” where models assess outputs from other language models. Such evaluations are essential in reinforcement learning pipelines, benchmark testing, and system alignment. These judge models rely on internal chain-of-thought reasoning, mirroring human judgment processes. Unlike conventional reward models that provide direct scores, these models simulate thoughtful evaluation, making them better suited for complex tasks such as math problem-solving, ethical reasoning, and user intent interpretation. Their ability to interpret and validate responses across languages and domains enhances automation and scalability in language model development.
However, current AI judgment systems face issues with inconsistency and shallow reasoning. Many rely on basic metrics or static annotations, which are inadequate for evaluating subjective or open-ended prompts. A common problem is position bias, where the order of answers affects the final decision, compromising fairness. Also, collecting human-annotated data at scale is costly and time-consuming, limiting the generalizability of these models.
Several existing approaches have addressed these challenges, but with limited success. Systems like EvalPlanner and DeepSeek-GRM rely on human-labeled data or rigid training schemes, which limit adaptability across task types. Others, like DeepSeek-R1, depend on distillation from large models but perform poorly on ambiguous prompts. Static datasets and offline tuning strategies hinder dynamic reasoning, while newer methods using score formatting or structured prompts have shown minimal accuracy improvements. Despite larger datasets and models, performance gains in traditional systems have stalled.
Researchers from Meta’s GenAI and FAIR teams introduced J1 to address the above limitations. J1 trains judgment models through a reinforcement learning-based framework, making them capable of learning through verifiable reward signals. The team used synthetic data to create high-quality and low-quality responses to a prompt, transforming subjective tasks into verifiable pairwise judgments. This synthetic dataset included 22,000 preference pairs, split between 17,000 prompts from the WildChat corpus and 5,000 mathematical queries. These were used to train two versions of J1: J1-Llama-8B and J1-Llama-70B, initialized from the Llama-3.1-8B-Instruct and Llama-3.3-70B-Instruct base models, respectively. The models were trained using Group Relative Policy Optimization, a reinforcement algorithm that eliminates the need for critic models and accelerates convergence.

At the training strategy’s core is position-agnostic learning, where bothandinput formats are used in training to prevent position bias. Also, consistency-based rewards are applied only when the model delivers correct verdicts across both answer orderings. This structure allows the judge to be fair and reliable regardless of prompt or answer order. The training framework supports multiple variations: models can output final verdicts, numeric scores for each answer, or both. A pointwise judging variant is included, which evaluates single responses using scores from 0 to 10. These formats make J1 a versatile and generalizable system capable of judging various tasks.

The results obtained using the J1 models reveal substantial performance improvements over existing systems. On the widely used Preference Proxy Evaluationsbenchmark, J1-Llama-70B achieved an overall accuracy of 69.6%, outperforming models trained with over ten times more data. In contrast, models like DeepSeek-GRM-27B and EvalPlanner-Llama-70B scored 67.2% and 65.6%, respectively. Even the smaller J1-Llama-8B model exceeded baseline systems like EvalPlanner-Llama-8B, scoring 62.2% versus 55.5%. J1 also showed top-tier performance on other critical benchmarks such as RewardBench, RM-Bench, JudgeBench, and FollowBenchEval, demonstrating robust generalization across verifiable and subjective tasks. These improvements are not just marginal but significant, considering the limited training data used in J1 compared to the expansive datasets in other models.

Several Key Takeaways from the Research on J1:

J1 is trained using 22,000 synthetic preference pairs, including 17K from WildChat and 5K from MATH tasks.
The training uses GRPO, which streamlines RL by avoiding the need for separate critic models.
It introduces position-agnostic learning, reducing position bias through consistency-based rewards.
Two main model variants, J1-Llama-8B and J1-Llama-70B, were trained on modest data but outperformed large-scale models.
J1-Llama-70B scored 69.6% on PPE, exceeding DeepSeek-GRM-27Band EvalPlanner-Llama-70B.
Supports multiple judgment formats: pairwise with verdicts, pairwise with scores, and pointwise scores.
Surpasses models distilled from DeepSeek-R1 and OpenAI’s o1-mini on several tasks.
Demonstrates that reasoning quality, not just dataset size, is critical for accurate judgments.
J1’s framework makes it a generalist judge applicable to verifiable and non-verifiable tasks.

In conclusion, the J1 approach fundamentally redefines how judgment models are trained and evaluated. Synthetic data and reinforcement learning bypass the traditional need for costly annotations while promoting fair, logical, and consistent evaluations. This work illustrates that reasoning-driven judging can outperform larger models that rely heavily on data volume and static alignment techniques. It also validates the notion that judgment models should be thinkers first, and scorers second. With performance that rivals and often surpasses state-of-the-art systems, J1 sets a new benchmark in training LLM-as-a-Judge systems.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World EnvironmentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA Optimization
#meta #researchers #introduced #reinforcement #learning

Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data
Large language models are now being used for evaluation and judgment tasks, extending beyond their traditional role of text generation. This has led to “LLM-as-a-Judge,” where models assess outputs from other language models. Such evaluations are essential in reinforcement learning pipelines, benchmark testing, and system alignment. These judge models rely on internal chain-of-thought reasoning, mirroring human judgment processes. Unlike conventional reward models that provide direct scores, these models simulate thoughtful evaluation, making them better suited for complex tasks such as math problem-solving, ethical reasoning, and user intent interpretation. Their ability to interpret and validate responses across languages and domains enhances automation and scalability in language model development. However, current AI judgment systems face issues with inconsistency and shallow reasoning. Many rely on basic metrics or static annotations, which are inadequate for evaluating subjective or open-ended prompts. A common problem is position bias, where the order of answers affects the final decision, compromising fairness. Also, collecting human-annotated data at scale is costly and time-consuming, limiting the generalizability of these models. Several existing approaches have addressed these challenges, but with limited success. Systems like EvalPlanner and DeepSeek-GRM rely on human-labeled data or rigid training schemes, which limit adaptability across task types. Others, like DeepSeek-R1, depend on distillation from large models but perform poorly on ambiguous prompts. Static datasets and offline tuning strategies hinder dynamic reasoning, while newer methods using score formatting or structured prompts have shown minimal accuracy improvements. Despite larger datasets and models, performance gains in traditional systems have stalled. Researchers from Meta’s GenAI and FAIR teams introduced J1 to address the above limitations. J1 trains judgment models through a reinforcement learning-based framework, making them capable of learning through verifiable reward signals. The team used synthetic data to create high-quality and low-quality responses to a prompt, transforming subjective tasks into verifiable pairwise judgments. This synthetic dataset included 22,000 preference pairs, split between 17,000 prompts from the WildChat corpus and 5,000 mathematical queries. These were used to train two versions of J1: J1-Llama-8B and J1-Llama-70B, initialized from the Llama-3.1-8B-Instruct and Llama-3.3-70B-Instruct base models, respectively. The models were trained using Group Relative Policy Optimization, a reinforcement algorithm that eliminates the need for critic models and accelerates convergence. At the training strategy’s core is position-agnostic learning, where bothandinput formats are used in training to prevent position bias. Also, consistency-based rewards are applied only when the model delivers correct verdicts across both answer orderings. This structure allows the judge to be fair and reliable regardless of prompt or answer order. The training framework supports multiple variations: models can output final verdicts, numeric scores for each answer, or both. A pointwise judging variant is included, which evaluates single responses using scores from 0 to 10. These formats make J1 a versatile and generalizable system capable of judging various tasks. The results obtained using the J1 models reveal substantial performance improvements over existing systems. On the widely used Preference Proxy Evaluationsbenchmark, J1-Llama-70B achieved an overall accuracy of 69.6%, outperforming models trained with over ten times more data. In contrast, models like DeepSeek-GRM-27B and EvalPlanner-Llama-70B scored 67.2% and 65.6%, respectively. Even the smaller J1-Llama-8B model exceeded baseline systems like EvalPlanner-Llama-8B, scoring 62.2% versus 55.5%. J1 also showed top-tier performance on other critical benchmarks such as RewardBench, RM-Bench, JudgeBench, and FollowBenchEval, demonstrating robust generalization across verifiable and subjective tasks. These improvements are not just marginal but significant, considering the limited training data used in J1 compared to the expansive datasets in other models. Several Key Takeaways from the Research on J1: J1 is trained using 22,000 synthetic preference pairs, including 17K from WildChat and 5K from MATH tasks. The training uses GRPO, which streamlines RL by avoiding the need for separate critic models. It introduces position-agnostic learning, reducing position bias through consistency-based rewards. Two main model variants, J1-Llama-8B and J1-Llama-70B, were trained on modest data but outperformed large-scale models. J1-Llama-70B scored 69.6% on PPE, exceeding DeepSeek-GRM-27Band EvalPlanner-Llama-70B. Supports multiple judgment formats: pairwise with verdicts, pairwise with scores, and pointwise scores. Surpasses models distilled from DeepSeek-R1 and OpenAI’s o1-mini on several tasks. Demonstrates that reasoning quality, not just dataset size, is critical for accurate judgments. J1’s framework makes it a generalist judge applicable to verifiable and non-verifiable tasks. In conclusion, the J1 approach fundamentally redefines how judgment models are trained and evaluated. Synthetic data and reinforcement learning bypass the traditional need for costly annotations while promoting fair, logical, and consistent evaluations. This work illustrates that reasoning-driven judging can outperform larger models that rely heavily on data volume and static alignment techniques. It also validates the notion that judgment models should be thinkers first, and scorers second. With performance that rivals and often surpasses state-of-the-art systems, J1 sets a new benchmark in training LLM-as-a-Judge systems. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World EnvironmentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA Optimization #meta #researchers #introduced #reinforcement #learning

WWW.MARKTECHPOST.COM

Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data

Large language models are now being used for evaluation and judgment tasks, extending beyond their traditional role of text generation. This has led to “LLM-as-a-Judge,” where models assess outputs from other language models. Such evaluations are essential in reinforcement learning pipelines, benchmark testing, and system alignment. These judge models rely on internal chain-of-thought reasoning, mirroring human judgment processes. Unlike conventional reward models that provide direct scores, these models simulate thoughtful evaluation, making them better suited for complex tasks such as math problem-solving, ethical reasoning, and user intent interpretation. Their ability to interpret and validate responses across languages and domains enhances automation and scalability in language model development. However, current AI judgment systems face issues with inconsistency and shallow reasoning. Many rely on basic metrics or static annotations, which are inadequate for evaluating subjective or open-ended prompts. A common problem is position bias, where the order of answers affects the final decision, compromising fairness. Also, collecting human-annotated data at scale is costly and time-consuming, limiting the generalizability of these models. Several existing approaches have addressed these challenges, but with limited success. Systems like EvalPlanner and DeepSeek-GRM rely on human-labeled data or rigid training schemes, which limit adaptability across task types. Others, like DeepSeek-R1, depend on distillation from large models but perform poorly on ambiguous prompts. Static datasets and offline tuning strategies hinder dynamic reasoning, while newer methods using score formatting or structured prompts have shown minimal accuracy improvements. Despite larger datasets and models, performance gains in traditional systems have stalled. Researchers from Meta’s GenAI and FAIR teams introduced J1 to address the above limitations. J1 trains judgment models through a reinforcement learning-based framework, making them capable of learning through verifiable reward signals. The team used synthetic data to create high-quality and low-quality responses to a prompt, transforming subjective tasks into verifiable pairwise judgments. This synthetic dataset included 22,000 preference pairs, split between 17,000 prompts from the WildChat corpus and 5,000 mathematical queries. These were used to train two versions of J1: J1-Llama-8B and J1-Llama-70B, initialized from the Llama-3.1-8B-Instruct and Llama-3.3-70B-Instruct base models, respectively. The models were trained using Group Relative Policy Optimization (GRPO), a reinforcement algorithm that eliminates the need for critic models and accelerates convergence. At the training strategy’s core is position-agnostic learning, where both (x, a, b) and (x, b, a) input formats are used in training to prevent position bias. Also, consistency-based rewards are applied only when the model delivers correct verdicts across both answer orderings. This structure allows the judge to be fair and reliable regardless of prompt or answer order. The training framework supports multiple variations: models can output final verdicts, numeric scores for each answer, or both. A pointwise judging variant is included, which evaluates single responses using scores from 0 to 10. These formats make J1 a versatile and generalizable system capable of judging various tasks. The results obtained using the J1 models reveal substantial performance improvements over existing systems. On the widely used Preference Proxy Evaluations (PPE) benchmark, J1-Llama-70B achieved an overall accuracy of 69.6%, outperforming models trained with over ten times more data. In contrast, models like DeepSeek-GRM-27B and EvalPlanner-Llama-70B scored 67.2% and 65.6%, respectively. Even the smaller J1-Llama-8B model exceeded baseline systems like EvalPlanner-Llama-8B, scoring 62.2% versus 55.5%. J1 also showed top-tier performance on other critical benchmarks such as RewardBench, RM-Bench, JudgeBench, and FollowBenchEval, demonstrating robust generalization across verifiable and subjective tasks. These improvements are not just marginal but significant, considering the limited training data used in J1 compared to the expansive datasets in other models. Several Key Takeaways from the Research on J1: J1 is trained using 22,000 synthetic preference pairs, including 17K from WildChat and 5K from MATH tasks. The training uses GRPO, which streamlines RL by avoiding the need for separate critic models. It introduces position-agnostic learning, reducing position bias through consistency-based rewards. Two main model variants, J1-Llama-8B and J1-Llama-70B, were trained on modest data but outperformed large-scale models. J1-Llama-70B scored 69.6% on PPE, exceeding DeepSeek-GRM-27B (67.2%) and EvalPlanner-Llama-70B (65.6%). Supports multiple judgment formats: pairwise with verdicts, pairwise with scores, and pointwise scores. Surpasses models distilled from DeepSeek-R1 and OpenAI’s o1-mini on several tasks. Demonstrates that reasoning quality, not just dataset size, is critical for accurate judgments. J1’s framework makes it a generalist judge applicable to verifiable and non-verifiable tasks. In conclusion, the J1 approach fundamentally redefines how judgment models are trained and evaluated. Synthetic data and reinforcement learning bypass the traditional need for costly annotations while promoting fair, logical, and consistent evaluations. This work illustrates that reasoning-driven judging can outperform larger models that rely heavily on data volume and static alignment techniques. It also validates the notion that judgment models should be thinkers first, and scorers second. With performance that rivals and often surpasses state-of-the-art systems, J1 sets a new benchmark in training LLM-as-a-Judge systems. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World EnvironmentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA Optimization

·272 Ansichten

Please log in to like, share and comment!
TechCrunch RT hat einen Link geteilt

2025-05-21 21:51:19 ·

Meta launches program to encourage startups to use its Llama AI models

Meta is launching a new program to incentivize startups to adopt its Llama AI models.
The program, Llama for Startups, provides companies “direct support” from Meta’s Llama team, as well as funding in certain cases. Any U.S.-based firm that is incorporated, has raised less than million in funding, has at least one developer on staff, and is building generative AI applications is eligible to apply by the May 30 deadline.
“Members may receive up to per month for up to six months to help them offset the costs of building and enhancing their generative AI solutions,” Meta wrote in a blog post. “Our experts will work closely with them to get started and explore advanced use cases of Llama that could benefit their startups.”
The launch of the Llama startup program comes as Meta tries to cement its lead in the fiercely competitive open model space. While the tech giant’s Llama models have racked up more than a billion downloads to date, rivals such as DeepSeek, Google, and Alibaba’s Qwen threaten to upend Meta’s efforts to establish a far-reaching model ecosystem.
Not helping matters, Llama has suffered several setbacks over the past few months.
The Wall Street Journal last week reported Meta has delayed the rollout of a flagship AI model, Llama 4 Behemoth, over concerns the model underperforms on key benchmarks. In April, Meta had to fend off allegations that it cheated on a popular crowdsourced AI benchmark, LM Arena. The company used a version of its Llama 4 Maverick model “optimized for conversationality” to achieve a high score on LM Arena, but released a different version of Maverick publicly.
Meta has huge ambitions for Llama — and its broader generative AI portfolio. Last year, the company made a prediction its generative AI products would rake in billion to billion in revenue in 2025, and between billion and trillion by 2035.

Techcrunch event

Join us at TechCrunch Sessions: AI
Secure your spot for our leading AI industry event with speakers from OpenAI, Anthropic, and Cohere. For a limited time, tickets are just for an entire day of expert talks, workshops, and potent networking.

Exhibit at TechCrunch Sessions: AI
Secure your spot at TC Sessions: AI and show 1,200+ decision-makers what you’ve built — without the big spend. Available through May 9 or while tables last.

Berkeley, CA
|
June 5

REGISTER NOW

Meta has revenue-sharing agreements with some companies that host its Llama models. The company recently launched an API for customizing Llama releases. And Meta AI, the company’s AI assistant powered by Llama, may eventually show ads and offer a subscription with additional features, CEO Mark Zuckerberg said during the company’s Q1 earnings call.
These products have proven costly to build. In 2024, Meta’s “GenAI” budget was more than million, and this year, it could exceed billion. That’s not including the infrastructure needed to run and train the models. Meta previously said it plans to spend billion to billion on capital expenditures in 2025, primarily on new data centers.
#meta #launches #program #encourage #startups

Meta launches program to encourage startups to use its Llama AI models
Meta is launching a new program to incentivize startups to adopt its Llama AI models. The program, Llama for Startups, provides companies “direct support” from Meta’s Llama team, as well as funding in certain cases. Any U.S.-based firm that is incorporated, has raised less than million in funding, has at least one developer on staff, and is building generative AI applications is eligible to apply by the May 30 deadline. “Members may receive up to per month for up to six months to help them offset the costs of building and enhancing their generative AI solutions,” Meta wrote in a blog post. “Our experts will work closely with them to get started and explore advanced use cases of Llama that could benefit their startups.” The launch of the Llama startup program comes as Meta tries to cement its lead in the fiercely competitive open model space. While the tech giant’s Llama models have racked up more than a billion downloads to date, rivals such as DeepSeek, Google, and Alibaba’s Qwen threaten to upend Meta’s efforts to establish a far-reaching model ecosystem. Not helping matters, Llama has suffered several setbacks over the past few months. The Wall Street Journal last week reported Meta has delayed the rollout of a flagship AI model, Llama 4 Behemoth, over concerns the model underperforms on key benchmarks. In April, Meta had to fend off allegations that it cheated on a popular crowdsourced AI benchmark, LM Arena. The company used a version of its Llama 4 Maverick model “optimized for conversationality” to achieve a high score on LM Arena, but released a different version of Maverick publicly. Meta has huge ambitions for Llama — and its broader generative AI portfolio. Last year, the company made a prediction its generative AI products would rake in billion to billion in revenue in 2025, and between billion and trillion by 2035. Techcrunch event Join us at TechCrunch Sessions: AI Secure your spot for our leading AI industry event with speakers from OpenAI, Anthropic, and Cohere. For a limited time, tickets are just for an entire day of expert talks, workshops, and potent networking. Exhibit at TechCrunch Sessions: AI Secure your spot at TC Sessions: AI and show 1,200+ decision-makers what you’ve built — without the big spend. Available through May 9 or while tables last. Berkeley, CA | June 5 REGISTER NOW Meta has revenue-sharing agreements with some companies that host its Llama models. The company recently launched an API for customizing Llama releases. And Meta AI, the company’s AI assistant powered by Llama, may eventually show ads and offer a subscription with additional features, CEO Mark Zuckerberg said during the company’s Q1 earnings call. These products have proven costly to build. In 2024, Meta’s “GenAI” budget was more than million, and this year, it could exceed billion. That’s not including the infrastructure needed to run and train the models. Meta previously said it plans to spend billion to billion on capital expenditures in 2025, primarily on new data centers. #meta #launches #program #encourage #startups

TECHCRUNCH.COM

Meta launches program to encourage startups to use its Llama AI models

Meta is launching a new program to incentivize startups to adopt its Llama AI models. The program, Llama for Startups, provides companies “direct support” from Meta’s Llama team, as well as funding in certain cases. Any U.S.-based firm that is incorporated, has raised less than $10 million in funding, has at least one developer on staff, and is building generative AI applications is eligible to apply by the May 30 deadline. “Members may receive up to $6,000 per month for up to six months to help them offset the costs of building and enhancing their generative AI solutions,” Meta wrote in a blog post. “Our experts will work closely with them to get started and explore advanced use cases of Llama that could benefit their startups.” The launch of the Llama startup program comes as Meta tries to cement its lead in the fiercely competitive open model space. While the tech giant’s Llama models have racked up more than a billion downloads to date, rivals such as DeepSeek, Google, and Alibaba’s Qwen threaten to upend Meta’s efforts to establish a far-reaching model ecosystem. Not helping matters, Llama has suffered several setbacks over the past few months. The Wall Street Journal last week reported Meta has delayed the rollout of a flagship AI model, Llama 4 Behemoth, over concerns the model underperforms on key benchmarks. In April, Meta had to fend off allegations that it cheated on a popular crowdsourced AI benchmark, LM Arena. The company used a version of its Llama 4 Maverick model “optimized for conversationality” to achieve a high score on LM Arena, but released a different version of Maverick publicly. Meta has huge ambitions for Llama — and its broader generative AI portfolio. Last year, the company made a prediction its generative AI products would rake in $2 billion to $3 billion in revenue in 2025, and between $460 billion and $1.4 trillion by 2035. Techcrunch event Join us at TechCrunch Sessions: AI Secure your spot for our leading AI industry event with speakers from OpenAI, Anthropic, and Cohere. For a limited time, tickets are just $292 for an entire day of expert talks, workshops, and potent networking. Exhibit at TechCrunch Sessions: AI Secure your spot at TC Sessions: AI and show 1,200+ decision-makers what you’ve built — without the big spend. Available through May 9 or while tables last. Berkeley, CA | June 5 REGISTER NOW Meta has revenue-sharing agreements with some companies that host its Llama models. The company recently launched an API for customizing Llama releases. And Meta AI, the company’s AI assistant powered by Llama, may eventually show ads and offer a subscription with additional features, CEO Mark Zuckerberg said during the company’s Q1 earnings call. These products have proven costly to build. In 2024, Meta’s “GenAI” budget was more than $900 million, and this year, it could exceed $1 billion. That’s not including the infrastructure needed to run and train the models. Meta previously said it plans to spend $60 billion to $80 billion on capital expenditures in 2025, primarily on new data centers.

·206 Ansichten

Please log in to like, share and comment!
Marktechpost AI RT hat einen Link geteilt

2025-05-21 07:27:29 ·

Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative Modeling

Data Scarcity in Generative Modeling
Generative models traditionally rely on large, high-quality datasets to produce samples that replicate the underlying data distribution. However, in fields like molecular modeling or physics-based inference, acquiring such data can be computationally infeasible or even impossible. Instead of labeled data, only a scalar reward—typically derived from a complex energy function—is available to judge the quality of generated samples. This presents a significant challenge: how can one train generative models effectively without direct supervision from data?
Meta AI tackles this challenge with Adjoint Sampling, a novel learning algorithm designed for training generative models using only scalar reward signals. Built on the theoretical framework of stochastic optimal control, Adjoint Sampling reframes the training process as an optimization task over a controlled diffusion process. Unlike standard generative models, it does not require explicit data. Instead, it learns to generate high-quality samples by iteratively refining them using a reward function—often derived from physical or chemical energy models.
Adjoint Sampling excels in scenarios where only an unnormalized energy function is accessible. It produces samples that align with the target distribution defined by this energy, bypassing the need for corrective methods like importance sampling or MCMC, which are computationally intensive.

Source:
Technical Details
The foundation of Adjoint Sampling is a stochastic differential equationthat models how sample trajectories evolve. The algorithm learns a control drift uuusuch that the final state of these trajectories approximates a desired distribution. A key innovation is its use of Reciprocal Adjoint Matching—a loss function that enables gradient-based updates using only the initial and final states of sample trajectories. This sidesteps the need to backpropagate through the entire diffusion path, greatly improving computational efficiency.
By sampling from a known base process and conditioning on terminal states, Adjoint Sampling constructs a replay buffer of samples and gradients, allowing multiple optimization steps per sample. This on-policy training method provides scalability unmatched by previous approaches, making it suitable for high-dimensional problems like molecular conformer generation.
Moreover, Adjoint Sampling supports geometric symmetries and periodic boundary conditions, enabling models to respect molecular invariances like rotation, translation, and torsion. These features are crucial for physically meaningful generative tasks in chemistry and physics.
Performance Insights and Benchmark Results
Adjoint Sampling achieves state-of-the-art results in both synthetic and real-world tasks. On synthetic benchmarks such as the Double-Well, Lennard-Jonespotentials, it significantly outperforms baselines like DDS and PIS, especially in energy efficiency. For example, where DDS and PIS require 1000 evaluations per gradient update, Adjoint Sampling only uses three, with similar or better performance in Wasserstein distance and effective sample size.
In a practical setting, the algorithm was evaluated on large-scale molecular conformer generation using the eSEN energy model trained on the SPICE-MACE-OFF dataset. Adjoint Sampling, especially its Cartesian variant with pretraining, achieved up to 96.4% recall and 0.60 Å mean RMSD, surpassing RDKit ETKDG—a widely used chemistry-based baseline—across all metrics. The method generalizes well to the GEOM-DRUGS dataset, showing substantial improvements in recall while maintaining competitive precision.

The algorithm’s ability to explore the configuration space broadly, aided by its stochastic initialization and reward-based learning, results in greater conformer diversity—critical for drug discovery and molecular design.
Conclusion: A Scalable Path Forward for Reward-Driven Generative Models
Adjoint Sampling represents a major step forward in generative modeling without data. By leveraging scalar reward signals and an efficient on-policy training method grounded in stochastic control, it enables scalable training of diffusion-based samplers with minimal energy evaluations. Its integration of geometric symmetries and its ability to generalize across diverse molecular structures position it as a foundational tool in computational chemistry and beyond.

Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World EnvironmentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA OptimizationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Chain-of-Thought May Not Be a Window into AI’s Reasoning: Anthropic’s New Study Reveals Hidden Gaps

Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub!
#sampling #without #data #now #scalable

Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative Modeling
Data Scarcity in Generative Modeling Generative models traditionally rely on large, high-quality datasets to produce samples that replicate the underlying data distribution. However, in fields like molecular modeling or physics-based inference, acquiring such data can be computationally infeasible or even impossible. Instead of labeled data, only a scalar reward—typically derived from a complex energy function—is available to judge the quality of generated samples. This presents a significant challenge: how can one train generative models effectively without direct supervision from data? Meta AI tackles this challenge with Adjoint Sampling, a novel learning algorithm designed for training generative models using only scalar reward signals. Built on the theoretical framework of stochastic optimal control, Adjoint Sampling reframes the training process as an optimization task over a controlled diffusion process. Unlike standard generative models, it does not require explicit data. Instead, it learns to generate high-quality samples by iteratively refining them using a reward function—often derived from physical or chemical energy models. Adjoint Sampling excels in scenarios where only an unnormalized energy function is accessible. It produces samples that align with the target distribution defined by this energy, bypassing the need for corrective methods like importance sampling or MCMC, which are computationally intensive. Source: Technical Details The foundation of Adjoint Sampling is a stochastic differential equationthat models how sample trajectories evolve. The algorithm learns a control drift uuusuch that the final state of these trajectories approximates a desired distribution. A key innovation is its use of Reciprocal Adjoint Matching—a loss function that enables gradient-based updates using only the initial and final states of sample trajectories. This sidesteps the need to backpropagate through the entire diffusion path, greatly improving computational efficiency. By sampling from a known base process and conditioning on terminal states, Adjoint Sampling constructs a replay buffer of samples and gradients, allowing multiple optimization steps per sample. This on-policy training method provides scalability unmatched by previous approaches, making it suitable for high-dimensional problems like molecular conformer generation. Moreover, Adjoint Sampling supports geometric symmetries and periodic boundary conditions, enabling models to respect molecular invariances like rotation, translation, and torsion. These features are crucial for physically meaningful generative tasks in chemistry and physics. Performance Insights and Benchmark Results Adjoint Sampling achieves state-of-the-art results in both synthetic and real-world tasks. On synthetic benchmarks such as the Double-Well, Lennard-Jonespotentials, it significantly outperforms baselines like DDS and PIS, especially in energy efficiency. For example, where DDS and PIS require 1000 evaluations per gradient update, Adjoint Sampling only uses three, with similar or better performance in Wasserstein distance and effective sample size. In a practical setting, the algorithm was evaluated on large-scale molecular conformer generation using the eSEN energy model trained on the SPICE-MACE-OFF dataset. Adjoint Sampling, especially its Cartesian variant with pretraining, achieved up to 96.4% recall and 0.60 Å mean RMSD, surpassing RDKit ETKDG—a widely used chemistry-based baseline—across all metrics. The method generalizes well to the GEOM-DRUGS dataset, showing substantial improvements in recall while maintaining competitive precision. The algorithm’s ability to explore the configuration space broadly, aided by its stochastic initialization and reward-based learning, results in greater conformer diversity—critical for drug discovery and molecular design. Conclusion: A Scalable Path Forward for Reward-Driven Generative Models Adjoint Sampling represents a major step forward in generative modeling without data. By leveraging scalar reward signals and an efficient on-policy training method grounded in stochastic control, it enables scalable training of diffusion-based samplers with minimal energy evaluations. Its integration of geometric symmetries and its ability to generalize across diverse molecular structures position it as a foundational tool in computational chemistry and beyond. Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World EnvironmentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA OptimizationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Chain-of-Thought May Not Be a Window into AI’s Reasoning: Anthropic’s New Study Reveals Hidden Gaps 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! #sampling #without #data #now #scalable

WWW.MARKTECHPOST.COM

Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative Modeling

Data Scarcity in Generative Modeling Generative models traditionally rely on large, high-quality datasets to produce samples that replicate the underlying data distribution. However, in fields like molecular modeling or physics-based inference, acquiring such data can be computationally infeasible or even impossible. Instead of labeled data, only a scalar reward—typically derived from a complex energy function—is available to judge the quality of generated samples. This presents a significant challenge: how can one train generative models effectively without direct supervision from data? Meta AI tackles this challenge with Adjoint Sampling, a novel learning algorithm designed for training generative models using only scalar reward signals. Built on the theoretical framework of stochastic optimal control (SOC), Adjoint Sampling reframes the training process as an optimization task over a controlled diffusion process. Unlike standard generative models, it does not require explicit data. Instead, it learns to generate high-quality samples by iteratively refining them using a reward function—often derived from physical or chemical energy models. Adjoint Sampling excels in scenarios where only an unnormalized energy function is accessible. It produces samples that align with the target distribution defined by this energy, bypassing the need for corrective methods like importance sampling or MCMC, which are computationally intensive. Source: https://arxiv.org/abs/2504.11713 Technical Details The foundation of Adjoint Sampling is a stochastic differential equation (SDE) that models how sample trajectories evolve. The algorithm learns a control drift u(x,t)u(x, t)u(x,t) such that the final state of these trajectories approximates a desired distribution (e.g., Boltzmann). A key innovation is its use of Reciprocal Adjoint Matching (RAM)—a loss function that enables gradient-based updates using only the initial and final states of sample trajectories. This sidesteps the need to backpropagate through the entire diffusion path, greatly improving computational efficiency. By sampling from a known base process and conditioning on terminal states, Adjoint Sampling constructs a replay buffer of samples and gradients, allowing multiple optimization steps per sample. This on-policy training method provides scalability unmatched by previous approaches, making it suitable for high-dimensional problems like molecular conformer generation. Moreover, Adjoint Sampling supports geometric symmetries and periodic boundary conditions, enabling models to respect molecular invariances like rotation, translation, and torsion. These features are crucial for physically meaningful generative tasks in chemistry and physics. Performance Insights and Benchmark Results Adjoint Sampling achieves state-of-the-art results in both synthetic and real-world tasks. On synthetic benchmarks such as the Double-Well (DW-4), Lennard-Jones (LJ-13 and LJ-55) potentials, it significantly outperforms baselines like DDS and PIS, especially in energy efficiency. For example, where DDS and PIS require 1000 evaluations per gradient update, Adjoint Sampling only uses three, with similar or better performance in Wasserstein distance and effective sample size (ESS). In a practical setting, the algorithm was evaluated on large-scale molecular conformer generation using the eSEN energy model trained on the SPICE-MACE-OFF dataset. Adjoint Sampling, especially its Cartesian variant with pretraining, achieved up to 96.4% recall and 0.60 Å mean RMSD, surpassing RDKit ETKDG—a widely used chemistry-based baseline—across all metrics. The method generalizes well to the GEOM-DRUGS dataset, showing substantial improvements in recall while maintaining competitive precision. The algorithm’s ability to explore the configuration space broadly, aided by its stochastic initialization and reward-based learning, results in greater conformer diversity—critical for drug discovery and molecular design. Conclusion: A Scalable Path Forward for Reward-Driven Generative Models Adjoint Sampling represents a major step forward in generative modeling without data. By leveraging scalar reward signals and an efficient on-policy training method grounded in stochastic control, it enables scalable training of diffusion-based samplers with minimal energy evaluations. Its integration of geometric symmetries and its ability to generalize across diverse molecular structures position it as a foundational tool in computational chemistry and beyond. Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World EnvironmentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA OptimizationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Chain-of-Thought May Not Be a Window into AI’s Reasoning: Anthropic’s New Study Reveals Hidden Gaps 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! (Promoted)

·312 Ansichten

Please log in to like, share and comment!

Login

Sprachen

Apple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and Accuracy

Qwen Researchers Proposes QwenLong-L1: A Reinforcement Learning Framework for Long-Context Reasoning in Large Language Models

This AI Paper Introduces GRIT: A Method for Teaching MLLMs to Reason with Images by Interleaving Text and Visual Grounding

Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional Compilers

Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models

Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding

Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data

Meta launches program to encourage startups to use its Llama AI models

Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative Modeling