www.marktechpost.com
Large language models (LLMs) have made significant strides in their post-training phase, like DeepSeek-R1, Kimi-K1.5, and OpenAI-o1, showing impressive reasoning capabilities. While DeepSeek-R1 provides open-source model weights, it withholds training code and dataset details, raising questions about scaling reasoning abilities to smaller models, optimal training data structures, and reliable replication methodologies. Traditional mathematics datasets like GSM8K and Omini-MATH present inconsistent difficulty levels with varying logical depths, complicating controlled experimentation. The need for targeted datasets with controllable complexity has become critical for isolating variables and studying the emergence of reasoning capabilities in LLMs.LLMs reasoning capabilities have been advanced through various techniques, with Chain-of-Thought (CoT) reasoning playing a crucial role in breaking down complex problems into manageable steps. Monte Carlo Tree Search (MCTS), initially successful in AlphaGo, has been adapted to guide model-based planning by balancing exploration and exploitation through tree-based search and random sampling. Further, Post-training strategies to enhance reasoning capabilities include additional fine-tuning or reinforcement learning (RL) on specialized datasets. Methods like Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and REINFORCE++ are showing promise, forming a frontier for advancing model reasoning alongside test-time scaling methods.Researchers from Microsoft Research Asia, Ubiquant, and Independent have proposed Logic-RL, a rule-based RL framework that acquires reasoning patterns similar to DeepSeek-R1 through training on logic puzzles. It adopts the REINFORCE++ algorithm and reward designs from DeepSeek-R1 for post-training. As training progresses, the model naturally allocates more computational steps to reasoning, expanding from generating hundreds to thousands of tokens, which enables deeper exploration and refinement of thought processes. Using only 5K generated logic puzzles, their 7B model shows cross-domain generalization, improving by 125% on AIME and 38% on AMC against the base model. This suggests that RL-trained reasoning develops abstract problem-solving patterns rather than domain-specific matching.The researchers face challenges with Qwen2.5-Math-7Bs tendency to generate Python code blocks that conflict with formatting requirements. Testing both Qwen2.5-7B-Base and Qwen2.5-7B-Instruct reveals nearly identical training metrics during RL training, including validation accuracy, response length growth curves, and reward curves. The implementation shows dramatic improvements in reasoning capabilities, with output length increasing from an initial average of 500 tokens to approximately 2000 tokens after just 1000 RL training steps. This enables the emergence of more complex behaviors, such as reflection and exploration of alternative solutions, and these behaviors significantly enhance the models ability to handle complex tasks and are closely aligned with the results reported in DeepSeek-R1.The results demonstrate that while PPO achieves significant advantages in accuracy and reward, it was 138% slower than REINFORCE++ in training speed. REINFORCE++ shows superior stability, performance gains, and training efficiency compared to GRPO, outperforming it across nearly all metrics. GRPO exhibits the weakest performance among the three RL algorithms evaluated. The models Super OOD (out-of-distribution) generalization capability proves exceptionally strong, achieving an overall improvement of 125% on the AIME dataset and 38% on the AMC dataset. This synchronous improvement indicates that the RL process enhances both in-distribution performance and facilitates the emergence of robust, transferable reasoning strategies.This study shows the significant potential of Logic-RL in developing complex reasoning skills in language models through a rule-based RL framework. However, its important to acknowledge that the findings are based on a relatively small-scale logic dataset, which may limit their applicability. The generalizability of these results to large-scale real-world mathematical or coding scenarios remains an open question that requires further investigation. Future research should focus on extending this approach to more diverse and complex datasets to thoroughly validate its effectiveness and robustness across different domains and problem types. By maintaining this work as an open research project, the researchers aim to benefit the broader scientific community.Check outthe Paper.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our80k+ ML SubReddit. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Qilin: A Multimodal Dataset with APP-level User Sessions To Advance Search and Recommendation SystemsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/AxoNN: Advancing Large Language Model Training through Four-Dimensional Hybrid Parallel ComputingSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/LightThinker: Dynamic Compression of Intermediate Thoughts for More Efficient LLM ReasoningSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meet AI Co-Scientist: A Multi-Agent System Powered by Gemini 2.0 for Accelerating Scientific Discovery Parlant: Build Reliable AI Customer Facing Agents with LLMs (Promoted)