Zoeken | CGShares

Zoeken

Berichten

Blogs

Gebruikers

Pagina

Groepen

Marktechpost AI @MarktechpostAI een koppeling hebt gedeeld
2025-06-07 20:19:33 ·

ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation

Autoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses on generating images one token at a time, similar to how sentences are constructed in language models. The appeal of this approach lies in its ability to maintain structural coherence across the image while allowing for high levels of control during the generation process. As researchers began to apply these techniques to visual data, they found that structured prediction not only preserved spatial integrity but also supported tasks like image manipulation and multimodal translation effectively.
Despite these benefits, generating high-resolution images remains computationally expensive and slow. A primary issue is the number of tokens needed to represent complex visuals. Raster-scan methods that flatten 2D images into linear sequences require thousands of tokens for detailed images, resulting in long inference times and high memory consumption. Models like Infinity need over 10,000 tokens for a 1024×1024 image. This becomes unsustainable for real-time applications or when scaling to more extensive datasets. Reducing the token burden while preserving or improving output quality has become a pressing challenge.

Efforts to mitigate token inflation have led to innovations like next-scale prediction seen in VAR and FlexVAR. These models create images by predicting progressively finer scales, which imitates the human tendency to sketch rough outlines before adding detail. However, they still rely on hundreds of tokens—680 in the case of VAR and FlexVAR for 256×256 images. Moreover, approaches like TiTok and FlexTok use 1D tokenization to compress spatial redundancy, but they often fail to scale efficiently. For example, FlexTok’s gFID increases from 1.9 at 32 tokens to 2.5 at 256 tokens, highlighting a degradation in output quality as the token count grows.
Researchers from ByteDance introduced DetailFlow, a 1D autoregressive image generation framework. This method arranges token sequences from global to fine detail using a process called next-detail prediction. Unlike traditional 2D raster-scan or scale-based techniques, DetailFlow employs a 1D tokenizer trained on progressively degraded images. This design allows the model to prioritize foundational image structures before refining visual details. By mapping tokens directly to resolution levels, DetailFlow significantly reduces token requirements, enabling images to be generated in a semantically ordered, coarse-to-fine manner.

The mechanism in DetailFlow centers on a 1D latent space where each token contributes incrementally more detail. Earlier tokens encode global features, while later tokens refine specific visual aspects. To train this, the researchers created a resolution mapping function that links token count to target resolution. During training, the model is exposed to images of varying quality levels and learns to predict progressively higher-resolution outputs as more tokens are introduced. It also implements parallel token prediction by grouping sequences and predicting entire sets at once. Since parallel prediction can introduce sampling errors, a self-correction mechanism was integrated. This system perturbs certain tokens during training and teaches subsequent tokens to compensate, ensuring that final images maintain structural and visual integrity.
The results from the experiments on the ImageNet 256×256 benchmark were noteworthy. DetailFlow achieved a gFID score of 2.96 using only 128 tokens, outperforming VAR at 3.3 and FlexVAR at 3.05, both of which used 680 tokens. Even more impressive, DetailFlow-64 reached a gFID of 2.62 using 512 tokens. In terms of speed, it delivered nearly double the inference rate of VAR and FlexVAR. A further ablation study confirmed that the self-correction training and semantic ordering of tokens substantially improved output quality. For example, enabling self-correction dropped the gFID from 4.11 to 3.68 in one setting. These metrics demonstrate both higher quality and faster generation compared to established models.

By focusing on semantic structure and reducing redundancy, DetailFlow presents a viable solution to long-standing issues in autoregressive image generation. The method’s coarse-to-fine approach, efficient parallel decoding, and ability to self-correct highlight how architectural innovations can address performance and scalability limitations. Through their structured use of 1D tokens, the researchers from ByteDance have demonstrated a model that maintains high image fidelity while significantly reducing computational load, making it a valuable addition to image synthesis research.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement FinetuningNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal ReasoningNikhilhttps://www.marktechpost.com/author/nikhil0980/NVIDIA AI Introduces Fast-dLLM: A Training-Free Framework That Brings KV Caching and Parallel Decoding to Diffusion LLMsNikhilhttps://www.marktechpost.com/author/nikhil0980/Meet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental Validation
#bytedance #researchers #introduce #detailflow #coarsetofine

ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation
Autoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses on generating images one token at a time, similar to how sentences are constructed in language models. The appeal of this approach lies in its ability to maintain structural coherence across the image while allowing for high levels of control during the generation process. As researchers began to apply these techniques to visual data, they found that structured prediction not only preserved spatial integrity but also supported tasks like image manipulation and multimodal translation effectively. Despite these benefits, generating high-resolution images remains computationally expensive and slow. A primary issue is the number of tokens needed to represent complex visuals. Raster-scan methods that flatten 2D images into linear sequences require thousands of tokens for detailed images, resulting in long inference times and high memory consumption. Models like Infinity need over 10,000 tokens for a 1024×1024 image. This becomes unsustainable for real-time applications or when scaling to more extensive datasets. Reducing the token burden while preserving or improving output quality has become a pressing challenge. Efforts to mitigate token inflation have led to innovations like next-scale prediction seen in VAR and FlexVAR. These models create images by predicting progressively finer scales, which imitates the human tendency to sketch rough outlines before adding detail. However, they still rely on hundreds of tokens—680 in the case of VAR and FlexVAR for 256×256 images. Moreover, approaches like TiTok and FlexTok use 1D tokenization to compress spatial redundancy, but they often fail to scale efficiently. For example, FlexTok’s gFID increases from 1.9 at 32 tokens to 2.5 at 256 tokens, highlighting a degradation in output quality as the token count grows. Researchers from ByteDance introduced DetailFlow, a 1D autoregressive image generation framework. This method arranges token sequences from global to fine detail using a process called next-detail prediction. Unlike traditional 2D raster-scan or scale-based techniques, DetailFlow employs a 1D tokenizer trained on progressively degraded images. This design allows the model to prioritize foundational image structures before refining visual details. By mapping tokens directly to resolution levels, DetailFlow significantly reduces token requirements, enabling images to be generated in a semantically ordered, coarse-to-fine manner. The mechanism in DetailFlow centers on a 1D latent space where each token contributes incrementally more detail. Earlier tokens encode global features, while later tokens refine specific visual aspects. To train this, the researchers created a resolution mapping function that links token count to target resolution. During training, the model is exposed to images of varying quality levels and learns to predict progressively higher-resolution outputs as more tokens are introduced. It also implements parallel token prediction by grouping sequences and predicting entire sets at once. Since parallel prediction can introduce sampling errors, a self-correction mechanism was integrated. This system perturbs certain tokens during training and teaches subsequent tokens to compensate, ensuring that final images maintain structural and visual integrity. The results from the experiments on the ImageNet 256×256 benchmark were noteworthy. DetailFlow achieved a gFID score of 2.96 using only 128 tokens, outperforming VAR at 3.3 and FlexVAR at 3.05, both of which used 680 tokens. Even more impressive, DetailFlow-64 reached a gFID of 2.62 using 512 tokens. In terms of speed, it delivered nearly double the inference rate of VAR and FlexVAR. A further ablation study confirmed that the self-correction training and semantic ordering of tokens substantially improved output quality. For example, enabling self-correction dropped the gFID from 4.11 to 3.68 in one setting. These metrics demonstrate both higher quality and faster generation compared to established models. By focusing on semantic structure and reducing redundancy, DetailFlow presents a viable solution to long-standing issues in autoregressive image generation. The method’s coarse-to-fine approach, efficient parallel decoding, and ability to self-correct highlight how architectural innovations can address performance and scalability limitations. Through their structured use of 1D tokens, the researchers from ByteDance have demonstrated a model that maintains high image fidelity while significantly reducing computational load, making it a valuable addition to image synthesis research. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement FinetuningNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal ReasoningNikhilhttps://www.marktechpost.com/author/nikhil0980/NVIDIA AI Introduces Fast-dLLM: A Training-Free Framework That Brings KV Caching and Parallel Decoding to Diffusion LLMsNikhilhttps://www.marktechpost.com/author/nikhil0980/Meet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental Validation #bytedance #researchers #introduce #detailflow #coarsetofine

ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation

www.marktechpost.com
Autoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses on generating images one token at a time, similar to how sentences are constructed in language models. The appeal of this approach lies in its ability to maintain structural coherence across the image while allowing for high levels of control during the generation process. As researchers began to apply these techniques to visual data, they found that structured prediction not only preserved spatial integrity but also supported tasks like image manipulation and multimodal translation effectively. Despite these benefits, generating high-resolution images remains computationally expensive and slow. A primary issue is the number of tokens needed to represent complex visuals. Raster-scan methods that flatten 2D images into linear sequences require thousands of tokens for detailed images, resulting in long inference times and high memory consumption. Models like Infinity need over 10,000 tokens for a 1024×1024 image. This becomes unsustainable for real-time applications or when scaling to more extensive datasets. Reducing the token burden while preserving or improving output quality has become a pressing challenge. Efforts to mitigate token inflation have led to innovations like next-scale prediction seen in VAR and FlexVAR. These models create images by predicting progressively finer scales, which imitates the human tendency to sketch rough outlines before adding detail. However, they still rely on hundreds of tokens—680 in the case of VAR and FlexVAR for 256×256 images. Moreover, approaches like TiTok and FlexTok use 1D tokenization to compress spatial redundancy, but they often fail to scale efficiently. For example, FlexTok’s gFID increases from 1.9 at 32 tokens to 2.5 at 256 tokens, highlighting a degradation in output quality as the token count grows. Researchers from ByteDance introduced DetailFlow, a 1D autoregressive image generation framework. This method arranges token sequences from global to fine detail using a process called next-detail prediction. Unlike traditional 2D raster-scan or scale-based techniques, DetailFlow employs a 1D tokenizer trained on progressively degraded images. This design allows the model to prioritize foundational image structures before refining visual details. By mapping tokens directly to resolution levels, DetailFlow significantly reduces token requirements, enabling images to be generated in a semantically ordered, coarse-to-fine manner. The mechanism in DetailFlow centers on a 1D latent space where each token contributes incrementally more detail. Earlier tokens encode global features, while later tokens refine specific visual aspects. To train this, the researchers created a resolution mapping function that links token count to target resolution. During training, the model is exposed to images of varying quality levels and learns to predict progressively higher-resolution outputs as more tokens are introduced. It also implements parallel token prediction by grouping sequences and predicting entire sets at once. Since parallel prediction can introduce sampling errors, a self-correction mechanism was integrated. This system perturbs certain tokens during training and teaches subsequent tokens to compensate, ensuring that final images maintain structural and visual integrity. The results from the experiments on the ImageNet 256×256 benchmark were noteworthy. DetailFlow achieved a gFID score of 2.96 using only 128 tokens, outperforming VAR at 3.3 and FlexVAR at 3.05, both of which used 680 tokens. Even more impressive, DetailFlow-64 reached a gFID of 2.62 using 512 tokens. In terms of speed, it delivered nearly double the inference rate of VAR and FlexVAR. A further ablation study confirmed that the self-correction training and semantic ordering of tokens substantially improved output quality. For example, enabling self-correction dropped the gFID from 4.11 to 3.68 in one setting. These metrics demonstrate both higher quality and faster generation compared to established models. By focusing on semantic structure and reducing redundancy, DetailFlow presents a viable solution to long-standing issues in autoregressive image generation. The method’s coarse-to-fine approach, efficient parallel decoding, and ability to self-correct highlight how architectural innovations can address performance and scalability limitations. Through their structured use of 1D tokens, the researchers from ByteDance have demonstrated a model that maintains high image fidelity while significantly reducing computational load, making it a valuable addition to image synthesis research. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement FinetuningNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal ReasoningNikhilhttps://www.marktechpost.com/author/nikhil0980/NVIDIA AI Introduces Fast-dLLM: A Training-Free Framework That Brings KV Caching and Parallel Decoding to Diffusion LLMsNikhilhttps://www.marktechpost.com/author/nikhil0980/Meet NovelSeek: A Unified Multi-Agent Framework for Autonomous Scientific Research from Hypothesis Generation to Experimental Validation

821

· 0 Reacties ·0 aandelen ·0 voorbeeld

Please log in to like, share and comment!
Marktechpost AI @MarktechpostAI een koppeling hebt gedeeld
2025-05-22 20:37:22 ·

Beyond Aha Moments: Structuring Reasoning in Large Language Models

Large Reasoning Modelslike OpenAI’s o1 and o3, DeepSeek-R1, Grok 3.5, and Gemini 2.5 Pro have shown strong capabilities in long CoT reasoning, often displaying advanced behaviors such as self-correction, backtracking, and verification—collectively known as “aha moments.” These behaviors have been observed to emerge through outcome-driven RL without the need for supervised fine-tuning. Models like DeepSeek-R1 and its open-source replicationshave demonstrated that carefully designed RL pipelines—using rule-based rewards, curriculum learning, and structured training—can induce such reflective reasoning abilities. However, these emergent behaviors tend to be unpredictable and inconsistent, limiting their practical reliability and scalability.
To address this, researchers have explored structured RL frameworks that target specific reasoning types, such as deduction, abduction, and induction. These approaches involve aligning specialist models, merging them in parameter space, and applying domain-specific continual RL. Tools like Logic-RL use rule-conditioned RL to solve logic puzzles, improving transferability to tasks like math reasoning. Meanwhile, other works propose mechanisms to enhance reasoning robustness, such as training models to reason both forwards and backwards, or iteratively self-critiquing their outputs. Studies analyzing “aha moments” suggest that these behaviors stem from internal shifts in uncertainty, latent representation, and self-assessment, offering new insights into engineering more reliable reasoning models.
Researchers from the National University of Singapore, Tsinghua University, and Salesforce AI Research address the limitations of relying on spontaneous “aha moments” in large language models by explicitly aligning them with three core reasoning abilities: deduction, induction, and abduction. They introduce a three-stage pipeline—individual meta-ability alignment, parameter-space merging, and domain-specific reinforcement learning—significantly enhancing model performance. Using a programmatically generated, self-verifiable task suite, their approach boosts accuracy over instruction-tuned baselines by over 10%, with further gains from domain-specific RL. This structured alignment framework offers a scalable, generalizable method for improving reasoning across math, coding, and science domains.
The researchers designed tasks aligned with deduction, induction, and abduction by using a structured “given two, infer the third” format based on hypothesis, rule, and observation. Deduction is framed as satisfiability checking, induction as masked-sequence prediction, and abduction as reverse rule-graph inference. These tasks are synthetically generated and automatically verified. The training pipeline includes three stages:independently training models for each reasoning type using REINFORCE++ with structured rewards,merging models through weighted parameter interpolation, andfine-tuning the unified model on domain-specific data via reinforcement learning, isolating the benefit of meta-ability alignment.
The study evaluates models aligned with meta-abilities—deduction, induction, and abduction—using a curriculum learning setup across difficulty levels. Models trained on synthetic tasks strongly generalize to seven unseen math, code, and science benchmarks. At both 7B and 32B scales, meta-ability–aligned and merged models consistently outperform instruction-tuned baselines, with the merged model offering the highest gains. Continued domain-specific RL from these merged checkpointsleads to further improvements over standard RL finetuning, especially in math benchmarks. Overall, the alignment strategy enhances reasoning abilities, and its benefits scale with model size, significantly boosting performance ceilings across tasks.

In conclusion, the study shows that large reasoning models can develop advanced problem-solving skills without depending on unpredictable “aha moments.” By aligning models with three core reasoning abilities—deduction, induction, and abduction—using self-verifiable tasks, the authors create specialist agents that can be effectively combined into a single model. This merged model outperforms instruction-tuned baselines by over 10% on diagnostic tasks and up to 2% on real-world benchmarks. When used as a starting point for domain-specific reinforcement learning, it raises performance by another 4%. This modular, systematic training approach offers a scalable and controllable foundation for building reliable, interpretable reasoning systems.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/RXTX: A Machine Learning-Guided Algorithm for Efficient Structured Matrix MultiplicationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/From Protocol to Production: How Model Context ProtocolGateways Enable Secure, Scalable, and Seamless AI Integrations Across EnterprisesSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers from Renmin University and Huawei Propose MemEngine: A Unified Modular AI Library for Customizing Memory in LLM-Based AgentsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Meta Introduces KernelLLM: An 8B LLM that Translates PyTorch Modules into Efficient Triton GPU Kernels
#beyond #aha #moments #structuring #reasoning

Beyond Aha Moments: Structuring Reasoning in Large Language Models
Large Reasoning Modelslike OpenAI’s o1 and o3, DeepSeek-R1, Grok 3.5, and Gemini 2.5 Pro have shown strong capabilities in long CoT reasoning, often displaying advanced behaviors such as self-correction, backtracking, and verification—collectively known as “aha moments.” These behaviors have been observed to emerge through outcome-driven RL without the need for supervised fine-tuning. Models like DeepSeek-R1 and its open-source replicationshave demonstrated that carefully designed RL pipelines—using rule-based rewards, curriculum learning, and structured training—can induce such reflective reasoning abilities. However, these emergent behaviors tend to be unpredictable and inconsistent, limiting their practical reliability and scalability. To address this, researchers have explored structured RL frameworks that target specific reasoning types, such as deduction, abduction, and induction. These approaches involve aligning specialist models, merging them in parameter space, and applying domain-specific continual RL. Tools like Logic-RL use rule-conditioned RL to solve logic puzzles, improving transferability to tasks like math reasoning. Meanwhile, other works propose mechanisms to enhance reasoning robustness, such as training models to reason both forwards and backwards, or iteratively self-critiquing their outputs. Studies analyzing “aha moments” suggest that these behaviors stem from internal shifts in uncertainty, latent representation, and self-assessment, offering new insights into engineering more reliable reasoning models. Researchers from the National University of Singapore, Tsinghua University, and Salesforce AI Research address the limitations of relying on spontaneous “aha moments” in large language models by explicitly aligning them with three core reasoning abilities: deduction, induction, and abduction. They introduce a three-stage pipeline—individual meta-ability alignment, parameter-space merging, and domain-specific reinforcement learning—significantly enhancing model performance. Using a programmatically generated, self-verifiable task suite, their approach boosts accuracy over instruction-tuned baselines by over 10%, with further gains from domain-specific RL. This structured alignment framework offers a scalable, generalizable method for improving reasoning across math, coding, and science domains. The researchers designed tasks aligned with deduction, induction, and abduction by using a structured “given two, infer the third” format based on hypothesis, rule, and observation. Deduction is framed as satisfiability checking, induction as masked-sequence prediction, and abduction as reverse rule-graph inference. These tasks are synthetically generated and automatically verified. The training pipeline includes three stages:independently training models for each reasoning type using REINFORCE++ with structured rewards,merging models through weighted parameter interpolation, andfine-tuning the unified model on domain-specific data via reinforcement learning, isolating the benefit of meta-ability alignment. The study evaluates models aligned with meta-abilities—deduction, induction, and abduction—using a curriculum learning setup across difficulty levels. Models trained on synthetic tasks strongly generalize to seven unseen math, code, and science benchmarks. At both 7B and 32B scales, meta-ability–aligned and merged models consistently outperform instruction-tuned baselines, with the merged model offering the highest gains. Continued domain-specific RL from these merged checkpointsleads to further improvements over standard RL finetuning, especially in math benchmarks. Overall, the alignment strategy enhances reasoning abilities, and its benefits scale with model size, significantly boosting performance ceilings across tasks. In conclusion, the study shows that large reasoning models can develop advanced problem-solving skills without depending on unpredictable “aha moments.” By aligning models with three core reasoning abilities—deduction, induction, and abduction—using self-verifiable tasks, the authors create specialist agents that can be effectively combined into a single model. This merged model outperforms instruction-tuned baselines by over 10% on diagnostic tasks and up to 2% on real-world benchmarks. When used as a starting point for domain-specific reinforcement learning, it raises performance by another 4%. This modular, systematic training approach offers a scalable and controllable foundation for building reliable, interpretable reasoning systems. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/RXTX: A Machine Learning-Guided Algorithm for Efficient Structured Matrix MultiplicationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/From Protocol to Production: How Model Context ProtocolGateways Enable Secure, Scalable, and Seamless AI Integrations Across EnterprisesSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers from Renmin University and Huawei Propose MemEngine: A Unified Modular AI Library for Customizing Memory in LLM-Based AgentsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Meta Introduces KernelLLM: An 8B LLM that Translates PyTorch Modules into Efficient Triton GPU Kernels #beyond #aha #moments #structuring #reasoning

Beyond Aha Moments: Structuring Reasoning in Large Language Models

www.marktechpost.com
Large Reasoning Models (LRMs) like OpenAI’s o1 and o3, DeepSeek-R1, Grok 3.5, and Gemini 2.5 Pro have shown strong capabilities in long CoT reasoning, often displaying advanced behaviors such as self-correction, backtracking, and verification—collectively known as “aha moments.” These behaviors have been observed to emerge through outcome-driven RL without the need for supervised fine-tuning. Models like DeepSeek-R1 and its open-source replications (e.g., TinyZero and Logic-RL) have demonstrated that carefully designed RL pipelines—using rule-based rewards, curriculum learning, and structured training—can induce such reflective reasoning abilities. However, these emergent behaviors tend to be unpredictable and inconsistent, limiting their practical reliability and scalability. To address this, researchers have explored structured RL frameworks that target specific reasoning types, such as deduction, abduction, and induction. These approaches involve aligning specialist models, merging them in parameter space, and applying domain-specific continual RL. Tools like Logic-RL use rule-conditioned RL to solve logic puzzles, improving transferability to tasks like math reasoning. Meanwhile, other works propose mechanisms to enhance reasoning robustness, such as training models to reason both forwards and backwards, or iteratively self-critiquing their outputs. Studies analyzing “aha moments” suggest that these behaviors stem from internal shifts in uncertainty, latent representation, and self-assessment, offering new insights into engineering more reliable reasoning models. Researchers from the National University of Singapore, Tsinghua University, and Salesforce AI Research address the limitations of relying on spontaneous “aha moments” in large language models by explicitly aligning them with three core reasoning abilities: deduction, induction, and abduction. They introduce a three-stage pipeline—individual meta-ability alignment, parameter-space merging, and domain-specific reinforcement learning—significantly enhancing model performance. Using a programmatically generated, self-verifiable task suite, their approach boosts accuracy over instruction-tuned baselines by over 10%, with further gains from domain-specific RL. This structured alignment framework offers a scalable, generalizable method for improving reasoning across math, coding, and science domains. The researchers designed tasks aligned with deduction, induction, and abduction by using a structured “given two, infer the third” format based on hypothesis (H), rule (R), and observation (O). Deduction is framed as satisfiability checking, induction as masked-sequence prediction, and abduction as reverse rule-graph inference. These tasks are synthetically generated and automatically verified. The training pipeline includes three stages: (A) independently training models for each reasoning type using REINFORCE++ with structured rewards, (B) merging models through weighted parameter interpolation, and (C) fine-tuning the unified model on domain-specific data via reinforcement learning, isolating the benefit of meta-ability alignment. The study evaluates models aligned with meta-abilities—deduction, induction, and abduction—using a curriculum learning setup across difficulty levels. Models trained on synthetic tasks strongly generalize to seven unseen math, code, and science benchmarks. At both 7B and 32B scales, meta-ability–aligned and merged models consistently outperform instruction-tuned baselines, with the merged model offering the highest gains. Continued domain-specific RL from these merged checkpoints (Domain-RL-Meta) leads to further improvements over standard RL finetuning (Domain-RL-Ins), especially in math benchmarks. Overall, the alignment strategy enhances reasoning abilities, and its benefits scale with model size, significantly boosting performance ceilings across tasks. In conclusion, the study shows that large reasoning models can develop advanced problem-solving skills without depending on unpredictable “aha moments.” By aligning models with three core reasoning abilities—deduction, induction, and abduction—using self-verifiable tasks, the authors create specialist agents that can be effectively combined into a single model. This merged model outperforms instruction-tuned baselines by over 10% on diagnostic tasks and up to 2% on real-world benchmarks. When used as a starting point for domain-specific reinforcement learning, it raises performance by another 4%. This modular, systematic training approach offers a scalable and controllable foundation for building reliable, interpretable reasoning systems. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/RXTX: A Machine Learning-Guided Algorithm for Efficient Structured Matrix MultiplicationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/From Protocol to Production: How Model Context Protocol (MCP) Gateways Enable Secure, Scalable, and Seamless AI Integrations Across EnterprisesSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers from Renmin University and Huawei Propose MemEngine: A Unified Modular AI Library for Customizing Memory in LLM-Based AgentsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Meta Introduces KernelLLM: An 8B LLM that Translates PyTorch Modules into Efficient Triton GPU Kernels

0 Reacties ·0 aandelen ·0 voorbeeld

Please log in to like, share and comment!
Towards Data Science @TowardsDataScience een koppeling hebt gedeeld
2025-05-15 09:16:58 ·

Boost 2-Bit LLM Accuracy with EoRA

Quantization is one of the key techniques for reducing the memory footprint of large language models. It works by converting the data type of model parameters from higher-precision formats such as 32-bit floating pointor 16-bit floating pointto lower-precision integer formats, typically INT8 or INT4. For example, quantizing a model to 4-bit means each parameter uses only 0.5 bytes, compared to 4 bytes in FP32.

Post-training quantization methods like GPTQ and AWQ can dramatically reduce the size of large models. A model like Llama 3 with 70 billion parameters can occupy around 140 GB in FP16, but this can be reduced to approximately 40 GB using 4-bit quantization, while still maintaining strong performance on downstream tasks.

However, despite this substantial reduction, such models still exceed the memory capacity of most consumer-grade GPUs, which typically offer 24 GB to 32 GB of VRAM. To make these models truly accessible, quantization to even lower bitwidths, such as 2-bit, is required. While recent advances in low-bit quantization are promising, achieving stable and accurate 2-bit quantization remains a significant challenge.

In this article, we review a technique called EoRA that helps compensate for quantization-induced errors. EoRA is a training-free method, meaning it can be applied quickly and efficiently to any model, even the largest ones. We’ll check how EoRA works and demonstrate how it can significantly improve the performance of 2-bit quantized models, bringing them close to the accuracy of their full-precision counterparts while being up to 5.5x smaller.

We’ll analyze experimental results obtained using large models such as Qwen3-32B and Qwen2.5-72B, both quantized to 2-bit using state-of-the-art quantization techniques, to illustrate the effectiveness of EoRA.

Diving into the Eigenspace in Search of an Adapter

Post-training quantization or, more generally, compression aims to reduce model size or inference cost by minimizing the output difference between the original weights Wl and compressed weights Ŵl  using only a small calibration dataset.

Most quantization methods are framed layer-wise, but the choice of compression formats is rigid and limits flexibility across diverse deployment needs.

To bypass format constraints and improve accuracy, previous work, such as QLoRAand HQQ+, directly fine-tuned a Lora adapter on top of the frozen quantized models.

It is also possible to reframe compression as a compensation problem: given a compressed model, introduce low-rank residual paths that specifically correct compression errors.

A straightforward method uses SVD to decompose the compression error:

\into

\forming low-rank approximations via two matrices:

\\where Al and Bl are the standard tensors of a LoRA adapter.

However, plain SVD has two limitations: it does not minimize the original layerwise compression loss directly, and it allocates capacity uniformly across all error components, ignoring the varying importance of different parts of the model.

To address this, NVIDIA proposes EoRA.

EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

EoRA first projects the compression error into the eigenspace defined by the input activation covariance:

\where X̃ is the average activation over the calibration set. Then, by performing eigendecomposition, we get:

\The compression error ΔW is projected as:

\where Q′=QΛ. Then SVD is applied on ΔW′ to produce a low-rank approximation, and the result is projected back to the original space, adjusting the low-rank factors accordingly.

This eigenspace projection changes the optimization objective: it weights the importance of different error components according to their contribution to the layerwise output, making the approximation more efficient. It can be computed quickly without any training, requires only calibration activations, and does not introduce extra inference latency. Moreover, the derivation shows that this approach leads to a direct minimization of the layerwise compression loss, not just the raw weight error.

Analytically, truncating a singular value in the projected space corresponds to minimizing the true compression error under reasonable assumptions about the calibration activations.

In their paper, NVIDIA presents a wide range of strong results showing that EoRA can significantly boost the accuracy of quantized models. However, their experiments focus mostly on older Quantization methods like GPTQ and are limited to mid-sized LLMs, up to 13B parameters, at 3-bit and 4-bit precisions.

This leaves an open question: can EoRA still be effective for much larger models, using more modern quantization techniques, and even pushing down to 2-bit precision?

Let’s find out.

Calibrating an EoRA Adapter

Suppose we have quantized models that show significantly degraded performance compared to their full-precision counterparts on certain tasks. Our goal is to reduce this performance gap using EoRA.

For the experiments, I used Qwen2.5-72B Instruct and Qwen3-32B, both quantized to 2-bit using AutoRound, a state-of-the-art quantization algorithm developed by Intel. AutoRound leverages SignSGD optimization to fine-tune quantization, and is particularly effective for low-bit settings.

All the models I made are available here:

Quantized Qwen3

Quantized Qwen2.5

The 2-bit models were quantized with a group size of 32, except for which used a group size of 128. A larger group size reduces model size by storing less quantization metadata, but it introduces greater quantization error.

I evaluated the models on IFEval, a benchmark that measures instruction-following capabilities. Results showed a noticeable drop in performance for the quantized versions.

Image by the author

To compensate for this degradation, I applied an EoRA adapter using the implementation provided in the GPTQModel library. The integration is straightforward. If you’re curious about how it’s implemented in PyTorch, the codebase is compact, clean, and easy to follow:

GPTQModel’s EoRA implementation: eora.py

EoRA requires a calibration dataset. Ideally, this dataset should reflect the model’s intended use case. However, since we don’t have a specific target task in this context and aim to preserve the model’s general capabilities, I used 1,024 randomly sampled examples from the C4 dataset.

Another key parameter is the LoRA rank, which greatly influences the effectiveness of the EoRA adapter. Its optimal value depends on the model architecture, the target task, and the calibration data. A higher rank may yield better performance but risks overfitting to the calibration set. It also increases the size of the adapter, counterproductive when the overall goal of quantization is to reduce memory usage. Conversely, a lower rank keeps the adapter lightweight but might not capture enough information to effectively compensate for quantization errors.

In my experiments, I tested LoRA ranks of 32, 64, and 256.

Below is the code used to create the EoRA adapter with GPTQModel:

from gptqmodel import GPTQModel
from gptqmodel.adapter.adapter import Lora
from datasets import load_dataset

calibration_dataset = load_dataset.select)eora_adapter_path = "Qwen3-32B-autoround-2bit-gptq-r256"
model_path = "kaitchup/Qwen3-32B-autoround-2bit-gptq"
eora = LoraGPTQModel.adapter.generateUsing an NVIDIA A100 GPU on RunPod, it took approximately 4 hours to generate the EoRA adapter for the model Qwen3-32B-autoround-2bit-gptq.

All EoRA adapters created for these models are publicly available:

EoRA Adapters for Qwen2.5 and Qwen3

Evaluating EoRA Adapters for 2-bit LLMs

Let’s evaluate the effect of the EoRA adapters. Do they improve the accuracy of the 2-bit models?

Image by the author

It works!

The improvements are particularly notable for Qwen3-14B and Qwen3-32B. For instance, applying EoRA to Qwen3-32B, quantized to 2-bit with a group size of 128, resulted in an accuracy gain of nearly 7.5 points. Increasing the LoRA rank, from 32 to 64, also led to improvements, highlighting the impact of rank on performance.

EoRA is also effective on larger models like Qwen2.5-72B, though the gains are more modest. Lower-rank adapters showed little to no benefit on this model; it wasn’t until I increased the rank to 256 that significant improvements began to appear.

Memory Consumption of EoRA

Using the EoRA adapter during inference results in the following increase in memory consumption:

Image by the author

The overhead is generally negligible. For instance for 2-bit Qwen3-14B, the adapters only add 257 MB and 514 MB to the total model size, with ranks of 32 and 64. With larger ranks, using an EoRA adapter becomes questionable as the total memory consumption may surpass the memory consumption of the same model quantized at a higher precision. For instance, 2-bit Qwen2.5 72B with an EoRA adapter of rank 256 is larger than 3-bit Qwen2.5 72B.

Note: This estimate includes only the memory consumed by the adapter’s parameters. For completeness, we could also account for the memory used by adapter activations during inference. However, these are extremely small relative to other tensorsand can safely be considered negligible.

Conclusion

EoRA works. We’ve confirmed that it’s a simple yet effective method for compensating quantization errors, even at 2-bit precision. It’s intuitive, training-free, and delivers meaningful performance gains. That said, there are a few trade-offs to consider:

Rank search: Finding the optimal LoRA rank requires experimentation. It’s difficult to predict in advance whether a rank of 32 will be sufficient or whether a higher rank, like 256, will cause overfitting. The optimal value depends on the model, calibration data, and target task.

Increased memory consumption: The goal of quantization is to reduce memory usage, often for highly constrained environments. While EoRA adapters are relatively lightweight at lower ranks, they do slightly increase memory consumption, particularly at higher ranks, reducing the overall efficiency of 2-bit quantization.

Looking ahead, NVIDIA’s paper also demonstrates that EoRA adapters make excellent starting points for QLoRA fine-tuning. In other words, if you plan to fine-tune a 2-bit model using QLoRA, initializing from an EoRA-adapted model can lead to better results with less training effort. I’ve written about fine-tuning adapters for GPTQ model last year, in my newsletter:

QLoRA with AutoRound: Cheaper and Better LLM Fine-tuning on Your GPU

The main difference is that instead of initializing the adapter from scratch, we would load the EoRA adapter. This adapter will be fine-tuned.

ReferencesDettmers et al, QLoRA: Efficient Finetuning of Quantized LLMs, arXivBadri and Shaji, Towards 1-bit Machine Learning Models, Mobius Labs’ BlogLiu et al., EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation, arXiv
The post Boost 2-Bit LLM Accuracy with EoRA appeared first on Towards Data Science.
#boost #2bit #llm #accuracy #with

Boost 2-Bit LLM Accuracy with EoRA
Quantization is one of the key techniques for reducing the memory footprint of large language models. It works by converting the data type of model parameters from higher-precision formats such as 32-bit floating pointor 16-bit floating pointto lower-precision integer formats, typically INT8 or INT4. For example, quantizing a model to 4-bit means each parameter uses only 0.5 bytes, compared to 4 bytes in FP32. Post-training quantization methods like GPTQ and AWQ can dramatically reduce the size of large models. A model like Llama 3 with 70 billion parameters can occupy around 140 GB in FP16, but this can be reduced to approximately 40 GB using 4-bit quantization, while still maintaining strong performance on downstream tasks. However, despite this substantial reduction, such models still exceed the memory capacity of most consumer-grade GPUs, which typically offer 24 GB to 32 GB of VRAM. To make these models truly accessible, quantization to even lower bitwidths, such as 2-bit, is required. While recent advances in low-bit quantization are promising, achieving stable and accurate 2-bit quantization remains a significant challenge. In this article, we review a technique called EoRA that helps compensate for quantization-induced errors. EoRA is a training-free method, meaning it can be applied quickly and efficiently to any model, even the largest ones. We’ll check how EoRA works and demonstrate how it can significantly improve the performance of 2-bit quantized models, bringing them close to the accuracy of their full-precision counterparts while being up to 5.5x smaller. We’ll analyze experimental results obtained using large models such as Qwen3-32B and Qwen2.5-72B, both quantized to 2-bit using state-of-the-art quantization techniques, to illustrate the effectiveness of EoRA. Diving into the Eigenspace in Search of an Adapter Post-training quantization or, more generally, compression aims to reduce model size or inference cost by minimizing the output difference between the original weights Wl and compressed weights Ŵl  using only a small calibration dataset. Most quantization methods are framed layer-wise, but the choice of compression formats is rigid and limits flexibility across diverse deployment needs. To bypass format constraints and improve accuracy, previous work, such as QLoRAand HQQ+, directly fine-tuned a Lora adapter on top of the frozen quantized models. It is also possible to reframe compression as a compensation problem: given a compressed model, introduce low-rank residual paths that specifically correct compression errors. A straightforward method uses SVD to decompose the compression error: \into \forming low-rank approximations via two matrices: \\where Al and Bl are the standard tensors of a LoRA adapter. However, plain SVD has two limitations: it does not minimize the original layerwise compression loss directly, and it allocates capacity uniformly across all error components, ignoring the varying importance of different parts of the model. To address this, NVIDIA proposes EoRA. EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation EoRA first projects the compression error into the eigenspace defined by the input activation covariance: \where X̃ is the average activation over the calibration set. Then, by performing eigendecomposition, we get: \The compression error ΔW is projected as: \where Q′=QΛ. Then SVD is applied on ΔW′ to produce a low-rank approximation, and the result is projected back to the original space, adjusting the low-rank factors accordingly. This eigenspace projection changes the optimization objective: it weights the importance of different error components according to their contribution to the layerwise output, making the approximation more efficient. It can be computed quickly without any training, requires only calibration activations, and does not introduce extra inference latency. Moreover, the derivation shows that this approach leads to a direct minimization of the layerwise compression loss, not just the raw weight error. Analytically, truncating a singular value in the projected space corresponds to minimizing the true compression error under reasonable assumptions about the calibration activations. In their paper, NVIDIA presents a wide range of strong results showing that EoRA can significantly boost the accuracy of quantized models. However, their experiments focus mostly on older Quantization methods like GPTQ and are limited to mid-sized LLMs, up to 13B parameters, at 3-bit and 4-bit precisions. This leaves an open question: can EoRA still be effective for much larger models, using more modern quantization techniques, and even pushing down to 2-bit precision? Let’s find out. Calibrating an EoRA Adapter Suppose we have quantized models that show significantly degraded performance compared to their full-precision counterparts on certain tasks. Our goal is to reduce this performance gap using EoRA. For the experiments, I used Qwen2.5-72B Instruct and Qwen3-32B, both quantized to 2-bit using AutoRound, a state-of-the-art quantization algorithm developed by Intel. AutoRound leverages SignSGD optimization to fine-tune quantization, and is particularly effective for low-bit settings. All the models I made are available here: Quantized Qwen3 Quantized Qwen2.5 The 2-bit models were quantized with a group size of 32, except for which used a group size of 128. A larger group size reduces model size by storing less quantization metadata, but it introduces greater quantization error. I evaluated the models on IFEval, a benchmark that measures instruction-following capabilities. Results showed a noticeable drop in performance for the quantized versions. Image by the author To compensate for this degradation, I applied an EoRA adapter using the implementation provided in the GPTQModel library. The integration is straightforward. If you’re curious about how it’s implemented in PyTorch, the codebase is compact, clean, and easy to follow: GPTQModel’s EoRA implementation: eora.py EoRA requires a calibration dataset. Ideally, this dataset should reflect the model’s intended use case. However, since we don’t have a specific target task in this context and aim to preserve the model’s general capabilities, I used 1,024 randomly sampled examples from the C4 dataset. Another key parameter is the LoRA rank, which greatly influences the effectiveness of the EoRA adapter. Its optimal value depends on the model architecture, the target task, and the calibration data. A higher rank may yield better performance but risks overfitting to the calibration set. It also increases the size of the adapter, counterproductive when the overall goal of quantization is to reduce memory usage. Conversely, a lower rank keeps the adapter lightweight but might not capture enough information to effectively compensate for quantization errors. In my experiments, I tested LoRA ranks of 32, 64, and 256. Below is the code used to create the EoRA adapter with GPTQModel: from gptqmodel import GPTQModel from gptqmodel.adapter.adapter import Lora from datasets import load_dataset calibration_dataset = load_dataset.select)eora_adapter_path = "Qwen3-32B-autoround-2bit-gptq-r256" model_path = "kaitchup/Qwen3-32B-autoround-2bit-gptq" eora = LoraGPTQModel.adapter.generateUsing an NVIDIA A100 GPU on RunPod, it took approximately 4 hours to generate the EoRA adapter for the model Qwen3-32B-autoround-2bit-gptq. All EoRA adapters created for these models are publicly available: EoRA Adapters for Qwen2.5 and Qwen3 Evaluating EoRA Adapters for 2-bit LLMs Let’s evaluate the effect of the EoRA adapters. Do they improve the accuracy of the 2-bit models? Image by the author It works! The improvements are particularly notable for Qwen3-14B and Qwen3-32B. For instance, applying EoRA to Qwen3-32B, quantized to 2-bit with a group size of 128, resulted in an accuracy gain of nearly 7.5 points. Increasing the LoRA rank, from 32 to 64, also led to improvements, highlighting the impact of rank on performance. EoRA is also effective on larger models like Qwen2.5-72B, though the gains are more modest. Lower-rank adapters showed little to no benefit on this model; it wasn’t until I increased the rank to 256 that significant improvements began to appear. Memory Consumption of EoRA Using the EoRA adapter during inference results in the following increase in memory consumption: Image by the author The overhead is generally negligible. For instance for 2-bit Qwen3-14B, the adapters only add 257 MB and 514 MB to the total model size, with ranks of 32 and 64. With larger ranks, using an EoRA adapter becomes questionable as the total memory consumption may surpass the memory consumption of the same model quantized at a higher precision. For instance, 2-bit Qwen2.5 72B with an EoRA adapter of rank 256 is larger than 3-bit Qwen2.5 72B. Note: This estimate includes only the memory consumed by the adapter’s parameters. For completeness, we could also account for the memory used by adapter activations during inference. However, these are extremely small relative to other tensorsand can safely be considered negligible. Conclusion EoRA works. We’ve confirmed that it’s a simple yet effective method for compensating quantization errors, even at 2-bit precision. It’s intuitive, training-free, and delivers meaningful performance gains. That said, there are a few trade-offs to consider: Rank search: Finding the optimal LoRA rank requires experimentation. It’s difficult to predict in advance whether a rank of 32 will be sufficient or whether a higher rank, like 256, will cause overfitting. The optimal value depends on the model, calibration data, and target task. Increased memory consumption: The goal of quantization is to reduce memory usage, often for highly constrained environments. While EoRA adapters are relatively lightweight at lower ranks, they do slightly increase memory consumption, particularly at higher ranks, reducing the overall efficiency of 2-bit quantization. Looking ahead, NVIDIA’s paper also demonstrates that EoRA adapters make excellent starting points for QLoRA fine-tuning. In other words, if you plan to fine-tune a 2-bit model using QLoRA, initializing from an EoRA-adapted model can lead to better results with less training effort. I’ve written about fine-tuning adapters for GPTQ model last year, in my newsletter: QLoRA with AutoRound: Cheaper and Better LLM Fine-tuning on Your GPU The main difference is that instead of initializing the adapter from scratch, we would load the EoRA adapter. This adapter will be fine-tuned. ReferencesDettmers et al, QLoRA: Efficient Finetuning of Quantized LLMs, arXivBadri and Shaji, Towards 1-bit Machine Learning Models, Mobius Labs’ BlogLiu et al., EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation, arXiv The post Boost 2-Bit LLM Accuracy with EoRA appeared first on Towards Data Science. #boost #2bit #llm #accuracy #with

Boost 2-Bit LLM Accuracy with EoRA

towardsdatascience.com
Quantization is one of the key techniques for reducing the memory footprint of large language models (LLMs). It works by converting the data type of model parameters from higher-precision formats such as 32-bit floating point (FP32) or 16-bit floating point (FP16/BF16) to lower-precision integer formats, typically INT8 or INT4. For example, quantizing a model to 4-bit means each parameter uses only 0.5 bytes, compared to 4 bytes in FP32. Post-training quantization methods like GPTQ and AWQ can dramatically reduce the size of large models. A model like Llama 3 with 70 billion parameters can occupy around 140 GB in FP16, but this can be reduced to approximately 40 GB using 4-bit quantization, while still maintaining strong performance on downstream tasks. However, despite this substantial reduction, such models still exceed the memory capacity of most consumer-grade GPUs, which typically offer 24 GB to 32 GB of VRAM. To make these models truly accessible, quantization to even lower bitwidths, such as 2-bit, is required. While recent advances in low-bit quantization are promising, achieving stable and accurate 2-bit quantization remains a significant challenge. In this article, we review a technique called EoRA that helps compensate for quantization-induced errors. EoRA is a training-free method, meaning it can be applied quickly and efficiently to any model, even the largest ones. We’ll check how EoRA works and demonstrate how it can significantly improve the performance of 2-bit quantized models, bringing them close to the accuracy of their full-precision counterparts while being up to 5.5x smaller. We’ll analyze experimental results obtained using large models such as Qwen3-32B and Qwen2.5-72B, both quantized to 2-bit using state-of-the-art quantization techniques, to illustrate the effectiveness of EoRA. Diving into the Eigenspace in Search of an Adapter Post-training quantization or, more generally, compression aims to reduce model size or inference cost by minimizing the output difference between the original weights Wl and compressed weights Ŵl  using only a small calibration dataset. Most quantization methods are framed layer-wise, but the choice of compression formats is rigid and limits flexibility across diverse deployment needs. To bypass format constraints and improve accuracy, previous work, such as QLoRA [1] and HQQ+ [2], directly fine-tuned a Lora adapter on top of the frozen quantized models. It is also possible to reframe compression as a compensation problem: given a compressed model, introduce low-rank residual paths that specifically correct compression errors. A straightforward method uses SVD to decompose the compression error: \[\Delta W_l = W_l – \hat{W}_l\] into \[U_l \Sigma_l V_l^T\] forming low-rank approximations via two matrices: \[B_l = U_l \Sigma_l \] \[A_l = V_l^T\] where Al and Bl are the standard tensors of a LoRA adapter. However, plain SVD has two limitations: it does not minimize the original layerwise compression loss directly, and it allocates capacity uniformly across all error components, ignoring the varying importance of different parts of the model. To address this, NVIDIA proposes EoRA [3]. EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation EoRA first projects the compression error into the eigenspace defined by the input activation covariance: \[\tilde{X} \tilde{X}^T\] where X̃ is the average activation over the calibration set. Then, by performing eigendecomposition, we get: \[\tilde{X} \tilde{X}^T = Q \Lambda Q^T\] The compression error ΔW is projected as: \[\Delta W’ = \Delta W Q’\] where Q′=QΛ. Then SVD is applied on ΔW′ to produce a low-rank approximation, and the result is projected back to the original space, adjusting the low-rank factors accordingly. This eigenspace projection changes the optimization objective: it weights the importance of different error components according to their contribution to the layerwise output (via eigenvalues), making the approximation more efficient. It can be computed quickly without any training, requires only calibration activations, and does not introduce extra inference latency. Moreover, the derivation shows that this approach leads to a direct minimization of the layerwise compression loss, not just the raw weight error. Analytically, truncating a singular value in the projected space corresponds to minimizing the true compression error under reasonable assumptions about the calibration activations. In their paper, NVIDIA presents a wide range of strong results showing that EoRA can significantly boost the accuracy of quantized models. However, their experiments focus mostly on older Quantization methods like GPTQ and are limited to mid-sized LLMs, up to 13B parameters, at 3-bit and 4-bit precisions. This leaves an open question: can EoRA still be effective for much larger models, using more modern quantization techniques, and even pushing down to 2-bit precision? Let’s find out. Calibrating an EoRA Adapter Suppose we have quantized models that show significantly degraded performance compared to their full-precision counterparts on certain tasks. Our goal is to reduce this performance gap using EoRA. For the experiments, I used Qwen2.5-72B Instruct and Qwen3-32B, both quantized to 2-bit using AutoRound (Apache 2.0 license), a state-of-the-art quantization algorithm developed by Intel. AutoRound leverages SignSGD optimization to fine-tune quantization, and is particularly effective for low-bit settings. All the models I made are available here (Apache 2.0 license): Quantized Qwen3 Quantized Qwen2.5 The 2-bit models were quantized with a group size of 32, except for which used a group size of 128. A larger group size reduces model size by storing less quantization metadata, but it introduces greater quantization error. I evaluated the models on IFEval, a benchmark that measures instruction-following capabilities. Results showed a noticeable drop in performance for the quantized versions. Image by the author To compensate for this degradation, I applied an EoRA adapter using the implementation provided in the GPTQModel library (licensed under Apache 2.0). The integration is straightforward. If you’re curious about how it’s implemented in PyTorch, the codebase is compact, clean, and easy to follow: GPTQModel’s EoRA implementation: eora.py EoRA requires a calibration dataset. Ideally, this dataset should reflect the model’s intended use case. However, since we don’t have a specific target task in this context and aim to preserve the model’s general capabilities, I used 1,024 randomly sampled examples from the C4 dataset (licensed under ODC-BY). Another key parameter is the LoRA rank, which greatly influences the effectiveness of the EoRA adapter. Its optimal value depends on the model architecture, the target task, and the calibration data. A higher rank may yield better performance but risks overfitting to the calibration set. It also increases the size of the adapter, counterproductive when the overall goal of quantization is to reduce memory usage. Conversely, a lower rank keeps the adapter lightweight but might not capture enough information to effectively compensate for quantization errors. In my experiments, I tested LoRA ranks of 32, 64, and 256. Below is the code used to create the EoRA adapter with GPTQModel: from gptqmodel import GPTQModel from gptqmodel.adapter.adapter import Lora from datasets import load_dataset calibration_dataset = load_dataset( "allenai/c4", data_files="en/c4-train.00001-of-01024.json.gz", split="train", download_mode="force_redownload" ).select(range(1024))["text"] eora_adapter_path = "Qwen3-32B-autoround-2bit-gptq-r256" model_path = "kaitchup/Qwen3-32B-autoround-2bit-gptq" eora = Lora( path=eora_adapter_path, rank=256, ) GPTQModel.adapter.generate( adapter=eora, model_id_or_path="Qwen/Qwen3-32B", quantized_model_id_or_path=model_path, calibration_dataset=calibration_dataset, calibration_dataset_concat_size=0, auto_gc=False) Using an NVIDIA A100 GPU on RunPod (referral link), it took approximately 4 hours to generate the EoRA adapter for the model Qwen3-32B-autoround-2bit-gptq. All EoRA adapters created for these models are publicly available (Apache 2.0 license): EoRA Adapters for Qwen2.5 and Qwen3 Evaluating EoRA Adapters for 2-bit LLMs Let’s evaluate the effect of the EoRA adapters. Do they improve the accuracy of the 2-bit models? Image by the author It works! The improvements are particularly notable for Qwen3-14B and Qwen3-32B. For instance, applying EoRA to Qwen3-32B, quantized to 2-bit with a group size of 128, resulted in an accuracy gain of nearly 7.5 points. Increasing the LoRA rank, from 32 to 64, also led to improvements, highlighting the impact of rank on performance. EoRA is also effective on larger models like Qwen2.5-72B, though the gains are more modest. Lower-rank adapters showed little to no benefit on this model; it wasn’t until I increased the rank to 256 that significant improvements began to appear. Memory Consumption of EoRA Using the EoRA adapter during inference results in the following increase in memory consumption: Image by the author The overhead is generally negligible. For instance for 2-bit Qwen3-14B, the adapters only add 257 MB and 514 MB to the total model size, with ranks of 32 and 64. With larger ranks, using an EoRA adapter becomes questionable as the total memory consumption may surpass the memory consumption of the same model quantized at a higher precision. For instance, 2-bit Qwen2.5 72B with an EoRA adapter of rank 256 is larger than 3-bit Qwen2.5 72B. Note: This estimate includes only the memory consumed by the adapter’s parameters. For completeness, we could also account for the memory used by adapter activations during inference. However, these are extremely small relative to other tensors (such as the model’s attention and MLP layers) and can safely be considered negligible. Conclusion EoRA works. We’ve confirmed that it’s a simple yet effective method for compensating quantization errors, even at 2-bit precision. It’s intuitive, training-free, and delivers meaningful performance gains. That said, there are a few trade-offs to consider: Rank search: Finding the optimal LoRA rank requires experimentation. It’s difficult to predict in advance whether a rank of 32 will be sufficient or whether a higher rank, like 256, will cause overfitting. The optimal value depends on the model, calibration data, and target task. Increased memory consumption: The goal of quantization is to reduce memory usage, often for highly constrained environments. While EoRA adapters are relatively lightweight at lower ranks, they do slightly increase memory consumption, particularly at higher ranks, reducing the overall efficiency of 2-bit quantization. Looking ahead, NVIDIA’s paper also demonstrates that EoRA adapters make excellent starting points for QLoRA fine-tuning. In other words, if you plan to fine-tune a 2-bit model using QLoRA, initializing from an EoRA-adapted model can lead to better results with less training effort. I’ve written about fine-tuning adapters for GPTQ model last year, in my newsletter: QLoRA with AutoRound: Cheaper and Better LLM Fine-tuning on Your GPU The main difference is that instead of initializing the adapter from scratch, we would load the EoRA adapter. This adapter will be fine-tuned. References [1] Dettmers et al, QLoRA: Efficient Finetuning of Quantized LLMs (2023), arXiv [2] Badri and Shaji, Towards 1-bit Machine Learning Models (2024), Mobius Labs’ Blog [3] Liu et al., EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation (2024), arXiv The post Boost 2-Bit LLM Accuracy with EoRA appeared first on Towards Data Science.

0 Reacties ·0 aandelen ·0 voorbeeld

Please log in to like, share and comment!
Marktechpost AI @MarktechpostAI een koppeling hebt gedeeld
2025-05-13 22:22:28 ·

Reinforcement Learning, Not Fine-Tuning: Nemotron-Tool-N1 Trains LLMs to Use Tools with Minimal Supervision and Maximum Generalization

Equipping LLMs with external tools or functions has become popular, showing great performance across diverse domains.
Existing research depends on synthesizing large volumes of tool-use trajectories through advanced language models and SFT to enhance LLMs’ tool-calling capability.
The critical limitation lies in the synthetic datasets’ inability to capture explicit reasoning steps, resulting in superficial tool call training.
In many cases, reasoning is either completely omitted during the training or deferred to inference through prompting techniques.
This results in pseudo-reasoning: models merely learn to mimic surface-level patterns without truly understanding the underlying decision-making process.
Existing research explores multiple approaches to enhance LLMs’ tool-use capabilities.
Previous methods have focused on two key strategies for improving tool learning.
The first approach concentrated on dataset curation and model refinement, involving the creation of large-scale supervised datasets and applying advanced training techniques such as SFT and DPO reinforcement learning.
LLMs are combined with various external tools, including search engines, calculators, vision tools, and Python interpreters, to expand their functional capabilities.
The second approach targeted reasoning improvement, shifting from traditional train-time scaling to more complex test-time scaling strategies.
Earlier methods relied on step-level supervision and learned reward models to guide reasoning trajectories.
Researchers from NVIDIA, Pennsylvania State University, and the University of Washington have proposed the Nemotron-Research-Tool-N1 series to address the limitations of existing tool-use methods.
It diverges from traditional SFT and reasoning trace distillation techniques by implementing a unique RL paradigm.
Drawing inspiration from DeepSeek-R1’s success, a lightweight supervision method has been developed to focus on the structural validity and functional correctness evaluation of tool invocations.
The Nemotron-Research-Tool-N1 model employs a binary reward mechanism that enables the model to autonomously develop reasoning strategies without relying on explicitly annotated reasoning trajectories.
Researchers unify and preprocess data from existing tool-calling datasets, xLAM, and a subset of ToolACE, which provide single-turn and multi-turn synthetic tool-calling trajectories.
A lightweight prompting template is created to guide tool call generation, featuring explicit instructions for intermediate reasoning within <think>…</think> tags and tool invocation enclosed in <tool_call>…</tool_call>.
The template helps to minimize rigid formatting constraints and reduce the risk of overfitting to specific prompt patterns.
The primary backbone model utilized is Qwen2.5-7B/14B-Instruct, and to evaluate the generalization ability of the proposed method, evaluations are performed on alternative backbone models, including multiple variants from the LLaMA family.
Results on the BFCL and API-Bank benchmarks show Nemotron-Research-Tool-N1 models’ superior performance.
On the BFCL benchmark, the Tool-N1-7B/14B models outperform closed-source models like GPT-4o and specialized fine-tuned models such as xLAM-2-70B and ToolACE-8B.
The models surpass SFT baselines trained on identical data sources, highlighting the effectiveness of the R1-style RL approach.
Further, the API-Bank benchmark validates these findings, with Tool-N1-7B/14B achieving 4.12% and 5.03% higher accuracy than GPT-4o.
These results conclusively demonstrate the potential of the proposed method in enhancing large language models’ tool-calling capabilities through a novel reinforcement learning paradigm.
In conclusion, researchers introduced Nemotron-Research-Tool-N1, a significant advancement in LLM tool-use capabilities.
The research shows a paradigm shift from traditional SFT methodologies by introducing a novel rule-based RL approach.
The proposed method enables models to develop sophisticated reasoning strategies without relying on explicitly annotated reasoning trajectories.
Benchmark evaluations across BFCL and API-Bank consistently validate the approach’s effectiveness, showing substantial performance improvements over existing baselines.
The findings open new avenues for developing more adaptable and intelligent language models that can autonomously generate reasoning strategies.
Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project.
Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.
Here’s a brief overview of what we’re building at Marktechpost:
ML News Community – r/machinelearningnews (92k+ members)
Newsletter– airesearchinsights.com/(30k+ subscribers)
miniCON AI Events – minicon.marktechpost.com
AI Reports & Magazines – magazine.marktechpost.com
AI Dev & Research News – marktechpost.com (1M+ monthly readers)
Partner with us
Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur.
As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications.
He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/RL^V:" style="color: #0066cc;">https://www.marktechpost.com/author/sajjadansari/RL^V: Unifying Reasoning and Verification in Language Models through Value-Free Reinforcement LearningSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Offline" style="color: #0066cc;">https://www.marktechpost.com/author/sajjadansari/Offline Video-LLMs Can Now Understand Real-Time Streams: Apple Researchers Introduce StreamBridge to Enable Multi-Turn and Proactive Video UnderstandingSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/AI" style="color: #0066cc;">https://www.marktechpost.com/author/sajjadansari/AI That Teaches Itself: Tsinghua University’s ‘Absolute Zero’ Trains LLMs With Zero External DataSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/This" style="color: #0066cc;">https://www.marktechpost.com/author/sajjadansari/This AI Paper Introduce WebThinker: A Deep Research Agent that Empowers Large Reasoning Models (LRMs) for Autonomous Search and Report Generation

Source: https://www.marktechpost.com/2025/05/13/reinforcement-learning-not-fine-tuning-nemotron-tool-n1-trains-llms-to-use-tools-with-minimal-supervision-and-maximum-generalization/" style="color: #0066cc;">https://www.marktechpost.com/2025/05/13/reinforcement-learning-not-fine-tuning-nemotron-tool-n1-trains-llms-to-use-tools-with-minimal-supervision-and-maximum-generalization/
#reinforcement #learning #not #finetuning #nemotrontooln1 #trains #llms #use #tools #with #minimal #supervision #and #maximum #generalization

Reinforcement Learning, Not Fine-Tuning: Nemotron-Tool-N1 Trains LLMs to Use Tools with Minimal Supervision and Maximum Generalization
Equipping LLMs with external tools or functions has become popular, showing great performance across diverse domains. Existing research depends on synthesizing large volumes of tool-use trajectories through advanced language models and SFT to enhance LLMs’ tool-calling capability. The critical limitation lies in the synthetic datasets’ inability to capture explicit reasoning steps, resulting in superficial tool call training. In many cases, reasoning is either completely omitted during the training or deferred to inference through prompting techniques. This results in pseudo-reasoning: models merely learn to mimic surface-level patterns without truly understanding the underlying decision-making process. Existing research explores multiple approaches to enhance LLMs’ tool-use capabilities. Previous methods have focused on two key strategies for improving tool learning. The first approach concentrated on dataset curation and model refinement, involving the creation of large-scale supervised datasets and applying advanced training techniques such as SFT and DPO reinforcement learning. LLMs are combined with various external tools, including search engines, calculators, vision tools, and Python interpreters, to expand their functional capabilities. The second approach targeted reasoning improvement, shifting from traditional train-time scaling to more complex test-time scaling strategies. Earlier methods relied on step-level supervision and learned reward models to guide reasoning trajectories. Researchers from NVIDIA, Pennsylvania State University, and the University of Washington have proposed the Nemotron-Research-Tool-N1 series to address the limitations of existing tool-use methods. It diverges from traditional SFT and reasoning trace distillation techniques by implementing a unique RL paradigm. Drawing inspiration from DeepSeek-R1’s success, a lightweight supervision method has been developed to focus on the structural validity and functional correctness evaluation of tool invocations. The Nemotron-Research-Tool-N1 model employs a binary reward mechanism that enables the model to autonomously develop reasoning strategies without relying on explicitly annotated reasoning trajectories. Researchers unify and preprocess data from existing tool-calling datasets, xLAM, and a subset of ToolACE, which provide single-turn and multi-turn synthetic tool-calling trajectories. A lightweight prompting template is created to guide tool call generation, featuring explicit instructions for intermediate reasoning within <think>…</think> tags and tool invocation enclosed in <tool_call>…</tool_call>. The template helps to minimize rigid formatting constraints and reduce the risk of overfitting to specific prompt patterns. The primary backbone model utilized is Qwen2.5-7B/14B-Instruct, and to evaluate the generalization ability of the proposed method, evaluations are performed on alternative backbone models, including multiple variants from the LLaMA family. Results on the BFCL and API-Bank benchmarks show Nemotron-Research-Tool-N1 models’ superior performance. On the BFCL benchmark, the Tool-N1-7B/14B models outperform closed-source models like GPT-4o and specialized fine-tuned models such as xLAM-2-70B and ToolACE-8B. The models surpass SFT baselines trained on identical data sources, highlighting the effectiveness of the R1-style RL approach. Further, the API-Bank benchmark validates these findings, with Tool-N1-7B/14B achieving 4.12% and 5.03% higher accuracy than GPT-4o. These results conclusively demonstrate the potential of the proposed method in enhancing large language models’ tool-calling capabilities through a novel reinforcement learning paradigm. In conclusion, researchers introduced Nemotron-Research-Tool-N1, a significant advancement in LLM tool-use capabilities. The research shows a paradigm shift from traditional SFT methodologies by introducing a novel rule-based RL approach. The proposed method enables models to develop sophisticated reasoning strategies without relying on explicitly annotated reasoning trajectories. Benchmark evaluations across BFCL and API-Bank consistently validate the approach’s effectiveness, showing substantial performance improvements over existing baselines. The findings open new avenues for developing more adaptable and intelligent language models that can autonomously generate reasoning strategies. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. Here’s a brief overview of what we’re building at Marktechpost: ML News Community – r/machinelearningnews (92k+ members) Newsletter– airesearchinsights.com/(30k+ subscribers) miniCON AI Events – minicon.marktechpost.com AI Reports & Magazines – magazine.marktechpost.com AI Dev & Research News – marktechpost.com (1M+ monthly readers) Partner with us Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/RL^V: Unifying Reasoning and Verification in Language Models through Value-Free Reinforcement LearningSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Offline Video-LLMs Can Now Understand Real-Time Streams: Apple Researchers Introduce StreamBridge to Enable Multi-Turn and Proactive Video UnderstandingSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/AI That Teaches Itself: Tsinghua University’s ‘Absolute Zero’ Trains LLMs With Zero External DataSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/This AI Paper Introduce WebThinker: A Deep Research Agent that Empowers Large Reasoning Models (LRMs) for Autonomous Search and Report Generation Source: https://www.marktechpost.com/2025/05/13/reinforcement-learning-not-fine-tuning-nemotron-tool-n1-trains-llms-to-use-tools-with-minimal-supervision-and-maximum-generalization/ #reinforcement #learning #not #finetuning #nemotrontooln1 #trains #llms #use #tools #with #minimal #supervision #and #maximum #generalization

Reinforcement Learning, Not Fine-Tuning: Nemotron-Tool-N1 Trains LLMs to Use Tools with Minimal Supervision and Maximum Generalization

www.marktechpost.com
Equipping LLMs with external tools or functions has become popular, showing great performance across diverse domains. Existing research depends on synthesizing large volumes of tool-use trajectories through advanced language models and SFT to enhance LLMs’ tool-calling capability. The critical limitation lies in the synthetic datasets’ inability to capture explicit reasoning steps, resulting in superficial tool call training. In many cases, reasoning is either completely omitted during the training or deferred to inference through prompting techniques. This results in pseudo-reasoning: models merely learn to mimic surface-level patterns without truly understanding the underlying decision-making process. Existing research explores multiple approaches to enhance LLMs’ tool-use capabilities. Previous methods have focused on two key strategies for improving tool learning. The first approach concentrated on dataset curation and model refinement, involving the creation of large-scale supervised datasets and applying advanced training techniques such as SFT and DPO reinforcement learning. LLMs are combined with various external tools, including search engines, calculators, vision tools, and Python interpreters, to expand their functional capabilities. The second approach targeted reasoning improvement, shifting from traditional train-time scaling to more complex test-time scaling strategies. Earlier methods relied on step-level supervision and learned reward models to guide reasoning trajectories. Researchers from NVIDIA, Pennsylvania State University, and the University of Washington have proposed the Nemotron-Research-Tool-N1 series to address the limitations of existing tool-use methods. It diverges from traditional SFT and reasoning trace distillation techniques by implementing a unique RL paradigm. Drawing inspiration from DeepSeek-R1’s success, a lightweight supervision method has been developed to focus on the structural validity and functional correctness evaluation of tool invocations. The Nemotron-Research-Tool-N1 model employs a binary reward mechanism that enables the model to autonomously develop reasoning strategies without relying on explicitly annotated reasoning trajectories. Researchers unify and preprocess data from existing tool-calling datasets, xLAM, and a subset of ToolACE, which provide single-turn and multi-turn synthetic tool-calling trajectories. A lightweight prompting template is created to guide tool call generation, featuring explicit instructions for intermediate reasoning within <think>…</think> tags and tool invocation enclosed in <tool_call>…</tool_call>. The template helps to minimize rigid formatting constraints and reduce the risk of overfitting to specific prompt patterns. The primary backbone model utilized is Qwen2.5-7B/14B-Instruct, and to evaluate the generalization ability of the proposed method, evaluations are performed on alternative backbone models, including multiple variants from the LLaMA family. Results on the BFCL and API-Bank benchmarks show Nemotron-Research-Tool-N1 models’ superior performance. On the BFCL benchmark, the Tool-N1-7B/14B models outperform closed-source models like GPT-4o and specialized fine-tuned models such as xLAM-2-70B and ToolACE-8B. The models surpass SFT baselines trained on identical data sources, highlighting the effectiveness of the R1-style RL approach. Further, the API-Bank benchmark validates these findings, with Tool-N1-7B/14B achieving 4.12% and 5.03% higher accuracy than GPT-4o. These results conclusively demonstrate the potential of the proposed method in enhancing large language models’ tool-calling capabilities through a novel reinforcement learning paradigm. In conclusion, researchers introduced Nemotron-Research-Tool-N1, a significant advancement in LLM tool-use capabilities. The research shows a paradigm shift from traditional SFT methodologies by introducing a novel rule-based RL approach. The proposed method enables models to develop sophisticated reasoning strategies without relying on explicitly annotated reasoning trajectories. Benchmark evaluations across BFCL and API-Bank consistently validate the approach’s effectiveness, showing substantial performance improvements over existing baselines. The findings open new avenues for developing more adaptable and intelligent language models that can autonomously generate reasoning strategies. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. Here’s a brief overview of what we’re building at Marktechpost: ML News Community – r/machinelearningnews (92k+ members) Newsletter– airesearchinsights.com/(30k+ subscribers) miniCON AI Events – minicon.marktechpost.com AI Reports & Magazines – magazine.marktechpost.com AI Dev & Research News – marktechpost.com (1M+ monthly readers) Partner with us Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/RL^V: Unifying Reasoning and Verification in Language Models through Value-Free Reinforcement LearningSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Offline Video-LLMs Can Now Understand Real-Time Streams: Apple Researchers Introduce StreamBridge to Enable Multi-Turn and Proactive Video UnderstandingSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/AI That Teaches Itself: Tsinghua University’s ‘Absolute Zero’ Trains LLMs With Zero External DataSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/This AI Paper Introduce WebThinker: A Deep Research Agent that Empowers Large Reasoning Models (LRMs) for Autonomous Search and Report Generation

0 Reacties ·0 aandelen ·0 voorbeeld

Please log in to like, share and comment!

Upgrade to Pro