• NVIDIA Scores Consecutive Win for End-to-End Autonomous Driving Grand Challenge at CVPR

    NVIDIA was today named an Autonomous Grand Challenge winner at the Computer Vision and Pattern Recognitionconference, held this week in Nashville, Tennessee. The announcement was made at the Embodied Intelligence for Autonomous Systems on the Horizon Workshop.
    This marks the second consecutive year that NVIDIA’s topped the leaderboard in the End-to-End Driving at Scale category and the third year in a row winning an Autonomous Grand Challenge award at CVPR.
    The theme of this year’s challenge was “Towards Generalizable Embodied Systems” — based on NAVSIM v2, a data-driven, nonreactive autonomous vehiclesimulation framework.
    The challenge offered researchers the opportunity to explore ways to handle unexpected situations, beyond using only real-world human driving data, to accelerate the development of smarter, safer AVs.
    Generating Safe and Adaptive Driving Trajectories
    Participants of the challenge were tasked with generating driving trajectories from multi-sensor data in a semi-reactive simulation, where the ego vehicle’s plan is fixed at the start, but background traffic changes dynamically.
    Submissions were evaluated using the Extended Predictive Driver Model Score, which measures safety, comfort, compliance and generalization across real-world and synthetic scenarios — pushing the boundaries of robust and generalizable autonomous driving research.
    The NVIDIA AV Applied Research Team’s key innovation was the Generalized Trajectory Scoringmethod, which generates a variety of trajectories and progressively filters out the best one.
    GTRS model architecture showing a unified system for generating and scoring diverse driving trajectories using diffusion- and vocabulary-based trajectories.
    GTRS introduces a combination of coarse sets of trajectories covering a wide range of situations and fine-grained trajectories for safety-critical situations, created using a diffusion policy conditioned on the environment. GTRS then uses a transformer decoder distilled from perception-dependent metrics, focusing on safety, comfort and traffic rule compliance. This decoder progressively filters out the most promising trajectory candidates by capturing subtle but critical differences between similar trajectories.
    This system has proved to generalize well to a wide range of scenarios, achieving state-of-the-art results on challenging benchmarks and enabling robust, adaptive trajectory selection in diverse and challenging driving conditions.

    NVIDIA Automotive Research at CVPR 
    More than 60 NVIDIA papers were accepted for CVPR 2025, spanning automotive, healthcare, robotics and more.
    In automotive, NVIDIA researchers are advancing physical AI with innovation in perception, planning and data generation. This year, three NVIDIA papers were nominated for the Best Paper Award: FoundationStereo, Zero-Shot Monocular Scene Flow and Difix3D+.
    The NVIDIA papers listed below showcase breakthroughs in stereo depth estimation, monocular motion understanding, 3D reconstruction, closed-loop planning, vision-language modeling and generative simulation — all critical to building safer, more generalizable AVs:

    Diffusion Renderer: Neural Inverse and Forward Rendering With Video Diffusion ModelsFoundationStereo: Zero-Shot Stereo MatchingZero-Shot Monocular Scene Flow Estimation in the WildDifix3D+: Improving 3D Reconstructions With Single-Step Diffusion Models3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting
    Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models
    Zero-Shot 4D Lidar Panoptic Segmentation
    NVILA: Efficient Frontier Visual Language Models
    RADIO Amplified: Improved Baselines for Agglomerative Vision Foundation Models
    OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving With Counterfactual Reasoning

    Explore automotive workshops and tutorials at CVPR, including:

    Workshop on Data-Driven Autonomous Driving Simulation, featuring Marco Pavone, senior director of AV research at NVIDIA, and Sanja Fidler, vice president of AI research at NVIDIA
    Workshop on Autonomous Driving, featuring Laura Leal-Taixe, senior research manager at NVIDIA
    Workshop on Open-World 3D Scene Understanding with Foundation Models, featuring Leal-Taixe
    Safe Artificial Intelligence for All Domains, featuring Jose Alvarez, director of AV applied research at NVIDIA
    Workshop on Foundation Models for V2X-Based Cooperative Autonomous Driving, featuring Pavone and Leal-Taixe
    Workshop on Multi-Agent Embodied Intelligent Systems Meet Generative AI Era, featuring Pavone
    LatinX in CV Workshop, featuring Leal-Taixe
    Workshop on Exploring the Next Generation of Data, featuring Alvarez
    Full-Stack, GPU-Based Acceleration of Deep Learning and Foundation Models, led by NVIDIA
    Continuous Data Cycle via Foundation Models, led by NVIDIA
    Distillation of Foundation Models for Autonomous Driving, led by NVIDIA

    Explore the NVIDIA research papers to be presented at CVPR and watch the NVIDIA GTC Paris keynote from NVIDIA founder and CEO Jensen Huang.
    Learn more about NVIDIA Research, a global team of hundreds of scientists and engineers focused on topics including AI, computer graphics, computer vision, self-driving cars and robotics.
    The featured image above shows how an autonomous vehicle adapts its trajectory to navigate an urban environment with dynamic traffic using the GTRS model.
    #nvidia #scores #consecutive #win #endtoend
    NVIDIA Scores Consecutive Win for End-to-End Autonomous Driving Grand Challenge at CVPR
    NVIDIA was today named an Autonomous Grand Challenge winner at the Computer Vision and Pattern Recognitionconference, held this week in Nashville, Tennessee. The announcement was made at the Embodied Intelligence for Autonomous Systems on the Horizon Workshop. This marks the second consecutive year that NVIDIA’s topped the leaderboard in the End-to-End Driving at Scale category and the third year in a row winning an Autonomous Grand Challenge award at CVPR. The theme of this year’s challenge was “Towards Generalizable Embodied Systems” — based on NAVSIM v2, a data-driven, nonreactive autonomous vehiclesimulation framework. The challenge offered researchers the opportunity to explore ways to handle unexpected situations, beyond using only real-world human driving data, to accelerate the development of smarter, safer AVs. Generating Safe and Adaptive Driving Trajectories Participants of the challenge were tasked with generating driving trajectories from multi-sensor data in a semi-reactive simulation, where the ego vehicle’s plan is fixed at the start, but background traffic changes dynamically. Submissions were evaluated using the Extended Predictive Driver Model Score, which measures safety, comfort, compliance and generalization across real-world and synthetic scenarios — pushing the boundaries of robust and generalizable autonomous driving research. The NVIDIA AV Applied Research Team’s key innovation was the Generalized Trajectory Scoringmethod, which generates a variety of trajectories and progressively filters out the best one. GTRS model architecture showing a unified system for generating and scoring diverse driving trajectories using diffusion- and vocabulary-based trajectories. GTRS introduces a combination of coarse sets of trajectories covering a wide range of situations and fine-grained trajectories for safety-critical situations, created using a diffusion policy conditioned on the environment. GTRS then uses a transformer decoder distilled from perception-dependent metrics, focusing on safety, comfort and traffic rule compliance. This decoder progressively filters out the most promising trajectory candidates by capturing subtle but critical differences between similar trajectories. This system has proved to generalize well to a wide range of scenarios, achieving state-of-the-art results on challenging benchmarks and enabling robust, adaptive trajectory selection in diverse and challenging driving conditions. NVIDIA Automotive Research at CVPR  More than 60 NVIDIA papers were accepted for CVPR 2025, spanning automotive, healthcare, robotics and more. In automotive, NVIDIA researchers are advancing physical AI with innovation in perception, planning and data generation. This year, three NVIDIA papers were nominated for the Best Paper Award: FoundationStereo, Zero-Shot Monocular Scene Flow and Difix3D+. The NVIDIA papers listed below showcase breakthroughs in stereo depth estimation, monocular motion understanding, 3D reconstruction, closed-loop planning, vision-language modeling and generative simulation — all critical to building safer, more generalizable AVs: Diffusion Renderer: Neural Inverse and Forward Rendering With Video Diffusion ModelsFoundationStereo: Zero-Shot Stereo MatchingZero-Shot Monocular Scene Flow Estimation in the WildDifix3D+: Improving 3D Reconstructions With Single-Step Diffusion Models3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models Zero-Shot 4D Lidar Panoptic Segmentation NVILA: Efficient Frontier Visual Language Models RADIO Amplified: Improved Baselines for Agglomerative Vision Foundation Models OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving With Counterfactual Reasoning Explore automotive workshops and tutorials at CVPR, including: Workshop on Data-Driven Autonomous Driving Simulation, featuring Marco Pavone, senior director of AV research at NVIDIA, and Sanja Fidler, vice president of AI research at NVIDIA Workshop on Autonomous Driving, featuring Laura Leal-Taixe, senior research manager at NVIDIA Workshop on Open-World 3D Scene Understanding with Foundation Models, featuring Leal-Taixe Safe Artificial Intelligence for All Domains, featuring Jose Alvarez, director of AV applied research at NVIDIA Workshop on Foundation Models for V2X-Based Cooperative Autonomous Driving, featuring Pavone and Leal-Taixe Workshop on Multi-Agent Embodied Intelligent Systems Meet Generative AI Era, featuring Pavone LatinX in CV Workshop, featuring Leal-Taixe Workshop on Exploring the Next Generation of Data, featuring Alvarez Full-Stack, GPU-Based Acceleration of Deep Learning and Foundation Models, led by NVIDIA Continuous Data Cycle via Foundation Models, led by NVIDIA Distillation of Foundation Models for Autonomous Driving, led by NVIDIA Explore the NVIDIA research papers to be presented at CVPR and watch the NVIDIA GTC Paris keynote from NVIDIA founder and CEO Jensen Huang. Learn more about NVIDIA Research, a global team of hundreds of scientists and engineers focused on topics including AI, computer graphics, computer vision, self-driving cars and robotics. The featured image above shows how an autonomous vehicle adapts its trajectory to navigate an urban environment with dynamic traffic using the GTRS model. #nvidia #scores #consecutive #win #endtoend
    BLOGS.NVIDIA.COM
    NVIDIA Scores Consecutive Win for End-to-End Autonomous Driving Grand Challenge at CVPR
    NVIDIA was today named an Autonomous Grand Challenge winner at the Computer Vision and Pattern Recognition (CVPR) conference, held this week in Nashville, Tennessee. The announcement was made at the Embodied Intelligence for Autonomous Systems on the Horizon Workshop. This marks the second consecutive year that NVIDIA’s topped the leaderboard in the End-to-End Driving at Scale category and the third year in a row winning an Autonomous Grand Challenge award at CVPR. The theme of this year’s challenge was “Towards Generalizable Embodied Systems” — based on NAVSIM v2, a data-driven, nonreactive autonomous vehicle (AV) simulation framework. The challenge offered researchers the opportunity to explore ways to handle unexpected situations, beyond using only real-world human driving data, to accelerate the development of smarter, safer AVs. Generating Safe and Adaptive Driving Trajectories Participants of the challenge were tasked with generating driving trajectories from multi-sensor data in a semi-reactive simulation, where the ego vehicle’s plan is fixed at the start, but background traffic changes dynamically. Submissions were evaluated using the Extended Predictive Driver Model Score, which measures safety, comfort, compliance and generalization across real-world and synthetic scenarios — pushing the boundaries of robust and generalizable autonomous driving research. The NVIDIA AV Applied Research Team’s key innovation was the Generalized Trajectory Scoring (GTRS) method, which generates a variety of trajectories and progressively filters out the best one. GTRS model architecture showing a unified system for generating and scoring diverse driving trajectories using diffusion- and vocabulary-based trajectories. GTRS introduces a combination of coarse sets of trajectories covering a wide range of situations and fine-grained trajectories for safety-critical situations, created using a diffusion policy conditioned on the environment. GTRS then uses a transformer decoder distilled from perception-dependent metrics, focusing on safety, comfort and traffic rule compliance. This decoder progressively filters out the most promising trajectory candidates by capturing subtle but critical differences between similar trajectories. This system has proved to generalize well to a wide range of scenarios, achieving state-of-the-art results on challenging benchmarks and enabling robust, adaptive trajectory selection in diverse and challenging driving conditions. NVIDIA Automotive Research at CVPR  More than 60 NVIDIA papers were accepted for CVPR 2025, spanning automotive, healthcare, robotics and more. In automotive, NVIDIA researchers are advancing physical AI with innovation in perception, planning and data generation. This year, three NVIDIA papers were nominated for the Best Paper Award: FoundationStereo, Zero-Shot Monocular Scene Flow and Difix3D+. The NVIDIA papers listed below showcase breakthroughs in stereo depth estimation, monocular motion understanding, 3D reconstruction, closed-loop planning, vision-language modeling and generative simulation — all critical to building safer, more generalizable AVs: Diffusion Renderer: Neural Inverse and Forward Rendering With Video Diffusion Models (Read more in this blog.) FoundationStereo: Zero-Shot Stereo Matching (Best Paper nominee) Zero-Shot Monocular Scene Flow Estimation in the Wild (Best Paper nominee) Difix3D+: Improving 3D Reconstructions With Single-Step Diffusion Models (Best Paper nominee) 3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models Zero-Shot 4D Lidar Panoptic Segmentation NVILA: Efficient Frontier Visual Language Models RADIO Amplified: Improved Baselines for Agglomerative Vision Foundation Models OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving With Counterfactual Reasoning Explore automotive workshops and tutorials at CVPR, including: Workshop on Data-Driven Autonomous Driving Simulation, featuring Marco Pavone, senior director of AV research at NVIDIA, and Sanja Fidler, vice president of AI research at NVIDIA Workshop on Autonomous Driving, featuring Laura Leal-Taixe, senior research manager at NVIDIA Workshop on Open-World 3D Scene Understanding with Foundation Models, featuring Leal-Taixe Safe Artificial Intelligence for All Domains, featuring Jose Alvarez, director of AV applied research at NVIDIA Workshop on Foundation Models for V2X-Based Cooperative Autonomous Driving, featuring Pavone and Leal-Taixe Workshop on Multi-Agent Embodied Intelligent Systems Meet Generative AI Era, featuring Pavone LatinX in CV Workshop, featuring Leal-Taixe Workshop on Exploring the Next Generation of Data, featuring Alvarez Full-Stack, GPU-Based Acceleration of Deep Learning and Foundation Models, led by NVIDIA Continuous Data Cycle via Foundation Models, led by NVIDIA Distillation of Foundation Models for Autonomous Driving, led by NVIDIA Explore the NVIDIA research papers to be presented at CVPR and watch the NVIDIA GTC Paris keynote from NVIDIA founder and CEO Jensen Huang. Learn more about NVIDIA Research, a global team of hundreds of scientists and engineers focused on topics including AI, computer graphics, computer vision, self-driving cars and robotics. The featured image above shows how an autonomous vehicle adapts its trajectory to navigate an urban environment with dynamic traffic using the GTRS model.
    Like
    Love
    Wow
    Angry
    27
    0 Kommentare 0 Anteile
  • Rethinking AI: DeepSeek’s playbook shakes up the high-spend, high-compute paradigm

    Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more

    When DeepSeek released its R1 model this January, it wasn’t just another AI announcement. It was a watershed moment that sent shockwaves through the tech industry, forcing industry leaders to reconsider their fundamental approaches to AI development.
    What makes DeepSeek’s accomplishment remarkable isn’t that the company developed novel capabilities; rather, it was how it achieved comparable results to those delivered by tech heavyweights at a fraction of the cost. In reality, DeepSeek didn’t do anything that hadn’t been done before; its innovation stemmed from pursuing different priorities. As a result, we are now experiencing rapid-fire development along two parallel tracks: efficiency and compute. 
    As DeepSeek prepares to release its R2 model, and as it concurrently faces the potential of even greater chip restrictions from the U.S., it’s important to look at how it captured so much attention.
    Engineering around constraints
    DeepSeek’s arrival, as sudden and dramatic as it was, captivated us all because it showcased the capacity for innovation to thrive even under significant constraints. Faced with U.S. export controls limiting access to cutting-edge AI chips, DeepSeek was forced to find alternative pathways to AI advancement.
    While U.S. companies pursued performance gains through more powerful hardware, bigger models and better data, DeepSeek focused on optimizing what was available. It implemented known ideas with remarkable execution — and there is novelty in executing what’s known and doing it well.
    This efficiency-first mindset yielded incredibly impressive results. DeepSeek’s R1 model reportedly matches OpenAI’s capabilities at just 5 to 10% of the operating cost. According to reports, the final training run for DeepSeek’s V3 predecessor cost a mere million — which was described by former Tesla AI scientist Andrej Karpathy as “a joke of a budget” compared to the tens or hundreds of millions spent by U.S. competitors. More strikingly, while OpenAI reportedly spent million training its recent “Orion” model, DeepSeek achieved superior benchmark results for just million — less than 1.2% of OpenAI’s investment.
    If you get starry eyed believing these incredible results were achieved even as DeepSeek was at a severe disadvantage based on its inability to access advanced AI chips, I hate to tell you, but that narrative isn’t entirely accurate. Initial U.S. export controls focused primarily on compute capabilities, not on memory and networking — two crucial components for AI development.
    That means that the chips DeepSeek had access to were not poor quality chips; their networking and memory capabilities allowed DeepSeek to parallelize operations across many units, a key strategy for running their large model efficiently.
    This, combined with China’s national push toward controlling the entire vertical stack of AI infrastructure, resulted in accelerated innovation that many Western observers didn’t anticipate. DeepSeek’s advancements were an inevitable part of AI development, but they brought known advancements forward a few years earlier than would have been possible otherwise, and that’s pretty amazing.
    Pragmatism over process
    Beyond hardware optimization, DeepSeek’s approach to training data represents another departure from conventional Western practices. Rather than relying solely on web-scraped content, DeepSeek reportedly leveraged significant amounts of synthetic data and outputs from other proprietary models. This is a classic example of model distillation, or the ability to learn from really powerful models. Such an approach, however, raises questions about data privacy and governance that might concern Western enterprise customers. Still, it underscores DeepSeek’s overall pragmatic focus on results over process.
    The effective use of synthetic data is a key differentiator. Synthetic data can be very effective when it comes to training large models, but you have to be careful; some model architectures handle synthetic data better than others. For instance, transformer-based models with mixture of expertsarchitectures like DeepSeek’s tend to be more robust when incorporating synthetic data, while more traditional dense architectures like those used in early Llama models can experience performance degradation or even “model collapse” when trained on too much synthetic content.
    This architectural sensitivity matters because synthetic data introduces different patterns and distributions compared to real-world data. When a model architecture doesn’t handle synthetic data well, it may learn shortcuts or biases present in the synthetic data generation process rather than generalizable knowledge. This can lead to reduced performance on real-world tasks, increased hallucinations or brittleness when facing novel situations. 
    Still, DeepSeek’s engineering teams reportedly designed their model architecture specifically with synthetic data integration in mind from the earliest planning stages. This allowed the company to leverage the cost benefits of synthetic data without sacrificing performance.
    Market reverberations
    Why does all of this matter? Stock market aside, DeepSeek’s emergence has triggered substantive strategic shifts among industry leaders.
    Case in point: OpenAI. Sam Altman recently announced plans to release the company’s first “open-weight” language model since 2019. This is a pretty notable pivot for a company that built its business on proprietary systems. It seems DeepSeek’s rise, on top of Llama’s success, has hit OpenAI’s leader hard. Just a month after DeepSeek arrived on the scene, Altman admitted that OpenAI had been “on the wrong side of history” regarding open-source AI. 
    With OpenAI reportedly spending to 8 billion annually on operations, the economic pressure from efficient alternatives like DeepSeek has become impossible to ignore. As AI scholar Kai-Fu Lee bluntly put it: “You’re spending billion or billion a year, making a massive loss, and here you have a competitor coming in with an open-source model that’s for free.” This necessitates change.
    This economic reality prompted OpenAI to pursue a massive billion funding round that valued the company at an unprecedented billion. But even with a war chest of funds at its disposal, the fundamental challenge remains: OpenAI’s approach is dramatically more resource-intensive than DeepSeek’s.
    Beyond model training
    Another significant trend accelerated by DeepSeek is the shift toward “test-time compute”. As major AI labs have now trained their models on much of the available public data on the internet, data scarcity is slowing further improvements in pre-training.
    To get around this, DeepSeek announced a collaboration with Tsinghua University to enable “self-principled critique tuning”. This approach trains AI to develop its own rules for judging content and then uses those rules to provide detailed critiques. The system includes a built-in “judge” that evaluates the AI’s answers in real-time, comparing responses against core rules and quality standards.
    The development is part of a movement towards autonomous self-evaluation and improvement in AI systems in which models use inference time to improve results, rather than simply making models larger during training. DeepSeek calls its system “DeepSeek-GRM”. But, as with its model distillation approach, this could be considered a mix of promise and risk.
    For example, if the AI develops its own judging criteria, there’s a risk those principles diverge from human values, ethics or context. The rules could end up being overly rigid or biased, optimizing for style over substance, and/or reinforce incorrect assumptions or hallucinations. Additionally, without a human in the loop, issues could arise if the “judge” is flawed or misaligned. It’s a kind of AI talking to itself, without robust external grounding. On top of this, users and developers may not understand why the AI reached a certain conclusion — which feeds into a bigger concern: Should an AI be allowed to decide what is “good” or “correct” based solely on its own logic? These risks shouldn’t be discounted.
    At the same time, this approach is gaining traction, as again DeepSeek builds on the body of work of othersto create what is likely the first full-stack application of SPCT in a commercial effort.
    This could mark a powerful shift in AI autonomy, but there still is a need for rigorous auditing, transparency and safeguards. It’s not just about models getting smarter, but that they remain aligned, interpretable, and trustworthy as they begin critiquing themselves without human guardrails.
    Moving into the future
    So, taking all of this into account, the rise of DeepSeek signals a broader shift in the AI industry toward parallel innovation tracks. While companies continue building more powerful compute clusters for next-generation capabilities, there will also be intense focus on finding efficiency gains through software engineering and model architecture improvements to offset the challenges of AI energy consumption, which far outpaces power generation capacity. 
    Companies are taking note. Microsoft, for example, has halted data center development in multiple regions globally, recalibrating toward a more distributed, efficient infrastructure approach. While still planning to invest approximately billion in AI infrastructure this fiscal year, the company is reallocating resources in response to the efficiency gains DeepSeek introduced to the market.
    Meta has also responded,
    With so much movement in such a short time, it becomes somewhat ironic that the U.S. sanctions designed to maintain American AI dominance may have instead accelerated the very innovation they sought to contain. By constraining access to materials, DeepSeek was forced to blaze a new trail.
    Moving forward, as the industry continues to evolve globally, adaptability for all players will be key. Policies, people and market reactions will continue to shift the ground rules — whether it’s eliminating the AI diffusion rule, a new ban on technology purchases or something else entirely. It’s what we learn from one another and how we respond that will be worth watching.
    Jae Lee is CEO and co-founder of TwelveLabs.

    Daily insights on business use cases with VB Daily
    If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.
    Read our Privacy Policy

    Thanks for subscribing. Check out more VB newsletters here.

    An error occured.
    #rethinking #deepseeks #playbook #shakes #highspend
    Rethinking AI: DeepSeek’s playbook shakes up the high-spend, high-compute paradigm
    Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more When DeepSeek released its R1 model this January, it wasn’t just another AI announcement. It was a watershed moment that sent shockwaves through the tech industry, forcing industry leaders to reconsider their fundamental approaches to AI development. What makes DeepSeek’s accomplishment remarkable isn’t that the company developed novel capabilities; rather, it was how it achieved comparable results to those delivered by tech heavyweights at a fraction of the cost. In reality, DeepSeek didn’t do anything that hadn’t been done before; its innovation stemmed from pursuing different priorities. As a result, we are now experiencing rapid-fire development along two parallel tracks: efficiency and compute.  As DeepSeek prepares to release its R2 model, and as it concurrently faces the potential of even greater chip restrictions from the U.S., it’s important to look at how it captured so much attention. Engineering around constraints DeepSeek’s arrival, as sudden and dramatic as it was, captivated us all because it showcased the capacity for innovation to thrive even under significant constraints. Faced with U.S. export controls limiting access to cutting-edge AI chips, DeepSeek was forced to find alternative pathways to AI advancement. While U.S. companies pursued performance gains through more powerful hardware, bigger models and better data, DeepSeek focused on optimizing what was available. It implemented known ideas with remarkable execution — and there is novelty in executing what’s known and doing it well. This efficiency-first mindset yielded incredibly impressive results. DeepSeek’s R1 model reportedly matches OpenAI’s capabilities at just 5 to 10% of the operating cost. According to reports, the final training run for DeepSeek’s V3 predecessor cost a mere million — which was described by former Tesla AI scientist Andrej Karpathy as “a joke of a budget” compared to the tens or hundreds of millions spent by U.S. competitors. More strikingly, while OpenAI reportedly spent million training its recent “Orion” model, DeepSeek achieved superior benchmark results for just million — less than 1.2% of OpenAI’s investment. If you get starry eyed believing these incredible results were achieved even as DeepSeek was at a severe disadvantage based on its inability to access advanced AI chips, I hate to tell you, but that narrative isn’t entirely accurate. Initial U.S. export controls focused primarily on compute capabilities, not on memory and networking — two crucial components for AI development. That means that the chips DeepSeek had access to were not poor quality chips; their networking and memory capabilities allowed DeepSeek to parallelize operations across many units, a key strategy for running their large model efficiently. This, combined with China’s national push toward controlling the entire vertical stack of AI infrastructure, resulted in accelerated innovation that many Western observers didn’t anticipate. DeepSeek’s advancements were an inevitable part of AI development, but they brought known advancements forward a few years earlier than would have been possible otherwise, and that’s pretty amazing. Pragmatism over process Beyond hardware optimization, DeepSeek’s approach to training data represents another departure from conventional Western practices. Rather than relying solely on web-scraped content, DeepSeek reportedly leveraged significant amounts of synthetic data and outputs from other proprietary models. This is a classic example of model distillation, or the ability to learn from really powerful models. Such an approach, however, raises questions about data privacy and governance that might concern Western enterprise customers. Still, it underscores DeepSeek’s overall pragmatic focus on results over process. The effective use of synthetic data is a key differentiator. Synthetic data can be very effective when it comes to training large models, but you have to be careful; some model architectures handle synthetic data better than others. For instance, transformer-based models with mixture of expertsarchitectures like DeepSeek’s tend to be more robust when incorporating synthetic data, while more traditional dense architectures like those used in early Llama models can experience performance degradation or even “model collapse” when trained on too much synthetic content. This architectural sensitivity matters because synthetic data introduces different patterns and distributions compared to real-world data. When a model architecture doesn’t handle synthetic data well, it may learn shortcuts or biases present in the synthetic data generation process rather than generalizable knowledge. This can lead to reduced performance on real-world tasks, increased hallucinations or brittleness when facing novel situations.  Still, DeepSeek’s engineering teams reportedly designed their model architecture specifically with synthetic data integration in mind from the earliest planning stages. This allowed the company to leverage the cost benefits of synthetic data without sacrificing performance. Market reverberations Why does all of this matter? Stock market aside, DeepSeek’s emergence has triggered substantive strategic shifts among industry leaders. Case in point: OpenAI. Sam Altman recently announced plans to release the company’s first “open-weight” language model since 2019. This is a pretty notable pivot for a company that built its business on proprietary systems. It seems DeepSeek’s rise, on top of Llama’s success, has hit OpenAI’s leader hard. Just a month after DeepSeek arrived on the scene, Altman admitted that OpenAI had been “on the wrong side of history” regarding open-source AI.  With OpenAI reportedly spending to 8 billion annually on operations, the economic pressure from efficient alternatives like DeepSeek has become impossible to ignore. As AI scholar Kai-Fu Lee bluntly put it: “You’re spending billion or billion a year, making a massive loss, and here you have a competitor coming in with an open-source model that’s for free.” This necessitates change. This economic reality prompted OpenAI to pursue a massive billion funding round that valued the company at an unprecedented billion. But even with a war chest of funds at its disposal, the fundamental challenge remains: OpenAI’s approach is dramatically more resource-intensive than DeepSeek’s. Beyond model training Another significant trend accelerated by DeepSeek is the shift toward “test-time compute”. As major AI labs have now trained their models on much of the available public data on the internet, data scarcity is slowing further improvements in pre-training. To get around this, DeepSeek announced a collaboration with Tsinghua University to enable “self-principled critique tuning”. This approach trains AI to develop its own rules for judging content and then uses those rules to provide detailed critiques. The system includes a built-in “judge” that evaluates the AI’s answers in real-time, comparing responses against core rules and quality standards. The development is part of a movement towards autonomous self-evaluation and improvement in AI systems in which models use inference time to improve results, rather than simply making models larger during training. DeepSeek calls its system “DeepSeek-GRM”. But, as with its model distillation approach, this could be considered a mix of promise and risk. For example, if the AI develops its own judging criteria, there’s a risk those principles diverge from human values, ethics or context. The rules could end up being overly rigid or biased, optimizing for style over substance, and/or reinforce incorrect assumptions or hallucinations. Additionally, without a human in the loop, issues could arise if the “judge” is flawed or misaligned. It’s a kind of AI talking to itself, without robust external grounding. On top of this, users and developers may not understand why the AI reached a certain conclusion — which feeds into a bigger concern: Should an AI be allowed to decide what is “good” or “correct” based solely on its own logic? These risks shouldn’t be discounted. At the same time, this approach is gaining traction, as again DeepSeek builds on the body of work of othersto create what is likely the first full-stack application of SPCT in a commercial effort. This could mark a powerful shift in AI autonomy, but there still is a need for rigorous auditing, transparency and safeguards. It’s not just about models getting smarter, but that they remain aligned, interpretable, and trustworthy as they begin critiquing themselves without human guardrails. Moving into the future So, taking all of this into account, the rise of DeepSeek signals a broader shift in the AI industry toward parallel innovation tracks. While companies continue building more powerful compute clusters for next-generation capabilities, there will also be intense focus on finding efficiency gains through software engineering and model architecture improvements to offset the challenges of AI energy consumption, which far outpaces power generation capacity.  Companies are taking note. Microsoft, for example, has halted data center development in multiple regions globally, recalibrating toward a more distributed, efficient infrastructure approach. While still planning to invest approximately billion in AI infrastructure this fiscal year, the company is reallocating resources in response to the efficiency gains DeepSeek introduced to the market. Meta has also responded, With so much movement in such a short time, it becomes somewhat ironic that the U.S. sanctions designed to maintain American AI dominance may have instead accelerated the very innovation they sought to contain. By constraining access to materials, DeepSeek was forced to blaze a new trail. Moving forward, as the industry continues to evolve globally, adaptability for all players will be key. Policies, people and market reactions will continue to shift the ground rules — whether it’s eliminating the AI diffusion rule, a new ban on technology purchases or something else entirely. It’s what we learn from one another and how we respond that will be worth watching. Jae Lee is CEO and co-founder of TwelveLabs. Daily insights on business use cases with VB Daily If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI. Read our Privacy Policy Thanks for subscribing. Check out more VB newsletters here. An error occured. #rethinking #deepseeks #playbook #shakes #highspend
    VENTUREBEAT.COM
    Rethinking AI: DeepSeek’s playbook shakes up the high-spend, high-compute paradigm
    Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more When DeepSeek released its R1 model this January, it wasn’t just another AI announcement. It was a watershed moment that sent shockwaves through the tech industry, forcing industry leaders to reconsider their fundamental approaches to AI development. What makes DeepSeek’s accomplishment remarkable isn’t that the company developed novel capabilities; rather, it was how it achieved comparable results to those delivered by tech heavyweights at a fraction of the cost. In reality, DeepSeek didn’t do anything that hadn’t been done before; its innovation stemmed from pursuing different priorities. As a result, we are now experiencing rapid-fire development along two parallel tracks: efficiency and compute.  As DeepSeek prepares to release its R2 model, and as it concurrently faces the potential of even greater chip restrictions from the U.S., it’s important to look at how it captured so much attention. Engineering around constraints DeepSeek’s arrival, as sudden and dramatic as it was, captivated us all because it showcased the capacity for innovation to thrive even under significant constraints. Faced with U.S. export controls limiting access to cutting-edge AI chips, DeepSeek was forced to find alternative pathways to AI advancement. While U.S. companies pursued performance gains through more powerful hardware, bigger models and better data, DeepSeek focused on optimizing what was available. It implemented known ideas with remarkable execution — and there is novelty in executing what’s known and doing it well. This efficiency-first mindset yielded incredibly impressive results. DeepSeek’s R1 model reportedly matches OpenAI’s capabilities at just 5 to 10% of the operating cost. According to reports, the final training run for DeepSeek’s V3 predecessor cost a mere $6 million — which was described by former Tesla AI scientist Andrej Karpathy as “a joke of a budget” compared to the tens or hundreds of millions spent by U.S. competitors. More strikingly, while OpenAI reportedly spent $500 million training its recent “Orion” model, DeepSeek achieved superior benchmark results for just $5.6 million — less than 1.2% of OpenAI’s investment. If you get starry eyed believing these incredible results were achieved even as DeepSeek was at a severe disadvantage based on its inability to access advanced AI chips, I hate to tell you, but that narrative isn’t entirely accurate (even though it makes a good story). Initial U.S. export controls focused primarily on compute capabilities, not on memory and networking — two crucial components for AI development. That means that the chips DeepSeek had access to were not poor quality chips; their networking and memory capabilities allowed DeepSeek to parallelize operations across many units, a key strategy for running their large model efficiently. This, combined with China’s national push toward controlling the entire vertical stack of AI infrastructure, resulted in accelerated innovation that many Western observers didn’t anticipate. DeepSeek’s advancements were an inevitable part of AI development, but they brought known advancements forward a few years earlier than would have been possible otherwise, and that’s pretty amazing. Pragmatism over process Beyond hardware optimization, DeepSeek’s approach to training data represents another departure from conventional Western practices. Rather than relying solely on web-scraped content, DeepSeek reportedly leveraged significant amounts of synthetic data and outputs from other proprietary models. This is a classic example of model distillation, or the ability to learn from really powerful models. Such an approach, however, raises questions about data privacy and governance that might concern Western enterprise customers. Still, it underscores DeepSeek’s overall pragmatic focus on results over process. The effective use of synthetic data is a key differentiator. Synthetic data can be very effective when it comes to training large models, but you have to be careful; some model architectures handle synthetic data better than others. For instance, transformer-based models with mixture of experts (MoE) architectures like DeepSeek’s tend to be more robust when incorporating synthetic data, while more traditional dense architectures like those used in early Llama models can experience performance degradation or even “model collapse” when trained on too much synthetic content. This architectural sensitivity matters because synthetic data introduces different patterns and distributions compared to real-world data. When a model architecture doesn’t handle synthetic data well, it may learn shortcuts or biases present in the synthetic data generation process rather than generalizable knowledge. This can lead to reduced performance on real-world tasks, increased hallucinations or brittleness when facing novel situations.  Still, DeepSeek’s engineering teams reportedly designed their model architecture specifically with synthetic data integration in mind from the earliest planning stages. This allowed the company to leverage the cost benefits of synthetic data without sacrificing performance. Market reverberations Why does all of this matter? Stock market aside, DeepSeek’s emergence has triggered substantive strategic shifts among industry leaders. Case in point: OpenAI. Sam Altman recently announced plans to release the company’s first “open-weight” language model since 2019. This is a pretty notable pivot for a company that built its business on proprietary systems. It seems DeepSeek’s rise, on top of Llama’s success, has hit OpenAI’s leader hard. Just a month after DeepSeek arrived on the scene, Altman admitted that OpenAI had been “on the wrong side of history” regarding open-source AI.  With OpenAI reportedly spending $7 to 8 billion annually on operations, the economic pressure from efficient alternatives like DeepSeek has become impossible to ignore. As AI scholar Kai-Fu Lee bluntly put it: “You’re spending $7 billion or $8 billion a year, making a massive loss, and here you have a competitor coming in with an open-source model that’s for free.” This necessitates change. This economic reality prompted OpenAI to pursue a massive $40 billion funding round that valued the company at an unprecedented $300 billion. But even with a war chest of funds at its disposal, the fundamental challenge remains: OpenAI’s approach is dramatically more resource-intensive than DeepSeek’s. Beyond model training Another significant trend accelerated by DeepSeek is the shift toward “test-time compute” (TTC). As major AI labs have now trained their models on much of the available public data on the internet, data scarcity is slowing further improvements in pre-training. To get around this, DeepSeek announced a collaboration with Tsinghua University to enable “self-principled critique tuning” (SPCT). This approach trains AI to develop its own rules for judging content and then uses those rules to provide detailed critiques. The system includes a built-in “judge” that evaluates the AI’s answers in real-time, comparing responses against core rules and quality standards. The development is part of a movement towards autonomous self-evaluation and improvement in AI systems in which models use inference time to improve results, rather than simply making models larger during training. DeepSeek calls its system “DeepSeek-GRM” (generalist reward modeling). But, as with its model distillation approach, this could be considered a mix of promise and risk. For example, if the AI develops its own judging criteria, there’s a risk those principles diverge from human values, ethics or context. The rules could end up being overly rigid or biased, optimizing for style over substance, and/or reinforce incorrect assumptions or hallucinations. Additionally, without a human in the loop, issues could arise if the “judge” is flawed or misaligned. It’s a kind of AI talking to itself, without robust external grounding. On top of this, users and developers may not understand why the AI reached a certain conclusion — which feeds into a bigger concern: Should an AI be allowed to decide what is “good” or “correct” based solely on its own logic? These risks shouldn’t be discounted. At the same time, this approach is gaining traction, as again DeepSeek builds on the body of work of others (think OpenAI’s “critique and revise” methods, Anthropic’s constitutional AI or research on self-rewarding agents) to create what is likely the first full-stack application of SPCT in a commercial effort. This could mark a powerful shift in AI autonomy, but there still is a need for rigorous auditing, transparency and safeguards. It’s not just about models getting smarter, but that they remain aligned, interpretable, and trustworthy as they begin critiquing themselves without human guardrails. Moving into the future So, taking all of this into account, the rise of DeepSeek signals a broader shift in the AI industry toward parallel innovation tracks. While companies continue building more powerful compute clusters for next-generation capabilities, there will also be intense focus on finding efficiency gains through software engineering and model architecture improvements to offset the challenges of AI energy consumption, which far outpaces power generation capacity.  Companies are taking note. Microsoft, for example, has halted data center development in multiple regions globally, recalibrating toward a more distributed, efficient infrastructure approach. While still planning to invest approximately $80 billion in AI infrastructure this fiscal year, the company is reallocating resources in response to the efficiency gains DeepSeek introduced to the market. Meta has also responded, With so much movement in such a short time, it becomes somewhat ironic that the U.S. sanctions designed to maintain American AI dominance may have instead accelerated the very innovation they sought to contain. By constraining access to materials, DeepSeek was forced to blaze a new trail. Moving forward, as the industry continues to evolve globally, adaptability for all players will be key. Policies, people and market reactions will continue to shift the ground rules — whether it’s eliminating the AI diffusion rule, a new ban on technology purchases or something else entirely. It’s what we learn from one another and how we respond that will be worth watching. Jae Lee is CEO and co-founder of TwelveLabs. Daily insights on business use cases with VB Daily If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI. Read our Privacy Policy Thanks for subscribing. Check out more VB newsletters here. An error occured.
    0 Kommentare 0 Anteile
  • From LLMs to hallucinations, here’s a simple guide to common AI terms

    Artificial intelligence is a deep and convoluted world. The scientists who work in this field often rely on jargon and lingo to explain what they’re working on. As a result, we frequently have to use those technical terms in our coverage of the artificial intelligence industry. That’s why we thought it would be helpful to put together a glossary with definitions of some of the most important words and phrases that we use in our articles.
    We will regularly update this glossary to add new entries as researchers continually uncover novel methods to push the frontier of artificial intelligence while identifying emerging safety risks.

    AGI
    Artificial general intelligence, or AGI, is a nebulous term. But it generally refers to AI that’s more capable than the average human at many, if not most, tasks. OpenAI CEO Sam Altman recently described AGI as the “equivalent of a median human that you could hire as a co-worker.” Meanwhile, OpenAI’s charter defines AGI as “highly autonomous systems that outperform humans at most economically valuable work.” Google DeepMind’s understanding differs slightly from these two definitions; the lab views AGI as “AI that’s at least as capable as humans at most cognitive tasks.” Confused? Not to worry — so are experts at the forefront of AI research.
    AI agent
    An AI agent refers to a tool that uses AI technologies to perform a series of tasks on your behalf — beyond what a more basic AI chatbot could do — such as filing expenses, booking tickets or a table at a restaurant, or even writing and maintaining code. However, as we’ve explained before, there are lots of moving pieces in this emergent space, so “AI agent” might mean different things to different people. Infrastructure is also still being built out to deliver on its envisaged capabilities. But the basic concept implies an autonomous system that may draw on multiple AI systems to carry out multistep tasks.
    Chain of thought
    Given a simple question, a human brain can answer without even thinking too much about it — things like “which animal is taller, a giraffe or a cat?” But in many cases, you often need a pen and paper to come up with the right answer because there are intermediary steps. For instance, if a farmer has chickens and cows, and together they have 40 heads and 120 legs, you might need to write down a simple equation to come up with the answer.
    In an AI context, chain-of-thought reasoning for large language models means breaking down a problem into smaller, intermediate steps to improve the quality of the end result. It usually takes longer to get an answer, but the answer is more likely to be correct, especially in a logic or coding context. Reasoning models are developed from traditional large language models and optimized for chain-of-thought thinking thanks to reinforcement learning.Techcrunch event

    Join us at TechCrunch Sessions: AI
    Secure your spot for our leading AI industry event with speakers from OpenAI, Anthropic, and Cohere. For a limited time, tickets are just for an entire day of expert talks, workshops, and potent networking.

    Exhibit at TechCrunch Sessions: AI
    Secure your spot at TC Sessions: AI and show 1,200+ decision-makers what you’ve built — without the big spend. Available through May 9 or while tables last.

    Berkeley, CA
    |
    June 5

    REGISTER NOW

    Deep learning
    A subset of self-improving machine learning in which AI algorithms are designed with a multi-layered, artificial neural networkstructure. This allows them to make more complex correlations compared to simpler machine learning-based systems, such as linear models or decision trees. The structure of deep learning algorithms draws inspiration from the interconnected pathways of neurons in the human brain.
    Deep learning AI models are able to identify important characteristics in data themselves, rather than requiring human engineers to define these features. The structure also supports algorithms that can learn from errors and, through a process of repetition and adjustment, improve their own outputs. However, deep learning systems require a lot of data points to yield good results. They also typically take longer to train compared to simpler machine learning algorithms — so development costs tend to be higher.Diffusion
    Diffusion is the tech at the heart of many art-, music-, and text-generating AI models. Inspired by physics, diffusion systems slowly “destroy” the structure of data — e.g. photos, songs, and so on — by adding noise until there’s nothing left. In physics, diffusion is spontaneous and irreversible — sugar diffused in coffee can’t be restored to cube form. But diffusion systems in AI aim to learn a sort of “reverse diffusion” process to restore the destroyed data, gaining the ability to recover the data from noise.
    Distillation
    Distillation is a technique used to extract knowledge from a large AI model with a ‘teacher-student’ model. Developers send requests to a teacher model and record the outputs. Answers are sometimes compared with a dataset to see how accurate they are. These outputs are then used to train the student model, which is trained to approximate the teacher’s behavior.
    Distillation can be used to create a smaller, more efficient model based on a larger model with a minimal distillation loss. This is likely how OpenAI developed GPT-4 Turbo, a faster version of GPT-4.
    While all AI companies use distillation internally, it may have also been used by some AI companies to catch up with frontier models. Distillation from a competitor usually violates the terms of service of AI API and chat assistants.
    Fine-tuning
    This refers to the further training of an AI model to optimize performance for a more specific task or area than was previously a focal point of its training — typically by feeding in new, specializeddata. 
    Many AI startups are taking large language models as a starting point to build a commercial product but are vying to amp up utility for a target sector or task by supplementing earlier training cycles with fine-tuning based on their own domain-specific knowledge and expertise.GAN
    A GAN, or Generative Adversarial Network, is a type of machine learning framework that underpins some important developments in generative AI when it comes to producing realistic data – includingdeepfake tools. GANs involve the use of a pair of neural networks, one of which draws on its training data to generate an output that is passed to the other model to evaluate. This second, discriminator model thus plays the role of a classifier on the generator’s output – enabling it to improve over time. 
    The GAN structure is set up as a competition– with the two models essentially programmed to try to outdo each other: the generator is trying to get its output past the discriminator, while the discriminator is working to spot artificially generated data. This structured contest can optimize AI outputs to be more realistic without the need for additional human intervention. Though GANs work best for narrower applications, rather than general purpose AI.
    Hallucination
    Hallucination is the AI industry’s preferred term for AI models making stuff up – literally generating information that is incorrect. Obviously, it’s a huge problem for AI quality. 
    Hallucinations produce GenAI outputs that can be misleading and could even lead to real-life risks — with potentially dangerous consequences. This is why most GenAI tools’ small print now warns users to verify AI-generated answers, even though such disclaimers are usually far less prominent than the information the tools dispense at the touch of a button.
    The problem of AIs fabricating information is thought to arise as a consequence of gaps in training data. For general purpose GenAI especially — also sometimes known as foundation models — this looks difficult to resolve. There is simply not enough data in existence to train AI models to comprehensively resolve all the questions we could possibly ask. TL;DR: we haven’t invented God. 
    Hallucinations are contributing to a push towards increasingly specialized and/or vertical AI models — i.e. domain-specific AIs that require narrower expertise – as a way to reduce the likelihood of knowledge gaps and shrink disinformation risks.
    Inference
    Inference is the process of running an AI model. It’s setting a model loose to make predictions or draw conclusions from previously-seen data. To be clear, inference can’t happen without training; a model must learn patterns in a set of data before it can effectively extrapolate from this training data.
    Many types of hardware can perform inference, ranging from smartphone processors to beefy GPUs to custom-designed AI accelerators. But not all of them can run models equally well. Very large models would take ages to make predictions on, say, a laptop versus a cloud server with high-end AI chips.Large language modelLarge language models, or LLMs, are the AI models used by popular AI assistants, such as ChatGPT, Claude, Google’s Gemini, Meta’s AI Llama, Microsoft Copilot, or Mistral’s Le Chat. When you chat with an AI assistant, you interact with a large language model that processes your request directly or with the help of different available tools, such as web browsing or code interpreters.
    AI assistants and LLMs can have different names. For instance, GPT is OpenAI’s large language model and ChatGPT is the AI assistant product.
    LLMs are deep neural networks made of billions of numerical parametersthat learn the relationships between words and phrases and create a representation of language, a sort of multidimensional map of words.
    These models are created from encoding the patterns they find in billions of books, articles, and transcripts. When you prompt an LLM, the model generates the most likely pattern that fits the prompt. It then evaluates the most probable next word after the last one based on what was said before. Repeat, repeat, and repeat.Neural network
    A neural network refers to the multi-layered algorithmic structure that underpins deep learning — and, more broadly, the whole boom in generative AI tools following the emergence of large language models. 
    Although the idea of taking inspiration from the densely interconnected pathways of the human brain as a design structure for data processing algorithms dates all the way back to the 1940s, it was the much more recent rise of graphical processing hardware— via the video game industry — that really unlocked the power of this theory. These chips proved well suited to training algorithms with many more layers than was possible in earlier epochs — enabling neural network-based AI systems to achieve far better performance across many domains, including voice recognition, autonomous navigation, and drug discovery.Training
    Developing machine learning AIs involves a process known as training. In simple terms, this refers to data being fed in in order that the model can learn from patterns and generate useful outputs.
    Things can get a bit philosophical at this point in the AI stack — since, pre-training, the mathematical structure that’s used as the starting point for developing a learning system is just a bunch of layers and random numbers. It’s only through training that the AI model really takes shape. Essentially, it’s the process of the system responding to characteristics in the data that enables it to adapt outputs towards a sought-for goal — whether that’s identifying images of cats or producing a haiku on demand.
    It’s important to note that not all AI requires training. Rules-based AIs that are programmed to follow manually predefined instructions — for example, such as linear chatbots — don’t need to undergo training. However, such AI systems are likely to be more constrained thanself-learning systems.
    Still, training can be expensive because it requires lots of inputs — and, typically, the volumes of inputs required for such models have been trending upwards.
    Hybrid approaches can sometimes be used to shortcut model development and help manage costs. Such as doing data-driven fine-tuning of a rules-based AI — meaning development requires less data, compute, energy, and algorithmic complexity than if the developer had started building from scratch.Transfer learning
    A technique where a previously trained AI model is used as the starting point for developing a new model for a different but typically related task – allowing knowledge gained in previous training cycles to be reapplied. 
    Transfer learning can drive efficiency savings by shortcutting model development. It can also be useful when data for the task that the model is being developed for is somewhat limited. But it’s important to note that the approach has limitations. Models that rely on transfer learning to gain generalized capabilities will likely require training on additional data in order to perform well in their domain of focusWeights
    Weights are core to AI training, as they determine how much importanceis given to different featuresin the data used for training the system — thereby shaping the AI model’s output. 
    Put another way, weights are numerical parameters that define what’s most salient in a dataset for the given training task. They achieve their function by applying multiplication to inputs. Model training typically begins with weights that are randomly assigned, but as the process unfolds, the weights adjust as the model seeks to arrive at an output that more closely matches the target.
    For example, an AI model for predicting housing prices that’s trained on historical real estate data for a target location could include weights for features such as the number of bedrooms and bathrooms, whether a property is detached or semi-detached, whether it has parking, a garage, and so on. 
    Ultimately, the weights the model attaches to each of these inputs reflect how much they influence the value of a property, based on the given dataset.

    Topics
    #llms #hallucinations #heres #simple #guide
    From LLMs to hallucinations, here’s a simple guide to common AI terms
    Artificial intelligence is a deep and convoluted world. The scientists who work in this field often rely on jargon and lingo to explain what they’re working on. As a result, we frequently have to use those technical terms in our coverage of the artificial intelligence industry. That’s why we thought it would be helpful to put together a glossary with definitions of some of the most important words and phrases that we use in our articles. We will regularly update this glossary to add new entries as researchers continually uncover novel methods to push the frontier of artificial intelligence while identifying emerging safety risks. AGI Artificial general intelligence, or AGI, is a nebulous term. But it generally refers to AI that’s more capable than the average human at many, if not most, tasks. OpenAI CEO Sam Altman recently described AGI as the “equivalent of a median human that you could hire as a co-worker.” Meanwhile, OpenAI’s charter defines AGI as “highly autonomous systems that outperform humans at most economically valuable work.” Google DeepMind’s understanding differs slightly from these two definitions; the lab views AGI as “AI that’s at least as capable as humans at most cognitive tasks.” Confused? Not to worry — so are experts at the forefront of AI research. AI agent An AI agent refers to a tool that uses AI technologies to perform a series of tasks on your behalf — beyond what a more basic AI chatbot could do — such as filing expenses, booking tickets or a table at a restaurant, or even writing and maintaining code. However, as we’ve explained before, there are lots of moving pieces in this emergent space, so “AI agent” might mean different things to different people. Infrastructure is also still being built out to deliver on its envisaged capabilities. But the basic concept implies an autonomous system that may draw on multiple AI systems to carry out multistep tasks. Chain of thought Given a simple question, a human brain can answer without even thinking too much about it — things like “which animal is taller, a giraffe or a cat?” But in many cases, you often need a pen and paper to come up with the right answer because there are intermediary steps. For instance, if a farmer has chickens and cows, and together they have 40 heads and 120 legs, you might need to write down a simple equation to come up with the answer. In an AI context, chain-of-thought reasoning for large language models means breaking down a problem into smaller, intermediate steps to improve the quality of the end result. It usually takes longer to get an answer, but the answer is more likely to be correct, especially in a logic or coding context. Reasoning models are developed from traditional large language models and optimized for chain-of-thought thinking thanks to reinforcement learning.Techcrunch event Join us at TechCrunch Sessions: AI Secure your spot for our leading AI industry event with speakers from OpenAI, Anthropic, and Cohere. For a limited time, tickets are just for an entire day of expert talks, workshops, and potent networking. Exhibit at TechCrunch Sessions: AI Secure your spot at TC Sessions: AI and show 1,200+ decision-makers what you’ve built — without the big spend. Available through May 9 or while tables last. Berkeley, CA | June 5 REGISTER NOW Deep learning A subset of self-improving machine learning in which AI algorithms are designed with a multi-layered, artificial neural networkstructure. This allows them to make more complex correlations compared to simpler machine learning-based systems, such as linear models or decision trees. The structure of deep learning algorithms draws inspiration from the interconnected pathways of neurons in the human brain. Deep learning AI models are able to identify important characteristics in data themselves, rather than requiring human engineers to define these features. The structure also supports algorithms that can learn from errors and, through a process of repetition and adjustment, improve their own outputs. However, deep learning systems require a lot of data points to yield good results. They also typically take longer to train compared to simpler machine learning algorithms — so development costs tend to be higher.Diffusion Diffusion is the tech at the heart of many art-, music-, and text-generating AI models. Inspired by physics, diffusion systems slowly “destroy” the structure of data — e.g. photos, songs, and so on — by adding noise until there’s nothing left. In physics, diffusion is spontaneous and irreversible — sugar diffused in coffee can’t be restored to cube form. But diffusion systems in AI aim to learn a sort of “reverse diffusion” process to restore the destroyed data, gaining the ability to recover the data from noise. Distillation Distillation is a technique used to extract knowledge from a large AI model with a ‘teacher-student’ model. Developers send requests to a teacher model and record the outputs. Answers are sometimes compared with a dataset to see how accurate they are. These outputs are then used to train the student model, which is trained to approximate the teacher’s behavior. Distillation can be used to create a smaller, more efficient model based on a larger model with a minimal distillation loss. This is likely how OpenAI developed GPT-4 Turbo, a faster version of GPT-4. While all AI companies use distillation internally, it may have also been used by some AI companies to catch up with frontier models. Distillation from a competitor usually violates the terms of service of AI API and chat assistants. Fine-tuning This refers to the further training of an AI model to optimize performance for a more specific task or area than was previously a focal point of its training — typically by feeding in new, specializeddata.  Many AI startups are taking large language models as a starting point to build a commercial product but are vying to amp up utility for a target sector or task by supplementing earlier training cycles with fine-tuning based on their own domain-specific knowledge and expertise.GAN A GAN, or Generative Adversarial Network, is a type of machine learning framework that underpins some important developments in generative AI when it comes to producing realistic data – includingdeepfake tools. GANs involve the use of a pair of neural networks, one of which draws on its training data to generate an output that is passed to the other model to evaluate. This second, discriminator model thus plays the role of a classifier on the generator’s output – enabling it to improve over time.  The GAN structure is set up as a competition– with the two models essentially programmed to try to outdo each other: the generator is trying to get its output past the discriminator, while the discriminator is working to spot artificially generated data. This structured contest can optimize AI outputs to be more realistic without the need for additional human intervention. Though GANs work best for narrower applications, rather than general purpose AI. Hallucination Hallucination is the AI industry’s preferred term for AI models making stuff up – literally generating information that is incorrect. Obviously, it’s a huge problem for AI quality.  Hallucinations produce GenAI outputs that can be misleading and could even lead to real-life risks — with potentially dangerous consequences. This is why most GenAI tools’ small print now warns users to verify AI-generated answers, even though such disclaimers are usually far less prominent than the information the tools dispense at the touch of a button. The problem of AIs fabricating information is thought to arise as a consequence of gaps in training data. For general purpose GenAI especially — also sometimes known as foundation models — this looks difficult to resolve. There is simply not enough data in existence to train AI models to comprehensively resolve all the questions we could possibly ask. TL;DR: we haven’t invented God.  Hallucinations are contributing to a push towards increasingly specialized and/or vertical AI models — i.e. domain-specific AIs that require narrower expertise – as a way to reduce the likelihood of knowledge gaps and shrink disinformation risks. Inference Inference is the process of running an AI model. It’s setting a model loose to make predictions or draw conclusions from previously-seen data. To be clear, inference can’t happen without training; a model must learn patterns in a set of data before it can effectively extrapolate from this training data. Many types of hardware can perform inference, ranging from smartphone processors to beefy GPUs to custom-designed AI accelerators. But not all of them can run models equally well. Very large models would take ages to make predictions on, say, a laptop versus a cloud server with high-end AI chips.Large language modelLarge language models, or LLMs, are the AI models used by popular AI assistants, such as ChatGPT, Claude, Google’s Gemini, Meta’s AI Llama, Microsoft Copilot, or Mistral’s Le Chat. When you chat with an AI assistant, you interact with a large language model that processes your request directly or with the help of different available tools, such as web browsing or code interpreters. AI assistants and LLMs can have different names. For instance, GPT is OpenAI’s large language model and ChatGPT is the AI assistant product. LLMs are deep neural networks made of billions of numerical parametersthat learn the relationships between words and phrases and create a representation of language, a sort of multidimensional map of words. These models are created from encoding the patterns they find in billions of books, articles, and transcripts. When you prompt an LLM, the model generates the most likely pattern that fits the prompt. It then evaluates the most probable next word after the last one based on what was said before. Repeat, repeat, and repeat.Neural network A neural network refers to the multi-layered algorithmic structure that underpins deep learning — and, more broadly, the whole boom in generative AI tools following the emergence of large language models.  Although the idea of taking inspiration from the densely interconnected pathways of the human brain as a design structure for data processing algorithms dates all the way back to the 1940s, it was the much more recent rise of graphical processing hardware— via the video game industry — that really unlocked the power of this theory. These chips proved well suited to training algorithms with many more layers than was possible in earlier epochs — enabling neural network-based AI systems to achieve far better performance across many domains, including voice recognition, autonomous navigation, and drug discovery.Training Developing machine learning AIs involves a process known as training. In simple terms, this refers to data being fed in in order that the model can learn from patterns and generate useful outputs. Things can get a bit philosophical at this point in the AI stack — since, pre-training, the mathematical structure that’s used as the starting point for developing a learning system is just a bunch of layers and random numbers. It’s only through training that the AI model really takes shape. Essentially, it’s the process of the system responding to characteristics in the data that enables it to adapt outputs towards a sought-for goal — whether that’s identifying images of cats or producing a haiku on demand. It’s important to note that not all AI requires training. Rules-based AIs that are programmed to follow manually predefined instructions — for example, such as linear chatbots — don’t need to undergo training. However, such AI systems are likely to be more constrained thanself-learning systems. Still, training can be expensive because it requires lots of inputs — and, typically, the volumes of inputs required for such models have been trending upwards. Hybrid approaches can sometimes be used to shortcut model development and help manage costs. Such as doing data-driven fine-tuning of a rules-based AI — meaning development requires less data, compute, energy, and algorithmic complexity than if the developer had started building from scratch.Transfer learning A technique where a previously trained AI model is used as the starting point for developing a new model for a different but typically related task – allowing knowledge gained in previous training cycles to be reapplied.  Transfer learning can drive efficiency savings by shortcutting model development. It can also be useful when data for the task that the model is being developed for is somewhat limited. But it’s important to note that the approach has limitations. Models that rely on transfer learning to gain generalized capabilities will likely require training on additional data in order to perform well in their domain of focusWeights Weights are core to AI training, as they determine how much importanceis given to different featuresin the data used for training the system — thereby shaping the AI model’s output.  Put another way, weights are numerical parameters that define what’s most salient in a dataset for the given training task. They achieve their function by applying multiplication to inputs. Model training typically begins with weights that are randomly assigned, but as the process unfolds, the weights adjust as the model seeks to arrive at an output that more closely matches the target. For example, an AI model for predicting housing prices that’s trained on historical real estate data for a target location could include weights for features such as the number of bedrooms and bathrooms, whether a property is detached or semi-detached, whether it has parking, a garage, and so on.  Ultimately, the weights the model attaches to each of these inputs reflect how much they influence the value of a property, based on the given dataset. Topics #llms #hallucinations #heres #simple #guide
    TECHCRUNCH.COM
    From LLMs to hallucinations, here’s a simple guide to common AI terms
    Artificial intelligence is a deep and convoluted world. The scientists who work in this field often rely on jargon and lingo to explain what they’re working on. As a result, we frequently have to use those technical terms in our coverage of the artificial intelligence industry. That’s why we thought it would be helpful to put together a glossary with definitions of some of the most important words and phrases that we use in our articles. We will regularly update this glossary to add new entries as researchers continually uncover novel methods to push the frontier of artificial intelligence while identifying emerging safety risks. AGI Artificial general intelligence, or AGI, is a nebulous term. But it generally refers to AI that’s more capable than the average human at many, if not most, tasks. OpenAI CEO Sam Altman recently described AGI as the “equivalent of a median human that you could hire as a co-worker.” Meanwhile, OpenAI’s charter defines AGI as “highly autonomous systems that outperform humans at most economically valuable work.” Google DeepMind’s understanding differs slightly from these two definitions; the lab views AGI as “AI that’s at least as capable as humans at most cognitive tasks.” Confused? Not to worry — so are experts at the forefront of AI research. AI agent An AI agent refers to a tool that uses AI technologies to perform a series of tasks on your behalf — beyond what a more basic AI chatbot could do — such as filing expenses, booking tickets or a table at a restaurant, or even writing and maintaining code. However, as we’ve explained before, there are lots of moving pieces in this emergent space, so “AI agent” might mean different things to different people. Infrastructure is also still being built out to deliver on its envisaged capabilities. But the basic concept implies an autonomous system that may draw on multiple AI systems to carry out multistep tasks. Chain of thought Given a simple question, a human brain can answer without even thinking too much about it — things like “which animal is taller, a giraffe or a cat?” But in many cases, you often need a pen and paper to come up with the right answer because there are intermediary steps. For instance, if a farmer has chickens and cows, and together they have 40 heads and 120 legs, you might need to write down a simple equation to come up with the answer (20 chickens and 20 cows). In an AI context, chain-of-thought reasoning for large language models means breaking down a problem into smaller, intermediate steps to improve the quality of the end result. It usually takes longer to get an answer, but the answer is more likely to be correct, especially in a logic or coding context. Reasoning models are developed from traditional large language models and optimized for chain-of-thought thinking thanks to reinforcement learning. (See: Large language model) Techcrunch event Join us at TechCrunch Sessions: AI Secure your spot for our leading AI industry event with speakers from OpenAI, Anthropic, and Cohere. For a limited time, tickets are just $292 for an entire day of expert talks, workshops, and potent networking. Exhibit at TechCrunch Sessions: AI Secure your spot at TC Sessions: AI and show 1,200+ decision-makers what you’ve built — without the big spend. Available through May 9 or while tables last. Berkeley, CA | June 5 REGISTER NOW Deep learning A subset of self-improving machine learning in which AI algorithms are designed with a multi-layered, artificial neural network (ANN) structure. This allows them to make more complex correlations compared to simpler machine learning-based systems, such as linear models or decision trees. The structure of deep learning algorithms draws inspiration from the interconnected pathways of neurons in the human brain. Deep learning AI models are able to identify important characteristics in data themselves, rather than requiring human engineers to define these features. The structure also supports algorithms that can learn from errors and, through a process of repetition and adjustment, improve their own outputs. However, deep learning systems require a lot of data points to yield good results (millions or more). They also typically take longer to train compared to simpler machine learning algorithms — so development costs tend to be higher. (See: Neural network) Diffusion Diffusion is the tech at the heart of many art-, music-, and text-generating AI models. Inspired by physics, diffusion systems slowly “destroy” the structure of data — e.g. photos, songs, and so on — by adding noise until there’s nothing left. In physics, diffusion is spontaneous and irreversible — sugar diffused in coffee can’t be restored to cube form. But diffusion systems in AI aim to learn a sort of “reverse diffusion” process to restore the destroyed data, gaining the ability to recover the data from noise. Distillation Distillation is a technique used to extract knowledge from a large AI model with a ‘teacher-student’ model. Developers send requests to a teacher model and record the outputs. Answers are sometimes compared with a dataset to see how accurate they are. These outputs are then used to train the student model, which is trained to approximate the teacher’s behavior. Distillation can be used to create a smaller, more efficient model based on a larger model with a minimal distillation loss. This is likely how OpenAI developed GPT-4 Turbo, a faster version of GPT-4. While all AI companies use distillation internally, it may have also been used by some AI companies to catch up with frontier models. Distillation from a competitor usually violates the terms of service of AI API and chat assistants. Fine-tuning This refers to the further training of an AI model to optimize performance for a more specific task or area than was previously a focal point of its training — typically by feeding in new, specialized (i.e., task-oriented) data.  Many AI startups are taking large language models as a starting point to build a commercial product but are vying to amp up utility for a target sector or task by supplementing earlier training cycles with fine-tuning based on their own domain-specific knowledge and expertise. (See: Large language model [LLM]) GAN A GAN, or Generative Adversarial Network, is a type of machine learning framework that underpins some important developments in generative AI when it comes to producing realistic data – including (but not only) deepfake tools. GANs involve the use of a pair of neural networks, one of which draws on its training data to generate an output that is passed to the other model to evaluate. This second, discriminator model thus plays the role of a classifier on the generator’s output – enabling it to improve over time.  The GAN structure is set up as a competition (hence “adversarial”) – with the two models essentially programmed to try to outdo each other: the generator is trying to get its output past the discriminator, while the discriminator is working to spot artificially generated data. This structured contest can optimize AI outputs to be more realistic without the need for additional human intervention. Though GANs work best for narrower applications (such as producing realistic photos or videos), rather than general purpose AI. Hallucination Hallucination is the AI industry’s preferred term for AI models making stuff up – literally generating information that is incorrect. Obviously, it’s a huge problem for AI quality.  Hallucinations produce GenAI outputs that can be misleading and could even lead to real-life risks — with potentially dangerous consequences (think of a health query that returns harmful medical advice). This is why most GenAI tools’ small print now warns users to verify AI-generated answers, even though such disclaimers are usually far less prominent than the information the tools dispense at the touch of a button. The problem of AIs fabricating information is thought to arise as a consequence of gaps in training data. For general purpose GenAI especially — also sometimes known as foundation models — this looks difficult to resolve. There is simply not enough data in existence to train AI models to comprehensively resolve all the questions we could possibly ask. TL;DR: we haven’t invented God (yet).  Hallucinations are contributing to a push towards increasingly specialized and/or vertical AI models — i.e. domain-specific AIs that require narrower expertise – as a way to reduce the likelihood of knowledge gaps and shrink disinformation risks. Inference Inference is the process of running an AI model. It’s setting a model loose to make predictions or draw conclusions from previously-seen data. To be clear, inference can’t happen without training; a model must learn patterns in a set of data before it can effectively extrapolate from this training data. Many types of hardware can perform inference, ranging from smartphone processors to beefy GPUs to custom-designed AI accelerators. But not all of them can run models equally well. Very large models would take ages to make predictions on, say, a laptop versus a cloud server with high-end AI chips. [See: Training] Large language model (LLM) Large language models, or LLMs, are the AI models used by popular AI assistants, such as ChatGPT, Claude, Google’s Gemini, Meta’s AI Llama, Microsoft Copilot, or Mistral’s Le Chat. When you chat with an AI assistant, you interact with a large language model that processes your request directly or with the help of different available tools, such as web browsing or code interpreters. AI assistants and LLMs can have different names. For instance, GPT is OpenAI’s large language model and ChatGPT is the AI assistant product. LLMs are deep neural networks made of billions of numerical parameters (or weights, see below) that learn the relationships between words and phrases and create a representation of language, a sort of multidimensional map of words. These models are created from encoding the patterns they find in billions of books, articles, and transcripts. When you prompt an LLM, the model generates the most likely pattern that fits the prompt. It then evaluates the most probable next word after the last one based on what was said before. Repeat, repeat, and repeat. (See: Neural network) Neural network A neural network refers to the multi-layered algorithmic structure that underpins deep learning — and, more broadly, the whole boom in generative AI tools following the emergence of large language models.  Although the idea of taking inspiration from the densely interconnected pathways of the human brain as a design structure for data processing algorithms dates all the way back to the 1940s, it was the much more recent rise of graphical processing hardware (GPUs) — via the video game industry — that really unlocked the power of this theory. These chips proved well suited to training algorithms with many more layers than was possible in earlier epochs — enabling neural network-based AI systems to achieve far better performance across many domains, including voice recognition, autonomous navigation, and drug discovery. (See: Large language model [LLM]) Training Developing machine learning AIs involves a process known as training. In simple terms, this refers to data being fed in in order that the model can learn from patterns and generate useful outputs. Things can get a bit philosophical at this point in the AI stack — since, pre-training, the mathematical structure that’s used as the starting point for developing a learning system is just a bunch of layers and random numbers. It’s only through training that the AI model really takes shape. Essentially, it’s the process of the system responding to characteristics in the data that enables it to adapt outputs towards a sought-for goal — whether that’s identifying images of cats or producing a haiku on demand. It’s important to note that not all AI requires training. Rules-based AIs that are programmed to follow manually predefined instructions — for example, such as linear chatbots — don’t need to undergo training. However, such AI systems are likely to be more constrained than (well-trained) self-learning systems. Still, training can be expensive because it requires lots of inputs — and, typically, the volumes of inputs required for such models have been trending upwards. Hybrid approaches can sometimes be used to shortcut model development and help manage costs. Such as doing data-driven fine-tuning of a rules-based AI — meaning development requires less data, compute, energy, and algorithmic complexity than if the developer had started building from scratch. [See: Inference] Transfer learning A technique where a previously trained AI model is used as the starting point for developing a new model for a different but typically related task – allowing knowledge gained in previous training cycles to be reapplied.  Transfer learning can drive efficiency savings by shortcutting model development. It can also be useful when data for the task that the model is being developed for is somewhat limited. But it’s important to note that the approach has limitations. Models that rely on transfer learning to gain generalized capabilities will likely require training on additional data in order to perform well in their domain of focus (See: Fine tuning) Weights Weights are core to AI training, as they determine how much importance (or weight) is given to different features (or input variables) in the data used for training the system — thereby shaping the AI model’s output.  Put another way, weights are numerical parameters that define what’s most salient in a dataset for the given training task. They achieve their function by applying multiplication to inputs. Model training typically begins with weights that are randomly assigned, but as the process unfolds, the weights adjust as the model seeks to arrive at an output that more closely matches the target. For example, an AI model for predicting housing prices that’s trained on historical real estate data for a target location could include weights for features such as the number of bedrooms and bathrooms, whether a property is detached or semi-detached, whether it has parking, a garage, and so on.  Ultimately, the weights the model attaches to each of these inputs reflect how much they influence the value of a property, based on the given dataset. Topics
    0 Kommentare 0 Anteile
  • Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPO

    The effectiveness of language models relies on their ability to simulate human-like step-by-step deduction. However, these reasoning sequences are resource-intensive and can be wasteful for simple questions that do not require elaborate computation. This lack of awareness regarding the complexity of the task is one of the core challenges in these models. They often default to detailed reasoning even for queries that could be answered directly. Such an approach increases token usage, extends response time, and increases system latency and memory usage. As a result, there’s a pressing need to equip language models with a mechanism that allows them to make autonomous decisions about whether to think deeply or respond succinctly.
    Current tools attempting to solve this issue either rely on manually set heuristics or prompt engineering to switch between short and long responses. Some methods use separate models and route questions based on complexity estimates. Still, these external routing systems often lack insight into the target model’s strengths and fail to make optimal decisions. Other techniques fine-tune models with prompt-based cues like “reasoning on/off,” but these rely on static rules rather than dynamic understanding. Despite some improvements, these approaches fail to enable fully autonomous and context-sensitive control within a single model.

    Researchers from the National University of Singapore introduced a new framework called Thinkless, which equips a language model with the ability to dynamically decide between using short or long-form reasoning. The framework is built on reinforcement learning and introduces two special control tokens—<short> for concise answers and <think> for detailed responses. By incorporating a novel algorithm called Decoupled Group Relative Policy Optimization, Thinkless separates the training focus between selecting the reasoning mode and improving the accuracy of the generated response. This design prevents the model from falling into one-dimensional behavior and enables adaptive reasoning tailored to each query.
    The methodology involves two stages: warm-up distillation and reinforcement learning. In the distillation phase, Thinkless is trained using outputs from two expert models—one specializing in short responses and the other in detailed reasoning. This stage helps the model establish a firm link between the control token and the desired reasoning format. The reinforcement learning stage then fine-tunes the model’s ability to decide which reasoning mode to use. DeGRPO decomposes the learning into two separate objectives: one for training the control token and another for refining the response tokens. This approach avoids the gradient imbalances in earlier models, where longer responses would overpower the learning signal, leading to a collapse in reasoning diversity. Thinkless ensures that both <short> and <think> tokens receive balanced updates, promoting stable learning across response types.

    When evaluated, Thinkless significantly reduced long-form reasoning while preserving high accuracy. On the Minerva Algebra benchmark, the model used the <think> token in only 25.88% of cases while achieving 94.59% accuracy. In contrast, conventional reasoning models had to use extended chains of thought much more frequently. On the AIME 2024 dataset, Thinkless reached a 27.33% accuracy rate with 100% usage of the reasoning mode, showing that it could maintain performance when full reasoning was necessary. On the GSM8K dataset, it utilized <think> only 13.31% of the time, yet still achieved 84.18% accuracy. These results reflect the model’s ability to handle simple and complex queries with appropriate reasoning depth, cutting down on unnecessary token generation by as much as 90% in some tasks.
    Overall, this study from the National University of Singapore researchers presents a compelling solution to the inefficiencies of uniform reasoning in large language models. By introducing a mechanism that enables models to judge task complexity and adjust their inference strategy accordingly, Thinkless optimizes both accuracy and efficiency. The method balances depth of reasoning and response precision without relying on fixed rules, offering a data-driven approach to more intelligent language model behavior.

    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE: A Parallel Computation Method for Efficient and Scalable Language Model DeploymentNikhilhttps://www.marktechpost.com/author/nikhil0980/Google AI Releases Standalone NotebookLM Mobile App with Offline Audio and Seamless Source IntegrationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DB
    #researchers #national #university #singapore #introduce
    Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPO
    The effectiveness of language models relies on their ability to simulate human-like step-by-step deduction. However, these reasoning sequences are resource-intensive and can be wasteful for simple questions that do not require elaborate computation. This lack of awareness regarding the complexity of the task is one of the core challenges in these models. They often default to detailed reasoning even for queries that could be answered directly. Such an approach increases token usage, extends response time, and increases system latency and memory usage. As a result, there’s a pressing need to equip language models with a mechanism that allows them to make autonomous decisions about whether to think deeply or respond succinctly. Current tools attempting to solve this issue either rely on manually set heuristics or prompt engineering to switch between short and long responses. Some methods use separate models and route questions based on complexity estimates. Still, these external routing systems often lack insight into the target model’s strengths and fail to make optimal decisions. Other techniques fine-tune models with prompt-based cues like “reasoning on/off,” but these rely on static rules rather than dynamic understanding. Despite some improvements, these approaches fail to enable fully autonomous and context-sensitive control within a single model. Researchers from the National University of Singapore introduced a new framework called Thinkless, which equips a language model with the ability to dynamically decide between using short or long-form reasoning. The framework is built on reinforcement learning and introduces two special control tokens—<short> for concise answers and <think> for detailed responses. By incorporating a novel algorithm called Decoupled Group Relative Policy Optimization, Thinkless separates the training focus between selecting the reasoning mode and improving the accuracy of the generated response. This design prevents the model from falling into one-dimensional behavior and enables adaptive reasoning tailored to each query. The methodology involves two stages: warm-up distillation and reinforcement learning. In the distillation phase, Thinkless is trained using outputs from two expert models—one specializing in short responses and the other in detailed reasoning. This stage helps the model establish a firm link between the control token and the desired reasoning format. The reinforcement learning stage then fine-tunes the model’s ability to decide which reasoning mode to use. DeGRPO decomposes the learning into two separate objectives: one for training the control token and another for refining the response tokens. This approach avoids the gradient imbalances in earlier models, where longer responses would overpower the learning signal, leading to a collapse in reasoning diversity. Thinkless ensures that both <short> and <think> tokens receive balanced updates, promoting stable learning across response types. When evaluated, Thinkless significantly reduced long-form reasoning while preserving high accuracy. On the Minerva Algebra benchmark, the model used the <think> token in only 25.88% of cases while achieving 94.59% accuracy. In contrast, conventional reasoning models had to use extended chains of thought much more frequently. On the AIME 2024 dataset, Thinkless reached a 27.33% accuracy rate with 100% usage of the reasoning mode, showing that it could maintain performance when full reasoning was necessary. On the GSM8K dataset, it utilized <think> only 13.31% of the time, yet still achieved 84.18% accuracy. These results reflect the model’s ability to handle simple and complex queries with appropriate reasoning depth, cutting down on unnecessary token generation by as much as 90% in some tasks. Overall, this study from the National University of Singapore researchers presents a compelling solution to the inefficiencies of uniform reasoning in large language models. By introducing a mechanism that enables models to judge task complexity and adjust their inference strategy accordingly, Thinkless optimizes both accuracy and efficiency. The method balances depth of reasoning and response precision without relying on fixed rules, offering a data-driven approach to more intelligent language model behavior. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE: A Parallel Computation Method for Efficient and Scalable Language Model DeploymentNikhilhttps://www.marktechpost.com/author/nikhil0980/Google AI Releases Standalone NotebookLM Mobile App with Offline Audio and Seamless Source IntegrationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DB #researchers #national #university #singapore #introduce
    WWW.MARKTECHPOST.COM
    Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPO
    The effectiveness of language models relies on their ability to simulate human-like step-by-step deduction. However, these reasoning sequences are resource-intensive and can be wasteful for simple questions that do not require elaborate computation. This lack of awareness regarding the complexity of the task is one of the core challenges in these models. They often default to detailed reasoning even for queries that could be answered directly. Such an approach increases token usage, extends response time, and increases system latency and memory usage. As a result, there’s a pressing need to equip language models with a mechanism that allows them to make autonomous decisions about whether to think deeply or respond succinctly. Current tools attempting to solve this issue either rely on manually set heuristics or prompt engineering to switch between short and long responses. Some methods use separate models and route questions based on complexity estimates. Still, these external routing systems often lack insight into the target model’s strengths and fail to make optimal decisions. Other techniques fine-tune models with prompt-based cues like “reasoning on/off,” but these rely on static rules rather than dynamic understanding. Despite some improvements, these approaches fail to enable fully autonomous and context-sensitive control within a single model. Researchers from the National University of Singapore introduced a new framework called Thinkless, which equips a language model with the ability to dynamically decide between using short or long-form reasoning. The framework is built on reinforcement learning and introduces two special control tokens—<short> for concise answers and <think> for detailed responses. By incorporating a novel algorithm called Decoupled Group Relative Policy Optimization (DeGRPO), Thinkless separates the training focus between selecting the reasoning mode and improving the accuracy of the generated response. This design prevents the model from falling into one-dimensional behavior and enables adaptive reasoning tailored to each query. The methodology involves two stages: warm-up distillation and reinforcement learning. In the distillation phase, Thinkless is trained using outputs from two expert models—one specializing in short responses and the other in detailed reasoning. This stage helps the model establish a firm link between the control token and the desired reasoning format. The reinforcement learning stage then fine-tunes the model’s ability to decide which reasoning mode to use. DeGRPO decomposes the learning into two separate objectives: one for training the control token and another for refining the response tokens. This approach avoids the gradient imbalances in earlier models, where longer responses would overpower the learning signal, leading to a collapse in reasoning diversity. Thinkless ensures that both <short> and <think> tokens receive balanced updates, promoting stable learning across response types. When evaluated, Thinkless significantly reduced long-form reasoning while preserving high accuracy. On the Minerva Algebra benchmark, the model used the <think> token in only 25.88% of cases while achieving 94.59% accuracy. In contrast, conventional reasoning models had to use extended chains of thought much more frequently. On the AIME 2024 dataset, Thinkless reached a 27.33% accuracy rate with 100% usage of the reasoning mode, showing that it could maintain performance when full reasoning was necessary. On the GSM8K dataset, it utilized <think> only 13.31% of the time, yet still achieved 84.18% accuracy. These results reflect the model’s ability to handle simple and complex queries with appropriate reasoning depth, cutting down on unnecessary token generation by as much as 90% in some tasks. Overall, this study from the National University of Singapore researchers presents a compelling solution to the inefficiencies of uniform reasoning in large language models. By introducing a mechanism that enables models to judge task complexity and adjust their inference strategy accordingly, Thinkless optimizes both accuracy and efficiency. The method balances depth of reasoning and response precision without relying on fixed rules, offering a data-driven approach to more intelligent language model behavior. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code AlignmentNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces PARSCALE (Parallel Scaling): A Parallel Computation Method for Efficient and Scalable Language Model DeploymentNikhilhttps://www.marktechpost.com/author/nikhil0980/Google AI Releases Standalone NotebookLM Mobile App with Offline Audio and Seamless Source IntegrationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DB
    0 Kommentare 0 Anteile
  • Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data

    Large language models are now being used for evaluation and judgment tasks, extending beyond their traditional role of text generation. This has led to “LLM-as-a-Judge,” where models assess outputs from other language models. Such evaluations are essential in reinforcement learning pipelines, benchmark testing, and system alignment. These judge models rely on internal chain-of-thought reasoning, mirroring human judgment processes. Unlike conventional reward models that provide direct scores, these models simulate thoughtful evaluation, making them better suited for complex tasks such as math problem-solving, ethical reasoning, and user intent interpretation. Their ability to interpret and validate responses across languages and domains enhances automation and scalability in language model development.
    However, current AI judgment systems face issues with inconsistency and shallow reasoning. Many rely on basic metrics or static annotations, which are inadequate for evaluating subjective or open-ended prompts. A common problem is position bias, where the order of answers affects the final decision, compromising fairness. Also, collecting human-annotated data at scale is costly and time-consuming, limiting the generalizability of these models.
    Several existing approaches have addressed these challenges, but with limited success. Systems like EvalPlanner and DeepSeek-GRM rely on human-labeled data or rigid training schemes, which limit adaptability across task types. Others, like DeepSeek-R1, depend on distillation from large models but perform poorly on ambiguous prompts. Static datasets and offline tuning strategies hinder dynamic reasoning, while newer methods using score formatting or structured prompts have shown minimal accuracy improvements. Despite larger datasets and models, performance gains in traditional systems have stalled.
    Researchers from Meta’s GenAI and FAIR teams introduced J1 to address the above limitations. J1 trains judgment models through a reinforcement learning-based framework, making them capable of learning through verifiable reward signals. The team used synthetic data to create high-quality and low-quality responses to a prompt, transforming subjective tasks into verifiable pairwise judgments. This synthetic dataset included 22,000 preference pairs, split between 17,000 prompts from the WildChat corpus and 5,000 mathematical queries. These were used to train two versions of J1: J1-Llama-8B and J1-Llama-70B, initialized from the Llama-3.1-8B-Instruct and Llama-3.3-70B-Instruct base models, respectively. The models were trained using Group Relative Policy Optimization, a reinforcement algorithm that eliminates the need for critic models and accelerates convergence.

    At the training strategy’s core is position-agnostic learning, where bothandinput formats are used in training to prevent position bias. Also, consistency-based rewards are applied only when the model delivers correct verdicts across both answer orderings. This structure allows the judge to be fair and reliable regardless of prompt or answer order. The training framework supports multiple variations: models can output final verdicts, numeric scores for each answer, or both. A pointwise judging variant is included, which evaluates single responses using scores from 0 to 10. These formats make J1 a versatile and generalizable system capable of judging various tasks.

    The results obtained using the J1 models reveal substantial performance improvements over existing systems. On the widely used Preference Proxy Evaluationsbenchmark, J1-Llama-70B achieved an overall accuracy of 69.6%, outperforming models trained with over ten times more data. In contrast, models like DeepSeek-GRM-27B and EvalPlanner-Llama-70B scored 67.2% and 65.6%, respectively. Even the smaller J1-Llama-8B model exceeded baseline systems like EvalPlanner-Llama-8B, scoring 62.2% versus 55.5%. J1 also showed top-tier performance on other critical benchmarks such as RewardBench, RM-Bench, JudgeBench, and FollowBenchEval, demonstrating robust generalization across verifiable and subjective tasks. These improvements are not just marginal but significant, considering the limited training data used in J1 compared to the expansive datasets in other models.

    Several Key Takeaways from the Research on J1:

    J1 is trained using 22,000 synthetic preference pairs, including 17K from WildChat and 5K from MATH tasks.
    The training uses GRPO, which streamlines RL by avoiding the need for separate critic models.
    It introduces position-agnostic learning, reducing position bias through consistency-based rewards.
    Two main model variants, J1-Llama-8B and J1-Llama-70B, were trained on modest data but outperformed large-scale models.
    J1-Llama-70B scored 69.6% on PPE, exceeding DeepSeek-GRM-27Band EvalPlanner-Llama-70B.
    Supports multiple judgment formats: pairwise with verdicts, pairwise with scores, and pointwise scores.
    Surpasses models distilled from DeepSeek-R1 and OpenAI’s o1-mini on several tasks.
    Demonstrates that reasoning quality, not just dataset size, is critical for accurate judgments.
    J1’s framework makes it a generalist judge applicable to verifiable and non-verifiable tasks.

    In conclusion, the J1 approach fundamentally redefines how judgment models are trained and evaluated. Synthetic data and reinforcement learning bypass the traditional need for costly annotations while promoting fair, logical, and consistent evaluations. This work illustrates that reasoning-driven judging can outperform larger models that rely heavily on data volume and static alignment techniques. It also validates the notion that judgment models should be thinkers first, and scorers second. With performance that rivals and often surpasses state-of-the-art systems, J1 sets a new benchmark in training LLM-as-a-Judge systems.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World EnvironmentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA Optimization
    #meta #researchers #introduced #reinforcement #learning
    Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data
    Large language models are now being used for evaluation and judgment tasks, extending beyond their traditional role of text generation. This has led to “LLM-as-a-Judge,” where models assess outputs from other language models. Such evaluations are essential in reinforcement learning pipelines, benchmark testing, and system alignment. These judge models rely on internal chain-of-thought reasoning, mirroring human judgment processes. Unlike conventional reward models that provide direct scores, these models simulate thoughtful evaluation, making them better suited for complex tasks such as math problem-solving, ethical reasoning, and user intent interpretation. Their ability to interpret and validate responses across languages and domains enhances automation and scalability in language model development. However, current AI judgment systems face issues with inconsistency and shallow reasoning. Many rely on basic metrics or static annotations, which are inadequate for evaluating subjective or open-ended prompts. A common problem is position bias, where the order of answers affects the final decision, compromising fairness. Also, collecting human-annotated data at scale is costly and time-consuming, limiting the generalizability of these models. Several existing approaches have addressed these challenges, but with limited success. Systems like EvalPlanner and DeepSeek-GRM rely on human-labeled data or rigid training schemes, which limit adaptability across task types. Others, like DeepSeek-R1, depend on distillation from large models but perform poorly on ambiguous prompts. Static datasets and offline tuning strategies hinder dynamic reasoning, while newer methods using score formatting or structured prompts have shown minimal accuracy improvements. Despite larger datasets and models, performance gains in traditional systems have stalled. Researchers from Meta’s GenAI and FAIR teams introduced J1 to address the above limitations. J1 trains judgment models through a reinforcement learning-based framework, making them capable of learning through verifiable reward signals. The team used synthetic data to create high-quality and low-quality responses to a prompt, transforming subjective tasks into verifiable pairwise judgments. This synthetic dataset included 22,000 preference pairs, split between 17,000 prompts from the WildChat corpus and 5,000 mathematical queries. These were used to train two versions of J1: J1-Llama-8B and J1-Llama-70B, initialized from the Llama-3.1-8B-Instruct and Llama-3.3-70B-Instruct base models, respectively. The models were trained using Group Relative Policy Optimization, a reinforcement algorithm that eliminates the need for critic models and accelerates convergence. At the training strategy’s core is position-agnostic learning, where bothandinput formats are used in training to prevent position bias. Also, consistency-based rewards are applied only when the model delivers correct verdicts across both answer orderings. This structure allows the judge to be fair and reliable regardless of prompt or answer order. The training framework supports multiple variations: models can output final verdicts, numeric scores for each answer, or both. A pointwise judging variant is included, which evaluates single responses using scores from 0 to 10. These formats make J1 a versatile and generalizable system capable of judging various tasks. The results obtained using the J1 models reveal substantial performance improvements over existing systems. On the widely used Preference Proxy Evaluationsbenchmark, J1-Llama-70B achieved an overall accuracy of 69.6%, outperforming models trained with over ten times more data. In contrast, models like DeepSeek-GRM-27B and EvalPlanner-Llama-70B scored 67.2% and 65.6%, respectively. Even the smaller J1-Llama-8B model exceeded baseline systems like EvalPlanner-Llama-8B, scoring 62.2% versus 55.5%. J1 also showed top-tier performance on other critical benchmarks such as RewardBench, RM-Bench, JudgeBench, and FollowBenchEval, demonstrating robust generalization across verifiable and subjective tasks. These improvements are not just marginal but significant, considering the limited training data used in J1 compared to the expansive datasets in other models. Several Key Takeaways from the Research on J1: J1 is trained using 22,000 synthetic preference pairs, including 17K from WildChat and 5K from MATH tasks. The training uses GRPO, which streamlines RL by avoiding the need for separate critic models. It introduces position-agnostic learning, reducing position bias through consistency-based rewards. Two main model variants, J1-Llama-8B and J1-Llama-70B, were trained on modest data but outperformed large-scale models. J1-Llama-70B scored 69.6% on PPE, exceeding DeepSeek-GRM-27Band EvalPlanner-Llama-70B. Supports multiple judgment formats: pairwise with verdicts, pairwise with scores, and pointwise scores. Surpasses models distilled from DeepSeek-R1 and OpenAI’s o1-mini on several tasks. Demonstrates that reasoning quality, not just dataset size, is critical for accurate judgments. J1’s framework makes it a generalist judge applicable to verifiable and non-verifiable tasks. In conclusion, the J1 approach fundamentally redefines how judgment models are trained and evaluated. Synthetic data and reinforcement learning bypass the traditional need for costly annotations while promoting fair, logical, and consistent evaluations. This work illustrates that reasoning-driven judging can outperform larger models that rely heavily on data volume and static alignment techniques. It also validates the notion that judgment models should be thinkers first, and scorers second. With performance that rivals and often surpasses state-of-the-art systems, J1 sets a new benchmark in training LLM-as-a-Judge systems. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World EnvironmentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA Optimization #meta #researchers #introduced #reinforcement #learning
    WWW.MARKTECHPOST.COM
    Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data
    Large language models are now being used for evaluation and judgment tasks, extending beyond their traditional role of text generation. This has led to “LLM-as-a-Judge,” where models assess outputs from other language models. Such evaluations are essential in reinforcement learning pipelines, benchmark testing, and system alignment. These judge models rely on internal chain-of-thought reasoning, mirroring human judgment processes. Unlike conventional reward models that provide direct scores, these models simulate thoughtful evaluation, making them better suited for complex tasks such as math problem-solving, ethical reasoning, and user intent interpretation. Their ability to interpret and validate responses across languages and domains enhances automation and scalability in language model development. However, current AI judgment systems face issues with inconsistency and shallow reasoning. Many rely on basic metrics or static annotations, which are inadequate for evaluating subjective or open-ended prompts. A common problem is position bias, where the order of answers affects the final decision, compromising fairness. Also, collecting human-annotated data at scale is costly and time-consuming, limiting the generalizability of these models. Several existing approaches have addressed these challenges, but with limited success. Systems like EvalPlanner and DeepSeek-GRM rely on human-labeled data or rigid training schemes, which limit adaptability across task types. Others, like DeepSeek-R1, depend on distillation from large models but perform poorly on ambiguous prompts. Static datasets and offline tuning strategies hinder dynamic reasoning, while newer methods using score formatting or structured prompts have shown minimal accuracy improvements. Despite larger datasets and models, performance gains in traditional systems have stalled. Researchers from Meta’s GenAI and FAIR teams introduced J1 to address the above limitations. J1 trains judgment models through a reinforcement learning-based framework, making them capable of learning through verifiable reward signals. The team used synthetic data to create high-quality and low-quality responses to a prompt, transforming subjective tasks into verifiable pairwise judgments. This synthetic dataset included 22,000 preference pairs, split between 17,000 prompts from the WildChat corpus and 5,000 mathematical queries. These were used to train two versions of J1: J1-Llama-8B and J1-Llama-70B, initialized from the Llama-3.1-8B-Instruct and Llama-3.3-70B-Instruct base models, respectively. The models were trained using Group Relative Policy Optimization (GRPO), a reinforcement algorithm that eliminates the need for critic models and accelerates convergence. At the training strategy’s core is position-agnostic learning, where both (x, a, b) and (x, b, a) input formats are used in training to prevent position bias. Also, consistency-based rewards are applied only when the model delivers correct verdicts across both answer orderings. This structure allows the judge to be fair and reliable regardless of prompt or answer order. The training framework supports multiple variations: models can output final verdicts, numeric scores for each answer, or both. A pointwise judging variant is included, which evaluates single responses using scores from 0 to 10. These formats make J1 a versatile and generalizable system capable of judging various tasks. The results obtained using the J1 models reveal substantial performance improvements over existing systems. On the widely used Preference Proxy Evaluations (PPE) benchmark, J1-Llama-70B achieved an overall accuracy of 69.6%, outperforming models trained with over ten times more data. In contrast, models like DeepSeek-GRM-27B and EvalPlanner-Llama-70B scored 67.2% and 65.6%, respectively. Even the smaller J1-Llama-8B model exceeded baseline systems like EvalPlanner-Llama-8B, scoring 62.2% versus 55.5%. J1 also showed top-tier performance on other critical benchmarks such as RewardBench, RM-Bench, JudgeBench, and FollowBenchEval, demonstrating robust generalization across verifiable and subjective tasks. These improvements are not just marginal but significant, considering the limited training data used in J1 compared to the expansive datasets in other models. Several Key Takeaways from the Research on J1: J1 is trained using 22,000 synthetic preference pairs, including 17K from WildChat and 5K from MATH tasks. The training uses GRPO, which streamlines RL by avoiding the need for separate critic models. It introduces position-agnostic learning, reducing position bias through consistency-based rewards. Two main model variants, J1-Llama-8B and J1-Llama-70B, were trained on modest data but outperformed large-scale models. J1-Llama-70B scored 69.6% on PPE, exceeding DeepSeek-GRM-27B (67.2%) and EvalPlanner-Llama-70B (65.6%). Supports multiple judgment formats: pairwise with verdicts, pairwise with scores, and pointwise scores. Surpasses models distilled from DeepSeek-R1 and OpenAI’s o1-mini on several tasks. Demonstrates that reasoning quality, not just dataset size, is critical for accurate judgments. J1’s framework makes it a generalist judge applicable to verifiable and non-verifiable tasks. In conclusion, the J1 approach fundamentally redefines how judgment models are trained and evaluated. Synthetic data and reinforcement learning bypass the traditional need for costly annotations while promoting fair, logical, and consistent evaluations. This work illustrates that reasoning-driven judging can outperform larger models that rely heavily on data volume and static alignment techniques. It also validates the notion that judgment models should be thinkers first, and scorers second. With performance that rivals and often surpasses state-of-the-art systems, J1 sets a new benchmark in training LLM-as-a-Judge systems. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World EnvironmentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA Optimization
    0 Kommentare 0 Anteile
  • Agentic AI in Financial Services: IBM’s Whitepaper Maps Opportunities, Risks, and Responsible Integration

    As autonomous AI agents move from theory into implementation, their impact on the financial services sector is becoming tangible. A recent whitepaper from IBM Consulting, titled “Agentic AI in Financial Services: Opportunities, Risks, and Responsible Implementation”, outlines how these AI systems—designed for autonomous decision-making and long-term planning—can fundamentally reshape how financial institutions operate. The paper presents a balanced framework that identifies where Agentic AI can add value, the risks it introduces, and how institutions can implement these systems responsibly.
    Understanding Agentic AI
    AI agents, in this context, are software entities that interact with their environments to accomplish tasks with a high degree of autonomy. Unlike traditional automation or even LLM-powered chatbots, Agentic AI incorporates planning, memory, and reasoning to execute dynamic tasks across systems. IBM categorizes them into Principal, Service, and Task agents, which collaborate in orchestrated systems. These systems enable the agents to autonomously process information, select tools, and interact with human users or enterprise systems in a closed loop of goal pursuit and reflection.
    The whitepaper describes the evolution from rule-based automation to multi-agent orchestration, emphasizing how LLMs now serve as the reasoning engine that drives agent behavior in real-time. Crucially, these agents can adapt to evolving conditions and handle complex, cross-domain tasks, making them ideal for the intricacies of financial services.
    Key Opportunities in Finance
    IBM identifies three primary use case patterns where Agentic AI can unlock significant value:

    Customer Engagement & Personalization
    Agents can streamline onboarding, personalize services through real-time behavioral data, and drive KYC/AML processes using tiered agent hierarchies that reduce manual oversight.Operational Excellence & Governance
    Agents improve internal efficiencies by automating risk management, compliance verification, and anomaly detection, while maintaining auditability and traceability.Technology & Software Development
    They support IT teams with automated testing, predictive maintenance, and infrastructure optimization—redefining DevOps through dynamic, self-improving workflows.
    These systems promise to replace fragmented interfaces and human handoffs with integrated, persona-driven agent experiences grounded in high-quality, governed data products.
    Risk Landscape and Mitigation Strategies
    Autonomy in AI brings unique risks. The IBM paper categorizes them under the system’s core components—goal misalignment, tool misuse, and dynamic deception being among the most critical. For instance, a wealth management agent might misinterpret a client’s risk appetite due to goal drift, or bypass controls by chaining permissible actions in unintended ways.
    Key mitigation strategies include:

    Goal Guardrails: Explicitly defined objectives, real-time monitoring, and value alignment feedback loops.
    Access Controls: Least-privilege design for tool/API access, combined with dynamic rate-limiting and auditing.
    Persona Calibration: Regularly reviewing agents’ behavior to avoid biased or unethical actions.

    The whitepaper also emphasizes agent persistence and system drift as long-term governance challenges. Persistent memory, while enabling learning, can cause agents to act on outdated assumptions. IBM proposes memory reset protocols and periodic recalibrations to counteract drift and ensure continued alignment with organizational values.
    Regulatory Readiness and Ethical Design
    IBM outlines regulatory developments in jurisdictions like the EU and Australia, where agentic systems are increasingly considered “high-risk.” These systems must comply with emerging mandates for transparency, explainability, and continuous human oversight. In the EU’s AI Act, for example, agents influencing access to financial services may fall under stricter obligations due to their autonomous and adaptive behavior.
    The paper recommends proactive alignment with ethical AI principles even in the absence of regulation—asking not just can we, but should we. This includes auditing agents for deceptive behavior, embedding human-in-the-loop structures, and maintaining transparency through natural language decision narratives and visualized reasoning paths.
    Conclusion
    Agentic AI stands at the frontier of enterprise automation. For financial services firms, the promise lies in enhanced personalization, operational agility, and AI-driven governance. Yet these benefits are closely linked to how responsibly these systems are designed and deployed. IBM’s whitepaper serves as a practical guide—advocating for a phased, risk-aware adoption strategy that includes governance frameworks, codified controls, and cross-functional accountability.

    Check out the White Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.
    Mohammad AsjadAsjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.Mohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Critical Security Vulnerabilities in the Model Context Protocol: How Malicious Tools and Deceptive Contexts Exploit AI AgentsMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Stability AI Introduces Adversarial Relativistic-ContrastivePost-Training and Stable Audio Open Small: A Distillation-Free Breakthrough for Fast, Diverse, and Efficient Text-to-Audio Generation Across DevicesMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Meta AI Introduces CATransformers: A Carbon-Aware Machine Learning Framework to Co-Optimize AI Models and Hardware for Sustainable Edge DeploymentMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Enterprise AI Without GPU Burn: Salesforce’s xGen-small Optimizes for Context, Cost, and Privacy

    Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub!
    #agentic #financial #services #ibms #whitepaper
    Agentic AI in Financial Services: IBM’s Whitepaper Maps Opportunities, Risks, and Responsible Integration
    As autonomous AI agents move from theory into implementation, their impact on the financial services sector is becoming tangible. A recent whitepaper from IBM Consulting, titled “Agentic AI in Financial Services: Opportunities, Risks, and Responsible Implementation”, outlines how these AI systems—designed for autonomous decision-making and long-term planning—can fundamentally reshape how financial institutions operate. The paper presents a balanced framework that identifies where Agentic AI can add value, the risks it introduces, and how institutions can implement these systems responsibly. Understanding Agentic AI AI agents, in this context, are software entities that interact with their environments to accomplish tasks with a high degree of autonomy. Unlike traditional automation or even LLM-powered chatbots, Agentic AI incorporates planning, memory, and reasoning to execute dynamic tasks across systems. IBM categorizes them into Principal, Service, and Task agents, which collaborate in orchestrated systems. These systems enable the agents to autonomously process information, select tools, and interact with human users or enterprise systems in a closed loop of goal pursuit and reflection. The whitepaper describes the evolution from rule-based automation to multi-agent orchestration, emphasizing how LLMs now serve as the reasoning engine that drives agent behavior in real-time. Crucially, these agents can adapt to evolving conditions and handle complex, cross-domain tasks, making them ideal for the intricacies of financial services. Key Opportunities in Finance IBM identifies three primary use case patterns where Agentic AI can unlock significant value: Customer Engagement & Personalization Agents can streamline onboarding, personalize services through real-time behavioral data, and drive KYC/AML processes using tiered agent hierarchies that reduce manual oversight.Operational Excellence & Governance Agents improve internal efficiencies by automating risk management, compliance verification, and anomaly detection, while maintaining auditability and traceability.Technology & Software Development They support IT teams with automated testing, predictive maintenance, and infrastructure optimization—redefining DevOps through dynamic, self-improving workflows. These systems promise to replace fragmented interfaces and human handoffs with integrated, persona-driven agent experiences grounded in high-quality, governed data products. Risk Landscape and Mitigation Strategies Autonomy in AI brings unique risks. The IBM paper categorizes them under the system’s core components—goal misalignment, tool misuse, and dynamic deception being among the most critical. For instance, a wealth management agent might misinterpret a client’s risk appetite due to goal drift, or bypass controls by chaining permissible actions in unintended ways. Key mitigation strategies include: Goal Guardrails: Explicitly defined objectives, real-time monitoring, and value alignment feedback loops. Access Controls: Least-privilege design for tool/API access, combined with dynamic rate-limiting and auditing. Persona Calibration: Regularly reviewing agents’ behavior to avoid biased or unethical actions. The whitepaper also emphasizes agent persistence and system drift as long-term governance challenges. Persistent memory, while enabling learning, can cause agents to act on outdated assumptions. IBM proposes memory reset protocols and periodic recalibrations to counteract drift and ensure continued alignment with organizational values. Regulatory Readiness and Ethical Design IBM outlines regulatory developments in jurisdictions like the EU and Australia, where agentic systems are increasingly considered “high-risk.” These systems must comply with emerging mandates for transparency, explainability, and continuous human oversight. In the EU’s AI Act, for example, agents influencing access to financial services may fall under stricter obligations due to their autonomous and adaptive behavior. The paper recommends proactive alignment with ethical AI principles even in the absence of regulation—asking not just can we, but should we. This includes auditing agents for deceptive behavior, embedding human-in-the-loop structures, and maintaining transparency through natural language decision narratives and visualized reasoning paths. Conclusion Agentic AI stands at the frontier of enterprise automation. For financial services firms, the promise lies in enhanced personalization, operational agility, and AI-driven governance. Yet these benefits are closely linked to how responsibly these systems are designed and deployed. IBM’s whitepaper serves as a practical guide—advocating for a phased, risk-aware adoption strategy that includes governance frameworks, codified controls, and cross-functional accountability. Check out the White Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit. Mohammad AsjadAsjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.Mohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Critical Security Vulnerabilities in the Model Context Protocol: How Malicious Tools and Deceptive Contexts Exploit AI AgentsMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Stability AI Introduces Adversarial Relativistic-ContrastivePost-Training and Stable Audio Open Small: A Distillation-Free Breakthrough for Fast, Diverse, and Efficient Text-to-Audio Generation Across DevicesMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Meta AI Introduces CATransformers: A Carbon-Aware Machine Learning Framework to Co-Optimize AI Models and Hardware for Sustainable Edge DeploymentMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Enterprise AI Without GPU Burn: Salesforce’s xGen-small Optimizes for Context, Cost, and Privacy 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! #agentic #financial #services #ibms #whitepaper
    WWW.MARKTECHPOST.COM
    Agentic AI in Financial Services: IBM’s Whitepaper Maps Opportunities, Risks, and Responsible Integration
    As autonomous AI agents move from theory into implementation, their impact on the financial services sector is becoming tangible. A recent whitepaper from IBM Consulting, titled “Agentic AI in Financial Services: Opportunities, Risks, and Responsible Implementation”, outlines how these AI systems—designed for autonomous decision-making and long-term planning—can fundamentally reshape how financial institutions operate. The paper presents a balanced framework that identifies where Agentic AI can add value, the risks it introduces, and how institutions can implement these systems responsibly. Understanding Agentic AI AI agents, in this context, are software entities that interact with their environments to accomplish tasks with a high degree of autonomy. Unlike traditional automation or even LLM-powered chatbots, Agentic AI incorporates planning, memory, and reasoning to execute dynamic tasks across systems. IBM categorizes them into Principal, Service, and Task agents, which collaborate in orchestrated systems. These systems enable the agents to autonomously process information, select tools, and interact with human users or enterprise systems in a closed loop of goal pursuit and reflection. The whitepaper describes the evolution from rule-based automation to multi-agent orchestration, emphasizing how LLMs now serve as the reasoning engine that drives agent behavior in real-time. Crucially, these agents can adapt to evolving conditions and handle complex, cross-domain tasks, making them ideal for the intricacies of financial services. Key Opportunities in Finance IBM identifies three primary use case patterns where Agentic AI can unlock significant value: Customer Engagement & Personalization Agents can streamline onboarding, personalize services through real-time behavioral data, and drive KYC/AML processes using tiered agent hierarchies that reduce manual oversight.Operational Excellence & Governance Agents improve internal efficiencies by automating risk management, compliance verification, and anomaly detection, while maintaining auditability and traceability.Technology & Software Development They support IT teams with automated testing, predictive maintenance, and infrastructure optimization—redefining DevOps through dynamic, self-improving workflows. These systems promise to replace fragmented interfaces and human handoffs with integrated, persona-driven agent experiences grounded in high-quality, governed data products. Risk Landscape and Mitigation Strategies Autonomy in AI brings unique risks. The IBM paper categorizes them under the system’s core components—goal misalignment, tool misuse, and dynamic deception being among the most critical. For instance, a wealth management agent might misinterpret a client’s risk appetite due to goal drift, or bypass controls by chaining permissible actions in unintended ways. Key mitigation strategies include: Goal Guardrails: Explicitly defined objectives, real-time monitoring, and value alignment feedback loops. Access Controls: Least-privilege design for tool/API access, combined with dynamic rate-limiting and auditing. Persona Calibration: Regularly reviewing agents’ behavior to avoid biased or unethical actions. The whitepaper also emphasizes agent persistence and system drift as long-term governance challenges. Persistent memory, while enabling learning, can cause agents to act on outdated assumptions. IBM proposes memory reset protocols and periodic recalibrations to counteract drift and ensure continued alignment with organizational values. Regulatory Readiness and Ethical Design IBM outlines regulatory developments in jurisdictions like the EU and Australia, where agentic systems are increasingly considered “high-risk.” These systems must comply with emerging mandates for transparency, explainability, and continuous human oversight. In the EU’s AI Act, for example, agents influencing access to financial services may fall under stricter obligations due to their autonomous and adaptive behavior. The paper recommends proactive alignment with ethical AI principles even in the absence of regulation—asking not just can we, but should we. This includes auditing agents for deceptive behavior, embedding human-in-the-loop structures, and maintaining transparency through natural language decision narratives and visualized reasoning paths. Conclusion Agentic AI stands at the frontier of enterprise automation. For financial services firms, the promise lies in enhanced personalization, operational agility, and AI-driven governance. Yet these benefits are closely linked to how responsibly these systems are designed and deployed. IBM’s whitepaper serves as a practical guide—advocating for a phased, risk-aware adoption strategy that includes governance frameworks, codified controls, and cross-functional accountability. Check out the White Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit. Mohammad AsjadAsjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.Mohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Critical Security Vulnerabilities in the Model Context Protocol (MCP): How Malicious Tools and Deceptive Contexts Exploit AI AgentsMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Stability AI Introduces Adversarial Relativistic-Contrastive (ARC) Post-Training and Stable Audio Open Small: A Distillation-Free Breakthrough for Fast, Diverse, and Efficient Text-to-Audio Generation Across DevicesMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Meta AI Introduces CATransformers: A Carbon-Aware Machine Learning Framework to Co-Optimize AI Models and Hardware for Sustainable Edge DeploymentMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Enterprise AI Without GPU Burn: Salesforce’s xGen-small Optimizes for Context, Cost, and Privacy 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! (Promoted)
    0 Kommentare 0 Anteile
  • Critical Security Vulnerabilities in the Model Context Protocol (MCP): How Malicious Tools and Deceptive Contexts Exploit AI Agents

    The Model Context Protocolrepresents a powerful paradigm shift in how large language models interact with tools, services, and external data sources. Designed to enable dynamic tool invocation, the MCP facilitates a standardized method for describing tool metadata, allowing models to select and call functions intelligently. However, as with any emerging framework that enhances model autonomy, MCP introduces significant security concerns. Among these are five notable vulnerabilities: Tool Poisoning, Rug-Pull Updates, Retrieval-Agent Deception, Server Spoofing, and Cross-Server Shadowing. Each of these weaknesses exploits a different layer of the MCP infrastructure and reveals potential threats that could compromise user safety and data integrity.

    Tool Poisoning is one of the most insidious vulnerabilities within the MCP framework. At its core, this attack involves embedding malicious behavior into a harmless tool. In MCP, where tools are advertised with brief descriptions and input/output schemas, a bad actor can craft a tool with a name and summary that seem benign, such as a calculator or formatter. However, once invoked, the tool might perform unauthorized actions such as deleting files, exfiltrating data, or issuing hidden commands. Since the AI model processes detailed tool specifications that may not be visible to the end-user, it could unknowingly execute harmful functions, believing it operates within the intended boundaries. This discrepancy between surface-level appearance and hidden functionality makes tool poisoning particularly dangerous.
    Rug-Pull Updates
    Closely related to tool poisoning is the concept of Rug-Pull Updates. This vulnerability centers on the temporal trust dynamics in MCP-enabled environments. Initially, a tool may behave exactly as expected, performing useful, legitimate operations. Over time, the developer of the tool, or someone who gains control of its source, may issue an update that introduces malicious behavior. This change might not trigger immediate alerts if users or agents rely on automated update mechanisms or do not rigorously re-evaluate tools after each revision. The AI model, still operating under the assumption that the tool is trustworthy, may call it for sensitive operations, unwittingly initiating data leaks, file corruption, or other undesirable outcomes. The danger of rug-pull updates lies in the deferred onset of risk: by the time the attack is active, the model has often already been conditioned to trust the tool implicitly.
    Retrieval-Agent Deception
    Retrieval-Agent Deception, or RADE, exposes a more indirect but equally potent vulnerability. In many MCP use cases, models are equipped with retrieval tools to query knowledge bases, documents, and other external data to enhance responses. RADE exploits this feature by placing malicious MCP command patterns into publicly accessible documents or datasets. When a retrieval tool ingests this poisoned data, the AI model may interpret embedded instructions as valid tool-calling commands. For instance, a document that explains a technical topic might include hidden prompts that direct the model to call a tool in an unintended manner or supply dangerous parameters. The model, unaware that it has been manipulated, executes these instructions, effectively turning retrieved data into a covert command channel. This blurring of data and executable intent threatens the integrity of context-aware agents that rely heavily on retrieval-augmented interactions.
    Server Spoofing
    Server Spoofing constitutes another sophisticated threat in MCP ecosystems, particularly in distributed environments. Because MCP enables models to interact with remote servers that expose various tools, each server typically advertises its tools via a manifest that includes names, descriptions, and schemas. An attacker can create a rogue server that mimics a legitimate one, copying its name and tool list to deceive models and users alike. When the AI agent connects to this spoofed server, it may receive altered tool metadata or execute tool calls with entirely different backend implementations than expected. From the model’s perspective, the server seems legitimate, and unless there is strong authentication or identity verification, it proceeds to operate under false assumptions. The consequences of server spoofing include credential theft, data manipulation, or unauthorized command execution.
    Cross-Server Shadowing
    Finally, Cross-Server Shadowing reflects the vulnerability in multi-server MCP contexts where several servers contribute tools to a shared model session. In such setups, a malicious server can manipulate the model’s behavior by injecting context that interferes with or redefines how tools from another server are perceived or used. This can occur through conflicting tool definitions, misleading metadata, or injected guidance that distorts the model’s tool selection logic. For example, if one server redefines a common tool name or provides conflicting instructions, it can effectively shadow or override the legitimate functionality offered by another server. The model, attempting to reconcile these inputs, may execute the wrong version of a tool or follow harmful instructions. Cross-server shadowing undermines the modularity of the MCP design by allowing one bad actor to corrupt interactions that span multiple otherwise secure sources.
    In conclusion, these five vulnerabilities expose critical security weaknesses in the Model Context Protocol’s current operational landscape. While MCP introduces exciting possibilities for agentic reasoning and dynamic task completion, it also opens the door to various behaviors that exploit model trust, contextual ambiguity, and tool discovery mechanisms. As the MCP standard evolves and gains broader adoption, addressing these threats will be essential to maintaining user trust and ensuring the safe deployment of AI agents in real-world environments.
    Sources



    Mohammad AsjadAsjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.Mohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Stability AI Introduces Adversarial Relativistic-ContrastivePost-Training and Stable Audio Open Small: A Distillation-Free Breakthrough for Fast, Diverse, and Efficient Text-to-Audio Generation Across DevicesMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Meta AI Introduces CATransformers: A Carbon-Aware Machine Learning Framework to Co-Optimize AI Models and Hardware for Sustainable Edge DeploymentMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Enterprise AI Without GPU Burn: Salesforce’s xGen-small Optimizes for Context, Cost, and PrivacyMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/ServiceNow AI Released Apriel-Nemotron-15b-Thinker: A Compact Yet Powerful Reasoning Model Optimized for Enterprise-Scale Deployment and Efficiency

    Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub!
    #critical #security #vulnerabilities #model #context
    Critical Security Vulnerabilities in the Model Context Protocol (MCP): How Malicious Tools and Deceptive Contexts Exploit AI Agents
    The Model Context Protocolrepresents a powerful paradigm shift in how large language models interact with tools, services, and external data sources. Designed to enable dynamic tool invocation, the MCP facilitates a standardized method for describing tool metadata, allowing models to select and call functions intelligently. However, as with any emerging framework that enhances model autonomy, MCP introduces significant security concerns. Among these are five notable vulnerabilities: Tool Poisoning, Rug-Pull Updates, Retrieval-Agent Deception, Server Spoofing, and Cross-Server Shadowing. Each of these weaknesses exploits a different layer of the MCP infrastructure and reveals potential threats that could compromise user safety and data integrity. Tool Poisoning is one of the most insidious vulnerabilities within the MCP framework. At its core, this attack involves embedding malicious behavior into a harmless tool. In MCP, where tools are advertised with brief descriptions and input/output schemas, a bad actor can craft a tool with a name and summary that seem benign, such as a calculator or formatter. However, once invoked, the tool might perform unauthorized actions such as deleting files, exfiltrating data, or issuing hidden commands. Since the AI model processes detailed tool specifications that may not be visible to the end-user, it could unknowingly execute harmful functions, believing it operates within the intended boundaries. This discrepancy between surface-level appearance and hidden functionality makes tool poisoning particularly dangerous. Rug-Pull Updates Closely related to tool poisoning is the concept of Rug-Pull Updates. This vulnerability centers on the temporal trust dynamics in MCP-enabled environments. Initially, a tool may behave exactly as expected, performing useful, legitimate operations. Over time, the developer of the tool, or someone who gains control of its source, may issue an update that introduces malicious behavior. This change might not trigger immediate alerts if users or agents rely on automated update mechanisms or do not rigorously re-evaluate tools after each revision. The AI model, still operating under the assumption that the tool is trustworthy, may call it for sensitive operations, unwittingly initiating data leaks, file corruption, or other undesirable outcomes. The danger of rug-pull updates lies in the deferred onset of risk: by the time the attack is active, the model has often already been conditioned to trust the tool implicitly. Retrieval-Agent Deception Retrieval-Agent Deception, or RADE, exposes a more indirect but equally potent vulnerability. In many MCP use cases, models are equipped with retrieval tools to query knowledge bases, documents, and other external data to enhance responses. RADE exploits this feature by placing malicious MCP command patterns into publicly accessible documents or datasets. When a retrieval tool ingests this poisoned data, the AI model may interpret embedded instructions as valid tool-calling commands. For instance, a document that explains a technical topic might include hidden prompts that direct the model to call a tool in an unintended manner or supply dangerous parameters. The model, unaware that it has been manipulated, executes these instructions, effectively turning retrieved data into a covert command channel. This blurring of data and executable intent threatens the integrity of context-aware agents that rely heavily on retrieval-augmented interactions. Server Spoofing Server Spoofing constitutes another sophisticated threat in MCP ecosystems, particularly in distributed environments. Because MCP enables models to interact with remote servers that expose various tools, each server typically advertises its tools via a manifest that includes names, descriptions, and schemas. An attacker can create a rogue server that mimics a legitimate one, copying its name and tool list to deceive models and users alike. When the AI agent connects to this spoofed server, it may receive altered tool metadata or execute tool calls with entirely different backend implementations than expected. From the model’s perspective, the server seems legitimate, and unless there is strong authentication or identity verification, it proceeds to operate under false assumptions. The consequences of server spoofing include credential theft, data manipulation, or unauthorized command execution. Cross-Server Shadowing Finally, Cross-Server Shadowing reflects the vulnerability in multi-server MCP contexts where several servers contribute tools to a shared model session. In such setups, a malicious server can manipulate the model’s behavior by injecting context that interferes with or redefines how tools from another server are perceived or used. This can occur through conflicting tool definitions, misleading metadata, or injected guidance that distorts the model’s tool selection logic. For example, if one server redefines a common tool name or provides conflicting instructions, it can effectively shadow or override the legitimate functionality offered by another server. The model, attempting to reconcile these inputs, may execute the wrong version of a tool or follow harmful instructions. Cross-server shadowing undermines the modularity of the MCP design by allowing one bad actor to corrupt interactions that span multiple otherwise secure sources. In conclusion, these five vulnerabilities expose critical security weaknesses in the Model Context Protocol’s current operational landscape. While MCP introduces exciting possibilities for agentic reasoning and dynamic task completion, it also opens the door to various behaviors that exploit model trust, contextual ambiguity, and tool discovery mechanisms. As the MCP standard evolves and gains broader adoption, addressing these threats will be essential to maintaining user trust and ensuring the safe deployment of AI agents in real-world environments. Sources Mohammad AsjadAsjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.Mohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Stability AI Introduces Adversarial Relativistic-ContrastivePost-Training and Stable Audio Open Small: A Distillation-Free Breakthrough for Fast, Diverse, and Efficient Text-to-Audio Generation Across DevicesMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Meta AI Introduces CATransformers: A Carbon-Aware Machine Learning Framework to Co-Optimize AI Models and Hardware for Sustainable Edge DeploymentMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Enterprise AI Without GPU Burn: Salesforce’s xGen-small Optimizes for Context, Cost, and PrivacyMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/ServiceNow AI Released Apriel-Nemotron-15b-Thinker: A Compact Yet Powerful Reasoning Model Optimized for Enterprise-Scale Deployment and Efficiency 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! #critical #security #vulnerabilities #model #context
    WWW.MARKTECHPOST.COM
    Critical Security Vulnerabilities in the Model Context Protocol (MCP): How Malicious Tools and Deceptive Contexts Exploit AI Agents
    The Model Context Protocol (MCP) represents a powerful paradigm shift in how large language models interact with tools, services, and external data sources. Designed to enable dynamic tool invocation, the MCP facilitates a standardized method for describing tool metadata, allowing models to select and call functions intelligently. However, as with any emerging framework that enhances model autonomy, MCP introduces significant security concerns. Among these are five notable vulnerabilities: Tool Poisoning, Rug-Pull Updates, Retrieval-Agent Deception (RADE), Server Spoofing, and Cross-Server Shadowing. Each of these weaknesses exploits a different layer of the MCP infrastructure and reveals potential threats that could compromise user safety and data integrity. Tool Poisoning is one of the most insidious vulnerabilities within the MCP framework. At its core, this attack involves embedding malicious behavior into a harmless tool. In MCP, where tools are advertised with brief descriptions and input/output schemas, a bad actor can craft a tool with a name and summary that seem benign, such as a calculator or formatter. However, once invoked, the tool might perform unauthorized actions such as deleting files, exfiltrating data, or issuing hidden commands. Since the AI model processes detailed tool specifications that may not be visible to the end-user, it could unknowingly execute harmful functions, believing it operates within the intended boundaries. This discrepancy between surface-level appearance and hidden functionality makes tool poisoning particularly dangerous. Rug-Pull Updates Closely related to tool poisoning is the concept of Rug-Pull Updates. This vulnerability centers on the temporal trust dynamics in MCP-enabled environments. Initially, a tool may behave exactly as expected, performing useful, legitimate operations. Over time, the developer of the tool, or someone who gains control of its source, may issue an update that introduces malicious behavior. This change might not trigger immediate alerts if users or agents rely on automated update mechanisms or do not rigorously re-evaluate tools after each revision. The AI model, still operating under the assumption that the tool is trustworthy, may call it for sensitive operations, unwittingly initiating data leaks, file corruption, or other undesirable outcomes. The danger of rug-pull updates lies in the deferred onset of risk: by the time the attack is active, the model has often already been conditioned to trust the tool implicitly. Retrieval-Agent Deception Retrieval-Agent Deception, or RADE, exposes a more indirect but equally potent vulnerability. In many MCP use cases, models are equipped with retrieval tools to query knowledge bases, documents, and other external data to enhance responses. RADE exploits this feature by placing malicious MCP command patterns into publicly accessible documents or datasets. When a retrieval tool ingests this poisoned data, the AI model may interpret embedded instructions as valid tool-calling commands. For instance, a document that explains a technical topic might include hidden prompts that direct the model to call a tool in an unintended manner or supply dangerous parameters. The model, unaware that it has been manipulated, executes these instructions, effectively turning retrieved data into a covert command channel. This blurring of data and executable intent threatens the integrity of context-aware agents that rely heavily on retrieval-augmented interactions. Server Spoofing Server Spoofing constitutes another sophisticated threat in MCP ecosystems, particularly in distributed environments. Because MCP enables models to interact with remote servers that expose various tools, each server typically advertises its tools via a manifest that includes names, descriptions, and schemas. An attacker can create a rogue server that mimics a legitimate one, copying its name and tool list to deceive models and users alike. When the AI agent connects to this spoofed server, it may receive altered tool metadata or execute tool calls with entirely different backend implementations than expected. From the model’s perspective, the server seems legitimate, and unless there is strong authentication or identity verification, it proceeds to operate under false assumptions. The consequences of server spoofing include credential theft, data manipulation, or unauthorized command execution. Cross-Server Shadowing Finally, Cross-Server Shadowing reflects the vulnerability in multi-server MCP contexts where several servers contribute tools to a shared model session. In such setups, a malicious server can manipulate the model’s behavior by injecting context that interferes with or redefines how tools from another server are perceived or used. This can occur through conflicting tool definitions, misleading metadata, or injected guidance that distorts the model’s tool selection logic. For example, if one server redefines a common tool name or provides conflicting instructions, it can effectively shadow or override the legitimate functionality offered by another server. The model, attempting to reconcile these inputs, may execute the wrong version of a tool or follow harmful instructions. Cross-server shadowing undermines the modularity of the MCP design by allowing one bad actor to corrupt interactions that span multiple otherwise secure sources. In conclusion, these five vulnerabilities expose critical security weaknesses in the Model Context Protocol’s current operational landscape. While MCP introduces exciting possibilities for agentic reasoning and dynamic task completion, it also opens the door to various behaviors that exploit model trust, contextual ambiguity, and tool discovery mechanisms. As the MCP standard evolves and gains broader adoption, addressing these threats will be essential to maintaining user trust and ensuring the safe deployment of AI agents in real-world environments. Sources https://techcommunity.microsoft.com/blog/microsoftdefendercloudblog/plug-play-and-prey-the-security-risks-of-the-model-context-protocol/4410829 Mohammad AsjadAsjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.Mohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Stability AI Introduces Adversarial Relativistic-Contrastive (ARC) Post-Training and Stable Audio Open Small: A Distillation-Free Breakthrough for Fast, Diverse, and Efficient Text-to-Audio Generation Across DevicesMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Meta AI Introduces CATransformers: A Carbon-Aware Machine Learning Framework to Co-Optimize AI Models and Hardware for Sustainable Edge DeploymentMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Enterprise AI Without GPU Burn: Salesforce’s xGen-small Optimizes for Context, Cost, and PrivacyMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/ServiceNow AI Released Apriel-Nemotron-15b-Thinker: A Compact Yet Powerful Reasoning Model Optimized for Enterprise-Scale Deployment and Efficiency 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! (Promoted)
    0 Kommentare 0 Anteile
  • Will the AI boom fuel a global energy crisis?

    AI’s thirst for energy is ballooning into a monster of a challenge. And it’s not just about the electricity bills. The environmental fallout is serious, stretching to guzzling precious water resources, creating mountains of electronic waste, and, yes, adding to those greenhouse gas emissions we’re all trying to cut.As AI models get ever more complex and weave themselves into yet more parts of our lives, a massive question mark hangs in the air: can we power this revolution without costing the Earth?The numbers don’t lie: AI’s energy demand is escalating fastThe sheer computing power needed for the smartest AI out there is on an almost unbelievable upward curve – some say it’s doubling roughly every few months. This isn’t a gentle slope; it’s a vertical climb that’s threatening to leave even our most optimistic energy plans in the dust.To give you a sense of scale, AI’s future energy needs could soon gulp down as much electricity as entire countries like Japan or the Netherlands, or even large US states like California. When you hear stats like that, you start to see the potential squeeze AI could put on the power grids we all rely on.2024 saw a record 4.3% surge in global electricity demand, and AI’s expansion was a big reason why, alongside the boom in electric cars and factories working harder. Wind back to 2022, and data centres, AI, and even cryptocurrency mining were already accounting for nearly 2% of all the electricity used worldwide – that’s about 460 terawatt-hours.Jump to 2024, and data centres on their own use around 415 TWh, which is roughly 1.5% of the global total, and growing at 12% a year. AI’s direct share of that slice is still relatively small – about 20 TWh, or 0.02% of global energy use – but hold onto your hats, because that number is set to rocket upwards.The forecasts? Well, they’re pretty eye-opening. By the end of 2025, AI data centres around the world could demand an extra 10 gigawattsof power. That’s more than the entire power capacity of a place like Utah.Roll on to 2026, and global data centre electricity use could hit 1,000 TWh – similar to what Japan uses right now. And, by 2027, the global power hunger of AI data centres is tipped to reach 68 GW, which is almost what California had in total power capacity back in 2022. Towards the end of this decade, the figures get even more jaw-dropping. Global data centre electricity consumption is predicted to double to around 945 TWh by 2030, which is just shy of 3% of all the electricity used on the planet.OPEC reckons data centre electricity use could even triple to 1,500 TWh by then. And Goldman Sachs? They’re saying global power demand from data centres could leap by as much as 165% compared to 2023, with those data centres specifically kitted out for AI seeing their demand shoot up by more than four times.There are even suggestions that data centres could be responsible for up to 21% of all global energy demand by 2030 if you count the energy it takes to get AI services to us, the users.When we talk about AI’s energy use, it mainly splits into two big chunks: training the AI, and then actually using it.Training enormous models, like GPT-4, takes a colossal amount of energy. Just to train GPT-3, for example, it’s estimated they used 1,287 megawatt-hoursof electricity, and GPT-4 is thought to have needed a whopping 50 times more than that. While training is a power hog, it’s the day-to-day running of these trained models that can chew through over 80% of AI’s total energy. It’s reported that asking ChatGPT a single question uses about ten times more energy than a Google search.With everyone jumping on the generative AI bandwagon, the race is on to build ever more powerful – and therefore more energy-guzzling – data centres.So, can we supply energy for AI – and for ourselves?This is the million-dollar question, isn’t it? Can our planet’s energy systems cope with this new demand? We’re already juggling a mix of fossil fuels, nuclear power, and renewables. If we’re going to feed AI’s growing appetite sustainably, we need to ramp up and diversify how we generate energy, and fast.Naturally, renewable energy – solar, wind, hydro, geothermal – is a huge piece of the puzzle. In the US, for instance, renewables are set to go from 23% of power generation in 2024 to 27% by 2026. The tech giants are making some big promises; Microsoft, for example, is planning to buy 10.5 GW of renewable energy between 2026 and 2030 just for its data centres. AI itself could actually help us use renewable energy more efficiently, perhaps cutting energy use by up to 60% in some areas by making energy storage smarter and managing power grids better.But let’s not get carried away. Renewables have their own headaches. The sun doesn’t always shine, and the wind doesn’t always blow, which is a real problem for data centres that need power around the clock, every single day. The batteries we have now to smooth out these bumps are often expensive and take up a lot of room. Plus, plugging massive new renewable projects into our existing power grids can be a slow and complicated business.This is where nuclear power is starting to look more appealing to some, especially as a steady, low-carbon way to power AI’s massive energy needs. It delivers that crucial 24/7 power, which is exactly what data centres crave. There’s a lot of buzz around Small Modular Reactorstoo, because they’re potentially more flexible and have beefed-up safety features. And it’s not just talk; big names like Microsoft, Amazon, and Google are seriously looking into nuclear options.Matt Garman, who heads up AWS, recently put it plainly to the BBC, calling nuclear a “great solution” for data centres. He said it’s “an excellent source of zero carbon, 24/7 power.” He also stressed that planning for future energy is a massive part of what AWS does.“It’s something we plan many years out,” Garman mentioned. “We invest ahead. I think the world is going to have to build new technologies. I believe nuclear is a big part of that, particularly as we look 10 years out.”Still, nuclear power isn’t a magic wand. Building new reactors takes a notoriously long time, costs a fortune, and involves wading through complex red tape. And let’s be frank, public opinion on nuclear power is still a bit shaky, often because of past accidents, even though modern reactors are much safer.The sheer speed at which AI is developing also creates a bit of a mismatch with how long it takes to get a new nuclear plant up and running. This could mean we end up leaning even more heavily on fossil fuels in the short term, which isn’t great for our green ambitions. Plus, the idea of sticking data centres right next to nuclear plants has got some people worried about what that might do to electricity prices and reliability for everyone else.Not just kilowatts: Wider environmental shadow of AI loomsAI’s impact on the planet goes way beyond just the electricity it uses. Those data centres get hot, and cooling them down uses vast amounts of water. Your average data centre sips about 1.7 litres of water for every kilowatt-hour of energy it burns through.Back in 2022, Google’s data centres reportedly drank their way through about 5 billion gallons of fresh water – that’s a 20% jump from the year before. Some estimates suggest that for every kWh a data centre uses, it might need up to two litres of water just for cooling. Put it another way, global AI infrastructure could soon be chugging six times more water than the entirety of Denmark.And then there’s the ever-growing mountain of electronic waste, or e-waste. Because AI tech – especially specialised hardware like GPUs and TPUs – moves so fast, old kit gets thrown out more often. We could be looking at AI contributing to an e-waste pile-up from data centres hitting five million tons every year by 2030. Even making the AI chips and all the other bits for data centres takes a toll on our natural resources and the environment. It means mining for critical minerals like lithium and cobalt, often using methods that aren’t exactly kind to the planet.Just to make one AI chip can take over 1,400 litres of water and 3,000 kWh of electricity. This hunger for new hardware is also pushing for more semiconductor factories, which, guess what, often leads to more gas-powered energy plants being built.And, of course, we can’t forget the carbon emissions. When AI is powered by electricity generated from burning fossil fuels, it adds to the climate change problem we’re all facing. It’s estimated that training just one big AI model can pump out as much CO2 as hundreds of US homes do in a year.If you look at the environmental reports from the big tech companies, you can see AI’s growing carbon footprint. Microsoft’s yearly emissions, for example, went up by about 40% between 2020 and 2023, mostly because they were building more data centres for AI. Google also reported that its total greenhouse gas emissions have shot up by nearly 50% over the last five years, with the power demands of its AI data centres being a major culprit.Can we innovate our way out?It might sound like all doom and gloom, but a combination of new ideas could help.A big focus is on making AI algorithms themselves more energy-efficient. Researchers are coming up with clever tricks like “model pruning”, “quantisation”, and “knowledge distillation”. Designing smaller, more specialised AI models that do specific jobs with less power is also a priority.Inside data centres, things like “power capping”and “dynamic resource allocation”can make a real difference. Software that’s “AI-aware” can even shift less urgent AI jobs to times when energy is cleaner or demand on the grid is lower. AI can even be used to make the cooling systems in data centres more efficient.On-device AI could also help to reduce power consumption. Instead of sending data off to massive, power-hungry cloud data centres, the AI processing happens right there on your phone or device. This could slash energy use, as the chips designed for this prioritise being efficient over raw power.And we can’t forget about rules and regulations. Governments are starting to wake up to the need to make AI accountable for its energy use and wider environmental impact.Having clear, standard ways to measure and report AI’s footprint is a crucial first step. We also need policies that encourage companies to make hardware that lasts longer and is easier to recycle, to help tackle that e-waste mountain. Things like energy credit trading systems could even give companies a financial reason to choose greener AI tech.It’s worth noting that the United Arab Emirates and the United States shook hands this week on a deal to build the biggest AI campus outside the US in the Gulf. While this shows just how important AI is becoming globally, it also throws a spotlight on why all these energy and environmental concerns need to be front and centre for such huge projects.Finding a sustainable future for AIAI has the power to do some amazing things, but its ferocious appetite for energy is a serious hurdle. The predictions for its future power demands are genuinely startling, potentially matching what whole countries use.If we’re going to meet this demand, we need a smart mix of energy sources. Renewables are fantastic for the long run, but they have their wobbles when it comes to consistent supply and scaling up quickly. Nuclear power – including those newer SMRs – offers a reliable, low-carbon option that’s definitely catching the eye of big tech companies. But we still need to get our heads around the safety, cost, and how long they take to build.And remember, it’s not just about electricity. AI’s broader environmental impact – from the water it drinks to cool data centres, to the growing piles of e-waste from its hardware, and the resources it uses up during manufacturing – is huge. We need to look at the whole picture if we’re serious about lessening AI’s ecological footprint.The good news? There are plenty of promising ideas and innovations bubbling up. Energy-saving AI algorithms, clever power management in data centres, AI-aware software that can manage workloads intelligently, and the shift towards on-device AI all offer ways to cut down on energy use. Plus, the fact that we’re even talking about AI’s environmental impact more means that discussions around policies and rules to push for sustainability are finally happening.Dealing with AI’s energy and environmental challenges needs everyone – researchers, the tech industry, and policymakers – to roll up their sleeves and work together, and fast.If we make energy efficiency a top priority in how AI is developed, invest properly in sustainable energy, manage hardware responsibly from cradle to grave, and put supportive policies in place, we can aim for a future where AI’s incredible potential is unlocked in a way that doesn’t break our planet.The race to lead in AI has to be a race for sustainable AI too.Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.Explore other upcoming enterprise technology events and webinars powered by TechForge here.
    #will #boom #fuel #global #energy
    Will the AI boom fuel a global energy crisis?
    AI’s thirst for energy is ballooning into a monster of a challenge. And it’s not just about the electricity bills. The environmental fallout is serious, stretching to guzzling precious water resources, creating mountains of electronic waste, and, yes, adding to those greenhouse gas emissions we’re all trying to cut.As AI models get ever more complex and weave themselves into yet more parts of our lives, a massive question mark hangs in the air: can we power this revolution without costing the Earth?The numbers don’t lie: AI’s energy demand is escalating fastThe sheer computing power needed for the smartest AI out there is on an almost unbelievable upward curve – some say it’s doubling roughly every few months. This isn’t a gentle slope; it’s a vertical climb that’s threatening to leave even our most optimistic energy plans in the dust.To give you a sense of scale, AI’s future energy needs could soon gulp down as much electricity as entire countries like Japan or the Netherlands, or even large US states like California. When you hear stats like that, you start to see the potential squeeze AI could put on the power grids we all rely on.2024 saw a record 4.3% surge in global electricity demand, and AI’s expansion was a big reason why, alongside the boom in electric cars and factories working harder. Wind back to 2022, and data centres, AI, and even cryptocurrency mining were already accounting for nearly 2% of all the electricity used worldwide – that’s about 460 terawatt-hours.Jump to 2024, and data centres on their own use around 415 TWh, which is roughly 1.5% of the global total, and growing at 12% a year. AI’s direct share of that slice is still relatively small – about 20 TWh, or 0.02% of global energy use – but hold onto your hats, because that number is set to rocket upwards.The forecasts? Well, they’re pretty eye-opening. By the end of 2025, AI data centres around the world could demand an extra 10 gigawattsof power. That’s more than the entire power capacity of a place like Utah.Roll on to 2026, and global data centre electricity use could hit 1,000 TWh – similar to what Japan uses right now. And, by 2027, the global power hunger of AI data centres is tipped to reach 68 GW, which is almost what California had in total power capacity back in 2022. Towards the end of this decade, the figures get even more jaw-dropping. Global data centre electricity consumption is predicted to double to around 945 TWh by 2030, which is just shy of 3% of all the electricity used on the planet.OPEC reckons data centre electricity use could even triple to 1,500 TWh by then. And Goldman Sachs? They’re saying global power demand from data centres could leap by as much as 165% compared to 2023, with those data centres specifically kitted out for AI seeing their demand shoot up by more than four times.There are even suggestions that data centres could be responsible for up to 21% of all global energy demand by 2030 if you count the energy it takes to get AI services to us, the users.When we talk about AI’s energy use, it mainly splits into two big chunks: training the AI, and then actually using it.Training enormous models, like GPT-4, takes a colossal amount of energy. Just to train GPT-3, for example, it’s estimated they used 1,287 megawatt-hoursof electricity, and GPT-4 is thought to have needed a whopping 50 times more than that. While training is a power hog, it’s the day-to-day running of these trained models that can chew through over 80% of AI’s total energy. It’s reported that asking ChatGPT a single question uses about ten times more energy than a Google search.With everyone jumping on the generative AI bandwagon, the race is on to build ever more powerful – and therefore more energy-guzzling – data centres.So, can we supply energy for AI – and for ourselves?This is the million-dollar question, isn’t it? Can our planet’s energy systems cope with this new demand? We’re already juggling a mix of fossil fuels, nuclear power, and renewables. If we’re going to feed AI’s growing appetite sustainably, we need to ramp up and diversify how we generate energy, and fast.Naturally, renewable energy – solar, wind, hydro, geothermal – is a huge piece of the puzzle. In the US, for instance, renewables are set to go from 23% of power generation in 2024 to 27% by 2026. The tech giants are making some big promises; Microsoft, for example, is planning to buy 10.5 GW of renewable energy between 2026 and 2030 just for its data centres. AI itself could actually help us use renewable energy more efficiently, perhaps cutting energy use by up to 60% in some areas by making energy storage smarter and managing power grids better.But let’s not get carried away. Renewables have their own headaches. The sun doesn’t always shine, and the wind doesn’t always blow, which is a real problem for data centres that need power around the clock, every single day. The batteries we have now to smooth out these bumps are often expensive and take up a lot of room. Plus, plugging massive new renewable projects into our existing power grids can be a slow and complicated business.This is where nuclear power is starting to look more appealing to some, especially as a steady, low-carbon way to power AI’s massive energy needs. It delivers that crucial 24/7 power, which is exactly what data centres crave. There’s a lot of buzz around Small Modular Reactorstoo, because they’re potentially more flexible and have beefed-up safety features. And it’s not just talk; big names like Microsoft, Amazon, and Google are seriously looking into nuclear options.Matt Garman, who heads up AWS, recently put it plainly to the BBC, calling nuclear a “great solution” for data centres. He said it’s “an excellent source of zero carbon, 24/7 power.” He also stressed that planning for future energy is a massive part of what AWS does.“It’s something we plan many years out,” Garman mentioned. “We invest ahead. I think the world is going to have to build new technologies. I believe nuclear is a big part of that, particularly as we look 10 years out.”Still, nuclear power isn’t a magic wand. Building new reactors takes a notoriously long time, costs a fortune, and involves wading through complex red tape. And let’s be frank, public opinion on nuclear power is still a bit shaky, often because of past accidents, even though modern reactors are much safer.The sheer speed at which AI is developing also creates a bit of a mismatch with how long it takes to get a new nuclear plant up and running. This could mean we end up leaning even more heavily on fossil fuels in the short term, which isn’t great for our green ambitions. Plus, the idea of sticking data centres right next to nuclear plants has got some people worried about what that might do to electricity prices and reliability for everyone else.Not just kilowatts: Wider environmental shadow of AI loomsAI’s impact on the planet goes way beyond just the electricity it uses. Those data centres get hot, and cooling them down uses vast amounts of water. Your average data centre sips about 1.7 litres of water for every kilowatt-hour of energy it burns through.Back in 2022, Google’s data centres reportedly drank their way through about 5 billion gallons of fresh water – that’s a 20% jump from the year before. Some estimates suggest that for every kWh a data centre uses, it might need up to two litres of water just for cooling. Put it another way, global AI infrastructure could soon be chugging six times more water than the entirety of Denmark.And then there’s the ever-growing mountain of electronic waste, or e-waste. Because AI tech – especially specialised hardware like GPUs and TPUs – moves so fast, old kit gets thrown out more often. We could be looking at AI contributing to an e-waste pile-up from data centres hitting five million tons every year by 2030. Even making the AI chips and all the other bits for data centres takes a toll on our natural resources and the environment. It means mining for critical minerals like lithium and cobalt, often using methods that aren’t exactly kind to the planet.Just to make one AI chip can take over 1,400 litres of water and 3,000 kWh of electricity. This hunger for new hardware is also pushing for more semiconductor factories, which, guess what, often leads to more gas-powered energy plants being built.And, of course, we can’t forget the carbon emissions. When AI is powered by electricity generated from burning fossil fuels, it adds to the climate change problem we’re all facing. It’s estimated that training just one big AI model can pump out as much CO2 as hundreds of US homes do in a year.If you look at the environmental reports from the big tech companies, you can see AI’s growing carbon footprint. Microsoft’s yearly emissions, for example, went up by about 40% between 2020 and 2023, mostly because they were building more data centres for AI. Google also reported that its total greenhouse gas emissions have shot up by nearly 50% over the last five years, with the power demands of its AI data centres being a major culprit.Can we innovate our way out?It might sound like all doom and gloom, but a combination of new ideas could help.A big focus is on making AI algorithms themselves more energy-efficient. Researchers are coming up with clever tricks like “model pruning”, “quantisation”, and “knowledge distillation”. Designing smaller, more specialised AI models that do specific jobs with less power is also a priority.Inside data centres, things like “power capping”and “dynamic resource allocation”can make a real difference. Software that’s “AI-aware” can even shift less urgent AI jobs to times when energy is cleaner or demand on the grid is lower. AI can even be used to make the cooling systems in data centres more efficient.On-device AI could also help to reduce power consumption. Instead of sending data off to massive, power-hungry cloud data centres, the AI processing happens right there on your phone or device. This could slash energy use, as the chips designed for this prioritise being efficient over raw power.And we can’t forget about rules and regulations. Governments are starting to wake up to the need to make AI accountable for its energy use and wider environmental impact.Having clear, standard ways to measure and report AI’s footprint is a crucial first step. We also need policies that encourage companies to make hardware that lasts longer and is easier to recycle, to help tackle that e-waste mountain. Things like energy credit trading systems could even give companies a financial reason to choose greener AI tech.It’s worth noting that the United Arab Emirates and the United States shook hands this week on a deal to build the biggest AI campus outside the US in the Gulf. While this shows just how important AI is becoming globally, it also throws a spotlight on why all these energy and environmental concerns need to be front and centre for such huge projects.Finding a sustainable future for AIAI has the power to do some amazing things, but its ferocious appetite for energy is a serious hurdle. The predictions for its future power demands are genuinely startling, potentially matching what whole countries use.If we’re going to meet this demand, we need a smart mix of energy sources. Renewables are fantastic for the long run, but they have their wobbles when it comes to consistent supply and scaling up quickly. Nuclear power – including those newer SMRs – offers a reliable, low-carbon option that’s definitely catching the eye of big tech companies. But we still need to get our heads around the safety, cost, and how long they take to build.And remember, it’s not just about electricity. AI’s broader environmental impact – from the water it drinks to cool data centres, to the growing piles of e-waste from its hardware, and the resources it uses up during manufacturing – is huge. We need to look at the whole picture if we’re serious about lessening AI’s ecological footprint.The good news? There are plenty of promising ideas and innovations bubbling up. Energy-saving AI algorithms, clever power management in data centres, AI-aware software that can manage workloads intelligently, and the shift towards on-device AI all offer ways to cut down on energy use. Plus, the fact that we’re even talking about AI’s environmental impact more means that discussions around policies and rules to push for sustainability are finally happening.Dealing with AI’s energy and environmental challenges needs everyone – researchers, the tech industry, and policymakers – to roll up their sleeves and work together, and fast.If we make energy efficiency a top priority in how AI is developed, invest properly in sustainable energy, manage hardware responsibly from cradle to grave, and put supportive policies in place, we can aim for a future where AI’s incredible potential is unlocked in a way that doesn’t break our planet.The race to lead in AI has to be a race for sustainable AI too.Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.Explore other upcoming enterprise technology events and webinars powered by TechForge here. #will #boom #fuel #global #energy
    WWW.ARTIFICIALINTELLIGENCE-NEWS.COM
    Will the AI boom fuel a global energy crisis?
    AI’s thirst for energy is ballooning into a monster of a challenge. And it’s not just about the electricity bills. The environmental fallout is serious, stretching to guzzling precious water resources, creating mountains of electronic waste, and, yes, adding to those greenhouse gas emissions we’re all trying to cut.As AI models get ever more complex and weave themselves into yet more parts of our lives, a massive question mark hangs in the air: can we power this revolution without costing the Earth?The numbers don’t lie: AI’s energy demand is escalating fastThe sheer computing power needed for the smartest AI out there is on an almost unbelievable upward curve – some say it’s doubling roughly every few months. This isn’t a gentle slope; it’s a vertical climb that’s threatening to leave even our most optimistic energy plans in the dust.To give you a sense of scale, AI’s future energy needs could soon gulp down as much electricity as entire countries like Japan or the Netherlands, or even large US states like California. When you hear stats like that, you start to see the potential squeeze AI could put on the power grids we all rely on.2024 saw a record 4.3% surge in global electricity demand, and AI’s expansion was a big reason why, alongside the boom in electric cars and factories working harder. Wind back to 2022, and data centres, AI, and even cryptocurrency mining were already accounting for nearly 2% of all the electricity used worldwide – that’s about 460 terawatt-hours (TWh).Jump to 2024, and data centres on their own use around 415 TWh, which is roughly 1.5% of the global total, and growing at 12% a year. AI’s direct share of that slice is still relatively small – about 20 TWh, or 0.02% of global energy use – but hold onto your hats, because that number is set to rocket upwards.The forecasts? Well, they’re pretty eye-opening. By the end of 2025, AI data centres around the world could demand an extra 10 gigawatts (GW) of power. That’s more than the entire power capacity of a place like Utah.Roll on to 2026, and global data centre electricity use could hit 1,000 TWh – similar to what Japan uses right now. And, by 2027, the global power hunger of AI data centres is tipped to reach 68 GW, which is almost what California had in total power capacity back in 2022. Towards the end of this decade, the figures get even more jaw-dropping. Global data centre electricity consumption is predicted to double to around 945 TWh by 2030, which is just shy of 3% of all the electricity used on the planet.OPEC reckons data centre electricity use could even triple to 1,500 TWh by then. And Goldman Sachs? They’re saying global power demand from data centres could leap by as much as 165% compared to 2023, with those data centres specifically kitted out for AI seeing their demand shoot up by more than four times.There are even suggestions that data centres could be responsible for up to 21% of all global energy demand by 2030 if you count the energy it takes to get AI services to us, the users.When we talk about AI’s energy use, it mainly splits into two big chunks: training the AI, and then actually using it.Training enormous models, like GPT-4, takes a colossal amount of energy. Just to train GPT-3, for example, it’s estimated they used 1,287 megawatt-hours (MWh) of electricity, and GPT-4 is thought to have needed a whopping 50 times more than that. While training is a power hog, it’s the day-to-day running of these trained models that can chew through over 80% of AI’s total energy. It’s reported that asking ChatGPT a single question uses about ten times more energy than a Google search (we’re talking roughly 2.9 Wh versus 0.3 Wh).With everyone jumping on the generative AI bandwagon, the race is on to build ever more powerful – and therefore more energy-guzzling – data centres.So, can we supply energy for AI – and for ourselves?This is the million-dollar question, isn’t it? Can our planet’s energy systems cope with this new demand? We’re already juggling a mix of fossil fuels, nuclear power, and renewables. If we’re going to feed AI’s growing appetite sustainably, we need to ramp up and diversify how we generate energy, and fast.Naturally, renewable energy – solar, wind, hydro, geothermal – is a huge piece of the puzzle. In the US, for instance, renewables are set to go from 23% of power generation in 2024 to 27% by 2026. The tech giants are making some big promises; Microsoft, for example, is planning to buy 10.5 GW of renewable energy between 2026 and 2030 just for its data centres. AI itself could actually help us use renewable energy more efficiently, perhaps cutting energy use by up to 60% in some areas by making energy storage smarter and managing power grids better.But let’s not get carried away. Renewables have their own headaches. The sun doesn’t always shine, and the wind doesn’t always blow, which is a real problem for data centres that need power around the clock, every single day. The batteries we have now to smooth out these bumps are often expensive and take up a lot of room. Plus, plugging massive new renewable projects into our existing power grids can be a slow and complicated business.This is where nuclear power is starting to look more appealing to some, especially as a steady, low-carbon way to power AI’s massive energy needs. It delivers that crucial 24/7 power, which is exactly what data centres crave. There’s a lot of buzz around Small Modular Reactors (SMRs) too, because they’re potentially more flexible and have beefed-up safety features. And it’s not just talk; big names like Microsoft, Amazon, and Google are seriously looking into nuclear options.Matt Garman, who heads up AWS, recently put it plainly to the BBC, calling nuclear a “great solution” for data centres. He said it’s “an excellent source of zero carbon, 24/7 power.” He also stressed that planning for future energy is a massive part of what AWS does.“It’s something we plan many years out,” Garman mentioned. “We invest ahead. I think the world is going to have to build new technologies. I believe nuclear is a big part of that, particularly as we look 10 years out.”Still, nuclear power isn’t a magic wand. Building new reactors takes a notoriously long time, costs a fortune, and involves wading through complex red tape. And let’s be frank, public opinion on nuclear power is still a bit shaky, often because of past accidents, even though modern reactors are much safer.The sheer speed at which AI is developing also creates a bit of a mismatch with how long it takes to get a new nuclear plant up and running. This could mean we end up leaning even more heavily on fossil fuels in the short term, which isn’t great for our green ambitions. Plus, the idea of sticking data centres right next to nuclear plants has got some people worried about what that might do to electricity prices and reliability for everyone else.Not just kilowatts: Wider environmental shadow of AI loomsAI’s impact on the planet goes way beyond just the electricity it uses. Those data centres get hot, and cooling them down uses vast amounts of water. Your average data centre sips about 1.7 litres of water for every kilowatt-hour of energy it burns through.Back in 2022, Google’s data centres reportedly drank their way through about 5 billion gallons of fresh water – that’s a 20% jump from the year before. Some estimates suggest that for every kWh a data centre uses, it might need up to two litres of water just for cooling. Put it another way, global AI infrastructure could soon be chugging six times more water than the entirety of Denmark.And then there’s the ever-growing mountain of electronic waste, or e-waste. Because AI tech – especially specialised hardware like GPUs and TPUs – moves so fast, old kit gets thrown out more often. We could be looking at AI contributing to an e-waste pile-up from data centres hitting five million tons every year by 2030. Even making the AI chips and all the other bits for data centres takes a toll on our natural resources and the environment. It means mining for critical minerals like lithium and cobalt, often using methods that aren’t exactly kind to the planet.Just to make one AI chip can take over 1,400 litres of water and 3,000 kWh of electricity. This hunger for new hardware is also pushing for more semiconductor factories, which, guess what, often leads to more gas-powered energy plants being built.And, of course, we can’t forget the carbon emissions. When AI is powered by electricity generated from burning fossil fuels, it adds to the climate change problem we’re all facing. It’s estimated that training just one big AI model can pump out as much CO2 as hundreds of US homes do in a year.If you look at the environmental reports from the big tech companies, you can see AI’s growing carbon footprint. Microsoft’s yearly emissions, for example, went up by about 40% between 2020 and 2023, mostly because they were building more data centres for AI. Google also reported that its total greenhouse gas emissions have shot up by nearly 50% over the last five years, with the power demands of its AI data centres being a major culprit.Can we innovate our way out?It might sound like all doom and gloom, but a combination of new ideas could help.A big focus is on making AI algorithms themselves more energy-efficient. Researchers are coming up with clever tricks like “model pruning” (stripping out unnecessary bits of an AI model), “quantisation” (using less precise numbers, which saves energy), and “knowledge distillation” (where a smaller, thriftier AI model learns from a big, complex one). Designing smaller, more specialised AI models that do specific jobs with less power is also a priority.Inside data centres, things like “power capping” (putting a lid on how much power hardware can draw) and “dynamic resource allocation” (shifting computing power around based on real-time needs and when renewable energy is plentiful) can make a real difference. Software that’s “AI-aware” can even shift less urgent AI jobs to times when energy is cleaner or demand on the grid is lower. AI can even be used to make the cooling systems in data centres more efficient.On-device AI could also help to reduce power consumption. Instead of sending data off to massive, power-hungry cloud data centres, the AI processing happens right there on your phone or device. This could slash energy use, as the chips designed for this prioritise being efficient over raw power.And we can’t forget about rules and regulations. Governments are starting to wake up to the need to make AI accountable for its energy use and wider environmental impact.Having clear, standard ways to measure and report AI’s footprint is a crucial first step. We also need policies that encourage companies to make hardware that lasts longer and is easier to recycle, to help tackle that e-waste mountain. Things like energy credit trading systems could even give companies a financial reason to choose greener AI tech.It’s worth noting that the United Arab Emirates and the United States shook hands this week on a deal to build the biggest AI campus outside the US in the Gulf. While this shows just how important AI is becoming globally, it also throws a spotlight on why all these energy and environmental concerns need to be front and centre for such huge projects.Finding a sustainable future for AIAI has the power to do some amazing things, but its ferocious appetite for energy is a serious hurdle. The predictions for its future power demands are genuinely startling, potentially matching what whole countries use.If we’re going to meet this demand, we need a smart mix of energy sources. Renewables are fantastic for the long run, but they have their wobbles when it comes to consistent supply and scaling up quickly. Nuclear power – including those newer SMRs – offers a reliable, low-carbon option that’s definitely catching the eye of big tech companies. But we still need to get our heads around the safety, cost, and how long they take to build.And remember, it’s not just about electricity. AI’s broader environmental impact – from the water it drinks to cool data centres, to the growing piles of e-waste from its hardware, and the resources it uses up during manufacturing – is huge. We need to look at the whole picture if we’re serious about lessening AI’s ecological footprint.The good news? There are plenty of promising ideas and innovations bubbling up. Energy-saving AI algorithms, clever power management in data centres, AI-aware software that can manage workloads intelligently, and the shift towards on-device AI all offer ways to cut down on energy use. Plus, the fact that we’re even talking about AI’s environmental impact more means that discussions around policies and rules to push for sustainability are finally happening.Dealing with AI’s energy and environmental challenges needs everyone – researchers, the tech industry, and policymakers – to roll up their sleeves and work together, and fast.If we make energy efficiency a top priority in how AI is developed, invest properly in sustainable energy, manage hardware responsibly from cradle to grave, and put supportive policies in place, we can aim for a future where AI’s incredible potential is unlocked in a way that doesn’t break our planet.The race to lead in AI has to be a race for sustainable AI too.(Photo by Nejc Soklič)Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.Explore other upcoming enterprise technology events and webinars powered by TechForge here.
    0 Kommentare 0 Anteile
  • Stability AI Introduces Adversarial Relativistic-Contrastive (ARC) Post-Training and Stable Audio Open Small: A Distillation-Free Breakthrough for Fast, Diverse, and Efficient Text-to-Audio Generation Across Devices

    Text-to-audio generation has emerged as a transformative approach for synthesizing sound directly from textual prompts, offering practical use in music production, gaming, and virtual experiences. Under the hood, these models typically employ Gaussian flow-based techniques such as diffusion or rectified flows. These methods model the incremental steps that transition from random noise to structured audio. While highly effective in producing high-quality soundscapes, the slow inference speeds have posed a barrier to real-time interactivity. It is particularly limiting when creative users expect an instrument-like responsiveness from these tools.
    Latency is the primary issue with these systems. Current text-to-audio models can take several seconds or even minutes to generate a few seconds of audio. The core bottleneck lies in their step-based inference architecture, requiring between 50 and 100 iterations per output. Previous acceleration strategies focus on distillation methods where smaller models are trained under the supervision of larger teacher models to replicate multi-step inference in fewer steps. However, these distillation methods are computationally expensive. They demand large-scale storage for intermediate training outputs or require simultaneous operation of several models in memory, which hinders their adoption, especially on mobile or edge devices. Also, such methods often sacrifice output diversity and introduce over-saturation artifacts.
    While a few adversarial post-training methods have been attempted to bypass the cost of distillation, their success has been limited. Most existing implementations rely on partial distillation for initialization or do not scale well to complex audio synthesis. Also, audio applications have seen fewer fully adversarial solutions. Tools like Presto integrate adversarial objectives but still depend on teacher models and CFG-based training for prompt adherence, which restricts their generative diversity.
    Researchers from UC San Diego, Stability AI, and Arm introduced Adversarial Relativistic-Contrastivepost-training. This approach sidesteps the need for teacher models, distillation, or classifier-free guidance. Instead, ARC enhances an existing pre-trained rectified flow generator by integrating two novel training objectives: a relativistic adversarial loss and a contrastive discriminator loss. These help the generator produce high-fidelity audio in fewer steps while maintaining strong alignment with text prompts. When paired with the Stable Audio Openframework, the result was a system capable of generating 12 seconds of 44.1 kHz stereo audio in only 75 milliseconds on an H100 GPU and around 7 seconds on mobile devices.
    With ARC methodology, they introducedStable Audio Open Small, a compact and efficient version of SAO tailored for resource-constrained environments. This model contains 497 million parameters and uses an architecture built on a latent diffusion transformer. It consists of three main components: a waveform-compressing autoencoder, a T5-based text embedding system for semantic conditioning, and a DiTthat operates within the latent space of the autoencoder. Stable Audio Open Small can generate stereo audio up to 11 seconds long at 44.1 kHz. It is designed to be deployed using the ‘stable-audio-tools’ library and supports ping-pong sampling, enabling efficient few-step generation. The model demonstrated exceptional inference efficiency, achieving generation speeds of under 7 seconds on a Vivo X200 Pro phone after applying dynamic Int8 quantization, which also cut RAM usage from 6.5GB to 3.6 GB. This makes it especially viable for on-device creative applications like mobile audio tools and embedded systems.

    The ARC training approach involves replacing the traditional L2 loss with an adversarial formulation where generated and real samples, paired with identical prompts, are evaluated by a discriminator trained to distinguish between them. A contrastive objective teaches the discriminator to rank accurate audio-text pairs higher than mismatched ones to improve prompt relevance. These paired objectives eliminate the need for CFG while achieving better prompt adherence. Also, ARC adopts ping-pong sampling to refine the audio output through alternating denoising and re-noising cycles, reducing inference steps without compromising quality.
    ARC’s performance was evaluated extensively. In objective tests, it achieved an FDopenl3 score of 84.43, a KLpasst score of 2.24, and a CLAP score of 0.27, indicating balanced quality and semantic precision. Diversity was notably strong, with a CLAP Conditional Diversity Scoreof 0.41. Real-Time Factor reached 156.42, reflecting outstanding generation speed, while GPU memory usage remained at a practical 4.06 GB. Subjectively, ARC scored 4.4 for diversity, 4.2 for quality, and 4.2 for prompt adherence in human evaluations involving 14 participants. Unlike distillation-based models like Presto, which scored higher on quality but dropped to 2.7 on diversity, ARC presented a more balanced and practical solution.

    Several Key Takeaways from the Research by Stability AI on Adversarial Relativistic-Contrastivepost-training and  Stable Audio Open Small include: 

    ARC post-training avoids distillation and CFG, relying on adversarial and contrastive losses.
    ARC generates 12s of 44.1 kHz stereo audio in 75ms on H100 and 7s on mobile CPUs.
    It achieves 0.41 CLAP Conditional Diversity Score, the highest among tested models.
    Subjective scores: 4.4, 4.2, and 4.2.
    Ping-pong sampling enables few-step inference while refining output quality.
    Stable Audio Open Small offers 497M parameters, supports 8-step generation, and is compatible with mobile deployments.
    On Vivo X200 Pro, inference latency dropped from 15.3s to 6.6s with half the memory.
    ARC and SAO Small provide real-time solutions for music, games, and creative tools.

    In conclusion, the combination of ARC post-training and Stable Audio Open Small eliminates the reliance on resource-intensive distillation and classifier-free guidance, enabling researchers to deliver a streamlined adversarial framework that accelerates inference without compromising output quality or prompt adherence. ARC enables fast, diverse, and semantically rich audio synthesis in high-performance and mobile environments. With Stable Audio Open Small optimized for lightweight deployment, this research lays the groundwork for integrating responsive, generative audio tools into everyday creative workflows, from professional sound design to real-time applications on edge devices.

    Check out the Paper, GitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.
    Mohammad AsjadAsjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.Mohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Meta AI Introduces CATransformers: A Carbon-Aware Machine Learning Framework to Co-Optimize AI Models and Hardware for Sustainable Edge DeploymentMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Enterprise AI Without GPU Burn: Salesforce’s xGen-small Optimizes for Context, Cost, and PrivacyMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/ServiceNow AI Released Apriel-Nemotron-15b-Thinker: A Compact Yet Powerful Reasoning Model Optimized for Enterprise-Scale Deployment and EfficiencyMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Researchers from Fudan University Introduce Lorsa: A Sparse Attention Mechanism That Recovers Atomic Attention Units Hidden in Transformer Superposition
    #stability #introduces #adversarial #relativisticcontrastive #arc
    Stability AI Introduces Adversarial Relativistic-Contrastive (ARC) Post-Training and Stable Audio Open Small: A Distillation-Free Breakthrough for Fast, Diverse, and Efficient Text-to-Audio Generation Across Devices
    Text-to-audio generation has emerged as a transformative approach for synthesizing sound directly from textual prompts, offering practical use in music production, gaming, and virtual experiences. Under the hood, these models typically employ Gaussian flow-based techniques such as diffusion or rectified flows. These methods model the incremental steps that transition from random noise to structured audio. While highly effective in producing high-quality soundscapes, the slow inference speeds have posed a barrier to real-time interactivity. It is particularly limiting when creative users expect an instrument-like responsiveness from these tools. Latency is the primary issue with these systems. Current text-to-audio models can take several seconds or even minutes to generate a few seconds of audio. The core bottleneck lies in their step-based inference architecture, requiring between 50 and 100 iterations per output. Previous acceleration strategies focus on distillation methods where smaller models are trained under the supervision of larger teacher models to replicate multi-step inference in fewer steps. However, these distillation methods are computationally expensive. They demand large-scale storage for intermediate training outputs or require simultaneous operation of several models in memory, which hinders their adoption, especially on mobile or edge devices. Also, such methods often sacrifice output diversity and introduce over-saturation artifacts. While a few adversarial post-training methods have been attempted to bypass the cost of distillation, their success has been limited. Most existing implementations rely on partial distillation for initialization or do not scale well to complex audio synthesis. Also, audio applications have seen fewer fully adversarial solutions. Tools like Presto integrate adversarial objectives but still depend on teacher models and CFG-based training for prompt adherence, which restricts their generative diversity. Researchers from UC San Diego, Stability AI, and Arm introduced Adversarial Relativistic-Contrastivepost-training. This approach sidesteps the need for teacher models, distillation, or classifier-free guidance. Instead, ARC enhances an existing pre-trained rectified flow generator by integrating two novel training objectives: a relativistic adversarial loss and a contrastive discriminator loss. These help the generator produce high-fidelity audio in fewer steps while maintaining strong alignment with text prompts. When paired with the Stable Audio Openframework, the result was a system capable of generating 12 seconds of 44.1 kHz stereo audio in only 75 milliseconds on an H100 GPU and around 7 seconds on mobile devices. With ARC methodology, they introducedStable Audio Open Small, a compact and efficient version of SAO tailored for resource-constrained environments. This model contains 497 million parameters and uses an architecture built on a latent diffusion transformer. It consists of three main components: a waveform-compressing autoencoder, a T5-based text embedding system for semantic conditioning, and a DiTthat operates within the latent space of the autoencoder. Stable Audio Open Small can generate stereo audio up to 11 seconds long at 44.1 kHz. It is designed to be deployed using the ‘stable-audio-tools’ library and supports ping-pong sampling, enabling efficient few-step generation. The model demonstrated exceptional inference efficiency, achieving generation speeds of under 7 seconds on a Vivo X200 Pro phone after applying dynamic Int8 quantization, which also cut RAM usage from 6.5GB to 3.6 GB. This makes it especially viable for on-device creative applications like mobile audio tools and embedded systems. The ARC training approach involves replacing the traditional L2 loss with an adversarial formulation where generated and real samples, paired with identical prompts, are evaluated by a discriminator trained to distinguish between them. A contrastive objective teaches the discriminator to rank accurate audio-text pairs higher than mismatched ones to improve prompt relevance. These paired objectives eliminate the need for CFG while achieving better prompt adherence. Also, ARC adopts ping-pong sampling to refine the audio output through alternating denoising and re-noising cycles, reducing inference steps without compromising quality. ARC’s performance was evaluated extensively. In objective tests, it achieved an FDopenl3 score of 84.43, a KLpasst score of 2.24, and a CLAP score of 0.27, indicating balanced quality and semantic precision. Diversity was notably strong, with a CLAP Conditional Diversity Scoreof 0.41. Real-Time Factor reached 156.42, reflecting outstanding generation speed, while GPU memory usage remained at a practical 4.06 GB. Subjectively, ARC scored 4.4 for diversity, 4.2 for quality, and 4.2 for prompt adherence in human evaluations involving 14 participants. Unlike distillation-based models like Presto, which scored higher on quality but dropped to 2.7 on diversity, ARC presented a more balanced and practical solution. Several Key Takeaways from the Research by Stability AI on Adversarial Relativistic-Contrastivepost-training and  Stable Audio Open Small include:  ARC post-training avoids distillation and CFG, relying on adversarial and contrastive losses. ARC generates 12s of 44.1 kHz stereo audio in 75ms on H100 and 7s on mobile CPUs. It achieves 0.41 CLAP Conditional Diversity Score, the highest among tested models. Subjective scores: 4.4, 4.2, and 4.2. Ping-pong sampling enables few-step inference while refining output quality. Stable Audio Open Small offers 497M parameters, supports 8-step generation, and is compatible with mobile deployments. On Vivo X200 Pro, inference latency dropped from 15.3s to 6.6s with half the memory. ARC and SAO Small provide real-time solutions for music, games, and creative tools. In conclusion, the combination of ARC post-training and Stable Audio Open Small eliminates the reliance on resource-intensive distillation and classifier-free guidance, enabling researchers to deliver a streamlined adversarial framework that accelerates inference without compromising output quality or prompt adherence. ARC enables fast, diverse, and semantically rich audio synthesis in high-performance and mobile environments. With Stable Audio Open Small optimized for lightweight deployment, this research lays the groundwork for integrating responsive, generative audio tools into everyday creative workflows, from professional sound design to real-time applications on edge devices. Check out the Paper, GitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. Mohammad AsjadAsjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.Mohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Meta AI Introduces CATransformers: A Carbon-Aware Machine Learning Framework to Co-Optimize AI Models and Hardware for Sustainable Edge DeploymentMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Enterprise AI Without GPU Burn: Salesforce’s xGen-small Optimizes for Context, Cost, and PrivacyMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/ServiceNow AI Released Apriel-Nemotron-15b-Thinker: A Compact Yet Powerful Reasoning Model Optimized for Enterprise-Scale Deployment and EfficiencyMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Researchers from Fudan University Introduce Lorsa: A Sparse Attention Mechanism That Recovers Atomic Attention Units Hidden in Transformer Superposition #stability #introduces #adversarial #relativisticcontrastive #arc
    WWW.MARKTECHPOST.COM
    Stability AI Introduces Adversarial Relativistic-Contrastive (ARC) Post-Training and Stable Audio Open Small: A Distillation-Free Breakthrough for Fast, Diverse, and Efficient Text-to-Audio Generation Across Devices
    Text-to-audio generation has emerged as a transformative approach for synthesizing sound directly from textual prompts, offering practical use in music production, gaming, and virtual experiences. Under the hood, these models typically employ Gaussian flow-based techniques such as diffusion or rectified flows. These methods model the incremental steps that transition from random noise to structured audio. While highly effective in producing high-quality soundscapes, the slow inference speeds have posed a barrier to real-time interactivity. It is particularly limiting when creative users expect an instrument-like responsiveness from these tools. Latency is the primary issue with these systems. Current text-to-audio models can take several seconds or even minutes to generate a few seconds of audio. The core bottleneck lies in their step-based inference architecture, requiring between 50 and 100 iterations per output. Previous acceleration strategies focus on distillation methods where smaller models are trained under the supervision of larger teacher models to replicate multi-step inference in fewer steps. However, these distillation methods are computationally expensive. They demand large-scale storage for intermediate training outputs or require simultaneous operation of several models in memory, which hinders their adoption, especially on mobile or edge devices. Also, such methods often sacrifice output diversity and introduce over-saturation artifacts. While a few adversarial post-training methods have been attempted to bypass the cost of distillation, their success has been limited. Most existing implementations rely on partial distillation for initialization or do not scale well to complex audio synthesis. Also, audio applications have seen fewer fully adversarial solutions. Tools like Presto integrate adversarial objectives but still depend on teacher models and CFG-based training for prompt adherence, which restricts their generative diversity. Researchers from UC San Diego, Stability AI, and Arm introduced Adversarial Relativistic-Contrastive (ARC) post-training. This approach sidesteps the need for teacher models, distillation, or classifier-free guidance. Instead, ARC enhances an existing pre-trained rectified flow generator by integrating two novel training objectives: a relativistic adversarial loss and a contrastive discriminator loss. These help the generator produce high-fidelity audio in fewer steps while maintaining strong alignment with text prompts. When paired with the Stable Audio Open (SAO) framework, the result was a system capable of generating 12 seconds of 44.1 kHz stereo audio in only 75 milliseconds on an H100 GPU and around 7 seconds on mobile devices. With ARC methodology, they introducedStable Audio Open Small, a compact and efficient version of SAO tailored for resource-constrained environments. This model contains 497 million parameters and uses an architecture built on a latent diffusion transformer. It consists of three main components: a waveform-compressing autoencoder, a T5-based text embedding system for semantic conditioning, and a DiT (Diffusion Transformer) that operates within the latent space of the autoencoder. Stable Audio Open Small can generate stereo audio up to 11 seconds long at 44.1 kHz. It is designed to be deployed using the ‘stable-audio-tools’ library and supports ping-pong sampling, enabling efficient few-step generation. The model demonstrated exceptional inference efficiency, achieving generation speeds of under 7 seconds on a Vivo X200 Pro phone after applying dynamic Int8 quantization, which also cut RAM usage from 6.5GB to 3.6 GB. This makes it especially viable for on-device creative applications like mobile audio tools and embedded systems. The ARC training approach involves replacing the traditional L2 loss with an adversarial formulation where generated and real samples, paired with identical prompts, are evaluated by a discriminator trained to distinguish between them. A contrastive objective teaches the discriminator to rank accurate audio-text pairs higher than mismatched ones to improve prompt relevance. These paired objectives eliminate the need for CFG while achieving better prompt adherence. Also, ARC adopts ping-pong sampling to refine the audio output through alternating denoising and re-noising cycles, reducing inference steps without compromising quality. ARC’s performance was evaluated extensively. In objective tests, it achieved an FDopenl3 score of 84.43, a KLpasst score of 2.24, and a CLAP score of 0.27, indicating balanced quality and semantic precision. Diversity was notably strong, with a CLAP Conditional Diversity Score (CCDS) of 0.41. Real-Time Factor reached 156.42, reflecting outstanding generation speed, while GPU memory usage remained at a practical 4.06 GB. Subjectively, ARC scored 4.4 for diversity, 4.2 for quality, and 4.2 for prompt adherence in human evaluations involving 14 participants. Unlike distillation-based models like Presto, which scored higher on quality but dropped to 2.7 on diversity, ARC presented a more balanced and practical solution. Several Key Takeaways from the Research by Stability AI on Adversarial Relativistic-Contrastive (ARC) post-training and  Stable Audio Open Small include:  ARC post-training avoids distillation and CFG, relying on adversarial and contrastive losses. ARC generates 12s of 44.1 kHz stereo audio in 75ms on H100 and 7s on mobile CPUs. It achieves 0.41 CLAP Conditional Diversity Score, the highest among tested models. Subjective scores: 4.4 (diversity), 4.2 (quality), and 4.2 (prompt adherence). Ping-pong sampling enables few-step inference while refining output quality. Stable Audio Open Small offers 497M parameters, supports 8-step generation, and is compatible with mobile deployments. On Vivo X200 Pro, inference latency dropped from 15.3s to 6.6s with half the memory. ARC and SAO Small provide real-time solutions for music, games, and creative tools. In conclusion, the combination of ARC post-training and Stable Audio Open Small eliminates the reliance on resource-intensive distillation and classifier-free guidance, enabling researchers to deliver a streamlined adversarial framework that accelerates inference without compromising output quality or prompt adherence. ARC enables fast, diverse, and semantically rich audio synthesis in high-performance and mobile environments. With Stable Audio Open Small optimized for lightweight deployment, this research lays the groundwork for integrating responsive, generative audio tools into everyday creative workflows, from professional sound design to real-time applications on edge devices. Check out the Paper, GitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. Mohammad AsjadAsjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.Mohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Meta AI Introduces CATransformers: A Carbon-Aware Machine Learning Framework to Co-Optimize AI Models and Hardware for Sustainable Edge DeploymentMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Enterprise AI Without GPU Burn: Salesforce’s xGen-small Optimizes for Context, Cost, and PrivacyMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/ServiceNow AI Released Apriel-Nemotron-15b-Thinker: A Compact Yet Powerful Reasoning Model Optimized for Enterprise-Scale Deployment and EfficiencyMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Researchers from Fudan University Introduce Lorsa: A Sparse Attention Mechanism That Recovers Atomic Attention Units Hidden in Transformer Superposition
    0 Kommentare 0 Anteile
  • Exclusive Talk: Joey Conway of NVIDIA on Llama Nemotron Ultra and Open Source Models

    Today, MarkTechPost had the pleasure of interviewing Joey Conway from NVIDIA to discuss their exciting work on open-source large language models, including Llama Nemotron Ultra & Parakeet.
    Highlights from the interview:

    NVIDIA’s Open Source Powerhouse: Discover how NVIDIA is pushing the boundaries of open-source AI with the release of cutting-edge models like Llama Nemotron Ultra and Parakeet TDT.
    Llama Nemotron Ultra: Smaller Size, Giant Performance: Learn how NVIDIA achieved on-par performance with models twice the size, enabling deployment on a single GPU node. Explore their innovative FFN fusion technique for significant speedups.
    Reasoning on Demand: Uncover the unique “reasoning on/off” feature in Llama Nemotron Ultra, offering unprecedented control for production deployments and cost optimization.
    Revolutionary Speech Recognition with Parakeet TDT: Dive into NVIDIA’s state-of-the-art ASR model that transcribes one hour of audio in one second with only a 6% word error rate – 50 times faster than other open-source alternatives!
    The “How”: Architectural Innovations: Get insights into the advanced architectures and optimizations behind these models, including FFN fusion, limited context attention, and the Token Duration Transducer 
    Democratizing AI with Open Data: Learn about NVIDIA’s commitment to the open-source community through the release of model weights and massive, high-quality datasets for both language and speech.
    Future Directions: Get a sneak peek into NVIDIA’s plans for multilingual support, even smaller edge-optimized models, and advancements in real-time streaming for speech recognition.
    Production-Ready AI: Understand how these models are designed with real-world deployment challenges in mind, focusing on accuracy, efficiency, and cost-effectiveness.

    Jean-Marc Mommessin: Joey, welcome to Marketechpost! We’re thrilled to have you here and to delve into the impressive open-source models NVIDIA has been releasing. To start, could you please introduce yourself and your role at NVIDIA?
    Joey Conway: Hi Jean-Marc, it’s great to be here. I’m Joey Conway, and I work in product management for some of the deep learning software at NVIDIA. Our team focuses on large language models like Nemotron and Llama Nemotron, as well as text-to-speech models such as Parakeet.
    Jean-Marc Mommessin: Wonderful. And you’ve been at NVIDIA for over seven years now, witnessing significant waves of innovation in AI. Let’s talk about your recent release, Llama Nemotron Ultra, a 253 billion parameter model. From what we’ve seen, it delivers performance on par with models like Llama 405B and DeepSeek R1, which are about twice its size. Remarkably, it can run on a single 8x H100 node. What else can you tell us about Llama Nemotron Ultra and what makes it so impressive?
    Joey Conway: We’re big believers in the open-source community and the fantastic work being done there. With Llama Nemotron, our goal was to build upon the existing foundations, particularly Llama, for which we greatly appreciate Meta’s contributions. We also observed significant progress in reasoning within the open community earlier this year. Inspired by this, we wanted to contribute and see how we could enhance Llama, especially for enterprise use cases.
    Our focus was primarily on improving reasoning capabilities and agentic tasks like tool calling and chat. We aimed to take the strengths of the open-source community, enhance them, and then contribute those improvements back.
    Jean-Marc Mommessin: Did you identify specific gaps in existing models that you aimed to address? You mentioned reasoning, but could you provide an example or two of enterprise agentic tasks where you felt there were shortcomings that Llama Nemotron Ultra overcomes?
    Joey Conway : Yes, I think looking back to the beginning of the year, a key challenge in enterprise deployments was handling complex queries requiring significant thought and reflection. These could be multi-step processes or involve substantial calculations and the use of external tools. At that time, there weren’t many strong open-weight models capable of robust reasoning. The progress we’ve seen in the last few months in this area is very encouraging.
    Another critical aspect for enterprises is the ability to accurately call APIs and closely follow instructions in user queries. We wanted to ensure that while we focused on improving reasoning, we didn’t compromise these essential production-level capabilities.
    Furthermore, we often noticed that when both reasoning and instruction following were well-addressed, they typically resided in separate models. Our aim was to simplify this by creating a single model that excels in both. This was the landscape we observed when we started this project around January and February.
    Jean-Marc Mommessin: That makes perfect sense and aligns with what we’re seeing in the industry as well. Now, let’s dive into the “how.” Your paper mentions FFN fusion as a key optimization. Could you elaborate on this technique, starting with a high-level explanation?
    Joey Conway: Absolutely. Our focus on optimization stemmed from the realization that deploying state-of-the-art models often requires a significant deployment footprint. We wanted to optimize this to fit within more common GPU setups.
    We explored various techniques, including our Puzzle neural architecture search. For dense transformer models, particularly those in the Llama family, we discovered a way to reduce or eliminate redundant attention layers. This process aligned the feed-forward networklayers in a sequence, allowing us to explore fusion methods.
    Our fundamental goal on the GPU is to maximize parallel execution. Fusing these aligned FFN layers enables greater parallel computation than was previously possible. By removing redundant layers, we found opportunities to essentially merge or fuse the remaining ones. This is a key example of how we tackle the challenges of running these models at scale. Importantly, this technique often yields greater improvements with larger models, which was beneficial for our Ultra model based on Meta’s Llama 3.1 -405B.
    Jean-Marc Mommessin: And this FFN fusion significantly improves the model’s throughput, achieving notable speedups. If I recall correctly, it’s in the range of 3 to 5x for the Ultra model?
    Joey Conway: That’s right, the speedups for the Ultra model are in that range. Additionally, by reducing the model’s size in terms of weights, we also lowered its memory footprint. This allowed us to utilize a larger KV cache. For Llama Nemotron Ultra, we could fit it onto a 8x H100 80GB setup, which is quite significant as it fits within common node configurations. So, FFN fusion provided both a substantial compute speedup and a reduction in memory usage, enabling us to handle larger context lengths. These are very exciting outcomes for us.
    Jean-Marc Mommessin: Let’s switch gears to data curation. AI data is crucial, and your training pipeline seems very sophisticated. You touched on “instruction following” earlier. Could you elaborate on your data curation process and how you ensured high-quality data, especially considering you leveraged other models in the process?

    Image source: NVIDIA
    Joey Conway: Transparency and openness were key in our approach. We wanted to share as much as possible about our data, techniques, and tooling so the community could understand and even use it themselves. Our primary goal with data curation was to improve accuracy across several key domains, including reasoning tasks like math and coding, as well as non-reasoning tasks like tool calling, instruction following, and chat.
    Our strategy involved curating specific datasets to enhance performance in these areas. Within our supervised fine-tuning process, we differentiated between “reasoning on” and “reasoning off” scenarios. For example, in math and coding, we curated data for simple questions that don’t require complex reasoning, as well as more intricate problems that do. This helps the model learn when and how to apply reasoning.
    A key part of this process was leveraging high-quality models from the community as “experts” in specific domains. For instance, we used DeepSeek R-1 extensively for reasoning-intensive math and coding tasks. For non-reasoning tasks like basic math, coding, chat, and tool calling, we utilized models like Llama and Qwen. Our aim was to blend the best capabilities of these community models into a single model.
    We’ve also made this curated dataset publicly available on Hugging Face, with around 30 million question-answer pairs. This allows the community to explore, use, and build upon our work. We were also excited to see our partner ServiceNow recently announce their apprehend Nemotron model, which was trained using our dataset to enhance their own reasoning capabilities.
    Jean-Marc Mommessin: That’s fantastic that you’re sharing the dataset. Given that you used other models to generate some of this data, what kind of quality checks did you implement to ensure the reliability of the training pairs?
    Joey Conway: Data quality was absolutely paramount. Since we were generating a significant portion of the data using other models, we implemented a rigorous multi-layered quality assurance process.
    First, for each expert model used to generate data in a specific domain, we would generate multiple candidate responses for the same prompt. Then, we employed a separate set of “critic” models to evaluate these candidates based on correctness, coherence, and adherence to the prompt.
    Second, we implemented a scoring mechanism. Each generated question-answer pair received a quality score based on the critic model’s evaluation. We set a high threshold, and any pair that didn’t meet this standard was discarded.
    Third, human review was integrated at various stages. Our team of data scientists and engineers manually inspected samples of the generated data to identify any systematic errors, biases, or instances of hallucination. This human oversight was crucial for catching nuances that automated systems might miss.
    Fourth, we focused on the diversity of the generated data. We wanted to ensure we weren’t just getting variations of the same types of questions and answers. We implemented strategies to encourage the expert models to generate a broad range of examples within each domain.
    Finally, after training Llama Nemotron Ultra on this curated data, we conducted extensive evaluations against benchmark datasets and in real-world use cases. This feedback loop helped us further refine our data generation and filtering techniques.
    So, it was a comprehensive approach involving expert generation, automated criticism and scoring, human review, diversity checks, and rigorous downstream evaluation to ensure the high quality of our training data.
    Jean-Marc Mommessin: The quality of the synthetic data is so important. Could you elaborate on the stages you take to ensure high accuracy when generating this data?
    Joey Conway: Absolutely. When doing synthetic data generation, there are a few key stages to ensure high accuracy. The first is the prompts – the seed data and how we prompt the model. The second is the quality of the responses.
    On the prompting side, we focus on prompting models where we believe they excel. For example, we might use Llama for chat-related prompts but avoid using a non-reasoning model for math. It’s crucial to align the prompts with the core strengths of the model.
    For vetting the responses, we invest time in both human manual review and automated methods. Going forward, we anticipate increasing our use of verifiers and reward models, similar to what we’ve done on the Reinforcement Learningside.
    The reason we’ve open-sourced much of this is that there’s a lot of nuance involved, and we wanted the community to engage with these challenges. Enterprises like ServiceNow have specific goals, and some of our data might be more or less useful to them. By making it available, they can vet it themselves. We also provide tools like classifier models to help categorize content, such as news or sports, allowing users to make informed decisions about the data blends they use for training.
    Jean-Marc Mommessin: Perfect. Is there anything else you’d like to highlight regarding this pipeline?
    Joey Conway: Yes, I’d like to touch on the Reinforcement Learningaspect. Following the supervised fine-tuning stage, where we enhanced core skills, we’ve just begun to explore the potential of RL with Nemotron. We believe this will be a significant area of future development.
    What’s exciting about RL is that its effectiveness is largely tied to the available compute time. The more time we invest, the better the model becomes at specific tasks. In our RL stages, we’ve developed methods to automate the process of asking the model a question, grading its answer, and providing feedback to allow it to learn and improve.
    You can see on the slide the domains where we’ve applied this: scientific reasoning, instruction following, and chat. If you look at the leaderboards, you’ll see that even with new models emerging, we’ve maintained a strong position in these areas, largely due to the effectiveness of RL in achieving top-tier accuracy. We’re optimistic that we’ll see more of this in the community, with more discussion and publication of techniques and data. We’ve started sharing some of our work in this area and will have much more to come in the next three to six months.
    Jean-Marc Mommessin: You mentioned RL and instruction following, which ties back to the beginning of our conversation. It seems like you’ve come full circle here.
    Joey Conway: Exactly. The exciting aspect here is automating the feedback loop wherever possible. For chat, we published a fine-tuned reward model last fall. Those who followed our work might recall that our Llama Nemotron model topped the chat leaderboards then. This was because the reward model provides an automated way to teach the original model whether its responses are good or bad. It essentially grades responses based on helpfulness, conciseness, verbosity, groundedness, and similar factors. This granular feedback per generated response allows the model to improve significantly, often more so than through supervised fine-tuning alone, which typically involves a few passes without a continuous feedback loop.
    Similarly, for instruction following, we use a verifier and a dataset to teach the model whether it followed instructions well or needs to try again. We’re eager to expand this approach to more domains. We’ve already published datasets related to coding and math since the release of this model a few weeks ago, and these have become popular on Hugging Face. I anticipate significant growth in this area within the community.
    Jean-Marc Mommessin: Alright, so one of the big innovations here, and you touched upon it, but I want to emphasize it, is the ability to toggle reasoning on and off via the system prompt. This is quite unique, and I’m sure many will follow suit. Could you expand on the idea behind this, how you see it applying to agents and beyond, its value, and the key challenges in implementing it?
    Joey Conway: The reasoning on and off capability was a core goal from the outset. We observed that models in the community often excelled in either reasoning or non-reasoning tasks, and we wanted to simplify deployment by having a single model that could handle both.
    We had to determine the best way to teach the model when to reason and when not to, while also providing enterprises with explicit control, as they often have deeper domain knowledge than we do. The motivation behind this is that reasoning generates significantly more tokens, which can lead to higher latency and cost. While crucial for solving complex problems, it’s not always necessary. We wanted to give enterprises the control to balance accuracy with latency and cost, allowing them to decide when to employ reasoning and when to opt for faster, less computationally intensive responses.
    Initially, we weren’t sure how to achieve this, as it hadn’t been widely implemented in the community. Our approach in the supervised fine-tuning stage was to explicitly teach the model by presenting the same question with two different answers: one with detailed reasoning and one without. This essentially doubled our dataset for this specific purpose. However, the outcome is a single model where users can simply include “use detailed thinking on” or “use detailed thinking off” in the prompt to control the model’s reasoning process.
    On the training side, this required more effort to teach the model this distinction. What we have today is essentially a v1, and I expect others will follow this approach. We’re also excited about future developments, such as time or token limits for reasoning and more granular controls. I’m optimistic that we’ll see further breakthroughs in this area within the next six to nine months, as the problem-solving power of reasoning is significant, but it comes with trade-offs that the community will continue to refine.
    Jean-Marc Mommessin: We all know that the real test comes in production. Production environments are sensitive to latency, cost, and while accuracy and reasoning are vital, excessive reasoning can lead to scalability issues and increased latency. The flexibility you’ve introduced is fantastic, and I can see numerous production use cases that will greatly benefit from the ability to control reasoning on a per-query basis.
    So, when you were developing this model, you aimed to balance accuracy and efficiency. Could you share some insights into how you made these trade-offs, the timeline for building the model and the team involved, and how you determined the optimal compromise between these two critical factors?
    Joey Conway: Balancing accuracy and efficiency is always a challenge. Our initial goal was to achieve both, which is a difficult undertaking. We started with the “Super” model, which was the most recent Llama 3.1 70B release from Meta, as our baseline for accuracy. We weren’t sure if we could simultaneously improve accuracy and reduce the model size.
    We found that through our training techniques and distillation process, we could indeed boost accuracy. We even released an initial checkpoint reflecting this. However, we wanted to go further by incorporating strong reasoning capabilities, aiming for state-of-the-art reasoning scores. This is where the SFT and RL stages came in, which required significant time for synthetic data generation since this type of data didn’t exist.
    During training, we carefully considered the number of epochs for each skill and continuously measured accuracy. Our goal was to improve performance across all six key areas rather than excelling in just a couple. This balancing act took more time as we experimented to find the right combinations. However, we felt it was crucial to ensure world-class performance in these six enterprise-relevant scenarios, including chat and instruction following.
    For areas like MMLU, we focused on maintaining performance and preventing regression rather than actively trying to improve scores. So, there were definitely priorities and trade-offs involved. Ultimately, we believe these were the right focus areas for our enterprise customers.
    Jean-Marc Mommessin: You are releasing this model family as part of the open-source community. We’ve discussed the gaps you aimed to address and the unique reasoning on/off feature for production scalability. Could you share your thoughts on how NVIDIA and your team view the role of these models within the broader open-source and LLM ecosystem, especially given your work building upon the Llama base?
    Joey Conway: NVIDIA has a long history of contributing models to the open-source community. What excites us about Llama is its strong traction with enterprise customers. While NVIDIA Research publishes extensively across various domains, our goal with Llama Nemotron was to build upon Llama’s momentum in enterprise adoption by focusing narrowly on specific areas. The base Llama models already cover many things exceptionally well, so we saw an opportunity to build on top of that and be very targeted in our enhancements.
    The recent LlamaCon event and Meta’s announcements sound very promising, and we’re excited about Llama 4 and the ongoing work there. Moving forward, we anticipate continuing to identify specific areas where we can add significant value, while Meta continues to build excellent general-purpose models suitable for enterprise production.
    From our perspective, reasoning will likely remain a key focus, and we’re also excited about Meta’s advancements in this area. Tool calling, instruction following, and chat are also areas we’ll continue to develop. One area we’re particularly interested in exploring is multilingual capabilities. For large enterprises, supporting multiple languages is crucial. While many models handle individual languages well, we aim to focus on a few key languages and ensure world-class accuracy for reasoning, tool calling, and chat within those. This is likely the next major area of expansion for us, beyond the exciting developments in model architectures like Llama 4’s new MoE architecture, which we’re also keen to explore for potential distillation and optimization for NVIDIA GPUs. So, there’s a lot of exciting work ahead.
    Jean-Marc Mommessin: When you say multilingual, are you thinking of supporting a broad range, like 50 languages, or a more focused set, perhaps around 5 or 10 initially, given the benchmark challenges you mentioned?
    Joey Conway: We’ll probably start with a more focused set, perhaps around 5 to 10 languages. The challenge is that the community currently lacks comprehensive benchmarks for tasks like reasoning or tool calling across a wide variety of languages. As we develop these multilingual models, we’re also having to create evaluation data simultaneously, which takes time. If those benchmarks were readily available, the process would be smoother. However, we see this as an exciting challenge. Our initial focus will likely be on a smaller set of languages where we can establish strong performance, given the current limitations in community-wide benchmarks.
    Jean-Marc Mommessin: Let’s shift gears and talk about another state-of-the-art open-source model you recently released: Parakeet TDT 0.6 B parameters, V2. This model has set a new standard for automatic speech recognition, transcribing one hour of audio in just one second. That’s 50 times faster than other open-source ASR models, and remarkably, it achieves only a 6% word error rate. This is truly impressive. What else would you like to highlight about this model before we discuss the “how” behind its incredible performance?
    Joey Conway: It’s worth noting that NVIDIA has been working on ASR models for a long time, even before I joined. We’ve also released many open models in this space over the years. The teams working on this are exceptional, and they consistently strive to balance accuracy with latency and throughput. Parakeet V2 is the latest in this line of high-performance models from NVIDIA.
    Jean-Marc Mommessin: It sounds like the advancements will keep coming. So, let’s delve into how you achieved this remarkable performance with Parakeet TDT. What kind of architecture did you use? I understand it’s based on a Fast Conformer architecture with specific optimizations like 8x depth-wise separable convolutional downsampling and limited context attention. Could you explain how you arrived at this approach and whether these optimizations primarily enhance speed and throughput or if they also contribute to accuracy and the ability to process long audio segments like a full hour in one shot?
    Joey Conway: Yes, we’ve explored various architectures for ASR over the years, and the Conformer architecture, originally from Google, has shown great promise. Our goal with Parakeet TDT was to take the Conformer architecture and make it significantly more efficient and faster without sacrificing quality.
    We’ve implemented several key optimizations. 
    First, as you mentioned, the depth-wise separable convolution downsampling. At the input stage, we significantly downsample the audio, which reduces the computational cost and memory requirements for processing.
    Second is the limited context attention. By focusing on smaller, overlapping chunks of audio, we can maintain accuracy while achieving a speedup in processing.
    Third, on the encoder side, we also utilize a sliding window attention technique, which allows us to process longer audio files without having to split them into shorter segments. This is crucial for handling long-form audio like a full hour in a single pass.
    Beyond the Conformer architecture, Parakeet TDT incorporates a Token and Duration Transducer. Traditional Recurrent Neural Networktransducer technology processes audio frame by frame. What we’ve done with TDT is enable the model to predict both the tokens and the expected duration of those tokens. This allows it to make decisions to skip over redundant frames, significantly speeding up the transcription process. This TDT innovation alone contributes to around a 1.5 to 2x speedup. So, there’s a combination of architectural choices and specific optimizations that contribute to Parakeet TDT’s impressive speed and accuracy.
    Jean-Marc Mommessin: I want to go back to one or two of those. Those are amazing, frankly. The speed increase is remarkable.
    Joey Conway: Yes, and we have another technique called a label looping algorithm. Essentially, when we’re doing batch inference, this algorithm allows us to advance the tokens independently for different samples. This separation of the workflow enables us to sweep and loop over frames and labels more efficiently, significantly speeding up the decoding process.
    Lastly, on the decoder side, we’ve moved some of the computation into CUDA graphs, which is a more efficient way to run many small kernels. This optimization alone provided around a 3x speed boost. So, as you can see with TDT models, we’ve been able to achieve speeds comparable to Connectionist Temporal Classificationdecoders, which are also known for their speed, while maintaining high accuracy. Our overall theme is always to balance speed improvements with maintaining or even enhancing accuracy. Techniques like CTC decoders have been around for a while and are fast but might not be as accurate. It really depends on the use case, but we’re always striving for that balance.
    Jean-Marc Mommessin: Can we revisit the limited context attention? Do you see this technique having broader applications in other areas down the line?
    Joey Conway: Yes, I believe so. Patterns like the sliding window attention are already used in other areas, such as LLMs. Our research teams are constantly experimenting, looking at successful techniques from different domains, and trying to apply them in new ways. Interestingly, some of the researchers who worked on Parakeet TDT also work on Llama Nemotron, so there’s a cross-pollination of ideas. I do expect that some of these techniques will find broader applications going forward. We also anticipate further improvements to TDT and the Conformer architecture, as we’ve been working on them for several years now. I don’t see these core technologies going away anytime soon; we’ll likely continue to refine them.
    Jean-Marc Mommessin: Leaving the TDT aside, do you see other potential applications for the Token and Duration Transducer concept in other domains?
    Joey Conway: That’s a good question. I’m not immediately seeing a direct application of the TDT concept outside of ASR. Its history is rooted in RNNs and RNN transducers, which have primarily been used in speech recognition. However, some of the underlying techniques we’ve applied to it, like using CUDA graphs for optimizing kernel execution, are general techniques that we use whenever we identify bottlenecks in a model’s pipeline. So, while the TDT itself might be domain-specific, some of the optimization strategies we’ve employed could certainly translate to other areas, including large language models.
    Jean-Marc Mommessin: let’s talk about data. AI data is always a key topic. How do you ensure that the data used to train Parakeet TDT is diverse enough to handle various accents, dialects, vocal ranges, pitches, and noisy background conditions, which often negatively impact ASR performance?
    Joey Conway: You’re absolutely right. As humans, we naturally filter out accents and background noise to understand speech. However, deep learning models are only as good as the data they’re trained on. Early on, limited data for specific accents or languages resulted in poor performance for those variations. What might have initially seemed like edge cases have become increasingly common, highlighting the need for more representative data.
    We’ve invested significant effort in curating our datasets to reflect this real-world diversity. We use techniques like classifiers to analyze our data and understand the distributions of accents, dialects, and acoustic conditions. We’ve worked with customers like YUM! Brands, who have drive-through use cases with significant highway noise, illustrating the importance of training the model to handle such challenging environments. Ensuring the right blend and distribution of these conditions in our training data is crucial for the model’s robustness.
    I’m also excited to announce that we plan to open-source a substantial speech dataset, around 100,000 hours, where we’ve meticulously performed this kind of curation. This dataset will include variations in sound levels, signal-to-noise ratios, background noise types, and even telephone audio formats relevant for call centers. Our goal is to provide the community with high-quality, diverse data that enables models to perform well across a wide range of real-world scenarios.
    Jean-Marc Mommessin: That’s fantastic news about the open-sourcing of the speech dataset! My final question regarding the Parakeet family: you currently have the 600 million and 1.1 billion parameter models. How do you envision future development for this family? What are the potential directions?
    Joey Conway: We’re considering development along two main dimensions: model size and the number of supported languages. In terms of size, we’ve released models at the smaller and mid-range to demonstrate the potential, similar to our approach with Llama Nemotron Super. We plan to explore larger models, potentially around 2 billion parameters, which we anticipate will handle even more languages and dialects.
    On the smaller end, we’re even considering models down to around 50 million parameters. The motivation here is to address use cases at the edge where a smaller footprint is necessary, such as enabling real-time audio processing for robots in noisy environments. We’ll be exploring the right trade-offs for such applications.
    Technologically, we plan to work on streaming capabilities for TDT. Currently, much of the processing is done in an offline batch mode, but we want to enable real-time, live transcription. And as mentioned, we’re excited about releasing the large, curated speech dataset.
    Finally, for those looking to deploy these models in production, we recommend exploring techniques like word boosting, which allows for customization of text normalization to include domain-specific terms and acronyms. We aim to provide a wide range of options for users to get started and tailor the models to their specific needs.
    Jean-Marc Mommessin: I’m very familiar with the NVIDIA Orin platform. Would these Parakeet models currently run on NVIDIA Orin?
    Joey Conway: Yes, I believe the 0.6 billion parameter model likely would run on Orin. I would need to double-check the exact specifications, but I’m quite confident it’s feasible.
    Jean-Marc Mommessin: Orin packs a significant punch. I especially love the robotics use case you mentioned. While there’s been a lot of focus on robot vision, the ability to hear and understand quickly is equally crucial, especially for safety. A model that’s 50 times faster and highly accurate in understanding another modality seems like a perfect fit for robotics.
    Joey Conway: Yes, and the slight hesitation I had earlier was due to the understanding that in robotics, there are often multiple models running simultaneously, including vision models. So, resource allocation is a consideration. However, our push towards smaller, more efficient models is precisely to address these kinds of multi-modal edge computing scenarios. The low latency and real-time processing capabilities of Parakeet are indeed very beneficial for enabling robots to react quickly and safely to auditory cues.
    Jean-Marc Mommessin: Anything else you’d like to add as a final thought on the Llama Nemotron Ultra and Parakeet families? They’re both open-source, fast, high-throughput, cost-efficient, and run on smaller footprints – are these the key takeaways?
    Joey Conway: Yes, that’s a great summary. Those were the core objectives we set out to achieve. We aimed for state-of-the-art accuracy, optimized footprints for efficient GPU utilization in terms of latency and throughput, and a commitment to open-sourcing everything to empower the community. We’ve strived to be as community-friendly as possible by releasing datasets, using permissive licenses, and making it easy for people to experiment. We’re eager to see the community’s feedback and the innovative applications they build upon our work. We’re also looking forward to learning from their experiences.
    Jean-Marc Mommessin: Where are all these models and datasets available?
    Joey Conway: Everything we’ve published is on Hugging Face – the models and the datasets. The software stack to run them comes from NVIDIA and is available on NGC, our content repository. Much of the underlying software is also open-source and can be found on GitHub. We also provide pip wheels for easier installation. The Nemo framework is the central hub for much of this software stack, whether you want to run the models or fine-tune them.
    We’ve tried to make it as user-friendly as possible. We use the same software internally to build the models, so it should be relatively straightforward for others to pick up and deploy as well.
    Jean-Marc Mommessin: Well, Joey, this has been fantastic. I’m continually impressed by NVIDIA’s commitment to giving back to the community with state-of-the-art models that will undoubtedly find their way into production. Thank you so much for your time and insights. I look forward to our next conversation.
    Joey Conway: Thank you, Jean-Marc. It was my pleasure, and we appreciate the opportunity. 
    Jean-marc MommessinJean-marc is a successful AI business executive .He leads and accelerates growth for AI powered solutions and started a computer vision company in 2006. He is a recognized speaker at AI conferences and has an MBA from Stanford.Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/NVIDIA AI Releases HOVER: A Breakthrough AI for Versatile Humanoid Control in RoboticsJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Speech-to-Speech Foundation Models Pave the Way for Seamless Multilingual InteractionsJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Lowe’s Revolutionizes Retail with AI: From Personalized Shopping to Proactive Customer AssistanceJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Google DeepMind’s Gemini Robotics: Unleashing Embodied AI with Zero-Shot Control and Enhanced Spatial Reasoning
    #exclusive #talk #joey #conway #nvidia
    Exclusive Talk: Joey Conway of NVIDIA on Llama Nemotron Ultra and Open Source Models
    Today, MarkTechPost had the pleasure of interviewing Joey Conway from NVIDIA to discuss their exciting work on open-source large language models, including Llama Nemotron Ultra & Parakeet. Highlights from the interview: NVIDIA’s Open Source Powerhouse: Discover how NVIDIA is pushing the boundaries of open-source AI with the release of cutting-edge models like Llama Nemotron Ultra and Parakeet TDT. Llama Nemotron Ultra: Smaller Size, Giant Performance: Learn how NVIDIA achieved on-par performance with models twice the size, enabling deployment on a single GPU node. Explore their innovative FFN fusion technique for significant speedups. Reasoning on Demand: Uncover the unique “reasoning on/off” feature in Llama Nemotron Ultra, offering unprecedented control for production deployments and cost optimization. Revolutionary Speech Recognition with Parakeet TDT: Dive into NVIDIA’s state-of-the-art ASR model that transcribes one hour of audio in one second with only a 6% word error rate – 50 times faster than other open-source alternatives! The “How”: Architectural Innovations: Get insights into the advanced architectures and optimizations behind these models, including FFN fusion, limited context attention, and the Token Duration Transducer  Democratizing AI with Open Data: Learn about NVIDIA’s commitment to the open-source community through the release of model weights and massive, high-quality datasets for both language and speech. Future Directions: Get a sneak peek into NVIDIA’s plans for multilingual support, even smaller edge-optimized models, and advancements in real-time streaming for speech recognition. Production-Ready AI: Understand how these models are designed with real-world deployment challenges in mind, focusing on accuracy, efficiency, and cost-effectiveness. Jean-Marc Mommessin: Joey, welcome to Marketechpost! We’re thrilled to have you here and to delve into the impressive open-source models NVIDIA has been releasing. To start, could you please introduce yourself and your role at NVIDIA? Joey Conway: Hi Jean-Marc, it’s great to be here. I’m Joey Conway, and I work in product management for some of the deep learning software at NVIDIA. Our team focuses on large language models like Nemotron and Llama Nemotron, as well as text-to-speech models such as Parakeet. Jean-Marc Mommessin: Wonderful. And you’ve been at NVIDIA for over seven years now, witnessing significant waves of innovation in AI. Let’s talk about your recent release, Llama Nemotron Ultra, a 253 billion parameter model. From what we’ve seen, it delivers performance on par with models like Llama 405B and DeepSeek R1, which are about twice its size. Remarkably, it can run on a single 8x H100 node. What else can you tell us about Llama Nemotron Ultra and what makes it so impressive? Joey Conway: We’re big believers in the open-source community and the fantastic work being done there. With Llama Nemotron, our goal was to build upon the existing foundations, particularly Llama, for which we greatly appreciate Meta’s contributions. We also observed significant progress in reasoning within the open community earlier this year. Inspired by this, we wanted to contribute and see how we could enhance Llama, especially for enterprise use cases. Our focus was primarily on improving reasoning capabilities and agentic tasks like tool calling and chat. We aimed to take the strengths of the open-source community, enhance them, and then contribute those improvements back. Jean-Marc Mommessin: Did you identify specific gaps in existing models that you aimed to address? You mentioned reasoning, but could you provide an example or two of enterprise agentic tasks where you felt there were shortcomings that Llama Nemotron Ultra overcomes? Joey Conway : Yes, I think looking back to the beginning of the year, a key challenge in enterprise deployments was handling complex queries requiring significant thought and reflection. These could be multi-step processes or involve substantial calculations and the use of external tools. At that time, there weren’t many strong open-weight models capable of robust reasoning. The progress we’ve seen in the last few months in this area is very encouraging. Another critical aspect for enterprises is the ability to accurately call APIs and closely follow instructions in user queries. We wanted to ensure that while we focused on improving reasoning, we didn’t compromise these essential production-level capabilities. Furthermore, we often noticed that when both reasoning and instruction following were well-addressed, they typically resided in separate models. Our aim was to simplify this by creating a single model that excels in both. This was the landscape we observed when we started this project around January and February. Jean-Marc Mommessin: That makes perfect sense and aligns with what we’re seeing in the industry as well. Now, let’s dive into the “how.” Your paper mentions FFN fusion as a key optimization. Could you elaborate on this technique, starting with a high-level explanation? Joey Conway: Absolutely. Our focus on optimization stemmed from the realization that deploying state-of-the-art models often requires a significant deployment footprint. We wanted to optimize this to fit within more common GPU setups. We explored various techniques, including our Puzzle neural architecture search. For dense transformer models, particularly those in the Llama family, we discovered a way to reduce or eliminate redundant attention layers. This process aligned the feed-forward networklayers in a sequence, allowing us to explore fusion methods. Our fundamental goal on the GPU is to maximize parallel execution. Fusing these aligned FFN layers enables greater parallel computation than was previously possible. By removing redundant layers, we found opportunities to essentially merge or fuse the remaining ones. This is a key example of how we tackle the challenges of running these models at scale. Importantly, this technique often yields greater improvements with larger models, which was beneficial for our Ultra model based on Meta’s Llama 3.1 -405B. Jean-Marc Mommessin: And this FFN fusion significantly improves the model’s throughput, achieving notable speedups. If I recall correctly, it’s in the range of 3 to 5x for the Ultra model? Joey Conway: That’s right, the speedups for the Ultra model are in that range. Additionally, by reducing the model’s size in terms of weights, we also lowered its memory footprint. This allowed us to utilize a larger KV cache. For Llama Nemotron Ultra, we could fit it onto a 8x H100 80GB setup, which is quite significant as it fits within common node configurations. So, FFN fusion provided both a substantial compute speedup and a reduction in memory usage, enabling us to handle larger context lengths. These are very exciting outcomes for us. Jean-Marc Mommessin: Let’s switch gears to data curation. AI data is crucial, and your training pipeline seems very sophisticated. You touched on “instruction following” earlier. Could you elaborate on your data curation process and how you ensured high-quality data, especially considering you leveraged other models in the process? Image source: NVIDIA Joey Conway: Transparency and openness were key in our approach. We wanted to share as much as possible about our data, techniques, and tooling so the community could understand and even use it themselves. Our primary goal with data curation was to improve accuracy across several key domains, including reasoning tasks like math and coding, as well as non-reasoning tasks like tool calling, instruction following, and chat. Our strategy involved curating specific datasets to enhance performance in these areas. Within our supervised fine-tuning process, we differentiated between “reasoning on” and “reasoning off” scenarios. For example, in math and coding, we curated data for simple questions that don’t require complex reasoning, as well as more intricate problems that do. This helps the model learn when and how to apply reasoning. A key part of this process was leveraging high-quality models from the community as “experts” in specific domains. For instance, we used DeepSeek R-1 extensively for reasoning-intensive math and coding tasks. For non-reasoning tasks like basic math, coding, chat, and tool calling, we utilized models like Llama and Qwen. Our aim was to blend the best capabilities of these community models into a single model. We’ve also made this curated dataset publicly available on Hugging Face, with around 30 million question-answer pairs. This allows the community to explore, use, and build upon our work. We were also excited to see our partner ServiceNow recently announce their apprehend Nemotron model, which was trained using our dataset to enhance their own reasoning capabilities. Jean-Marc Mommessin: That’s fantastic that you’re sharing the dataset. Given that you used other models to generate some of this data, what kind of quality checks did you implement to ensure the reliability of the training pairs? Joey Conway: Data quality was absolutely paramount. Since we were generating a significant portion of the data using other models, we implemented a rigorous multi-layered quality assurance process. First, for each expert model used to generate data in a specific domain, we would generate multiple candidate responses for the same prompt. Then, we employed a separate set of “critic” models to evaluate these candidates based on correctness, coherence, and adherence to the prompt. Second, we implemented a scoring mechanism. Each generated question-answer pair received a quality score based on the critic model’s evaluation. We set a high threshold, and any pair that didn’t meet this standard was discarded. Third, human review was integrated at various stages. Our team of data scientists and engineers manually inspected samples of the generated data to identify any systematic errors, biases, or instances of hallucination. This human oversight was crucial for catching nuances that automated systems might miss. Fourth, we focused on the diversity of the generated data. We wanted to ensure we weren’t just getting variations of the same types of questions and answers. We implemented strategies to encourage the expert models to generate a broad range of examples within each domain. Finally, after training Llama Nemotron Ultra on this curated data, we conducted extensive evaluations against benchmark datasets and in real-world use cases. This feedback loop helped us further refine our data generation and filtering techniques. So, it was a comprehensive approach involving expert generation, automated criticism and scoring, human review, diversity checks, and rigorous downstream evaluation to ensure the high quality of our training data. Jean-Marc Mommessin: The quality of the synthetic data is so important. Could you elaborate on the stages you take to ensure high accuracy when generating this data? Joey Conway: Absolutely. When doing synthetic data generation, there are a few key stages to ensure high accuracy. The first is the prompts – the seed data and how we prompt the model. The second is the quality of the responses. On the prompting side, we focus on prompting models where we believe they excel. For example, we might use Llama for chat-related prompts but avoid using a non-reasoning model for math. It’s crucial to align the prompts with the core strengths of the model. For vetting the responses, we invest time in both human manual review and automated methods. Going forward, we anticipate increasing our use of verifiers and reward models, similar to what we’ve done on the Reinforcement Learningside. The reason we’ve open-sourced much of this is that there’s a lot of nuance involved, and we wanted the community to engage with these challenges. Enterprises like ServiceNow have specific goals, and some of our data might be more or less useful to them. By making it available, they can vet it themselves. We also provide tools like classifier models to help categorize content, such as news or sports, allowing users to make informed decisions about the data blends they use for training. Jean-Marc Mommessin: Perfect. Is there anything else you’d like to highlight regarding this pipeline? Joey Conway: Yes, I’d like to touch on the Reinforcement Learningaspect. Following the supervised fine-tuning stage, where we enhanced core skills, we’ve just begun to explore the potential of RL with Nemotron. We believe this will be a significant area of future development. What’s exciting about RL is that its effectiveness is largely tied to the available compute time. The more time we invest, the better the model becomes at specific tasks. In our RL stages, we’ve developed methods to automate the process of asking the model a question, grading its answer, and providing feedback to allow it to learn and improve. You can see on the slide the domains where we’ve applied this: scientific reasoning, instruction following, and chat. If you look at the leaderboards, you’ll see that even with new models emerging, we’ve maintained a strong position in these areas, largely due to the effectiveness of RL in achieving top-tier accuracy. We’re optimistic that we’ll see more of this in the community, with more discussion and publication of techniques and data. We’ve started sharing some of our work in this area and will have much more to come in the next three to six months. Jean-Marc Mommessin: You mentioned RL and instruction following, which ties back to the beginning of our conversation. It seems like you’ve come full circle here. Joey Conway: Exactly. The exciting aspect here is automating the feedback loop wherever possible. For chat, we published a fine-tuned reward model last fall. Those who followed our work might recall that our Llama Nemotron model topped the chat leaderboards then. This was because the reward model provides an automated way to teach the original model whether its responses are good or bad. It essentially grades responses based on helpfulness, conciseness, verbosity, groundedness, and similar factors. This granular feedback per generated response allows the model to improve significantly, often more so than through supervised fine-tuning alone, which typically involves a few passes without a continuous feedback loop. Similarly, for instruction following, we use a verifier and a dataset to teach the model whether it followed instructions well or needs to try again. We’re eager to expand this approach to more domains. We’ve already published datasets related to coding and math since the release of this model a few weeks ago, and these have become popular on Hugging Face. I anticipate significant growth in this area within the community. Jean-Marc Mommessin: Alright, so one of the big innovations here, and you touched upon it, but I want to emphasize it, is the ability to toggle reasoning on and off via the system prompt. This is quite unique, and I’m sure many will follow suit. Could you expand on the idea behind this, how you see it applying to agents and beyond, its value, and the key challenges in implementing it? Joey Conway: The reasoning on and off capability was a core goal from the outset. We observed that models in the community often excelled in either reasoning or non-reasoning tasks, and we wanted to simplify deployment by having a single model that could handle both. We had to determine the best way to teach the model when to reason and when not to, while also providing enterprises with explicit control, as they often have deeper domain knowledge than we do. The motivation behind this is that reasoning generates significantly more tokens, which can lead to higher latency and cost. While crucial for solving complex problems, it’s not always necessary. We wanted to give enterprises the control to balance accuracy with latency and cost, allowing them to decide when to employ reasoning and when to opt for faster, less computationally intensive responses. Initially, we weren’t sure how to achieve this, as it hadn’t been widely implemented in the community. Our approach in the supervised fine-tuning stage was to explicitly teach the model by presenting the same question with two different answers: one with detailed reasoning and one without. This essentially doubled our dataset for this specific purpose. However, the outcome is a single model where users can simply include “use detailed thinking on” or “use detailed thinking off” in the prompt to control the model’s reasoning process. On the training side, this required more effort to teach the model this distinction. What we have today is essentially a v1, and I expect others will follow this approach. We’re also excited about future developments, such as time or token limits for reasoning and more granular controls. I’m optimistic that we’ll see further breakthroughs in this area within the next six to nine months, as the problem-solving power of reasoning is significant, but it comes with trade-offs that the community will continue to refine. Jean-Marc Mommessin: We all know that the real test comes in production. Production environments are sensitive to latency, cost, and while accuracy and reasoning are vital, excessive reasoning can lead to scalability issues and increased latency. The flexibility you’ve introduced is fantastic, and I can see numerous production use cases that will greatly benefit from the ability to control reasoning on a per-query basis. So, when you were developing this model, you aimed to balance accuracy and efficiency. Could you share some insights into how you made these trade-offs, the timeline for building the model and the team involved, and how you determined the optimal compromise between these two critical factors? Joey Conway: Balancing accuracy and efficiency is always a challenge. Our initial goal was to achieve both, which is a difficult undertaking. We started with the “Super” model, which was the most recent Llama 3.1 70B release from Meta, as our baseline for accuracy. We weren’t sure if we could simultaneously improve accuracy and reduce the model size. We found that through our training techniques and distillation process, we could indeed boost accuracy. We even released an initial checkpoint reflecting this. However, we wanted to go further by incorporating strong reasoning capabilities, aiming for state-of-the-art reasoning scores. This is where the SFT and RL stages came in, which required significant time for synthetic data generation since this type of data didn’t exist. During training, we carefully considered the number of epochs for each skill and continuously measured accuracy. Our goal was to improve performance across all six key areas rather than excelling in just a couple. This balancing act took more time as we experimented to find the right combinations. However, we felt it was crucial to ensure world-class performance in these six enterprise-relevant scenarios, including chat and instruction following. For areas like MMLU, we focused on maintaining performance and preventing regression rather than actively trying to improve scores. So, there were definitely priorities and trade-offs involved. Ultimately, we believe these were the right focus areas for our enterprise customers. Jean-Marc Mommessin: You are releasing this model family as part of the open-source community. We’ve discussed the gaps you aimed to address and the unique reasoning on/off feature for production scalability. Could you share your thoughts on how NVIDIA and your team view the role of these models within the broader open-source and LLM ecosystem, especially given your work building upon the Llama base? Joey Conway: NVIDIA has a long history of contributing models to the open-source community. What excites us about Llama is its strong traction with enterprise customers. While NVIDIA Research publishes extensively across various domains, our goal with Llama Nemotron was to build upon Llama’s momentum in enterprise adoption by focusing narrowly on specific areas. The base Llama models already cover many things exceptionally well, so we saw an opportunity to build on top of that and be very targeted in our enhancements. The recent LlamaCon event and Meta’s announcements sound very promising, and we’re excited about Llama 4 and the ongoing work there. Moving forward, we anticipate continuing to identify specific areas where we can add significant value, while Meta continues to build excellent general-purpose models suitable for enterprise production. From our perspective, reasoning will likely remain a key focus, and we’re also excited about Meta’s advancements in this area. Tool calling, instruction following, and chat are also areas we’ll continue to develop. One area we’re particularly interested in exploring is multilingual capabilities. For large enterprises, supporting multiple languages is crucial. While many models handle individual languages well, we aim to focus on a few key languages and ensure world-class accuracy for reasoning, tool calling, and chat within those. This is likely the next major area of expansion for us, beyond the exciting developments in model architectures like Llama 4’s new MoE architecture, which we’re also keen to explore for potential distillation and optimization for NVIDIA GPUs. So, there’s a lot of exciting work ahead. Jean-Marc Mommessin: When you say multilingual, are you thinking of supporting a broad range, like 50 languages, or a more focused set, perhaps around 5 or 10 initially, given the benchmark challenges you mentioned? Joey Conway: We’ll probably start with a more focused set, perhaps around 5 to 10 languages. The challenge is that the community currently lacks comprehensive benchmarks for tasks like reasoning or tool calling across a wide variety of languages. As we develop these multilingual models, we’re also having to create evaluation data simultaneously, which takes time. If those benchmarks were readily available, the process would be smoother. However, we see this as an exciting challenge. Our initial focus will likely be on a smaller set of languages where we can establish strong performance, given the current limitations in community-wide benchmarks. Jean-Marc Mommessin: Let’s shift gears and talk about another state-of-the-art open-source model you recently released: Parakeet TDT 0.6 B parameters, V2. This model has set a new standard for automatic speech recognition, transcribing one hour of audio in just one second. That’s 50 times faster than other open-source ASR models, and remarkably, it achieves only a 6% word error rate. This is truly impressive. What else would you like to highlight about this model before we discuss the “how” behind its incredible performance? Joey Conway: It’s worth noting that NVIDIA has been working on ASR models for a long time, even before I joined. We’ve also released many open models in this space over the years. The teams working on this are exceptional, and they consistently strive to balance accuracy with latency and throughput. Parakeet V2 is the latest in this line of high-performance models from NVIDIA. Jean-Marc Mommessin: It sounds like the advancements will keep coming. So, let’s delve into how you achieved this remarkable performance with Parakeet TDT. What kind of architecture did you use? I understand it’s based on a Fast Conformer architecture with specific optimizations like 8x depth-wise separable convolutional downsampling and limited context attention. Could you explain how you arrived at this approach and whether these optimizations primarily enhance speed and throughput or if they also contribute to accuracy and the ability to process long audio segments like a full hour in one shot? Joey Conway: Yes, we’ve explored various architectures for ASR over the years, and the Conformer architecture, originally from Google, has shown great promise. Our goal with Parakeet TDT was to take the Conformer architecture and make it significantly more efficient and faster without sacrificing quality. We’ve implemented several key optimizations.  First, as you mentioned, the depth-wise separable convolution downsampling. At the input stage, we significantly downsample the audio, which reduces the computational cost and memory requirements for processing. Second is the limited context attention. By focusing on smaller, overlapping chunks of audio, we can maintain accuracy while achieving a speedup in processing. Third, on the encoder side, we also utilize a sliding window attention technique, which allows us to process longer audio files without having to split them into shorter segments. This is crucial for handling long-form audio like a full hour in a single pass. Beyond the Conformer architecture, Parakeet TDT incorporates a Token and Duration Transducer. Traditional Recurrent Neural Networktransducer technology processes audio frame by frame. What we’ve done with TDT is enable the model to predict both the tokens and the expected duration of those tokens. This allows it to make decisions to skip over redundant frames, significantly speeding up the transcription process. This TDT innovation alone contributes to around a 1.5 to 2x speedup. So, there’s a combination of architectural choices and specific optimizations that contribute to Parakeet TDT’s impressive speed and accuracy. Jean-Marc Mommessin: I want to go back to one or two of those. Those are amazing, frankly. The speed increase is remarkable. Joey Conway: Yes, and we have another technique called a label looping algorithm. Essentially, when we’re doing batch inference, this algorithm allows us to advance the tokens independently for different samples. This separation of the workflow enables us to sweep and loop over frames and labels more efficiently, significantly speeding up the decoding process. Lastly, on the decoder side, we’ve moved some of the computation into CUDA graphs, which is a more efficient way to run many small kernels. This optimization alone provided around a 3x speed boost. So, as you can see with TDT models, we’ve been able to achieve speeds comparable to Connectionist Temporal Classificationdecoders, which are also known for their speed, while maintaining high accuracy. Our overall theme is always to balance speed improvements with maintaining or even enhancing accuracy. Techniques like CTC decoders have been around for a while and are fast but might not be as accurate. It really depends on the use case, but we’re always striving for that balance. Jean-Marc Mommessin: Can we revisit the limited context attention? Do you see this technique having broader applications in other areas down the line? Joey Conway: Yes, I believe so. Patterns like the sliding window attention are already used in other areas, such as LLMs. Our research teams are constantly experimenting, looking at successful techniques from different domains, and trying to apply them in new ways. Interestingly, some of the researchers who worked on Parakeet TDT also work on Llama Nemotron, so there’s a cross-pollination of ideas. I do expect that some of these techniques will find broader applications going forward. We also anticipate further improvements to TDT and the Conformer architecture, as we’ve been working on them for several years now. I don’t see these core technologies going away anytime soon; we’ll likely continue to refine them. Jean-Marc Mommessin: Leaving the TDT aside, do you see other potential applications for the Token and Duration Transducer concept in other domains? Joey Conway: That’s a good question. I’m not immediately seeing a direct application of the TDT concept outside of ASR. Its history is rooted in RNNs and RNN transducers, which have primarily been used in speech recognition. However, some of the underlying techniques we’ve applied to it, like using CUDA graphs for optimizing kernel execution, are general techniques that we use whenever we identify bottlenecks in a model’s pipeline. So, while the TDT itself might be domain-specific, some of the optimization strategies we’ve employed could certainly translate to other areas, including large language models. Jean-Marc Mommessin: let’s talk about data. AI data is always a key topic. How do you ensure that the data used to train Parakeet TDT is diverse enough to handle various accents, dialects, vocal ranges, pitches, and noisy background conditions, which often negatively impact ASR performance? Joey Conway: You’re absolutely right. As humans, we naturally filter out accents and background noise to understand speech. However, deep learning models are only as good as the data they’re trained on. Early on, limited data for specific accents or languages resulted in poor performance for those variations. What might have initially seemed like edge cases have become increasingly common, highlighting the need for more representative data. We’ve invested significant effort in curating our datasets to reflect this real-world diversity. We use techniques like classifiers to analyze our data and understand the distributions of accents, dialects, and acoustic conditions. We’ve worked with customers like YUM! Brands, who have drive-through use cases with significant highway noise, illustrating the importance of training the model to handle such challenging environments. Ensuring the right blend and distribution of these conditions in our training data is crucial for the model’s robustness. I’m also excited to announce that we plan to open-source a substantial speech dataset, around 100,000 hours, where we’ve meticulously performed this kind of curation. This dataset will include variations in sound levels, signal-to-noise ratios, background noise types, and even telephone audio formats relevant for call centers. Our goal is to provide the community with high-quality, diverse data that enables models to perform well across a wide range of real-world scenarios. Jean-Marc Mommessin: That’s fantastic news about the open-sourcing of the speech dataset! My final question regarding the Parakeet family: you currently have the 600 million and 1.1 billion parameter models. How do you envision future development for this family? What are the potential directions? Joey Conway: We’re considering development along two main dimensions: model size and the number of supported languages. In terms of size, we’ve released models at the smaller and mid-range to demonstrate the potential, similar to our approach with Llama Nemotron Super. We plan to explore larger models, potentially around 2 billion parameters, which we anticipate will handle even more languages and dialects. On the smaller end, we’re even considering models down to around 50 million parameters. The motivation here is to address use cases at the edge where a smaller footprint is necessary, such as enabling real-time audio processing for robots in noisy environments. We’ll be exploring the right trade-offs for such applications. Technologically, we plan to work on streaming capabilities for TDT. Currently, much of the processing is done in an offline batch mode, but we want to enable real-time, live transcription. And as mentioned, we’re excited about releasing the large, curated speech dataset. Finally, for those looking to deploy these models in production, we recommend exploring techniques like word boosting, which allows for customization of text normalization to include domain-specific terms and acronyms. We aim to provide a wide range of options for users to get started and tailor the models to their specific needs. Jean-Marc Mommessin: I’m very familiar with the NVIDIA Orin platform. Would these Parakeet models currently run on NVIDIA Orin? Joey Conway: Yes, I believe the 0.6 billion parameter model likely would run on Orin. I would need to double-check the exact specifications, but I’m quite confident it’s feasible. Jean-Marc Mommessin: Orin packs a significant punch. I especially love the robotics use case you mentioned. While there’s been a lot of focus on robot vision, the ability to hear and understand quickly is equally crucial, especially for safety. A model that’s 50 times faster and highly accurate in understanding another modality seems like a perfect fit for robotics. Joey Conway: Yes, and the slight hesitation I had earlier was due to the understanding that in robotics, there are often multiple models running simultaneously, including vision models. So, resource allocation is a consideration. However, our push towards smaller, more efficient models is precisely to address these kinds of multi-modal edge computing scenarios. The low latency and real-time processing capabilities of Parakeet are indeed very beneficial for enabling robots to react quickly and safely to auditory cues. Jean-Marc Mommessin: Anything else you’d like to add as a final thought on the Llama Nemotron Ultra and Parakeet families? They’re both open-source, fast, high-throughput, cost-efficient, and run on smaller footprints – are these the key takeaways? Joey Conway: Yes, that’s a great summary. Those were the core objectives we set out to achieve. We aimed for state-of-the-art accuracy, optimized footprints for efficient GPU utilization in terms of latency and throughput, and a commitment to open-sourcing everything to empower the community. We’ve strived to be as community-friendly as possible by releasing datasets, using permissive licenses, and making it easy for people to experiment. We’re eager to see the community’s feedback and the innovative applications they build upon our work. We’re also looking forward to learning from their experiences. Jean-Marc Mommessin: Where are all these models and datasets available? Joey Conway: Everything we’ve published is on Hugging Face – the models and the datasets. The software stack to run them comes from NVIDIA and is available on NGC, our content repository. Much of the underlying software is also open-source and can be found on GitHub. We also provide pip wheels for easier installation. The Nemo framework is the central hub for much of this software stack, whether you want to run the models or fine-tune them. We’ve tried to make it as user-friendly as possible. We use the same software internally to build the models, so it should be relatively straightforward for others to pick up and deploy as well. Jean-Marc Mommessin: Well, Joey, this has been fantastic. I’m continually impressed by NVIDIA’s commitment to giving back to the community with state-of-the-art models that will undoubtedly find their way into production. Thank you so much for your time and insights. I look forward to our next conversation. Joey Conway: Thank you, Jean-Marc. It was my pleasure, and we appreciate the opportunity.  Jean-marc MommessinJean-marc is a successful AI business executive .He leads and accelerates growth for AI powered solutions and started a computer vision company in 2006. He is a recognized speaker at AI conferences and has an MBA from Stanford.Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/NVIDIA AI Releases HOVER: A Breakthrough AI for Versatile Humanoid Control in RoboticsJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Speech-to-Speech Foundation Models Pave the Way for Seamless Multilingual InteractionsJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Lowe’s Revolutionizes Retail with AI: From Personalized Shopping to Proactive Customer AssistanceJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Google DeepMind’s Gemini Robotics: Unleashing Embodied AI with Zero-Shot Control and Enhanced Spatial Reasoning #exclusive #talk #joey #conway #nvidia
    WWW.MARKTECHPOST.COM
    Exclusive Talk: Joey Conway of NVIDIA on Llama Nemotron Ultra and Open Source Models
    Today, MarkTechPost had the pleasure of interviewing Joey Conway from NVIDIA to discuss their exciting work on open-source large language models, including Llama Nemotron Ultra & Parakeet. Highlights from the interview: NVIDIA’s Open Source Powerhouse: Discover how NVIDIA is pushing the boundaries of open-source AI with the release of cutting-edge models like Llama Nemotron Ultra and Parakeet TDT. Llama Nemotron Ultra: Smaller Size, Giant Performance: Learn how NVIDIA achieved on-par performance with models twice the size, enabling deployment on a single GPU node. Explore their innovative FFN fusion technique for significant speedups. Reasoning on Demand: Uncover the unique “reasoning on/off” feature in Llama Nemotron Ultra, offering unprecedented control for production deployments and cost optimization. Revolutionary Speech Recognition with Parakeet TDT: Dive into NVIDIA’s state-of-the-art ASR model that transcribes one hour of audio in one second with only a 6% word error rate – 50 times faster than other open-source alternatives! The “How”: Architectural Innovations: Get insights into the advanced architectures and optimizations behind these models, including FFN fusion, limited context attention, and the Token Duration Transducer (TDT)  Democratizing AI with Open Data: Learn about NVIDIA’s commitment to the open-source community through the release of model weights and massive, high-quality datasets for both language and speech. Future Directions: Get a sneak peek into NVIDIA’s plans for multilingual support, even smaller edge-optimized models, and advancements in real-time streaming for speech recognition. Production-Ready AI: Understand how these models are designed with real-world deployment challenges in mind, focusing on accuracy, efficiency, and cost-effectiveness. Jean-Marc Mommessin: Joey, welcome to Marketechpost! We’re thrilled to have you here and to delve into the impressive open-source models NVIDIA has been releasing. To start, could you please introduce yourself and your role at NVIDIA? Joey Conway: Hi Jean-Marc, it’s great to be here. I’m Joey Conway, and I work in product management for some of the deep learning software at NVIDIA. Our team focuses on large language models like Nemotron and Llama Nemotron, as well as text-to-speech models such as Parakeet. Jean-Marc Mommessin: Wonderful. And you’ve been at NVIDIA for over seven years now, witnessing significant waves of innovation in AI. Let’s talk about your recent release, Llama Nemotron Ultra, a 253 billion parameter model. From what we’ve seen, it delivers performance on par with models like Llama 405B and DeepSeek R1, which are about twice its size. Remarkably, it can run on a single 8x H100 node. What else can you tell us about Llama Nemotron Ultra and what makes it so impressive? Joey Conway: We’re big believers in the open-source community and the fantastic work being done there. With Llama Nemotron, our goal was to build upon the existing foundations, particularly Llama, for which we greatly appreciate Meta’s contributions. We also observed significant progress in reasoning within the open community earlier this year. Inspired by this, we wanted to contribute and see how we could enhance Llama, especially for enterprise use cases. Our focus was primarily on improving reasoning capabilities and agentic tasks like tool calling and chat. We aimed to take the strengths of the open-source community, enhance them, and then contribute those improvements back. Jean-Marc Mommessin: Did you identify specific gaps in existing models that you aimed to address? You mentioned reasoning, but could you provide an example or two of enterprise agentic tasks where you felt there were shortcomings that Llama Nemotron Ultra overcomes? Joey Conway : Yes, I think looking back to the beginning of the year, a key challenge in enterprise deployments was handling complex queries requiring significant thought and reflection. These could be multi-step processes or involve substantial calculations and the use of external tools. At that time, there weren’t many strong open-weight models capable of robust reasoning. The progress we’ve seen in the last few months in this area is very encouraging. Another critical aspect for enterprises is the ability to accurately call APIs and closely follow instructions in user queries. We wanted to ensure that while we focused on improving reasoning, we didn’t compromise these essential production-level capabilities. Furthermore, we often noticed that when both reasoning and instruction following were well-addressed, they typically resided in separate models. Our aim was to simplify this by creating a single model that excels in both. This was the landscape we observed when we started this project around January and February. Jean-Marc Mommessin: That makes perfect sense and aligns with what we’re seeing in the industry as well. Now, let’s dive into the “how.” Your paper mentions FFN fusion as a key optimization. Could you elaborate on this technique, starting with a high-level explanation? Joey Conway: Absolutely. Our focus on optimization stemmed from the realization that deploying state-of-the-art models often requires a significant deployment footprint. We wanted to optimize this to fit within more common GPU setups. We explored various techniques, including our Puzzle neural architecture search. For dense transformer models, particularly those in the Llama family, we discovered a way to reduce or eliminate redundant attention layers. This process aligned the feed-forward network (FFN) layers in a sequence, allowing us to explore fusion methods. Our fundamental goal on the GPU is to maximize parallel execution. Fusing these aligned FFN layers enables greater parallel computation than was previously possible. By removing redundant layers, we found opportunities to essentially merge or fuse the remaining ones. This is a key example of how we tackle the challenges of running these models at scale. Importantly, this technique often yields greater improvements with larger models, which was beneficial for our Ultra model based on Meta’s Llama 3.1 -405B. Jean-Marc Mommessin: And this FFN fusion significantly improves the model’s throughput, achieving notable speedups. If I recall correctly, it’s in the range of 3 to 5x for the Ultra model? Joey Conway: That’s right, the speedups for the Ultra model are in that range. Additionally, by reducing the model’s size in terms of weights, we also lowered its memory footprint. This allowed us to utilize a larger KV cache. For Llama Nemotron Ultra, we could fit it onto a 8x H100 80GB setup, which is quite significant as it fits within common node configurations. So, FFN fusion provided both a substantial compute speedup and a reduction in memory usage, enabling us to handle larger context lengths. These are very exciting outcomes for us. Jean-Marc Mommessin: Let’s switch gears to data curation. AI data is crucial, and your training pipeline seems very sophisticated. You touched on “instruction following” earlier. Could you elaborate on your data curation process and how you ensured high-quality data, especially considering you leveraged other models in the process? Image source: NVIDIA Joey Conway: Transparency and openness were key in our approach. We wanted to share as much as possible about our data, techniques, and tooling so the community could understand and even use it themselves. Our primary goal with data curation was to improve accuracy across several key domains, including reasoning tasks like math and coding, as well as non-reasoning tasks like tool calling, instruction following, and chat. Our strategy involved curating specific datasets to enhance performance in these areas. Within our supervised fine-tuning process, we differentiated between “reasoning on” and “reasoning off” scenarios. For example, in math and coding, we curated data for simple questions that don’t require complex reasoning, as well as more intricate problems that do. This helps the model learn when and how to apply reasoning. A key part of this process was leveraging high-quality models from the community as “experts” in specific domains. For instance, we used DeepSeek R-1 extensively for reasoning-intensive math and coding tasks. For non-reasoning tasks like basic math, coding, chat, and tool calling, we utilized models like Llama and Qwen. Our aim was to blend the best capabilities of these community models into a single model. We’ve also made this curated dataset publicly available on Hugging Face, with around 30 million question-answer pairs. This allows the community to explore, use, and build upon our work. We were also excited to see our partner ServiceNow recently announce their apprehend Nemotron model, which was trained using our dataset to enhance their own reasoning capabilities. Jean-Marc Mommessin: That’s fantastic that you’re sharing the dataset. Given that you used other models to generate some of this data, what kind of quality checks did you implement to ensure the reliability of the training pairs? Joey Conway: Data quality was absolutely paramount. Since we were generating a significant portion of the data using other models, we implemented a rigorous multi-layered quality assurance process. First, for each expert model used to generate data in a specific domain, we would generate multiple candidate responses for the same prompt. Then, we employed a separate set of “critic” models to evaluate these candidates based on correctness, coherence, and adherence to the prompt. Second, we implemented a scoring mechanism. Each generated question-answer pair received a quality score based on the critic model’s evaluation. We set a high threshold, and any pair that didn’t meet this standard was discarded. Third, human review was integrated at various stages. Our team of data scientists and engineers manually inspected samples of the generated data to identify any systematic errors, biases, or instances of hallucination. This human oversight was crucial for catching nuances that automated systems might miss. Fourth, we focused on the diversity of the generated data. We wanted to ensure we weren’t just getting variations of the same types of questions and answers. We implemented strategies to encourage the expert models to generate a broad range of examples within each domain. Finally, after training Llama Nemotron Ultra on this curated data, we conducted extensive evaluations against benchmark datasets and in real-world use cases. This feedback loop helped us further refine our data generation and filtering techniques. So, it was a comprehensive approach involving expert generation, automated criticism and scoring, human review, diversity checks, and rigorous downstream evaluation to ensure the high quality of our training data. Jean-Marc Mommessin: The quality of the synthetic data is so important. Could you elaborate on the stages you take to ensure high accuracy when generating this data? Joey Conway: Absolutely. When doing synthetic data generation, there are a few key stages to ensure high accuracy. The first is the prompts – the seed data and how we prompt the model. The second is the quality of the responses. On the prompting side, we focus on prompting models where we believe they excel. For example, we might use Llama for chat-related prompts but avoid using a non-reasoning model for math. It’s crucial to align the prompts with the core strengths of the model. For vetting the responses, we invest time in both human manual review and automated methods. Going forward, we anticipate increasing our use of verifiers and reward models, similar to what we’ve done on the Reinforcement Learning (RL) side. The reason we’ve open-sourced much of this is that there’s a lot of nuance involved, and we wanted the community to engage with these challenges. Enterprises like ServiceNow have specific goals, and some of our data might be more or less useful to them. By making it available, they can vet it themselves. We also provide tools like classifier models to help categorize content, such as news or sports, allowing users to make informed decisions about the data blends they use for training. Jean-Marc Mommessin: Perfect. Is there anything else you’d like to highlight regarding this pipeline? Joey Conway: Yes, I’d like to touch on the Reinforcement Learning (RL) aspect. Following the supervised fine-tuning stage, where we enhanced core skills, we’ve just begun to explore the potential of RL with Nemotron. We believe this will be a significant area of future development. What’s exciting about RL is that its effectiveness is largely tied to the available compute time. The more time we invest, the better the model becomes at specific tasks. In our RL stages, we’ve developed methods to automate the process of asking the model a question, grading its answer, and providing feedback to allow it to learn and improve. You can see on the slide the domains where we’ve applied this: scientific reasoning, instruction following, and chat. If you look at the leaderboards, you’ll see that even with new models emerging, we’ve maintained a strong position in these areas, largely due to the effectiveness of RL in achieving top-tier accuracy. We’re optimistic that we’ll see more of this in the community, with more discussion and publication of techniques and data. We’ve started sharing some of our work in this area and will have much more to come in the next three to six months. Jean-Marc Mommessin: You mentioned RL and instruction following, which ties back to the beginning of our conversation. It seems like you’ve come full circle here. Joey Conway: Exactly. The exciting aspect here is automating the feedback loop wherever possible. For chat, we published a fine-tuned reward model last fall. Those who followed our work might recall that our Llama Nemotron model topped the chat leaderboards then. This was because the reward model provides an automated way to teach the original model whether its responses are good or bad. It essentially grades responses based on helpfulness, conciseness, verbosity, groundedness, and similar factors. This granular feedback per generated response allows the model to improve significantly, often more so than through supervised fine-tuning alone, which typically involves a few passes without a continuous feedback loop. Similarly, for instruction following, we use a verifier and a dataset to teach the model whether it followed instructions well or needs to try again. We’re eager to expand this approach to more domains. We’ve already published datasets related to coding and math since the release of this model a few weeks ago, and these have become popular on Hugging Face. I anticipate significant growth in this area within the community. Jean-Marc Mommessin: Alright, so one of the big innovations here, and you touched upon it, but I want to emphasize it, is the ability to toggle reasoning on and off via the system prompt. This is quite unique, and I’m sure many will follow suit. Could you expand on the idea behind this, how you see it applying to agents and beyond, its value, and the key challenges in implementing it? Joey Conway: The reasoning on and off capability was a core goal from the outset. We observed that models in the community often excelled in either reasoning or non-reasoning tasks, and we wanted to simplify deployment by having a single model that could handle both. We had to determine the best way to teach the model when to reason and when not to, while also providing enterprises with explicit control, as they often have deeper domain knowledge than we do. The motivation behind this is that reasoning generates significantly more tokens, which can lead to higher latency and cost. While crucial for solving complex problems, it’s not always necessary. We wanted to give enterprises the control to balance accuracy with latency and cost, allowing them to decide when to employ reasoning and when to opt for faster, less computationally intensive responses. Initially, we weren’t sure how to achieve this, as it hadn’t been widely implemented in the community. Our approach in the supervised fine-tuning stage was to explicitly teach the model by presenting the same question with two different answers: one with detailed reasoning and one without. This essentially doubled our dataset for this specific purpose. However, the outcome is a single model where users can simply include “use detailed thinking on” or “use detailed thinking off” in the prompt to control the model’s reasoning process. On the training side, this required more effort to teach the model this distinction. What we have today is essentially a v1, and I expect others will follow this approach. We’re also excited about future developments, such as time or token limits for reasoning and more granular controls. I’m optimistic that we’ll see further breakthroughs in this area within the next six to nine months, as the problem-solving power of reasoning is significant, but it comes with trade-offs that the community will continue to refine. Jean-Marc Mommessin: We all know that the real test comes in production. Production environments are sensitive to latency, cost, and while accuracy and reasoning are vital, excessive reasoning can lead to scalability issues and increased latency. The flexibility you’ve introduced is fantastic, and I can see numerous production use cases that will greatly benefit from the ability to control reasoning on a per-query basis. So, when you were developing this model, you aimed to balance accuracy and efficiency. Could you share some insights into how you made these trade-offs, the timeline for building the model and the team involved, and how you determined the optimal compromise between these two critical factors? Joey Conway: Balancing accuracy and efficiency is always a challenge. Our initial goal was to achieve both, which is a difficult undertaking. We started with the “Super” model, which was the most recent Llama 3.1 70B release from Meta, as our baseline for accuracy. We weren’t sure if we could simultaneously improve accuracy and reduce the model size. We found that through our training techniques and distillation process, we could indeed boost accuracy. We even released an initial checkpoint reflecting this. However, we wanted to go further by incorporating strong reasoning capabilities, aiming for state-of-the-art reasoning scores. This is where the SFT and RL stages came in, which required significant time for synthetic data generation since this type of data didn’t exist. During training, we carefully considered the number of epochs for each skill and continuously measured accuracy. Our goal was to improve performance across all six key areas rather than excelling in just a couple. This balancing act took more time as we experimented to find the right combinations. However, we felt it was crucial to ensure world-class performance in these six enterprise-relevant scenarios, including chat and instruction following. For areas like MMLU, we focused on maintaining performance and preventing regression rather than actively trying to improve scores. So, there were definitely priorities and trade-offs involved. Ultimately, we believe these were the right focus areas for our enterprise customers. Jean-Marc Mommessin: You are releasing this model family as part of the open-source community. We’ve discussed the gaps you aimed to address and the unique reasoning on/off feature for production scalability. Could you share your thoughts on how NVIDIA and your team view the role of these models within the broader open-source and LLM ecosystem, especially given your work building upon the Llama base? Joey Conway: NVIDIA has a long history of contributing models to the open-source community. What excites us about Llama is its strong traction with enterprise customers. While NVIDIA Research publishes extensively across various domains, our goal with Llama Nemotron was to build upon Llama’s momentum in enterprise adoption by focusing narrowly on specific areas. The base Llama models already cover many things exceptionally well, so we saw an opportunity to build on top of that and be very targeted in our enhancements. The recent LlamaCon event and Meta’s announcements sound very promising, and we’re excited about Llama 4 and the ongoing work there. Moving forward, we anticipate continuing to identify specific areas where we can add significant value, while Meta continues to build excellent general-purpose models suitable for enterprise production. From our perspective, reasoning will likely remain a key focus, and we’re also excited about Meta’s advancements in this area. Tool calling, instruction following, and chat are also areas we’ll continue to develop. One area we’re particularly interested in exploring is multilingual capabilities. For large enterprises, supporting multiple languages is crucial. While many models handle individual languages well, we aim to focus on a few key languages and ensure world-class accuracy for reasoning, tool calling, and chat within those. This is likely the next major area of expansion for us, beyond the exciting developments in model architectures like Llama 4’s new MoE architecture, which we’re also keen to explore for potential distillation and optimization for NVIDIA GPUs. So, there’s a lot of exciting work ahead. Jean-Marc Mommessin: When you say multilingual, are you thinking of supporting a broad range, like 50 languages, or a more focused set, perhaps around 5 or 10 initially, given the benchmark challenges you mentioned? Joey Conway: We’ll probably start with a more focused set, perhaps around 5 to 10 languages. The challenge is that the community currently lacks comprehensive benchmarks for tasks like reasoning or tool calling across a wide variety of languages. As we develop these multilingual models, we’re also having to create evaluation data simultaneously, which takes time. If those benchmarks were readily available, the process would be smoother. However, we see this as an exciting challenge. Our initial focus will likely be on a smaller set of languages where we can establish strong performance, given the current limitations in community-wide benchmarks. Jean-Marc Mommessin: Let’s shift gears and talk about another state-of-the-art open-source model you recently released: Parakeet TDT 0.6 B parameters, V2. This model has set a new standard for automatic speech recognition (ASR), transcribing one hour of audio in just one second. That’s 50 times faster than other open-source ASR models, and remarkably, it achieves only a 6% word error rate. This is truly impressive. What else would you like to highlight about this model before we discuss the “how” behind its incredible performance? Joey Conway: It’s worth noting that NVIDIA has been working on ASR models for a long time, even before I joined. We’ve also released many open models in this space over the years. The teams working on this are exceptional, and they consistently strive to balance accuracy with latency and throughput. Parakeet V2 is the latest in this line of high-performance models from NVIDIA. Jean-Marc Mommessin: It sounds like the advancements will keep coming. So, let’s delve into how you achieved this remarkable performance with Parakeet TDT. What kind of architecture did you use? I understand it’s based on a Fast Conformer architecture with specific optimizations like 8x depth-wise separable convolutional downsampling and limited context attention. Could you explain how you arrived at this approach and whether these optimizations primarily enhance speed and throughput or if they also contribute to accuracy and the ability to process long audio segments like a full hour in one shot? Joey Conway: Yes, we’ve explored various architectures for ASR over the years, and the Conformer architecture, originally from Google, has shown great promise. Our goal with Parakeet TDT was to take the Conformer architecture and make it significantly more efficient and faster without sacrificing quality. We’ve implemented several key optimizations.  First, as you mentioned, the depth-wise separable convolution downsampling. At the input stage, we significantly downsample the audio, which reduces the computational cost and memory requirements for processing. Second is the limited context attention. By focusing on smaller, overlapping chunks of audio, we can maintain accuracy while achieving a speedup in processing. Third, on the encoder side, we also utilize a sliding window attention technique, which allows us to process longer audio files without having to split them into shorter segments. This is crucial for handling long-form audio like a full hour in a single pass. Beyond the Conformer architecture, Parakeet TDT incorporates a Token and Duration Transducer (TDT). Traditional Recurrent Neural Network (RNN) transducer technology processes audio frame by frame. What we’ve done with TDT is enable the model to predict both the tokens and the expected duration of those tokens. This allows it to make decisions to skip over redundant frames, significantly speeding up the transcription process. This TDT innovation alone contributes to around a 1.5 to 2x speedup. So, there’s a combination of architectural choices and specific optimizations that contribute to Parakeet TDT’s impressive speed and accuracy. Jean-Marc Mommessin: I want to go back to one or two of those. Those are amazing, frankly. The speed increase is remarkable. Joey Conway: Yes, and we have another technique called a label looping algorithm. Essentially, when we’re doing batch inference, this algorithm allows us to advance the tokens independently for different samples. This separation of the workflow enables us to sweep and loop over frames and labels more efficiently, significantly speeding up the decoding process. Lastly, on the decoder side, we’ve moved some of the computation into CUDA graphs, which is a more efficient way to run many small kernels. This optimization alone provided around a 3x speed boost. So, as you can see with TDT models, we’ve been able to achieve speeds comparable to Connectionist Temporal Classification (CTC) decoders, which are also known for their speed, while maintaining high accuracy. Our overall theme is always to balance speed improvements with maintaining or even enhancing accuracy. Techniques like CTC decoders have been around for a while and are fast but might not be as accurate. It really depends on the use case, but we’re always striving for that balance. Jean-Marc Mommessin: Can we revisit the limited context attention? Do you see this technique having broader applications in other areas down the line? Joey Conway: Yes, I believe so. Patterns like the sliding window attention are already used in other areas, such as LLMs. Our research teams are constantly experimenting, looking at successful techniques from different domains, and trying to apply them in new ways. Interestingly, some of the researchers who worked on Parakeet TDT also work on Llama Nemotron, so there’s a cross-pollination of ideas. I do expect that some of these techniques will find broader applications going forward. We also anticipate further improvements to TDT and the Conformer architecture, as we’ve been working on them for several years now. I don’t see these core technologies going away anytime soon; we’ll likely continue to refine them. Jean-Marc Mommessin: Leaving the TDT aside, do you see other potential applications for the Token and Duration Transducer concept in other domains? Joey Conway: That’s a good question. I’m not immediately seeing a direct application of the TDT concept outside of ASR. Its history is rooted in RNNs and RNN transducers, which have primarily been used in speech recognition. However, some of the underlying techniques we’ve applied to it, like using CUDA graphs for optimizing kernel execution, are general techniques that we use whenever we identify bottlenecks in a model’s pipeline. So, while the TDT itself might be domain-specific, some of the optimization strategies we’ve employed could certainly translate to other areas, including large language models. Jean-Marc Mommessin: let’s talk about data. AI data is always a key topic. How do you ensure that the data used to train Parakeet TDT is diverse enough to handle various accents, dialects, vocal ranges, pitches, and noisy background conditions, which often negatively impact ASR performance? Joey Conway: You’re absolutely right. As humans, we naturally filter out accents and background noise to understand speech. However, deep learning models are only as good as the data they’re trained on. Early on, limited data for specific accents or languages resulted in poor performance for those variations. What might have initially seemed like edge cases have become increasingly common, highlighting the need for more representative data. We’ve invested significant effort in curating our datasets to reflect this real-world diversity. We use techniques like classifiers to analyze our data and understand the distributions of accents, dialects, and acoustic conditions. We’ve worked with customers like YUM! Brands, who have drive-through use cases with significant highway noise, illustrating the importance of training the model to handle such challenging environments. Ensuring the right blend and distribution of these conditions in our training data is crucial for the model’s robustness. I’m also excited to announce that we plan to open-source a substantial speech dataset, around 100,000 hours, where we’ve meticulously performed this kind of curation. This dataset will include variations in sound levels, signal-to-noise ratios, background noise types, and even telephone audio formats relevant for call centers. Our goal is to provide the community with high-quality, diverse data that enables models to perform well across a wide range of real-world scenarios. Jean-Marc Mommessin: That’s fantastic news about the open-sourcing of the speech dataset! My final question regarding the Parakeet family: you currently have the 600 million and 1.1 billion parameter models. How do you envision future development for this family? What are the potential directions? Joey Conway: We’re considering development along two main dimensions: model size and the number of supported languages. In terms of size, we’ve released models at the smaller and mid-range to demonstrate the potential, similar to our approach with Llama Nemotron Super. We plan to explore larger models, potentially around 2 billion parameters, which we anticipate will handle even more languages and dialects. On the smaller end, we’re even considering models down to around 50 million parameters. The motivation here is to address use cases at the edge where a smaller footprint is necessary, such as enabling real-time audio processing for robots in noisy environments. We’ll be exploring the right trade-offs for such applications. Technologically, we plan to work on streaming capabilities for TDT. Currently, much of the processing is done in an offline batch mode, but we want to enable real-time, live transcription. And as mentioned, we’re excited about releasing the large, curated speech dataset. Finally, for those looking to deploy these models in production, we recommend exploring techniques like word boosting, which allows for customization of text normalization to include domain-specific terms and acronyms. We aim to provide a wide range of options for users to get started and tailor the models to their specific needs. Jean-Marc Mommessin: I’m very familiar with the NVIDIA Orin platform. Would these Parakeet models currently run on NVIDIA Orin? Joey Conway: Yes, I believe the 0.6 billion parameter model likely would run on Orin. I would need to double-check the exact specifications, but I’m quite confident it’s feasible. Jean-Marc Mommessin: Orin packs a significant punch. I especially love the robotics use case you mentioned. While there’s been a lot of focus on robot vision, the ability to hear and understand quickly is equally crucial, especially for safety. A model that’s 50 times faster and highly accurate in understanding another modality seems like a perfect fit for robotics. Joey Conway: Yes, and the slight hesitation I had earlier was due to the understanding that in robotics, there are often multiple models running simultaneously, including vision models. So, resource allocation is a consideration. However, our push towards smaller, more efficient models is precisely to address these kinds of multi-modal edge computing scenarios. The low latency and real-time processing capabilities of Parakeet are indeed very beneficial for enabling robots to react quickly and safely to auditory cues. Jean-Marc Mommessin: Anything else you’d like to add as a final thought on the Llama Nemotron Ultra and Parakeet families? They’re both open-source, fast, high-throughput, cost-efficient, and run on smaller footprints – are these the key takeaways? Joey Conway: Yes, that’s a great summary. Those were the core objectives we set out to achieve. We aimed for state-of-the-art accuracy, optimized footprints for efficient GPU utilization in terms of latency and throughput, and a commitment to open-sourcing everything to empower the community. We’ve strived to be as community-friendly as possible by releasing datasets, using permissive licenses, and making it easy for people to experiment. We’re eager to see the community’s feedback and the innovative applications they build upon our work. We’re also looking forward to learning from their experiences. Jean-Marc Mommessin: Where are all these models and datasets available? Joey Conway: Everything we’ve published is on Hugging Face – the models and the datasets. The software stack to run them comes from NVIDIA and is available on NGC, our content repository. Much of the underlying software is also open-source and can be found on GitHub. We also provide pip wheels for easier installation. The Nemo framework is the central hub for much of this software stack, whether you want to run the models or fine-tune them. We’ve tried to make it as user-friendly as possible. We use the same software internally to build the models, so it should be relatively straightforward for others to pick up and deploy as well. Jean-Marc Mommessin: Well, Joey, this has been fantastic. I’m continually impressed by NVIDIA’s commitment to giving back to the community with state-of-the-art models that will undoubtedly find their way into production. Thank you so much for your time and insights. I look forward to our next conversation. Joey Conway: Thank you, Jean-Marc. It was my pleasure, and we appreciate the opportunity.  Jean-marc MommessinJean-marc is a successful AI business executive .He leads and accelerates growth for AI powered solutions and started a computer vision company in 2006. He is a recognized speaker at AI conferences and has an MBA from Stanford.Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/NVIDIA AI Releases HOVER: A Breakthrough AI for Versatile Humanoid Control in RoboticsJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Speech-to-Speech Foundation Models Pave the Way for Seamless Multilingual InteractionsJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Lowe’s Revolutionizes Retail with AI: From Personalized Shopping to Proactive Customer AssistanceJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Google DeepMind’s Gemini Robotics: Unleashing Embodied AI with Zero-Shot Control and Enhanced Spatial Reasoning
    0 Kommentare 0 Anteile
Suchergebnis