Exclusive Talk: Joey Conway of NVIDIA on Llama Nemotron Ultra and Open Source Models
Today, MarkTechPost had the pleasure of interviewing Joey Conway from NVIDIA to discuss their exciting work on open-source large language models, including Llama Nemotron Ultra & Parakeet.
Highlights from the interview:
NVIDIA’s Open Source Powerhouse: Discover how NVIDIA is pushing the boundaries of open-source AI with the release of cutting-edge models like Llama Nemotron Ultra and Parakeet TDT.
Llama Nemotron Ultra: Smaller Size, Giant Performance: Learn how NVIDIA achieved on-par performance with models twice the size, enabling deployment on a single GPU node. Explore their innovative FFN fusion technique for significant speedups.
Reasoning on Demand: Uncover the unique “reasoning on/off” feature in Llama Nemotron Ultra, offering unprecedented control for production deployments and cost optimization.
Revolutionary Speech Recognition with Parakeet TDT: Dive into NVIDIA’s state-of-the-art ASR model that transcribes one hour of audio in one second with only a 6% word error rate – 50 times faster than other open-source alternatives!
The “How”: Architectural Innovations: Get insights into the advanced architectures and optimizations behind these models, including FFN fusion, limited context attention, and the Token Duration Transducer
Democratizing AI with Open Data: Learn about NVIDIA’s commitment to the open-source community through the release of model weights and massive, high-quality datasets for both language and speech.
Future Directions: Get a sneak peek into NVIDIA’s plans for multilingual support, even smaller edge-optimized models, and advancements in real-time streaming for speech recognition.
Production-Ready AI: Understand how these models are designed with real-world deployment challenges in mind, focusing on accuracy, efficiency, and cost-effectiveness.
Jean-Marc Mommessin: Joey, welcome to Marketechpost! We’re thrilled to have you here and to delve into the impressive open-source models NVIDIA has been releasing. To start, could you please introduce yourself and your role at NVIDIA?
Joey Conway: Hi Jean-Marc, it’s great to be here. I’m Joey Conway, and I work in product management for some of the deep learning software at NVIDIA. Our team focuses on large language models like Nemotron and Llama Nemotron, as well as text-to-speech models such as Parakeet.
Jean-Marc Mommessin: Wonderful. And you’ve been at NVIDIA for over seven years now, witnessing significant waves of innovation in AI. Let’s talk about your recent release, Llama Nemotron Ultra, a 253 billion parameter model. From what we’ve seen, it delivers performance on par with models like Llama 405B and DeepSeek R1, which are about twice its size. Remarkably, it can run on a single 8x H100 node. What else can you tell us about Llama Nemotron Ultra and what makes it so impressive?
Joey Conway: We’re big believers in the open-source community and the fantastic work being done there. With Llama Nemotron, our goal was to build upon the existing foundations, particularly Llama, for which we greatly appreciate Meta’s contributions. We also observed significant progress in reasoning within the open community earlier this year. Inspired by this, we wanted to contribute and see how we could enhance Llama, especially for enterprise use cases.
Our focus was primarily on improving reasoning capabilities and agentic tasks like tool calling and chat. We aimed to take the strengths of the open-source community, enhance them, and then contribute those improvements back.
Jean-Marc Mommessin: Did you identify specific gaps in existing models that you aimed to address? You mentioned reasoning, but could you provide an example or two of enterprise agentic tasks where you felt there were shortcomings that Llama Nemotron Ultra overcomes?
Joey Conway : Yes, I think looking back to the beginning of the year, a key challenge in enterprise deployments was handling complex queries requiring significant thought and reflection. These could be multi-step processes or involve substantial calculations and the use of external tools. At that time, there weren’t many strong open-weight models capable of robust reasoning. The progress we’ve seen in the last few months in this area is very encouraging.
Another critical aspect for enterprises is the ability to accurately call APIs and closely follow instructions in user queries. We wanted to ensure that while we focused on improving reasoning, we didn’t compromise these essential production-level capabilities.
Furthermore, we often noticed that when both reasoning and instruction following were well-addressed, they typically resided in separate models. Our aim was to simplify this by creating a single model that excels in both. This was the landscape we observed when we started this project around January and February.
Jean-Marc Mommessin: That makes perfect sense and aligns with what we’re seeing in the industry as well. Now, let’s dive into the “how.” Your paper mentions FFN fusion as a key optimization. Could you elaborate on this technique, starting with a high-level explanation?
Joey Conway: Absolutely. Our focus on optimization stemmed from the realization that deploying state-of-the-art models often requires a significant deployment footprint. We wanted to optimize this to fit within more common GPU setups.
We explored various techniques, including our Puzzle neural architecture search. For dense transformer models, particularly those in the Llama family, we discovered a way to reduce or eliminate redundant attention layers. This process aligned the feed-forward networklayers in a sequence, allowing us to explore fusion methods.
Our fundamental goal on the GPU is to maximize parallel execution. Fusing these aligned FFN layers enables greater parallel computation than was previously possible. By removing redundant layers, we found opportunities to essentially merge or fuse the remaining ones. This is a key example of how we tackle the challenges of running these models at scale. Importantly, this technique often yields greater improvements with larger models, which was beneficial for our Ultra model based on Meta’s Llama 3.1 -405B.
Jean-Marc Mommessin: And this FFN fusion significantly improves the model’s throughput, achieving notable speedups. If I recall correctly, it’s in the range of 3 to 5x for the Ultra model?
Joey Conway: That’s right, the speedups for the Ultra model are in that range. Additionally, by reducing the model’s size in terms of weights, we also lowered its memory footprint. This allowed us to utilize a larger KV cache. For Llama Nemotron Ultra, we could fit it onto a 8x H100 80GB setup, which is quite significant as it fits within common node configurations. So, FFN fusion provided both a substantial compute speedup and a reduction in memory usage, enabling us to handle larger context lengths. These are very exciting outcomes for us.
Jean-Marc Mommessin: Let’s switch gears to data curation. AI data is crucial, and your training pipeline seems very sophisticated. You touched on “instruction following” earlier. Could you elaborate on your data curation process and how you ensured high-quality data, especially considering you leveraged other models in the process?
Image source: NVIDIA
Joey Conway: Transparency and openness were key in our approach. We wanted to share as much as possible about our data, techniques, and tooling so the community could understand and even use it themselves. Our primary goal with data curation was to improve accuracy across several key domains, including reasoning tasks like math and coding, as well as non-reasoning tasks like tool calling, instruction following, and chat.
Our strategy involved curating specific datasets to enhance performance in these areas. Within our supervised fine-tuning process, we differentiated between “reasoning on” and “reasoning off” scenarios. For example, in math and coding, we curated data for simple questions that don’t require complex reasoning, as well as more intricate problems that do. This helps the model learn when and how to apply reasoning.
A key part of this process was leveraging high-quality models from the community as “experts” in specific domains. For instance, we used DeepSeek R-1 extensively for reasoning-intensive math and coding tasks. For non-reasoning tasks like basic math, coding, chat, and tool calling, we utilized models like Llama and Qwen. Our aim was to blend the best capabilities of these community models into a single model.
We’ve also made this curated dataset publicly available on Hugging Face, with around 30 million question-answer pairs. This allows the community to explore, use, and build upon our work. We were also excited to see our partner ServiceNow recently announce their apprehend Nemotron model, which was trained using our dataset to enhance their own reasoning capabilities.
Jean-Marc Mommessin: That’s fantastic that you’re sharing the dataset. Given that you used other models to generate some of this data, what kind of quality checks did you implement to ensure the reliability of the training pairs?
Joey Conway: Data quality was absolutely paramount. Since we were generating a significant portion of the data using other models, we implemented a rigorous multi-layered quality assurance process.
First, for each expert model used to generate data in a specific domain, we would generate multiple candidate responses for the same prompt. Then, we employed a separate set of “critic” models to evaluate these candidates based on correctness, coherence, and adherence to the prompt.
Second, we implemented a scoring mechanism. Each generated question-answer pair received a quality score based on the critic model’s evaluation. We set a high threshold, and any pair that didn’t meet this standard was discarded.
Third, human review was integrated at various stages. Our team of data scientists and engineers manually inspected samples of the generated data to identify any systematic errors, biases, or instances of hallucination. This human oversight was crucial for catching nuances that automated systems might miss.
Fourth, we focused on the diversity of the generated data. We wanted to ensure we weren’t just getting variations of the same types of questions and answers. We implemented strategies to encourage the expert models to generate a broad range of examples within each domain.
Finally, after training Llama Nemotron Ultra on this curated data, we conducted extensive evaluations against benchmark datasets and in real-world use cases. This feedback loop helped us further refine our data generation and filtering techniques.
So, it was a comprehensive approach involving expert generation, automated criticism and scoring, human review, diversity checks, and rigorous downstream evaluation to ensure the high quality of our training data.
Jean-Marc Mommessin: The quality of the synthetic data is so important. Could you elaborate on the stages you take to ensure high accuracy when generating this data?
Joey Conway: Absolutely. When doing synthetic data generation, there are a few key stages to ensure high accuracy. The first is the prompts – the seed data and how we prompt the model. The second is the quality of the responses.
On the prompting side, we focus on prompting models where we believe they excel. For example, we might use Llama for chat-related prompts but avoid using a non-reasoning model for math. It’s crucial to align the prompts with the core strengths of the model.
For vetting the responses, we invest time in both human manual review and automated methods. Going forward, we anticipate increasing our use of verifiers and reward models, similar to what we’ve done on the Reinforcement Learningside.
The reason we’ve open-sourced much of this is that there’s a lot of nuance involved, and we wanted the community to engage with these challenges. Enterprises like ServiceNow have specific goals, and some of our data might be more or less useful to them. By making it available, they can vet it themselves. We also provide tools like classifier models to help categorize content, such as news or sports, allowing users to make informed decisions about the data blends they use for training.
Jean-Marc Mommessin: Perfect. Is there anything else you’d like to highlight regarding this pipeline?
Joey Conway: Yes, I’d like to touch on the Reinforcement Learningaspect. Following the supervised fine-tuning stage, where we enhanced core skills, we’ve just begun to explore the potential of RL with Nemotron. We believe this will be a significant area of future development.
What’s exciting about RL is that its effectiveness is largely tied to the available compute time. The more time we invest, the better the model becomes at specific tasks. In our RL stages, we’ve developed methods to automate the process of asking the model a question, grading its answer, and providing feedback to allow it to learn and improve.
You can see on the slide the domains where we’ve applied this: scientific reasoning, instruction following, and chat. If you look at the leaderboards, you’ll see that even with new models emerging, we’ve maintained a strong position in these areas, largely due to the effectiveness of RL in achieving top-tier accuracy. We’re optimistic that we’ll see more of this in the community, with more discussion and publication of techniques and data. We’ve started sharing some of our work in this area and will have much more to come in the next three to six months.
Jean-Marc Mommessin: You mentioned RL and instruction following, which ties back to the beginning of our conversation. It seems like you’ve come full circle here.
Joey Conway: Exactly. The exciting aspect here is automating the feedback loop wherever possible. For chat, we published a fine-tuned reward model last fall. Those who followed our work might recall that our Llama Nemotron model topped the chat leaderboards then. This was because the reward model provides an automated way to teach the original model whether its responses are good or bad. It essentially grades responses based on helpfulness, conciseness, verbosity, groundedness, and similar factors. This granular feedback per generated response allows the model to improve significantly, often more so than through supervised fine-tuning alone, which typically involves a few passes without a continuous feedback loop.
Similarly, for instruction following, we use a verifier and a dataset to teach the model whether it followed instructions well or needs to try again. We’re eager to expand this approach to more domains. We’ve already published datasets related to coding and math since the release of this model a few weeks ago, and these have become popular on Hugging Face. I anticipate significant growth in this area within the community.
Jean-Marc Mommessin: Alright, so one of the big innovations here, and you touched upon it, but I want to emphasize it, is the ability to toggle reasoning on and off via the system prompt. This is quite unique, and I’m sure many will follow suit. Could you expand on the idea behind this, how you see it applying to agents and beyond, its value, and the key challenges in implementing it?
Joey Conway: The reasoning on and off capability was a core goal from the outset. We observed that models in the community often excelled in either reasoning or non-reasoning tasks, and we wanted to simplify deployment by having a single model that could handle both.
We had to determine the best way to teach the model when to reason and when not to, while also providing enterprises with explicit control, as they often have deeper domain knowledge than we do. The motivation behind this is that reasoning generates significantly more tokens, which can lead to higher latency and cost. While crucial for solving complex problems, it’s not always necessary. We wanted to give enterprises the control to balance accuracy with latency and cost, allowing them to decide when to employ reasoning and when to opt for faster, less computationally intensive responses.
Initially, we weren’t sure how to achieve this, as it hadn’t been widely implemented in the community. Our approach in the supervised fine-tuning stage was to explicitly teach the model by presenting the same question with two different answers: one with detailed reasoning and one without. This essentially doubled our dataset for this specific purpose. However, the outcome is a single model where users can simply include “use detailed thinking on” or “use detailed thinking off” in the prompt to control the model’s reasoning process.
On the training side, this required more effort to teach the model this distinction. What we have today is essentially a v1, and I expect others will follow this approach. We’re also excited about future developments, such as time or token limits for reasoning and more granular controls. I’m optimistic that we’ll see further breakthroughs in this area within the next six to nine months, as the problem-solving power of reasoning is significant, but it comes with trade-offs that the community will continue to refine.
Jean-Marc Mommessin: We all know that the real test comes in production. Production environments are sensitive to latency, cost, and while accuracy and reasoning are vital, excessive reasoning can lead to scalability issues and increased latency. The flexibility you’ve introduced is fantastic, and I can see numerous production use cases that will greatly benefit from the ability to control reasoning on a per-query basis.
So, when you were developing this model, you aimed to balance accuracy and efficiency. Could you share some insights into how you made these trade-offs, the timeline for building the model and the team involved, and how you determined the optimal compromise between these two critical factors?
Joey Conway: Balancing accuracy and efficiency is always a challenge. Our initial goal was to achieve both, which is a difficult undertaking. We started with the “Super” model, which was the most recent Llama 3.1 70B release from Meta, as our baseline for accuracy. We weren’t sure if we could simultaneously improve accuracy and reduce the model size.
We found that through our training techniques and distillation process, we could indeed boost accuracy. We even released an initial checkpoint reflecting this. However, we wanted to go further by incorporating strong reasoning capabilities, aiming for state-of-the-art reasoning scores. This is where the SFT and RL stages came in, which required significant time for synthetic data generation since this type of data didn’t exist.
During training, we carefully considered the number of epochs for each skill and continuously measured accuracy. Our goal was to improve performance across all six key areas rather than excelling in just a couple. This balancing act took more time as we experimented to find the right combinations. However, we felt it was crucial to ensure world-class performance in these six enterprise-relevant scenarios, including chat and instruction following.
For areas like MMLU, we focused on maintaining performance and preventing regression rather than actively trying to improve scores. So, there were definitely priorities and trade-offs involved. Ultimately, we believe these were the right focus areas for our enterprise customers.
Jean-Marc Mommessin: You are releasing this model family as part of the open-source community. We’ve discussed the gaps you aimed to address and the unique reasoning on/off feature for production scalability. Could you share your thoughts on how NVIDIA and your team view the role of these models within the broader open-source and LLM ecosystem, especially given your work building upon the Llama base?
Joey Conway: NVIDIA has a long history of contributing models to the open-source community. What excites us about Llama is its strong traction with enterprise customers. While NVIDIA Research publishes extensively across various domains, our goal with Llama Nemotron was to build upon Llama’s momentum in enterprise adoption by focusing narrowly on specific areas. The base Llama models already cover many things exceptionally well, so we saw an opportunity to build on top of that and be very targeted in our enhancements.
The recent LlamaCon event and Meta’s announcements sound very promising, and we’re excited about Llama 4 and the ongoing work there. Moving forward, we anticipate continuing to identify specific areas where we can add significant value, while Meta continues to build excellent general-purpose models suitable for enterprise production.
From our perspective, reasoning will likely remain a key focus, and we’re also excited about Meta’s advancements in this area. Tool calling, instruction following, and chat are also areas we’ll continue to develop. One area we’re particularly interested in exploring is multilingual capabilities. For large enterprises, supporting multiple languages is crucial. While many models handle individual languages well, we aim to focus on a few key languages and ensure world-class accuracy for reasoning, tool calling, and chat within those. This is likely the next major area of expansion for us, beyond the exciting developments in model architectures like Llama 4’s new MoE architecture, which we’re also keen to explore for potential distillation and optimization for NVIDIA GPUs. So, there’s a lot of exciting work ahead.
Jean-Marc Mommessin: When you say multilingual, are you thinking of supporting a broad range, like 50 languages, or a more focused set, perhaps around 5 or 10 initially, given the benchmark challenges you mentioned?
Joey Conway: We’ll probably start with a more focused set, perhaps around 5 to 10 languages. The challenge is that the community currently lacks comprehensive benchmarks for tasks like reasoning or tool calling across a wide variety of languages. As we develop these multilingual models, we’re also having to create evaluation data simultaneously, which takes time. If those benchmarks were readily available, the process would be smoother. However, we see this as an exciting challenge. Our initial focus will likely be on a smaller set of languages where we can establish strong performance, given the current limitations in community-wide benchmarks.
Jean-Marc Mommessin: Let’s shift gears and talk about another state-of-the-art open-source model you recently released: Parakeet TDT 0.6 B parameters, V2. This model has set a new standard for automatic speech recognition, transcribing one hour of audio in just one second. That’s 50 times faster than other open-source ASR models, and remarkably, it achieves only a 6% word error rate. This is truly impressive. What else would you like to highlight about this model before we discuss the “how” behind its incredible performance?
Joey Conway: It’s worth noting that NVIDIA has been working on ASR models for a long time, even before I joined. We’ve also released many open models in this space over the years. The teams working on this are exceptional, and they consistently strive to balance accuracy with latency and throughput. Parakeet V2 is the latest in this line of high-performance models from NVIDIA.
Jean-Marc Mommessin: It sounds like the advancements will keep coming. So, let’s delve into how you achieved this remarkable performance with Parakeet TDT. What kind of architecture did you use? I understand it’s based on a Fast Conformer architecture with specific optimizations like 8x depth-wise separable convolutional downsampling and limited context attention. Could you explain how you arrived at this approach and whether these optimizations primarily enhance speed and throughput or if they also contribute to accuracy and the ability to process long audio segments like a full hour in one shot?
Joey Conway: Yes, we’ve explored various architectures for ASR over the years, and the Conformer architecture, originally from Google, has shown great promise. Our goal with Parakeet TDT was to take the Conformer architecture and make it significantly more efficient and faster without sacrificing quality.
We’ve implemented several key optimizations.
First, as you mentioned, the depth-wise separable convolution downsampling. At the input stage, we significantly downsample the audio, which reduces the computational cost and memory requirements for processing.
Second is the limited context attention. By focusing on smaller, overlapping chunks of audio, we can maintain accuracy while achieving a speedup in processing.
Third, on the encoder side, we also utilize a sliding window attention technique, which allows us to process longer audio files without having to split them into shorter segments. This is crucial for handling long-form audio like a full hour in a single pass.
Beyond the Conformer architecture, Parakeet TDT incorporates a Token and Duration Transducer. Traditional Recurrent Neural Networktransducer technology processes audio frame by frame. What we’ve done with TDT is enable the model to predict both the tokens and the expected duration of those tokens. This allows it to make decisions to skip over redundant frames, significantly speeding up the transcription process. This TDT innovation alone contributes to around a 1.5 to 2x speedup. So, there’s a combination of architectural choices and specific optimizations that contribute to Parakeet TDT’s impressive speed and accuracy.
Jean-Marc Mommessin: I want to go back to one or two of those. Those are amazing, frankly. The speed increase is remarkable.
Joey Conway: Yes, and we have another technique called a label looping algorithm. Essentially, when we’re doing batch inference, this algorithm allows us to advance the tokens independently for different samples. This separation of the workflow enables us to sweep and loop over frames and labels more efficiently, significantly speeding up the decoding process.
Lastly, on the decoder side, we’ve moved some of the computation into CUDA graphs, which is a more efficient way to run many small kernels. This optimization alone provided around a 3x speed boost. So, as you can see with TDT models, we’ve been able to achieve speeds comparable to Connectionist Temporal Classificationdecoders, which are also known for their speed, while maintaining high accuracy. Our overall theme is always to balance speed improvements with maintaining or even enhancing accuracy. Techniques like CTC decoders have been around for a while and are fast but might not be as accurate. It really depends on the use case, but we’re always striving for that balance.
Jean-Marc Mommessin: Can we revisit the limited context attention? Do you see this technique having broader applications in other areas down the line?
Joey Conway: Yes, I believe so. Patterns like the sliding window attention are already used in other areas, such as LLMs. Our research teams are constantly experimenting, looking at successful techniques from different domains, and trying to apply them in new ways. Interestingly, some of the researchers who worked on Parakeet TDT also work on Llama Nemotron, so there’s a cross-pollination of ideas. I do expect that some of these techniques will find broader applications going forward. We also anticipate further improvements to TDT and the Conformer architecture, as we’ve been working on them for several years now. I don’t see these core technologies going away anytime soon; we’ll likely continue to refine them.
Jean-Marc Mommessin: Leaving the TDT aside, do you see other potential applications for the Token and Duration Transducer concept in other domains?
Joey Conway: That’s a good question. I’m not immediately seeing a direct application of the TDT concept outside of ASR. Its history is rooted in RNNs and RNN transducers, which have primarily been used in speech recognition. However, some of the underlying techniques we’ve applied to it, like using CUDA graphs for optimizing kernel execution, are general techniques that we use whenever we identify bottlenecks in a model’s pipeline. So, while the TDT itself might be domain-specific, some of the optimization strategies we’ve employed could certainly translate to other areas, including large language models.
Jean-Marc Mommessin: let’s talk about data. AI data is always a key topic. How do you ensure that the data used to train Parakeet TDT is diverse enough to handle various accents, dialects, vocal ranges, pitches, and noisy background conditions, which often negatively impact ASR performance?
Joey Conway: You’re absolutely right. As humans, we naturally filter out accents and background noise to understand speech. However, deep learning models are only as good as the data they’re trained on. Early on, limited data for specific accents or languages resulted in poor performance for those variations. What might have initially seemed like edge cases have become increasingly common, highlighting the need for more representative data.
We’ve invested significant effort in curating our datasets to reflect this real-world diversity. We use techniques like classifiers to analyze our data and understand the distributions of accents, dialects, and acoustic conditions. We’ve worked with customers like YUM! Brands, who have drive-through use cases with significant highway noise, illustrating the importance of training the model to handle such challenging environments. Ensuring the right blend and distribution of these conditions in our training data is crucial for the model’s robustness.
I’m also excited to announce that we plan to open-source a substantial speech dataset, around 100,000 hours, where we’ve meticulously performed this kind of curation. This dataset will include variations in sound levels, signal-to-noise ratios, background noise types, and even telephone audio formats relevant for call centers. Our goal is to provide the community with high-quality, diverse data that enables models to perform well across a wide range of real-world scenarios.
Jean-Marc Mommessin: That’s fantastic news about the open-sourcing of the speech dataset! My final question regarding the Parakeet family: you currently have the 600 million and 1.1 billion parameter models. How do you envision future development for this family? What are the potential directions?
Joey Conway: We’re considering development along two main dimensions: model size and the number of supported languages. In terms of size, we’ve released models at the smaller and mid-range to demonstrate the potential, similar to our approach with Llama Nemotron Super. We plan to explore larger models, potentially around 2 billion parameters, which we anticipate will handle even more languages and dialects.
On the smaller end, we’re even considering models down to around 50 million parameters. The motivation here is to address use cases at the edge where a smaller footprint is necessary, such as enabling real-time audio processing for robots in noisy environments. We’ll be exploring the right trade-offs for such applications.
Technologically, we plan to work on streaming capabilities for TDT. Currently, much of the processing is done in an offline batch mode, but we want to enable real-time, live transcription. And as mentioned, we’re excited about releasing the large, curated speech dataset.
Finally, for those looking to deploy these models in production, we recommend exploring techniques like word boosting, which allows for customization of text normalization to include domain-specific terms and acronyms. We aim to provide a wide range of options for users to get started and tailor the models to their specific needs.
Jean-Marc Mommessin: I’m very familiar with the NVIDIA Orin platform. Would these Parakeet models currently run on NVIDIA Orin?
Joey Conway: Yes, I believe the 0.6 billion parameter model likely would run on Orin. I would need to double-check the exact specifications, but I’m quite confident it’s feasible.
Jean-Marc Mommessin: Orin packs a significant punch. I especially love the robotics use case you mentioned. While there’s been a lot of focus on robot vision, the ability to hear and understand quickly is equally crucial, especially for safety. A model that’s 50 times faster and highly accurate in understanding another modality seems like a perfect fit for robotics.
Joey Conway: Yes, and the slight hesitation I had earlier was due to the understanding that in robotics, there are often multiple models running simultaneously, including vision models. So, resource allocation is a consideration. However, our push towards smaller, more efficient models is precisely to address these kinds of multi-modal edge computing scenarios. The low latency and real-time processing capabilities of Parakeet are indeed very beneficial for enabling robots to react quickly and safely to auditory cues.
Jean-Marc Mommessin: Anything else you’d like to add as a final thought on the Llama Nemotron Ultra and Parakeet families? They’re both open-source, fast, high-throughput, cost-efficient, and run on smaller footprints – are these the key takeaways?
Joey Conway: Yes, that’s a great summary. Those were the core objectives we set out to achieve. We aimed for state-of-the-art accuracy, optimized footprints for efficient GPU utilization in terms of latency and throughput, and a commitment to open-sourcing everything to empower the community. We’ve strived to be as community-friendly as possible by releasing datasets, using permissive licenses, and making it easy for people to experiment. We’re eager to see the community’s feedback and the innovative applications they build upon our work. We’re also looking forward to learning from their experiences.
Jean-Marc Mommessin: Where are all these models and datasets available?
Joey Conway: Everything we’ve published is on Hugging Face – the models and the datasets. The software stack to run them comes from NVIDIA and is available on NGC, our content repository. Much of the underlying software is also open-source and can be found on GitHub. We also provide pip wheels for easier installation. The Nemo framework is the central hub for much of this software stack, whether you want to run the models or fine-tune them.
We’ve tried to make it as user-friendly as possible. We use the same software internally to build the models, so it should be relatively straightforward for others to pick up and deploy as well.
Jean-Marc Mommessin: Well, Joey, this has been fantastic. I’m continually impressed by NVIDIA’s commitment to giving back to the community with state-of-the-art models that will undoubtedly find their way into production. Thank you so much for your time and insights. I look forward to our next conversation.
Joey Conway: Thank you, Jean-Marc. It was my pleasure, and we appreciate the opportunity.
Jean-marc MommessinJean-marc is a successful AI business executive .He leads and accelerates growth for AI powered solutions and started a computer vision company in 2006. He is a recognized speaker at AI conferences and has an MBA from Stanford.Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/NVIDIA AI Releases HOVER: A Breakthrough AI for Versatile Humanoid Control in RoboticsJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Speech-to-Speech Foundation Models Pave the Way for Seamless Multilingual InteractionsJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Lowe’s Revolutionizes Retail with AI: From Personalized Shopping to Proactive Customer AssistanceJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Google DeepMind’s Gemini Robotics: Unleashing Embodied AI with Zero-Shot Control and Enhanced Spatial Reasoning
#exclusive #talk #joey #conway #nvidia
Exclusive Talk: Joey Conway of NVIDIA on Llama Nemotron Ultra and Open Source Models
Today, MarkTechPost had the pleasure of interviewing Joey Conway from NVIDIA to discuss their exciting work on open-source large language models, including Llama Nemotron Ultra & Parakeet.
Highlights from the interview:
NVIDIA’s Open Source Powerhouse: Discover how NVIDIA is pushing the boundaries of open-source AI with the release of cutting-edge models like Llama Nemotron Ultra and Parakeet TDT.
Llama Nemotron Ultra: Smaller Size, Giant Performance: Learn how NVIDIA achieved on-par performance with models twice the size, enabling deployment on a single GPU node. Explore their innovative FFN fusion technique for significant speedups.
Reasoning on Demand: Uncover the unique “reasoning on/off” feature in Llama Nemotron Ultra, offering unprecedented control for production deployments and cost optimization.
Revolutionary Speech Recognition with Parakeet TDT: Dive into NVIDIA’s state-of-the-art ASR model that transcribes one hour of audio in one second with only a 6% word error rate – 50 times faster than other open-source alternatives!
The “How”: Architectural Innovations: Get insights into the advanced architectures and optimizations behind these models, including FFN fusion, limited context attention, and the Token Duration Transducer
Democratizing AI with Open Data: Learn about NVIDIA’s commitment to the open-source community through the release of model weights and massive, high-quality datasets for both language and speech.
Future Directions: Get a sneak peek into NVIDIA’s plans for multilingual support, even smaller edge-optimized models, and advancements in real-time streaming for speech recognition.
Production-Ready AI: Understand how these models are designed with real-world deployment challenges in mind, focusing on accuracy, efficiency, and cost-effectiveness.
Jean-Marc Mommessin: Joey, welcome to Marketechpost! We’re thrilled to have you here and to delve into the impressive open-source models NVIDIA has been releasing. To start, could you please introduce yourself and your role at NVIDIA?
Joey Conway: Hi Jean-Marc, it’s great to be here. I’m Joey Conway, and I work in product management for some of the deep learning software at NVIDIA. Our team focuses on large language models like Nemotron and Llama Nemotron, as well as text-to-speech models such as Parakeet.
Jean-Marc Mommessin: Wonderful. And you’ve been at NVIDIA for over seven years now, witnessing significant waves of innovation in AI. Let’s talk about your recent release, Llama Nemotron Ultra, a 253 billion parameter model. From what we’ve seen, it delivers performance on par with models like Llama 405B and DeepSeek R1, which are about twice its size. Remarkably, it can run on a single 8x H100 node. What else can you tell us about Llama Nemotron Ultra and what makes it so impressive?
Joey Conway: We’re big believers in the open-source community and the fantastic work being done there. With Llama Nemotron, our goal was to build upon the existing foundations, particularly Llama, for which we greatly appreciate Meta’s contributions. We also observed significant progress in reasoning within the open community earlier this year. Inspired by this, we wanted to contribute and see how we could enhance Llama, especially for enterprise use cases.
Our focus was primarily on improving reasoning capabilities and agentic tasks like tool calling and chat. We aimed to take the strengths of the open-source community, enhance them, and then contribute those improvements back.
Jean-Marc Mommessin: Did you identify specific gaps in existing models that you aimed to address? You mentioned reasoning, but could you provide an example or two of enterprise agentic tasks where you felt there were shortcomings that Llama Nemotron Ultra overcomes?
Joey Conway : Yes, I think looking back to the beginning of the year, a key challenge in enterprise deployments was handling complex queries requiring significant thought and reflection. These could be multi-step processes or involve substantial calculations and the use of external tools. At that time, there weren’t many strong open-weight models capable of robust reasoning. The progress we’ve seen in the last few months in this area is very encouraging.
Another critical aspect for enterprises is the ability to accurately call APIs and closely follow instructions in user queries. We wanted to ensure that while we focused on improving reasoning, we didn’t compromise these essential production-level capabilities.
Furthermore, we often noticed that when both reasoning and instruction following were well-addressed, they typically resided in separate models. Our aim was to simplify this by creating a single model that excels in both. This was the landscape we observed when we started this project around January and February.
Jean-Marc Mommessin: That makes perfect sense and aligns with what we’re seeing in the industry as well. Now, let’s dive into the “how.” Your paper mentions FFN fusion as a key optimization. Could you elaborate on this technique, starting with a high-level explanation?
Joey Conway: Absolutely. Our focus on optimization stemmed from the realization that deploying state-of-the-art models often requires a significant deployment footprint. We wanted to optimize this to fit within more common GPU setups.
We explored various techniques, including our Puzzle neural architecture search. For dense transformer models, particularly those in the Llama family, we discovered a way to reduce or eliminate redundant attention layers. This process aligned the feed-forward networklayers in a sequence, allowing us to explore fusion methods.
Our fundamental goal on the GPU is to maximize parallel execution. Fusing these aligned FFN layers enables greater parallel computation than was previously possible. By removing redundant layers, we found opportunities to essentially merge or fuse the remaining ones. This is a key example of how we tackle the challenges of running these models at scale. Importantly, this technique often yields greater improvements with larger models, which was beneficial for our Ultra model based on Meta’s Llama 3.1 -405B.
Jean-Marc Mommessin: And this FFN fusion significantly improves the model’s throughput, achieving notable speedups. If I recall correctly, it’s in the range of 3 to 5x for the Ultra model?
Joey Conway: That’s right, the speedups for the Ultra model are in that range. Additionally, by reducing the model’s size in terms of weights, we also lowered its memory footprint. This allowed us to utilize a larger KV cache. For Llama Nemotron Ultra, we could fit it onto a 8x H100 80GB setup, which is quite significant as it fits within common node configurations. So, FFN fusion provided both a substantial compute speedup and a reduction in memory usage, enabling us to handle larger context lengths. These are very exciting outcomes for us.
Jean-Marc Mommessin: Let’s switch gears to data curation. AI data is crucial, and your training pipeline seems very sophisticated. You touched on “instruction following” earlier. Could you elaborate on your data curation process and how you ensured high-quality data, especially considering you leveraged other models in the process?
Image source: NVIDIA
Joey Conway: Transparency and openness were key in our approach. We wanted to share as much as possible about our data, techniques, and tooling so the community could understand and even use it themselves. Our primary goal with data curation was to improve accuracy across several key domains, including reasoning tasks like math and coding, as well as non-reasoning tasks like tool calling, instruction following, and chat.
Our strategy involved curating specific datasets to enhance performance in these areas. Within our supervised fine-tuning process, we differentiated between “reasoning on” and “reasoning off” scenarios. For example, in math and coding, we curated data for simple questions that don’t require complex reasoning, as well as more intricate problems that do. This helps the model learn when and how to apply reasoning.
A key part of this process was leveraging high-quality models from the community as “experts” in specific domains. For instance, we used DeepSeek R-1 extensively for reasoning-intensive math and coding tasks. For non-reasoning tasks like basic math, coding, chat, and tool calling, we utilized models like Llama and Qwen. Our aim was to blend the best capabilities of these community models into a single model.
We’ve also made this curated dataset publicly available on Hugging Face, with around 30 million question-answer pairs. This allows the community to explore, use, and build upon our work. We were also excited to see our partner ServiceNow recently announce their apprehend Nemotron model, which was trained using our dataset to enhance their own reasoning capabilities.
Jean-Marc Mommessin: That’s fantastic that you’re sharing the dataset. Given that you used other models to generate some of this data, what kind of quality checks did you implement to ensure the reliability of the training pairs?
Joey Conway: Data quality was absolutely paramount. Since we were generating a significant portion of the data using other models, we implemented a rigorous multi-layered quality assurance process.
First, for each expert model used to generate data in a specific domain, we would generate multiple candidate responses for the same prompt. Then, we employed a separate set of “critic” models to evaluate these candidates based on correctness, coherence, and adherence to the prompt.
Second, we implemented a scoring mechanism. Each generated question-answer pair received a quality score based on the critic model’s evaluation. We set a high threshold, and any pair that didn’t meet this standard was discarded.
Third, human review was integrated at various stages. Our team of data scientists and engineers manually inspected samples of the generated data to identify any systematic errors, biases, or instances of hallucination. This human oversight was crucial for catching nuances that automated systems might miss.
Fourth, we focused on the diversity of the generated data. We wanted to ensure we weren’t just getting variations of the same types of questions and answers. We implemented strategies to encourage the expert models to generate a broad range of examples within each domain.
Finally, after training Llama Nemotron Ultra on this curated data, we conducted extensive evaluations against benchmark datasets and in real-world use cases. This feedback loop helped us further refine our data generation and filtering techniques.
So, it was a comprehensive approach involving expert generation, automated criticism and scoring, human review, diversity checks, and rigorous downstream evaluation to ensure the high quality of our training data.
Jean-Marc Mommessin: The quality of the synthetic data is so important. Could you elaborate on the stages you take to ensure high accuracy when generating this data?
Joey Conway: Absolutely. When doing synthetic data generation, there are a few key stages to ensure high accuracy. The first is the prompts – the seed data and how we prompt the model. The second is the quality of the responses.
On the prompting side, we focus on prompting models where we believe they excel. For example, we might use Llama for chat-related prompts but avoid using a non-reasoning model for math. It’s crucial to align the prompts with the core strengths of the model.
For vetting the responses, we invest time in both human manual review and automated methods. Going forward, we anticipate increasing our use of verifiers and reward models, similar to what we’ve done on the Reinforcement Learningside.
The reason we’ve open-sourced much of this is that there’s a lot of nuance involved, and we wanted the community to engage with these challenges. Enterprises like ServiceNow have specific goals, and some of our data might be more or less useful to them. By making it available, they can vet it themselves. We also provide tools like classifier models to help categorize content, such as news or sports, allowing users to make informed decisions about the data blends they use for training.
Jean-Marc Mommessin: Perfect. Is there anything else you’d like to highlight regarding this pipeline?
Joey Conway: Yes, I’d like to touch on the Reinforcement Learningaspect. Following the supervised fine-tuning stage, where we enhanced core skills, we’ve just begun to explore the potential of RL with Nemotron. We believe this will be a significant area of future development.
What’s exciting about RL is that its effectiveness is largely tied to the available compute time. The more time we invest, the better the model becomes at specific tasks. In our RL stages, we’ve developed methods to automate the process of asking the model a question, grading its answer, and providing feedback to allow it to learn and improve.
You can see on the slide the domains where we’ve applied this: scientific reasoning, instruction following, and chat. If you look at the leaderboards, you’ll see that even with new models emerging, we’ve maintained a strong position in these areas, largely due to the effectiveness of RL in achieving top-tier accuracy. We’re optimistic that we’ll see more of this in the community, with more discussion and publication of techniques and data. We’ve started sharing some of our work in this area and will have much more to come in the next three to six months.
Jean-Marc Mommessin: You mentioned RL and instruction following, which ties back to the beginning of our conversation. It seems like you’ve come full circle here.
Joey Conway: Exactly. The exciting aspect here is automating the feedback loop wherever possible. For chat, we published a fine-tuned reward model last fall. Those who followed our work might recall that our Llama Nemotron model topped the chat leaderboards then. This was because the reward model provides an automated way to teach the original model whether its responses are good or bad. It essentially grades responses based on helpfulness, conciseness, verbosity, groundedness, and similar factors. This granular feedback per generated response allows the model to improve significantly, often more so than through supervised fine-tuning alone, which typically involves a few passes without a continuous feedback loop.
Similarly, for instruction following, we use a verifier and a dataset to teach the model whether it followed instructions well or needs to try again. We’re eager to expand this approach to more domains. We’ve already published datasets related to coding and math since the release of this model a few weeks ago, and these have become popular on Hugging Face. I anticipate significant growth in this area within the community.
Jean-Marc Mommessin: Alright, so one of the big innovations here, and you touched upon it, but I want to emphasize it, is the ability to toggle reasoning on and off via the system prompt. This is quite unique, and I’m sure many will follow suit. Could you expand on the idea behind this, how you see it applying to agents and beyond, its value, and the key challenges in implementing it?
Joey Conway: The reasoning on and off capability was a core goal from the outset. We observed that models in the community often excelled in either reasoning or non-reasoning tasks, and we wanted to simplify deployment by having a single model that could handle both.
We had to determine the best way to teach the model when to reason and when not to, while also providing enterprises with explicit control, as they often have deeper domain knowledge than we do. The motivation behind this is that reasoning generates significantly more tokens, which can lead to higher latency and cost. While crucial for solving complex problems, it’s not always necessary. We wanted to give enterprises the control to balance accuracy with latency and cost, allowing them to decide when to employ reasoning and when to opt for faster, less computationally intensive responses.
Initially, we weren’t sure how to achieve this, as it hadn’t been widely implemented in the community. Our approach in the supervised fine-tuning stage was to explicitly teach the model by presenting the same question with two different answers: one with detailed reasoning and one without. This essentially doubled our dataset for this specific purpose. However, the outcome is a single model where users can simply include “use detailed thinking on” or “use detailed thinking off” in the prompt to control the model’s reasoning process.
On the training side, this required more effort to teach the model this distinction. What we have today is essentially a v1, and I expect others will follow this approach. We’re also excited about future developments, such as time or token limits for reasoning and more granular controls. I’m optimistic that we’ll see further breakthroughs in this area within the next six to nine months, as the problem-solving power of reasoning is significant, but it comes with trade-offs that the community will continue to refine.
Jean-Marc Mommessin: We all know that the real test comes in production. Production environments are sensitive to latency, cost, and while accuracy and reasoning are vital, excessive reasoning can lead to scalability issues and increased latency. The flexibility you’ve introduced is fantastic, and I can see numerous production use cases that will greatly benefit from the ability to control reasoning on a per-query basis.
So, when you were developing this model, you aimed to balance accuracy and efficiency. Could you share some insights into how you made these trade-offs, the timeline for building the model and the team involved, and how you determined the optimal compromise between these two critical factors?
Joey Conway: Balancing accuracy and efficiency is always a challenge. Our initial goal was to achieve both, which is a difficult undertaking. We started with the “Super” model, which was the most recent Llama 3.1 70B release from Meta, as our baseline for accuracy. We weren’t sure if we could simultaneously improve accuracy and reduce the model size.
We found that through our training techniques and distillation process, we could indeed boost accuracy. We even released an initial checkpoint reflecting this. However, we wanted to go further by incorporating strong reasoning capabilities, aiming for state-of-the-art reasoning scores. This is where the SFT and RL stages came in, which required significant time for synthetic data generation since this type of data didn’t exist.
During training, we carefully considered the number of epochs for each skill and continuously measured accuracy. Our goal was to improve performance across all six key areas rather than excelling in just a couple. This balancing act took more time as we experimented to find the right combinations. However, we felt it was crucial to ensure world-class performance in these six enterprise-relevant scenarios, including chat and instruction following.
For areas like MMLU, we focused on maintaining performance and preventing regression rather than actively trying to improve scores. So, there were definitely priorities and trade-offs involved. Ultimately, we believe these were the right focus areas for our enterprise customers.
Jean-Marc Mommessin: You are releasing this model family as part of the open-source community. We’ve discussed the gaps you aimed to address and the unique reasoning on/off feature for production scalability. Could you share your thoughts on how NVIDIA and your team view the role of these models within the broader open-source and LLM ecosystem, especially given your work building upon the Llama base?
Joey Conway: NVIDIA has a long history of contributing models to the open-source community. What excites us about Llama is its strong traction with enterprise customers. While NVIDIA Research publishes extensively across various domains, our goal with Llama Nemotron was to build upon Llama’s momentum in enterprise adoption by focusing narrowly on specific areas. The base Llama models already cover many things exceptionally well, so we saw an opportunity to build on top of that and be very targeted in our enhancements.
The recent LlamaCon event and Meta’s announcements sound very promising, and we’re excited about Llama 4 and the ongoing work there. Moving forward, we anticipate continuing to identify specific areas where we can add significant value, while Meta continues to build excellent general-purpose models suitable for enterprise production.
From our perspective, reasoning will likely remain a key focus, and we’re also excited about Meta’s advancements in this area. Tool calling, instruction following, and chat are also areas we’ll continue to develop. One area we’re particularly interested in exploring is multilingual capabilities. For large enterprises, supporting multiple languages is crucial. While many models handle individual languages well, we aim to focus on a few key languages and ensure world-class accuracy for reasoning, tool calling, and chat within those. This is likely the next major area of expansion for us, beyond the exciting developments in model architectures like Llama 4’s new MoE architecture, which we’re also keen to explore for potential distillation and optimization for NVIDIA GPUs. So, there’s a lot of exciting work ahead.
Jean-Marc Mommessin: When you say multilingual, are you thinking of supporting a broad range, like 50 languages, or a more focused set, perhaps around 5 or 10 initially, given the benchmark challenges you mentioned?
Joey Conway: We’ll probably start with a more focused set, perhaps around 5 to 10 languages. The challenge is that the community currently lacks comprehensive benchmarks for tasks like reasoning or tool calling across a wide variety of languages. As we develop these multilingual models, we’re also having to create evaluation data simultaneously, which takes time. If those benchmarks were readily available, the process would be smoother. However, we see this as an exciting challenge. Our initial focus will likely be on a smaller set of languages where we can establish strong performance, given the current limitations in community-wide benchmarks.
Jean-Marc Mommessin: Let’s shift gears and talk about another state-of-the-art open-source model you recently released: Parakeet TDT 0.6 B parameters, V2. This model has set a new standard for automatic speech recognition, transcribing one hour of audio in just one second. That’s 50 times faster than other open-source ASR models, and remarkably, it achieves only a 6% word error rate. This is truly impressive. What else would you like to highlight about this model before we discuss the “how” behind its incredible performance?
Joey Conway: It’s worth noting that NVIDIA has been working on ASR models for a long time, even before I joined. We’ve also released many open models in this space over the years. The teams working on this are exceptional, and they consistently strive to balance accuracy with latency and throughput. Parakeet V2 is the latest in this line of high-performance models from NVIDIA.
Jean-Marc Mommessin: It sounds like the advancements will keep coming. So, let’s delve into how you achieved this remarkable performance with Parakeet TDT. What kind of architecture did you use? I understand it’s based on a Fast Conformer architecture with specific optimizations like 8x depth-wise separable convolutional downsampling and limited context attention. Could you explain how you arrived at this approach and whether these optimizations primarily enhance speed and throughput or if they also contribute to accuracy and the ability to process long audio segments like a full hour in one shot?
Joey Conway: Yes, we’ve explored various architectures for ASR over the years, and the Conformer architecture, originally from Google, has shown great promise. Our goal with Parakeet TDT was to take the Conformer architecture and make it significantly more efficient and faster without sacrificing quality.
We’ve implemented several key optimizations.
First, as you mentioned, the depth-wise separable convolution downsampling. At the input stage, we significantly downsample the audio, which reduces the computational cost and memory requirements for processing.
Second is the limited context attention. By focusing on smaller, overlapping chunks of audio, we can maintain accuracy while achieving a speedup in processing.
Third, on the encoder side, we also utilize a sliding window attention technique, which allows us to process longer audio files without having to split them into shorter segments. This is crucial for handling long-form audio like a full hour in a single pass.
Beyond the Conformer architecture, Parakeet TDT incorporates a Token and Duration Transducer. Traditional Recurrent Neural Networktransducer technology processes audio frame by frame. What we’ve done with TDT is enable the model to predict both the tokens and the expected duration of those tokens. This allows it to make decisions to skip over redundant frames, significantly speeding up the transcription process. This TDT innovation alone contributes to around a 1.5 to 2x speedup. So, there’s a combination of architectural choices and specific optimizations that contribute to Parakeet TDT’s impressive speed and accuracy.
Jean-Marc Mommessin: I want to go back to one or two of those. Those are amazing, frankly. The speed increase is remarkable.
Joey Conway: Yes, and we have another technique called a label looping algorithm. Essentially, when we’re doing batch inference, this algorithm allows us to advance the tokens independently for different samples. This separation of the workflow enables us to sweep and loop over frames and labels more efficiently, significantly speeding up the decoding process.
Lastly, on the decoder side, we’ve moved some of the computation into CUDA graphs, which is a more efficient way to run many small kernels. This optimization alone provided around a 3x speed boost. So, as you can see with TDT models, we’ve been able to achieve speeds comparable to Connectionist Temporal Classificationdecoders, which are also known for their speed, while maintaining high accuracy. Our overall theme is always to balance speed improvements with maintaining or even enhancing accuracy. Techniques like CTC decoders have been around for a while and are fast but might not be as accurate. It really depends on the use case, but we’re always striving for that balance.
Jean-Marc Mommessin: Can we revisit the limited context attention? Do you see this technique having broader applications in other areas down the line?
Joey Conway: Yes, I believe so. Patterns like the sliding window attention are already used in other areas, such as LLMs. Our research teams are constantly experimenting, looking at successful techniques from different domains, and trying to apply them in new ways. Interestingly, some of the researchers who worked on Parakeet TDT also work on Llama Nemotron, so there’s a cross-pollination of ideas. I do expect that some of these techniques will find broader applications going forward. We also anticipate further improvements to TDT and the Conformer architecture, as we’ve been working on them for several years now. I don’t see these core technologies going away anytime soon; we’ll likely continue to refine them.
Jean-Marc Mommessin: Leaving the TDT aside, do you see other potential applications for the Token and Duration Transducer concept in other domains?
Joey Conway: That’s a good question. I’m not immediately seeing a direct application of the TDT concept outside of ASR. Its history is rooted in RNNs and RNN transducers, which have primarily been used in speech recognition. However, some of the underlying techniques we’ve applied to it, like using CUDA graphs for optimizing kernel execution, are general techniques that we use whenever we identify bottlenecks in a model’s pipeline. So, while the TDT itself might be domain-specific, some of the optimization strategies we’ve employed could certainly translate to other areas, including large language models.
Jean-Marc Mommessin: let’s talk about data. AI data is always a key topic. How do you ensure that the data used to train Parakeet TDT is diverse enough to handle various accents, dialects, vocal ranges, pitches, and noisy background conditions, which often negatively impact ASR performance?
Joey Conway: You’re absolutely right. As humans, we naturally filter out accents and background noise to understand speech. However, deep learning models are only as good as the data they’re trained on. Early on, limited data for specific accents or languages resulted in poor performance for those variations. What might have initially seemed like edge cases have become increasingly common, highlighting the need for more representative data.
We’ve invested significant effort in curating our datasets to reflect this real-world diversity. We use techniques like classifiers to analyze our data and understand the distributions of accents, dialects, and acoustic conditions. We’ve worked with customers like YUM! Brands, who have drive-through use cases with significant highway noise, illustrating the importance of training the model to handle such challenging environments. Ensuring the right blend and distribution of these conditions in our training data is crucial for the model’s robustness.
I’m also excited to announce that we plan to open-source a substantial speech dataset, around 100,000 hours, where we’ve meticulously performed this kind of curation. This dataset will include variations in sound levels, signal-to-noise ratios, background noise types, and even telephone audio formats relevant for call centers. Our goal is to provide the community with high-quality, diverse data that enables models to perform well across a wide range of real-world scenarios.
Jean-Marc Mommessin: That’s fantastic news about the open-sourcing of the speech dataset! My final question regarding the Parakeet family: you currently have the 600 million and 1.1 billion parameter models. How do you envision future development for this family? What are the potential directions?
Joey Conway: We’re considering development along two main dimensions: model size and the number of supported languages. In terms of size, we’ve released models at the smaller and mid-range to demonstrate the potential, similar to our approach with Llama Nemotron Super. We plan to explore larger models, potentially around 2 billion parameters, which we anticipate will handle even more languages and dialects.
On the smaller end, we’re even considering models down to around 50 million parameters. The motivation here is to address use cases at the edge where a smaller footprint is necessary, such as enabling real-time audio processing for robots in noisy environments. We’ll be exploring the right trade-offs for such applications.
Technologically, we plan to work on streaming capabilities for TDT. Currently, much of the processing is done in an offline batch mode, but we want to enable real-time, live transcription. And as mentioned, we’re excited about releasing the large, curated speech dataset.
Finally, for those looking to deploy these models in production, we recommend exploring techniques like word boosting, which allows for customization of text normalization to include domain-specific terms and acronyms. We aim to provide a wide range of options for users to get started and tailor the models to their specific needs.
Jean-Marc Mommessin: I’m very familiar with the NVIDIA Orin platform. Would these Parakeet models currently run on NVIDIA Orin?
Joey Conway: Yes, I believe the 0.6 billion parameter model likely would run on Orin. I would need to double-check the exact specifications, but I’m quite confident it’s feasible.
Jean-Marc Mommessin: Orin packs a significant punch. I especially love the robotics use case you mentioned. While there’s been a lot of focus on robot vision, the ability to hear and understand quickly is equally crucial, especially for safety. A model that’s 50 times faster and highly accurate in understanding another modality seems like a perfect fit for robotics.
Joey Conway: Yes, and the slight hesitation I had earlier was due to the understanding that in robotics, there are often multiple models running simultaneously, including vision models. So, resource allocation is a consideration. However, our push towards smaller, more efficient models is precisely to address these kinds of multi-modal edge computing scenarios. The low latency and real-time processing capabilities of Parakeet are indeed very beneficial for enabling robots to react quickly and safely to auditory cues.
Jean-Marc Mommessin: Anything else you’d like to add as a final thought on the Llama Nemotron Ultra and Parakeet families? They’re both open-source, fast, high-throughput, cost-efficient, and run on smaller footprints – are these the key takeaways?
Joey Conway: Yes, that’s a great summary. Those were the core objectives we set out to achieve. We aimed for state-of-the-art accuracy, optimized footprints for efficient GPU utilization in terms of latency and throughput, and a commitment to open-sourcing everything to empower the community. We’ve strived to be as community-friendly as possible by releasing datasets, using permissive licenses, and making it easy for people to experiment. We’re eager to see the community’s feedback and the innovative applications they build upon our work. We’re also looking forward to learning from their experiences.
Jean-Marc Mommessin: Where are all these models and datasets available?
Joey Conway: Everything we’ve published is on Hugging Face – the models and the datasets. The software stack to run them comes from NVIDIA and is available on NGC, our content repository. Much of the underlying software is also open-source and can be found on GitHub. We also provide pip wheels for easier installation. The Nemo framework is the central hub for much of this software stack, whether you want to run the models or fine-tune them.
We’ve tried to make it as user-friendly as possible. We use the same software internally to build the models, so it should be relatively straightforward for others to pick up and deploy as well.
Jean-Marc Mommessin: Well, Joey, this has been fantastic. I’m continually impressed by NVIDIA’s commitment to giving back to the community with state-of-the-art models that will undoubtedly find their way into production. Thank you so much for your time and insights. I look forward to our next conversation.
Joey Conway: Thank you, Jean-Marc. It was my pleasure, and we appreciate the opportunity.
Jean-marc MommessinJean-marc is a successful AI business executive .He leads and accelerates growth for AI powered solutions and started a computer vision company in 2006. He is a recognized speaker at AI conferences and has an MBA from Stanford.Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/NVIDIA AI Releases HOVER: A Breakthrough AI for Versatile Humanoid Control in RoboticsJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Speech-to-Speech Foundation Models Pave the Way for Seamless Multilingual InteractionsJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Lowe’s Revolutionizes Retail with AI: From Personalized Shopping to Proactive Customer AssistanceJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Google DeepMind’s Gemini Robotics: Unleashing Embodied AI with Zero-Shot Control and Enhanced Spatial Reasoning
#exclusive #talk #joey #conway #nvidia
·4 Views