ARSTECHNICA.COM
Nvidias new AI audio model can synthesize sounds that have never existed
You've never heard anything like it Nvidias new AI audio model can synthesize sounds that have never existed What does a screaming saxophone sound like? The Fugatto model has an answer... Kyle Orland Nov 25, 2024 4:40 pm | 39 An audio wave can contain so much. An angry cello, for instance... Credit: Getty Images An audio wave can contain so much. An angry cello, for instance... Credit: Getty Images Story textSizeSmallStandardLargeWidth *StandardWideLinksStandardOrange* Subscribers only Learn moreAt this point, anyone who has been following AI research is long familiar with generative models that can synthesize speech or melodic music from nothing but text prompting. Nvidia's newly revealed "Fugatto" model looks to go a step further, using new synthetic training methods and inference-level combination techniques to "transform any mix of music, voices, and sounds," including the synthesis of sounds that have never existed.While Fugatto isn't available for public testing yet, a sample-filled website showcases how Fugatto can be used to dial a number of distinct audio traits and descriptions up or down, resulting in everything from the sound of saxophones barking to people speaking underwater to ambulance sirens singing in a kind of choir. While the results on display can be a bit hit or miss, the vast array of capabilities on display here helps support Nvidia's description of Fugatto as "a Swiss Army knife for sound."Youre only as good as your dataIn an explanatory research paper, over a dozen Nvidia researchers explain the difficulty in crafting a training dataset that can "reveal meaningful relationships between audio and language." While standard language models can often infer how to handle various instructions from the text-based data itself, it can be hard to generalize descriptions and traits from audio without more explicit guidance.To that end, the researchers start by using an LLM to generate a Python script that can create a large number of template-based and free-form instructions describing different audio "personas" (e.g., "standard, young-crowd, thirty-somethings, professional"). They then generate a set of both absolute (e.g., "synthesize a happy voice") and relative (e.g., "increase the happiness of this voice") instructions that can be applied to those personas.The wide array of open source audio datasets used as the basis for Fugatto generally don't have these kinds of trait measurements embedded in them by default. But the researchers make use of existing audio understanding models to create "synthetic captions" for their training clips based on their prompts, creating natural language descriptions that can automatically quantify traits such as gender, emotion, and speech quality. Audio processing tools are also used to describe and quantify training clips on a more acoustic level (e.g. "fundamental frequency variance" or "reverb").For relational comparisons, the researchers rely on datasets where one factor is held constant while another changes, such as different emotional readings of the same text or different instruments playing the same notes. By comparing these samples across a large enough set of data samples, the model can start to learn what kinds of audio characteristics tend to appear in "happier" speech, for instance, or differentiate the sound of a saxophone and a flute.After running a variety of different open source audio collections through this process, the researchers ended up with a heavily annotated dataset of 20 million separate samples representing at least 50,000 hours of audio. From there, a set of 32 Nvidia tensor cores was used to create a model with 2.5 billion parameters that started to show reliable scores on a variety of audio quality tests.Its all in the mix OK, Fugatto, can we get a little more barking and a little less saxophone in the monitors? Credit: Getty Images Beyond the training, Nvidia is also talking up Fugatto's "ComposableART" system (for "Audio Representation Transformation"). When provided with a prompt in text and/or audio, this system can use "conditional guidance" to "independently control and generate (unseen) combinations of instructions and tasks" and generate "highly customizable audio outputs outside the training distribution." In other words, it can combine different traits from its training set to create entirely new sounds that have never been heard before.I won't pretend to understand all of the complex math described in the paperwhich involves a "weighted combination of vector fields between instructions, frame indices and models." But the end results, as shown in examples on the project's webpage and in an Nvidia trailer, highlight how ComposableART can be used to create the sound of, say, a violin that "sounds like a laughing baby or a banjo that's playing in front of gentle rainfall" or "factory machinery that screams in metallic agony." While some of these examples are more convincing to our ears than others, the fact that Fugatto can take a decent stab at these kinds of combinations at all is a testament to the way the model characterizes and mixes extremely disparate audio data from multiple different open source data sets.Perhaps the most interesting part of Fugatto is the way it treats each individual audio trait as a tunable continuum, rather than a binary. For an example that melds the sound of an acoustic guitar and running water, for instance, the result ends up very different when either the guitar or the water is weighted more heavily in Fugatto's interpolated mix. Nvidia also mentions examples of tuning a French accent to be heavier or lighter, or varying the "degree of sorrow" inherent in a spoken clip.Beyond tuning and combining different audio traits, Fugatto can also perform the kinds of audio tasks we've seen in previous models, like changing the emotion in a piece of spoken text or isolating the vocal track in a piece of music. Fugatto can also detect individual notes in a piece of MIDI music and replace them with a variety of vocal performances, or detect the beat of a piece of music and add effects from drums to barking dogs to ticking clocks in a way that matches the rhythm. Fugatto's generated audio (magenta) matches the melody of an input MIDI file (Cyan) very closely. Credit: Nvidia Research While the researchers describe Fugatto as just the first step "towards a future where unsupervised multitask learning emerges from data and model scale," Nvidia is already talking up use cases from song prototyping to dynamically changing video game scores to international ad targeting. But Nvidia was also quick to highlight that models like Fugatto are best seen as a new tool for audio artists rather than a replacement for their creative talents."The history of music is also a history of technology," Nvidia Inception participant and producer/songwriter Ido Zmishlany said in Nvidia's blog post. "The electric guitar gave the world rock and roll. When the sampler showed up, hip-hop was born. With AI, were writing the next chapter of music. We have a new instrument, a new tool for making musicand thats super exciting."Kyle OrlandSenior Gaming EditorKyle OrlandSenior Gaming Editor Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper. 39 Comments Prev story
0 Reacties 0 aandelen 19 Views