The Nvidia AI interview: Inside DLSS 4 and machine learning with Bryan Catanzaro
www.eurogamer.net
The Nvidia AI interview: Inside DLSS 4 and machine learning with Bryan CatanzaroDF talks with Nvidia's VP of applied deep learning research.Image credit: Digital Foundry Interview by Alex Battaglia Video Producer, Digital Foundry Additional contributions byWill JuddPublished on Jan. 21, 2025 At CES 2025, Nvidia announced its RTX 50-series graphics cards with DLSS 4. While at the show, we spoke with Nvidia VP of applied deep learning research Bryan Catanzaro about the finer details of how the new DLSS works, from its revised transformer model for super resolution and ray reconstruction to the new multi frame generation (MFG) feature. Despite coming just over a year since our last interview with Bryan, which coincided with the release of DLSS 3.5 and Cyberpunk 2077 Phantom Liberty, there are some fairly major advancements here, some of which that will be reserved for RTX 50-series owners and others that will be available for a wider range of Nvidia graphics cards. The interview follows below, with light edits for length and clarity as usual. The full interview is available via the video embed below if you prefer. Enjoy! Here's the full video interview with Bryan and Alex from the CES 2025 show floor. Watch on YouTube00:00 Introduction00:48 Why switch from CNNs to transformers?02:08 What are some image characteristics that are improved with DLSS 4 Super Resolution?03:17 Is there headroom to continue to improve on Super Resolution?04:12 How much more expensive is DLSS 4 Super Resolution to run?05:25 How does the transformer model improve Ray Reconstruction?09:43 Why is frame gen no longer using hardware optical flow?13:06 Could the new Frame Generation run on RTX 3000?13:44 What has changed for frame pacing with DLSS 4 Frame Generation?15:37 Will Frame Generation ever support standard v-sync?17:18 Could you explain how Reflex 2 works?21:11 What is the lowest acceptable input frame-rate for DLSS 4 Frame Generation?22:13 What does the future of real-time graphics look like?The last time we talked was when ray reconstruction first came out, and now, with RTX 5000, there's a new DLSS model - the first time since 2020 that we're seeing such a big change in how things are done. So why switch over to this new transformer model? To start, how does it improve super resolution specifically? Bryan Catanzaro: We've been evolving the super resolution model now for about five or six years, and it gets increasingly challenging to make the model smarter; trying to cram more and more intelligence into the same space. You have to innovate; you have to try something new. The transformer architecture has been such a wonderful thing for language modeling, for image generation; all of the advances that that we see today like ChatGPT or Stable Diffusion - these are all built on transformer models. Transformer models have this great property in that they're very scalable. You can train them on large amounts of data, and because they're able to direct attention around an image, it allows the model to make smarter choices about what's happening and what to generate. We can train it on much more data, get a smarter model and then breakthrough results. We're really excited about the kinds of image quality that we're able to achieve with our new ray reconstruction and super resolution models in DLSS 4. What are some key image characteristics that are improved with the new transform model in the super resolution mode? Bryan Catanzaro: You know what the issues are with super resolution - it's things like stability, ghosting and detail. We're always trying to push on all of those dimensions, and they usually trade off. It's easier to get more detail if you accumulate more, but then that leads to ghosting. Or the opposite of ghosting, when you have stability problems because the model dmakes different choices each frame and then you have something like geometry in the distance that's shimmering and flickering which is also really bad. Those are the standard problems with any sort of image reconstruction. I think that the tradeoffs we're making with our new super resolution and ray reconstruction models are just way better than what we've had in the past.Here's our DF Direct discussing the Nvidia news, featuring Alex and Oliver. Watch on YouTubeIs there better potential with this kind of model also? With the old models, it seems like we're hitting a wall in terms of the quality that can be achieved. Is there a better trajectory with a transformer model? Bryan Catanzaro: Yeah, absolutely. It's always been true in machine learning that a bigger model trained on more data is going to get better results if the data is high quality. And of course, with DLSS or any sort of real-time graphics algorithm, we have a strict compute budget in terms of milliseconds per frame. One of the reasons we were brave enough to try building a transformer-based image reconstruction algorithm for super resolution and ray reconstruction is because we knew that Blackwell [RTX 50-series] was going to have amazing Tensor cores. It was designed as a neural rendering GPU; the amount of compute horsepower that's going into the Tensor cores is going up exponentially. And so we have the opportunity to try something a little bit more ambitious, and that's what we've done. The specific performance cost of super resolution at 4K on an RTX 4090 was sub-0.5ms, if I recall correctly. Can you give me a ballpark difference in terms of milliseconds per frame for what the new transformer model costs? Bryan Catanzaro: The new super resolution model has four times more compute in it than the old one, but it doesn't take four times as long to execute, especially on Blackwell, because we have designed the algorithm along with the Tensor core to make sure that we're running at really high efficiencies. I can't quote the exact number of milliseconds on a 50-series card, but I can say that it's got four times more compute. And on Blackwell, we think it's the best way to play. The last time we talked, it was really obvious to see that ray reconstruction was the direction that the industry should go in, because you can't just hand-tune a denoiser for every single environmental setting. It made sense, but we noticed problem points in the beginning, both specific to certain titles and more universal ones. How is the transformer model improving these specific areas? Bryan Catanzaro: Some of it's just polish - we've had another year to iterate on it, and we're always increasing the quality of our data sets. We're analysing failure cases, adding them to our training sets and our evaluation methodology. But also, the new model being much bigger and having much more compute in it just gives it more capacity to learn. A lot of times when we have a failure in one of these DLSS models, it looks like shimmering, ghosting or blurring in-game. We consider those model failures; the model is just making a poor choice. It needs to, for example, decide not to accumulate if that's going to lead to ghosting. It needs to, for example, not have a bias to make crenelated stair-step patterns on edges, because that's the whole point of anti-aliasing. Due to a lot of technical reasons, we've been fighting that in DLSS for years, and I think these models are just smarter, so they fail less. Here's the DLSS 4 first look video Alex and Bryan refer to during the interview.Yeah, that was one of my key takeaways about DLSS 4. Sometimes with AI there's a slight stylisation of the output, and I didn't see that at all [in the DLSS 4 b-roll Rich recorded], so I was very happy to see that. Bryan Catanzaro: I noticed [in the Digital Foundry video] that Rich was looking at animated textures, which have always really bothered me too. And it's a really tricky thing for DLSS super resolution or ray reconstruction to deal with, because the motion vectors from the game that are describing how things are moving around don't go along with the texture. The TV is just sitting there, and yet you don't want the screen on the TV to just blur as stuff moves around. That requires the model to ignore the motion vectors that are coming from the game, basically analyse the scene and recognise "oh, this area is actually a TV with an animated texture on it - I'm going to make sure not to blur that." It was really hard to teach the prior CNN models about that. We did our best, and we did make a lot of progress, but I feel like this new transformer model opens up a new space for us to solve these problems. I hope we get to do a dedicated look at ray reconstruction. Because it was so nascent a technology; it feels like this is almost a larger leap than what we're seeing with super resolution. Bryan Catanzaro: I think that's true. Another part of this is frame gen, which now doesn't use hardware optical flow as it did on RTX 40-series, why make that change? Bryan Catanzaro: Well, because we get better results that way. Technology is always a function of the time in which it's built. When we built DLSS 3 frame generation, we absolutely needed hardware acceleration to compute optical flow as we didn't have enough Tensor cores and we didn't have a real-time optical flow algorithm that ran on Tensor cores that could fit our compute budget. So we instead used the optical flow accelerator, which Nvidia had been building for years as an evolution of our video encoder technology and our automotive computer vision acceleration for self driving cars and and so. The difficult part about any sort of hardware implementation of an algorithm like optical flow is that it's really difficult to improve it; it is what it is. The failures that arose from that hardware optical flow couldn't be undone with a smarter neural network, so we decided to just replace them with a fully AI-based solution, which is what we've done for frame generation in DLSS 4. This new frame generation algorithm is significantly more Tensor core heavy, and so it still has a lot of hardware requirements, but it has a few good properties. One is it uses less memory, which is important as we're always trying to save every megabyte. Two is it has better image quality, and that's especially important for the 50-series MFG, because the percentage of time that a gamer is looking at generated frames is much higher and therefore any artefacts are going to be much more visible. So we needed to make image quality better. Three is we needed to make the algorithm cheaper to run in terms of milliseconds, especially for the 50-series cards when we're doing MFG. What we wanted to do was make it possible to amortise a lot of the work over the multiple frames that we're generating. If you think about it, there's really two rendered frames that we're analysing in order to create a series of frames in between those. And it seems like you should do that comparison once, and then you should do some other thing to generate each frame. And so that required a different algorithm. To see this content please enable targeting cookies. Now that frame generation is running wholly on Tensor cores, obviously it's more intensive, but what's keeping it from running on RTX 3000? Bryan Catanzaro: I think this is a question of optimisation, engineering and user experience. We're launching this multi frame generation with the 50-series, and we'll see what we're able to squeeze out of older hardware in the future. Another part of this is frame pacing, which has always actually been an extreme challenge, especially in a VRR scenario. What has changed with regards to frame pacing, between DLSS 3 frame generation and DLSS 4 frame generation? Bryan Catanzaro: We have an updated flip metering system in Blackwell that has much lower variability and takes the CPU out of the equation when deciding exactly when to present a frame. Because of that, we're able to reduce the displayed frame time variability by about a factor of five or 10 compared with our previous best frame pacing. This is especially important for multi frame generation, because the more frames you're trying to show, the more the variability really starts throwing a wrench into the experience. I'm very curious to see if those frame pacing improvements would affect, for example, RTX 40-series as well? Bryan Catanzaro: DLSS 4 is just better than DLSS 3, so I expect that things will be better on 40-series as well. Another element of Nvidia's frame generation is using Reflex to reduce latency, which now has a generative AI aspect to it with Reflex 2. Can you talk a bit about it? Bryan Catanzaro: I'm always thinking about real-time graphics in three dimensions; smoothness, responsiveness and image quality - which includes ray tracing and higher resolution and better textures and all that. With DLSS, we want to improve on all those areas. We're excited about Reflex 2 because it's a new way of thinking about lowering latency. What we're doing is actually rendering the scene in the normal way, but right before we go to finalise the image, we sample the camera position again to see if the user has moved the camera while the GPU has been rendering that frame. If that happens, we warp the image to the new camera position. For most pixels, that's going to look really good and it dramatically lowers the latency between the mouse and the camera. Sometimes when the camera moves, something that was hidden before is revealed, and you would then have a hole with no information on what should be there: disocclusion. The trick with a technique like Reflex 2 is filling in those holes to make a convincing-looking image? And the trade-offs that that we've made with Reflex 2 are going to be really exciting for gamers that are really latency sensitive. I think there's still more work to do to make the image quality even better, and you can imagine that AI has a big role to play here as well. Yeah, it's interesting too, because input latency is a matter of perception, and this is completely playing with that. On a technical level, it's not actually moving the real 3D scene - it's a 2D image manipulation, right? But you're almost getting the same effect. Bryan Catanzaro: It's pretty fun to me. It feels totally different playing a game with Reflex 2, it just feels so much more connected. I think a lot of gamers are going to love it, especially in certain titles that are very latency sensitive. But you know, DLSS is trying to give people more options so they can play how they want = if they want to lower latency, if they want to increase image quality, if they want smoothness. DLSS has something for everybody. The ability to choose two, three or four inserted frames with frame generation. Bryan Catanzaro: Yeah, it's a big deal, and you can do that in the Nvidia app as well, which is useful to override games that were developed with DLSS 3 frame generation and don't have a UI for selecting 2x, 3x or 4x frame generation. Rather than trying to update all the UIs for all the games, we figured it would be useful for gamers to be able to choose what they'd like. Coming onto multi frame generation, what is the lowest acceptable input frame-rate for MFG? Bryan Catanzaro: I think that the acceptable input frame rate is still about the same for 3x or 4x as it was for 2x. I think the challenges really have to do with how large the movement is between two consecutive rendered frames. When the movement gets very large, it becomes much harder to figure out what to do in between those frames. But if you understand how an object is moving, dividing the motion into smaller pieces isn't actually that tricky, right? So the trick is figuring out how the objects are moving, and so that's kind of independent of how many frames we're generating. Where do you see the future of frame generation? Now we're taking whatever kind of raw performance we can get it and blowing it up for a minor performance and latency cost, but eventually we're going to have 1000Hz monitors. Where does frame generation fit into that future? Bryan Catanzaro: Well, I'm excited about 1000Hz monitors. I think that's going to feel amazing - and we're going to be using a lot of frame gen to get to 1000Hz. Graphics is shifting; we've been on this journey of redefining graphics with neural rendering for almost seven years and we're still at the beginning. If we think about the approximations that we use for graphics, there's still a lot that we would like to get rid of. One that you brought up earlier is subsurface scattering. It's kind of crazy that in 3D graphics today that we're mostly simulating a 2d manifold; we're not actually doing 3D graphics. We're bouncing light off of pieces of paper that are like origami heads or something, but we're not actually moving rays through 3D objects. Most of the time for opaque things that probably doesn't matter, but for a lot of things that are semi translucent - a lot of the things that make the world feel real and textured - we actually do need to do a better job of working with light transport in three dimensions, like through materials. And so you ask yourself, what's the role of a polygon? If the job is to think about how light interacts through three dimensional objects, the model that we've been using for the past 50 years - "let's really carefully model the outside surface of an object" - that's probably not the right representation. And so this phenomenon is that we're finding neural representations and neural rendering algorithms that are able to learn from real-world data and from very expensive simulations that would never be real time, so we're able to come up with technologies that are going to be much more realistic and convincing than we could ever do with traditional "bottom-up" rendering. Bottom-up rendering is when you're trying to model every fuzzy hair and every snowflake and every drop of water and every light photon, so that we can simulate reality. At some point, you know, we're making a shift away from this explicit, bottom-up kind of graphics towards a more top-down generated graphics where we learn, for example, how snowflakes look. When a painter paints a scene, they're not actually simulating every photon and every facet of every piece of geometry. They just they know what it's supposed to look like. And so I think neural rendering is moving in that direction, and I'm very excited about the prospects of overcoming a lot of the limitations of today's graphics, which I think are really difficult to scale. You know, the more fidelity we put in bottom-up simulation, the more work we have to do to capture textures and geometry and animate it. It becomes very expensive and really challenging. A lot is held back because we just don't have the artist bandwidth, we don't have the time or the storage to save everything. But we're going to have neural materials, neural rendering algorithms, neural radiance caches; we're going to find ways of using AI in order to understand how the world should be drawn, and that's going to open up a lot of new possibilities to make games more interesting-looking and more fun.Yeah, one of the things that I've always lamented about polygon-based graphics is that inability to represent anything like heterogeneous volumes and ray tracing that is almost impossible in real time. So I'm happy that neural rendering is going to start bridging that gap, for more complex deformable materials, fluid simulations, all these things. So that's what I hope we see in the future. Bryan Catanzaro: That's where we're headed, for sure.
0 Commenti
·0 condivisioni
·52 Views