• NVIDIA CEO Drops the Blueprint for Europe’s AI Boom

    At GTC Paris — held alongside VivaTech, Europe’s largest tech event — NVIDIA founder and CEO Jensen Huang delivered a clear message: Europe isn’t just adopting AI — it’s building it.
    “We now have a new industry, an AI industry, and it’s now part of the new infrastructure, called intelligence infrastructure, that will be used by every country, every society,” Huang said, addressing an audience gathered online and at the iconic Dôme de Paris.
    From exponential inference growth to quantum breakthroughs, and from infrastructure to industry, agentic AI to robotics, Huang outlined how the region is laying the groundwork for an AI-powered future.

    A New Industrial Revolution
    At the heart of this transformation, Huang explained, are systems like GB200 NVL72 — “one giant GPU” and NVIDIA’s most powerful AI platform yet — now in full production and powering everything from sovereign models to quantum computing.
    “This machine was designed to be a thinking machine, a thinking machine, in the sense that it reasons, it plans, it spends a lot of time talking to itself,” Huang said, walking the audience through the size and scale of these machines and their performance.
    At GTC Paris, Huang showed audience members the innards of some of NVIDIA’s latest hardware.
    There’s more coming, with Huang saying NVIDIA’s partners are now producing 1,000 GB200 systems a week, “and this is just the beginning.” He walked the audience through a range of available systems ranging from the tiny NVIDIA DGX Spark to rack-mounted RTX PRO Servers.
    Huang explained that NVIDIA is working to help countries use technologies like these to build both AI infrastructure — services built for third parties to use and innovate on — and AI factories, which companies build for their own use, to generate revenue.
    NVIDIA is partnering with European governments, telcos and cloud providers to deploy NVIDIA technologies across the region. NVIDIA is also expanding its network of technology centers across Europe — including new hubs in Finland, Germany, Spain, Italy and the U.K. — to accelerate skills development and quantum growth.
    Quantum Meets Classical
    Europe’s quantum ambitions just got a boost.
    The NVIDIA CUDA-Q platform is live on Denmark’s Gefion supercomputer, opening new possibilities for hybrid AI and quantum engineering. In addition, Huang announced that CUDA-Q is now available on NVIDIA Grace Blackwell systems.
    Across the continent, NVIDIA is partnering with supercomputing centers and quantum hardware builders to advance hybrid quantum-AI research and accelerate quantum error correction.
    “Quantum computing is reaching an inflection point,” Huang said. “We are within reach of being able to apply quantum computing, quantum classical computing, in areas that can solve some interesting problems in the coming years.”
    Sovereign Models, Smarter Agents
    European developers want more control over their models. Enter NVIDIA Nemotron, designed to help build large language models tuned to local needs.
    “And so now you know that you have access to an enhanced open model that is still open, that is top of the leader chart,” Huang said.
    These models will be coming to Perplexity, a reasoning search engine, enabling secure, multilingual AI deployment across Europe.
    “You can now ask and get questions answered in the language, in the culture, in the sensibility of your country,” Huang said.
    Huang explained how NVIDIA is helping countries across Europe build AI infrastructure.
    Every company will build its own agents, Huang said. To help create those agents, Huang introduced a suite of agentic AI blueprints, including an Agentic AI Safety blueprint for enterprises and governments.
    The new NVIDIA NeMo Agent toolkit and NVIDIA AI Blueprint for building data flywheels further accelerate the development of safe, high-performing AI agents.
    To help deploy these agents, NVIDIA is partnering with European governments, telcos and cloud providers to deploy the DGX Cloud Lepton platform across the region, providing instant access to accelerated computing capacity.
    “One model architecture, one deployment, and you can run it anywhere,” Huang said, adding that Lepton is now integrated with Hugging Face, giving developers direct access to global compute.
    The Industrial Cloud Goes Live
    AI isn’t just virtual. It’s powering physical systems, too, sparking a new industrial revolution.
    “We’re working on industrial AI with one company after another,” Huang said, describing work to build digital twins based on the NVIDIA Omniverse platform with companies across the continent.
    Huang explained that everything he showed during his keynote was “computer simulation, not animation” and that it looks beautiful because “it turns out the world is beautiful, and it turns out math is beautiful.”
    To further this work, Huang announced NVIDIA is launching the world’s first industrial AI cloud — to be built in Germany — to help Europe’s manufacturers simulate, automate and optimize at scale.
    “Soon, everything that moves will be robotic,” Huang said. “And the car is the next one.”
    NVIDIA DRIVE, NVIDIA’s full-stack AV platform, is now in production to accelerate the large-scale deployment of safe, intelligent transportation.
    And to show what’s coming next, Huang was joined on stage by Grek, a pint-sized robot, as Huang talked about how NVIDIA partnered with DeepMind and Disney to build Newton, the world’s most advanced physics training engine for robotics.
    The Next Wave
    The next wave of AI has begun — and it’s exponential, Huang explained.
    “We have physical robots, and we have information robots. We call them agents,” Huang said. “The technology necessary to teach a robot to manipulate, to simulate — and of course, the manifestation of an incredible robot — is now right in front of us.”
    This new era of AI is being driven by a surge in inference workloads. “The number of people using inference has gone from 8 million to 800 million — 100x in just a couple of years,” Huang said.
    To meet this demand, Huang emphasized the need for a new kind of computer: “We need a special computer designed for thinking, designed for reasoning. And that’s what Blackwell is — a thinking machine.”
    Huang and Grek, as he explained how AI is driving advancements in robotics.
    These Blackwell-powered systems will live in a new class of data centers — AI factories — built to generate tokens, the raw material of modern intelligence.
    “These AI factories are going to generate tokens,” Huang said, turning to Grek with a smile. “And these tokens are going to become your food, little Grek.”
    With that, the keynote closed on a bold vision: a future powered by sovereign infrastructure, agentic AI, robotics — and exponential inference — all built in partnership with Europe.
    Watch the NVIDIA GTC Paris keynote from Huang at VivaTech and explore GTC Paris sessions.
    #nvidia #ceo #drops #blueprint #europes
    NVIDIA CEO Drops the Blueprint for Europe’s AI Boom
    At GTC Paris — held alongside VivaTech, Europe’s largest tech event — NVIDIA founder and CEO Jensen Huang delivered a clear message: Europe isn’t just adopting AI — it’s building it. “We now have a new industry, an AI industry, and it’s now part of the new infrastructure, called intelligence infrastructure, that will be used by every country, every society,” Huang said, addressing an audience gathered online and at the iconic Dôme de Paris. From exponential inference growth to quantum breakthroughs, and from infrastructure to industry, agentic AI to robotics, Huang outlined how the region is laying the groundwork for an AI-powered future. A New Industrial Revolution At the heart of this transformation, Huang explained, are systems like GB200 NVL72 — “one giant GPU” and NVIDIA’s most powerful AI platform yet — now in full production and powering everything from sovereign models to quantum computing. “This machine was designed to be a thinking machine, a thinking machine, in the sense that it reasons, it plans, it spends a lot of time talking to itself,” Huang said, walking the audience through the size and scale of these machines and their performance. At GTC Paris, Huang showed audience members the innards of some of NVIDIA’s latest hardware. There’s more coming, with Huang saying NVIDIA’s partners are now producing 1,000 GB200 systems a week, “and this is just the beginning.” He walked the audience through a range of available systems ranging from the tiny NVIDIA DGX Spark to rack-mounted RTX PRO Servers. Huang explained that NVIDIA is working to help countries use technologies like these to build both AI infrastructure — services built for third parties to use and innovate on — and AI factories, which companies build for their own use, to generate revenue. NVIDIA is partnering with European governments, telcos and cloud providers to deploy NVIDIA technologies across the region. NVIDIA is also expanding its network of technology centers across Europe — including new hubs in Finland, Germany, Spain, Italy and the U.K. — to accelerate skills development and quantum growth. Quantum Meets Classical Europe’s quantum ambitions just got a boost. The NVIDIA CUDA-Q platform is live on Denmark’s Gefion supercomputer, opening new possibilities for hybrid AI and quantum engineering. In addition, Huang announced that CUDA-Q is now available on NVIDIA Grace Blackwell systems. Across the continent, NVIDIA is partnering with supercomputing centers and quantum hardware builders to advance hybrid quantum-AI research and accelerate quantum error correction. “Quantum computing is reaching an inflection point,” Huang said. “We are within reach of being able to apply quantum computing, quantum classical computing, in areas that can solve some interesting problems in the coming years.” Sovereign Models, Smarter Agents European developers want more control over their models. Enter NVIDIA Nemotron, designed to help build large language models tuned to local needs. “And so now you know that you have access to an enhanced open model that is still open, that is top of the leader chart,” Huang said. These models will be coming to Perplexity, a reasoning search engine, enabling secure, multilingual AI deployment across Europe. “You can now ask and get questions answered in the language, in the culture, in the sensibility of your country,” Huang said. Huang explained how NVIDIA is helping countries across Europe build AI infrastructure. Every company will build its own agents, Huang said. To help create those agents, Huang introduced a suite of agentic AI blueprints, including an Agentic AI Safety blueprint for enterprises and governments. The new NVIDIA NeMo Agent toolkit and NVIDIA AI Blueprint for building data flywheels further accelerate the development of safe, high-performing AI agents. To help deploy these agents, NVIDIA is partnering with European governments, telcos and cloud providers to deploy the DGX Cloud Lepton platform across the region, providing instant access to accelerated computing capacity. “One model architecture, one deployment, and you can run it anywhere,” Huang said, adding that Lepton is now integrated with Hugging Face, giving developers direct access to global compute. The Industrial Cloud Goes Live AI isn’t just virtual. It’s powering physical systems, too, sparking a new industrial revolution. “We’re working on industrial AI with one company after another,” Huang said, describing work to build digital twins based on the NVIDIA Omniverse platform with companies across the continent. Huang explained that everything he showed during his keynote was “computer simulation, not animation” and that it looks beautiful because “it turns out the world is beautiful, and it turns out math is beautiful.” To further this work, Huang announced NVIDIA is launching the world’s first industrial AI cloud — to be built in Germany — to help Europe’s manufacturers simulate, automate and optimize at scale. “Soon, everything that moves will be robotic,” Huang said. “And the car is the next one.” NVIDIA DRIVE, NVIDIA’s full-stack AV platform, is now in production to accelerate the large-scale deployment of safe, intelligent transportation. And to show what’s coming next, Huang was joined on stage by Grek, a pint-sized robot, as Huang talked about how NVIDIA partnered with DeepMind and Disney to build Newton, the world’s most advanced physics training engine for robotics. The Next Wave The next wave of AI has begun — and it’s exponential, Huang explained. “We have physical robots, and we have information robots. We call them agents,” Huang said. “The technology necessary to teach a robot to manipulate, to simulate — and of course, the manifestation of an incredible robot — is now right in front of us.” This new era of AI is being driven by a surge in inference workloads. “The number of people using inference has gone from 8 million to 800 million — 100x in just a couple of years,” Huang said. To meet this demand, Huang emphasized the need for a new kind of computer: “We need a special computer designed for thinking, designed for reasoning. And that’s what Blackwell is — a thinking machine.” Huang and Grek, as he explained how AI is driving advancements in robotics. These Blackwell-powered systems will live in a new class of data centers — AI factories — built to generate tokens, the raw material of modern intelligence. “These AI factories are going to generate tokens,” Huang said, turning to Grek with a smile. “And these tokens are going to become your food, little Grek.” With that, the keynote closed on a bold vision: a future powered by sovereign infrastructure, agentic AI, robotics — and exponential inference — all built in partnership with Europe. Watch the NVIDIA GTC Paris keynote from Huang at VivaTech and explore GTC Paris sessions. #nvidia #ceo #drops #blueprint #europes
    BLOGS.NVIDIA.COM
    NVIDIA CEO Drops the Blueprint for Europe’s AI Boom
    At GTC Paris — held alongside VivaTech, Europe’s largest tech event — NVIDIA founder and CEO Jensen Huang delivered a clear message: Europe isn’t just adopting AI — it’s building it. “We now have a new industry, an AI industry, and it’s now part of the new infrastructure, called intelligence infrastructure, that will be used by every country, every society,” Huang said, addressing an audience gathered online and at the iconic Dôme de Paris. From exponential inference growth to quantum breakthroughs, and from infrastructure to industry, agentic AI to robotics, Huang outlined how the region is laying the groundwork for an AI-powered future. A New Industrial Revolution At the heart of this transformation, Huang explained, are systems like GB200 NVL72 — “one giant GPU” and NVIDIA’s most powerful AI platform yet — now in full production and powering everything from sovereign models to quantum computing. “This machine was designed to be a thinking machine, a thinking machine, in the sense that it reasons, it plans, it spends a lot of time talking to itself,” Huang said, walking the audience through the size and scale of these machines and their performance. At GTC Paris, Huang showed audience members the innards of some of NVIDIA’s latest hardware. There’s more coming, with Huang saying NVIDIA’s partners are now producing 1,000 GB200 systems a week, “and this is just the beginning.” He walked the audience through a range of available systems ranging from the tiny NVIDIA DGX Spark to rack-mounted RTX PRO Servers. Huang explained that NVIDIA is working to help countries use technologies like these to build both AI infrastructure — services built for third parties to use and innovate on — and AI factories, which companies build for their own use, to generate revenue. NVIDIA is partnering with European governments, telcos and cloud providers to deploy NVIDIA technologies across the region. NVIDIA is also expanding its network of technology centers across Europe — including new hubs in Finland, Germany, Spain, Italy and the U.K. — to accelerate skills development and quantum growth. Quantum Meets Classical Europe’s quantum ambitions just got a boost. The NVIDIA CUDA-Q platform is live on Denmark’s Gefion supercomputer, opening new possibilities for hybrid AI and quantum engineering. In addition, Huang announced that CUDA-Q is now available on NVIDIA Grace Blackwell systems. Across the continent, NVIDIA is partnering with supercomputing centers and quantum hardware builders to advance hybrid quantum-AI research and accelerate quantum error correction. “Quantum computing is reaching an inflection point,” Huang said. “We are within reach of being able to apply quantum computing, quantum classical computing, in areas that can solve some interesting problems in the coming years.” Sovereign Models, Smarter Agents European developers want more control over their models. Enter NVIDIA Nemotron, designed to help build large language models tuned to local needs. “And so now you know that you have access to an enhanced open model that is still open, that is top of the leader chart,” Huang said. These models will be coming to Perplexity, a reasoning search engine, enabling secure, multilingual AI deployment across Europe. “You can now ask and get questions answered in the language, in the culture, in the sensibility of your country,” Huang said. Huang explained how NVIDIA is helping countries across Europe build AI infrastructure. Every company will build its own agents, Huang said. To help create those agents, Huang introduced a suite of agentic AI blueprints, including an Agentic AI Safety blueprint for enterprises and governments. The new NVIDIA NeMo Agent toolkit and NVIDIA AI Blueprint for building data flywheels further accelerate the development of safe, high-performing AI agents. To help deploy these agents, NVIDIA is partnering with European governments, telcos and cloud providers to deploy the DGX Cloud Lepton platform across the region, providing instant access to accelerated computing capacity. “One model architecture, one deployment, and you can run it anywhere,” Huang said, adding that Lepton is now integrated with Hugging Face, giving developers direct access to global compute. The Industrial Cloud Goes Live AI isn’t just virtual. It’s powering physical systems, too, sparking a new industrial revolution. “We’re working on industrial AI with one company after another,” Huang said, describing work to build digital twins based on the NVIDIA Omniverse platform with companies across the continent. Huang explained that everything he showed during his keynote was “computer simulation, not animation” and that it looks beautiful because “it turns out the world is beautiful, and it turns out math is beautiful.” To further this work, Huang announced NVIDIA is launching the world’s first industrial AI cloud — to be built in Germany — to help Europe’s manufacturers simulate, automate and optimize at scale. “Soon, everything that moves will be robotic,” Huang said. “And the car is the next one.” NVIDIA DRIVE, NVIDIA’s full-stack AV platform, is now in production to accelerate the large-scale deployment of safe, intelligent transportation. And to show what’s coming next, Huang was joined on stage by Grek, a pint-sized robot, as Huang talked about how NVIDIA partnered with DeepMind and Disney to build Newton, the world’s most advanced physics training engine for robotics. The Next Wave The next wave of AI has begun — and it’s exponential, Huang explained. “We have physical robots, and we have information robots. We call them agents,” Huang said. “The technology necessary to teach a robot to manipulate, to simulate — and of course, the manifestation of an incredible robot — is now right in front of us.” This new era of AI is being driven by a surge in inference workloads. “The number of people using inference has gone from 8 million to 800 million — 100x in just a couple of years,” Huang said. To meet this demand, Huang emphasized the need for a new kind of computer: “We need a special computer designed for thinking, designed for reasoning. And that’s what Blackwell is — a thinking machine.” Huang and Grek, as he explained how AI is driving advancements in robotics. These Blackwell-powered systems will live in a new class of data centers — AI factories — built to generate tokens, the raw material of modern intelligence. “These AI factories are going to generate tokens,” Huang said, turning to Grek with a smile. “And these tokens are going to become your food, little Grek.” With that, the keynote closed on a bold vision: a future powered by sovereign infrastructure, agentic AI, robotics — and exponential inference — all built in partnership with Europe. Watch the NVIDIA GTC Paris keynote from Huang at VivaTech and explore GTC Paris sessions.
    Like
    Love
    Sad
    23
    0 Commentarii 0 Distribuiri
  • EPFL Researchers Unveil FG2 at CVPR: A New AI Model That Slashes Localization Errors by 28% for Autonomous Vehicles in GPS-Denied Environments

    Navigating the dense urban canyons of cities like San Francisco or New York can be a nightmare for GPS systems. The towering skyscrapers block and reflect satellite signals, leading to location errors of tens of meters. For you and me, that might mean a missed turn. But for an autonomous vehicle or a delivery robot, that level of imprecision is the difference between a successful mission and a costly failure. These machines require pinpoint accuracy to operate safely and efficiently. Addressing this critical challenge, researchers from the École Polytechnique Fédérale de Lausannein Switzerland have introduced a groundbreaking new method for visual localization during CVPR 2025
    Their new paper, “FG2: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching,” presents a novel AI model that significantly enhances the ability of a ground-level system, like an autonomous car, to determine its exact position and orientation using only a camera and a corresponding aerialimage. The new approach has demonstrated a remarkable 28% reduction in mean localization error compared to the previous state-of-the-art on a challenging public dataset.
    Key Takeaways:

    Superior Accuracy: The FG2 model reduces the average localization error by a significant 28% on the VIGOR cross-area test set, a challenging benchmark for this task.
    Human-like Intuition: Instead of relying on abstract descriptors, the model mimics human reasoning by matching fine-grained, semantically consistent features—like curbs, crosswalks, and buildings—between a ground-level photo and an aerial map.
    Enhanced Interpretability: The method allows researchers to “see” what the AI is “thinking” by visualizing exactly which features in the ground and aerial images are being matched, a major step forward from previous “black box” models.
    Weakly Supervised Learning: Remarkably, the model learns these complex and consistent feature matches without any direct labels for correspondences. It achieves this using only the final camera pose as a supervisory signal.

    Challenge: Seeing the World from Two Different Angles
    The core problem of cross-view localization is the dramatic difference in perspective between a street-level camera and an overhead satellite view. A building facade seen from the ground looks completely different from its rooftop signature in an aerial image. Existing methods have struggled with this. Some create a general “descriptor” for the entire scene, but this is an abstract approach that doesn’t mirror how humans naturally localize themselves by spotting specific landmarks. Other methods transform the ground image into a Bird’s-Eye-Viewbut are often limited to the ground plane, ignoring crucial vertical structures like buildings.

    FG2: Matching Fine-Grained Features
    The EPFL team’s FG2 method introduces a more intuitive and effective process. It aligns two sets of points: one generated from the ground-level image and another sampled from the aerial map.

    Here’s a breakdown of their innovative pipeline:

    Mapping to 3D: The process begins by taking the features from the ground-level image and lifting them into a 3D point cloud centered around the camera. This creates a 3D representation of the immediate environment.
    Smart Pooling to BEV: This is where the magic happens. Instead of simply flattening the 3D data, the model learns to intelligently select the most important features along the verticaldimension for each point. It essentially asks, “For this spot on the map, is the ground-level road marking more important, or is the edge of that building’s roof the better landmark?” This selection process is crucial, as it allows the model to correctly associate features like building facades with their corresponding rooftops in the aerial view.
    Feature Matching and Pose Estimation: Once both the ground and aerial views are represented as 2D point planes with rich feature descriptors, the model computes the similarity between them. It then samples a sparse set of the most confident matches and uses a classic geometric algorithm called Procrustes alignment to calculate the precise 3-DoFpose.

    Unprecedented Performance and Interpretability
    The results speak for themselves. On the challenging VIGOR dataset, which includes images from different cities in its cross-area test, FG2 reduced the mean localization error by 28% compared to the previous best method. It also demonstrated superior generalization capabilities on the KITTI dataset, a staple in autonomous driving research.

    Perhaps more importantly, the FG2 model offers a new level of transparency. By visualizing the matched points, the researchers showed that the model learns semantically consistent correspondences without being explicitly told to. For example, the system correctly matches zebra crossings, road markings, and even building facades in the ground view to their corresponding locations on the aerial map. This interpretability is extremenly valuable for building trust in safety-critical autonomous systems.
    “A Clearer Path” for Autonomous Navigation
    The FG2 method represents a significant leap forward in fine-grained visual localization. By developing a model that intelligently selects and matches features in a way that mirrors human intuition, the EPFL researchers have not only shattered previous accuracy records but also made the decision-making process of the AI more interpretable. This work paves the way for more robust and reliable navigation systems for autonomous vehicles, drones, and robots, bringing us one step closer to a future where machines can confidently navigate our world, even when GPS fails them.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
    Jean-marc MommessinJean-marc is a successful AI business executive .He leads and accelerates growth for AI powered solutions and started a computer vision company in 2006. He is a recognized speaker at AI conferences and has an MBA from Stanford.Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/AI-Generated Ad Created with Google’s Veo3 Airs During NBA Finals, Slashing Production Costs by 95%Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video ControlJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Snowflake Charts New AI Territory: Cortex AISQL & Snowflake Intelligence Poised to Reshape Data AnalyticsJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Exclusive Talk: Joey Conway of NVIDIA on Llama Nemotron Ultra and Open Source Models
    #epfl #researchers #unveil #fg2 #cvpr
    EPFL Researchers Unveil FG2 at CVPR: A New AI Model That Slashes Localization Errors by 28% for Autonomous Vehicles in GPS-Denied Environments
    Navigating the dense urban canyons of cities like San Francisco or New York can be a nightmare for GPS systems. The towering skyscrapers block and reflect satellite signals, leading to location errors of tens of meters. For you and me, that might mean a missed turn. But for an autonomous vehicle or a delivery robot, that level of imprecision is the difference between a successful mission and a costly failure. These machines require pinpoint accuracy to operate safely and efficiently. Addressing this critical challenge, researchers from the École Polytechnique Fédérale de Lausannein Switzerland have introduced a groundbreaking new method for visual localization during CVPR 2025 Their new paper, “FG2: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching,” presents a novel AI model that significantly enhances the ability of a ground-level system, like an autonomous car, to determine its exact position and orientation using only a camera and a corresponding aerialimage. The new approach has demonstrated a remarkable 28% reduction in mean localization error compared to the previous state-of-the-art on a challenging public dataset. Key Takeaways: Superior Accuracy: The FG2 model reduces the average localization error by a significant 28% on the VIGOR cross-area test set, a challenging benchmark for this task. Human-like Intuition: Instead of relying on abstract descriptors, the model mimics human reasoning by matching fine-grained, semantically consistent features—like curbs, crosswalks, and buildings—between a ground-level photo and an aerial map. Enhanced Interpretability: The method allows researchers to “see” what the AI is “thinking” by visualizing exactly which features in the ground and aerial images are being matched, a major step forward from previous “black box” models. Weakly Supervised Learning: Remarkably, the model learns these complex and consistent feature matches without any direct labels for correspondences. It achieves this using only the final camera pose as a supervisory signal. Challenge: Seeing the World from Two Different Angles The core problem of cross-view localization is the dramatic difference in perspective between a street-level camera and an overhead satellite view. A building facade seen from the ground looks completely different from its rooftop signature in an aerial image. Existing methods have struggled with this. Some create a general “descriptor” for the entire scene, but this is an abstract approach that doesn’t mirror how humans naturally localize themselves by spotting specific landmarks. Other methods transform the ground image into a Bird’s-Eye-Viewbut are often limited to the ground plane, ignoring crucial vertical structures like buildings. FG2: Matching Fine-Grained Features The EPFL team’s FG2 method introduces a more intuitive and effective process. It aligns two sets of points: one generated from the ground-level image and another sampled from the aerial map. Here’s a breakdown of their innovative pipeline: Mapping to 3D: The process begins by taking the features from the ground-level image and lifting them into a 3D point cloud centered around the camera. This creates a 3D representation of the immediate environment. Smart Pooling to BEV: This is where the magic happens. Instead of simply flattening the 3D data, the model learns to intelligently select the most important features along the verticaldimension for each point. It essentially asks, “For this spot on the map, is the ground-level road marking more important, or is the edge of that building’s roof the better landmark?” This selection process is crucial, as it allows the model to correctly associate features like building facades with their corresponding rooftops in the aerial view. Feature Matching and Pose Estimation: Once both the ground and aerial views are represented as 2D point planes with rich feature descriptors, the model computes the similarity between them. It then samples a sparse set of the most confident matches and uses a classic geometric algorithm called Procrustes alignment to calculate the precise 3-DoFpose. Unprecedented Performance and Interpretability The results speak for themselves. On the challenging VIGOR dataset, which includes images from different cities in its cross-area test, FG2 reduced the mean localization error by 28% compared to the previous best method. It also demonstrated superior generalization capabilities on the KITTI dataset, a staple in autonomous driving research. Perhaps more importantly, the FG2 model offers a new level of transparency. By visualizing the matched points, the researchers showed that the model learns semantically consistent correspondences without being explicitly told to. For example, the system correctly matches zebra crossings, road markings, and even building facades in the ground view to their corresponding locations on the aerial map. This interpretability is extremenly valuable for building trust in safety-critical autonomous systems. “A Clearer Path” for Autonomous Navigation The FG2 method represents a significant leap forward in fine-grained visual localization. By developing a model that intelligently selects and matches features in a way that mirrors human intuition, the EPFL researchers have not only shattered previous accuracy records but also made the decision-making process of the AI more interpretable. This work paves the way for more robust and reliable navigation systems for autonomous vehicles, drones, and robots, bringing us one step closer to a future where machines can confidently navigate our world, even when GPS fails them. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Jean-marc MommessinJean-marc is a successful AI business executive .He leads and accelerates growth for AI powered solutions and started a computer vision company in 2006. He is a recognized speaker at AI conferences and has an MBA from Stanford.Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/AI-Generated Ad Created with Google’s Veo3 Airs During NBA Finals, Slashing Production Costs by 95%Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video ControlJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Snowflake Charts New AI Territory: Cortex AISQL & Snowflake Intelligence Poised to Reshape Data AnalyticsJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Exclusive Talk: Joey Conway of NVIDIA on Llama Nemotron Ultra and Open Source Models #epfl #researchers #unveil #fg2 #cvpr
    WWW.MARKTECHPOST.COM
    EPFL Researchers Unveil FG2 at CVPR: A New AI Model That Slashes Localization Errors by 28% for Autonomous Vehicles in GPS-Denied Environments
    Navigating the dense urban canyons of cities like San Francisco or New York can be a nightmare for GPS systems. The towering skyscrapers block and reflect satellite signals, leading to location errors of tens of meters. For you and me, that might mean a missed turn. But for an autonomous vehicle or a delivery robot, that level of imprecision is the difference between a successful mission and a costly failure. These machines require pinpoint accuracy to operate safely and efficiently. Addressing this critical challenge, researchers from the École Polytechnique Fédérale de Lausanne (EPFL) in Switzerland have introduced a groundbreaking new method for visual localization during CVPR 2025 Their new paper, “FG2: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching,” presents a novel AI model that significantly enhances the ability of a ground-level system, like an autonomous car, to determine its exact position and orientation using only a camera and a corresponding aerial (or satellite) image. The new approach has demonstrated a remarkable 28% reduction in mean localization error compared to the previous state-of-the-art on a challenging public dataset. Key Takeaways: Superior Accuracy: The FG2 model reduces the average localization error by a significant 28% on the VIGOR cross-area test set, a challenging benchmark for this task. Human-like Intuition: Instead of relying on abstract descriptors, the model mimics human reasoning by matching fine-grained, semantically consistent features—like curbs, crosswalks, and buildings—between a ground-level photo and an aerial map. Enhanced Interpretability: The method allows researchers to “see” what the AI is “thinking” by visualizing exactly which features in the ground and aerial images are being matched, a major step forward from previous “black box” models. Weakly Supervised Learning: Remarkably, the model learns these complex and consistent feature matches without any direct labels for correspondences. It achieves this using only the final camera pose as a supervisory signal. Challenge: Seeing the World from Two Different Angles The core problem of cross-view localization is the dramatic difference in perspective between a street-level camera and an overhead satellite view. A building facade seen from the ground looks completely different from its rooftop signature in an aerial image. Existing methods have struggled with this. Some create a general “descriptor” for the entire scene, but this is an abstract approach that doesn’t mirror how humans naturally localize themselves by spotting specific landmarks. Other methods transform the ground image into a Bird’s-Eye-View (BEV) but are often limited to the ground plane, ignoring crucial vertical structures like buildings. FG2: Matching Fine-Grained Features The EPFL team’s FG2 method introduces a more intuitive and effective process. It aligns two sets of points: one generated from the ground-level image and another sampled from the aerial map. Here’s a breakdown of their innovative pipeline: Mapping to 3D: The process begins by taking the features from the ground-level image and lifting them into a 3D point cloud centered around the camera. This creates a 3D representation of the immediate environment. Smart Pooling to BEV: This is where the magic happens. Instead of simply flattening the 3D data, the model learns to intelligently select the most important features along the vertical (height) dimension for each point. It essentially asks, “For this spot on the map, is the ground-level road marking more important, or is the edge of that building’s roof the better landmark?” This selection process is crucial, as it allows the model to correctly associate features like building facades with their corresponding rooftops in the aerial view. Feature Matching and Pose Estimation: Once both the ground and aerial views are represented as 2D point planes with rich feature descriptors, the model computes the similarity between them. It then samples a sparse set of the most confident matches and uses a classic geometric algorithm called Procrustes alignment to calculate the precise 3-DoF (x, y, and yaw) pose. Unprecedented Performance and Interpretability The results speak for themselves. On the challenging VIGOR dataset, which includes images from different cities in its cross-area test, FG2 reduced the mean localization error by 28% compared to the previous best method. It also demonstrated superior generalization capabilities on the KITTI dataset, a staple in autonomous driving research. Perhaps more importantly, the FG2 model offers a new level of transparency. By visualizing the matched points, the researchers showed that the model learns semantically consistent correspondences without being explicitly told to. For example, the system correctly matches zebra crossings, road markings, and even building facades in the ground view to their corresponding locations on the aerial map. This interpretability is extremenly valuable for building trust in safety-critical autonomous systems. “A Clearer Path” for Autonomous Navigation The FG2 method represents a significant leap forward in fine-grained visual localization. By developing a model that intelligently selects and matches features in a way that mirrors human intuition, the EPFL researchers have not only shattered previous accuracy records but also made the decision-making process of the AI more interpretable. This work paves the way for more robust and reliable navigation systems for autonomous vehicles, drones, and robots, bringing us one step closer to a future where machines can confidently navigate our world, even when GPS fails them. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Jean-marc MommessinJean-marc is a successful AI business executive .He leads and accelerates growth for AI powered solutions and started a computer vision company in 2006. He is a recognized speaker at AI conferences and has an MBA from Stanford.Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/AI-Generated Ad Created with Google’s Veo3 Airs During NBA Finals, Slashing Production Costs by 95%Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Highlighted at CVPR 2025: Google DeepMind’s ‘Motion Prompting’ Paper Unlocks Granular Video ControlJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Snowflake Charts New AI Territory: Cortex AISQL & Snowflake Intelligence Poised to Reshape Data AnalyticsJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Exclusive Talk: Joey Conway of NVIDIA on Llama Nemotron Ultra and Open Source Models
    Like
    Love
    Wow
    Angry
    Sad
    601
    0 Commentarii 0 Distribuiri
  • Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards

    Text embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation. However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed.
    Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding
    Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages, making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs.
    These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs.

    Technical Architecture
    Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to thetoken. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function.

    The models are trained using a robust multi-stage training pipeline:

    Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks.
    Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity, fine-tuning performance in downstream applications.
    Model merging: Spherical linear interpolationof multiple fine-tuned checkpoints ensures robustness and generalization.

    This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings.
    Performance Benchmarks and Insights
    The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks.

    On MMTEB, Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series.
    On MTEB: Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B.
    On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA.

    For reranking:

    Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers.
    Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance.

    Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops, emphasizing their contributions.
    Conclusion
    Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation.

    Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent Using LangGraph and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding
    #alibaba #qwen #team #releases #qwen3embedding
    Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards
    Text embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation. However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed. Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages, making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs. These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs. Technical Architecture Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to thetoken. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function. The models are trained using a robust multi-stage training pipeline: Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks. Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity, fine-tuning performance in downstream applications. Model merging: Spherical linear interpolationof multiple fine-tuned checkpoints ensures robustness and generalization. This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings. Performance Benchmarks and Insights The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks. On MMTEB, Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series. On MTEB: Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B. On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA. For reranking: Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers. Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance. Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops, emphasizing their contributions. Conclusion Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation. Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent Using LangGraph and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding #alibaba #qwen #team #releases #qwen3embedding
    WWW.MARKTECHPOST.COM
    Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards
    Text embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation (RAG). However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed. Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages (119 in total), making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs. These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs. Technical Architecture Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to the [EOS] token. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function. The models are trained using a robust multi-stage training pipeline: Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks. Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity (>0.7), fine-tuning performance in downstream applications. Model merging: Spherical linear interpolation (SLERP) of multiple fine-tuned checkpoints ensures robustness and generalization. This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings. Performance Benchmarks and Insights The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks. On MMTEB (216 tasks across 250+ languages), Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series. On MTEB (English v2): Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B. On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA. For reranking: Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers. Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance. Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops (up to 6 points on MMTEB), emphasizing their contributions. Conclusion Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation. Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent Using LangGraph and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding
    Like
    Love
    Wow
    Angry
    Sad
    332
    0 Commentarii 0 Distribuiri
  • Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning

    Large Reasoning Models, trained from LLMs using reinforcement learning, demonstrated great performance in complex reasoning tasks, including mathematics, STEM, and coding. However, existing LRMs face challenges in completing various puzzle tasks that require purely logical reasoning skills, which are easy and obvious for humans. Current methods targeting puzzles focus only on designing benchmarks for evaluation, lacking the training methods and resources for modern LLMs to tackle this challenge. Current puzzle datasets lack diversity and scalability, covering limited puzzle types with little control over generation or difficulty. Moreover, due to the success of the “LLM+RLVR” paradigm, it has become crucial to obtain large, diverse, and challenging sets of verifiable puzzle prompts for training agents.
    Reinforcement Learning with Verifiable Rewardshas emerged as a key method for improving models’ reasoning capabilities, removing the need for reward models by directly assigning rewards based on objectively verifiable answers. Puzzles are particularly well-suited for RLVR. However, most prior RLVR research has overlooked the puzzles’ potential for delivering effective reward signals. In puzzle reasoning of LLMs, existing benchmarks evaluate different types of reasoning, including abstract, deductive, and compositional reasoning. Few benchmarks support scalable generation and difficulty control but lack puzzle diversity. Moreover, the improvement of LLMs’ puzzle-solving abilities mainly falls into two categories: tool integration and RLVR.
    Researchers from ByteDance Seed, Fudan University, Tsinghua University, Nanjing University, and Shanghai Jiao Tong University have proposed Enigmata, the first comprehensive toolkit designed for improving LLMs with puzzle reasoning skills. It contains 36 tasks across seven categories, each featuring a generator that produces unlimited examples with controllable difficulty and a rule-based verifier for automatic evaluation. The researchers further developed Enigmata-Eval as a rigorous benchmark and created optimized multi-task RLVR strategies. Puzzle data from Enigmata enhances SoTA performance on advanced math and STEM reasoning tasks like AIME, BeyondAIME, and GPQA when trained on larger models like Seed1.5-Thinking. This shows the generalization benefits of Enigmata.

    The Enigmata-Data comprises 36 puzzle tasks organized into 7 primary categories, including Crypto, Arithmetic, Logic, Grid, Graph, Search, and Sequential Puzzle, making it the only dataset having multiple task categories with scalability, automatic verification, and public availability. The data construction follows a three-phase pipeline: Tasks Collection and Design, Auto-Generator and Verifier Development, and Sliding Difficulty Control. Moreover, the Enigmata-Eval is developed by systematically sampling from the broader dataset, aiming to extract 50 instances per difficulty level for each task. The final evaluation set contains 4,758 puzzle instances rather than the theoretical maximum of 5,400, due to inherent constraints, where some tasks generate fewer instances per difficulty level.

    The proposed model outperforms most public models on Enigmata-Eval with 32B parameters, showing the effectiveness of the dataset and training recipe. The model stands out on the challenging ARC-AGI benchmark, surpassing strong reasoning models such as Gemini 2.5 Pro, o3-mini, and o1. The Qwen2.5-32B-Enigmata shows outstanding performance in structured reasoning categories, outperforming in Crypto, Arithmetic, and Logic tasks, suggesting effective development of rule-based reasoning capabilities. The model shows competitive performance in search tasks that require strategic exploration and planning capabilities. Moreover, Crypto and Arithmetic tasks tend to provide the highest accuracy, while spatial and sequential tasks remain more difficult.
    In this paper, researchers introduced Enigmata, a comprehensive suite for equipping LLMs with advanced puzzle reasoning that integrates seamlessly with RL using verifiable rule-based rewards. The trained Enigmata-Model shows superior performance and robust generalization skills through RLVR training. Experiments reveal that when applied to larger models such as Seed1.5-Thinking, synthetic puzzle data brings additional benefits in other domains, including mathematics and STEM reasoning over state-of-the-art models. Enigmata provides a solid foundation for the research community to advance reasoning model development, offering a unified framework that effectively bridges logical puzzle-solving with broader reasoning capabilities in LLMs.

    Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic IntegrationSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning
    #enigmatas #multistage #mixtraining #reinforcement #learning
    Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning
    Large Reasoning Models, trained from LLMs using reinforcement learning, demonstrated great performance in complex reasoning tasks, including mathematics, STEM, and coding. However, existing LRMs face challenges in completing various puzzle tasks that require purely logical reasoning skills, which are easy and obvious for humans. Current methods targeting puzzles focus only on designing benchmarks for evaluation, lacking the training methods and resources for modern LLMs to tackle this challenge. Current puzzle datasets lack diversity and scalability, covering limited puzzle types with little control over generation or difficulty. Moreover, due to the success of the “LLM+RLVR” paradigm, it has become crucial to obtain large, diverse, and challenging sets of verifiable puzzle prompts for training agents. Reinforcement Learning with Verifiable Rewardshas emerged as a key method for improving models’ reasoning capabilities, removing the need for reward models by directly assigning rewards based on objectively verifiable answers. Puzzles are particularly well-suited for RLVR. However, most prior RLVR research has overlooked the puzzles’ potential for delivering effective reward signals. In puzzle reasoning of LLMs, existing benchmarks evaluate different types of reasoning, including abstract, deductive, and compositional reasoning. Few benchmarks support scalable generation and difficulty control but lack puzzle diversity. Moreover, the improvement of LLMs’ puzzle-solving abilities mainly falls into two categories: tool integration and RLVR. Researchers from ByteDance Seed, Fudan University, Tsinghua University, Nanjing University, and Shanghai Jiao Tong University have proposed Enigmata, the first comprehensive toolkit designed for improving LLMs with puzzle reasoning skills. It contains 36 tasks across seven categories, each featuring a generator that produces unlimited examples with controllable difficulty and a rule-based verifier for automatic evaluation. The researchers further developed Enigmata-Eval as a rigorous benchmark and created optimized multi-task RLVR strategies. Puzzle data from Enigmata enhances SoTA performance on advanced math and STEM reasoning tasks like AIME, BeyondAIME, and GPQA when trained on larger models like Seed1.5-Thinking. This shows the generalization benefits of Enigmata. The Enigmata-Data comprises 36 puzzle tasks organized into 7 primary categories, including Crypto, Arithmetic, Logic, Grid, Graph, Search, and Sequential Puzzle, making it the only dataset having multiple task categories with scalability, automatic verification, and public availability. The data construction follows a three-phase pipeline: Tasks Collection and Design, Auto-Generator and Verifier Development, and Sliding Difficulty Control. Moreover, the Enigmata-Eval is developed by systematically sampling from the broader dataset, aiming to extract 50 instances per difficulty level for each task. The final evaluation set contains 4,758 puzzle instances rather than the theoretical maximum of 5,400, due to inherent constraints, where some tasks generate fewer instances per difficulty level. The proposed model outperforms most public models on Enigmata-Eval with 32B parameters, showing the effectiveness of the dataset and training recipe. The model stands out on the challenging ARC-AGI benchmark, surpassing strong reasoning models such as Gemini 2.5 Pro, o3-mini, and o1. The Qwen2.5-32B-Enigmata shows outstanding performance in structured reasoning categories, outperforming in Crypto, Arithmetic, and Logic tasks, suggesting effective development of rule-based reasoning capabilities. The model shows competitive performance in search tasks that require strategic exploration and planning capabilities. Moreover, Crypto and Arithmetic tasks tend to provide the highest accuracy, while spatial and sequential tasks remain more difficult. In this paper, researchers introduced Enigmata, a comprehensive suite for equipping LLMs with advanced puzzle reasoning that integrates seamlessly with RL using verifiable rule-based rewards. The trained Enigmata-Model shows superior performance and robust generalization skills through RLVR training. Experiments reveal that when applied to larger models such as Seed1.5-Thinking, synthetic puzzle data brings additional benefits in other domains, including mathematics and STEM reasoning over state-of-the-art models. Enigmata provides a solid foundation for the research community to advance reasoning model development, offering a unified framework that effectively bridges logical puzzle-solving with broader reasoning capabilities in LLMs. Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic IntegrationSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning #enigmatas #multistage #mixtraining #reinforcement #learning
    WWW.MARKTECHPOST.COM
    Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning
    Large Reasoning Models (LRMs), trained from LLMs using reinforcement learning (RL), demonstrated great performance in complex reasoning tasks, including mathematics, STEM, and coding. However, existing LRMs face challenges in completing various puzzle tasks that require purely logical reasoning skills, which are easy and obvious for humans. Current methods targeting puzzles focus only on designing benchmarks for evaluation, lacking the training methods and resources for modern LLMs to tackle this challenge. Current puzzle datasets lack diversity and scalability, covering limited puzzle types with little control over generation or difficulty. Moreover, due to the success of the “LLM+RLVR” paradigm, it has become crucial to obtain large, diverse, and challenging sets of verifiable puzzle prompts for training agents. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key method for improving models’ reasoning capabilities, removing the need for reward models by directly assigning rewards based on objectively verifiable answers. Puzzles are particularly well-suited for RLVR. However, most prior RLVR research has overlooked the puzzles’ potential for delivering effective reward signals. In puzzle reasoning of LLMs, existing benchmarks evaluate different types of reasoning, including abstract, deductive, and compositional reasoning. Few benchmarks support scalable generation and difficulty control but lack puzzle diversity. Moreover, the improvement of LLMs’ puzzle-solving abilities mainly falls into two categories: tool integration and RLVR. Researchers from ByteDance Seed, Fudan University, Tsinghua University, Nanjing University, and Shanghai Jiao Tong University have proposed Enigmata, the first comprehensive toolkit designed for improving LLMs with puzzle reasoning skills. It contains 36 tasks across seven categories, each featuring a generator that produces unlimited examples with controllable difficulty and a rule-based verifier for automatic evaluation. The researchers further developed Enigmata-Eval as a rigorous benchmark and created optimized multi-task RLVR strategies. Puzzle data from Enigmata enhances SoTA performance on advanced math and STEM reasoning tasks like AIME, BeyondAIME, and GPQA when trained on larger models like Seed1.5-Thinking. This shows the generalization benefits of Enigmata. The Enigmata-Data comprises 36 puzzle tasks organized into 7 primary categories, including Crypto, Arithmetic, Logic, Grid, Graph, Search, and Sequential Puzzle, making it the only dataset having multiple task categories with scalability, automatic verification, and public availability. The data construction follows a three-phase pipeline: Tasks Collection and Design, Auto-Generator and Verifier Development, and Sliding Difficulty Control. Moreover, the Enigmata-Eval is developed by systematically sampling from the broader dataset, aiming to extract 50 instances per difficulty level for each task. The final evaluation set contains 4,758 puzzle instances rather than the theoretical maximum of 5,400, due to inherent constraints, where some tasks generate fewer instances per difficulty level. The proposed model outperforms most public models on Enigmata-Eval with 32B parameters, showing the effectiveness of the dataset and training recipe. The model stands out on the challenging ARC-AGI benchmark, surpassing strong reasoning models such as Gemini 2.5 Pro, o3-mini, and o1. The Qwen2.5-32B-Enigmata shows outstanding performance in structured reasoning categories, outperforming in Crypto, Arithmetic, and Logic tasks, suggesting effective development of rule-based reasoning capabilities. The model shows competitive performance in search tasks that require strategic exploration and planning capabilities. Moreover, Crypto and Arithmetic tasks tend to provide the highest accuracy, while spatial and sequential tasks remain more difficult. In this paper, researchers introduced Enigmata, a comprehensive suite for equipping LLMs with advanced puzzle reasoning that integrates seamlessly with RL using verifiable rule-based rewards. The trained Enigmata-Model shows superior performance and robust generalization skills through RLVR training. Experiments reveal that when applied to larger models such as Seed1.5-Thinking (20B/200B parameters), synthetic puzzle data brings additional benefits in other domains, including mathematics and STEM reasoning over state-of-the-art models. Enigmata provides a solid foundation for the research community to advance reasoning model development, offering a unified framework that effectively bridges logical puzzle-solving with broader reasoning capabilities in LLMs. Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic IntegrationSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement Learning
    0 Commentarii 0 Distribuiri
  • Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic Integration

    State-of-the-art models show human-competitive accuracy on AIME, GPQA, MATH-500, and OlympiadBench, solving Olympiad-level problems. Recent multimodal foundation models have advanced benchmarks for disciplinary knowledge and mathematical reasoning. However, these evaluations miss a crucial aspect of machine intelligence: physical reasoning, which requires integrating disciplinary knowledge, symbolic operations, and real-world constraints. Physical problem-solving differs fundamentally from pure mathematical reasoning as it demands models to decode implicit conditions in questions. For example, interpreting “smooth surface” as zero friction coefficient, and maintaining physical consistency across reasoning chains because physical laws remain constant regardless of reasoning trajectories.
    MLLM shows excellent visual understanding by integrating visual and textual data across various tasks, motivating exploration of its reasoning abilities. However, uncertainty remains regarding whether these models possess genuine advanced reasoning capabilities for visual tasks, particularly in physical domains closer to real-world scenarios. Several LLM benchmarks have emerged to evaluate reasoning abilities, with PHYBench being most relevant for physics reasoning. MLLM scientific benchmarks, such as PhysReason and EMMA, contain multimodal physics problems with figures, however, they include only small physics subsets, which inadequately evaluate MLLMs’ capabilities for reasoning and solving advanced physics problems.
    Researchers from the University of Hong Kong, the University of Michigan, the University of Toronto, the University of Waterloo, and the Ohio State University have proposed PHYX, a novel benchmark to evaluate the physical reasoning capabilities of foundation models. It comprises 3,000 visually-grounded physics questions, precisely curated across six distinct physics domains: Mechanics, Electromagnetism, Thermodynamics, Wave/Acoustics, Optics, and Modern Physics. It evaluates physics-based reasoning via multimodal problem-solving with three core innovations:3,000 newly collected questions with realistic physical scenarios requiring integrated visual analysis and causal reasoning,Expert-validated data design covering six fundamental physics domains, andStrict unified three-step evaluation protocols.

    Researchers designed a four-stage data collection process to ensure high-quality data. The process begins with an in-depth survey of core physics disciplines to determine coverage across diverse domains and subfields, followed by the recruitment of STEM graduate students as expert annotators. They comply with copyright restrictions and avoid data contamination by selecting questions without answers that are immediately available. Moreover, quality control involves a three-stage cleaning process including duplicate detection through lexical overlap analysis with manual review by physics Ph.D. students, followed by filtering the shortest 10% of questions based on textual length, resulting in 3,000 high-quality questions from an initial collection of 3,300.

    PHYX presents significant challenges for current models, with even the worst-performing human experts achieving 75.6% accuracy, outperforming all evaluated models and showing a gap between human expertise and current model capabilities. The benchmark reveals that multiple-choice formats narrow performance gaps by allowing weaker models to rely on surface-level cues, but open-ended questions demand genuine reasoning and precise answer generation. Comparing GPT-4o’s performance on PHYX to previously reported results on MathVista and MATH-V, lower accuracy in physical reasoning tasks emphasizes that physical reasoning requires deeper integration of abstract concepts and real-world knowledge, presenting greater challenges than purely mathematical contexts.
    In conclusion, researchers introduced PHYX, the first large-scale benchmark for evaluating physical reasoning in multimodal, visually grounded scenarios. Rigorous evaluation reveals that state-of-the-art models show limitations in physical reasoning, relying predominantly on memorized knowledge, mathematical formulas, and superficial visual patterns rather than genuine understanding of physical principles. The benchmark focuses exclusively on English-language prompts and annotations, limiting assessment of multilingual reasoning abilities. Also, while images depict physically realistic scenarios, they are often schematic or textbook-style rather than real-world photographs, which may not fully capture the complexity of perception in natural environments.

    Check out the Paper, Code and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement LearningSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models
    #multimodal #foundation #models #fall #short
    Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic Integration
    State-of-the-art models show human-competitive accuracy on AIME, GPQA, MATH-500, and OlympiadBench, solving Olympiad-level problems. Recent multimodal foundation models have advanced benchmarks for disciplinary knowledge and mathematical reasoning. However, these evaluations miss a crucial aspect of machine intelligence: physical reasoning, which requires integrating disciplinary knowledge, symbolic operations, and real-world constraints. Physical problem-solving differs fundamentally from pure mathematical reasoning as it demands models to decode implicit conditions in questions. For example, interpreting “smooth surface” as zero friction coefficient, and maintaining physical consistency across reasoning chains because physical laws remain constant regardless of reasoning trajectories. MLLM shows excellent visual understanding by integrating visual and textual data across various tasks, motivating exploration of its reasoning abilities. However, uncertainty remains regarding whether these models possess genuine advanced reasoning capabilities for visual tasks, particularly in physical domains closer to real-world scenarios. Several LLM benchmarks have emerged to evaluate reasoning abilities, with PHYBench being most relevant for physics reasoning. MLLM scientific benchmarks, such as PhysReason and EMMA, contain multimodal physics problems with figures, however, they include only small physics subsets, which inadequately evaluate MLLMs’ capabilities for reasoning and solving advanced physics problems. Researchers from the University of Hong Kong, the University of Michigan, the University of Toronto, the University of Waterloo, and the Ohio State University have proposed PHYX, a novel benchmark to evaluate the physical reasoning capabilities of foundation models. It comprises 3,000 visually-grounded physics questions, precisely curated across six distinct physics domains: Mechanics, Electromagnetism, Thermodynamics, Wave/Acoustics, Optics, and Modern Physics. It evaluates physics-based reasoning via multimodal problem-solving with three core innovations:3,000 newly collected questions with realistic physical scenarios requiring integrated visual analysis and causal reasoning,Expert-validated data design covering six fundamental physics domains, andStrict unified three-step evaluation protocols. Researchers designed a four-stage data collection process to ensure high-quality data. The process begins with an in-depth survey of core physics disciplines to determine coverage across diverse domains and subfields, followed by the recruitment of STEM graduate students as expert annotators. They comply with copyright restrictions and avoid data contamination by selecting questions without answers that are immediately available. Moreover, quality control involves a three-stage cleaning process including duplicate detection through lexical overlap analysis with manual review by physics Ph.D. students, followed by filtering the shortest 10% of questions based on textual length, resulting in 3,000 high-quality questions from an initial collection of 3,300. PHYX presents significant challenges for current models, with even the worst-performing human experts achieving 75.6% accuracy, outperforming all evaluated models and showing a gap between human expertise and current model capabilities. The benchmark reveals that multiple-choice formats narrow performance gaps by allowing weaker models to rely on surface-level cues, but open-ended questions demand genuine reasoning and precise answer generation. Comparing GPT-4o’s performance on PHYX to previously reported results on MathVista and MATH-V, lower accuracy in physical reasoning tasks emphasizes that physical reasoning requires deeper integration of abstract concepts and real-world knowledge, presenting greater challenges than purely mathematical contexts. In conclusion, researchers introduced PHYX, the first large-scale benchmark for evaluating physical reasoning in multimodal, visually grounded scenarios. Rigorous evaluation reveals that state-of-the-art models show limitations in physical reasoning, relying predominantly on memorized knowledge, mathematical formulas, and superficial visual patterns rather than genuine understanding of physical principles. The benchmark focuses exclusively on English-language prompts and annotations, limiting assessment of multilingual reasoning abilities. Also, while images depict physically realistic scenarios, they are often schematic or textbook-style rather than real-world photographs, which may not fully capture the complexity of perception in natural environments. Check out the Paper, Code and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement LearningSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models #multimodal #foundation #models #fall #short
    WWW.MARKTECHPOST.COM
    Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic Integration
    State-of-the-art models show human-competitive accuracy on AIME, GPQA, MATH-500, and OlympiadBench, solving Olympiad-level problems. Recent multimodal foundation models have advanced benchmarks for disciplinary knowledge and mathematical reasoning. However, these evaluations miss a crucial aspect of machine intelligence: physical reasoning, which requires integrating disciplinary knowledge, symbolic operations, and real-world constraints. Physical problem-solving differs fundamentally from pure mathematical reasoning as it demands models to decode implicit conditions in questions. For example, interpreting “smooth surface” as zero friction coefficient, and maintaining physical consistency across reasoning chains because physical laws remain constant regardless of reasoning trajectories. MLLM shows excellent visual understanding by integrating visual and textual data across various tasks, motivating exploration of its reasoning abilities. However, uncertainty remains regarding whether these models possess genuine advanced reasoning capabilities for visual tasks, particularly in physical domains closer to real-world scenarios. Several LLM benchmarks have emerged to evaluate reasoning abilities, with PHYBench being most relevant for physics reasoning. MLLM scientific benchmarks, such as PhysReason and EMMA, contain multimodal physics problems with figures, however, they include only small physics subsets, which inadequately evaluate MLLMs’ capabilities for reasoning and solving advanced physics problems. Researchers from the University of Hong Kong, the University of Michigan, the University of Toronto, the University of Waterloo, and the Ohio State University have proposed PHYX, a novel benchmark to evaluate the physical reasoning capabilities of foundation models. It comprises 3,000 visually-grounded physics questions, precisely curated across six distinct physics domains: Mechanics, Electromagnetism, Thermodynamics, Wave/Acoustics, Optics, and Modern Physics. It evaluates physics-based reasoning via multimodal problem-solving with three core innovations: (a) 3,000 newly collected questions with realistic physical scenarios requiring integrated visual analysis and causal reasoning, (b) Expert-validated data design covering six fundamental physics domains, and (c) Strict unified three-step evaluation protocols. Researchers designed a four-stage data collection process to ensure high-quality data. The process begins with an in-depth survey of core physics disciplines to determine coverage across diverse domains and subfields, followed by the recruitment of STEM graduate students as expert annotators. They comply with copyright restrictions and avoid data contamination by selecting questions without answers that are immediately available. Moreover, quality control involves a three-stage cleaning process including duplicate detection through lexical overlap analysis with manual review by physics Ph.D. students, followed by filtering the shortest 10% of questions based on textual length, resulting in 3,000 high-quality questions from an initial collection of 3,300. PHYX presents significant challenges for current models, with even the worst-performing human experts achieving 75.6% accuracy, outperforming all evaluated models and showing a gap between human expertise and current model capabilities. The benchmark reveals that multiple-choice formats narrow performance gaps by allowing weaker models to rely on surface-level cues, but open-ended questions demand genuine reasoning and precise answer generation. Comparing GPT-4o’s performance on PHYX to previously reported results on MathVista and MATH-V (both 63.8%), lower accuracy in physical reasoning tasks emphasizes that physical reasoning requires deeper integration of abstract concepts and real-world knowledge, presenting greater challenges than purely mathematical contexts. In conclusion, researchers introduced PHYX, the first large-scale benchmark for evaluating physical reasoning in multimodal, visually grounded scenarios. Rigorous evaluation reveals that state-of-the-art models show limitations in physical reasoning, relying predominantly on memorized knowledge, mathematical formulas, and superficial visual patterns rather than genuine understanding of physical principles. The benchmark focuses exclusively on English-language prompts and annotations, limiting assessment of multilingual reasoning abilities. Also, while images depict physically realistic scenarios, they are often schematic or textbook-style rather than real-world photographs, which may not fully capture the complexity of perception in natural environments. Check out the Paper, Code and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better AlignmentSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Introduces AceReason-Nemotron for Advancing Math and Code Reasoning through Reinforcement LearningSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models
    14 Commentarii 0 Distribuiri
  • NVIDIA’s Bartley Richardson on How Teams of AI Agents Provide Next-Level Automation

    Building effective agentic AI systems requires rethinking how technology interacts and delivers value across organizations.
    Bartley Richardson, senior director of engineering and AI infrastructure at NVIDIA, joined the NVIDIA AI Podcast to discuss how enterprises can successfully deploy agentic AI systems.
    “When I talk with people about agents and agentic AI, what I really want to say is automation,” Richardson said. “It is that next level of automation.”

    Richardson explains that AI reasoning models play a critical role in these systems by “thinking out loud” and enabling better planning capabilities.
    “Reasoning models have been trained and tuned in a very specific way to think — almost like thinking out loud,” Richardson said. “It’s kind of like when you’re brainstorming with your colleagues or family.”
    What makes NVIDIA’s Llama Nemotron models distinctive is that they give users the ability to toggle reasoning on or off within the same model, optimizing for specific tasks.
    Enterprise IT leaders must acknowledge the multi-vendor reality of modern environments,  Richardson explained, saying organizations will have agent systems from various sources working together simultaneously.
    “You’re going to have all these agents working together, and the trick is discovering how to let them all mesh together in a somewhat seamless way for your employees,” Richardson said.
    To address this challenge, NVIDIA developed the AI-Q Blueprint for developing advanced agentic AI systems. Teams can build AI agents to automate complex tasks, break down operational silos and drive efficiency across industries. The blueprint uses the open-source NVIDIA Agent Intelligencetoolkit to evaluate and profile agent workflows, making it easier to optimize and ensure interoperability among agents, tools and data sources.
    “We have customers that optimize their tool-calling chains and get 15x speedups through their pipeline using AI-Q,” Richardson said.
    He also emphasized the importance of maintaining realistic expectations that still provide significant business value.
    “Agentic systems will make mistakes,” Richardson added. “But if it gets you 60%, 70%, 80% of the way there, that’s amazing.”
    Time Stamps
    1:15 – Defining agentic AI as the next evolution of enterprise automation.
    4:06 – How reasoning models enhance agentic system capabilities.
    12:41 – Enterprise considerations for implementing multi-vendor agent systems.
    19:33 – Introduction to the NVIDIA Agent Intelligence toolkit for observability and traceability.
    You Might Also Like… 
    NVIDIA’s Rama Akkiraju on How AI Platform Architects Help Bridge Business Vision and Technical Execution
    Enterprises are exploring AI to rethink problem-solving and business processes. These initiatives require the right infrastructure, such as AI factories, which allow businesses to convert data into tokens and outcomes. Rama Akkiraju, vice president of IT for AI and machine learning at NVIDIA, joined the AI Podcast to discuss how enterprises can build the right foundations for AI success, and the critical role of AI platform architects in designing and building AI infrastructure based on specific business needs.
    Roboflow Helps Unlock Computer Vision for Every Kind of AI Builder
    Roboflow’s mission is to make the world programmable through computer vision. By simplifying computer vision development, the company helps bridge the gap between AI and people looking to harness it. Cofounder and CEO Joseph Nelson discusses how Roboflow empowers users in manufacturing, healthcare and automotive to solve complex problems with visual AI.
    NVIDIA’s Jacob Liberman on Bringing Agentic AI to Enterprises
    Agentic AI enables developers to create intelligent multi-agent systems that reason, act and execute complex tasks with a degree of autonomy. Jacob Liberman, director of product management at NVIDIA, explains how agentic AI bridges the gap between powerful AI models and practical enterprise applications.
    #nvidias #bartley #richardson #how #teams
    NVIDIA’s Bartley Richardson on How Teams of AI Agents Provide Next-Level Automation
    Building effective agentic AI systems requires rethinking how technology interacts and delivers value across organizations. Bartley Richardson, senior director of engineering and AI infrastructure at NVIDIA, joined the NVIDIA AI Podcast to discuss how enterprises can successfully deploy agentic AI systems. “When I talk with people about agents and agentic AI, what I really want to say is automation,” Richardson said. “It is that next level of automation.” Richardson explains that AI reasoning models play a critical role in these systems by “thinking out loud” and enabling better planning capabilities. “Reasoning models have been trained and tuned in a very specific way to think — almost like thinking out loud,” Richardson said. “It’s kind of like when you’re brainstorming with your colleagues or family.” What makes NVIDIA’s Llama Nemotron models distinctive is that they give users the ability to toggle reasoning on or off within the same model, optimizing for specific tasks. Enterprise IT leaders must acknowledge the multi-vendor reality of modern environments,  Richardson explained, saying organizations will have agent systems from various sources working together simultaneously. “You’re going to have all these agents working together, and the trick is discovering how to let them all mesh together in a somewhat seamless way for your employees,” Richardson said. To address this challenge, NVIDIA developed the AI-Q Blueprint for developing advanced agentic AI systems. Teams can build AI agents to automate complex tasks, break down operational silos and drive efficiency across industries. The blueprint uses the open-source NVIDIA Agent Intelligencetoolkit to evaluate and profile agent workflows, making it easier to optimize and ensure interoperability among agents, tools and data sources. “We have customers that optimize their tool-calling chains and get 15x speedups through their pipeline using AI-Q,” Richardson said. He also emphasized the importance of maintaining realistic expectations that still provide significant business value. “Agentic systems will make mistakes,” Richardson added. “But if it gets you 60%, 70%, 80% of the way there, that’s amazing.” Time Stamps 1:15 – Defining agentic AI as the next evolution of enterprise automation. 4:06 – How reasoning models enhance agentic system capabilities. 12:41 – Enterprise considerations for implementing multi-vendor agent systems. 19:33 – Introduction to the NVIDIA Agent Intelligence toolkit for observability and traceability. You Might Also Like…  NVIDIA’s Rama Akkiraju on How AI Platform Architects Help Bridge Business Vision and Technical Execution Enterprises are exploring AI to rethink problem-solving and business processes. These initiatives require the right infrastructure, such as AI factories, which allow businesses to convert data into tokens and outcomes. Rama Akkiraju, vice president of IT for AI and machine learning at NVIDIA, joined the AI Podcast to discuss how enterprises can build the right foundations for AI success, and the critical role of AI platform architects in designing and building AI infrastructure based on specific business needs. Roboflow Helps Unlock Computer Vision for Every Kind of AI Builder Roboflow’s mission is to make the world programmable through computer vision. By simplifying computer vision development, the company helps bridge the gap between AI and people looking to harness it. Cofounder and CEO Joseph Nelson discusses how Roboflow empowers users in manufacturing, healthcare and automotive to solve complex problems with visual AI. NVIDIA’s Jacob Liberman on Bringing Agentic AI to Enterprises Agentic AI enables developers to create intelligent multi-agent systems that reason, act and execute complex tasks with a degree of autonomy. Jacob Liberman, director of product management at NVIDIA, explains how agentic AI bridges the gap between powerful AI models and practical enterprise applications. #nvidias #bartley #richardson #how #teams
    BLOGS.NVIDIA.COM
    NVIDIA’s Bartley Richardson on How Teams of AI Agents Provide Next-Level Automation
    Building effective agentic AI systems requires rethinking how technology interacts and delivers value across organizations. Bartley Richardson, senior director of engineering and AI infrastructure at NVIDIA, joined the NVIDIA AI Podcast to discuss how enterprises can successfully deploy agentic AI systems. “When I talk with people about agents and agentic AI, what I really want to say is automation,” Richardson said. “It is that next level of automation.” Richardson explains that AI reasoning models play a critical role in these systems by “thinking out loud” and enabling better planning capabilities. “Reasoning models have been trained and tuned in a very specific way to think — almost like thinking out loud,” Richardson said. “It’s kind of like when you’re brainstorming with your colleagues or family.” What makes NVIDIA’s Llama Nemotron models distinctive is that they give users the ability to toggle reasoning on or off within the same model, optimizing for specific tasks. Enterprise IT leaders must acknowledge the multi-vendor reality of modern environments,  Richardson explained, saying organizations will have agent systems from various sources working together simultaneously. “You’re going to have all these agents working together, and the trick is discovering how to let them all mesh together in a somewhat seamless way for your employees,” Richardson said. To address this challenge, NVIDIA developed the AI-Q Blueprint for developing advanced agentic AI systems. Teams can build AI agents to automate complex tasks, break down operational silos and drive efficiency across industries. The blueprint uses the open-source NVIDIA Agent Intelligence (AIQ) toolkit to evaluate and profile agent workflows, making it easier to optimize and ensure interoperability among agents, tools and data sources. “We have customers that optimize their tool-calling chains and get 15x speedups through their pipeline using AI-Q,” Richardson said. He also emphasized the importance of maintaining realistic expectations that still provide significant business value. “Agentic systems will make mistakes,” Richardson added. “But if it gets you 60%, 70%, 80% of the way there, that’s amazing.” Time Stamps 1:15 – Defining agentic AI as the next evolution of enterprise automation. 4:06 – How reasoning models enhance agentic system capabilities. 12:41 – Enterprise considerations for implementing multi-vendor agent systems. 19:33 – Introduction to the NVIDIA Agent Intelligence toolkit for observability and traceability. You Might Also Like…  NVIDIA’s Rama Akkiraju on How AI Platform Architects Help Bridge Business Vision and Technical Execution Enterprises are exploring AI to rethink problem-solving and business processes. These initiatives require the right infrastructure, such as AI factories, which allow businesses to convert data into tokens and outcomes. Rama Akkiraju, vice president of IT for AI and machine learning at NVIDIA, joined the AI Podcast to discuss how enterprises can build the right foundations for AI success, and the critical role of AI platform architects in designing and building AI infrastructure based on specific business needs. Roboflow Helps Unlock Computer Vision for Every Kind of AI Builder Roboflow’s mission is to make the world programmable through computer vision. By simplifying computer vision development, the company helps bridge the gap between AI and people looking to harness it. Cofounder and CEO Joseph Nelson discusses how Roboflow empowers users in manufacturing, healthcare and automotive to solve complex problems with visual AI. NVIDIA’s Jacob Liberman on Bringing Agentic AI to Enterprises Agentic AI enables developers to create intelligent multi-agent systems that reason, act and execute complex tasks with a degree of autonomy. Jacob Liberman, director of product management at NVIDIA, explains how agentic AI bridges the gap between powerful AI models and practical enterprise applications.
    0 Commentarii 0 Distribuiri
  • Qwen Researchers Proposes QwenLong-L1: A Reinforcement Learning Framework for Long-Context Reasoning in Large Language Models

    While large reasoning modelshave shown impressive capabilities in short-context reasoning through reinforcement learning, these gains do not generalize well to long-context scenarios. Applications such as multi-document QA, research synthesis, and legal or financial analysis require models to process and reason over sequences exceeding 100K tokens. However, RL optimization in such regimes is plagued by slower reward convergence, unstable policy updates due to KL divergence fluctuations, and reduced exploration resulting from entropy collapse. These bottlenecks reveal a fundamental gap in transitioning LRMs from short-context proficiency to long-context generalization.
    QwenLong-L1: A Structured RL Framework for Long-Context Adaptation
    To address these limitations, the Qwen Research team introduces QwenLong-L1, a novel RL framework designed to adapt LRMs to long-context reasoning tasks. The framework is structured into three key stages:

    Warm-up Supervised Fine-Tuning: Provides a stable initialization for the policy model by training on curated question-context-answer triplets, ensuring basic competence in contextual comprehension and answer extraction.
    Curriculum-Guided Phased Reinforcement Learning: Introduces a staged training process with gradually increasing context lengths. This progression enables the model to incrementally acquire long-context reasoning behaviors without destabilizing policy updates.
    Difficulty-Aware Retrospective Sampling: Enhances exploration by maintaining and reusing hard examples from previous phases, weighted by their difficulty, to encourage deeper reasoning and robustness across diverse inputs.

    These stages are complemented by hybrid reward mechanisms—combining rule-based exact match verification with semantic evaluation by a lightweight LLM—ensuring both precision and recall during policy training.

    Technical Design and Methodological Advantages
    QwenLong-L1 integrates recent advances in group-relative RL optimization, specifically GRPO and DAPO, to mitigate the computational overhead associated with long-context value estimation:

    GRPO estimates advantage by normalizing rewards within sampled groups, eliminating the need for a separate value network and encouraging diverse generation patterns.
    DAPO incorporates mechanisms such as dynamic sampling, overlength penalty shaping, and asymmetric clipping thresholds to prevent entropy collapse and mitigate length biases during training.

    The reward function is defined as the maximum of two signals: a deterministic rule-based match and a semantic judgment from a compact evaluator model. This hybrid approach avoids overfitting to rigid formats while maintaining answer correctness across varied notations and phrasings.
    Moreover, the framework is optimized via progressive context scaling, where the RL process transitions from 20K-token to 60K-token input lengths in controlled phases, stabilizing training dynamics and facilitating policy generalization.
    Experimental Results and Benchmark Performance
    QwenLong-L1 was evaluated on seven long-context document QA benchmarks, including DocMath, Frames, 2WikiMultihopQA, HotpotQA, Musique, NarrativeQA, and Qasper. The 32B variant, QwenLong-L1-32B, demonstrated strong empirical performance:

    It outperformed baseline models such as R1-Distill-Qwen-32B by 5.1 points and exceeded leading proprietary systems like OpenAI-o3-mini and Qwen3-235B-A22B.
    Its performance was comparable to Claude-3.7-Sonnet-Thinking, indicating competitive reasoning capabilities under extreme context lengths.
    Pass@K analysis revealed consistent improvements with increased sampling, achieving a Pass@2 average of 73.7, surpassing DeepSeek-R1 and OpenAI-o1-preview, even at low sampling rates.

    Ablation studies further validated the individual contributions of SFT, phased RL, and retrospective sampling. Notably, RL played a decisive role in enabling emergent reasoning behaviors such as grounding, subgoal setting, verification, and backtracking—traits not effectively induced by supervised fine-tuning alone.
    Conclusion
    QwenLong-L1 represents a systematic approach to equipping LRMs with robust long-context reasoning capabilities through reinforcement learning. Its design effectively bridges the gap between short-context expertise and the demands of information-dense environments by combining supervised initialization, curriculum-driven context scaling, and hybrid evaluation strategies. The framework not only achieves state-of-the-art results across long-context benchmarks but also demonstrates the emergence of interpretable reasoning patterns during training.

    Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGen
    #qwen #researchers #proposes #qwenlongl1 #reinforcement
    Qwen Researchers Proposes QwenLong-L1: A Reinforcement Learning Framework for Long-Context Reasoning in Large Language Models
    While large reasoning modelshave shown impressive capabilities in short-context reasoning through reinforcement learning, these gains do not generalize well to long-context scenarios. Applications such as multi-document QA, research synthesis, and legal or financial analysis require models to process and reason over sequences exceeding 100K tokens. However, RL optimization in such regimes is plagued by slower reward convergence, unstable policy updates due to KL divergence fluctuations, and reduced exploration resulting from entropy collapse. These bottlenecks reveal a fundamental gap in transitioning LRMs from short-context proficiency to long-context generalization. QwenLong-L1: A Structured RL Framework for Long-Context Adaptation To address these limitations, the Qwen Research team introduces QwenLong-L1, a novel RL framework designed to adapt LRMs to long-context reasoning tasks. The framework is structured into three key stages: Warm-up Supervised Fine-Tuning: Provides a stable initialization for the policy model by training on curated question-context-answer triplets, ensuring basic competence in contextual comprehension and answer extraction. Curriculum-Guided Phased Reinforcement Learning: Introduces a staged training process with gradually increasing context lengths. This progression enables the model to incrementally acquire long-context reasoning behaviors without destabilizing policy updates. Difficulty-Aware Retrospective Sampling: Enhances exploration by maintaining and reusing hard examples from previous phases, weighted by their difficulty, to encourage deeper reasoning and robustness across diverse inputs. These stages are complemented by hybrid reward mechanisms—combining rule-based exact match verification with semantic evaluation by a lightweight LLM—ensuring both precision and recall during policy training. Technical Design and Methodological Advantages QwenLong-L1 integrates recent advances in group-relative RL optimization, specifically GRPO and DAPO, to mitigate the computational overhead associated with long-context value estimation: GRPO estimates advantage by normalizing rewards within sampled groups, eliminating the need for a separate value network and encouraging diverse generation patterns. DAPO incorporates mechanisms such as dynamic sampling, overlength penalty shaping, and asymmetric clipping thresholds to prevent entropy collapse and mitigate length biases during training. The reward function is defined as the maximum of two signals: a deterministic rule-based match and a semantic judgment from a compact evaluator model. This hybrid approach avoids overfitting to rigid formats while maintaining answer correctness across varied notations and phrasings. Moreover, the framework is optimized via progressive context scaling, where the RL process transitions from 20K-token to 60K-token input lengths in controlled phases, stabilizing training dynamics and facilitating policy generalization. Experimental Results and Benchmark Performance QwenLong-L1 was evaluated on seven long-context document QA benchmarks, including DocMath, Frames, 2WikiMultihopQA, HotpotQA, Musique, NarrativeQA, and Qasper. The 32B variant, QwenLong-L1-32B, demonstrated strong empirical performance: It outperformed baseline models such as R1-Distill-Qwen-32B by 5.1 points and exceeded leading proprietary systems like OpenAI-o3-mini and Qwen3-235B-A22B. Its performance was comparable to Claude-3.7-Sonnet-Thinking, indicating competitive reasoning capabilities under extreme context lengths. Pass@K analysis revealed consistent improvements with increased sampling, achieving a Pass@2 average of 73.7, surpassing DeepSeek-R1 and OpenAI-o1-preview, even at low sampling rates. Ablation studies further validated the individual contributions of SFT, phased RL, and retrospective sampling. Notably, RL played a decisive role in enabling emergent reasoning behaviors such as grounding, subgoal setting, verification, and backtracking—traits not effectively induced by supervised fine-tuning alone. Conclusion QwenLong-L1 represents a systematic approach to equipping LRMs with robust long-context reasoning capabilities through reinforcement learning. Its design effectively bridges the gap between short-context expertise and the demands of information-dense environments by combining supervised initialization, curriculum-driven context scaling, and hybrid evaluation strategies. The framework not only achieves state-of-the-art results across long-context benchmarks but also demonstrates the emergence of interpretable reasoning patterns during training. Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGen #qwen #researchers #proposes #qwenlongl1 #reinforcement
    WWW.MARKTECHPOST.COM
    Qwen Researchers Proposes QwenLong-L1: A Reinforcement Learning Framework for Long-Context Reasoning in Large Language Models
    While large reasoning models (LRMs) have shown impressive capabilities in short-context reasoning through reinforcement learning (RL), these gains do not generalize well to long-context scenarios. Applications such as multi-document QA, research synthesis, and legal or financial analysis require models to process and reason over sequences exceeding 100K tokens. However, RL optimization in such regimes is plagued by slower reward convergence, unstable policy updates due to KL divergence fluctuations, and reduced exploration resulting from entropy collapse. These bottlenecks reveal a fundamental gap in transitioning LRMs from short-context proficiency to long-context generalization. QwenLong-L1: A Structured RL Framework for Long-Context Adaptation To address these limitations, the Qwen Research team introduces QwenLong-L1, a novel RL framework designed to adapt LRMs to long-context reasoning tasks. The framework is structured into three key stages: Warm-up Supervised Fine-Tuning (SFT): Provides a stable initialization for the policy model by training on curated question-context-answer triplets, ensuring basic competence in contextual comprehension and answer extraction. Curriculum-Guided Phased Reinforcement Learning: Introduces a staged training process with gradually increasing context lengths. This progression enables the model to incrementally acquire long-context reasoning behaviors without destabilizing policy updates. Difficulty-Aware Retrospective Sampling: Enhances exploration by maintaining and reusing hard examples from previous phases, weighted by their difficulty, to encourage deeper reasoning and robustness across diverse inputs. These stages are complemented by hybrid reward mechanisms—combining rule-based exact match verification with semantic evaluation by a lightweight LLM—ensuring both precision and recall during policy training. Technical Design and Methodological Advantages QwenLong-L1 integrates recent advances in group-relative RL optimization, specifically GRPO and DAPO, to mitigate the computational overhead associated with long-context value estimation: GRPO estimates advantage by normalizing rewards within sampled groups, eliminating the need for a separate value network and encouraging diverse generation patterns. DAPO incorporates mechanisms such as dynamic sampling, overlength penalty shaping, and asymmetric clipping thresholds to prevent entropy collapse and mitigate length biases during training. The reward function is defined as the maximum of two signals: a deterministic rule-based match and a semantic judgment from a compact evaluator model (e.g., Qwen2.5-1.5B). This hybrid approach avoids overfitting to rigid formats while maintaining answer correctness across varied notations and phrasings. Moreover, the framework is optimized via progressive context scaling, where the RL process transitions from 20K-token to 60K-token input lengths in controlled phases, stabilizing training dynamics and facilitating policy generalization. Experimental Results and Benchmark Performance QwenLong-L1 was evaluated on seven long-context document QA benchmarks, including DocMath, Frames, 2WikiMultihopQA, HotpotQA, Musique, NarrativeQA, and Qasper. The 32B variant, QwenLong-L1-32B, demonstrated strong empirical performance: It outperformed baseline models such as R1-Distill-Qwen-32B by 5.1 points and exceeded leading proprietary systems like OpenAI-o3-mini and Qwen3-235B-A22B. Its performance was comparable to Claude-3.7-Sonnet-Thinking, indicating competitive reasoning capabilities under extreme context lengths. Pass@K analysis revealed consistent improvements with increased sampling, achieving a Pass@2 average of 73.7, surpassing DeepSeek-R1 and OpenAI-o1-preview, even at low sampling rates. Ablation studies further validated the individual contributions of SFT, phased RL, and retrospective sampling. Notably, RL played a decisive role in enabling emergent reasoning behaviors such as grounding, subgoal setting, verification, and backtracking—traits not effectively induced by supervised fine-tuning alone. Conclusion QwenLong-L1 represents a systematic approach to equipping LRMs with robust long-context reasoning capabilities through reinforcement learning. Its design effectively bridges the gap between short-context expertise and the demands of information-dense environments by combining supervised initialization, curriculum-driven context scaling, and hybrid evaluation strategies. The framework not only achieves state-of-the-art results across long-context benchmarks but also demonstrates the emergence of interpretable reasoning patterns during training. Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGen
    4 Commentarii 0 Distribuiri
  • NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific Tasks

    NVIDIA has released Llama Nemotron Nano 4B, an open-source reasoning model designed to deliver strong performance and efficiency across scientific tasks, programming, symbolic math, function calling, and instruction following—while being compact enough for edge deployment. With just 4 billion parameters, it achieves higher accuracy and up to 50% greater throughput than comparable open models with up to 8 billion parameters, according to internal benchmarks.
    The model is positioned as a practical foundation for deploying language-based AI agents in resource-constrained environments. By focusing on inference efficiency, Llama Nemotron Nano 4B addresses a growing demand for compact models capable of supporting hybrid reasoning and instruction-following tasks outside traditional cloud settings.
    Model Architecture and Training Stack
    Nemotron Nano 4B builds upon the Llama 3.1 architecture and shares lineage with NVIDIA’s earlier “Minitron” family. The architecture follows a dense, decoder-only transformer design. The model has been optimized for performance in reasoning-intensive workloads while maintaining a lightweight parameter count.
    The post-training stack for the model includes multi-stage supervised fine-tuning on curated datasets for mathematics, coding, reasoning tasks, and function calling. In addition to traditional supervised learning, Nemotron Nano 4B has undergone reinforcement learning optimization using Reward-aware Preference Optimization, a method intended to enhance the model’s utility in chat-based and instruction-following environments.
    This combination of instruction tuning and reward modeling helps align the model’s outputs more closely with user intent, particularly in multi-turn reasoning scenarios. The training approach reflects NVIDIA’s emphasis on aligning smaller models to practical usage tasks that traditionally require significantly larger parameter sizes.

    Performance Benchmarks
    Despite its compact footprint, Nemotron Nano 4B exhibits robust performance in both single-turn and multi-turn reasoning tasks. According to NVIDIA, it provides 50% higher inference throughput compared to similar open-weight models within the 8B parameter range. The model supports a context window of up to 128,000 tokens, which is particularly useful for tasks involving long documents, nested function calls, or multi-hop reasoning chains.
    While NVIDIA has not disclosed full benchmark tables in the Hugging Face documentation, the model reportedly outperforms other open alternatives in benchmarks across math, code generation, and function calling precision. Its throughput advantage suggests it can serve as a viable default for developers targeting efficient inference pipelines with moderately complex workloads.
    Edge-Ready Deployment
    One of the core differentiators of Nemotron Nano 4B is its focus on edge deployment. The model has been explicitly tested and optimized to run efficiently on NVIDIA Jetson platforms and NVIDIA RTX GPUs. This enables real-time reasoning capabilities on low-power embedded devices, including robotics systems, autonomous edge agents, or local developer workstations.
    For enterprises and research teams concerned with privacy and deployment control, the ability to run advanced reasoning models locally—without relying on cloud inference APIs—can provide both cost savings and greater flexibility.
    Licensing and Access
    The model is released under the NVIDIA Open Model License, which permits commercial usage. It is available through Hugging Face at huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1, with all relevant model weights, configuration files, and tokenizer artifacts openly accessible. The license structure aligns with NVIDIA’s broader strategy of supporting developer ecosystems around its open models.
    Conclusion
    Nemotron Nano 4B represents NVIDIA’s continued investment in bringing scalable, practical AI models to a broader development audience—especially those targeting edge or cost-sensitive deployment scenarios. While the field continues to see rapid progress in ultra-large models, compact and efficient models like Nemotron Nano 4B provide a counterbalance, enabling deployment flexibility without compromising too heavily on performance.

    Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGenAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use
    #nvidia #releases #llama #nemotron #nano
    NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific Tasks
    NVIDIA has released Llama Nemotron Nano 4B, an open-source reasoning model designed to deliver strong performance and efficiency across scientific tasks, programming, symbolic math, function calling, and instruction following—while being compact enough for edge deployment. With just 4 billion parameters, it achieves higher accuracy and up to 50% greater throughput than comparable open models with up to 8 billion parameters, according to internal benchmarks. The model is positioned as a practical foundation for deploying language-based AI agents in resource-constrained environments. By focusing on inference efficiency, Llama Nemotron Nano 4B addresses a growing demand for compact models capable of supporting hybrid reasoning and instruction-following tasks outside traditional cloud settings. Model Architecture and Training Stack Nemotron Nano 4B builds upon the Llama 3.1 architecture and shares lineage with NVIDIA’s earlier “Minitron” family. The architecture follows a dense, decoder-only transformer design. The model has been optimized for performance in reasoning-intensive workloads while maintaining a lightweight parameter count. The post-training stack for the model includes multi-stage supervised fine-tuning on curated datasets for mathematics, coding, reasoning tasks, and function calling. In addition to traditional supervised learning, Nemotron Nano 4B has undergone reinforcement learning optimization using Reward-aware Preference Optimization, a method intended to enhance the model’s utility in chat-based and instruction-following environments. This combination of instruction tuning and reward modeling helps align the model’s outputs more closely with user intent, particularly in multi-turn reasoning scenarios. The training approach reflects NVIDIA’s emphasis on aligning smaller models to practical usage tasks that traditionally require significantly larger parameter sizes. Performance Benchmarks Despite its compact footprint, Nemotron Nano 4B exhibits robust performance in both single-turn and multi-turn reasoning tasks. According to NVIDIA, it provides 50% higher inference throughput compared to similar open-weight models within the 8B parameter range. The model supports a context window of up to 128,000 tokens, which is particularly useful for tasks involving long documents, nested function calls, or multi-hop reasoning chains. While NVIDIA has not disclosed full benchmark tables in the Hugging Face documentation, the model reportedly outperforms other open alternatives in benchmarks across math, code generation, and function calling precision. Its throughput advantage suggests it can serve as a viable default for developers targeting efficient inference pipelines with moderately complex workloads. Edge-Ready Deployment One of the core differentiators of Nemotron Nano 4B is its focus on edge deployment. The model has been explicitly tested and optimized to run efficiently on NVIDIA Jetson platforms and NVIDIA RTX GPUs. This enables real-time reasoning capabilities on low-power embedded devices, including robotics systems, autonomous edge agents, or local developer workstations. For enterprises and research teams concerned with privacy and deployment control, the ability to run advanced reasoning models locally—without relying on cloud inference APIs—can provide both cost savings and greater flexibility. Licensing and Access The model is released under the NVIDIA Open Model License, which permits commercial usage. It is available through Hugging Face at huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1, with all relevant model weights, configuration files, and tokenizer artifacts openly accessible. The license structure aligns with NVIDIA’s broader strategy of supporting developer ecosystems around its open models. Conclusion Nemotron Nano 4B represents NVIDIA’s continued investment in bringing scalable, practical AI models to a broader development audience—especially those targeting edge or cost-sensitive deployment scenarios. While the field continues to see rapid progress in ultra-large models, compact and efficient models like Nemotron Nano 4B provide a counterbalance, enabling deployment flexibility without compromising too heavily on performance. Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGenAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use #nvidia #releases #llama #nemotron #nano
    WWW.MARKTECHPOST.COM
    NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific Tasks
    NVIDIA has released Llama Nemotron Nano 4B, an open-source reasoning model designed to deliver strong performance and efficiency across scientific tasks, programming, symbolic math, function calling, and instruction following—while being compact enough for edge deployment. With just 4 billion parameters, it achieves higher accuracy and up to 50% greater throughput than comparable open models with up to 8 billion parameters, according to internal benchmarks. The model is positioned as a practical foundation for deploying language-based AI agents in resource-constrained environments. By focusing on inference efficiency, Llama Nemotron Nano 4B addresses a growing demand for compact models capable of supporting hybrid reasoning and instruction-following tasks outside traditional cloud settings. Model Architecture and Training Stack Nemotron Nano 4B builds upon the Llama 3.1 architecture and shares lineage with NVIDIA’s earlier “Minitron” family. The architecture follows a dense, decoder-only transformer design. The model has been optimized for performance in reasoning-intensive workloads while maintaining a lightweight parameter count. The post-training stack for the model includes multi-stage supervised fine-tuning on curated datasets for mathematics, coding, reasoning tasks, and function calling. In addition to traditional supervised learning, Nemotron Nano 4B has undergone reinforcement learning optimization using Reward-aware Preference Optimization (RPO), a method intended to enhance the model’s utility in chat-based and instruction-following environments. This combination of instruction tuning and reward modeling helps align the model’s outputs more closely with user intent, particularly in multi-turn reasoning scenarios. The training approach reflects NVIDIA’s emphasis on aligning smaller models to practical usage tasks that traditionally require significantly larger parameter sizes. Performance Benchmarks Despite its compact footprint, Nemotron Nano 4B exhibits robust performance in both single-turn and multi-turn reasoning tasks. According to NVIDIA, it provides 50% higher inference throughput compared to similar open-weight models within the 8B parameter range. The model supports a context window of up to 128,000 tokens, which is particularly useful for tasks involving long documents, nested function calls, or multi-hop reasoning chains. While NVIDIA has not disclosed full benchmark tables in the Hugging Face documentation, the model reportedly outperforms other open alternatives in benchmarks across math, code generation, and function calling precision. Its throughput advantage suggests it can serve as a viable default for developers targeting efficient inference pipelines with moderately complex workloads. Edge-Ready Deployment One of the core differentiators of Nemotron Nano 4B is its focus on edge deployment. The model has been explicitly tested and optimized to run efficiently on NVIDIA Jetson platforms and NVIDIA RTX GPUs. This enables real-time reasoning capabilities on low-power embedded devices, including robotics systems, autonomous edge agents, or local developer workstations. For enterprises and research teams concerned with privacy and deployment control, the ability to run advanced reasoning models locally—without relying on cloud inference APIs—can provide both cost savings and greater flexibility. Licensing and Access The model is released under the NVIDIA Open Model License, which permits commercial usage. It is available through Hugging Face at huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1, with all relevant model weights, configuration files, and tokenizer artifacts openly accessible. The license structure aligns with NVIDIA’s broader strategy of supporting developer ecosystems around its open models. Conclusion Nemotron Nano 4B represents NVIDIA’s continued investment in bringing scalable, practical AI models to a broader development audience—especially those targeting edge or cost-sensitive deployment scenarios. While the field continues to see rapid progress in ultra-large models, compact and efficient models like Nemotron Nano 4B provide a counterbalance, enabling deployment flexibility without compromising too heavily on performance. Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGenAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use
    0 Commentarii 0 Distribuiri
  • Enhancing Language Model Generalization: Bridging the Gap Between In-Context Learning and Fine-Tuning

    Language modelshave great capabilities as in-context learners when pretrained on vast internet text corpora, allowing them to generalize effectively from just a few task examples. However, fine-tuning these models for downstream tasks presents significant challenges. While fine-tuning requires hundreds to thousands of examples, the resulting generalization patterns show limitations. For example, models fine-tuned on statements like “B’s mother is A” struggle to answer related questions like “Who is A’s son?” However, the LMs can handle such reverse relations in context. This raises questions about the differences between in-context learning and fine-tuning generalization patterns, and how these differences should inform adaptation strategies for downstream tasks.
    Research into improving LMs’ adaptability has followed several key approaches. In-context learning studies have examined learning and generalization patterns through empirical, mechanistic, and theoretical analyses. Out-of-context learning research explores how models utilize information not explicitly included in prompts. Data augmentation techniques use LLMs to enhance performance from limited datasets, with specific solutions targeting issues like the reversal curse through hardcoded augmentations, deductive closure training, and generating reasoning pathways. Moreover, synthetic data approaches have evolved from early hand-designed data to improve generalization in domains like linguistics or mathematics to more recent methods that generate data directly from language models.
    Researchers from Google DeepMind and Stanford University have constructed several datasets that isolate knowledge from pretraining data to create clean generalization tests. Performance is evaluated across various generalization types by exposing pretrained models to controlled information subsets, both in-context and through fine-tuning. Their findings reveal that in-context learning shows more flexible generalization than fine-tuning in data-matched settings, though there are some exceptions where fine-tuning can generalize to reversals within larger knowledge structures. Building on these insights, researchers have developed a method that enhances fine-tuning generalization by including in-context inferences into the fine-tuning data.
    Researchers employ multiple datasets carefully designed to isolate specific generalization challenges or insert them within broader learning contexts. Evaluation relies on multiple-choice likelihood scoring without providing answer choices in context. The experiments involve fine-tuning Gemini 1.5 Flash using batch sizes of 8 or 16. For in-context evaluation, the researchers combine training documents as context for the instruction-tuned model, randomly subsampling by 8x for larger datasets to minimize interference issues. The key innovation is a dataset augmentation approach using in-context generalization to enhance fine-tuning dataset coverage. This includes local and global strategies, each employing distinct contexts and prompts.
    On the Reversal Curse dataset, in-context learning achieves near-ceiling performance on reversals, while conventional fine-tuning shows near-zero accuracy as models favor incorrect celebrity names seen during training. Fine-tuning with data augmented by in-context inferences matches the high performance of pure in-context learning. Testing on simple nonsense reversals reveals similar patterns, though with less pronounced benefits. For simple syllogisms, while the pretrained model performs at chance level, fine-tuning does produce above-chance generalization for certain syllogism types where logical inferences align with simple linguistic patterns. However, in-context learning outperforms fine-tuning, with augmented fine-tuning showing the best overall results.

    In conclusion, this paper explores generalization differences between in-context learning and fine-tuning when LMs face novel information structures. Results show in-context learning’s superior generalization for certain inference types, prompting the researchers to develop methods that enhance fine-tuning performance by incorporating in-context inferences into training data. Despite promising outcomes, several limitations affect the study. The first one is the dependency on nonsense words and implausible operations. Second, the research focuses on specific LMs, limiting the results’ generality. Future research should investigate learning and generalization differences across various models to expand upon these findings, especially newer reasoning models.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Salesforce AI Researchers Introduce UAEval4RAG: A New Benchmark to Evaluate RAG Systems’ Ability to Reject Unanswerable QueriesSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Google Researchers Introduce LightLab: A Diffusion-Based AI Method for Physically Plausible, Fine-Grained Light Control in Single ImagesSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/DanceGRPO: A Unified Framework for Reinforcement Learning in Visual Generation Across Multiple Paradigms and TasksSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Reinforcement Learning, Not Fine-Tuning: Nemotron-Tool-N1 Trains LLMs to Use Tools with Minimal Supervision and Maximum Generalization

    Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub!
    #enhancing #language #model #generalization #bridging
    Enhancing Language Model Generalization: Bridging the Gap Between In-Context Learning and Fine-Tuning
    Language modelshave great capabilities as in-context learners when pretrained on vast internet text corpora, allowing them to generalize effectively from just a few task examples. However, fine-tuning these models for downstream tasks presents significant challenges. While fine-tuning requires hundreds to thousands of examples, the resulting generalization patterns show limitations. For example, models fine-tuned on statements like “B’s mother is A” struggle to answer related questions like “Who is A’s son?” However, the LMs can handle such reverse relations in context. This raises questions about the differences between in-context learning and fine-tuning generalization patterns, and how these differences should inform adaptation strategies for downstream tasks. Research into improving LMs’ adaptability has followed several key approaches. In-context learning studies have examined learning and generalization patterns through empirical, mechanistic, and theoretical analyses. Out-of-context learning research explores how models utilize information not explicitly included in prompts. Data augmentation techniques use LLMs to enhance performance from limited datasets, with specific solutions targeting issues like the reversal curse through hardcoded augmentations, deductive closure training, and generating reasoning pathways. Moreover, synthetic data approaches have evolved from early hand-designed data to improve generalization in domains like linguistics or mathematics to more recent methods that generate data directly from language models. Researchers from Google DeepMind and Stanford University have constructed several datasets that isolate knowledge from pretraining data to create clean generalization tests. Performance is evaluated across various generalization types by exposing pretrained models to controlled information subsets, both in-context and through fine-tuning. Their findings reveal that in-context learning shows more flexible generalization than fine-tuning in data-matched settings, though there are some exceptions where fine-tuning can generalize to reversals within larger knowledge structures. Building on these insights, researchers have developed a method that enhances fine-tuning generalization by including in-context inferences into the fine-tuning data. Researchers employ multiple datasets carefully designed to isolate specific generalization challenges or insert them within broader learning contexts. Evaluation relies on multiple-choice likelihood scoring without providing answer choices in context. The experiments involve fine-tuning Gemini 1.5 Flash using batch sizes of 8 or 16. For in-context evaluation, the researchers combine training documents as context for the instruction-tuned model, randomly subsampling by 8x for larger datasets to minimize interference issues. The key innovation is a dataset augmentation approach using in-context generalization to enhance fine-tuning dataset coverage. This includes local and global strategies, each employing distinct contexts and prompts. On the Reversal Curse dataset, in-context learning achieves near-ceiling performance on reversals, while conventional fine-tuning shows near-zero accuracy as models favor incorrect celebrity names seen during training. Fine-tuning with data augmented by in-context inferences matches the high performance of pure in-context learning. Testing on simple nonsense reversals reveals similar patterns, though with less pronounced benefits. For simple syllogisms, while the pretrained model performs at chance level, fine-tuning does produce above-chance generalization for certain syllogism types where logical inferences align with simple linguistic patterns. However, in-context learning outperforms fine-tuning, with augmented fine-tuning showing the best overall results. In conclusion, this paper explores generalization differences between in-context learning and fine-tuning when LMs face novel information structures. Results show in-context learning’s superior generalization for certain inference types, prompting the researchers to develop methods that enhance fine-tuning performance by incorporating in-context inferences into training data. Despite promising outcomes, several limitations affect the study. The first one is the dependency on nonsense words and implausible operations. Second, the research focuses on specific LMs, limiting the results’ generality. Future research should investigate learning and generalization differences across various models to expand upon these findings, especially newer reasoning models. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Salesforce AI Researchers Introduce UAEval4RAG: A New Benchmark to Evaluate RAG Systems’ Ability to Reject Unanswerable QueriesSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Google Researchers Introduce LightLab: A Diffusion-Based AI Method for Physically Plausible, Fine-Grained Light Control in Single ImagesSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/DanceGRPO: A Unified Framework for Reinforcement Learning in Visual Generation Across Multiple Paradigms and TasksSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Reinforcement Learning, Not Fine-Tuning: Nemotron-Tool-N1 Trains LLMs to Use Tools with Minimal Supervision and Maximum Generalization 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! #enhancing #language #model #generalization #bridging
    WWW.MARKTECHPOST.COM
    Enhancing Language Model Generalization: Bridging the Gap Between In-Context Learning and Fine-Tuning
    Language models (LMs) have great capabilities as in-context learners when pretrained on vast internet text corpora, allowing them to generalize effectively from just a few task examples. However, fine-tuning these models for downstream tasks presents significant challenges. While fine-tuning requires hundreds to thousands of examples, the resulting generalization patterns show limitations. For example, models fine-tuned on statements like “B’s mother is A” struggle to answer related questions like “Who is A’s son?” However, the LMs can handle such reverse relations in context. This raises questions about the differences between in-context learning and fine-tuning generalization patterns, and how these differences should inform adaptation strategies for downstream tasks. Research into improving LMs’ adaptability has followed several key approaches. In-context learning studies have examined learning and generalization patterns through empirical, mechanistic, and theoretical analyses. Out-of-context learning research explores how models utilize information not explicitly included in prompts. Data augmentation techniques use LLMs to enhance performance from limited datasets, with specific solutions targeting issues like the reversal curse through hardcoded augmentations, deductive closure training, and generating reasoning pathways. Moreover, synthetic data approaches have evolved from early hand-designed data to improve generalization in domains like linguistics or mathematics to more recent methods that generate data directly from language models. Researchers from Google DeepMind and Stanford University have constructed several datasets that isolate knowledge from pretraining data to create clean generalization tests. Performance is evaluated across various generalization types by exposing pretrained models to controlled information subsets, both in-context and through fine-tuning. Their findings reveal that in-context learning shows more flexible generalization than fine-tuning in data-matched settings, though there are some exceptions where fine-tuning can generalize to reversals within larger knowledge structures. Building on these insights, researchers have developed a method that enhances fine-tuning generalization by including in-context inferences into the fine-tuning data. Researchers employ multiple datasets carefully designed to isolate specific generalization challenges or insert them within broader learning contexts. Evaluation relies on multiple-choice likelihood scoring without providing answer choices in context. The experiments involve fine-tuning Gemini 1.5 Flash using batch sizes of 8 or 16. For in-context evaluation, the researchers combine training documents as context for the instruction-tuned model, randomly subsampling by 8x for larger datasets to minimize interference issues. The key innovation is a dataset augmentation approach using in-context generalization to enhance fine-tuning dataset coverage. This includes local and global strategies, each employing distinct contexts and prompts. On the Reversal Curse dataset, in-context learning achieves near-ceiling performance on reversals, while conventional fine-tuning shows near-zero accuracy as models favor incorrect celebrity names seen during training. Fine-tuning with data augmented by in-context inferences matches the high performance of pure in-context learning. Testing on simple nonsense reversals reveals similar patterns, though with less pronounced benefits. For simple syllogisms, while the pretrained model performs at chance level (indicating no data contamination), fine-tuning does produce above-chance generalization for certain syllogism types where logical inferences align with simple linguistic patterns. However, in-context learning outperforms fine-tuning, with augmented fine-tuning showing the best overall results. In conclusion, this paper explores generalization differences between in-context learning and fine-tuning when LMs face novel information structures. Results show in-context learning’s superior generalization for certain inference types, prompting the researchers to develop methods that enhance fine-tuning performance by incorporating in-context inferences into training data. Despite promising outcomes, several limitations affect the study. The first one is the dependency on nonsense words and implausible operations. Second, the research focuses on specific LMs, limiting the results’ generality. Future research should investigate learning and generalization differences across various models to expand upon these findings, especially newer reasoning models. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Salesforce AI Researchers Introduce UAEval4RAG: A New Benchmark to Evaluate RAG Systems’ Ability to Reject Unanswerable QueriesSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Google Researchers Introduce LightLab: A Diffusion-Based AI Method for Physically Plausible, Fine-Grained Light Control in Single ImagesSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/DanceGRPO: A Unified Framework for Reinforcement Learning in Visual Generation Across Multiple Paradigms and TasksSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Reinforcement Learning, Not Fine-Tuning: Nemotron-Tool-N1 Trains LLMs to Use Tools with Minimal Supervision and Maximum Generalization 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! (Promoted)
    0 Commentarii 0 Distribuiri
  • AI Blueprint for Video Search and Summarization Now Available to Deploy Video Analytics AI Agents Across Industries

    The age of video analytics AI agents is here.
    Video is one of the defining features of the modern digital landscape, accounting for over 50% of all global data traffic. Dominant in media and increasingly important for enterprises across industries, it is one of the largest and most ubiquitous data sources in the world. Yet less than 1% of it is analyzed for insights.
    Nearly half of global GDP comes from physical industries — spanning energy to automotive and electronics. With labor shortage concerns, manufacturing onshoring efforts and rising demand for automation, video analytics AI agents will play a more critical role than ever, helping bridge the physical and digital worlds.
    To accelerate the development of these agents, NVIDIA today is making the AI Blueprint for video search and summarization, powered by the NVIDIA Metropolis platform, generally available — giving developers the tools to create and deploy highly capable AI agents for analyzing vast sums of real-time and archived videos.
    A wave of vision AI agents and productivity assistants powered by vision language modelsare coming online. Combining powerful computer vision models with the skills of super intelligent large language models, these video analytics AI agents allow enterprises to easily see, search and summarize huge volumes of video. By analyzing videos in real time or reviewing terabytes of recorded video, video analytics AI agents are unlocking unprecedented value and opportunities across a range of important industries.
    Manufacturers and warehouses are using AI agents to help increase worker safety and productivity. For example, agents can help distribute forklifts and position workers for optimal efficiency. Smart cities are deploying video analytics AI agents to reduce traffic congestion and increase safety, and the uses go on and on.

    A Blueprint to Create Diverse Fleets of Video Analytics AI Agents
    The VSS blueprint is built on top of the NVIDIA Metropolis platform and boosted by VLMs and LLMs such as NVIDIA VILA and NVIDIA Llama Nemotron, NVIDIA NeMo Retriever microservices, and retrieval-augmented generation— a technique that connects LLMs to a company’s enterprise data.
    The VSS blueprint incorporates the NVIDIA AI Enterprise software platform, including NVIDIA NIM microservices for VLMs, LLMs and advanced AI frameworks for RAG. With the VSS blueprint, users can summarize a video 100x faster than watching in real time. For example, an hourlong video can be summarized in text in less than one minute.
    The VSS blueprint offers a host of powerful features designed to provide robust video understanding, performance and scalability.
    This release introduces expanded hardware support, including the ability to deploy on a single NVIDIA A100 or H100 GPU for smaller workloads, offering greater flexibility in resource allocation. The blueprint can also be deployed at the edge on the NVIDIA RTX 6000 PRO and NVIDIA DGX Spark computing platforms.
    The VSS blueprint can process hundreds of live video streams or burst clips simultaneously. In addition to visual understanding, it offers audio transcription. Converting speech to text adds contextual depth in scenarios where audio is critical — such as training videos, keynotes or team meetings.
    Industry Leaders Deploy Video Analytics AI Agents to Drive Business Value
    Everyone from the world’s leading manufacturers to smart cities and sports leagues are using the VSS blueprint to develop AI agents for optimizing operations.
    Pegatron, a leading electronics manufacturing company, uses the VSS blueprint to study operating procedures and train employees on best practices. The company is also integrating the blueprint into its PEGAAi platform so organizations can build AI agents to transform manufacturing processes.

    These agents can ingest and analyze massive volumes of video, enabling advanced capabilities like automated monitoring, anomaly detection, video search and incident reporting. Pegatron’s Visual Analytics Agent can be used to understand operating procedures for printed circuit board assembly and identify when actions are correct or incorrect. To date, the agents have reduced Pegatron’s labor costs by 7% and defect rates by 67%.
    Additional leading Taiwanese semiconductor and electronics manufacturers are building AI agents and digital twins to optimize their planning and operational applications.
    Kaohsiung City, Taiwan, is using a unified smart city vision AI application developed by its partner, Linker Vision, to improve incident response times. Previously, city departments such as waste management, transportation and emergency response were isolated by siloed infrastructure — leading to slow response times due to lack of access to critical information.
    Powered by the VSS blueprint, Linker Vision’s AI-powered application has agents that combine real-time video analytics with generative AI to not just detect visual elements but also understand and narrate complex urban events like floods or traffic accidents.
    Linker Vision currently delivers timely insights to 12 city departments and is on track to scale from 30,000 city cameras to over 50,000 by 2026. These insights are providing improved situational awareness and data-driven decision-making across city services, and reducing incident response times by up to 80%.

    The National Hockey League used the VAST InsightEngine with the VSS blueprint to streamline and accelerate vision AI workflows. It manages massive volumes of game footage.
    With the VAST InsightEngine, the NHL is positioned to search through petabytes of video in sub-seconds, enabling near-instant retrieval of highlights and in-game moments. AI-driven agentic workflows further enhance content creation by automatically clipping, tagging and assembling video content for ease of access and use.
    In the future, the League could potentially use real-time AI reasoning to enable tailored insights — such as player stats, strategy analyses or fantasy recommendations — generated dynamically during live games. This end-to-end automation could transform how media is created, curated and delivered, setting a new standard for AI-driven sports content production.

    Siemens is using its Industrial Copilot for Operations to assist factory floor workers with equipment maintenance tasks, error handling and performance optimization. This generative AI-powered assistant offers real-time answers to equipment errors using information about operational and document data.
    The copilot was built with a fusion of VSS components like VLMs, LLMs and NVIDIA NeMo microservices. The Industrial Copilot has resulted in rapid decision-making and reduced machine downtime. Siemens has reported a 30% increase in productivity, with the potential to reach 50%.
    Supported by an Expanding Partner Ecosystem Creating Sophisticated AI Agents
    NVIDIA partners are using the VSS blueprint to expedite the creation of agentic AI video analytics capabilities for their workflows, reducing development time from months to weeks.
    Superb AI, a leader in intelligent video analytics, set up a sophisticated airport operations project at Incheon Airport to reduce passenger wait times in a matter of weeks. In Malaysia, solution provider ITMAX is building advanced visual AI agents with the VSS blueprint for the City of Kuala Lumpur to improve overall city management and reduce incident response times.
    In the advertising sector, PYLER integrated the VSS blueprint into its brand safetyand ad targetingsolutions in just a few weeks. Using AiD and AiM, Samsung Electronics increased advertising effectiveness with brand- and product-aligned, high-value ad placements. BYD saw its ad-click through rates increase 4x by targeting contextually relevant and positive content, while Hana Financial Group surpassed multiple brand campaign goals.
    Fingermark is the application provider of Eyecue, a real-time computer vision platform used by quick service restaurants. Fingermark is adding the VSS blueprint into Eyecue to turn video footage into clear, actionable insights regarding drive-thru wait times, service bottlenecks and staff-related incidents at scale.
    Try the VSS blueprint on build.nvidia.com and read this technical blog for more details.
    Watch the COMPUTEX keynote from NVIDIA founder and CEO Jensen Huang, as well as NVIDIA GTC Taipei 2025 sessions.
    #blueprint #video #search #summarization #now
    AI Blueprint for Video Search and Summarization Now Available to Deploy Video Analytics AI Agents Across Industries
    The age of video analytics AI agents is here. Video is one of the defining features of the modern digital landscape, accounting for over 50% of all global data traffic. Dominant in media and increasingly important for enterprises across industries, it is one of the largest and most ubiquitous data sources in the world. Yet less than 1% of it is analyzed for insights. Nearly half of global GDP comes from physical industries — spanning energy to automotive and electronics. With labor shortage concerns, manufacturing onshoring efforts and rising demand for automation, video analytics AI agents will play a more critical role than ever, helping bridge the physical and digital worlds. To accelerate the development of these agents, NVIDIA today is making the AI Blueprint for video search and summarization, powered by the NVIDIA Metropolis platform, generally available — giving developers the tools to create and deploy highly capable AI agents for analyzing vast sums of real-time and archived videos. A wave of vision AI agents and productivity assistants powered by vision language modelsare coming online. Combining powerful computer vision models with the skills of super intelligent large language models, these video analytics AI agents allow enterprises to easily see, search and summarize huge volumes of video. By analyzing videos in real time or reviewing terabytes of recorded video, video analytics AI agents are unlocking unprecedented value and opportunities across a range of important industries. Manufacturers and warehouses are using AI agents to help increase worker safety and productivity. For example, agents can help distribute forklifts and position workers for optimal efficiency. Smart cities are deploying video analytics AI agents to reduce traffic congestion and increase safety, and the uses go on and on. A Blueprint to Create Diverse Fleets of Video Analytics AI Agents The VSS blueprint is built on top of the NVIDIA Metropolis platform and boosted by VLMs and LLMs such as NVIDIA VILA and NVIDIA Llama Nemotron, NVIDIA NeMo Retriever microservices, and retrieval-augmented generation— a technique that connects LLMs to a company’s enterprise data. The VSS blueprint incorporates the NVIDIA AI Enterprise software platform, including NVIDIA NIM microservices for VLMs, LLMs and advanced AI frameworks for RAG. With the VSS blueprint, users can summarize a video 100x faster than watching in real time. For example, an hourlong video can be summarized in text in less than one minute. The VSS blueprint offers a host of powerful features designed to provide robust video understanding, performance and scalability. This release introduces expanded hardware support, including the ability to deploy on a single NVIDIA A100 or H100 GPU for smaller workloads, offering greater flexibility in resource allocation. The blueprint can also be deployed at the edge on the NVIDIA RTX 6000 PRO and NVIDIA DGX Spark computing platforms. The VSS blueprint can process hundreds of live video streams or burst clips simultaneously. In addition to visual understanding, it offers audio transcription. Converting speech to text adds contextual depth in scenarios where audio is critical — such as training videos, keynotes or team meetings. Industry Leaders Deploy Video Analytics AI Agents to Drive Business Value Everyone from the world’s leading manufacturers to smart cities and sports leagues are using the VSS blueprint to develop AI agents for optimizing operations. Pegatron, a leading electronics manufacturing company, uses the VSS blueprint to study operating procedures and train employees on best practices. The company is also integrating the blueprint into its PEGAAi platform so organizations can build AI agents to transform manufacturing processes. These agents can ingest and analyze massive volumes of video, enabling advanced capabilities like automated monitoring, anomaly detection, video search and incident reporting. Pegatron’s Visual Analytics Agent can be used to understand operating procedures for printed circuit board assembly and identify when actions are correct or incorrect. To date, the agents have reduced Pegatron’s labor costs by 7% and defect rates by 67%. Additional leading Taiwanese semiconductor and electronics manufacturers are building AI agents and digital twins to optimize their planning and operational applications. Kaohsiung City, Taiwan, is using a unified smart city vision AI application developed by its partner, Linker Vision, to improve incident response times. Previously, city departments such as waste management, transportation and emergency response were isolated by siloed infrastructure — leading to slow response times due to lack of access to critical information. Powered by the VSS blueprint, Linker Vision’s AI-powered application has agents that combine real-time video analytics with generative AI to not just detect visual elements but also understand and narrate complex urban events like floods or traffic accidents. Linker Vision currently delivers timely insights to 12 city departments and is on track to scale from 30,000 city cameras to over 50,000 by 2026. These insights are providing improved situational awareness and data-driven decision-making across city services, and reducing incident response times by up to 80%. The National Hockey League used the VAST InsightEngine with the VSS blueprint to streamline and accelerate vision AI workflows. It manages massive volumes of game footage. With the VAST InsightEngine, the NHL is positioned to search through petabytes of video in sub-seconds, enabling near-instant retrieval of highlights and in-game moments. AI-driven agentic workflows further enhance content creation by automatically clipping, tagging and assembling video content for ease of access and use. In the future, the League could potentially use real-time AI reasoning to enable tailored insights — such as player stats, strategy analyses or fantasy recommendations — generated dynamically during live games. This end-to-end automation could transform how media is created, curated and delivered, setting a new standard for AI-driven sports content production. Siemens is using its Industrial Copilot for Operations to assist factory floor workers with equipment maintenance tasks, error handling and performance optimization. This generative AI-powered assistant offers real-time answers to equipment errors using information about operational and document data. The copilot was built with a fusion of VSS components like VLMs, LLMs and NVIDIA NeMo microservices. The Industrial Copilot has resulted in rapid decision-making and reduced machine downtime. Siemens has reported a 30% increase in productivity, with the potential to reach 50%. Supported by an Expanding Partner Ecosystem Creating Sophisticated AI Agents NVIDIA partners are using the VSS blueprint to expedite the creation of agentic AI video analytics capabilities for their workflows, reducing development time from months to weeks. Superb AI, a leader in intelligent video analytics, set up a sophisticated airport operations project at Incheon Airport to reduce passenger wait times in a matter of weeks. In Malaysia, solution provider ITMAX is building advanced visual AI agents with the VSS blueprint for the City of Kuala Lumpur to improve overall city management and reduce incident response times. In the advertising sector, PYLER integrated the VSS blueprint into its brand safetyand ad targetingsolutions in just a few weeks. Using AiD and AiM, Samsung Electronics increased advertising effectiveness with brand- and product-aligned, high-value ad placements. BYD saw its ad-click through rates increase 4x by targeting contextually relevant and positive content, while Hana Financial Group surpassed multiple brand campaign goals. Fingermark is the application provider of Eyecue, a real-time computer vision platform used by quick service restaurants. Fingermark is adding the VSS blueprint into Eyecue to turn video footage into clear, actionable insights regarding drive-thru wait times, service bottlenecks and staff-related incidents at scale. Try the VSS blueprint on build.nvidia.com and read this technical blog for more details. Watch the COMPUTEX keynote from NVIDIA founder and CEO Jensen Huang, as well as NVIDIA GTC Taipei 2025 sessions. #blueprint #video #search #summarization #now
    BLOGS.NVIDIA.COM
    AI Blueprint for Video Search and Summarization Now Available to Deploy Video Analytics AI Agents Across Industries
    The age of video analytics AI agents is here. Video is one of the defining features of the modern digital landscape, accounting for over 50% of all global data traffic. Dominant in media and increasingly important for enterprises across industries, it is one of the largest and most ubiquitous data sources in the world. Yet less than 1% of it is analyzed for insights. Nearly half of global GDP comes from physical industries — spanning energy to automotive and electronics. With labor shortage concerns, manufacturing onshoring efforts and rising demand for automation, video analytics AI agents will play a more critical role than ever, helping bridge the physical and digital worlds. To accelerate the development of these agents, NVIDIA today is making the AI Blueprint for video search and summarization (VSS), powered by the NVIDIA Metropolis platform, generally available — giving developers the tools to create and deploy highly capable AI agents for analyzing vast sums of real-time and archived videos. A wave of vision AI agents and productivity assistants powered by vision language models (VLMs) are coming online. Combining powerful computer vision models with the skills of super intelligent large language models (LLMs), these video analytics AI agents allow enterprises to easily see, search and summarize huge volumes of video. By analyzing videos in real time or reviewing terabytes of recorded video, video analytics AI agents are unlocking unprecedented value and opportunities across a range of important industries. Manufacturers and warehouses are using AI agents to help increase worker safety and productivity. For example, agents can help distribute forklifts and position workers for optimal efficiency. Smart cities are deploying video analytics AI agents to reduce traffic congestion and increase safety, and the uses go on and on. A Blueprint to Create Diverse Fleets of Video Analytics AI Agents The VSS blueprint is built on top of the NVIDIA Metropolis platform and boosted by VLMs and LLMs such as NVIDIA VILA and NVIDIA Llama Nemotron, NVIDIA NeMo Retriever microservices, and retrieval-augmented generation (RAG) — a technique that connects LLMs to a company’s enterprise data. The VSS blueprint incorporates the NVIDIA AI Enterprise software platform, including NVIDIA NIM microservices for VLMs, LLMs and advanced AI frameworks for RAG. With the VSS blueprint, users can summarize a video 100x faster than watching in real time. For example, an hourlong video can be summarized in text in less than one minute. The VSS blueprint offers a host of powerful features designed to provide robust video understanding, performance and scalability. This release introduces expanded hardware support, including the ability to deploy on a single NVIDIA A100 or H100 GPU for smaller workloads, offering greater flexibility in resource allocation. The blueprint can also be deployed at the edge on the NVIDIA RTX 6000 PRO and NVIDIA DGX Spark computing platforms. The VSS blueprint can process hundreds of live video streams or burst clips simultaneously. In addition to visual understanding, it offers audio transcription. Converting speech to text adds contextual depth in scenarios where audio is critical — such as training videos, keynotes or team meetings. Industry Leaders Deploy Video Analytics AI Agents to Drive Business Value Everyone from the world’s leading manufacturers to smart cities and sports leagues are using the VSS blueprint to develop AI agents for optimizing operations. Pegatron, a leading electronics manufacturing company, uses the VSS blueprint to study operating procedures and train employees on best practices. The company is also integrating the blueprint into its PEGAAi platform so organizations can build AI agents to transform manufacturing processes. These agents can ingest and analyze massive volumes of video, enabling advanced capabilities like automated monitoring, anomaly detection, video search and incident reporting. Pegatron’s Visual Analytics Agent can be used to understand operating procedures for printed circuit board assembly and identify when actions are correct or incorrect. To date, the agents have reduced Pegatron’s labor costs by 7% and defect rates by 67%. Additional leading Taiwanese semiconductor and electronics manufacturers are building AI agents and digital twins to optimize their planning and operational applications. Kaohsiung City, Taiwan, is using a unified smart city vision AI application developed by its partner, Linker Vision, to improve incident response times. Previously, city departments such as waste management, transportation and emergency response were isolated by siloed infrastructure — leading to slow response times due to lack of access to critical information. Powered by the VSS blueprint, Linker Vision’s AI-powered application has agents that combine real-time video analytics with generative AI to not just detect visual elements but also understand and narrate complex urban events like floods or traffic accidents. Linker Vision currently delivers timely insights to 12 city departments and is on track to scale from 30,000 city cameras to over 50,000 by 2026. These insights are providing improved situational awareness and data-driven decision-making across city services, and reducing incident response times by up to 80%. The National Hockey League used the VAST InsightEngine with the VSS blueprint to streamline and accelerate vision AI workflows. It manages massive volumes of game footage. With the VAST InsightEngine, the NHL is positioned to search through petabytes of video in sub-seconds, enabling near-instant retrieval of highlights and in-game moments. AI-driven agentic workflows further enhance content creation by automatically clipping, tagging and assembling video content for ease of access and use. In the future, the League could potentially use real-time AI reasoning to enable tailored insights — such as player stats, strategy analyses or fantasy recommendations — generated dynamically during live games. This end-to-end automation could transform how media is created, curated and delivered, setting a new standard for AI-driven sports content production. Siemens is using its Industrial Copilot for Operations to assist factory floor workers with equipment maintenance tasks, error handling and performance optimization. This generative AI-powered assistant offers real-time answers to equipment errors using information about operational and document data. The copilot was built with a fusion of VSS components like VLMs, LLMs and NVIDIA NeMo microservices. The Industrial Copilot has resulted in rapid decision-making and reduced machine downtime. Siemens has reported a 30% increase in productivity, with the potential to reach 50%. Supported by an Expanding Partner Ecosystem Creating Sophisticated AI Agents NVIDIA partners are using the VSS blueprint to expedite the creation of agentic AI video analytics capabilities for their workflows, reducing development time from months to weeks. Superb AI, a leader in intelligent video analytics, set up a sophisticated airport operations project at Incheon Airport to reduce passenger wait times in a matter of weeks. In Malaysia, solution provider ITMAX is building advanced visual AI agents with the VSS blueprint for the City of Kuala Lumpur to improve overall city management and reduce incident response times. In the advertising sector, PYLER integrated the VSS blueprint into its brand safety (AiD) and ad targeting (AiM) solutions in just a few weeks. Using AiD and AiM, Samsung Electronics increased advertising effectiveness with brand- and product-aligned, high-value ad placements. BYD saw its ad-click through rates increase 4x by targeting contextually relevant and positive content, while Hana Financial Group surpassed multiple brand campaign goals. Fingermark is the application provider of Eyecue, a real-time computer vision platform used by quick service restaurants. Fingermark is adding the VSS blueprint into Eyecue to turn video footage into clear, actionable insights regarding drive-thru wait times, service bottlenecks and staff-related incidents at scale. Try the VSS blueprint on build.nvidia.com and read this technical blog for more details. Watch the COMPUTEX keynote from NVIDIA founder and CEO Jensen Huang, as well as NVIDIA GTC Taipei 2025 sessions.
    0 Commentarii 0 Distribuiri
Sponsorizeaza Paginile