Posts Directory | CGShares

Directory

Users

Posts

Pages

Groups

Microsoft Academic @MicrosoftAcademic shared a link
2025-02-25 19:23:18 ·

Magma: A foundation model for multimodal AI agents across digital and physical worlds

www.microsoft.com
Imagine an AI system capable of guiding a robot to manipulate physical objects as effortlessly as it navigates software menus. Such seamless integration of digital and physical tasks has long been the stuff of science fiction.Today, Microsoft researchers are bringing that vision closer to reality with Magma (opens in new tab), a multimodal AI foundation model designed to process information and generate action proposals across both digital and physical environments. It is designed to enable AI agents to interpret user interfaces and suggest actions like button clicks, while also orchestrating robotic movements and interactions in the physical world. Built on the foundation model paradigm, Magma is pretrained on an expansive and diverse dataset, allowing it to generalize better across tasks and environments than smaller, task-specific models. As illustrated in Figure 1, Magma synthesizes visual and textual inputs to generate meaningful actionswhether executing a command in software or grabbing a tool in the physical world. This new model represents a significant step toward AI agents that can serve as versatile, general-purpose assistants.Figure 1: Magma is one of the first foundation models that is capable of interpreting and grounding multimodal inputs within both digital and physical environments. Given a described goal, Magma can formulate plans and execute actions to achieve it. By effectively transferring knowledge from freely available visual and language data, Magma bridges verbal, spatial and temporal intelligence to navigate complex tasks and settings.on-demand eventMicrosoft Research Forum Episode 4Learn about the latest multimodal AI models, advanced benchmarks for AI evaluation and model self-improvement, and an entirely new kind of computer for AI inference and hard optimization. Watch on-demandOpens in a new tab Vision-Language-Action (VLA) models integrate visual perception, language comprehension, and action reasoning to enable AI systems to interpret images, process textual instructions, and propose actions. These models bridge the gap between multimodal understanding and real-world interaction. Typically pretrained on large numbers of VLA datasets, they acquire the ability to understand visual content, process language, and perceive and interact with the spatial world, allowing them to perform a wide range of tasks. However, due to the dramatic difference among various digital and physical environments, separate VLA models are trained and used for different environments. As a result, these models struggle to generalize to new tasks and environments outside of their training data. Moreover, most of these models do not leverage pretrained vision-language (VL) models or diverse VL datasets, which hampers their understanding of VL relations and generalizability. Magma, to the best of our knowledge, is one of the first VLA foundation model that can adapt to new tasks in both digital and physical environments, which helps AI-powered assistants or robots understand their surroundings and suggest appropriate actions. For example, it could enable a home assistant robot to learn how to organize a new type of object it has never encountered or help a virtual assistant generate step-by-step user interface navigation instructions for an unfamiliar task. Through Magma, we demonstrate the advantages of pretraining a single VLA model for AI agents across multiple environments while still achieving state-of-the-art results on user interface navigation and robotic manipulation tasks, outperforming previous models that are tailored to these specific domains. On VL tasks, Magma also compares favorably to popular VL models that are trained on much larger datasets.Building a foundation model that spans such different modalities has required us to rethink how we train and supervise AI agents. Magma introduces a novel training paradigm centered on two key innovations: Set-of-Mark (SoM) and Trace-of-Mark (ToM) annotations. These techniques developed by Microsoft Research, imbue the model with a structured understanding of tasks in both user interface navigation and robotic manipulation domains.Set-of-Mark (SoM): SoM is an annotated set of key objects, or interface elements that are relevant to achieving a given goal. For example, if the task is to navigate a web page, the SoM includes all the bounding boxes for clickable user interface elements. In a physical task like setting a table, the SoM could include the plate, the cup, and the position of each item on the table. By providing SoM, we give Magma a high-level hint of what needs attentionthe essential elements of the taskwithout yet specifying the order or method.Figure 2: Set-of-Mark (SoM) for Action Grounding. Set-of-Mark prompting enables effective action grounding in images for both UI screenshot (left), robot manipulation (middle) and human video (right) by having the model predict numeric marks for clickable buttons or robot arms in image space. These marks give Magma a high-level hint of what needs attention the essential elements of the taskTrace-of-Mark (ToM): In ToM we extend the strategy of overlaying marks from static images to dynamic videos, by incorporating tracing lines following object movements over time. While SoM highlights key objects or interface elements relevant to a task, ToM captures how these elements change or move throughout an interaction. For example, in a physical task like moving an object on a table, ToM might illustrate the motion of a hand placing the object and adjusting its position. By providing these temporal traces, ToM offers Magma a richer understanding of how actions unfold, complementing SoMs focus on what needs attention.Figure 3: Trace-of-Mark (ToM) for Action Planning. Trace-of-Mark supervisions for robot manipulation (left) and human action (right). It compels the model to comprehend temporal video dynamics and anticipate future states before acting, while using fewer tokens than next-frame prediction to capture longer temporal horizons and action-related dynamics without ambient distractions.Performance and evaluationZero-shot agentic intelligenceTable 1: Zero-shot evaluation on agentic intelligence. We report the results for pretrained Magma without any domain-specific finetuning. In this experiment, Magma is the only model that can conduct the full task spectrum.Figure 4: Zero-shot evaluation on Google Robots and Bridge with SimplerEnv. Magma shows strong zero-shot cross-domain robustness and demonstrates impressive results in cross-embodiment manipulation simulation tasks.Efficient finetuningTable 2: Efficient finetuning on Mind2Web for web UI navigation.Figure 5: Few-shot finetuning on Widow-X robot (left) and LIBERO (right). Magma achieves a significantly higher average success rate in all task suites. Additionally, removing SoM and ToM during pretraining has a negative impact on model performance. Table 3: Without task-specific data, Magma performs competitively and even outperforms some state-of-the-art approaches such as Video-Llama2 and ShareGPT4Video on most benchmarks, despite using much fewer video instruction tuning data.Relation to broader researchMagma is one component of a much larger vision within Microsoft Research for the future of agentic AI systems. Across various teams and projects at Microsoft, we are collectively exploring how AI systems can detect, analyze, and respond in the world to amplify human capabilities.Earlier this month, we announced AutoGen v0.4, a fully reimagined open-source library for building advanced agentic AI systems. While AutoGen focuses on the structure and management of AI agents, Magma enhances those agents by empowering them with a new level of capability. Developers can already use AutoGen to set up an AI assistant that leverages an LLM for planning and dialogue using conventional LLMs. Now with MAGMA, if developers want to build agents that execute both physical or user interface/browser tasks, that same assistant would call upon Magma to understand the environment, perform reasoning, and take a sequence of actions to complete the task.The reasoning ability of Magma can be further developed by incorporating test-time search and reinforcement learning, as described in ExACT. ExACT shows an approach for teaching AI agents to explore more effectively, enabling them to intelligently navigate their environments, gather valuable information, evaluate options, and identify optimal decision-making and planning strategies.At the application level, we are also exploring new user experience (UX) powered by foundation models for the next generation of agentic AI systems. Data Formulator is a prime example. Announced late last year, Data Formulator, is an AI-driven visualization tool developed by Microsoft Research that translates high-level analytical intents into rich visual representations by handling complex data transformations behind the scenes.Looking ahead, the integration of reasoning, exploration and action capabilities will pave the way for highly capable, robust agentic AI systems.Magma is available on Azure AI Foundry Labs (opens in new tab) as well as on HuggingFace (opens in new tab) with an MIT license. Please refer to the Magma project page (opens in new tab) for more technical details. We invite you to test and explore these cutting-edge agentic model innovations from Microsoft Research.Opens in a new tab

0 Comments ·0 Shares ·23 Views

Please log in to like, share and comment!
Science News @ScienceNews shared a link
2025-02-25 19:23:51 ·

A new book chronicles the science of life in the air

www.sciencenews.org
Air-BorneCarl ZimmerDutton, $32On March 10, 2020, 61 choir members rehearsed in a church hall in Skagit County, Wash. As they sang, a microscopic germ wafted through the air. Before the months end, 58 members were infected and five fell gravely ill. Across the United States, the virus wreaked havoc. Within weeks, thousands of people died, schools and businesses shuttered and 700,000 people lost their jobs.Many scientists determined in 2020 that the coronavirus spread through the air, but it would take public health agencies months longer to acknowledge that. The Skagit County superspreader event helped the World Health Organization and the U.S. Centers for Disease Control and Prevention to consider the airborne transmission of COVID-19. But to this day, some scientists believe the delay in calling the virus airborne was a mistake one that stalled vital public health measures and allowed the disease to spread faster. In his new book, Air-Borne, science journalist Carl Zimmer roots the mistake in the past of a historically neglected field: aerobiology, or the science of airborne life.Zimmer begins his chronicle in the 19th century with Louis Pasteurs summit up a towering glacier in the French Alps. As part of a grand experiment, the microbiologist tipped a glass chamber to the sky, snared life and proved that microscopic germs floated in the air. Pasteurs discovery inspired generations of scientists to look for airborne life themselves, including pathologist Fred Meier, who stuck Petri dishes out of various aircraft and ultimately named the field.Through the stories of Pasteur, Meier and dozens of other scientists, Zimmer seamlessly weaves together centuries of aerobiology science. He richly humanizes the characters with honesty and complexity, simultaneously highlighting the publicly revered and the unsung. His pithy, punchy and accessible language gives life to glamorous experiments, like those conducted from hot-air balloons, as well as unassuming ones run in university basements.But aerobiology is more than science-laden joyrides through the sky. The field was mired in humankinds darkest moments, which Zimmer brings out of the shadows and into the light. Aerobiologists were central to debates on how life-threatening diseases like the Black Death, cholera and tuberculosis spread. And while some scientists worked to fight airborne infections, others committed to creating them, Zimmer writes. During World War II, the United States was one of several countries to create biological weapons. Some U.S. researchers helped build an arsenal of deadly germs and spores to potentially use against the nations enemies. For years after the war, aerobiology remained shrouded in secrecy and was largely ignored by public health officials. It wasnt until COVID-19 that this began to change.Readers will end the book with a better understanding of just how high life can fly and how far public knowledge of aerobiology has come. Its a reminder that the current decisions humans make regarding airborne life is informed by a deep history. Zimmer concludes his chronicle with a vision of harmonious coexistence with the life that teems in the atmosphere: As long as there is life on Earth, it will fly, and as long as we are here, we will breathe.BuyAir-Bornefrom Bookshop.org.Science Newsis a Bookshop.org affiliate and will earn a commission on purchases made from links in this article.

0 Comments ·0 Shares ·22 Views

Please log in to like, share and comment!
Nature @Nature shared a link
2025-02-25 19:24:07 ·

My experience of speaking at a conference with a baby

www.nature.com
Nature, Published online: 25 February 2025; doi:10.1038/d41586-025-00281-2Hannah Chance spent a day during her parental leave telling a group of postdocs about her science-policy work. Aidan, aged six months, came along too.

0 Comments ·0 Shares ·21 Views

Please log in to like, share and comment!
Nature @Nature shared a link
2025-02-25 19:24:25 ·

Sharing benefits of research is key to effective science communication

www.nature.com
Nature, Published online: 25 February 2025; doi:10.1038/d41586-025-00586-2Sharing benefits of research is key to effective science communication

0 Comments ·0 Shares ·22 Views

Please log in to like, share and comment!
LiveScience @LiveScience shared a link
2025-02-25 19:24:42 ·

Mars was once a 'vacation-style' beach planet, Chinese rover scans reveal

www.livescience.com
China's Zhurong rover has found evidence of an ancient shoreline buried deep beneath the planet. That could point to an ocean, a beach, and to life.

0 Comments ·0 Shares ·22 Views

Please log in to like, share and comment!
Gleb Alexandrov @GlebAlexandrovs shared a link
2025-02-25 19:27:50 ·

What happened to Photopea?

x.com
What happened to Photopea?

0 Comments ·0 Shares ·23 Views

Please log in to like, share and comment!
Gleb Alexandrov @GlebAlexandrovs shared a link
2025-02-25 19:27:53 ·

Re @Khamurai3D That. Is. Brilliant!

x.com
Re @Khamurai3D That. Is. Brilliant!

0 Comments ·0 Shares ·23 Views

Please log in to like, share and comment!
FlippedNormals @FlippedNormals shared a link
2025-02-25 19:27:58 ·

Instantly upgrade your texturing with our Scars HQ Brush Packyour shortcut to hyper-realistic, high-quality scar details. https://flipnm.co/4kefNss

x.com
Instantly upgrade your texturing with our Scars HQ Brush Packyour shortcut to hyper-realistic, high-quality scar details. https://flipnm.co/4kefNss

0 Comments ·0 Shares ·18 Views

Please log in to like, share and comment!
80 Level @80Level shared a link
2025-02-25 19:29:16 ·

Game Developer Turned Unitys 2D Tilemaps Into Fully 3D

cgshares.com
A Reddit user named nodoxi demonstrated his custom version of Unitys 2D Tilemaps, revealing some details about his approach, which makes them appear 3D. Nodoxi shared a screenshot of the Rule Tile editor with GameObjects and mentioned that all mesh parts are merged into one to improve rendering efficiency.For the 3D pieces, nodoxi used a shader with tri-planar mapping and two textures (one for the upper rocks and one for the bottom). The snow is just a regular tilemap.Take a look at some screenshots below:nodoxinodoxinodoxiSure, optimization is a big concern with this workflow, but the developer mentioned that the map isnt very large and the polygon count is low. However, he sees room for improvement and is considering breaking it into chunks.For more details, check out the original Reddit post. If youre interested in the project nodoxi used these tiles for, you can wishlist Crimsonwood, a roguelike top-down shooter set in a cursed national park, here.If youre looking for a nice Unity 3D tilemap tool, we recommend checking out MAPGrid by TulioAndMiguelMPG, which is still in development:Also,dont forget to join our80 Level Talent platformand ournew Discord server, follow us onInstagram,Twitter,LinkedIn,Telegram,TikTok, andThreads, where we share breakdowns, the latest news, awesome artworks, and more.Source link The post Game Developer Turned Unitys 2D Tilemaps Into Fully 3D appeared first on CG SHARES.

0 Comments ·0 Shares ·20 Views

Please log in to like, share and comment!
80 Level @80Level shared a link
2025-02-25 19:29:18 ·

Atmospheric Unreal Engine 5 Scene Inspired by The Witcher

cgshares.com
Environment Artist and Sculptor Sherif Dawoud showcased a mysterious Unreal Engine 5 environment called Crossroads. The scene is part of a bigger project and was inspired by The Witcher.The play of the light is especially stunning here and reminds me of a painting. Notably, all the models and textures were created by the artist himself using ZBrush, Substance 3D Designer, Blender, and SpeedTree.It might not be that surprising considering his many talents. You might have seen him in another role an artist with an eye for anatomy, producing beautifully realistic characters like Hank Schrader from Breaking Bad and body parts, includinghands,triceps, and more, which you can seeonArtStation.Im looking forward to where Dawoud will take Crossroads. Meanwhile, join our80 Level Talent platformand ournew Discord server, follow us onInstagram,Twitter,LinkedIn,Telegram,TikTok, andThreads,where we share breakdowns, the latest news, awesome artworks, and more.Source link The post Atmospheric Unreal Engine 5 Scene Inspired by The Witcher appeared first on CG SHARES.

0 Comments ·0 Shares ·21 Views

Please log in to like, share and comment!

Upgrade to Pro