TOWARDSAI.NET
Building Large Action Models: Insights from Microsoft
Building Large Action Models: Insights from Microsoft 0 like January 7, 2025Share this postAuthor(s): Jesus Rodriguez Originally published on Towards AI. Created Using MidjourneyI recently started an AI-focused educational newsletter, that already has over 175,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:TheSequence | Jesus Rodriguez | SubstackThe best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and datathesequence.aiAction execution is one of the key building blocks of agentic workflows. One of the most interesting debates in that are is whether actions are executed by the model itself or by an external coordination layer. The supporters of the former hypothesis have lined up behind a theory known as large action models(LAMs) with projects like Gorilla or Rabbit r1 as key pioneers. However, there are still only a few practical examples of LAM frameworks. Recently, Microsoft Research published one of the most complete papers in this area outlining a complete framework for LAM models. Microsofts core idea is to simply bridge the gap between the language understanding prowess of LLMs and the need for real-world action execution.From LLMs to LAMs: A Paradigm ShiftThe limitations of traditional LLMs in interacting with and manipulating the physical world necessitate the development of LAMs. While LLMs excel at generating intricate textual responses, their inability to translate understanding into tangible actions restricts their applicability in real-world scenarios. LAMs address this challenge by extending the expertise of LLMs from language processing to action generation, enabling them to perform actions in both physical and digital environments. This transition signifies a shift from passive language understanding to active task completion, marking a significant milestone in AI development.Image Credit: Microsoft ResearchKey Architectural Components: A Step-by-Step ApproachMicrosofts framework for developing LAMs outlines a systematic process, encompassing crucial stages from inception to deployment. The key architectural components include:Data Collection and PreparationThis foundational step involves gathering and curating high-quality, action-oriented data for specific use cases. This data includes user queries, environmental context, potential actions, and any other relevant information required to train the LAM effectively. A two-phase data collection approach is adopted:Task-Plan CollectionThis phase focuses on collecting data consisting of tasks and their corresponding plans. Tasks represent user requests expressed in natural language, while plans outline detailed step-by-step procedures designed to fulfill these requests. This data is crucial for training the model to generate effective plans and enhance its high-level reasoning and planning capabilities. Sources for this data include application documentation, online how-to guides like WikiHow, and historical search queries.Task-Action CollectionThis phase converts task-plan data into executable steps. It involves refining tasks and plans to be more concrete and grounded within a specific environment. Action sequences are generated, representing actionable instructions that directly interact with the environment, such as select_text(text=hello) or click(on=Button(20), how=left, double=False). This data provides the necessary granularity for training a LAM to perform reliable and accurate task executions in real-world scenarios.Image Credit: Microsoft ResearchModel TrainingThis stage involves training or fine-tuning LLMs to perform actions rather than merely generate text. A staged training strategy, consisting of four phases, is employed:Phase 1: Task-Plan Pretraining: This phase focuses on training the model to generate coherent and logical plans for various tasks, utilizing a dataset of 76,672 task-plan pairs. This pretraining establishes a foundational understanding of task structures, enabling the model to decompose tasks into logical steps.Phase 2: Learning from Experts: The model learns to execute actions by imitating expert-labeled task-action trajectories. This phase aligns plan generation with actionable steps, teaching the model how to perform actions based on observed UI states and corresponding actions.Phase 3: Self-Boosting Exploration: This phase encourages the model to explore and handle tasks that even expert demonstrations failed to solve. By interacting with the environment and trying alternative strategies, the model autonomously generates new success cases, promoting diversity and adaptability.Phase 4: Learning from a Reward Model: This phase incorporates reinforcement learning (RL) principles to optimize decision-making. A reward model is trained on success and failure data to predict the quality of actions. This model is then used to fine-tune the LAM in an offline RL setting, allowing the model to learn from failures and improve action selection without additional environmental interactions.Image Credit: Microsoft ResearchIntegration and GroundingThe trained LAM is integrated into an agent framework, enabling interaction with external tools, maintaining memory, and interfacing with the environment. This integration transforms the model into a functional agent capable of making meaningful impacts in the physical world. Microsofts UFO, a GUI agent for Windows OS interaction, exemplifies this integration. The AppAgent within UFO serves as the operational platform for the LAM.EvaluationRigorous evaluation processes are essential to assess the reliability, robustness, and safety of the LAM before real-world deployment. This evaluation involves testing the model in a variety of scenarios to ensure generalization across different environments and tasks, as well as effective handling of unexpected situations. Both offline and online evaluations are conducted:Offline Evaluation: The LAMs performance is assessed using an offline dataset in a controlled, static environment. This allows for systematic analysis of task success rates, precision, and recall metrics.Online Evaluation: The LAMs performance is evaluated in a real-world environment. This involves measuring aspects like task completion accuracy, efficiency, and effectiveness.Image Credit: Microsoft ResearchKey Building Blocks: Essential Features of LAMsSeveral key building blocks empower LAMs to perform complex real-world tasks:Action Generation: The ability to translate user intentions into actionable steps grounded in the environment is a defining feature of LAMs. These actions can manifest as operations on graphical user interfaces (GUIs), API calls for software applications, physical manipulations by robots, or even code generation.Dynamic Planning and Adaptation: LAMs are capable of decomposing complex tasks into subtasks and dynamically adjusting their plans in response to environmental changes. This adaptive planning ensures robust performance in dynamic, real-world scenarios where unexpected situations are common.Specialization and Efficiency: LAMs can be tailored for specific domains or tasks, achieving high accuracy and efficiency within their operational scope. This specialization allows for reduced computational overhead and improved response times compared to general-purpose LLMs.Agent Systems: Agent systems provide the operational framework for LAMs, equipping them with tools, memory, and feedback mechanisms. This integration allows LAMs to interact with the world and execute actions effectively. UFOs AppAgent, for example, employs components like action executors, memory, and environment data collection to facilitate seamless interaction between the LAM and the Windows OS environment.The UFO Agent: Grounding LAMs in Windows OSMicrosofts UFO agent exemplifies the integration and grounding of LAMs in a real-world environment. Key aspects of UFO include:Architecture: UFO comprises a HostAgent for decomposing user requests into subtasks and an AppAgent for executing these subtasks within specific applications. This hierarchical structure facilitates the handling of complex, cross-application tasks.AppAgent Structure: The AppAgent, where the LAM resides, consists of:Environment Data Collection: The agent gathers information about the application environment, including UI elements and their properties, to provide context for the LAM.LAM Inference Engine: The LAM, serving as the brain of the AppAgent, processes the collected information and infers the necessary actions to fulfill the user request.Action Executor: This component grounds the LAMs predicted actions, translating them into concrete interactions with the applications UI, such as mouse clicks, keyboard inputs, or API calls.Memory: The agent maintains a memory of previous actions and plans, providing crucial context for the LAM to make informed and adaptive decisions.Image Credit: Microsoft ResearchEvaluation and Performance: Benchmarking LAMsMicrosoft employs a comprehensive evaluation framework to assess the performance of LAMs in both controlled and real-world environments. Key metrics include:Task Success Rate (TSR): This measures the percentage of tasks successfully completed out of the total attempted. It evaluates the agents ability to accurately and reliably complete tasks.Task Completion Time: This measures the total time taken to complete a task, from the initial request to the final action. It reflects the efficiency of the LAM and agent system.Object Accuracy: This measures the accuracy of selecting the correct UI element for each task step. It assesses the agents ability to interact with the appropriate UI components.Step Success Rate (SSR): This measures the percentage of individual steps completed successfully within a task. It provides a granular assessment of action execution accuracy.In online evaluations using Microsoft Word as the target application, LAM achieved a TSR of 71.0%, demonstrating competitive performance compared to baseline models like GPT-4o. Importantly, LAM exhibited superior efficiency, achieving the shortest task completion times and lowest average step latencies. These results underscore the efficacy of Microsofts framework in building LAMs that are not only accurate but also efficient in real-world applications.LimitationsDespite the advancements made, LAMs are still in their early stages of development. Key limitations and future research areas include:Safety Risks: The ability of LAMs to interact with the real world introduces potential safety concerns. Robust mechanisms are needed to ensure that LAMs operate safely and reliably, minimizing the risk of unintended consequences.Ethical Considerations: The development and deployment of LAMs raise ethical considerations, particularly regarding bias, fairness, and accountability. Future research needs to address these concerns to ensure responsible LAM development and deployment.Scalability and Adaptability: Scaling LAMs to new domains and tasks can be challenging due to the need for extensive data collection and training. Developing more efficient training methods and exploring techniques like transfer learning are crucial for enhancing the scalability and adaptability of LAMs.ConclusionMicrosofts framework for building LAMs represents a significant advancement in AI, enabling a shift from passive language understanding to active real-world engagement. The frameworks comprehensive approach, encompassing data collection, model training, agent integration, and rigorous evaluation, provides a robust foundation for building LAMs. While challenges remain, the transformative potential of LAMs in revolutionizing human-computer interaction and automating complex tasks is undeniable. Continued research and development efforts will pave the way for more sophisticated, reliable, and ethically sound LAM applications, bringing us closer to a future where AI seamlessly integrates with our lives, augmenting human capabilities and transforming our interaction with the world around us.Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming asponsor. Published via Towards AITowards AI - Medium Share this post
0 Comentários
0 Compartilhamentos
49 Visualizações