Αναζήτηση

Marktechpost AI μοιράστηκε ένα σύνδεσμο

2025-05-23 02:36:00 ·

Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use

Modern web usage spans many digital interactions, from filling out forms and managing accounts to executing data queries and navigating complex dashboards. Despite the web being deeply intertwined with productivity and work processes, many of these actions still demand repetitive human input. This scenario is especially true for environments that require detailed instructions or decisions beyond mere searches. While artificial intelligence agents have emerged to support task automation, many prioritize complete autonomy. However, this frequently sidelines user control, leading to outcomes that diverge from user expectations. The next leap forward in productivity-enhancing AI involves agents designed not to replace users but to collaborate with them, blending automation with continuous, real-time human input for more accurate and trusted results.
A key challenge in deploying AI agents for web-based tasks is the lack of visibility and intervention. Users often cannot see what steps the agent is planning, how it intends to execute them, or when it might go off track. In scenarios that involve complex decisions, like entering payment information, interpreting dynamic content, or running scripts, users need mechanisms to step in and redirect the process. Without these capabilities, systems risk making irreversible mistakes or misaligning with user goals. This highlights a significant limitation in current AI automation: the absence of structured human-in-the-loop design, where users dynamically guide and supervise agent behavior, without acting merely as spectators.
Previous solutions approached web automation through rule-based scripts or general-purpose AI agents driven by language models. These systems interpret user commands and attempt to carry them out autonomously. However, they often execute plans without surfacing intermediate decisions or allowing meaningful user feedback. A few offer command-line-like interactions, which are inaccessible to the average user and rarely include layered safety mechanisms. Moreover, minimal support for task reuse or performance learning across sessions limits long-term value. These systems also tend to lack adaptability when the context changes mid-task or errors must be corrected collaboratively.
Researchers at Microsoft introduced Magentic-UI, an open-source prototype that emphasizes collaborative human-AI interaction for web-based tasks. Unlike previous systems aiming for full independence, this tool promotes real-time co-planning, execution sharing, and step-by-step user oversight. Magentic-UI is built on Microsoft’s AutoGen framework and is tightly integrated with Azure AI Foundry Labs. It’s a direct evolution from the previously introduced Magentic-One system. With its launch, Microsoft Research aims to address fundamental questions about human oversight, safety mechanisms, and learning in agentic systems by offering an experimental platform for researchers and developers.
Magentic-UI includes four core interactive features: co-planning, co-tasking, action guards, and plan learning. Co-planning lets users view and adjust the agent’s proposed steps before execution begins, offering full control over what the AI will do. Co-tasking enables real-time visibility during operation, letting users pause, edit, or take over specific actions. Action guards are customizable confirmations for high-risk activities like closing browser tabs or clicking “submit” on a form, actions that could have unintended consequences. Plan learning allows Magentic-UI to remember and refine steps for future tasks, improving over time through experience. These capabilities are supported by a modular team of agents: the Orchestrator leads planning and decision-making, WebSurfer handles browser interactions, Coder executes code in a sandbox, and FileSurfer interprets files and data.

Technically, when a user submits a request, the Orchestrator agent generates a step-by-step plan. Users can modify it through a graphical interface by editing, deleting, or regenerating steps. Once finalized, the plan is delegated across specialized agents. Each agent reports after performing its task, and the Orchestrator determines whether to proceed, repeat, or request user feedback. All actions are visible on the interface, and users can halt execution at any point. This architecture not only ensures transparency but also allows for adaptive task flows. For example, if a step fails due to a broken link, the Orchestrator can dynamically adjust the plan with user consent.
In controlled evaluations using the GAIA benchmark, which includes complex tasks like navigating the web and interpreting documents, Magentic-UI’s performance was rigorously tested. GAIA consists of 162 tasks requiring multimodal understanding. When operating autonomously, Magentic-UI completed 30.3% of tasks successfully. However, when supported by a simulated user with access to additional task information, success jumped to 51.9%, a 71% improvement. Another configuration using a smarter simulated user improved the rate to 42.6%. Interestingly, Magentic-UI requested help in only 10% of the enhanced tasks and asked for final answers in 18%. In those cases, the system asked for help an average of just 1.1 times. This shows how minimal but well-timed human intervention significantly boosts task completion without high oversight costs.

Magentic-UI also features a “Saved Plans” gallery that displays strategies reused from past tasks. Retrieval from this gallery is approximately three times faster than generating a new plan. A predictive mechanism surfaces these plans while users type, streamlining repeated tasks like flight searches or form submissions. Safety mechanisms are robust. Every browser or code action runs inside a Docker container, ensuring that no user credentials are exposed. Users can define allow-lists for site access, and every action can be gated behind approval prompts. A red-team evaluation further tested it against phishing attacks and prompt injections, where the system either sought user clarification or blocked execution, reinforcing its layered defense model.

Several Key Takeaways from the Research on Magentic-UI:

With simple human input, magentic-UI boosts task completion by 71%.
Requests user help in only 10% of enhanced tasks and averages 1.1 help requests per task.
It features a co-planning UI that allows full user control before execution.
Executes tasks via four modular agents: Orchestrator, WebSurfer, Coder, and FileSurfer.
Stores and reuses plans, reducing repeat task latency by up to 3x.
All actions are sandboxed via Docker containers; no user credentials are ever exposed.
Passed red-team evaluations against phishing and injection threats.
Supports fully user-configurable “action guards” for high-risk steps.
Fully open-source and integrated with Azure AI Foundry Labs.

In conclusion, Magentic-UI addresses a long-standing problem in AI automation, the lack of transparency and controllability. Rather than replacing users, it enables them to remain central to the process. The system performs well even with minimal help and learns to improve each time. The modular design, robust safeguards, and detailed interaction model create a strong foundation for future intelligent assistants.

Check out the Technical details and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent DesignAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context UnderstandingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraph
#microsoft #introduces #magenticuian #opensource #agent

Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use
Modern web usage spans many digital interactions, from filling out forms and managing accounts to executing data queries and navigating complex dashboards. Despite the web being deeply intertwined with productivity and work processes, many of these actions still demand repetitive human input. This scenario is especially true for environments that require detailed instructions or decisions beyond mere searches. While artificial intelligence agents have emerged to support task automation, many prioritize complete autonomy. However, this frequently sidelines user control, leading to outcomes that diverge from user expectations. The next leap forward in productivity-enhancing AI involves agents designed not to replace users but to collaborate with them, blending automation with continuous, real-time human input for more accurate and trusted results. A key challenge in deploying AI agents for web-based tasks is the lack of visibility and intervention. Users often cannot see what steps the agent is planning, how it intends to execute them, or when it might go off track. In scenarios that involve complex decisions, like entering payment information, interpreting dynamic content, or running scripts, users need mechanisms to step in and redirect the process. Without these capabilities, systems risk making irreversible mistakes or misaligning with user goals. This highlights a significant limitation in current AI automation: the absence of structured human-in-the-loop design, where users dynamically guide and supervise agent behavior, without acting merely as spectators. Previous solutions approached web automation through rule-based scripts or general-purpose AI agents driven by language models. These systems interpret user commands and attempt to carry them out autonomously. However, they often execute plans without surfacing intermediate decisions or allowing meaningful user feedback. A few offer command-line-like interactions, which are inaccessible to the average user and rarely include layered safety mechanisms. Moreover, minimal support for task reuse or performance learning across sessions limits long-term value. These systems also tend to lack adaptability when the context changes mid-task or errors must be corrected collaboratively. Researchers at Microsoft introduced Magentic-UI, an open-source prototype that emphasizes collaborative human-AI interaction for web-based tasks. Unlike previous systems aiming for full independence, this tool promotes real-time co-planning, execution sharing, and step-by-step user oversight. Magentic-UI is built on Microsoft’s AutoGen framework and is tightly integrated with Azure AI Foundry Labs. It’s a direct evolution from the previously introduced Magentic-One system. With its launch, Microsoft Research aims to address fundamental questions about human oversight, safety mechanisms, and learning in agentic systems by offering an experimental platform for researchers and developers. Magentic-UI includes four core interactive features: co-planning, co-tasking, action guards, and plan learning. Co-planning lets users view and adjust the agent’s proposed steps before execution begins, offering full control over what the AI will do. Co-tasking enables real-time visibility during operation, letting users pause, edit, or take over specific actions. Action guards are customizable confirmations for high-risk activities like closing browser tabs or clicking “submit” on a form, actions that could have unintended consequences. Plan learning allows Magentic-UI to remember and refine steps for future tasks, improving over time through experience. These capabilities are supported by a modular team of agents: the Orchestrator leads planning and decision-making, WebSurfer handles browser interactions, Coder executes code in a sandbox, and FileSurfer interprets files and data. Technically, when a user submits a request, the Orchestrator agent generates a step-by-step plan. Users can modify it through a graphical interface by editing, deleting, or regenerating steps. Once finalized, the plan is delegated across specialized agents. Each agent reports after performing its task, and the Orchestrator determines whether to proceed, repeat, or request user feedback. All actions are visible on the interface, and users can halt execution at any point. This architecture not only ensures transparency but also allows for adaptive task flows. For example, if a step fails due to a broken link, the Orchestrator can dynamically adjust the plan with user consent. In controlled evaluations using the GAIA benchmark, which includes complex tasks like navigating the web and interpreting documents, Magentic-UI’s performance was rigorously tested. GAIA consists of 162 tasks requiring multimodal understanding. When operating autonomously, Magentic-UI completed 30.3% of tasks successfully. However, when supported by a simulated user with access to additional task information, success jumped to 51.9%, a 71% improvement. Another configuration using a smarter simulated user improved the rate to 42.6%. Interestingly, Magentic-UI requested help in only 10% of the enhanced tasks and asked for final answers in 18%. In those cases, the system asked for help an average of just 1.1 times. This shows how minimal but well-timed human intervention significantly boosts task completion without high oversight costs. Magentic-UI also features a “Saved Plans” gallery that displays strategies reused from past tasks. Retrieval from this gallery is approximately three times faster than generating a new plan. A predictive mechanism surfaces these plans while users type, streamlining repeated tasks like flight searches or form submissions. Safety mechanisms are robust. Every browser or code action runs inside a Docker container, ensuring that no user credentials are exposed. Users can define allow-lists for site access, and every action can be gated behind approval prompts. A red-team evaluation further tested it against phishing attacks and prompt injections, where the system either sought user clarification or blocked execution, reinforcing its layered defense model. Several Key Takeaways from the Research on Magentic-UI: With simple human input, magentic-UI boosts task completion by 71%. Requests user help in only 10% of enhanced tasks and averages 1.1 help requests per task. It features a co-planning UI that allows full user control before execution. Executes tasks via four modular agents: Orchestrator, WebSurfer, Coder, and FileSurfer. Stores and reuses plans, reducing repeat task latency by up to 3x. All actions are sandboxed via Docker containers; no user credentials are ever exposed. Passed red-team evaluations against phishing and injection threats. Supports fully user-configurable “action guards” for high-risk steps. Fully open-source and integrated with Azure AI Foundry Labs. In conclusion, Magentic-UI addresses a long-standing problem in AI automation, the lack of transparency and controllability. Rather than replacing users, it enables them to remain central to the process. The system performs well even with minimal help and learns to improve each time. The modular design, robust safeguards, and detailed interaction model create a strong foundation for future intelligent assistants. Check out the Technical details and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent DesignAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context UnderstandingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraph #microsoft #introduces #magenticuian #opensource #agent

WWW.MARKTECHPOST.COM

Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use

Modern web usage spans many digital interactions, from filling out forms and managing accounts to executing data queries and navigating complex dashboards. Despite the web being deeply intertwined with productivity and work processes, many of these actions still demand repetitive human input. This scenario is especially true for environments that require detailed instructions or decisions beyond mere searches. While artificial intelligence agents have emerged to support task automation, many prioritize complete autonomy. However, this frequently sidelines user control, leading to outcomes that diverge from user expectations. The next leap forward in productivity-enhancing AI involves agents designed not to replace users but to collaborate with them, blending automation with continuous, real-time human input for more accurate and trusted results. A key challenge in deploying AI agents for web-based tasks is the lack of visibility and intervention. Users often cannot see what steps the agent is planning, how it intends to execute them, or when it might go off track. In scenarios that involve complex decisions, like entering payment information, interpreting dynamic content, or running scripts, users need mechanisms to step in and redirect the process. Without these capabilities, systems risk making irreversible mistakes or misaligning with user goals. This highlights a significant limitation in current AI automation: the absence of structured human-in-the-loop design, where users dynamically guide and supervise agent behavior, without acting merely as spectators. Previous solutions approached web automation through rule-based scripts or general-purpose AI agents driven by language models. These systems interpret user commands and attempt to carry them out autonomously. However, they often execute plans without surfacing intermediate decisions or allowing meaningful user feedback. A few offer command-line-like interactions, which are inaccessible to the average user and rarely include layered safety mechanisms. Moreover, minimal support for task reuse or performance learning across sessions limits long-term value. These systems also tend to lack adaptability when the context changes mid-task or errors must be corrected collaboratively. Researchers at Microsoft introduced Magentic-UI, an open-source prototype that emphasizes collaborative human-AI interaction for web-based tasks. Unlike previous systems aiming for full independence, this tool promotes real-time co-planning, execution sharing, and step-by-step user oversight. Magentic-UI is built on Microsoft’s AutoGen framework and is tightly integrated with Azure AI Foundry Labs. It’s a direct evolution from the previously introduced Magentic-One system. With its launch, Microsoft Research aims to address fundamental questions about human oversight, safety mechanisms, and learning in agentic systems by offering an experimental platform for researchers and developers. Magentic-UI includes four core interactive features: co-planning, co-tasking, action guards, and plan learning. Co-planning lets users view and adjust the agent’s proposed steps before execution begins, offering full control over what the AI will do. Co-tasking enables real-time visibility during operation, letting users pause, edit, or take over specific actions. Action guards are customizable confirmations for high-risk activities like closing browser tabs or clicking “submit” on a form, actions that could have unintended consequences. Plan learning allows Magentic-UI to remember and refine steps for future tasks, improving over time through experience. These capabilities are supported by a modular team of agents: the Orchestrator leads planning and decision-making, WebSurfer handles browser interactions, Coder executes code in a sandbox, and FileSurfer interprets files and data. Technically, when a user submits a request, the Orchestrator agent generates a step-by-step plan. Users can modify it through a graphical interface by editing, deleting, or regenerating steps. Once finalized, the plan is delegated across specialized agents. Each agent reports after performing its task, and the Orchestrator determines whether to proceed, repeat, or request user feedback. All actions are visible on the interface, and users can halt execution at any point. This architecture not only ensures transparency but also allows for adaptive task flows. For example, if a step fails due to a broken link, the Orchestrator can dynamically adjust the plan with user consent. In controlled evaluations using the GAIA benchmark, which includes complex tasks like navigating the web and interpreting documents, Magentic-UI’s performance was rigorously tested. GAIA consists of 162 tasks requiring multimodal understanding. When operating autonomously, Magentic-UI completed 30.3% of tasks successfully. However, when supported by a simulated user with access to additional task information, success jumped to 51.9%, a 71% improvement. Another configuration using a smarter simulated user improved the rate to 42.6%. Interestingly, Magentic-UI requested help in only 10% of the enhanced tasks and asked for final answers in 18%. In those cases, the system asked for help an average of just 1.1 times. This shows how minimal but well-timed human intervention significantly boosts task completion without high oversight costs. Magentic-UI also features a “Saved Plans” gallery that displays strategies reused from past tasks. Retrieval from this gallery is approximately three times faster than generating a new plan. A predictive mechanism surfaces these plans while users type, streamlining repeated tasks like flight searches or form submissions. Safety mechanisms are robust. Every browser or code action runs inside a Docker container, ensuring that no user credentials are exposed. Users can define allow-lists for site access, and every action can be gated behind approval prompts. A red-team evaluation further tested it against phishing attacks and prompt injections, where the system either sought user clarification or blocked execution, reinforcing its layered defense model. Several Key Takeaways from the Research on Magentic-UI: With simple human input, magentic-UI boosts task completion by 71% (from 30.3% to 51.9%). Requests user help in only 10% of enhanced tasks and averages 1.1 help requests per task. It features a co-planning UI that allows full user control before execution. Executes tasks via four modular agents: Orchestrator, WebSurfer, Coder, and FileSurfer. Stores and reuses plans, reducing repeat task latency by up to 3x. All actions are sandboxed via Docker containers; no user credentials are ever exposed. Passed red-team evaluations against phishing and injection threats. Supports fully user-configurable “action guards” for high-risk steps. Fully open-source and integrated with Azure AI Foundry Labs. In conclusion, Magentic-UI addresses a long-standing problem in AI automation, the lack of transparency and controllability. Rather than replacing users, it enables them to remain central to the process. The system performs well even with minimal help and learns to improve each time. The modular design, robust safeguards, and detailed interaction model create a strong foundation for future intelligent assistants. Check out the Technical details and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent DesignAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context UnderstandingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraph

·170 Views

Παρακαλούμε συνδέσου στην Κοινότητά μας για να δηλώσεις τι σου αρέσει, να σχολιάσεις και να μοιραστείς με τους φίλους σου!
Marktechpost AI μοιράστηκε ένα σύνδεσμο

2025-05-22 18:31:34 ·

Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent Design

Anthropic has announced the release of its next-generation language models: Claude Opus 4 and Claude Sonnet 4. The update marks a significant technical refinement in the Claude model family, particularly in areas involving structured reasoning, software engineering, and autonomous agent behaviors.
This release is not another reinvention but a focused improvement—bringing increased consistency, interpretability, and performance across complex reasoning tasks. With extended context handling, long-horizon planning, and more efficient coding capabilities, these models reflect a maturing shift toward functional generalist systems that can serve a range of high-complexity applications.
Claude Opus 4: Scaling Advanced Reasoning and Multi-file Code Understanding
Positioned as the flagship model, Claude Opus 4 has been benchmarked as Anthropic’s most capable model to date. Designed to handle intricate reasoning workflows and software development scenarios, Opus 4 has achieved:

72.5% accuracy on the SWE-bench benchmark, which tests models against real-world GitHub issue resolution.
43.2% on TerminalBench, which evaluates correctness in terminal-based code generation tasks requiring multi-step planning.

A notable aspect of Claude Opus 4 is its agentic behavior in software environments. In practical testing, the model was able to autonomously sustain nearly seven hours of uninterrupted code generation and task execution. This is a marked improvement from Claude 3 Opus, which previously sustained such tasks for under an hour.
These improvements are attributed to enhanced memory management, broader context retention, and a more robust internal planning loop. From a developer’s perspective, Opus 4 reduces the need for frequent interventions and exhibits stronger consistency in handling edge cases across software stacks.

Claude Sonnet 4: A Balanced Model for General Reasoning and Code Tasks
Claude Sonnet 4 replaces its predecessor, Claude 3.5 Sonnet, with a more stable and balanced architecture that brings improvements in both speed and quality without significantly increasing computational costs.
Sonnet 4 is optimized for mid-scale deployments where cost-performance trade-offs are critical. While not matching Opus 4’s reasoning ceiling, it inherits many architectural upgrades—supporting multi-file code navigation, intermediate tool use, and structured text processing with improved latency.
It serves as the new default model for free-tier users on Claude.ai and is also available via API. This makes Sonnet 4 a practical option for lightweight development tools, user-facing assistants, and analytical pipelines requiring consistent but less intensive model calls.
Architectural Highlights: Hybrid Reasoning and Extended Thinking
Both models incorporate hybrid reasoning capabilities, introducing two distinct response modes:

Fast Mode for low-latency responses suitable for short prompts and conversational tasks.
Extended Thinking Mode for computationally intensive tasks requiring deeper inference, longer memory chains, or multi-turn agentic behavior.

This dual-mode reasoning strategy allows users to dynamically allocate compute and latency budgets based on task complexity. It is especially relevant in agent frameworks, where LLMs must balance fast reaction time with deliberative planning.
Deployment and Integration
Claude Opus 4 and Sonnet 4 are accessible through multiple cloud platforms:

Anthropic’s Claude API
Amazon Bedrock
Google Cloud Vertex AI

This cross-platform availability simplifies model deployment into diverse enterprise environments, supporting use cases ranging from autonomous agents to code analysis, decision support, and retrieval-augmented generationpipelines.
Conclusion
The Claude 4 series does not introduce radical design changes but instead demonstrates measured improvements in reliability, interpretability, and task generalization. With Claude Opus 4, Anthropic positions itself firmly in the upper tier of AI model providers for reasoning and coding automation. Meanwhile, Claude Sonnet 4 offers a technically sound, cost-efficient entry point for developers and researchers working on mid-scale AI applications.
For engineering teams evaluating LLMs for long-context planning, software agents, or structured data workflows, the Claude 4 models present a competitive, technically capable alternative.

Check out the Technical details and Get started today on Claude, Claude Code, or the platform of your choice. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context UnderstandingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data
#anthropic #releases #claude #opus #sonnet

Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent Design
Anthropic has announced the release of its next-generation language models: Claude Opus 4 and Claude Sonnet 4. The update marks a significant technical refinement in the Claude model family, particularly in areas involving structured reasoning, software engineering, and autonomous agent behaviors. This release is not another reinvention but a focused improvement—bringing increased consistency, interpretability, and performance across complex reasoning tasks. With extended context handling, long-horizon planning, and more efficient coding capabilities, these models reflect a maturing shift toward functional generalist systems that can serve a range of high-complexity applications. Claude Opus 4: Scaling Advanced Reasoning and Multi-file Code Understanding Positioned as the flagship model, Claude Opus 4 has been benchmarked as Anthropic’s most capable model to date. Designed to handle intricate reasoning workflows and software development scenarios, Opus 4 has achieved: 72.5% accuracy on the SWE-bench benchmark, which tests models against real-world GitHub issue resolution. 43.2% on TerminalBench, which evaluates correctness in terminal-based code generation tasks requiring multi-step planning. A notable aspect of Claude Opus 4 is its agentic behavior in software environments. In practical testing, the model was able to autonomously sustain nearly seven hours of uninterrupted code generation and task execution. This is a marked improvement from Claude 3 Opus, which previously sustained such tasks for under an hour. These improvements are attributed to enhanced memory management, broader context retention, and a more robust internal planning loop. From a developer’s perspective, Opus 4 reduces the need for frequent interventions and exhibits stronger consistency in handling edge cases across software stacks. Claude Sonnet 4: A Balanced Model for General Reasoning and Code Tasks Claude Sonnet 4 replaces its predecessor, Claude 3.5 Sonnet, with a more stable and balanced architecture that brings improvements in both speed and quality without significantly increasing computational costs. Sonnet 4 is optimized for mid-scale deployments where cost-performance trade-offs are critical. While not matching Opus 4’s reasoning ceiling, it inherits many architectural upgrades—supporting multi-file code navigation, intermediate tool use, and structured text processing with improved latency. It serves as the new default model for free-tier users on Claude.ai and is also available via API. This makes Sonnet 4 a practical option for lightweight development tools, user-facing assistants, and analytical pipelines requiring consistent but less intensive model calls. Architectural Highlights: Hybrid Reasoning and Extended Thinking Both models incorporate hybrid reasoning capabilities, introducing two distinct response modes: Fast Mode for low-latency responses suitable for short prompts and conversational tasks. Extended Thinking Mode for computationally intensive tasks requiring deeper inference, longer memory chains, or multi-turn agentic behavior. This dual-mode reasoning strategy allows users to dynamically allocate compute and latency budgets based on task complexity. It is especially relevant in agent frameworks, where LLMs must balance fast reaction time with deliberative planning. Deployment and Integration Claude Opus 4 and Sonnet 4 are accessible through multiple cloud platforms: Anthropic’s Claude API Amazon Bedrock Google Cloud Vertex AI This cross-platform availability simplifies model deployment into diverse enterprise environments, supporting use cases ranging from autonomous agents to code analysis, decision support, and retrieval-augmented generationpipelines. Conclusion The Claude 4 series does not introduce radical design changes but instead demonstrates measured improvements in reliability, interpretability, and task generalization. With Claude Opus 4, Anthropic positions itself firmly in the upper tier of AI model providers for reasoning and coding automation. Meanwhile, Claude Sonnet 4 offers a technically sound, cost-efficient entry point for developers and researchers working on mid-scale AI applications. For engineering teams evaluating LLMs for long-context planning, software agents, or structured data workflows, the Claude 4 models present a competitive, technically capable alternative. Check out the Technical details and Get started today on Claude, Claude Code, or the platform of your choice. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context UnderstandingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data #anthropic #releases #claude #opus #sonnet

WWW.MARKTECHPOST.COM

Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent Design

Anthropic has announced the release of its next-generation language models: Claude Opus 4 and Claude Sonnet 4. The update marks a significant technical refinement in the Claude model family, particularly in areas involving structured reasoning, software engineering, and autonomous agent behaviors. This release is not another reinvention but a focused improvement—bringing increased consistency, interpretability, and performance across complex reasoning tasks. With extended context handling, long-horizon planning, and more efficient coding capabilities, these models reflect a maturing shift toward functional generalist systems that can serve a range of high-complexity applications. Claude Opus 4: Scaling Advanced Reasoning and Multi-file Code Understanding Positioned as the flagship model, Claude Opus 4 has been benchmarked as Anthropic’s most capable model to date. Designed to handle intricate reasoning workflows and software development scenarios, Opus 4 has achieved: 72.5% accuracy on the SWE-bench benchmark, which tests models against real-world GitHub issue resolution. 43.2% on TerminalBench, which evaluates correctness in terminal-based code generation tasks requiring multi-step planning. A notable aspect of Claude Opus 4 is its agentic behavior in software environments. In practical testing, the model was able to autonomously sustain nearly seven hours of uninterrupted code generation and task execution. This is a marked improvement from Claude 3 Opus, which previously sustained such tasks for under an hour. These improvements are attributed to enhanced memory management, broader context retention, and a more robust internal planning loop. From a developer’s perspective, Opus 4 reduces the need for frequent interventions and exhibits stronger consistency in handling edge cases across software stacks. Claude Sonnet 4: A Balanced Model for General Reasoning and Code Tasks Claude Sonnet 4 replaces its predecessor, Claude 3.5 Sonnet, with a more stable and balanced architecture that brings improvements in both speed and quality without significantly increasing computational costs. Sonnet 4 is optimized for mid-scale deployments where cost-performance trade-offs are critical. While not matching Opus 4’s reasoning ceiling, it inherits many architectural upgrades—supporting multi-file code navigation, intermediate tool use, and structured text processing with improved latency. It serves as the new default model for free-tier users on Claude.ai and is also available via API. This makes Sonnet 4 a practical option for lightweight development tools, user-facing assistants, and analytical pipelines requiring consistent but less intensive model calls. Architectural Highlights: Hybrid Reasoning and Extended Thinking Both models incorporate hybrid reasoning capabilities, introducing two distinct response modes: Fast Mode for low-latency responses suitable for short prompts and conversational tasks. Extended Thinking Mode for computationally intensive tasks requiring deeper inference, longer memory chains, or multi-turn agentic behavior. This dual-mode reasoning strategy allows users to dynamically allocate compute and latency budgets based on task complexity. It is especially relevant in agent frameworks, where LLMs must balance fast reaction time with deliberative planning. Deployment and Integration Claude Opus 4 and Sonnet 4 are accessible through multiple cloud platforms: Anthropic’s Claude API Amazon Bedrock Google Cloud Vertex AI This cross-platform availability simplifies model deployment into diverse enterprise environments, supporting use cases ranging from autonomous agents to code analysis, decision support, and retrieval-augmented generation (RAG) pipelines. Conclusion The Claude 4 series does not introduce radical design changes but instead demonstrates measured improvements in reliability, interpretability, and task generalization. With Claude Opus 4, Anthropic positions itself firmly in the upper tier of AI model providers for reasoning and coding automation. Meanwhile, Claude Sonnet 4 offers a technically sound, cost-efficient entry point for developers and researchers working on mid-scale AI applications. For engineering teams evaluating LLMs for long-context planning, software agents, or structured data workflows, the Claude 4 models present a competitive, technically capable alternative. Check out the Technical details and Get started today on Claude, Claude Code, or the platform of your choice. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context UnderstandingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data

·165 Views

Παρακαλούμε συνδέσου στην Κοινότητά μας για να δηλώσεις τι σου αρέσει, να σχολιάσεις και να μοιραστείς με τους φίλους σου!
Marktechpost AI μοιράστηκε ένα σύνδεσμο

2025-05-22 08:56:37 ·

Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding

Addressing Architectural Trade-offs in Language Models
As language models scale, balancing expressivity, efficiency, and adaptability becomes increasingly challenging. Transformer architectures dominate due to their strong performance across a wide range of tasks, but they are computationally expensive—particularly for long-context scenarios—due to the quadratic complexity of self-attention. On the other hand, Structured State Space Modelsoffer improved efficiency and linear scaling, yet often lack the nuanced sequence modeling required for complex language understanding. A combined architecture that leverages the strengths of both approaches is needed to support diverse applications across environments.
Introducing Falcon-H1: A Hybrid Architecture
The Falcon-H1 series, released by the Technology Innovation Institute, introduces a hybrid family of language models that combine Transformer attention mechanisms with Mamba2-based SSM components. This architecture is designed to improve computational efficiency while maintaining competitive performance across tasks requiring deep contextual understanding.
Falcon-H1 covers a wide parameter range—from 0.5B to 34B—catering to use cases from resource-constrained deployments to large-scale distributed inference. The design aims to address common bottlenecks in LLM deployment: memory efficiency, scalability, multilingual support, and the ability to handle extended input sequences.

Source: /
Architectural Details and Design Objectives
Falcon-H1 adopts a parallel structure where attention heads and Mamba2 SSMs operate side by side. This design allows each mechanism to independently contribute to sequence modeling: attention heads specialize in capturing token-level dependencies, while SSM components support efficient long-range information retention.
The series supports a context length of up to 256K tokens, which is particularly useful for applications in document summarization, retrieval-augmented generation, and multi-turn dialogue systems. Model training incorporates a customized microparameterizationrecipe and optimized data pipelines, allowing for stable and efficient training across model sizes.
The models are trained with a focus on multilingual capabilities. The architecture is natively equipped to handle 18 languages, with coverage including English, Chinese, Arabic, Hindi, French, and others. The framework is extensible to over 100 languages, supporting localization and region-specific model adaptation.
Empirical Results and Comparative Evaluation
Despite relatively modest parameter counts, Falcon-H1 models demonstrate strong empirical performance:

Falcon-H1-0.5B achieves results comparable to 7B-parameter models released in 2024.
Falcon-H1-1.5B-Deep performs on par with leading 7B to 10B Transformer models.
Falcon-H1-34B matches or exceeds the performance of models such as Qwen3-32B, Llama4-Scout-17B/109B, and Gemma3-27B across several benchmarks.

Evaluations emphasize both general-purpose language understanding and multilingual benchmarks. Notably, the models achieve strong performance across both high-resource and low-resource languages without requiring excessive fine-tuning or additional adaptation layers.

Source: /
Deployment and inference are supported through integration with open-source tools such as Hugging Face Transformers. FlashAttention-2 compatibility further reduces memory usage during inference, offering an attractive efficiency-performance balance for enterprise use.
Conclusion
Falcon-H1 represents a methodical effort to refine language model architecture by integrating complementary mechanisms—attention and SSMs—within a unified framework. By doing so, it addresses key limitations in both long-context processing and scaling efficiency. The model family provides a range of options for practitioners, from lightweight variants suitable for edge deployment to high-capacity configurations for server-side applications.
Through its multilingual coverage, long-context capabilities, and architectural flexibility, Falcon-H1 offers a technically sound foundation for research and production use cases that demand performance without compromising on efficiency or accessibility.

Check out the Official Release, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative Modeling
#technology #innovation #institute #tii #releases

Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding
Addressing Architectural Trade-offs in Language Models As language models scale, balancing expressivity, efficiency, and adaptability becomes increasingly challenging. Transformer architectures dominate due to their strong performance across a wide range of tasks, but they are computationally expensive—particularly for long-context scenarios—due to the quadratic complexity of self-attention. On the other hand, Structured State Space Modelsoffer improved efficiency and linear scaling, yet often lack the nuanced sequence modeling required for complex language understanding. A combined architecture that leverages the strengths of both approaches is needed to support diverse applications across environments. Introducing Falcon-H1: A Hybrid Architecture The Falcon-H1 series, released by the Technology Innovation Institute, introduces a hybrid family of language models that combine Transformer attention mechanisms with Mamba2-based SSM components. This architecture is designed to improve computational efficiency while maintaining competitive performance across tasks requiring deep contextual understanding. Falcon-H1 covers a wide parameter range—from 0.5B to 34B—catering to use cases from resource-constrained deployments to large-scale distributed inference. The design aims to address common bottlenecks in LLM deployment: memory efficiency, scalability, multilingual support, and the ability to handle extended input sequences. Source: / Architectural Details and Design Objectives Falcon-H1 adopts a parallel structure where attention heads and Mamba2 SSMs operate side by side. This design allows each mechanism to independently contribute to sequence modeling: attention heads specialize in capturing token-level dependencies, while SSM components support efficient long-range information retention. The series supports a context length of up to 256K tokens, which is particularly useful for applications in document summarization, retrieval-augmented generation, and multi-turn dialogue systems. Model training incorporates a customized microparameterizationrecipe and optimized data pipelines, allowing for stable and efficient training across model sizes. The models are trained with a focus on multilingual capabilities. The architecture is natively equipped to handle 18 languages, with coverage including English, Chinese, Arabic, Hindi, French, and others. The framework is extensible to over 100 languages, supporting localization and region-specific model adaptation. Empirical Results and Comparative Evaluation Despite relatively modest parameter counts, Falcon-H1 models demonstrate strong empirical performance: Falcon-H1-0.5B achieves results comparable to 7B-parameter models released in 2024. Falcon-H1-1.5B-Deep performs on par with leading 7B to 10B Transformer models. Falcon-H1-34B matches or exceeds the performance of models such as Qwen3-32B, Llama4-Scout-17B/109B, and Gemma3-27B across several benchmarks. Evaluations emphasize both general-purpose language understanding and multilingual benchmarks. Notably, the models achieve strong performance across both high-resource and low-resource languages without requiring excessive fine-tuning or additional adaptation layers. Source: / Deployment and inference are supported through integration with open-source tools such as Hugging Face Transformers. FlashAttention-2 compatibility further reduces memory usage during inference, offering an attractive efficiency-performance balance for enterprise use. Conclusion Falcon-H1 represents a methodical effort to refine language model architecture by integrating complementary mechanisms—attention and SSMs—within a unified framework. By doing so, it addresses key limitations in both long-context processing and scaling efficiency. The model family provides a range of options for practitioners, from lightweight variants suitable for edge deployment to high-capacity configurations for server-side applications. Through its multilingual coverage, long-context capabilities, and architectural flexibility, Falcon-H1 offers a technically sound foundation for research and production use cases that demand performance without compromising on efficiency or accessibility. Check out the Official Release, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative Modeling #technology #innovation #institute #tii #releases

WWW.MARKTECHPOST.COM

Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding

Addressing Architectural Trade-offs in Language Models As language models scale, balancing expressivity, efficiency, and adaptability becomes increasingly challenging. Transformer architectures dominate due to their strong performance across a wide range of tasks, but they are computationally expensive—particularly for long-context scenarios—due to the quadratic complexity of self-attention. On the other hand, Structured State Space Models (SSMs) offer improved efficiency and linear scaling, yet often lack the nuanced sequence modeling required for complex language understanding. A combined architecture that leverages the strengths of both approaches is needed to support diverse applications across environments. Introducing Falcon-H1: A Hybrid Architecture The Falcon-H1 series, released by the Technology Innovation Institute (TII), introduces a hybrid family of language models that combine Transformer attention mechanisms with Mamba2-based SSM components. This architecture is designed to improve computational efficiency while maintaining competitive performance across tasks requiring deep contextual understanding. Falcon-H1 covers a wide parameter range—from 0.5B to 34B—catering to use cases from resource-constrained deployments to large-scale distributed inference. The design aims to address common bottlenecks in LLM deployment: memory efficiency, scalability, multilingual support, and the ability to handle extended input sequences. Source: https://falcon-lm.github.io/blog/falcon-h1/ Architectural Details and Design Objectives Falcon-H1 adopts a parallel structure where attention heads and Mamba2 SSMs operate side by side. This design allows each mechanism to independently contribute to sequence modeling: attention heads specialize in capturing token-level dependencies, while SSM components support efficient long-range information retention. The series supports a context length of up to 256K tokens, which is particularly useful for applications in document summarization, retrieval-augmented generation, and multi-turn dialogue systems. Model training incorporates a customized microparameterization (μP) recipe and optimized data pipelines, allowing for stable and efficient training across model sizes. The models are trained with a focus on multilingual capabilities. The architecture is natively equipped to handle 18 languages, with coverage including English, Chinese, Arabic, Hindi, French, and others. The framework is extensible to over 100 languages, supporting localization and region-specific model adaptation. Empirical Results and Comparative Evaluation Despite relatively modest parameter counts, Falcon-H1 models demonstrate strong empirical performance: Falcon-H1-0.5B achieves results comparable to 7B-parameter models released in 2024. Falcon-H1-1.5B-Deep performs on par with leading 7B to 10B Transformer models. Falcon-H1-34B matches or exceeds the performance of models such as Qwen3-32B, Llama4-Scout-17B/109B, and Gemma3-27B across several benchmarks. Evaluations emphasize both general-purpose language understanding and multilingual benchmarks. Notably, the models achieve strong performance across both high-resource and low-resource languages without requiring excessive fine-tuning or additional adaptation layers. Source: https://falcon-lm.github.io/blog/falcon-h1/ Deployment and inference are supported through integration with open-source tools such as Hugging Face Transformers. FlashAttention-2 compatibility further reduces memory usage during inference, offering an attractive efficiency-performance balance for enterprise use. Conclusion Falcon-H1 represents a methodical effort to refine language model architecture by integrating complementary mechanisms—attention and SSMs—within a unified framework. By doing so, it addresses key limitations in both long-context processing and scaling efficiency. The model family provides a range of options for practitioners, from lightweight variants suitable for edge deployment to high-capacity configurations for server-side applications. Through its multilingual coverage, long-context capabilities, and architectural flexibility, Falcon-H1 offers a technically sound foundation for research and production use cases that demand performance without compromising on efficiency or accessibility. Check out the Official Release, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative Modeling

·196 Views

Παρακαλούμε συνδέσου στην Κοινότητά μας για να δηλώσεις τι σου αρέσει, να σχολιάσεις και να μοιραστείς με τους φίλους σου!
NeowinFeed μοιράστηκε ένα σύνδεσμο

2025-05-22 06:25:24 ·

Vercel releases first AI model for v0, now in beta

When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

Vercel releases first AI model for v0, now in beta

David Uzondu

Neowin
·

May 22, 2025 02:18 EDT

Google recently showered us with AI goodies, including Gemma 3n, an AI model that's designed to run on low-end devices, like smartphones. Now, Vercel has stepped further into the ring with its own generative UI system, v0, by releasing its very first dedicated model. If you do not know what v0 is, it is a sort of competitor to tools like the recently announced Google Stitch, which also aims to let you describe a user interface and have AI generate the design. The tool first saw the light of day back in 2023 as an invite-only beta, promising to turn natural language into front-end code.
The newly available model is dubbed v0-1.0-md, and Vercel states it is specifically designed for building modern web applications. This multimodal model supports both text and image inputs, offers a 128,000-token context window with a 32,000-token output limit, and is priced at per million input tokens and per million output tokens.
It offers features like 'auto-fix' for common coding blunders and 'quick edit' for streaming inline changes as they are generated. Crucially, v0-1.0-md uses an OpenAI-compatible API, meaning you can plug it into existing tools like Cursor, Codex, or your own custom applications that already speak OpenAI's language, including Vercel's own AI SDK. It even supports function and tool calls, and promises low-latency streaming responses. Developers can poke around with this new model in the Vercel AI Playground to see how it handles different prompts.

Currently, access to the v0 API, and thus the v0-1.0-md model, is in beta, and you will need a Premium or Team plan on Vercel with usage-based billing enabled. To get started, you would grab an API key from v0.dev and then send requests to its POST api.v0.dev/v1/chat/completions endpoint, authenticating with a bearer token. While there are daily message limits around 200 messages and context size constraints that mirror its advertised capabilities, Vercel notes you can request higher limits if you hit those ceilings.
If you want to dig into the details or see how to set it up, the official v0 docs on Vercel's site have everything you need, including examples.

Tags

Report a problem with article

Follow @NeowinFeed
#vercel #releases #first #model #now

Vercel releases first AI model for v0, now in beta
When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works. Vercel releases first AI model for v0, now in beta David Uzondu Neowin · May 22, 2025 02:18 EDT Google recently showered us with AI goodies, including Gemma 3n, an AI model that's designed to run on low-end devices, like smartphones. Now, Vercel has stepped further into the ring with its own generative UI system, v0, by releasing its very first dedicated model. If you do not know what v0 is, it is a sort of competitor to tools like the recently announced Google Stitch, which also aims to let you describe a user interface and have AI generate the design. The tool first saw the light of day back in 2023 as an invite-only beta, promising to turn natural language into front-end code. The newly available model is dubbed v0-1.0-md, and Vercel states it is specifically designed for building modern web applications. This multimodal model supports both text and image inputs, offers a 128,000-token context window with a 32,000-token output limit, and is priced at per million input tokens and per million output tokens. It offers features like 'auto-fix' for common coding blunders and 'quick edit' for streaming inline changes as they are generated. Crucially, v0-1.0-md uses an OpenAI-compatible API, meaning you can plug it into existing tools like Cursor, Codex, or your own custom applications that already speak OpenAI's language, including Vercel's own AI SDK. It even supports function and tool calls, and promises low-latency streaming responses. Developers can poke around with this new model in the Vercel AI Playground to see how it handles different prompts. Currently, access to the v0 API, and thus the v0-1.0-md model, is in beta, and you will need a Premium or Team plan on Vercel with usage-based billing enabled. To get started, you would grab an API key from v0.dev and then send requests to its POST api.v0.dev/v1/chat/completions endpoint, authenticating with a bearer token. While there are daily message limits around 200 messages and context size constraints that mirror its advertised capabilities, Vercel notes you can request higher limits if you hit those ceilings. If you want to dig into the details or see how to set it up, the official v0 docs on Vercel's site have everything you need, including examples. Tags Report a problem with article Follow @NeowinFeed #vercel #releases #first #model #now

WWW.NEOWIN.NET

Vercel releases first AI model for v0, now in beta

When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works. Vercel releases first AI model for v0, now in beta David Uzondu Neowin · May 22, 2025 02:18 EDT Google recently showered us with AI goodies, including Gemma 3n, an AI model that's designed to run on low-end devices, like smartphones. Now, Vercel has stepped further into the ring with its own generative UI system, v0, by releasing its very first dedicated model. If you do not know what v0 is, it is a sort of competitor to tools like the recently announced Google Stitch, which also aims to let you describe a user interface and have AI generate the design. The tool first saw the light of day back in 2023 as an invite-only beta, promising to turn natural language into front-end code. The newly available model is dubbed v0-1.0-md, and Vercel states it is specifically designed for building modern web applications. This multimodal model supports both text and image inputs, offers a 128,000-token context window with a 32,000-token output limit, and is priced at $3 per million input tokens and $15 per million output tokens. It offers features like 'auto-fix' for common coding blunders and 'quick edit' for streaming inline changes as they are generated. Crucially, v0-1.0-md uses an OpenAI-compatible API, meaning you can plug it into existing tools like Cursor, Codex, or your own custom applications that already speak OpenAI's language, including Vercel's own AI SDK. It even supports function and tool calls, and promises low-latency streaming responses. Developers can poke around with this new model in the Vercel AI Playground to see how it handles different prompts. Currently, access to the v0 API, and thus the v0-1.0-md model, is in beta, and you will need a Premium or Team plan on Vercel with usage-based billing enabled. To get started, you would grab an API key from v0.dev and then send requests to its POST api.v0.dev/v1/chat/completions endpoint, authenticating with a bearer token. While there are daily message limits around 200 messages and context size constraints that mirror its advertised capabilities, Vercel notes you can request higher limits if you hit those ceilings. If you want to dig into the details or see how to set it up, the official v0 docs on Vercel's site have everything you need, including examples. Tags Report a problem with article Follow @NeowinFeed

·162 Views

Παρακαλούμε συνδέσου στην Κοινότητά μας για να δηλώσεις τι σου αρέσει, να σχολιάσεις και να μοιραστείς με τους φίλους σου!
Marktechpost AI μοιράστηκε ένα σύνδεσμο

2025-05-22 04:40:48 ·

Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device Use

Researchers are reimagining how models operate as demand skyrockets for faster, smarter, and more private AI on phones, tablets, and laptops. The next generation of AI isn’t just lighter and faster; it’s local. By embedding intelligence directly into devices, developers are unlocking near-instant responsiveness, slashing memory demands, and putting privacy back into users’ hands. With mobile hardware rapidly advancing, the race is on to build compact, lightning-fast models that are intelligent enough to redefine everyday digital experiences.
A major concern is delivering high-quality, multimodal intelligence within the constrained environments of mobile devices. Unlike cloud-based systems that have access to extensive computational power, on-device models must perform under strict RAM and processing limits. Multimodal AI, capable of interpreting text, images, audio, and video, typically requires large models, which most mobile devices cannot handle efficiently. Also, cloud dependency introduces latency and privacy concerns, making it essential to design models that can run locally without sacrificing performance.
Earlier models like Gemma 3 and Gemma 3 QAT attempted to bridge this gap by reducing size while maintaining performance. Designed for use on cloud or desktop GPUs, they significantly improved model efficiency. However, these models still required robust hardware and could not fully overcome mobile platforms’ memory and responsiveness constraints. Despite supporting advanced functions, they often involved compromises limiting their real-time smartphone usability.
Researchers from Google and Google DeepMind introduced Gemma 3n. The architecture behind Gemma 3n has been optimized for mobile-first deployment, targeting performance across Android and Chrome platforms. It also forms the underlying basis for the next version of Gemini Nano. The innovation represents a significant leap forward by supporting multimodal AI functionalities with a much lower memory footprint while maintaining real-time response capabilities. This marks the first open model built on this shared infrastructure and is made available to developers in preview, allowing immediate experimentation.

The core innovation in Gemma 3n is the application of Per-Layer Embeddings, a method that drastically reduces RAM usage. While the raw model sizes include 5 billion and 8 billion parameters, they behave with memory footprints equivalent to 2 billion and 4 billion parameter models. The dynamic memory consumption is just 2GB for the 5B model and 3GB for the 8B version. Also, it uses a nested model configuration where a 4B active memory footprint model includes a 2B submodel trained through a technique known as MatFormer. This allows developers to dynamically switch performance modes without loading separate models. Further advancements include KVC sharing and activation quantization, which reduce latency and increase response speed. For example, response time on mobile improved by 1.5x compared to Gemma 3 4B while maintaining better output quality.

The performance metrics achieved by Gemma 3n reinforce its suitability for mobile deployment. It excels in automatic speech recognition and translation, allowing seamless speech conversion to translated text. On multilingual benchmarks like WMT24++, it scores 50.1%, highlighting its strength in Japanese, German, Korean, Spanish, and French. Its mix’n’match capability allows the creation of submodels optimized for various quality and latency combinations, offering developers further customization. The architecture supports interleaved inputs from different modalities, text, audio, images, and video, allowing more natural and context-rich interactions. It also performs offline, ensuring privacy and reliability even without network connectivity. Use cases include live visual and auditory feedback, context-aware content generation, and advanced voice-based applications.

Several Key Takeaways from the Research on Gemma 3n include:

Built using collaboration between Google, DeepMind, Qualcomm, MediaTek, and Samsung System LSI. Designed for mobile-first deployment.
Raw model size of 5B and 8B parameters, with operational footprints of 2GB and 3GB, respectively, using Per-Layer Embeddings.
1.5x faster response on mobile vs Gemma 3 4B. Multilingual benchmark score of 50.1% on WMT24++.
Accepts and understands audio, text, image, and video, enabling complex multimodal processing and interleaved inputs.
Supports dynamic trade-offs using MatFormer training with nested submodels and mix’n’match capabilities.
Operates without an internet connection, ensuring privacy and reliability.
Preview is available via Google AI Studio and Google AI Edge, with text and image processing capabilities.

In conclusion, this innovation provides a clear pathway for making high-performance AI portable and private. By tackling RAM constraints through innovative architecture and enhancing multilingual and multimodal capabilities, researchers offer a viable solution for bringing sophisticated AI directly into everyday devices. The flexible submodel switching, offline readiness, and fast response time mark a comprehensive approach to mobile-first AI. The research addresses the balance of computational efficiency, user privacy, and dynamic responsiveness. The result is a system capable of delivering real-time AI experiences without sacrificing capability or versatility, fundamentally expanding what users can expect from on-device intelligence.

Check out the Technical details and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image Comprehension
#google #deepmind #releases #gemma #compact

Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device Use
Researchers are reimagining how models operate as demand skyrockets for faster, smarter, and more private AI on phones, tablets, and laptops. The next generation of AI isn’t just lighter and faster; it’s local. By embedding intelligence directly into devices, developers are unlocking near-instant responsiveness, slashing memory demands, and putting privacy back into users’ hands. With mobile hardware rapidly advancing, the race is on to build compact, lightning-fast models that are intelligent enough to redefine everyday digital experiences. A major concern is delivering high-quality, multimodal intelligence within the constrained environments of mobile devices. Unlike cloud-based systems that have access to extensive computational power, on-device models must perform under strict RAM and processing limits. Multimodal AI, capable of interpreting text, images, audio, and video, typically requires large models, which most mobile devices cannot handle efficiently. Also, cloud dependency introduces latency and privacy concerns, making it essential to design models that can run locally without sacrificing performance. Earlier models like Gemma 3 and Gemma 3 QAT attempted to bridge this gap by reducing size while maintaining performance. Designed for use on cloud or desktop GPUs, they significantly improved model efficiency. However, these models still required robust hardware and could not fully overcome mobile platforms’ memory and responsiveness constraints. Despite supporting advanced functions, they often involved compromises limiting their real-time smartphone usability. Researchers from Google and Google DeepMind introduced Gemma 3n. The architecture behind Gemma 3n has been optimized for mobile-first deployment, targeting performance across Android and Chrome platforms. It also forms the underlying basis for the next version of Gemini Nano. The innovation represents a significant leap forward by supporting multimodal AI functionalities with a much lower memory footprint while maintaining real-time response capabilities. This marks the first open model built on this shared infrastructure and is made available to developers in preview, allowing immediate experimentation. The core innovation in Gemma 3n is the application of Per-Layer Embeddings, a method that drastically reduces RAM usage. While the raw model sizes include 5 billion and 8 billion parameters, they behave with memory footprints equivalent to 2 billion and 4 billion parameter models. The dynamic memory consumption is just 2GB for the 5B model and 3GB for the 8B version. Also, it uses a nested model configuration where a 4B active memory footprint model includes a 2B submodel trained through a technique known as MatFormer. This allows developers to dynamically switch performance modes without loading separate models. Further advancements include KVC sharing and activation quantization, which reduce latency and increase response speed. For example, response time on mobile improved by 1.5x compared to Gemma 3 4B while maintaining better output quality. The performance metrics achieved by Gemma 3n reinforce its suitability for mobile deployment. It excels in automatic speech recognition and translation, allowing seamless speech conversion to translated text. On multilingual benchmarks like WMT24++, it scores 50.1%, highlighting its strength in Japanese, German, Korean, Spanish, and French. Its mix’n’match capability allows the creation of submodels optimized for various quality and latency combinations, offering developers further customization. The architecture supports interleaved inputs from different modalities, text, audio, images, and video, allowing more natural and context-rich interactions. It also performs offline, ensuring privacy and reliability even without network connectivity. Use cases include live visual and auditory feedback, context-aware content generation, and advanced voice-based applications. Several Key Takeaways from the Research on Gemma 3n include: Built using collaboration between Google, DeepMind, Qualcomm, MediaTek, and Samsung System LSI. Designed for mobile-first deployment. Raw model size of 5B and 8B parameters, with operational footprints of 2GB and 3GB, respectively, using Per-Layer Embeddings. 1.5x faster response on mobile vs Gemma 3 4B. Multilingual benchmark score of 50.1% on WMT24++. Accepts and understands audio, text, image, and video, enabling complex multimodal processing and interleaved inputs. Supports dynamic trade-offs using MatFormer training with nested submodels and mix’n’match capabilities. Operates without an internet connection, ensuring privacy and reliability. Preview is available via Google AI Studio and Google AI Edge, with text and image processing capabilities. In conclusion, this innovation provides a clear pathway for making high-performance AI portable and private. By tackling RAM constraints through innovative architecture and enhancing multilingual and multimodal capabilities, researchers offer a viable solution for bringing sophisticated AI directly into everyday devices. The flexible submodel switching, offline readiness, and fast response time mark a comprehensive approach to mobile-first AI. The research addresses the balance of computational efficiency, user privacy, and dynamic responsiveness. The result is a system capable of delivering real-time AI experiences without sacrificing capability or versatility, fundamentally expanding what users can expect from on-device intelligence. Check out the Technical details and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image Comprehension #google #deepmind #releases #gemma #compact

WWW.MARKTECHPOST.COM

Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device Use

Researchers are reimagining how models operate as demand skyrockets for faster, smarter, and more private AI on phones, tablets, and laptops. The next generation of AI isn’t just lighter and faster; it’s local. By embedding intelligence directly into devices, developers are unlocking near-instant responsiveness, slashing memory demands, and putting privacy back into users’ hands. With mobile hardware rapidly advancing, the race is on to build compact, lightning-fast models that are intelligent enough to redefine everyday digital experiences. A major concern is delivering high-quality, multimodal intelligence within the constrained environments of mobile devices. Unlike cloud-based systems that have access to extensive computational power, on-device models must perform under strict RAM and processing limits. Multimodal AI, capable of interpreting text, images, audio, and video, typically requires large models, which most mobile devices cannot handle efficiently. Also, cloud dependency introduces latency and privacy concerns, making it essential to design models that can run locally without sacrificing performance. Earlier models like Gemma 3 and Gemma 3 QAT attempted to bridge this gap by reducing size while maintaining performance. Designed for use on cloud or desktop GPUs, they significantly improved model efficiency. However, these models still required robust hardware and could not fully overcome mobile platforms’ memory and responsiveness constraints. Despite supporting advanced functions, they often involved compromises limiting their real-time smartphone usability. Researchers from Google and Google DeepMind introduced Gemma 3n. The architecture behind Gemma 3n has been optimized for mobile-first deployment, targeting performance across Android and Chrome platforms. It also forms the underlying basis for the next version of Gemini Nano. The innovation represents a significant leap forward by supporting multimodal AI functionalities with a much lower memory footprint while maintaining real-time response capabilities. This marks the first open model built on this shared infrastructure and is made available to developers in preview, allowing immediate experimentation. The core innovation in Gemma 3n is the application of Per-Layer Embeddings (PLE), a method that drastically reduces RAM usage. While the raw model sizes include 5 billion and 8 billion parameters, they behave with memory footprints equivalent to 2 billion and 4 billion parameter models. The dynamic memory consumption is just 2GB for the 5B model and 3GB for the 8B version. Also, it uses a nested model configuration where a 4B active memory footprint model includes a 2B submodel trained through a technique known as MatFormer. This allows developers to dynamically switch performance modes without loading separate models. Further advancements include KVC sharing and activation quantization, which reduce latency and increase response speed. For example, response time on mobile improved by 1.5x compared to Gemma 3 4B while maintaining better output quality. The performance metrics achieved by Gemma 3n reinforce its suitability for mobile deployment. It excels in automatic speech recognition and translation, allowing seamless speech conversion to translated text. On multilingual benchmarks like WMT24++ (ChrF), it scores 50.1%, highlighting its strength in Japanese, German, Korean, Spanish, and French. Its mix’n’match capability allows the creation of submodels optimized for various quality and latency combinations, offering developers further customization. The architecture supports interleaved inputs from different modalities, text, audio, images, and video, allowing more natural and context-rich interactions. It also performs offline, ensuring privacy and reliability even without network connectivity. Use cases include live visual and auditory feedback, context-aware content generation, and advanced voice-based applications. Several Key Takeaways from the Research on Gemma 3n include: Built using collaboration between Google, DeepMind, Qualcomm, MediaTek, and Samsung System LSI. Designed for mobile-first deployment. Raw model size of 5B and 8B parameters, with operational footprints of 2GB and 3GB, respectively, using Per-Layer Embeddings (PLE). 1.5x faster response on mobile vs Gemma 3 4B. Multilingual benchmark score of 50.1% on WMT24++ (ChrF). Accepts and understands audio, text, image, and video, enabling complex multimodal processing and interleaved inputs. Supports dynamic trade-offs using MatFormer training with nested submodels and mix’n’match capabilities. Operates without an internet connection, ensuring privacy and reliability. Preview is available via Google AI Studio and Google AI Edge, with text and image processing capabilities. In conclusion, this innovation provides a clear pathway for making high-performance AI portable and private. By tackling RAM constraints through innovative architecture and enhancing multilingual and multimodal capabilities, researchers offer a viable solution for bringing sophisticated AI directly into everyday devices. The flexible submodel switching, offline readiness, and fast response time mark a comprehensive approach to mobile-first AI. The research addresses the balance of computational efficiency, user privacy, and dynamic responsiveness. The result is a system capable of delivering real-time AI experiences without sacrificing capability or versatility, fundamentally expanding what users can expect from on-device intelligence. Check out the Technical details and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image Comprehension

·208 Views

Παρακαλούμε συνδέσου στην Κοινότητά μας για να δηλώσεις τι σου αρέσει, να σχολιάσεις και να μοιραστείς με τους φίλους σου!
Marktechpost AI μοιράστηκε ένα σύνδεσμο

2025-05-22 00:33:58 ·

A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraph

In this tutorial, we provide a practical guide for implementing LangGraph, a streamlined, graph-based AI orchestration framework, integrated seamlessly with Anthropic’s Claude API. Through detailed, executable code optimized for Google Colab, developers learn how to build and visualize AI workflows as interconnected nodes performing distinct tasks, such as generating concise answers, critically analyzing responses, and automatically composing technical blog content. The compact implementation highlights LangGraph’s intuitive node-graph architecture. It can manage complex sequences of Claude-powered natural language tasks, from basic question-answering scenarios to advanced content generation pipelines.
from getpass import getpass
import os

anthropic_key = getpassos.environ= anthropic_key

printWe securely prompt users to input their Anthropic API key using Python’s getpass module, ensuring sensitive data isn’t displayed. It then sets this key as an environment variableand confirms successful storage.
import os
import json
import requests
from typing import Dict, List, Any, Callable, Optional, Union
from dataclasses import dataclass, field
import networkx as nx
import matplotlib.pyplot as plt
from IPython.display import display, HTML, clear_output
We import essential libraries for building and visualizing structured AI workflows. It includes modules for handling data, graph creation and visualization, interactive notebook display, and type annotationsfor clarity and maintainability.
try:
import anthropic
except ImportError:
print!pip install -q anthropic
import anthropic

from anthropic import Anthropic
We ensure the anthropic Python package is available for use. It attempts to import the module and, if not found, automatically installs it using pip in a Google Colab environment. After installation, it imports the Anthropic client, essential for interacting with Claude models via the Anthropic API. 4o
@dataclass
class NodeConfig:
name: str
function: Callable
inputs: List= fieldoutputs: List= fieldconfig: Dict= fieldThis NodeConfig data class defines the structure of each node in the LangGraph workflow. Each node has a name, an executable function, optional inputs and outputs, and an optional config dictionary to store additional parameters. This setup allows for modular, reusable node definitions for graph-based AI tasks.
class LangGraph:
def __init__:
self.api_key = api_key or os.environ.getif not self.api_key:
from google.colab import userdata
try:
self.api_key = userdata.getif not self.api_key:
raise ValueErrorexcept:
printself.api_key = inputif not self.api_key:
raise ValueErrorself.client = Anthropicself.graph = nx.DiGraphself.nodes = {}
self.state = {}

def add_node:
self.nodes= node_config
self.graph.add_nodefor input_node in node_config.inputs:
if input_node in self.nodes:
self.graph.add_edgereturn self

def claude_node:
"""Convenience method to create a Claude API node"""
inputs = inputs oroutputs = outputs ordef claude_fn:
prompt = prompt_template
for k, v in state.items:
if isinstance:
prompt = prompt.replacemessage_params = {
"model": model,
"max_tokens": 1000,
"messages":}

if system_prompt:
message_params= system_prompt

response = self.client.messages.createreturn response.content.text

node_config = NodeConfigreturn self.add_nodedef transform_node:
"""Add a data transformation node"""
inputs = inputs oroutputs = outputs ornode_config = NodeConfigreturn self.add_nodedef visualize:
"""Visualize the graph"""
plt.figure)
pos = nx.spring_layoutnx.drawplt.titleplt.tight_layoutplt.showprintfor node in self.graph.nodes:
successors = list)
if successors:
print}")
else:
print")
printdef _get_execution_order:
"""Determine execution order based on dependencies"""
try:
return list)
except nx.NetworkXUnfeasible:
raise ValueErrordef execute:
"""Execute the graph in topological order"""
self.state = initial_state or {}
execution_order = self._get_execution_orderprintfor node_name in execution_order:
printnode = self.nodesinputs = {k: self.state.getfor k in node.inputs if k in self.state}

result = node.functionif len== 1:
self.state] = result
elif isinstance) and len== len:
for i, output_name in enumerate:
self.state= resultprintreturn self.state

def run_example:
"""Run an example LangGraph flow with a predefined question"""
printgraph = LangGraphdef question_provider:
return question

graph.transform_nodegraph.claude_nodegraph.claude_nodegraph.visualizeresult = graph.executeprintprintprintprint}\n")
print}\n")
print}")
printreturn graph
The LangGraph class implements a lightweight framework for constructing and executing graph-based AI workflows using Claude from Anthropic. It allows users to define modular nodes, either Claude-powered prompts or custom transformation functions, connect them via dependencies, visualize the entire pipeline, and execute them in topological order. The run_example function demonstrates this by building a simple question-answering and evaluation flow, showcasing the clarity and modularity of LangGraph’s architecture.
def run_advanced_example:
"""Run a more advanced example with multiple nodes for content generation"""
graph = LangGraphdef topic_selector:
return "Graph-based AI systems"

graph.transform_nodegraph.claude_nodegraph.claude_nodegraph.claude_nodedef assembler:
return f"# {state}\n\n{introduction}\n\n## Outline\n{outline}\n\n## Conclusion\n{conclusion}"

graph.transform_nodegraph.visualizeresult = graph.executeprintprintprintprint)
printreturn graph
The run_advanced_example function showcases a more sophisticated use of LangGraph by orchestrating multiple Claude-powered nodes to generate a complete blog post. It starts by selecting a topic, then creates an outline, an introduction, and a conclusion, all using structured Claude prompts. Finally, a transformation node assembles the content into a formatted blog post. This example demonstrates how LangGraph can automate complex, multi-step content generation tasks using modular, connected nodes in a clear and executable flow.
printquestion = "What are the three main advantages of using graph-based AI architectures?"
simple_graph = run_exampleprintadvanced_graph = run_advanced_exampleFinally, we trigger the execution of both defined LangGraph workflows. First, it runs the simple question-answering example by passing a predefined question to the run_examplefunction. Then, it initiates the more advanced blog post generation workflow using run_advanced_example. Together, these calls demonstrate the practical flexibility of LangGraph, from basic prompt-based interactions to multi-step content automation using Anthropic’s Claude API.
In conclusion, we have implemented LangGraph integrated with Anthropic’s Claude API, which illustrates the ease of designing modular AI workflows that leverage powerful language models in structured, graph-based pipelines. Through visualizing task flows and separating responsibilities among nodes, such as question processing, analytical evaluation, content outlining, and assembly, developers gain practical experience in building maintainable, scalable AI systems. LangGraph’s clear node dependencies and Claude’s sophisticated language capabilities provide an efficient solution for orchestrating complex AI processes, especially for rapid prototyping and execution in environments like Google Colab.

Check out the Colab Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World Environments
#stepbystep #implementation #tutorial #building #modular

A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraph
In this tutorial, we provide a practical guide for implementing LangGraph, a streamlined, graph-based AI orchestration framework, integrated seamlessly with Anthropic’s Claude API. Through detailed, executable code optimized for Google Colab, developers learn how to build and visualize AI workflows as interconnected nodes performing distinct tasks, such as generating concise answers, critically analyzing responses, and automatically composing technical blog content. The compact implementation highlights LangGraph’s intuitive node-graph architecture. It can manage complex sequences of Claude-powered natural language tasks, from basic question-answering scenarios to advanced content generation pipelines. from getpass import getpass import os anthropic_key = getpassos.environ= anthropic_key printWe securely prompt users to input their Anthropic API key using Python’s getpass module, ensuring sensitive data isn’t displayed. It then sets this key as an environment variableand confirms successful storage. import os import json import requests from typing import Dict, List, Any, Callable, Optional, Union from dataclasses import dataclass, field import networkx as nx import matplotlib.pyplot as plt from IPython.display import display, HTML, clear_output We import essential libraries for building and visualizing structured AI workflows. It includes modules for handling data, graph creation and visualization, interactive notebook display, and type annotationsfor clarity and maintainability. try: import anthropic except ImportError: print!pip install -q anthropic import anthropic from anthropic import Anthropic We ensure the anthropic Python package is available for use. It attempts to import the module and, if not found, automatically installs it using pip in a Google Colab environment. After installation, it imports the Anthropic client, essential for interacting with Claude models via the Anthropic API. 4o @dataclass class NodeConfig: name: str function: Callable inputs: List= fieldoutputs: List= fieldconfig: Dict= fieldThis NodeConfig data class defines the structure of each node in the LangGraph workflow. Each node has a name, an executable function, optional inputs and outputs, and an optional config dictionary to store additional parameters. This setup allows for modular, reusable node definitions for graph-based AI tasks. class LangGraph: def __init__: self.api_key = api_key or os.environ.getif not self.api_key: from google.colab import userdata try: self.api_key = userdata.getif not self.api_key: raise ValueErrorexcept: printself.api_key = inputif not self.api_key: raise ValueErrorself.client = Anthropicself.graph = nx.DiGraphself.nodes = {} self.state = {} def add_node: self.nodes= node_config self.graph.add_nodefor input_node in node_config.inputs: if input_node in self.nodes: self.graph.add_edgereturn self def claude_node: """Convenience method to create a Claude API node""" inputs = inputs oroutputs = outputs ordef claude_fn: prompt = prompt_template for k, v in state.items: if isinstance: prompt = prompt.replacemessage_params = { "model": model, "max_tokens": 1000, "messages":} if system_prompt: message_params= system_prompt response = self.client.messages.createreturn response.content.text node_config = NodeConfigreturn self.add_nodedef transform_node: """Add a data transformation node""" inputs = inputs oroutputs = outputs ornode_config = NodeConfigreturn self.add_nodedef visualize: """Visualize the graph""" plt.figure) pos = nx.spring_layoutnx.drawplt.titleplt.tight_layoutplt.showprintfor node in self.graph.nodes: successors = list) if successors: print}") else: print") printdef _get_execution_order: """Determine execution order based on dependencies""" try: return list) except nx.NetworkXUnfeasible: raise ValueErrordef execute: """Execute the graph in topological order""" self.state = initial_state or {} execution_order = self._get_execution_orderprintfor node_name in execution_order: printnode = self.nodesinputs = {k: self.state.getfor k in node.inputs if k in self.state} result = node.functionif len== 1: self.state] = result elif isinstance) and len== len: for i, output_name in enumerate: self.state= resultprintreturn self.state def run_example: """Run an example LangGraph flow with a predefined question""" printgraph = LangGraphdef question_provider: return question graph.transform_nodegraph.claude_nodegraph.claude_nodegraph.visualizeresult = graph.executeprintprintprintprint}\n") print}\n") print}") printreturn graph The LangGraph class implements a lightweight framework for constructing and executing graph-based AI workflows using Claude from Anthropic. It allows users to define modular nodes, either Claude-powered prompts or custom transformation functions, connect them via dependencies, visualize the entire pipeline, and execute them in topological order. The run_example function demonstrates this by building a simple question-answering and evaluation flow, showcasing the clarity and modularity of LangGraph’s architecture. def run_advanced_example: """Run a more advanced example with multiple nodes for content generation""" graph = LangGraphdef topic_selector: return "Graph-based AI systems" graph.transform_nodegraph.claude_nodegraph.claude_nodegraph.claude_nodedef assembler: return f"# {state}\n\n{introduction}\n\n## Outline\n{outline}\n\n## Conclusion\n{conclusion}" graph.transform_nodegraph.visualizeresult = graph.executeprintprintprintprint) printreturn graph The run_advanced_example function showcases a more sophisticated use of LangGraph by orchestrating multiple Claude-powered nodes to generate a complete blog post. It starts by selecting a topic, then creates an outline, an introduction, and a conclusion, all using structured Claude prompts. Finally, a transformation node assembles the content into a formatted blog post. This example demonstrates how LangGraph can automate complex, multi-step content generation tasks using modular, connected nodes in a clear and executable flow. printquestion = "What are the three main advantages of using graph-based AI architectures?" simple_graph = run_exampleprintadvanced_graph = run_advanced_exampleFinally, we trigger the execution of both defined LangGraph workflows. First, it runs the simple question-answering example by passing a predefined question to the run_examplefunction. Then, it initiates the more advanced blog post generation workflow using run_advanced_example. Together, these calls demonstrate the practical flexibility of LangGraph, from basic prompt-based interactions to multi-step content automation using Anthropic’s Claude API. In conclusion, we have implemented LangGraph integrated with Anthropic’s Claude API, which illustrates the ease of designing modular AI workflows that leverage powerful language models in structured, graph-based pipelines. Through visualizing task flows and separating responsibilities among nodes, such as question processing, analytical evaluation, content outlining, and assembly, developers gain practical experience in building maintainable, scalable AI systems. LangGraph’s clear node dependencies and Claude’s sophisticated language capabilities provide an efficient solution for orchestrating complex AI processes, especially for rapid prototyping and execution in environments like Google Colab. Check out the Colab Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World Environments #stepbystep #implementation #tutorial #building #modular

WWW.MARKTECHPOST.COM

A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraph

In this tutorial, we provide a practical guide for implementing LangGraph, a streamlined, graph-based AI orchestration framework, integrated seamlessly with Anthropic’s Claude API. Through detailed, executable code optimized for Google Colab, developers learn how to build and visualize AI workflows as interconnected nodes performing distinct tasks, such as generating concise answers, critically analyzing responses, and automatically composing technical blog content. The compact implementation highlights LangGraph’s intuitive node-graph architecture. It can manage complex sequences of Claude-powered natural language tasks, from basic question-answering scenarios to advanced content generation pipelines. from getpass import getpass import os anthropic_key = getpass("Enter your Anthropic API key: ") os.environ["ANTHROPIC_API_KEY"] = anthropic_key print("Key set:", "ANTHROPIC_API_KEY" in os.environ) We securely prompt users to input their Anthropic API key using Python’s getpass module, ensuring sensitive data isn’t displayed. It then sets this key as an environment variable (ANTHROPIC_API_KEY) and confirms successful storage. import os import json import requests from typing import Dict, List, Any, Callable, Optional, Union from dataclasses import dataclass, field import networkx as nx import matplotlib.pyplot as plt from IPython.display import display, HTML, clear_output We import essential libraries for building and visualizing structured AI workflows. It includes modules for handling data (json, requests, dataclasses), graph creation and visualization (networkx, matplotlib), interactive notebook display (IPython.display), and type annotations (typing) for clarity and maintainability. try: import anthropic except ImportError: print("Installing anthropic package...") !pip install -q anthropic import anthropic from anthropic import Anthropic We ensure the anthropic Python package is available for use. It attempts to import the module and, if not found, automatically installs it using pip in a Google Colab environment. After installation, it imports the Anthropic client, essential for interacting with Claude models via the Anthropic API. 4o @dataclass class NodeConfig: name: str function: Callable inputs: List[str] = field(default_factory=list) outputs: List[str] = field(default_factory=list) config: Dict[str, Any] = field(default_factory=dict) This NodeConfig data class defines the structure of each node in the LangGraph workflow. Each node has a name, an executable function, optional inputs and outputs, and an optional config dictionary to store additional parameters. This setup allows for modular, reusable node definitions for graph-based AI tasks. class LangGraph: def __init__(self, api_key: Optional[str] = None): self.api_key = api_key or os.environ.get("ANTHROPIC_API_KEY") if not self.api_key: from google.colab import userdata try: self.api_key = userdata.get('ANTHROPIC_API_KEY') if not self.api_key: raise ValueError("No API key found") except: print("No Anthropic API key found in environment variables or Colab secrets.") self.api_key = input("Please enter your Anthropic API key: ") if not self.api_key: raise ValueError("Please provide an Anthropic API key") self.client = Anthropic(api_key=self.api_key) self.graph = nx.DiGraph() self.nodes = {} self.state = {} def add_node(self, node_config: NodeConfig): self.nodes[node_config.name] = node_config self.graph.add_node(node_config.name) for input_node in node_config.inputs: if input_node in self.nodes: self.graph.add_edge(input_node, node_config.name) return self def claude_node(self, name: str, prompt_template: str, model: str = "claude-3-7-sonnet-20250219", inputs: List[str] = None, outputs: List[str] = None, system_prompt: str = None): """Convenience method to create a Claude API node""" inputs = inputs or [] outputs = outputs or [name + "_response"] def claude_fn(state, **kwargs): prompt = prompt_template for k, v in state.items(): if isinstance(v, str): prompt = prompt.replace(f"{{{k}}}", v) message_params = { "model": model, "max_tokens": 1000, "messages": [{"role": "user", "content": prompt}] } if system_prompt: message_params["system"] = system_prompt response = self.client.messages.create(**message_params) return response.content[0].text node_config = NodeConfig( name=name, function=claude_fn, inputs=inputs, outputs=outputs, config={"model": model, "prompt_template": prompt_template} ) return self.add_node(node_config) def transform_node(self, name: str, transform_fn: Callable, inputs: List[str] = None, outputs: List[str] = None): """Add a data transformation node""" inputs = inputs or [] outputs = outputs or [name + "_output"] node_config = NodeConfig( name=name, function=transform_fn, inputs=inputs, outputs=outputs ) return self.add_node(node_config) def visualize(self): """Visualize the graph""" plt.figure(figsize=(10, 6)) pos = nx.spring_layout(self.graph) nx.draw(self.graph, pos, with_labels=True, node_color="lightblue", node_size=1500, arrowsize=20, font_size=10) plt.title("LangGraph Flow") plt.tight_layout() plt.show() print("\nGraph Structure:") for node in self.graph.nodes(): successors = list(self.graph.successors(node)) if successors: print(f" {node} → {', '.join(successors)}") else: print(f" {node} (endpoint)") print() def _get_execution_order(self): """Determine execution order based on dependencies""" try: return list(nx.topological_sort(self.graph)) except nx.NetworkXUnfeasible: raise ValueError("Graph contains a cycle") def execute(self, initial_state: Dict[str, Any] = None): """Execute the graph in topological order""" self.state = initial_state or {} execution_order = self._get_execution_order() print("Executing LangGraph flow:") for node_name in execution_order: print(f"- Running node: {node_name}") node = self.nodes[node_name] inputs = {k: self.state.get(k) for k in node.inputs if k in self.state} result = node.function(self.state, **inputs) if len(node.outputs) == 1: self.state[node.outputs[0]] = result elif isinstance(result, (list, tuple)) and len(result) == len(node.outputs): for i, output_name in enumerate(node.outputs): self.state[output_name] = result[i] print("Execution completed!") return self.state def run_example(question="What are the key benefits of using a graph-based architecture for AI workflows?"): """Run an example LangGraph flow with a predefined question""" print(f"Running example with question: '{question}'") graph = LangGraph() def question_provider(state, **kwargs): return question graph.transform_node( name="question_provider", transform_fn=question_provider, outputs=["user_question"] ) graph.claude_node( name="question_answerer", prompt_template="Answer this question clearly and concisely: {user_question}", inputs=["user_question"], outputs=["answer"], system_prompt="You are a helpful AI assistant." ) graph.claude_node( name="answer_analyzer", prompt_template="Analyze if this answer addresses the question well: Question: {user_question}\nAnswer: {answer}", inputs=["user_question", "answer"], outputs=["analysis"], system_prompt="You are a critical evaluator. Be brief but thorough." ) graph.visualize() result = graph.execute() print("\n" + "="*50) print("EXECUTION RESULTS:") print("="*50) print(f"\n🔍 QUESTION:\n{result.get('user_question')}\n") print(f"📝 ANSWER:\n{result.get('answer')}\n") print(f"✅ ANALYSIS:\n{result.get('analysis')}") print("="*50 + "\n") return graph The LangGraph class implements a lightweight framework for constructing and executing graph-based AI workflows using Claude from Anthropic. It allows users to define modular nodes, either Claude-powered prompts or custom transformation functions, connect them via dependencies, visualize the entire pipeline, and execute them in topological order. The run_example function demonstrates this by building a simple question-answering and evaluation flow, showcasing the clarity and modularity of LangGraph’s architecture. def run_advanced_example(): """Run a more advanced example with multiple nodes for content generation""" graph = LangGraph() def topic_selector(state, **kwargs): return "Graph-based AI systems" graph.transform_node( name="topic_selector", transform_fn=topic_selector, outputs=["topic"] ) graph.claude_node( name="outline_generator", prompt_template="Create a brief outline for a technical blog post about {topic}. Include 3-4 main sections only.", inputs=["topic"], outputs=["outline"], system_prompt="You are a technical writer specializing in AI technologies." ) graph.claude_node( name="intro_writer", prompt_template="Write an engaging introduction for a blog post with this outline: {outline}\nTopic: {topic}", inputs=["topic", "outline"], outputs=["introduction"], system_prompt="You are a technical writer. Write in a clear, engaging style." ) graph.claude_node( name="conclusion_writer", prompt_template="Write a conclusion for a blog post with this outline: {outline}\nTopic: {topic}", inputs=["topic", "outline"], outputs=["conclusion"], system_prompt="You are a technical writer. Summarize key points and include a forward-looking statement." ) def assembler(state, introduction, outline, conclusion, **kwargs): return f"# {state['topic']}\n\n{introduction}\n\n## Outline\n{outline}\n\n## Conclusion\n{conclusion}" graph.transform_node( name="content_assembler", transform_fn=assembler, inputs=["topic", "introduction", "outline", "conclusion"], outputs=["final_content"] ) graph.visualize() result = graph.execute() print("\n" + "="*50) print("BLOG POST GENERATED:") print("="*50 + "\n") print(result.get("final_content")) print("\n" + "="*50) return graph The run_advanced_example function showcases a more sophisticated use of LangGraph by orchestrating multiple Claude-powered nodes to generate a complete blog post. It starts by selecting a topic, then creates an outline, an introduction, and a conclusion, all using structured Claude prompts. Finally, a transformation node assembles the content into a formatted blog post. This example demonstrates how LangGraph can automate complex, multi-step content generation tasks using modular, connected nodes in a clear and executable flow. print("1. Running simple question-answering example") question = "What are the three main advantages of using graph-based AI architectures?" simple_graph = run_example(question) print("\n2. Running advanced blog post creation example") advanced_graph = run_advanced_example() Finally, we trigger the execution of both defined LangGraph workflows. First, it runs the simple question-answering example by passing a predefined question to the run_example() function. Then, it initiates the more advanced blog post generation workflow using run_advanced_example(). Together, these calls demonstrate the practical flexibility of LangGraph, from basic prompt-based interactions to multi-step content automation using Anthropic’s Claude API. In conclusion, we have implemented LangGraph integrated with Anthropic’s Claude API, which illustrates the ease of designing modular AI workflows that leverage powerful language models in structured, graph-based pipelines. Through visualizing task flows and separating responsibilities among nodes, such as question processing, analytical evaluation, content outlining, and assembly, developers gain practical experience in building maintainable, scalable AI systems. LangGraph’s clear node dependencies and Claude’s sophisticated language capabilities provide an efficient solution for orchestrating complex AI processes, especially for rapid prototyping and execution in environments like Google Colab. Check out the Colab Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World Environments

·224 Views

Παρακαλούμε συνδέσου στην Κοινότητά μας για να δηλώσεις τι σου αρέσει, να σχολιάσεις και να μοιραστείς με τους φίλους σου!
Business Insider μοιράστηκε ένα σύνδεσμο

2025-05-21 22:52:46 ·

Google launched a dizzying array of new AI products, and it's getting harder to make sense of them all

Google CEO Sundar Pichai speaks during a Google I/O conference.

Justin Sullivan/Getty Images

2025-05-21T22:01:46Z

d

Read in app

This story is available exclusively to Business Insider
subscribers. Become an Insider
and start reading now.
Have an account?

Google announced over two dozen new AI updates at its I/O developer conference.
It's impressive, though some of the new products seem to overlap significantly.
Google's approach could lose to more focused rivals as tech races to build an "everything app."

Attending Google's I/O developer conference is like being doused with a firehose of new AI announcements.At I/O's keynote event on Tuesday, Business Insider counted at least two dozen new models, features, and updates."We are shipping faster than ever," Google CEO Sundar Pichai boasted onstage.Indeed. But it's starting to get a little confusing. For one, some of the launches seem to overlap with each other. Launching so many AI products in such a short timeframe is impressive, and it can also feel scatterbrained.AI Mode allows you to chat with Google as you browse the web, creating a more conversational search experience. Don't confuse it with Gemini in Chrome, which allows you to ask Gemini questions while you browse.With Gemini Live, you can point your phone at whatever you want and talk to the AI assistant about it. Don't mistake it for Search Live, which allows you to chat with Search about whatever your phone sees.Project Mariner is an experimental AI agent that can take actions like booking tickets. Gemini's upcoming Agent Mode also has agentic capabilities, like helping users find just the right Zillow listing.Not all the new tools seemed that similar. Google launched an impressive new AI filmmaking tool called Flow, powered by its new model Veo 3.Google also touted updates to an entirely separate AI model family from Gemini called Gemma which, incidentally, can help decipher how dolphins talk to each other — that's DolphinGemma.Multiple Googlers that Business Insider spoke with at I/O used a single word to describe Google's current rate of shipping: "intense."Google's approach complicates its own vision of building a single, universal AI assistant.OpenAI is also moving fast towards this goal and appears intent on launching a dedicated device to run it, given its recent purchase of Apple designer Jony Ive's hardware startup.Google risks building so many overlapping AI products that it will be tough to compete with a single, more stand-alone solution, such as an AI-native phone.No one's counting Google out, though. The tech giant has become an undeniable AI leader, inventing much of the core research behind the current boom and successfully launching transformational technology like Waymo. Time will tell whether Google's more sprawling approach wins out.Google did not immediately respond to a request for comment from Business Insider.
#google #launched #dizzying #array #new

Google launched a dizzying array of new AI products, and it's getting harder to make sense of them all
Google CEO Sundar Pichai speaks during a Google I/O conference. Justin Sullivan/Getty Images 2025-05-21T22:01:46Z d Read in app This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? Google announced over two dozen new AI updates at its I/O developer conference. It's impressive, though some of the new products seem to overlap significantly. Google's approach could lose to more focused rivals as tech races to build an "everything app." Attending Google's I/O developer conference is like being doused with a firehose of new AI announcements.At I/O's keynote event on Tuesday, Business Insider counted at least two dozen new models, features, and updates."We are shipping faster than ever," Google CEO Sundar Pichai boasted onstage.Indeed. But it's starting to get a little confusing. For one, some of the launches seem to overlap with each other. Launching so many AI products in such a short timeframe is impressive, and it can also feel scatterbrained.AI Mode allows you to chat with Google as you browse the web, creating a more conversational search experience. Don't confuse it with Gemini in Chrome, which allows you to ask Gemini questions while you browse.With Gemini Live, you can point your phone at whatever you want and talk to the AI assistant about it. Don't mistake it for Search Live, which allows you to chat with Search about whatever your phone sees.Project Mariner is an experimental AI agent that can take actions like booking tickets. Gemini's upcoming Agent Mode also has agentic capabilities, like helping users find just the right Zillow listing.Not all the new tools seemed that similar. Google launched an impressive new AI filmmaking tool called Flow, powered by its new model Veo 3.Google also touted updates to an entirely separate AI model family from Gemini called Gemma which, incidentally, can help decipher how dolphins talk to each other — that's DolphinGemma.Multiple Googlers that Business Insider spoke with at I/O used a single word to describe Google's current rate of shipping: "intense."Google's approach complicates its own vision of building a single, universal AI assistant.OpenAI is also moving fast towards this goal and appears intent on launching a dedicated device to run it, given its recent purchase of Apple designer Jony Ive's hardware startup.Google risks building so many overlapping AI products that it will be tough to compete with a single, more stand-alone solution, such as an AI-native phone.No one's counting Google out, though. The tech giant has become an undeniable AI leader, inventing much of the core research behind the current boom and successfully launching transformational technology like Waymo. Time will tell whether Google's more sprawling approach wins out.Google did not immediately respond to a request for comment from Business Insider. #google #launched #dizzying #array #new

WWW.BUSINESSINSIDER.COM

Google launched a dizzying array of new AI products, and it's getting harder to make sense of them all

Google CEO Sundar Pichai speaks during a Google I/O conference. Justin Sullivan/Getty Images 2025-05-21T22:01:46Z Save Saved Read in app This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? Google announced over two dozen new AI updates at its I/O developer conference. It's impressive, though some of the new products seem to overlap significantly. Google's approach could lose to more focused rivals as tech races to build an "everything app." Attending Google's I/O developer conference is like being doused with a firehose of new AI announcements.At I/O's keynote event on Tuesday, Business Insider counted at least two dozen new models, features, and updates."We are shipping faster than ever," Google CEO Sundar Pichai boasted onstage.Indeed. But it's starting to get a little confusing. For one, some of the launches seem to overlap with each other. Launching so many AI products in such a short timeframe is impressive, and it can also feel scatterbrained.AI Mode allows you to chat with Google as you browse the web, creating a more conversational search experience. Don't confuse it with Gemini in Chrome, which allows you to ask Gemini questions while you browse.With Gemini Live, you can point your phone at whatever you want and talk to the AI assistant about it. Don't mistake it for Search Live, which allows you to chat with Search about whatever your phone sees.Project Mariner is an experimental AI agent that can take actions like booking tickets. Gemini's upcoming Agent Mode also has agentic capabilities, like helping users find just the right Zillow listing.Not all the new tools seemed that similar. Google launched an impressive new AI filmmaking tool called Flow, powered by its new model Veo 3.Google also touted updates to an entirely separate AI model family from Gemini called Gemma which, incidentally, can help decipher how dolphins talk to each other — that's DolphinGemma.Multiple Googlers that Business Insider spoke with at I/O used a single word to describe Google's current rate of shipping: "intense."Google's approach complicates its own vision of building a single, universal AI assistant. (That mission has its own name, too: Project Astra.)OpenAI is also moving fast towards this goal and appears intent on launching a dedicated device to run it, given its recent purchase of Apple designer Jony Ive's hardware startup.Google risks building so many overlapping AI products that it will be tough to compete with a single, more stand-alone solution, such as an AI-native phone.No one's counting Google out, though. The tech giant has become an undeniable AI leader, inventing much of the core research behind the current boom and successfully launching transformational technology like Waymo. Time will tell whether Google's more sprawling approach wins out.Google did not immediately respond to a request for comment from Business Insider.

·154 Views

Παρακαλούμε συνδέσου στην Κοινότητά μας για να δηλώσεις τι σου αρέσει, να σχολιάσεις και να μοιραστείς με τους φίλους σου!
Computer Weekly μοιράστηκε ένα σύνδεσμο

2025-05-21 22:49:20 ·

Google I/O: LLM capabilities power agentic AI search

BillionPhotos.com - stock.adobe.

News

Google I/O: LLM capabilities power agentic AI search
As Google strives to make AI universal, it is starting to integrate agentic AI into Google Search to fast-track purchasing on websites

By

Cliff Saran,
Managing Editor

Published: 21 May 2025 17:00

Google has taken steps to advance artificial intelligencelanguage models closer to what it calls “world models”, as it tries to make them more useful and universal.
The company used its annual developer event, Google I/O, to showcase the Gemini 2.5 large language model, new application programming interfacesand programming tools and agentic AI functionality built into Google’s internet search engine.
Gemini is Google’s primary AI engine, but it offers several others including Gemma 3n, a small language model for mobile devices.
Demis Hassabis, CEO of Google Deepmind, said: “Our ultimate vision is to transform the Gemini app into a universal AI assistant that will perform everyday tasks for us, take care of our mundane admin and surface delightful new recommendations – making us more productive and enriching our lives.”
Hassabis said the company was beginning to develop new AI capabilities, following on from work on a research prototype called Project Astra, which explored concepts such as video understanding, screen sharing and memory. “Over the past year, we’ve been integrating capabilities like these into Gemini Live for more people to experience today.”
Google has been working to make its main AI model, Gemini, a world model. With Gemini 2.5 Pro, Hassabis said the model can make plans and imagine new experiences by understanding and simulating aspects of the world.
Hassabis said the progress the company has made is based on training AI agents to master complex games such as Go and StarCraft, with its Genie 2 software able to generate 3D-simulated interactive worlds.
According to Hassabis, Gemini is making use of this work in how it handles world knowledge and reasoning to represent and simulate natural environments. Other examples include Veo, Google’s AI-based video content generator, which Hassabis said has a deep understanding of “intuitive physics”.
As it strives to make its AI more useful, the company has released a Gemini 2.5-powered feature called AI Mode, on its North American internet search site, to provide more in-depth querying than just what is possible with the AI Overview functionality currently available.
An agentic AI feature called Project Mariner is also now part of AI Mode, which Google said can help people searching the internet get tasks done quicker. As an example, Google said a query to find affordable tickets would use AI Mode to look across multiple websites, analysing hundreds of potential ticket options with real-time pricing and inventory, and handle the work of filling in forms.
“AI Mode will present ticket options that meet your exact criteria, and you can complete the purchase on whichever site you prefer, saving you time while keeping you in control,” Google said.
Another agentic AI feature uses AI Mode to fast-track browsing and purchases on websites, with the entire payment process automated using Google Pay.
To support software developers, Google has integrated Gemini 2.5 Pro into the native code editor of Google AI Studio, which it said would help programmers prototype faster.
It has also released a beta version of Jules, an asynchronous code agent, which works directly with a software developer’s GitHub repositories.
Google said users can ask Jules to take on tasks such as version upgrades, writing tests, updating features and bug fixes.

about Google AI models

Gemini vs. ChatGPT – what’s the difference: ChatGPT took the early lead among AI-generated chatbots before Google answered with Gemini. While ChatGPT and Gemini perform similar tasks, there are differences.
Google Gemini 2.5 Pro explained – Everything you need to know: Google’s latest multimodal model – Gemini 2.5 Pro – entered the AI race with enhanced reasoning and improved performance across coding, math and science benchmarks.

In The Current Issue:

UK critical systems at risk from ‘digital divide’ created by AI threats
UK at risk of Russian cyber and physical attacks as Ukraine seeks peace deal
Standard Chartered grounds AI ambitions in data governance

Download Current Issue

Microsoft entices developers to build more Windows AI apps
– Cliff Saran's Enterprise blog

Red Hat launches llm-d community & project
– Open Source Insider

View All Blogs
#google #llm #capabilities #power #agentic

Google I/O: LLM capabilities power agentic AI search
BillionPhotos.com - stock.adobe. News Google I/O: LLM capabilities power agentic AI search As Google strives to make AI universal, it is starting to integrate agentic AI into Google Search to fast-track purchasing on websites By Cliff Saran, Managing Editor Published: 21 May 2025 17:00 Google has taken steps to advance artificial intelligencelanguage models closer to what it calls “world models”, as it tries to make them more useful and universal. The company used its annual developer event, Google I/O, to showcase the Gemini 2.5 large language model, new application programming interfacesand programming tools and agentic AI functionality built into Google’s internet search engine. Gemini is Google’s primary AI engine, but it offers several others including Gemma 3n, a small language model for mobile devices. Demis Hassabis, CEO of Google Deepmind, said: “Our ultimate vision is to transform the Gemini app into a universal AI assistant that will perform everyday tasks for us, take care of our mundane admin and surface delightful new recommendations – making us more productive and enriching our lives.” Hassabis said the company was beginning to develop new AI capabilities, following on from work on a research prototype called Project Astra, which explored concepts such as video understanding, screen sharing and memory. “Over the past year, we’ve been integrating capabilities like these into Gemini Live for more people to experience today.” Google has been working to make its main AI model, Gemini, a world model. With Gemini 2.5 Pro, Hassabis said the model can make plans and imagine new experiences by understanding and simulating aspects of the world. Hassabis said the progress the company has made is based on training AI agents to master complex games such as Go and StarCraft, with its Genie 2 software able to generate 3D-simulated interactive worlds. According to Hassabis, Gemini is making use of this work in how it handles world knowledge and reasoning to represent and simulate natural environments. Other examples include Veo, Google’s AI-based video content generator, which Hassabis said has a deep understanding of “intuitive physics”. As it strives to make its AI more useful, the company has released a Gemini 2.5-powered feature called AI Mode, on its North American internet search site, to provide more in-depth querying than just what is possible with the AI Overview functionality currently available. An agentic AI feature called Project Mariner is also now part of AI Mode, which Google said can help people searching the internet get tasks done quicker. As an example, Google said a query to find affordable tickets would use AI Mode to look across multiple websites, analysing hundreds of potential ticket options with real-time pricing and inventory, and handle the work of filling in forms. “AI Mode will present ticket options that meet your exact criteria, and you can complete the purchase on whichever site you prefer, saving you time while keeping you in control,” Google said. Another agentic AI feature uses AI Mode to fast-track browsing and purchases on websites, with the entire payment process automated using Google Pay. To support software developers, Google has integrated Gemini 2.5 Pro into the native code editor of Google AI Studio, which it said would help programmers prototype faster. It has also released a beta version of Jules, an asynchronous code agent, which works directly with a software developer’s GitHub repositories. Google said users can ask Jules to take on tasks such as version upgrades, writing tests, updating features and bug fixes. about Google AI models Gemini vs. ChatGPT – what’s the difference: ChatGPT took the early lead among AI-generated chatbots before Google answered with Gemini. While ChatGPT and Gemini perform similar tasks, there are differences. Google Gemini 2.5 Pro explained – Everything you need to know: Google’s latest multimodal model – Gemini 2.5 Pro – entered the AI race with enhanced reasoning and improved performance across coding, math and science benchmarks. In The Current Issue: UK critical systems at risk from ‘digital divide’ created by AI threats UK at risk of Russian cyber and physical attacks as Ukraine seeks peace deal Standard Chartered grounds AI ambitions in data governance Download Current Issue Microsoft entices developers to build more Windows AI apps – Cliff Saran's Enterprise blog Red Hat launches llm-d community & project – Open Source Insider View All Blogs #google #llm #capabilities #power #agentic

WWW.COMPUTERWEEKLY.COM

Google I/O: LLM capabilities power agentic AI search

BillionPhotos.com - stock.adobe. News Google I/O: LLM capabilities power agentic AI search As Google strives to make AI universal, it is starting to integrate agentic AI into Google Search to fast-track purchasing on websites By Cliff Saran, Managing Editor Published: 21 May 2025 17:00 Google has taken steps to advance artificial intelligence (AI) language models closer to what it calls “world models”, as it tries to make them more useful and universal. The company used its annual developer event, Google I/O, to showcase the Gemini 2.5 large language model (LLM), new application programming interfaces (APIs) and programming tools and agentic AI functionality built into Google’s internet search engine. Gemini is Google’s primary AI engine, but it offers several others including Gemma 3n, a small language model for mobile devices. Demis Hassabis, CEO of Google Deepmind, said: “Our ultimate vision is to transform the Gemini app into a universal AI assistant that will perform everyday tasks for us, take care of our mundane admin and surface delightful new recommendations – making us more productive and enriching our lives.” Hassabis said the company was beginning to develop new AI capabilities, following on from work on a research prototype called Project Astra, which explored concepts such as video understanding, screen sharing and memory. “Over the past year, we’ve been integrating capabilities like these into Gemini Live for more people to experience today.” Google has been working to make its main AI model, Gemini, a world model. With Gemini 2.5 Pro, Hassabis said the model can make plans and imagine new experiences by understanding and simulating aspects of the world. Hassabis said the progress the company has made is based on training AI agents to master complex games such as Go and StarCraft, with its Genie 2 software able to generate 3D-simulated interactive worlds. According to Hassabis, Gemini is making use of this work in how it handles world knowledge and reasoning to represent and simulate natural environments. Other examples include Veo, Google’s AI-based video content generator, which Hassabis said has a deep understanding of “intuitive physics”. As it strives to make its AI more useful, the company has released a Gemini 2.5-powered feature called AI Mode, on its North American internet search site, to provide more in-depth querying than just what is possible with the AI Overview functionality currently available. An agentic AI feature called Project Mariner is also now part of AI Mode, which Google said can help people searching the internet get tasks done quicker. As an example, Google said a query to find affordable tickets would use AI Mode to look across multiple websites, analysing hundreds of potential ticket options with real-time pricing and inventory, and handle the work of filling in forms. “AI Mode will present ticket options that meet your exact criteria, and you can complete the purchase on whichever site you prefer, saving you time while keeping you in control,” Google said. Another agentic AI feature uses AI Mode to fast-track browsing and purchases on websites, with the entire payment process automated using Google Pay. To support software developers, Google has integrated Gemini 2.5 Pro into the native code editor of Google AI Studio, which it said would help programmers prototype faster. It has also released a beta version of Jules, an asynchronous code agent, which works directly with a software developer’s GitHub repositories. Google said users can ask Jules to take on tasks such as version upgrades, writing tests, updating features and bug fixes. Read more about Google AI models Gemini vs. ChatGPT – what’s the difference: ChatGPT took the early lead among AI-generated chatbots before Google answered with Gemini. While ChatGPT and Gemini perform similar tasks, there are differences. Google Gemini 2.5 Pro explained – Everything you need to know: Google’s latest multimodal model – Gemini 2.5 Pro – entered the AI race with enhanced reasoning and improved performance across coding, math and science benchmarks. In The Current Issue: UK critical systems at risk from ‘digital divide’ created by AI threats UK at risk of Russian cyber and physical attacks as Ukraine seeks peace deal Standard Chartered grounds AI ambitions in data governance Download Current Issue Microsoft entices developers to build more Windows AI apps – Cliff Saran's Enterprise blog Red Hat launches llm-d community & project – Open Source Insider View All Blogs

·129 Views

Παρακαλούμε συνδέσου στην Κοινότητά μας για να δηλώσεις τι σου αρέσει, να σχολιάσεις και να μοιραστείς με τους φίλους σου!
Marktechpost AI μοιράστηκε ένα σύνδεσμο

2025-05-21 22:35:37 ·

Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data

Large language models are now being used for evaluation and judgment tasks, extending beyond their traditional role of text generation. This has led to “LLM-as-a-Judge,” where models assess outputs from other language models. Such evaluations are essential in reinforcement learning pipelines, benchmark testing, and system alignment. These judge models rely on internal chain-of-thought reasoning, mirroring human judgment processes. Unlike conventional reward models that provide direct scores, these models simulate thoughtful evaluation, making them better suited for complex tasks such as math problem-solving, ethical reasoning, and user intent interpretation. Their ability to interpret and validate responses across languages and domains enhances automation and scalability in language model development.
However, current AI judgment systems face issues with inconsistency and shallow reasoning. Many rely on basic metrics or static annotations, which are inadequate for evaluating subjective or open-ended prompts. A common problem is position bias, where the order of answers affects the final decision, compromising fairness. Also, collecting human-annotated data at scale is costly and time-consuming, limiting the generalizability of these models.
Several existing approaches have addressed these challenges, but with limited success. Systems like EvalPlanner and DeepSeek-GRM rely on human-labeled data or rigid training schemes, which limit adaptability across task types. Others, like DeepSeek-R1, depend on distillation from large models but perform poorly on ambiguous prompts. Static datasets and offline tuning strategies hinder dynamic reasoning, while newer methods using score formatting or structured prompts have shown minimal accuracy improvements. Despite larger datasets and models, performance gains in traditional systems have stalled.
Researchers from Meta’s GenAI and FAIR teams introduced J1 to address the above limitations. J1 trains judgment models through a reinforcement learning-based framework, making them capable of learning through verifiable reward signals. The team used synthetic data to create high-quality and low-quality responses to a prompt, transforming subjective tasks into verifiable pairwise judgments. This synthetic dataset included 22,000 preference pairs, split between 17,000 prompts from the WildChat corpus and 5,000 mathematical queries. These were used to train two versions of J1: J1-Llama-8B and J1-Llama-70B, initialized from the Llama-3.1-8B-Instruct and Llama-3.3-70B-Instruct base models, respectively. The models were trained using Group Relative Policy Optimization, a reinforcement algorithm that eliminates the need for critic models and accelerates convergence.

At the training strategy’s core is position-agnostic learning, where bothandinput formats are used in training to prevent position bias. Also, consistency-based rewards are applied only when the model delivers correct verdicts across both answer orderings. This structure allows the judge to be fair and reliable regardless of prompt or answer order. The training framework supports multiple variations: models can output final verdicts, numeric scores for each answer, or both. A pointwise judging variant is included, which evaluates single responses using scores from 0 to 10. These formats make J1 a versatile and generalizable system capable of judging various tasks.

The results obtained using the J1 models reveal substantial performance improvements over existing systems. On the widely used Preference Proxy Evaluationsbenchmark, J1-Llama-70B achieved an overall accuracy of 69.6%, outperforming models trained with over ten times more data. In contrast, models like DeepSeek-GRM-27B and EvalPlanner-Llama-70B scored 67.2% and 65.6%, respectively. Even the smaller J1-Llama-8B model exceeded baseline systems like EvalPlanner-Llama-8B, scoring 62.2% versus 55.5%. J1 also showed top-tier performance on other critical benchmarks such as RewardBench, RM-Bench, JudgeBench, and FollowBenchEval, demonstrating robust generalization across verifiable and subjective tasks. These improvements are not just marginal but significant, considering the limited training data used in J1 compared to the expansive datasets in other models.

Several Key Takeaways from the Research on J1:

J1 is trained using 22,000 synthetic preference pairs, including 17K from WildChat and 5K from MATH tasks.
The training uses GRPO, which streamlines RL by avoiding the need for separate critic models.
It introduces position-agnostic learning, reducing position bias through consistency-based rewards.
Two main model variants, J1-Llama-8B and J1-Llama-70B, were trained on modest data but outperformed large-scale models.
J1-Llama-70B scored 69.6% on PPE, exceeding DeepSeek-GRM-27Band EvalPlanner-Llama-70B.
Supports multiple judgment formats: pairwise with verdicts, pairwise with scores, and pointwise scores.
Surpasses models distilled from DeepSeek-R1 and OpenAI’s o1-mini on several tasks.
Demonstrates that reasoning quality, not just dataset size, is critical for accurate judgments.
J1’s framework makes it a generalist judge applicable to verifiable and non-verifiable tasks.

In conclusion, the J1 approach fundamentally redefines how judgment models are trained and evaluated. Synthetic data and reinforcement learning bypass the traditional need for costly annotations while promoting fair, logical, and consistent evaluations. This work illustrates that reasoning-driven judging can outperform larger models that rely heavily on data volume and static alignment techniques. It also validates the notion that judgment models should be thinkers first, and scorers second. With performance that rivals and often surpasses state-of-the-art systems, J1 sets a new benchmark in training LLM-as-a-Judge systems.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World EnvironmentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA Optimization
#meta #researchers #introduced #reinforcement #learning

Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data
Large language models are now being used for evaluation and judgment tasks, extending beyond their traditional role of text generation. This has led to “LLM-as-a-Judge,” where models assess outputs from other language models. Such evaluations are essential in reinforcement learning pipelines, benchmark testing, and system alignment. These judge models rely on internal chain-of-thought reasoning, mirroring human judgment processes. Unlike conventional reward models that provide direct scores, these models simulate thoughtful evaluation, making them better suited for complex tasks such as math problem-solving, ethical reasoning, and user intent interpretation. Their ability to interpret and validate responses across languages and domains enhances automation and scalability in language model development. However, current AI judgment systems face issues with inconsistency and shallow reasoning. Many rely on basic metrics or static annotations, which are inadequate for evaluating subjective or open-ended prompts. A common problem is position bias, where the order of answers affects the final decision, compromising fairness. Also, collecting human-annotated data at scale is costly and time-consuming, limiting the generalizability of these models. Several existing approaches have addressed these challenges, but with limited success. Systems like EvalPlanner and DeepSeek-GRM rely on human-labeled data or rigid training schemes, which limit adaptability across task types. Others, like DeepSeek-R1, depend on distillation from large models but perform poorly on ambiguous prompts. Static datasets and offline tuning strategies hinder dynamic reasoning, while newer methods using score formatting or structured prompts have shown minimal accuracy improvements. Despite larger datasets and models, performance gains in traditional systems have stalled. Researchers from Meta’s GenAI and FAIR teams introduced J1 to address the above limitations. J1 trains judgment models through a reinforcement learning-based framework, making them capable of learning through verifiable reward signals. The team used synthetic data to create high-quality and low-quality responses to a prompt, transforming subjective tasks into verifiable pairwise judgments. This synthetic dataset included 22,000 preference pairs, split between 17,000 prompts from the WildChat corpus and 5,000 mathematical queries. These were used to train two versions of J1: J1-Llama-8B and J1-Llama-70B, initialized from the Llama-3.1-8B-Instruct and Llama-3.3-70B-Instruct base models, respectively. The models were trained using Group Relative Policy Optimization, a reinforcement algorithm that eliminates the need for critic models and accelerates convergence. At the training strategy’s core is position-agnostic learning, where bothandinput formats are used in training to prevent position bias. Also, consistency-based rewards are applied only when the model delivers correct verdicts across both answer orderings. This structure allows the judge to be fair and reliable regardless of prompt or answer order. The training framework supports multiple variations: models can output final verdicts, numeric scores for each answer, or both. A pointwise judging variant is included, which evaluates single responses using scores from 0 to 10. These formats make J1 a versatile and generalizable system capable of judging various tasks. The results obtained using the J1 models reveal substantial performance improvements over existing systems. On the widely used Preference Proxy Evaluationsbenchmark, J1-Llama-70B achieved an overall accuracy of 69.6%, outperforming models trained with over ten times more data. In contrast, models like DeepSeek-GRM-27B and EvalPlanner-Llama-70B scored 67.2% and 65.6%, respectively. Even the smaller J1-Llama-8B model exceeded baseline systems like EvalPlanner-Llama-8B, scoring 62.2% versus 55.5%. J1 also showed top-tier performance on other critical benchmarks such as RewardBench, RM-Bench, JudgeBench, and FollowBenchEval, demonstrating robust generalization across verifiable and subjective tasks. These improvements are not just marginal but significant, considering the limited training data used in J1 compared to the expansive datasets in other models. Several Key Takeaways from the Research on J1: J1 is trained using 22,000 synthetic preference pairs, including 17K from WildChat and 5K from MATH tasks. The training uses GRPO, which streamlines RL by avoiding the need for separate critic models. It introduces position-agnostic learning, reducing position bias through consistency-based rewards. Two main model variants, J1-Llama-8B and J1-Llama-70B, were trained on modest data but outperformed large-scale models. J1-Llama-70B scored 69.6% on PPE, exceeding DeepSeek-GRM-27Band EvalPlanner-Llama-70B. Supports multiple judgment formats: pairwise with verdicts, pairwise with scores, and pointwise scores. Surpasses models distilled from DeepSeek-R1 and OpenAI’s o1-mini on several tasks. Demonstrates that reasoning quality, not just dataset size, is critical for accurate judgments. J1’s framework makes it a generalist judge applicable to verifiable and non-verifiable tasks. In conclusion, the J1 approach fundamentally redefines how judgment models are trained and evaluated. Synthetic data and reinforcement learning bypass the traditional need for costly annotations while promoting fair, logical, and consistent evaluations. This work illustrates that reasoning-driven judging can outperform larger models that rely heavily on data volume and static alignment techniques. It also validates the notion that judgment models should be thinkers first, and scorers second. With performance that rivals and often surpasses state-of-the-art systems, J1 sets a new benchmark in training LLM-as-a-Judge systems. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World EnvironmentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA Optimization #meta #researchers #introduced #reinforcement #learning

WWW.MARKTECHPOST.COM

Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data

Large language models are now being used for evaluation and judgment tasks, extending beyond their traditional role of text generation. This has led to “LLM-as-a-Judge,” where models assess outputs from other language models. Such evaluations are essential in reinforcement learning pipelines, benchmark testing, and system alignment. These judge models rely on internal chain-of-thought reasoning, mirroring human judgment processes. Unlike conventional reward models that provide direct scores, these models simulate thoughtful evaluation, making them better suited for complex tasks such as math problem-solving, ethical reasoning, and user intent interpretation. Their ability to interpret and validate responses across languages and domains enhances automation and scalability in language model development. However, current AI judgment systems face issues with inconsistency and shallow reasoning. Many rely on basic metrics or static annotations, which are inadequate for evaluating subjective or open-ended prompts. A common problem is position bias, where the order of answers affects the final decision, compromising fairness. Also, collecting human-annotated data at scale is costly and time-consuming, limiting the generalizability of these models. Several existing approaches have addressed these challenges, but with limited success. Systems like EvalPlanner and DeepSeek-GRM rely on human-labeled data or rigid training schemes, which limit adaptability across task types. Others, like DeepSeek-R1, depend on distillation from large models but perform poorly on ambiguous prompts. Static datasets and offline tuning strategies hinder dynamic reasoning, while newer methods using score formatting or structured prompts have shown minimal accuracy improvements. Despite larger datasets and models, performance gains in traditional systems have stalled. Researchers from Meta’s GenAI and FAIR teams introduced J1 to address the above limitations. J1 trains judgment models through a reinforcement learning-based framework, making them capable of learning through verifiable reward signals. The team used synthetic data to create high-quality and low-quality responses to a prompt, transforming subjective tasks into verifiable pairwise judgments. This synthetic dataset included 22,000 preference pairs, split between 17,000 prompts from the WildChat corpus and 5,000 mathematical queries. These were used to train two versions of J1: J1-Llama-8B and J1-Llama-70B, initialized from the Llama-3.1-8B-Instruct and Llama-3.3-70B-Instruct base models, respectively. The models were trained using Group Relative Policy Optimization (GRPO), a reinforcement algorithm that eliminates the need for critic models and accelerates convergence. At the training strategy’s core is position-agnostic learning, where both (x, a, b) and (x, b, a) input formats are used in training to prevent position bias. Also, consistency-based rewards are applied only when the model delivers correct verdicts across both answer orderings. This structure allows the judge to be fair and reliable regardless of prompt or answer order. The training framework supports multiple variations: models can output final verdicts, numeric scores for each answer, or both. A pointwise judging variant is included, which evaluates single responses using scores from 0 to 10. These formats make J1 a versatile and generalizable system capable of judging various tasks. The results obtained using the J1 models reveal substantial performance improvements over existing systems. On the widely used Preference Proxy Evaluations (PPE) benchmark, J1-Llama-70B achieved an overall accuracy of 69.6%, outperforming models trained with over ten times more data. In contrast, models like DeepSeek-GRM-27B and EvalPlanner-Llama-70B scored 67.2% and 65.6%, respectively. Even the smaller J1-Llama-8B model exceeded baseline systems like EvalPlanner-Llama-8B, scoring 62.2% versus 55.5%. J1 also showed top-tier performance on other critical benchmarks such as RewardBench, RM-Bench, JudgeBench, and FollowBenchEval, demonstrating robust generalization across verifiable and subjective tasks. These improvements are not just marginal but significant, considering the limited training data used in J1 compared to the expansive datasets in other models. Several Key Takeaways from the Research on J1: J1 is trained using 22,000 synthetic preference pairs, including 17K from WildChat and 5K from MATH tasks. The training uses GRPO, which streamlines RL by avoiding the need for separate critic models. It introduces position-agnostic learning, reducing position bias through consistency-based rewards. Two main model variants, J1-Llama-8B and J1-Llama-70B, were trained on modest data but outperformed large-scale models. J1-Llama-70B scored 69.6% on PPE, exceeding DeepSeek-GRM-27B (67.2%) and EvalPlanner-Llama-70B (65.6%). Supports multiple judgment formats: pairwise with verdicts, pairwise with scores, and pointwise scores. Surpasses models distilled from DeepSeek-R1 and OpenAI’s o1-mini on several tasks. Demonstrates that reasoning quality, not just dataset size, is critical for accurate judgments. J1’s framework makes it a generalist judge applicable to verifiable and non-verifiable tasks. In conclusion, the J1 approach fundamentally redefines how judgment models are trained and evaluated. Synthetic data and reinforcement learning bypass the traditional need for costly annotations while promoting fair, logical, and consistent evaluations. This work illustrates that reasoning-driven judging can outperform larger models that rely heavily on data volume and static alignment techniques. It also validates the notion that judgment models should be thinkers first, and scorers second. With performance that rivals and often surpasses state-of-the-art systems, J1 sets a new benchmark in training LLM-as-a-Judge systems. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World EnvironmentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA Optimization

·189 Views

Παρακαλούμε συνδέσου στην Κοινότητά μας για να δηλώσεις τι σου αρέσει, να σχολιάσεις και να μοιραστείς με τους φίλους σου!

Γίνε Μέλος

Γλώσσες

Αναζήτηση

Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use

Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent Design

Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding

Vercel releases first AI model for v0, now in beta

Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device Use

A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraph

Google launched a dizzying array of new AI products, and it's getting harder to make sense of them all

Google I/O: LLM capabilities power agentic AI search

Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data