Site içinde arama yapın

Marktechpost AI paylaşılan bir bağlantı
2025-06-06 07:23:46 -

Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards

Text embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation. However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed.
Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding
Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages, making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs.
These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs.

Technical Architecture
Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to thetoken. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function.

The models are trained using a robust multi-stage training pipeline:

Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks.
Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity, fine-tuning performance in downstream applications.
Model merging: Spherical linear interpolationof multiple fine-tuned checkpoints ensures robustness and generalization.

This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings.
Performance Benchmarks and Insights
The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks.

On MMTEB, Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series.
On MTEB: Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B.
On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA.

For reranking:

Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers.
Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance.

Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops, emphasizing their contributions.
Conclusion
Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation.

Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent Using LangGraph and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding
#alibaba #qwen #team #releases #qwen3embedding

Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards
Text embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation. However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed. Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages, making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs. These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs. Technical Architecture Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to thetoken. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function. The models are trained using a robust multi-stage training pipeline: Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks. Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity, fine-tuning performance in downstream applications. Model merging: Spherical linear interpolationof multiple fine-tuned checkpoints ensures robustness and generalization. This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings. Performance Benchmarks and Insights The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks. On MMTEB, Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series. On MTEB: Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B. On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA. For reranking: Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers. Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance. Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops, emphasizing their contributions. Conclusion Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation. Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent Using LangGraph and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding #alibaba #qwen #team #releases #qwen3embedding

WWW.MARKTECHPOST.COM

Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards
Text embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation (RAG). However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed. Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages (119 in total), making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs. These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs. Technical Architecture Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to the [EOS] token. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function. The models are trained using a robust multi-stage training pipeline: Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks. Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity (>0.7), fine-tuning performance in downstream applications. Model merging: Spherical linear interpolation (SLERP) of multiple fine-tuned checkpoints ensures robustness and generalization. This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings. Performance Benchmarks and Insights The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks. On MMTEB (216 tasks across 250+ languages), Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series. On MTEB (English v2): Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B. On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA. For reranking: Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers. Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance. Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops (up to 6 points on MMTEB), emphasizing their contributions. Conclusion Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation. Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent Using LangGraph and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding

332

0 Yorumlar 0 hisse senetleri 0 önizleme

Please log in to like, share and comment!
Marktechpost AI paylaşılan bir bağlantı
2025-05-27 16:55:49 -

Qwen Researchers Proposes QwenLong-L1: A Reinforcement Learning Framework for Long-Context Reasoning in Large Language Models

While large reasoning modelshave shown impressive capabilities in short-context reasoning through reinforcement learning, these gains do not generalize well to long-context scenarios. Applications such as multi-document QA, research synthesis, and legal or financial analysis require models to process and reason over sequences exceeding 100K tokens. However, RL optimization in such regimes is plagued by slower reward convergence, unstable policy updates due to KL divergence fluctuations, and reduced exploration resulting from entropy collapse. These bottlenecks reveal a fundamental gap in transitioning LRMs from short-context proficiency to long-context generalization.
QwenLong-L1: A Structured RL Framework for Long-Context Adaptation
To address these limitations, the Qwen Research team introduces QwenLong-L1, a novel RL framework designed to adapt LRMs to long-context reasoning tasks. The framework is structured into three key stages:

Warm-up Supervised Fine-Tuning: Provides a stable initialization for the policy model by training on curated question-context-answer triplets, ensuring basic competence in contextual comprehension and answer extraction.
Curriculum-Guided Phased Reinforcement Learning: Introduces a staged training process with gradually increasing context lengths. This progression enables the model to incrementally acquire long-context reasoning behaviors without destabilizing policy updates.
Difficulty-Aware Retrospective Sampling: Enhances exploration by maintaining and reusing hard examples from previous phases, weighted by their difficulty, to encourage deeper reasoning and robustness across diverse inputs.

These stages are complemented by hybrid reward mechanisms—combining rule-based exact match verification with semantic evaluation by a lightweight LLM—ensuring both precision and recall during policy training.

Technical Design and Methodological Advantages
QwenLong-L1 integrates recent advances in group-relative RL optimization, specifically GRPO and DAPO, to mitigate the computational overhead associated with long-context value estimation:

GRPO estimates advantage by normalizing rewards within sampled groups, eliminating the need for a separate value network and encouraging diverse generation patterns.
DAPO incorporates mechanisms such as dynamic sampling, overlength penalty shaping, and asymmetric clipping thresholds to prevent entropy collapse and mitigate length biases during training.

The reward function is defined as the maximum of two signals: a deterministic rule-based match and a semantic judgment from a compact evaluator model. This hybrid approach avoids overfitting to rigid formats while maintaining answer correctness across varied notations and phrasings.
Moreover, the framework is optimized via progressive context scaling, where the RL process transitions from 20K-token to 60K-token input lengths in controlled phases, stabilizing training dynamics and facilitating policy generalization.
Experimental Results and Benchmark Performance
QwenLong-L1 was evaluated on seven long-context document QA benchmarks, including DocMath, Frames, 2WikiMultihopQA, HotpotQA, Musique, NarrativeQA, and Qasper. The 32B variant, QwenLong-L1-32B, demonstrated strong empirical performance:

It outperformed baseline models such as R1-Distill-Qwen-32B by 5.1 points and exceeded leading proprietary systems like OpenAI-o3-mini and Qwen3-235B-A22B.
Its performance was comparable to Claude-3.7-Sonnet-Thinking, indicating competitive reasoning capabilities under extreme context lengths.
Pass@K analysis revealed consistent improvements with increased sampling, achieving a Pass@2 average of 73.7, surpassing DeepSeek-R1 and OpenAI-o1-preview, even at low sampling rates.

Ablation studies further validated the individual contributions of SFT, phased RL, and retrospective sampling. Notably, RL played a decisive role in enabling emergent reasoning behaviors such as grounding, subgoal setting, verification, and backtracking—traits not effectively induced by supervised fine-tuning alone.
Conclusion
QwenLong-L1 represents a systematic approach to equipping LRMs with robust long-context reasoning capabilities through reinforcement learning. Its design effectively bridges the gap between short-context expertise and the demands of information-dense environments by combining supervised initialization, curriculum-driven context scaling, and hybrid evaluation strategies. The framework not only achieves state-of-the-art results across long-context benchmarks but also demonstrates the emergence of interpretable reasoning patterns during training.

Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGen
#qwen #researchers #proposes #qwenlongl1 #reinforcement

Qwen Researchers Proposes QwenLong-L1: A Reinforcement Learning Framework for Long-Context Reasoning in Large Language Models
While large reasoning modelshave shown impressive capabilities in short-context reasoning through reinforcement learning, these gains do not generalize well to long-context scenarios. Applications such as multi-document QA, research synthesis, and legal or financial analysis require models to process and reason over sequences exceeding 100K tokens. However, RL optimization in such regimes is plagued by slower reward convergence, unstable policy updates due to KL divergence fluctuations, and reduced exploration resulting from entropy collapse. These bottlenecks reveal a fundamental gap in transitioning LRMs from short-context proficiency to long-context generalization. QwenLong-L1: A Structured RL Framework for Long-Context Adaptation To address these limitations, the Qwen Research team introduces QwenLong-L1, a novel RL framework designed to adapt LRMs to long-context reasoning tasks. The framework is structured into three key stages: Warm-up Supervised Fine-Tuning: Provides a stable initialization for the policy model by training on curated question-context-answer triplets, ensuring basic competence in contextual comprehension and answer extraction. Curriculum-Guided Phased Reinforcement Learning: Introduces a staged training process with gradually increasing context lengths. This progression enables the model to incrementally acquire long-context reasoning behaviors without destabilizing policy updates. Difficulty-Aware Retrospective Sampling: Enhances exploration by maintaining and reusing hard examples from previous phases, weighted by their difficulty, to encourage deeper reasoning and robustness across diverse inputs. These stages are complemented by hybrid reward mechanisms—combining rule-based exact match verification with semantic evaluation by a lightweight LLM—ensuring both precision and recall during policy training. Technical Design and Methodological Advantages QwenLong-L1 integrates recent advances in group-relative RL optimization, specifically GRPO and DAPO, to mitigate the computational overhead associated with long-context value estimation: GRPO estimates advantage by normalizing rewards within sampled groups, eliminating the need for a separate value network and encouraging diverse generation patterns. DAPO incorporates mechanisms such as dynamic sampling, overlength penalty shaping, and asymmetric clipping thresholds to prevent entropy collapse and mitigate length biases during training. The reward function is defined as the maximum of two signals: a deterministic rule-based match and a semantic judgment from a compact evaluator model. This hybrid approach avoids overfitting to rigid formats while maintaining answer correctness across varied notations and phrasings. Moreover, the framework is optimized via progressive context scaling, where the RL process transitions from 20K-token to 60K-token input lengths in controlled phases, stabilizing training dynamics and facilitating policy generalization. Experimental Results and Benchmark Performance QwenLong-L1 was evaluated on seven long-context document QA benchmarks, including DocMath, Frames, 2WikiMultihopQA, HotpotQA, Musique, NarrativeQA, and Qasper. The 32B variant, QwenLong-L1-32B, demonstrated strong empirical performance: It outperformed baseline models such as R1-Distill-Qwen-32B by 5.1 points and exceeded leading proprietary systems like OpenAI-o3-mini and Qwen3-235B-A22B. Its performance was comparable to Claude-3.7-Sonnet-Thinking, indicating competitive reasoning capabilities under extreme context lengths. Pass@K analysis revealed consistent improvements with increased sampling, achieving a Pass@2 average of 73.7, surpassing DeepSeek-R1 and OpenAI-o1-preview, even at low sampling rates. Ablation studies further validated the individual contributions of SFT, phased RL, and retrospective sampling. Notably, RL played a decisive role in enabling emergent reasoning behaviors such as grounding, subgoal setting, verification, and backtracking—traits not effectively induced by supervised fine-tuning alone. Conclusion QwenLong-L1 represents a systematic approach to equipping LRMs with robust long-context reasoning capabilities through reinforcement learning. Its design effectively bridges the gap between short-context expertise and the demands of information-dense environments by combining supervised initialization, curriculum-driven context scaling, and hybrid evaluation strategies. The framework not only achieves state-of-the-art results across long-context benchmarks but also demonstrates the emergence of interpretable reasoning patterns during training. Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGen #qwen #researchers #proposes #qwenlongl1 #reinforcement

WWW.MARKTECHPOST.COM

Qwen Researchers Proposes QwenLong-L1: A Reinforcement Learning Framework for Long-Context Reasoning in Large Language Models
While large reasoning models (LRMs) have shown impressive capabilities in short-context reasoning through reinforcement learning (RL), these gains do not generalize well to long-context scenarios. Applications such as multi-document QA, research synthesis, and legal or financial analysis require models to process and reason over sequences exceeding 100K tokens. However, RL optimization in such regimes is plagued by slower reward convergence, unstable policy updates due to KL divergence fluctuations, and reduced exploration resulting from entropy collapse. These bottlenecks reveal a fundamental gap in transitioning LRMs from short-context proficiency to long-context generalization. QwenLong-L1: A Structured RL Framework for Long-Context Adaptation To address these limitations, the Qwen Research team introduces QwenLong-L1, a novel RL framework designed to adapt LRMs to long-context reasoning tasks. The framework is structured into three key stages: Warm-up Supervised Fine-Tuning (SFT): Provides a stable initialization for the policy model by training on curated question-context-answer triplets, ensuring basic competence in contextual comprehension and answer extraction. Curriculum-Guided Phased Reinforcement Learning: Introduces a staged training process with gradually increasing context lengths. This progression enables the model to incrementally acquire long-context reasoning behaviors without destabilizing policy updates. Difficulty-Aware Retrospective Sampling: Enhances exploration by maintaining and reusing hard examples from previous phases, weighted by their difficulty, to encourage deeper reasoning and robustness across diverse inputs. These stages are complemented by hybrid reward mechanisms—combining rule-based exact match verification with semantic evaluation by a lightweight LLM—ensuring both precision and recall during policy training. Technical Design and Methodological Advantages QwenLong-L1 integrates recent advances in group-relative RL optimization, specifically GRPO and DAPO, to mitigate the computational overhead associated with long-context value estimation: GRPO estimates advantage by normalizing rewards within sampled groups, eliminating the need for a separate value network and encouraging diverse generation patterns. DAPO incorporates mechanisms such as dynamic sampling, overlength penalty shaping, and asymmetric clipping thresholds to prevent entropy collapse and mitigate length biases during training. The reward function is defined as the maximum of two signals: a deterministic rule-based match and a semantic judgment from a compact evaluator model (e.g., Qwen2.5-1.5B). This hybrid approach avoids overfitting to rigid formats while maintaining answer correctness across varied notations and phrasings. Moreover, the framework is optimized via progressive context scaling, where the RL process transitions from 20K-token to 60K-token input lengths in controlled phases, stabilizing training dynamics and facilitating policy generalization. Experimental Results and Benchmark Performance QwenLong-L1 was evaluated on seven long-context document QA benchmarks, including DocMath, Frames, 2WikiMultihopQA, HotpotQA, Musique, NarrativeQA, and Qasper. The 32B variant, QwenLong-L1-32B, demonstrated strong empirical performance: It outperformed baseline models such as R1-Distill-Qwen-32B by 5.1 points and exceeded leading proprietary systems like OpenAI-o3-mini and Qwen3-235B-A22B. Its performance was comparable to Claude-3.7-Sonnet-Thinking, indicating competitive reasoning capabilities under extreme context lengths. Pass@K analysis revealed consistent improvements with increased sampling, achieving a Pass@2 average of 73.7, surpassing DeepSeek-R1 and OpenAI-o1-preview, even at low sampling rates. Ablation studies further validated the individual contributions of SFT, phased RL, and retrospective sampling. Notably, RL played a decisive role in enabling emergent reasoning behaviors such as grounding, subgoal setting, verification, and backtracking—traits not effectively induced by supervised fine-tuning alone. Conclusion QwenLong-L1 represents a systematic approach to equipping LRMs with robust long-context reasoning capabilities through reinforcement learning. Its design effectively bridges the gap between short-context expertise and the demands of information-dense environments by combining supervised initialization, curriculum-driven context scaling, and hybrid evaluation strategies. The framework not only achieves state-of-the-art results across long-context benchmarks but also demonstrates the emergence of interpretable reasoning patterns during training. Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGen

4 Yorumlar 0 hisse senetleri 0 önizleme

Please log in to like, share and comment!
Marktechpost AI paylaşılan bir bağlantı
2025-05-26 00:39:26 -

NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific Tasks

NVIDIA has released Llama Nemotron Nano 4B, an open-source reasoning model designed to deliver strong performance and efficiency across scientific tasks, programming, symbolic math, function calling, and instruction following—while being compact enough for edge deployment. With just 4 billion parameters, it achieves higher accuracy and up to 50% greater throughput than comparable open models with up to 8 billion parameters, according to internal benchmarks.
The model is positioned as a practical foundation for deploying language-based AI agents in resource-constrained environments. By focusing on inference efficiency, Llama Nemotron Nano 4B addresses a growing demand for compact models capable of supporting hybrid reasoning and instruction-following tasks outside traditional cloud settings.
Model Architecture and Training Stack
Nemotron Nano 4B builds upon the Llama 3.1 architecture and shares lineage with NVIDIA’s earlier “Minitron” family. The architecture follows a dense, decoder-only transformer design. The model has been optimized for performance in reasoning-intensive workloads while maintaining a lightweight parameter count.
The post-training stack for the model includes multi-stage supervised fine-tuning on curated datasets for mathematics, coding, reasoning tasks, and function calling. In addition to traditional supervised learning, Nemotron Nano 4B has undergone reinforcement learning optimization using Reward-aware Preference Optimization, a method intended to enhance the model’s utility in chat-based and instruction-following environments.
This combination of instruction tuning and reward modeling helps align the model’s outputs more closely with user intent, particularly in multi-turn reasoning scenarios. The training approach reflects NVIDIA’s emphasis on aligning smaller models to practical usage tasks that traditionally require significantly larger parameter sizes.

Performance Benchmarks
Despite its compact footprint, Nemotron Nano 4B exhibits robust performance in both single-turn and multi-turn reasoning tasks. According to NVIDIA, it provides 50% higher inference throughput compared to similar open-weight models within the 8B parameter range. The model supports a context window of up to 128,000 tokens, which is particularly useful for tasks involving long documents, nested function calls, or multi-hop reasoning chains.
While NVIDIA has not disclosed full benchmark tables in the Hugging Face documentation, the model reportedly outperforms other open alternatives in benchmarks across math, code generation, and function calling precision. Its throughput advantage suggests it can serve as a viable default for developers targeting efficient inference pipelines with moderately complex workloads.
Edge-Ready Deployment
One of the core differentiators of Nemotron Nano 4B is its focus on edge deployment. The model has been explicitly tested and optimized to run efficiently on NVIDIA Jetson platforms and NVIDIA RTX GPUs. This enables real-time reasoning capabilities on low-power embedded devices, including robotics systems, autonomous edge agents, or local developer workstations.
For enterprises and research teams concerned with privacy and deployment control, the ability to run advanced reasoning models locally—without relying on cloud inference APIs—can provide both cost savings and greater flexibility.
Licensing and Access
The model is released under the NVIDIA Open Model License, which permits commercial usage. It is available through Hugging Face at huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1, with all relevant model weights, configuration files, and tokenizer artifacts openly accessible. The license structure aligns with NVIDIA’s broader strategy of supporting developer ecosystems around its open models.
Conclusion
Nemotron Nano 4B represents NVIDIA’s continued investment in bringing scalable, practical AI models to a broader development audience—especially those targeting edge or cost-sensitive deployment scenarios. While the field continues to see rapid progress in ultra-large models, compact and efficient models like Nemotron Nano 4B provide a counterbalance, enabling deployment flexibility without compromising too heavily on performance.

Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGenAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use
#nvidia #releases #llama #nemotron #nano

NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific Tasks
NVIDIA has released Llama Nemotron Nano 4B, an open-source reasoning model designed to deliver strong performance and efficiency across scientific tasks, programming, symbolic math, function calling, and instruction following—while being compact enough for edge deployment. With just 4 billion parameters, it achieves higher accuracy and up to 50% greater throughput than comparable open models with up to 8 billion parameters, according to internal benchmarks. The model is positioned as a practical foundation for deploying language-based AI agents in resource-constrained environments. By focusing on inference efficiency, Llama Nemotron Nano 4B addresses a growing demand for compact models capable of supporting hybrid reasoning and instruction-following tasks outside traditional cloud settings. Model Architecture and Training Stack Nemotron Nano 4B builds upon the Llama 3.1 architecture and shares lineage with NVIDIA’s earlier “Minitron” family. The architecture follows a dense, decoder-only transformer design. The model has been optimized for performance in reasoning-intensive workloads while maintaining a lightweight parameter count. The post-training stack for the model includes multi-stage supervised fine-tuning on curated datasets for mathematics, coding, reasoning tasks, and function calling. In addition to traditional supervised learning, Nemotron Nano 4B has undergone reinforcement learning optimization using Reward-aware Preference Optimization, a method intended to enhance the model’s utility in chat-based and instruction-following environments. This combination of instruction tuning and reward modeling helps align the model’s outputs more closely with user intent, particularly in multi-turn reasoning scenarios. The training approach reflects NVIDIA’s emphasis on aligning smaller models to practical usage tasks that traditionally require significantly larger parameter sizes. Performance Benchmarks Despite its compact footprint, Nemotron Nano 4B exhibits robust performance in both single-turn and multi-turn reasoning tasks. According to NVIDIA, it provides 50% higher inference throughput compared to similar open-weight models within the 8B parameter range. The model supports a context window of up to 128,000 tokens, which is particularly useful for tasks involving long documents, nested function calls, or multi-hop reasoning chains. While NVIDIA has not disclosed full benchmark tables in the Hugging Face documentation, the model reportedly outperforms other open alternatives in benchmarks across math, code generation, and function calling precision. Its throughput advantage suggests it can serve as a viable default for developers targeting efficient inference pipelines with moderately complex workloads. Edge-Ready Deployment One of the core differentiators of Nemotron Nano 4B is its focus on edge deployment. The model has been explicitly tested and optimized to run efficiently on NVIDIA Jetson platforms and NVIDIA RTX GPUs. This enables real-time reasoning capabilities on low-power embedded devices, including robotics systems, autonomous edge agents, or local developer workstations. For enterprises and research teams concerned with privacy and deployment control, the ability to run advanced reasoning models locally—without relying on cloud inference APIs—can provide both cost savings and greater flexibility. Licensing and Access The model is released under the NVIDIA Open Model License, which permits commercial usage. It is available through Hugging Face at huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1, with all relevant model weights, configuration files, and tokenizer artifacts openly accessible. The license structure aligns with NVIDIA’s broader strategy of supporting developer ecosystems around its open models. Conclusion Nemotron Nano 4B represents NVIDIA’s continued investment in bringing scalable, practical AI models to a broader development audience—especially those targeting edge or cost-sensitive deployment scenarios. While the field continues to see rapid progress in ultra-large models, compact and efficient models like Nemotron Nano 4B provide a counterbalance, enabling deployment flexibility without compromising too heavily on performance. Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGenAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use #nvidia #releases #llama #nemotron #nano

WWW.MARKTECHPOST.COM

NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific Tasks
NVIDIA has released Llama Nemotron Nano 4B, an open-source reasoning model designed to deliver strong performance and efficiency across scientific tasks, programming, symbolic math, function calling, and instruction following—while being compact enough for edge deployment. With just 4 billion parameters, it achieves higher accuracy and up to 50% greater throughput than comparable open models with up to 8 billion parameters, according to internal benchmarks. The model is positioned as a practical foundation for deploying language-based AI agents in resource-constrained environments. By focusing on inference efficiency, Llama Nemotron Nano 4B addresses a growing demand for compact models capable of supporting hybrid reasoning and instruction-following tasks outside traditional cloud settings. Model Architecture and Training Stack Nemotron Nano 4B builds upon the Llama 3.1 architecture and shares lineage with NVIDIA’s earlier “Minitron” family. The architecture follows a dense, decoder-only transformer design. The model has been optimized for performance in reasoning-intensive workloads while maintaining a lightweight parameter count. The post-training stack for the model includes multi-stage supervised fine-tuning on curated datasets for mathematics, coding, reasoning tasks, and function calling. In addition to traditional supervised learning, Nemotron Nano 4B has undergone reinforcement learning optimization using Reward-aware Preference Optimization (RPO), a method intended to enhance the model’s utility in chat-based and instruction-following environments. This combination of instruction tuning and reward modeling helps align the model’s outputs more closely with user intent, particularly in multi-turn reasoning scenarios. The training approach reflects NVIDIA’s emphasis on aligning smaller models to practical usage tasks that traditionally require significantly larger parameter sizes. Performance Benchmarks Despite its compact footprint, Nemotron Nano 4B exhibits robust performance in both single-turn and multi-turn reasoning tasks. According to NVIDIA, it provides 50% higher inference throughput compared to similar open-weight models within the 8B parameter range. The model supports a context window of up to 128,000 tokens, which is particularly useful for tasks involving long documents, nested function calls, or multi-hop reasoning chains. While NVIDIA has not disclosed full benchmark tables in the Hugging Face documentation, the model reportedly outperforms other open alternatives in benchmarks across math, code generation, and function calling precision. Its throughput advantage suggests it can serve as a viable default for developers targeting efficient inference pipelines with moderately complex workloads. Edge-Ready Deployment One of the core differentiators of Nemotron Nano 4B is its focus on edge deployment. The model has been explicitly tested and optimized to run efficiently on NVIDIA Jetson platforms and NVIDIA RTX GPUs. This enables real-time reasoning capabilities on low-power embedded devices, including robotics systems, autonomous edge agents, or local developer workstations. For enterprises and research teams concerned with privacy and deployment control, the ability to run advanced reasoning models locally—without relying on cloud inference APIs—can provide both cost savings and greater flexibility. Licensing and Access The model is released under the NVIDIA Open Model License, which permits commercial usage. It is available through Hugging Face at huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1, with all relevant model weights, configuration files, and tokenizer artifacts openly accessible. The license structure aligns with NVIDIA’s broader strategy of supporting developer ecosystems around its open models. Conclusion Nemotron Nano 4B represents NVIDIA’s continued investment in bringing scalable, practical AI models to a broader development audience—especially those targeting edge or cost-sensitive deployment scenarios. While the field continues to see rapid progress in ultra-large models, compact and efficient models like Nemotron Nano 4B provide a counterbalance, enabling deployment flexibility without compromising too heavily on performance. Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build an AI Agent with Live Python Execution and Automated ValidationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent CreationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGenAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use

0 Yorumlar 0 hisse senetleri 0 önizleme

Please log in to like, share and comment!
Marktechpost AI paylaşılan bir bağlantı
2025-05-24 22:40:00 -

Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent Creation

In this comprehensive tutorial, we guide users through creating a powerful multi-tool AI agent using LangGraph and Claude, optimized for diverse tasks including mathematical computations, web searches, weather inquiries, text analysis, and real-time information retrieval. It begins by simplifying dependency installations to ensure effortless setup, even for beginners. Users are then introduced to structured implementations of specialized tools, such as a safe calculator, an efficient web-search utility leveraging DuckDuckGo, a mock weather information provider, a detailed text analyzer, and a time-fetching function. The tutorial also clearly delineates the integration of these tools within a sophisticated agent architecture built using LangGraph, illustrating practical usage through interactive examples and clear explanations, facilitating both beginners and advanced developers to deploy custom multi-functional AI agents rapidly.
import subprocess
import sys

def install_packages:
packages =for package in packages:
try:
subprocess.check_callprintexcept subprocess.CalledProcessError:
printprintinstall_packagesprintWe automate the installation of essential Python packages required for building a LangGraph-based multi-tool AI agent. It leverages a subprocess to run pip commands silently and ensures each package, ranging from long-chain components to web search and environment handling tools, is installed successfully. This setup streamlines the environment preparation process, making the notebook portable and beginner-friendly.
import os
import json
import math
import requests
from typing import Dict, List, Any, Annotated, TypedDict
from datetime import datetime
import operator

from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, ToolMessage
from langchain_core.tools import tool
from langchain_anthropic import ChatAnthropic
from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import ToolNode
from langgraph.checkpoint.memory import MemorySaver
from duckduckgo_search import DDGS
We import all the necessary libraries and modules for constructing the multi-tool AI agent. It includes Python standard libraries such as os, json, math, and datetime for general-purpose functionality and external libraries like requests for HTTP calls and duckduckgo_search for implementing web search. The LangChain and LangGraph ecosystems bring in message types, tool decorators, state graph components, and checkpointing utilities, while ChatAnthropic enables integration with the Claude model for conversational intelligence. These imports form the foundational building blocks for defining tools, agent workflows, and interactions.
os.environ= "Use Your API Key Here"

ANTHROPIC_API_KEY = os.getenvWe set and retrieve the Anthropic API key required to authenticate and interact with Claude models. The os.environ line assigns your API key, while os.getenv securely retrieves it for later use in model initialization. This approach ensures the key is accessible throughout the script without hardcoding it multiple times.
from typing import TypedDict

class AgentState:
messages: Annotated, operator.add]

@tool
def calculator-> str:
"""
Perform mathematical calculations. Supports basic arithmetic, trigonometry, and more.

Args:
expression: Mathematical expression as a string")

Returns:
Result of the calculation as a string
"""
try:
allowed_names = {
'abs': abs, 'round': round, 'min': min, 'max': max,
'sum': sum, 'pow': pow, 'sqrt': math.sqrt,
'sin': math.sin, 'cos': math.cos, 'tan': math.tan,
'log': math.log, 'log10': math.log10, 'exp': math.exp,
'pi': math.pi, 'e': math.e
}

expression = expression.replaceresult = evalreturn f"Result: {result}"
except Exception as e:
return f"Error in calculation: {str}"
We define the agent’s internal state and implement a robust calculator tool. The AgentState class uses TypedDict to structure agent memory, specifically tracking messages exchanged during the conversation. The calculator function, decorated with @tool to register it as an AI-usable utility, securely evaluates mathematical expressions. It allows for safe computation by limiting available functions to a predefined set from the math module and replacing common syntax like ^ with Python’s exponentiation operator. This ensures the tool can handle simple arithmetic and advanced functions like trigonometry or logarithms while preventing unsafe code execution.
@tool
def web_search-> str:
"""
Search the web for information using DuckDuckGo.

Args:
query: Search query string
num_results: Number of results to returnReturns:
Search results as formatted string
"""
try:
num_results = min, 10)

with DDGSas ddgs:
results = list)

if not results:
return f"No search results found for: {query}"

formatted_results = f"Search results for '{query}':\n\n"
for i, result in enumerate:
formatted_results += f"{i}. **{result}**\n"
formatted_results += f" {result}\n"
formatted_results += f" Source: {result}\n\n"

return formatted_results
except Exception as e:
return f"Error performing web search: {str}"
We define a web_search tool that enables the agent to fetch real-time information from the internet using the DuckDuckGo Search API via the duckduckgo_search Python package. The tool accepts a search query and an optional num_results parameter, ensuring that the number of results returned is between 1 and 10. It opens a DuckDuckGo search session, retrieves the results, and formats them neatly for user-friendly display. If no results are found or an error occurs, the function handles it gracefully by returning an informative message. This tool equips the agent with real-time search capabilities, enhancing responsiveness and utility.
@tool
def weather_info-> str:
"""
Get current weather information for a city using OpenWeatherMap API.
Note: This is a mock implementation for demo purposes.

Args:
city: Name of the city

Returns:
Weather information as a string
"""
mock_weather = {
"new york": {"temp": 22, "condition": "Partly Cloudy", "humidity": 65},
"london": {"temp": 15, "condition": "Rainy", "humidity": 80},
"tokyo": {"temp": 28, "condition": "Sunny", "humidity": 70},
"paris": {"temp": 18, "condition": "Overcast", "humidity": 75}
}

city_lower = city.lowerif city_lower in mock_weather:
weather = mock_weatherreturn f"Weather in {city}:\n" \
f"Temperature: {weather}°C\n" \
f"Condition: {weather}\n" \
f"Humidity: {weather}%"
else:
return f"Weather data not available for {city}."
We define a weather_info tool that simulates retrieving current weather data for a given city. While it does not connect to a live weather API, it uses a predefined dictionary of mock data for major cities like New York, London, Tokyo, and Paris. Upon receiving a city name, the function normalizes it to lowercase and checks for its presence in the mock dataset. It returns temperature, weather condition, and humidity in a readable format if found. Otherwise, it notifies the user that weather data is unavailable. This tool serves as a placeholder and can later be upgraded to fetch live data from an actual weather API.
@tool
def text_analyzer-> str:
"""
Analyze text and provide statistics like word count, character count, etc.

Args:
text: Text to analyze

Returns:
Text analysis results
"""
if not text.strip:
return "Please provide text to analyze."

words = text.splitsentences = text.split+ text.split+ text.splitsentences =analysis = f"Text Analysis Results:\n"
analysis += f"• Characters: {len}\n"
analysis += f"• Characters: {len)}\n"
analysis += f"• Words: {len}\n"
analysis += f"• Sentences: {len}\n"
analysis += f"• Average words per sentence: {len/ max, 1):.1f}\n"
analysis += f"• Most common word: {max, key=words.count) if words else 'N/A'}"

return analysis
The text_analyzer tool provides a detailed statistical analysis of a given text input. It calculates metrics such as character count, word count, sentence count, and average words per sentence, and it identifies the most frequently occurring word. The tool handles empty input gracefully by prompting the user to provide valid text. It uses simple string operations and Python’s set and max functions to extract meaningful insights. It is a valuable utility for language analysis or content quality checks in the AI agent’s toolkit.
@tool
def current_time-> str:
"""
Get the current date and time.

Returns:
Current date and time as a formatted string
"""
now = datetime.nowreturn f"Current date and time: {now.strftime}"
The current_time tool provides a straightforward way to retrieve the current system date and time in a human-readable format. Using Python’s datetime module, it captures the present moment and formats it as YYYY-MM-DD HH:MM:SS. This utility is particularly useful for time-stamping responses or answering user queries about the current date and time within the AI agent’s interaction flow.
tools =def create_llm:
if ANTHROPIC_API_KEY:
return ChatAnthropicelse:
class MockLLM:
def invoke:
last_message = messages.content if messages else ""

if anyfor word in):
import re
numbers = re.findall\s\w]+', last_message)
expr = numbersif numbers else "2+2"
return AIMessage}, "id": "calc1"}])
elif anyfor word in):
query = last_message.replace.replace.replace.stripif not query or len< 3:
query = "python programming"
return AIMessageelif anyfor word in):
city = "New York"
words = last_message.lower.splitfor i, word in enumerate:
if word == 'in' and i + 1 < len:
city = words.titlebreak
return AIMessageelif anyfor word in):
return AIMessageelif anyfor word in):
text = last_message.replace.replace.stripif not text:
text = "Sample text for analysis"
return AIMessageelse:
return AIMessagedef bind_tools:
return self

printreturn MockLLMllm = create_llmllm_with_tools = llm.bind_toolsWe initialize the language model that powers the AI agent. If a valid Anthropic API key is available, it uses the Claude 3 Haiku model for high-quality responses. Without an API key, a MockLLM is defined to simulate basic tool-routing behavior based on keyword matching, allowing the agent to function offline with limited capabilities. The bind_tools method links the defined tools to the model, enabling it to invoke them as needed.
def agent_node-> Dict:
"""Main agent node that processes messages and decides on tool usage."""
messages = stateresponse = llm_with_tools.invokereturn {"messages":}

def should_continue-> str:
"""Determine whether to continue with tool calls or end."""
last_message = stateif hasattrand last_message.tool_calls:
return "tools"
return END
We define the agent’s core decision-making logic. The agent_node function handles incoming messages, invokes the language model, and returns the model’s response. The should_continue function then evaluates whether the model’s response includes tool calls. If so, it routes control to the tool execution node; otherwise, it directs the flow to end the interaction. These functions enable dynamic and conditional transitions within the agent’s workflow.
def create_agent_graph:
tool_node = ToolNodeworkflow = StateGraphworkflow.add_nodeworkflow.add_nodeworkflow.add_edgeworkflow.add_conditional_edgesworkflow.add_edgememory = MemorySaverapp = workflow.compilereturn app

printagent = create_agent_graphprintWe construct the LangGraph-powered workflow that defines the AI agent’s operational structure. It initializes a ToolNode to handle tool executions and uses a StateGraph to organize the flow between agent decisions and tool usage. Nodes and edges are added to manage transitions: starting with the agent, conditionally routing to tools, and looping back as needed. A MemorySaver is integrated for persistent state tracking across turns. The graph is compiled into an executable application, enabling a structured, memory-aware multi-tool agent ready for deployment.
def test_agent:
"""Test the agent with various queries."""
config = {"configurable": {"thread_id": "test-thread"}}

test_queries =printfor i, query in enumerate:
printprinttry:
response = agent.invoke]},
config=config
)

last_message = responseprintexcept Exception as e:
print}\n")
The test_agent function is a validation utility that ensures that the LangGraph agent responds correctly across different use cases. It runs predefined queries, arithmetic, web search, weather, time, and text analysis, and prints the agent’s responses. Using a consistent thread_id for configuration, it invokes the agent with each query. It neatly displays the results, helping developers verify tool integration and conversational logic before moving to interactive or production use.
def chat_with_agent:
"""Interactive chat function."""
config = {"configurable": {"thread_id": "interactive-thread"}}

printprintprintwhile True:
try:
user_input = input.stripif user_input.lowerin:
printbreak
elif user_input.lower== 'help':
printprint?'")
printprintprintprintprintcontinue
elif not user_input:
continue

response = agent.invoke]},
config=config
)

last_message = responseprintexcept KeyboardInterrupt:
printbreak
except Exception as e:
print}\n")
The chat_with_agent function provides an interactive command-line interface for real-time conversations with the LangGraph multi-tool agent. It supports natural language queries and recognizes commands like “help” for usage guidance and “quit” to exit. Each user input is processed through the agent, which dynamically selects and invokes appropriate response tools. The function enhances user engagement by simulating a conversational experience and showcasing the agent’s capabilities in handling various queries, from math and web search to weather, text analysis, and time retrieval.
if __name__ == "__main__":
test_agentprintprintprintchat_with_agentdef quick_demo:
"""Quick demonstration of agent capabilities."""
config = {"configurable": {"thread_id": "demo"}}

demos =printfor category, query in demos:
printtry:
response = agent.invoke]},
config=config
)
printexcept Exception as e:
print}\n")

printprintprintprintprintfor a quick demonstration")
printfor interactive chat")
printprintprintFinally, we orchestrate the execution of the LangGraph multi-tool agent. If the script is run directly, it initiates test_agentto validate functionality with sample queries, followed by launching the interactive chat_with_agentmode for real-time interaction. The quick_demofunction also briefly showcases the agent’s capabilities in math, search, and time queries. Clear usage instructions are printed at the end, guiding users on configuring the API key, running demonstrations, and interacting with the agent. This provides a smooth onboarding experience for users to explore and extend the agent’s functionality.
In conclusion, this step-by-step tutorial gives valuable insights into building an effective multi-tool AI agent leveraging LangGraph and Claude’s generative capabilities. With straightforward explanations and hands-on demonstrations, the guide empowers users to integrate diverse utilities into a cohesive and interactive system. The agent’s flexibility in performing tasks, from complex calculations to dynamic information retrieval, showcases the versatility of modern AI development frameworks. Also, the inclusion of user-friendly functions for both testing and interactive chat enhances practical understanding, enabling immediate application in various contexts. Developers can confidently extend and customize their AI agents with this foundational knowledge.

Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGenAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent DesignAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding
#stepbystep #guide #build #customizable #multitool

Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent Creation
In this comprehensive tutorial, we guide users through creating a powerful multi-tool AI agent using LangGraph and Claude, optimized for diverse tasks including mathematical computations, web searches, weather inquiries, text analysis, and real-time information retrieval. It begins by simplifying dependency installations to ensure effortless setup, even for beginners. Users are then introduced to structured implementations of specialized tools, such as a safe calculator, an efficient web-search utility leveraging DuckDuckGo, a mock weather information provider, a detailed text analyzer, and a time-fetching function. The tutorial also clearly delineates the integration of these tools within a sophisticated agent architecture built using LangGraph, illustrating practical usage through interactive examples and clear explanations, facilitating both beginners and advanced developers to deploy custom multi-functional AI agents rapidly. import subprocess import sys def install_packages: packages =for package in packages: try: subprocess.check_callprintexcept subprocess.CalledProcessError: printprintinstall_packagesprintWe automate the installation of essential Python packages required for building a LangGraph-based multi-tool AI agent. It leverages a subprocess to run pip commands silently and ensures each package, ranging from long-chain components to web search and environment handling tools, is installed successfully. This setup streamlines the environment preparation process, making the notebook portable and beginner-friendly. import os import json import math import requests from typing import Dict, List, Any, Annotated, TypedDict from datetime import datetime import operator from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, ToolMessage from langchain_core.tools import tool from langchain_anthropic import ChatAnthropic from langgraph.graph import StateGraph, START, END from langgraph.prebuilt import ToolNode from langgraph.checkpoint.memory import MemorySaver from duckduckgo_search import DDGS We import all the necessary libraries and modules for constructing the multi-tool AI agent. It includes Python standard libraries such as os, json, math, and datetime for general-purpose functionality and external libraries like requests for HTTP calls and duckduckgo_search for implementing web search. The LangChain and LangGraph ecosystems bring in message types, tool decorators, state graph components, and checkpointing utilities, while ChatAnthropic enables integration with the Claude model for conversational intelligence. These imports form the foundational building blocks for defining tools, agent workflows, and interactions. os.environ= "Use Your API Key Here" ANTHROPIC_API_KEY = os.getenvWe set and retrieve the Anthropic API key required to authenticate and interact with Claude models. The os.environ line assigns your API key, while os.getenv securely retrieves it for later use in model initialization. This approach ensures the key is accessible throughout the script without hardcoding it multiple times. from typing import TypedDict class AgentState: messages: Annotated, operator.add] @tool def calculator-> str: """ Perform mathematical calculations. Supports basic arithmetic, trigonometry, and more. Args: expression: Mathematical expression as a string") Returns: Result of the calculation as a string """ try: allowed_names = { 'abs': abs, 'round': round, 'min': min, 'max': max, 'sum': sum, 'pow': pow, 'sqrt': math.sqrt, 'sin': math.sin, 'cos': math.cos, 'tan': math.tan, 'log': math.log, 'log10': math.log10, 'exp': math.exp, 'pi': math.pi, 'e': math.e } expression = expression.replaceresult = evalreturn f"Result: {result}" except Exception as e: return f"Error in calculation: {str}" We define the agent’s internal state and implement a robust calculator tool. The AgentState class uses TypedDict to structure agent memory, specifically tracking messages exchanged during the conversation. The calculator function, decorated with @tool to register it as an AI-usable utility, securely evaluates mathematical expressions. It allows for safe computation by limiting available functions to a predefined set from the math module and replacing common syntax like ^ with Python’s exponentiation operator. This ensures the tool can handle simple arithmetic and advanced functions like trigonometry or logarithms while preventing unsafe code execution. @tool def web_search-> str: """ Search the web for information using DuckDuckGo. Args: query: Search query string num_results: Number of results to returnReturns: Search results as formatted string """ try: num_results = min, 10) with DDGSas ddgs: results = list) if not results: return f"No search results found for: {query}" formatted_results = f"Search results for '{query}':\n\n" for i, result in enumerate: formatted_results += f"{i}. **{result}**\n" formatted_results += f" {result}\n" formatted_results += f" Source: {result}\n\n" return formatted_results except Exception as e: return f"Error performing web search: {str}" We define a web_search tool that enables the agent to fetch real-time information from the internet using the DuckDuckGo Search API via the duckduckgo_search Python package. The tool accepts a search query and an optional num_results parameter, ensuring that the number of results returned is between 1 and 10. It opens a DuckDuckGo search session, retrieves the results, and formats them neatly for user-friendly display. If no results are found or an error occurs, the function handles it gracefully by returning an informative message. This tool equips the agent with real-time search capabilities, enhancing responsiveness and utility. @tool def weather_info-> str: """ Get current weather information for a city using OpenWeatherMap API. Note: This is a mock implementation for demo purposes. Args: city: Name of the city Returns: Weather information as a string """ mock_weather = { "new york": {"temp": 22, "condition": "Partly Cloudy", "humidity": 65}, "london": {"temp": 15, "condition": "Rainy", "humidity": 80}, "tokyo": {"temp": 28, "condition": "Sunny", "humidity": 70}, "paris": {"temp": 18, "condition": "Overcast", "humidity": 75} } city_lower = city.lowerif city_lower in mock_weather: weather = mock_weatherreturn f"Weather in {city}:\n" \ f"Temperature: {weather}°C\n" \ f"Condition: {weather}\n" \ f"Humidity: {weather}%" else: return f"Weather data not available for {city}." We define a weather_info tool that simulates retrieving current weather data for a given city. While it does not connect to a live weather API, it uses a predefined dictionary of mock data for major cities like New York, London, Tokyo, and Paris. Upon receiving a city name, the function normalizes it to lowercase and checks for its presence in the mock dataset. It returns temperature, weather condition, and humidity in a readable format if found. Otherwise, it notifies the user that weather data is unavailable. This tool serves as a placeholder and can later be upgraded to fetch live data from an actual weather API. @tool def text_analyzer-> str: """ Analyze text and provide statistics like word count, character count, etc. Args: text: Text to analyze Returns: Text analysis results """ if not text.strip: return "Please provide text to analyze." words = text.splitsentences = text.split+ text.split+ text.splitsentences =analysis = f"Text Analysis Results:\n" analysis += f"• Characters: {len}\n" analysis += f"• Characters: {len)}\n" analysis += f"• Words: {len}\n" analysis += f"• Sentences: {len}\n" analysis += f"• Average words per sentence: {len/ max, 1):.1f}\n" analysis += f"• Most common word: {max, key=words.count) if words else 'N/A'}" return analysis The text_analyzer tool provides a detailed statistical analysis of a given text input. It calculates metrics such as character count, word count, sentence count, and average words per sentence, and it identifies the most frequently occurring word. The tool handles empty input gracefully by prompting the user to provide valid text. It uses simple string operations and Python’s set and max functions to extract meaningful insights. It is a valuable utility for language analysis or content quality checks in the AI agent’s toolkit. @tool def current_time-> str: """ Get the current date and time. Returns: Current date and time as a formatted string """ now = datetime.nowreturn f"Current date and time: {now.strftime}" The current_time tool provides a straightforward way to retrieve the current system date and time in a human-readable format. Using Python’s datetime module, it captures the present moment and formats it as YYYY-MM-DD HH:MM:SS. This utility is particularly useful for time-stamping responses or answering user queries about the current date and time within the AI agent’s interaction flow. tools =def create_llm: if ANTHROPIC_API_KEY: return ChatAnthropicelse: class MockLLM: def invoke: last_message = messages.content if messages else "" if anyfor word in): import re numbers = re.findall\s\w]+', last_message) expr = numbersif numbers else "2+2" return AIMessage}, "id": "calc1"}]) elif anyfor word in): query = last_message.replace.replace.replace.stripif not query or len< 3: query = "python programming" return AIMessageelif anyfor word in): city = "New York" words = last_message.lower.splitfor i, word in enumerate: if word == 'in' and i + 1 < len: city = words.titlebreak return AIMessageelif anyfor word in): return AIMessageelif anyfor word in): text = last_message.replace.replace.stripif not text: text = "Sample text for analysis" return AIMessageelse: return AIMessagedef bind_tools: return self printreturn MockLLMllm = create_llmllm_with_tools = llm.bind_toolsWe initialize the language model that powers the AI agent. If a valid Anthropic API key is available, it uses the Claude 3 Haiku model for high-quality responses. Without an API key, a MockLLM is defined to simulate basic tool-routing behavior based on keyword matching, allowing the agent to function offline with limited capabilities. The bind_tools method links the defined tools to the model, enabling it to invoke them as needed. def agent_node-> Dict: """Main agent node that processes messages and decides on tool usage.""" messages = stateresponse = llm_with_tools.invokereturn {"messages":} def should_continue-> str: """Determine whether to continue with tool calls or end.""" last_message = stateif hasattrand last_message.tool_calls: return "tools" return END We define the agent’s core decision-making logic. The agent_node function handles incoming messages, invokes the language model, and returns the model’s response. The should_continue function then evaluates whether the model’s response includes tool calls. If so, it routes control to the tool execution node; otherwise, it directs the flow to end the interaction. These functions enable dynamic and conditional transitions within the agent’s workflow. def create_agent_graph: tool_node = ToolNodeworkflow = StateGraphworkflow.add_nodeworkflow.add_nodeworkflow.add_edgeworkflow.add_conditional_edgesworkflow.add_edgememory = MemorySaverapp = workflow.compilereturn app printagent = create_agent_graphprintWe construct the LangGraph-powered workflow that defines the AI agent’s operational structure. It initializes a ToolNode to handle tool executions and uses a StateGraph to organize the flow between agent decisions and tool usage. Nodes and edges are added to manage transitions: starting with the agent, conditionally routing to tools, and looping back as needed. A MemorySaver is integrated for persistent state tracking across turns. The graph is compiled into an executable application, enabling a structured, memory-aware multi-tool agent ready for deployment. def test_agent: """Test the agent with various queries.""" config = {"configurable": {"thread_id": "test-thread"}} test_queries =printfor i, query in enumerate: printprinttry: response = agent.invoke]}, config=config ) last_message = responseprintexcept Exception as e: print}\n") The test_agent function is a validation utility that ensures that the LangGraph agent responds correctly across different use cases. It runs predefined queries, arithmetic, web search, weather, time, and text analysis, and prints the agent’s responses. Using a consistent thread_id for configuration, it invokes the agent with each query. It neatly displays the results, helping developers verify tool integration and conversational logic before moving to interactive or production use. def chat_with_agent: """Interactive chat function.""" config = {"configurable": {"thread_id": "interactive-thread"}} printprintprintwhile True: try: user_input = input.stripif user_input.lowerin: printbreak elif user_input.lower== 'help': printprint?'") printprintprintprintprintcontinue elif not user_input: continue response = agent.invoke]}, config=config ) last_message = responseprintexcept KeyboardInterrupt: printbreak except Exception as e: print}\n") The chat_with_agent function provides an interactive command-line interface for real-time conversations with the LangGraph multi-tool agent. It supports natural language queries and recognizes commands like “help” for usage guidance and “quit” to exit. Each user input is processed through the agent, which dynamically selects and invokes appropriate response tools. The function enhances user engagement by simulating a conversational experience and showcasing the agent’s capabilities in handling various queries, from math and web search to weather, text analysis, and time retrieval. if __name__ == "__main__": test_agentprintprintprintchat_with_agentdef quick_demo: """Quick demonstration of agent capabilities.""" config = {"configurable": {"thread_id": "demo"}} demos =printfor category, query in demos: printtry: response = agent.invoke]}, config=config ) printexcept Exception as e: print}\n") printprintprintprintprintfor a quick demonstration") printfor interactive chat") printprintprintFinally, we orchestrate the execution of the LangGraph multi-tool agent. If the script is run directly, it initiates test_agentto validate functionality with sample queries, followed by launching the interactive chat_with_agentmode for real-time interaction. The quick_demofunction also briefly showcases the agent’s capabilities in math, search, and time queries. Clear usage instructions are printed at the end, guiding users on configuring the API key, running demonstrations, and interacting with the agent. This provides a smooth onboarding experience for users to explore and extend the agent’s functionality. In conclusion, this step-by-step tutorial gives valuable insights into building an effective multi-tool AI agent leveraging LangGraph and Claude’s generative capabilities. With straightforward explanations and hands-on demonstrations, the guide empowers users to integrate diverse utilities into a cohesive and interactive system. The agent’s flexibility in performing tasks, from complex calculations to dynamic information retrieval, showcases the versatility of modern AI development frameworks. Also, the inclusion of user-friendly functions for both testing and interactive chat enhances practical understanding, enabling immediate application in various contexts. Developers can confidently extend and customize their AI agents with this foundational knowledge. Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGenAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent DesignAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding #stepbystep #guide #build #customizable #multitool

WWW.MARKTECHPOST.COM

Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent Creation
In this comprehensive tutorial, we guide users through creating a powerful multi-tool AI agent using LangGraph and Claude, optimized for diverse tasks including mathematical computations, web searches, weather inquiries, text analysis, and real-time information retrieval. It begins by simplifying dependency installations to ensure effortless setup, even for beginners. Users are then introduced to structured implementations of specialized tools, such as a safe calculator, an efficient web-search utility leveraging DuckDuckGo, a mock weather information provider, a detailed text analyzer, and a time-fetching function. The tutorial also clearly delineates the integration of these tools within a sophisticated agent architecture built using LangGraph, illustrating practical usage through interactive examples and clear explanations, facilitating both beginners and advanced developers to deploy custom multi-functional AI agents rapidly. import subprocess import sys def install_packages(): packages = [ "langgraph", "langchain", "langchain-anthropic", "langchain-community", "requests", "python-dotenv", "duckduckgo-search" ] for package in packages: try: subprocess.check_call([sys.executable, "-m", "pip", "install", package, "-q"]) print(f"✓ Installed {package}") except subprocess.CalledProcessError: print(f"✗ Failed to install {package}") print("Installing required packages...") install_packages() print("Installation complete!\n") We automate the installation of essential Python packages required for building a LangGraph-based multi-tool AI agent. It leverages a subprocess to run pip commands silently and ensures each package, ranging from long-chain components to web search and environment handling tools, is installed successfully. This setup streamlines the environment preparation process, making the notebook portable and beginner-friendly. import os import json import math import requests from typing import Dict, List, Any, Annotated, TypedDict from datetime import datetime import operator from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, ToolMessage from langchain_core.tools import tool from langchain_anthropic import ChatAnthropic from langgraph.graph import StateGraph, START, END from langgraph.prebuilt import ToolNode from langgraph.checkpoint.memory import MemorySaver from duckduckgo_search import DDGS We import all the necessary libraries and modules for constructing the multi-tool AI agent. It includes Python standard libraries such as os, json, math, and datetime for general-purpose functionality and external libraries like requests for HTTP calls and duckduckgo_search for implementing web search. The LangChain and LangGraph ecosystems bring in message types, tool decorators, state graph components, and checkpointing utilities, while ChatAnthropic enables integration with the Claude model for conversational intelligence. These imports form the foundational building blocks for defining tools, agent workflows, and interactions. os.environ["ANTHROPIC_API_KEY"] = "Use Your API Key Here" ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY") We set and retrieve the Anthropic API key required to authenticate and interact with Claude models. The os.environ line assigns your API key (which you should replace with a valid key), while os.getenv securely retrieves it for later use in model initialization. This approach ensures the key is accessible throughout the script without hardcoding it multiple times. from typing import TypedDict class AgentState(TypedDict): messages: Annotated[List[BaseMessage], operator.add] @tool def calculator(expression: str) -> str: """ Perform mathematical calculations. Supports basic arithmetic, trigonometry, and more. Args: expression: Mathematical expression as a string (e.g., "2 + 3 * 4", "sin(3.14159/2)") Returns: Result of the calculation as a string """ try: allowed_names = { 'abs': abs, 'round': round, 'min': min, 'max': max, 'sum': sum, 'pow': pow, 'sqrt': math.sqrt, 'sin': math.sin, 'cos': math.cos, 'tan': math.tan, 'log': math.log, 'log10': math.log10, 'exp': math.exp, 'pi': math.pi, 'e': math.e } expression = expression.replace('^', '**') result = eval(expression, {"__builtins__": {}}, allowed_names) return f"Result: {result}" except Exception as e: return f"Error in calculation: {str(e)}" We define the agent’s internal state and implement a robust calculator tool. The AgentState class uses TypedDict to structure agent memory, specifically tracking messages exchanged during the conversation. The calculator function, decorated with @tool to register it as an AI-usable utility, securely evaluates mathematical expressions. It allows for safe computation by limiting available functions to a predefined set from the math module and replacing common syntax like ^ with Python’s exponentiation operator. This ensures the tool can handle simple arithmetic and advanced functions like trigonometry or logarithms while preventing unsafe code execution. @tool def web_search(query: str, num_results: int = 3) -> str: """ Search the web for information using DuckDuckGo. Args: query: Search query string num_results: Number of results to return (default: 3, max: 10) Returns: Search results as formatted string """ try: num_results = min(max(num_results, 1), 10) with DDGS() as ddgs: results = list(ddgs.text(query, max_results=num_results)) if not results: return f"No search results found for: {query}" formatted_results = f"Search results for '{query}':\n\n" for i, result in enumerate(results, 1): formatted_results += f"{i}. **{result['title']}**\n" formatted_results += f" {result['body']}\n" formatted_results += f" Source: {result['href']}\n\n" return formatted_results except Exception as e: return f"Error performing web search: {str(e)}" We define a web_search tool that enables the agent to fetch real-time information from the internet using the DuckDuckGo Search API via the duckduckgo_search Python package. The tool accepts a search query and an optional num_results parameter, ensuring that the number of results returned is between 1 and 10. It opens a DuckDuckGo search session, retrieves the results, and formats them neatly for user-friendly display. If no results are found or an error occurs, the function handles it gracefully by returning an informative message. This tool equips the agent with real-time search capabilities, enhancing responsiveness and utility. @tool def weather_info(city: str) -> str: """ Get current weather information for a city using OpenWeatherMap API. Note: This is a mock implementation for demo purposes. Args: city: Name of the city Returns: Weather information as a string """ mock_weather = { "new york": {"temp": 22, "condition": "Partly Cloudy", "humidity": 65}, "london": {"temp": 15, "condition": "Rainy", "humidity": 80}, "tokyo": {"temp": 28, "condition": "Sunny", "humidity": 70}, "paris": {"temp": 18, "condition": "Overcast", "humidity": 75} } city_lower = city.lower() if city_lower in mock_weather: weather = mock_weather[city_lower] return f"Weather in {city}:\n" \ f"Temperature: {weather['temp']}°C\n" \ f"Condition: {weather['condition']}\n" \ f"Humidity: {weather['humidity']}%" else: return f"Weather data not available for {city}. (This is a demo with limited cities: New York, London, Tokyo, Paris)" We define a weather_info tool that simulates retrieving current weather data for a given city. While it does not connect to a live weather API, it uses a predefined dictionary of mock data for major cities like New York, London, Tokyo, and Paris. Upon receiving a city name, the function normalizes it to lowercase and checks for its presence in the mock dataset. It returns temperature, weather condition, and humidity in a readable format if found. Otherwise, it notifies the user that weather data is unavailable. This tool serves as a placeholder and can later be upgraded to fetch live data from an actual weather API. @tool def text_analyzer(text: str) -> str: """ Analyze text and provide statistics like word count, character count, etc. Args: text: Text to analyze Returns: Text analysis results """ if not text.strip(): return "Please provide text to analyze." words = text.split() sentences = text.split('.') + text.split('!') + text.split('?') sentences = [s.strip() for s in sentences if s.strip()] analysis = f"Text Analysis Results:\n" analysis += f"• Characters (with spaces): {len(text)}\n" analysis += f"• Characters (without spaces): {len(text.replace(' ', ''))}\n" analysis += f"• Words: {len(words)}\n" analysis += f"• Sentences: {len(sentences)}\n" analysis += f"• Average words per sentence: {len(words) / max(len(sentences), 1):.1f}\n" analysis += f"• Most common word: {max(set(words), key=words.count) if words else 'N/A'}" return analysis The text_analyzer tool provides a detailed statistical analysis of a given text input. It calculates metrics such as character count (with and without spaces), word count, sentence count, and average words per sentence, and it identifies the most frequently occurring word. The tool handles empty input gracefully by prompting the user to provide valid text. It uses simple string operations and Python’s set and max functions to extract meaningful insights. It is a valuable utility for language analysis or content quality checks in the AI agent’s toolkit. @tool def current_time() -> str: """ Get the current date and time. Returns: Current date and time as a formatted string """ now = datetime.now() return f"Current date and time: {now.strftime('%Y-%m-%d %H:%M:%S')}" The current_time tool provides a straightforward way to retrieve the current system date and time in a human-readable format. Using Python’s datetime module, it captures the present moment and formats it as YYYY-MM-DD HH:MM:SS. This utility is particularly useful for time-stamping responses or answering user queries about the current date and time within the AI agent’s interaction flow. tools = [calculator, web_search, weather_info, text_analyzer, current_time] def create_llm(): if ANTHROPIC_API_KEY: return ChatAnthropic( model="claude-3-haiku-20240307", temperature=0.1, max_tokens=1024 ) else: class MockLLM: def invoke(self, messages): last_message = messages[-1].content if messages else "" if any(word in last_message.lower() for word in ['calculate', 'math', '+', '-', '*', '/', 'sqrt', 'sin', 'cos']): import re numbers = re.findall(r'[\d\+\-\*/\.\(\)\s\w]+', last_message) expr = numbers[0] if numbers else "2+2" return AIMessage(content="I'll help you with that calculation.", tool_calls=[{"name": "calculator", "args": {"expression": expr.strip()}, "id": "calc1"}]) elif any(word in last_message.lower() for word in ['search', 'find', 'look up', 'information about']): query = last_message.replace('search for', '').replace('find', '').replace('look up', '').strip() if not query or len(query) < 3: query = "python programming" return AIMessage(content="I'll search for that information.", tool_calls=[{"name": "web_search", "args": {"query": query}, "id": "search1"}]) elif any(word in last_message.lower() for word in ['weather', 'temperature']): city = "New York" words = last_message.lower().split() for i, word in enumerate(words): if word == 'in' and i + 1 < len(words): city = words[i + 1].title() break return AIMessage(content="I'll get the weather information.", tool_calls=[{"name": "weather_info", "args": {"city": city}, "id": "weather1"}]) elif any(word in last_message.lower() for word in ['time', 'date']): return AIMessage(content="I'll get the current time.", tool_calls=[{"name": "current_time", "args": {}, "id": "time1"}]) elif any(word in last_message.lower() for word in ['analyze', 'analysis']): text = last_message.replace('analyze this text:', '').replace('analyze', '').strip() if not text: text = "Sample text for analysis" return AIMessage(content="I'll analyze that text for you.", tool_calls=[{"name": "text_analyzer", "args": {"text": text}, "id": "analyze1"}]) else: return AIMessage(content="Hello! I'm a multi-tool agent powered by Claude. I can help with:\n• Mathematical calculations\n• Web searches\n• Weather information\n• Text analysis\n• Current time/date\n\nWhat would you like me to help you with?") def bind_tools(self, tools): return self print("⚠️ Note: Using mock LLM for demo. Add your ANTHROPIC_API_KEY for full functionality.") return MockLLM() llm = create_llm() llm_with_tools = llm.bind_tools(tools) We initialize the language model that powers the AI agent. If a valid Anthropic API key is available, it uses the Claude 3 Haiku model for high-quality responses. Without an API key, a MockLLM is defined to simulate basic tool-routing behavior based on keyword matching, allowing the agent to function offline with limited capabilities. The bind_tools method links the defined tools to the model, enabling it to invoke them as needed. def agent_node(state: AgentState) -> Dict[str, Any]: """Main agent node that processes messages and decides on tool usage.""" messages = state["messages"] response = llm_with_tools.invoke(messages) return {"messages": [response]} def should_continue(state: AgentState) -> str: """Determine whether to continue with tool calls or end.""" last_message = state["messages"][-1] if hasattr(last_message, 'tool_calls') and last_message.tool_calls: return "tools" return END We define the agent’s core decision-making logic. The agent_node function handles incoming messages, invokes the language model (with tools), and returns the model’s response. The should_continue function then evaluates whether the model’s response includes tool calls. If so, it routes control to the tool execution node; otherwise, it directs the flow to end the interaction. These functions enable dynamic and conditional transitions within the agent’s workflow. def create_agent_graph(): tool_node = ToolNode(tools) workflow = StateGraph(AgentState) workflow.add_node("agent", agent_node) workflow.add_node("tools", tool_node) workflow.add_edge(START, "agent") workflow.add_conditional_edges("agent", should_continue, {"tools": "tools", END: END}) workflow.add_edge("tools", "agent") memory = MemorySaver() app = workflow.compile(checkpointer=memory) return app print("Creating LangGraph Multi-Tool Agent...") agent = create_agent_graph() print("✓ Agent created successfully!\n") We construct the LangGraph-powered workflow that defines the AI agent’s operational structure. It initializes a ToolNode to handle tool executions and uses a StateGraph to organize the flow between agent decisions and tool usage. Nodes and edges are added to manage transitions: starting with the agent, conditionally routing to tools, and looping back as needed. A MemorySaver is integrated for persistent state tracking across turns. The graph is compiled into an executable application (app), enabling a structured, memory-aware multi-tool agent ready for deployment. def test_agent(): """Test the agent with various queries.""" config = {"configurable": {"thread_id": "test-thread"}} test_queries = [ "What's 15 * 7 + 23?", "Search for information about Python programming", "What's the weather like in Tokyo?", "What time is it?", "Analyze this text: 'LangGraph is an amazing framework for building AI agents.'" ] print("🧪 Testing the agent with sample queries...\n") for i, query in enumerate(test_queries, 1): print(f"Query {i}: {query}") print("-" * 50) try: response = agent.invoke( {"messages": [HumanMessage(content=query)]}, config=config ) last_message = response["messages"][-1] print(f"Response: {last_message.content}\n") except Exception as e: print(f"Error: {str(e)}\n") The test_agent function is a validation utility that ensures that the LangGraph agent responds correctly across different use cases. It runs predefined queries, arithmetic, web search, weather, time, and text analysis, and prints the agent’s responses. Using a consistent thread_id for configuration, it invokes the agent with each query. It neatly displays the results, helping developers verify tool integration and conversational logic before moving to interactive or production use. def chat_with_agent(): """Interactive chat function.""" config = {"configurable": {"thread_id": "interactive-thread"}} print("🤖 Multi-Tool Agent Chat") print("Available tools: Calculator, Web Search, Weather Info, Text Analyzer, Current Time") print("Type 'quit' to exit, 'help' for available commands\n") while True: try: user_input = input("You: ").strip() if user_input.lower() in ['quit', 'exit', 'q']: print("Goodbye!") break elif user_input.lower() == 'help': print("\nAvailable commands:") print("• Calculator: 'Calculate 15 * 7 + 23' or 'What's sin(pi/2)?'") print("• Web Search: 'Search for Python tutorials' or 'Find information about AI'") print("• Weather: 'Weather in Tokyo' or 'What's the temperature in London?'") print("• Text Analysis: 'Analyze this text: [your text]'") print("• Current Time: 'What time is it?' or 'Current date'") print("• quit: Exit the chat\n") continue elif not user_input: continue response = agent.invoke( {"messages": [HumanMessage(content=user_input)]}, config=config ) last_message = response["messages"][-1] print(f"Agent: {last_message.content}\n") except KeyboardInterrupt: print("\nGoodbye!") break except Exception as e: print(f"Error: {str(e)}\n") The chat_with_agent function provides an interactive command-line interface for real-time conversations with the LangGraph multi-tool agent. It supports natural language queries and recognizes commands like “help” for usage guidance and “quit” to exit. Each user input is processed through the agent, which dynamically selects and invokes appropriate response tools. The function enhances user engagement by simulating a conversational experience and showcasing the agent’s capabilities in handling various queries, from math and web search to weather, text analysis, and time retrieval. if __name__ == "__main__": test_agent() print("=" * 60) print("🎉 LangGraph Multi-Tool Agent is ready!") print("=" * 60) chat_with_agent() def quick_demo(): """Quick demonstration of agent capabilities.""" config = {"configurable": {"thread_id": "demo"}} demos = [ ("Math", "Calculate the square root of 144 plus 5 times 3"), ("Search", "Find recent news about artificial intelligence"), ("Time", "What's the current date and time?") ] print("🚀 Quick Demo of Agent Capabilities\n") for category, query in demos: print(f"[{category}] Query: {query}") try: response = agent.invoke( {"messages": [HumanMessage(content=query)]}, config=config ) print(f"Response: {response['messages'][-1].content}\n") except Exception as e: print(f"Error: {str(e)}\n") print("\n" + "="*60) print("🔧 Usage Instructions:") print("1. Add your ANTHROPIC_API_KEY to use Claude model") print(" os.environ['ANTHROPIC_API_KEY'] = 'your-anthropic-api-key'") print("2. Run quick_demo() for a quick demonstration") print("3. Run chat_with_agent() for interactive chat") print("4. The agent supports: calculations, web search, weather, text analysis, and time") print("5. Example: 'Calculate 15*7+23' or 'Search for Python tutorials'") print("="*60) Finally, we orchestrate the execution of the LangGraph multi-tool agent. If the script is run directly, it initiates test_agent() to validate functionality with sample queries, followed by launching the interactive chat_with_agent() mode for real-time interaction. The quick_demo() function also briefly showcases the agent’s capabilities in math, search, and time queries. Clear usage instructions are printed at the end, guiding users on configuring the API key, running demonstrations, and interacting with the agent. This provides a smooth onboarding experience for users to explore and extend the agent’s functionality. In conclusion, this step-by-step tutorial gives valuable insights into building an effective multi-tool AI agent leveraging LangGraph and Claude’s generative capabilities. With straightforward explanations and hands-on demonstrations, the guide empowers users to integrate diverse utilities into a cohesive and interactive system. The agent’s flexibility in performing tasks, from complex calculations to dynamic information retrieval, showcases the versatility of modern AI development frameworks. Also, the inclusion of user-friendly functions for both testing and interactive chat enhances practical understanding, enabling immediate application in various contexts. Developers can confidently extend and customize their AI agents with this foundational knowledge. Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGenAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent DesignAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding

0 Yorumlar 0 hisse senetleri 0 önizleme

Please log in to like, share and comment!
Marktechpost AI paylaşılan bir bağlantı
2025-05-23 02:36:00 -

Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use

Modern web usage spans many digital interactions, from filling out forms and managing accounts to executing data queries and navigating complex dashboards. Despite the web being deeply intertwined with productivity and work processes, many of these actions still demand repetitive human input. This scenario is especially true for environments that require detailed instructions or decisions beyond mere searches. While artificial intelligence agents have emerged to support task automation, many prioritize complete autonomy. However, this frequently sidelines user control, leading to outcomes that diverge from user expectations. The next leap forward in productivity-enhancing AI involves agents designed not to replace users but to collaborate with them, blending automation with continuous, real-time human input for more accurate and trusted results.
A key challenge in deploying AI agents for web-based tasks is the lack of visibility and intervention. Users often cannot see what steps the agent is planning, how it intends to execute them, or when it might go off track. In scenarios that involve complex decisions, like entering payment information, interpreting dynamic content, or running scripts, users need mechanisms to step in and redirect the process. Without these capabilities, systems risk making irreversible mistakes or misaligning with user goals. This highlights a significant limitation in current AI automation: the absence of structured human-in-the-loop design, where users dynamically guide and supervise agent behavior, without acting merely as spectators.
Previous solutions approached web automation through rule-based scripts or general-purpose AI agents driven by language models. These systems interpret user commands and attempt to carry them out autonomously. However, they often execute plans without surfacing intermediate decisions or allowing meaningful user feedback. A few offer command-line-like interactions, which are inaccessible to the average user and rarely include layered safety mechanisms. Moreover, minimal support for task reuse or performance learning across sessions limits long-term value. These systems also tend to lack adaptability when the context changes mid-task or errors must be corrected collaboratively.
Researchers at Microsoft introduced Magentic-UI, an open-source prototype that emphasizes collaborative human-AI interaction for web-based tasks. Unlike previous systems aiming for full independence, this tool promotes real-time co-planning, execution sharing, and step-by-step user oversight. Magentic-UI is built on Microsoft’s AutoGen framework and is tightly integrated with Azure AI Foundry Labs. It’s a direct evolution from the previously introduced Magentic-One system. With its launch, Microsoft Research aims to address fundamental questions about human oversight, safety mechanisms, and learning in agentic systems by offering an experimental platform for researchers and developers.
Magentic-UI includes four core interactive features: co-planning, co-tasking, action guards, and plan learning. Co-planning lets users view and adjust the agent’s proposed steps before execution begins, offering full control over what the AI will do. Co-tasking enables real-time visibility during operation, letting users pause, edit, or take over specific actions. Action guards are customizable confirmations for high-risk activities like closing browser tabs or clicking “submit” on a form, actions that could have unintended consequences. Plan learning allows Magentic-UI to remember and refine steps for future tasks, improving over time through experience. These capabilities are supported by a modular team of agents: the Orchestrator leads planning and decision-making, WebSurfer handles browser interactions, Coder executes code in a sandbox, and FileSurfer interprets files and data.

Technically, when a user submits a request, the Orchestrator agent generates a step-by-step plan. Users can modify it through a graphical interface by editing, deleting, or regenerating steps. Once finalized, the plan is delegated across specialized agents. Each agent reports after performing its task, and the Orchestrator determines whether to proceed, repeat, or request user feedback. All actions are visible on the interface, and users can halt execution at any point. This architecture not only ensures transparency but also allows for adaptive task flows. For example, if a step fails due to a broken link, the Orchestrator can dynamically adjust the plan with user consent.
In controlled evaluations using the GAIA benchmark, which includes complex tasks like navigating the web and interpreting documents, Magentic-UI’s performance was rigorously tested. GAIA consists of 162 tasks requiring multimodal understanding. When operating autonomously, Magentic-UI completed 30.3% of tasks successfully. However, when supported by a simulated user with access to additional task information, success jumped to 51.9%, a 71% improvement. Another configuration using a smarter simulated user improved the rate to 42.6%. Interestingly, Magentic-UI requested help in only 10% of the enhanced tasks and asked for final answers in 18%. In those cases, the system asked for help an average of just 1.1 times. This shows how minimal but well-timed human intervention significantly boosts task completion without high oversight costs.

Magentic-UI also features a “Saved Plans” gallery that displays strategies reused from past tasks. Retrieval from this gallery is approximately three times faster than generating a new plan. A predictive mechanism surfaces these plans while users type, streamlining repeated tasks like flight searches or form submissions. Safety mechanisms are robust. Every browser or code action runs inside a Docker container, ensuring that no user credentials are exposed. Users can define allow-lists for site access, and every action can be gated behind approval prompts. A red-team evaluation further tested it against phishing attacks and prompt injections, where the system either sought user clarification or blocked execution, reinforcing its layered defense model.

Several Key Takeaways from the Research on Magentic-UI:

With simple human input, magentic-UI boosts task completion by 71%.
Requests user help in only 10% of enhanced tasks and averages 1.1 help requests per task.
It features a co-planning UI that allows full user control before execution.
Executes tasks via four modular agents: Orchestrator, WebSurfer, Coder, and FileSurfer.
Stores and reuses plans, reducing repeat task latency by up to 3x.
All actions are sandboxed via Docker containers; no user credentials are ever exposed.
Passed red-team evaluations against phishing and injection threats.
Supports fully user-configurable “action guards” for high-risk steps.
Fully open-source and integrated with Azure AI Foundry Labs.

In conclusion, Magentic-UI addresses a long-standing problem in AI automation, the lack of transparency and controllability. Rather than replacing users, it enables them to remain central to the process. The system performs well even with minimal help and learns to improve each time. The modular design, robust safeguards, and detailed interaction model create a strong foundation for future intelligent assistants.

Check out the Technical details and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent DesignAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context UnderstandingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraph
#microsoft #introduces #magenticuian #opensource #agent

Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use
Modern web usage spans many digital interactions, from filling out forms and managing accounts to executing data queries and navigating complex dashboards. Despite the web being deeply intertwined with productivity and work processes, many of these actions still demand repetitive human input. This scenario is especially true for environments that require detailed instructions or decisions beyond mere searches. While artificial intelligence agents have emerged to support task automation, many prioritize complete autonomy. However, this frequently sidelines user control, leading to outcomes that diverge from user expectations. The next leap forward in productivity-enhancing AI involves agents designed not to replace users but to collaborate with them, blending automation with continuous, real-time human input for more accurate and trusted results. A key challenge in deploying AI agents for web-based tasks is the lack of visibility and intervention. Users often cannot see what steps the agent is planning, how it intends to execute them, or when it might go off track. In scenarios that involve complex decisions, like entering payment information, interpreting dynamic content, or running scripts, users need mechanisms to step in and redirect the process. Without these capabilities, systems risk making irreversible mistakes or misaligning with user goals. This highlights a significant limitation in current AI automation: the absence of structured human-in-the-loop design, where users dynamically guide and supervise agent behavior, without acting merely as spectators. Previous solutions approached web automation through rule-based scripts or general-purpose AI agents driven by language models. These systems interpret user commands and attempt to carry them out autonomously. However, they often execute plans without surfacing intermediate decisions or allowing meaningful user feedback. A few offer command-line-like interactions, which are inaccessible to the average user and rarely include layered safety mechanisms. Moreover, minimal support for task reuse or performance learning across sessions limits long-term value. These systems also tend to lack adaptability when the context changes mid-task or errors must be corrected collaboratively. Researchers at Microsoft introduced Magentic-UI, an open-source prototype that emphasizes collaborative human-AI interaction for web-based tasks. Unlike previous systems aiming for full independence, this tool promotes real-time co-planning, execution sharing, and step-by-step user oversight. Magentic-UI is built on Microsoft’s AutoGen framework and is tightly integrated with Azure AI Foundry Labs. It’s a direct evolution from the previously introduced Magentic-One system. With its launch, Microsoft Research aims to address fundamental questions about human oversight, safety mechanisms, and learning in agentic systems by offering an experimental platform for researchers and developers. Magentic-UI includes four core interactive features: co-planning, co-tasking, action guards, and plan learning. Co-planning lets users view and adjust the agent’s proposed steps before execution begins, offering full control over what the AI will do. Co-tasking enables real-time visibility during operation, letting users pause, edit, or take over specific actions. Action guards are customizable confirmations for high-risk activities like closing browser tabs or clicking “submit” on a form, actions that could have unintended consequences. Plan learning allows Magentic-UI to remember and refine steps for future tasks, improving over time through experience. These capabilities are supported by a modular team of agents: the Orchestrator leads planning and decision-making, WebSurfer handles browser interactions, Coder executes code in a sandbox, and FileSurfer interprets files and data. Technically, when a user submits a request, the Orchestrator agent generates a step-by-step plan. Users can modify it through a graphical interface by editing, deleting, or regenerating steps. Once finalized, the plan is delegated across specialized agents. Each agent reports after performing its task, and the Orchestrator determines whether to proceed, repeat, or request user feedback. All actions are visible on the interface, and users can halt execution at any point. This architecture not only ensures transparency but also allows for adaptive task flows. For example, if a step fails due to a broken link, the Orchestrator can dynamically adjust the plan with user consent. In controlled evaluations using the GAIA benchmark, which includes complex tasks like navigating the web and interpreting documents, Magentic-UI’s performance was rigorously tested. GAIA consists of 162 tasks requiring multimodal understanding. When operating autonomously, Magentic-UI completed 30.3% of tasks successfully. However, when supported by a simulated user with access to additional task information, success jumped to 51.9%, a 71% improvement. Another configuration using a smarter simulated user improved the rate to 42.6%. Interestingly, Magentic-UI requested help in only 10% of the enhanced tasks and asked for final answers in 18%. In those cases, the system asked for help an average of just 1.1 times. This shows how minimal but well-timed human intervention significantly boosts task completion without high oversight costs. Magentic-UI also features a “Saved Plans” gallery that displays strategies reused from past tasks. Retrieval from this gallery is approximately three times faster than generating a new plan. A predictive mechanism surfaces these plans while users type, streamlining repeated tasks like flight searches or form submissions. Safety mechanisms are robust. Every browser or code action runs inside a Docker container, ensuring that no user credentials are exposed. Users can define allow-lists for site access, and every action can be gated behind approval prompts. A red-team evaluation further tested it against phishing attacks and prompt injections, where the system either sought user clarification or blocked execution, reinforcing its layered defense model. Several Key Takeaways from the Research on Magentic-UI: With simple human input, magentic-UI boosts task completion by 71%. Requests user help in only 10% of enhanced tasks and averages 1.1 help requests per task. It features a co-planning UI that allows full user control before execution. Executes tasks via four modular agents: Orchestrator, WebSurfer, Coder, and FileSurfer. Stores and reuses plans, reducing repeat task latency by up to 3x. All actions are sandboxed via Docker containers; no user credentials are ever exposed. Passed red-team evaluations against phishing and injection threats. Supports fully user-configurable “action guards” for high-risk steps. Fully open-source and integrated with Azure AI Foundry Labs. In conclusion, Magentic-UI addresses a long-standing problem in AI automation, the lack of transparency and controllability. Rather than replacing users, it enables them to remain central to the process. The system performs well even with minimal help and learns to improve each time. The modular design, robust safeguards, and detailed interaction model create a strong foundation for future intelligent assistants. Check out the Technical details and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent DesignAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context UnderstandingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraph #microsoft #introduces #magenticuian #opensource #agent

WWW.MARKTECHPOST.COM

Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use
Modern web usage spans many digital interactions, from filling out forms and managing accounts to executing data queries and navigating complex dashboards. Despite the web being deeply intertwined with productivity and work processes, many of these actions still demand repetitive human input. This scenario is especially true for environments that require detailed instructions or decisions beyond mere searches. While artificial intelligence agents have emerged to support task automation, many prioritize complete autonomy. However, this frequently sidelines user control, leading to outcomes that diverge from user expectations. The next leap forward in productivity-enhancing AI involves agents designed not to replace users but to collaborate with them, blending automation with continuous, real-time human input for more accurate and trusted results. A key challenge in deploying AI agents for web-based tasks is the lack of visibility and intervention. Users often cannot see what steps the agent is planning, how it intends to execute them, or when it might go off track. In scenarios that involve complex decisions, like entering payment information, interpreting dynamic content, or running scripts, users need mechanisms to step in and redirect the process. Without these capabilities, systems risk making irreversible mistakes or misaligning with user goals. This highlights a significant limitation in current AI automation: the absence of structured human-in-the-loop design, where users dynamically guide and supervise agent behavior, without acting merely as spectators. Previous solutions approached web automation through rule-based scripts or general-purpose AI agents driven by language models. These systems interpret user commands and attempt to carry them out autonomously. However, they often execute plans without surfacing intermediate decisions or allowing meaningful user feedback. A few offer command-line-like interactions, which are inaccessible to the average user and rarely include layered safety mechanisms. Moreover, minimal support for task reuse or performance learning across sessions limits long-term value. These systems also tend to lack adaptability when the context changes mid-task or errors must be corrected collaboratively. Researchers at Microsoft introduced Magentic-UI, an open-source prototype that emphasizes collaborative human-AI interaction for web-based tasks. Unlike previous systems aiming for full independence, this tool promotes real-time co-planning, execution sharing, and step-by-step user oversight. Magentic-UI is built on Microsoft’s AutoGen framework and is tightly integrated with Azure AI Foundry Labs. It’s a direct evolution from the previously introduced Magentic-One system. With its launch, Microsoft Research aims to address fundamental questions about human oversight, safety mechanisms, and learning in agentic systems by offering an experimental platform for researchers and developers. Magentic-UI includes four core interactive features: co-planning, co-tasking, action guards, and plan learning. Co-planning lets users view and adjust the agent’s proposed steps before execution begins, offering full control over what the AI will do. Co-tasking enables real-time visibility during operation, letting users pause, edit, or take over specific actions. Action guards are customizable confirmations for high-risk activities like closing browser tabs or clicking “submit” on a form, actions that could have unintended consequences. Plan learning allows Magentic-UI to remember and refine steps for future tasks, improving over time through experience. These capabilities are supported by a modular team of agents: the Orchestrator leads planning and decision-making, WebSurfer handles browser interactions, Coder executes code in a sandbox, and FileSurfer interprets files and data. Technically, when a user submits a request, the Orchestrator agent generates a step-by-step plan. Users can modify it through a graphical interface by editing, deleting, or regenerating steps. Once finalized, the plan is delegated across specialized agents. Each agent reports after performing its task, and the Orchestrator determines whether to proceed, repeat, or request user feedback. All actions are visible on the interface, and users can halt execution at any point. This architecture not only ensures transparency but also allows for adaptive task flows. For example, if a step fails due to a broken link, the Orchestrator can dynamically adjust the plan with user consent. In controlled evaluations using the GAIA benchmark, which includes complex tasks like navigating the web and interpreting documents, Magentic-UI’s performance was rigorously tested. GAIA consists of 162 tasks requiring multimodal understanding. When operating autonomously, Magentic-UI completed 30.3% of tasks successfully. However, when supported by a simulated user with access to additional task information, success jumped to 51.9%, a 71% improvement. Another configuration using a smarter simulated user improved the rate to 42.6%. Interestingly, Magentic-UI requested help in only 10% of the enhanced tasks and asked for final answers in 18%. In those cases, the system asked for help an average of just 1.1 times. This shows how minimal but well-timed human intervention significantly boosts task completion without high oversight costs. Magentic-UI also features a “Saved Plans” gallery that displays strategies reused from past tasks. Retrieval from this gallery is approximately three times faster than generating a new plan. A predictive mechanism surfaces these plans while users type, streamlining repeated tasks like flight searches or form submissions. Safety mechanisms are robust. Every browser or code action runs inside a Docker container, ensuring that no user credentials are exposed. Users can define allow-lists for site access, and every action can be gated behind approval prompts. A red-team evaluation further tested it against phishing attacks and prompt injections, where the system either sought user clarification or blocked execution, reinforcing its layered defense model. Several Key Takeaways from the Research on Magentic-UI: With simple human input, magentic-UI boosts task completion by 71% (from 30.3% to 51.9%). Requests user help in only 10% of enhanced tasks and averages 1.1 help requests per task. It features a co-planning UI that allows full user control before execution. Executes tasks via four modular agents: Orchestrator, WebSurfer, Coder, and FileSurfer. Stores and reuses plans, reducing repeat task latency by up to 3x. All actions are sandboxed via Docker containers; no user credentials are ever exposed. Passed red-team evaluations against phishing and injection threats. Supports fully user-configurable “action guards” for high-risk steps. Fully open-source and integrated with Azure AI Foundry Labs. In conclusion, Magentic-UI addresses a long-standing problem in AI automation, the lack of transparency and controllability. Rather than replacing users, it enables them to remain central to the process. The system performs well even with minimal help and learns to improve each time. The modular design, robust safeguards, and detailed interaction model create a strong foundation for future intelligent assistants. Check out the Technical details and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent DesignAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context UnderstandingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraph

0 Yorumlar 0 hisse senetleri 0 önizleme

Please log in to like, share and comment!
Marktechpost AI paylaşılan bir bağlantı
2025-05-22 18:31:34 -

Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent Design

Anthropic has announced the release of its next-generation language models: Claude Opus 4 and Claude Sonnet 4. The update marks a significant technical refinement in the Claude model family, particularly in areas involving structured reasoning, software engineering, and autonomous agent behaviors.
This release is not another reinvention but a focused improvement—bringing increased consistency, interpretability, and performance across complex reasoning tasks. With extended context handling, long-horizon planning, and more efficient coding capabilities, these models reflect a maturing shift toward functional generalist systems that can serve a range of high-complexity applications.
Claude Opus 4: Scaling Advanced Reasoning and Multi-file Code Understanding
Positioned as the flagship model, Claude Opus 4 has been benchmarked as Anthropic’s most capable model to date. Designed to handle intricate reasoning workflows and software development scenarios, Opus 4 has achieved:

72.5% accuracy on the SWE-bench benchmark, which tests models against real-world GitHub issue resolution.
43.2% on TerminalBench, which evaluates correctness in terminal-based code generation tasks requiring multi-step planning.

A notable aspect of Claude Opus 4 is its agentic behavior in software environments. In practical testing, the model was able to autonomously sustain nearly seven hours of uninterrupted code generation and task execution. This is a marked improvement from Claude 3 Opus, which previously sustained such tasks for under an hour.
These improvements are attributed to enhanced memory management, broader context retention, and a more robust internal planning loop. From a developer’s perspective, Opus 4 reduces the need for frequent interventions and exhibits stronger consistency in handling edge cases across software stacks.

Claude Sonnet 4: A Balanced Model for General Reasoning and Code Tasks
Claude Sonnet 4 replaces its predecessor, Claude 3.5 Sonnet, with a more stable and balanced architecture that brings improvements in both speed and quality without significantly increasing computational costs.
Sonnet 4 is optimized for mid-scale deployments where cost-performance trade-offs are critical. While not matching Opus 4’s reasoning ceiling, it inherits many architectural upgrades—supporting multi-file code navigation, intermediate tool use, and structured text processing with improved latency.
It serves as the new default model for free-tier users on Claude.ai and is also available via API. This makes Sonnet 4 a practical option for lightweight development tools, user-facing assistants, and analytical pipelines requiring consistent but less intensive model calls.
Architectural Highlights: Hybrid Reasoning and Extended Thinking
Both models incorporate hybrid reasoning capabilities, introducing two distinct response modes:

Fast Mode for low-latency responses suitable for short prompts and conversational tasks.
Extended Thinking Mode for computationally intensive tasks requiring deeper inference, longer memory chains, or multi-turn agentic behavior.

This dual-mode reasoning strategy allows users to dynamically allocate compute and latency budgets based on task complexity. It is especially relevant in agent frameworks, where LLMs must balance fast reaction time with deliberative planning.
Deployment and Integration
Claude Opus 4 and Sonnet 4 are accessible through multiple cloud platforms:

Anthropic’s Claude API
Amazon Bedrock
Google Cloud Vertex AI

This cross-platform availability simplifies model deployment into diverse enterprise environments, supporting use cases ranging from autonomous agents to code analysis, decision support, and retrieval-augmented generationpipelines.
Conclusion
The Claude 4 series does not introduce radical design changes but instead demonstrates measured improvements in reliability, interpretability, and task generalization. With Claude Opus 4, Anthropic positions itself firmly in the upper tier of AI model providers for reasoning and coding automation. Meanwhile, Claude Sonnet 4 offers a technically sound, cost-efficient entry point for developers and researchers working on mid-scale AI applications.
For engineering teams evaluating LLMs for long-context planning, software agents, or structured data workflows, the Claude 4 models present a competitive, technically capable alternative.

Check out the Technical details and Get started today on Claude, Claude Code, or the platform of your choice. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context UnderstandingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data
#anthropic #releases #claude #opus #sonnet

Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent Design
Anthropic has announced the release of its next-generation language models: Claude Opus 4 and Claude Sonnet 4. The update marks a significant technical refinement in the Claude model family, particularly in areas involving structured reasoning, software engineering, and autonomous agent behaviors. This release is not another reinvention but a focused improvement—bringing increased consistency, interpretability, and performance across complex reasoning tasks. With extended context handling, long-horizon planning, and more efficient coding capabilities, these models reflect a maturing shift toward functional generalist systems that can serve a range of high-complexity applications. Claude Opus 4: Scaling Advanced Reasoning and Multi-file Code Understanding Positioned as the flagship model, Claude Opus 4 has been benchmarked as Anthropic’s most capable model to date. Designed to handle intricate reasoning workflows and software development scenarios, Opus 4 has achieved: 72.5% accuracy on the SWE-bench benchmark, which tests models against real-world GitHub issue resolution. 43.2% on TerminalBench, which evaluates correctness in terminal-based code generation tasks requiring multi-step planning. A notable aspect of Claude Opus 4 is its agentic behavior in software environments. In practical testing, the model was able to autonomously sustain nearly seven hours of uninterrupted code generation and task execution. This is a marked improvement from Claude 3 Opus, which previously sustained such tasks for under an hour. These improvements are attributed to enhanced memory management, broader context retention, and a more robust internal planning loop. From a developer’s perspective, Opus 4 reduces the need for frequent interventions and exhibits stronger consistency in handling edge cases across software stacks. Claude Sonnet 4: A Balanced Model for General Reasoning and Code Tasks Claude Sonnet 4 replaces its predecessor, Claude 3.5 Sonnet, with a more stable and balanced architecture that brings improvements in both speed and quality without significantly increasing computational costs. Sonnet 4 is optimized for mid-scale deployments where cost-performance trade-offs are critical. While not matching Opus 4’s reasoning ceiling, it inherits many architectural upgrades—supporting multi-file code navigation, intermediate tool use, and structured text processing with improved latency. It serves as the new default model for free-tier users on Claude.ai and is also available via API. This makes Sonnet 4 a practical option for lightweight development tools, user-facing assistants, and analytical pipelines requiring consistent but less intensive model calls. Architectural Highlights: Hybrid Reasoning and Extended Thinking Both models incorporate hybrid reasoning capabilities, introducing two distinct response modes: Fast Mode for low-latency responses suitable for short prompts and conversational tasks. Extended Thinking Mode for computationally intensive tasks requiring deeper inference, longer memory chains, or multi-turn agentic behavior. This dual-mode reasoning strategy allows users to dynamically allocate compute and latency budgets based on task complexity. It is especially relevant in agent frameworks, where LLMs must balance fast reaction time with deliberative planning. Deployment and Integration Claude Opus 4 and Sonnet 4 are accessible through multiple cloud platforms: Anthropic’s Claude API Amazon Bedrock Google Cloud Vertex AI This cross-platform availability simplifies model deployment into diverse enterprise environments, supporting use cases ranging from autonomous agents to code analysis, decision support, and retrieval-augmented generationpipelines. Conclusion The Claude 4 series does not introduce radical design changes but instead demonstrates measured improvements in reliability, interpretability, and task generalization. With Claude Opus 4, Anthropic positions itself firmly in the upper tier of AI model providers for reasoning and coding automation. Meanwhile, Claude Sonnet 4 offers a technically sound, cost-efficient entry point for developers and researchers working on mid-scale AI applications. For engineering teams evaluating LLMs for long-context planning, software agents, or structured data workflows, the Claude 4 models present a competitive, technically capable alternative. Check out the Technical details and Get started today on Claude, Claude Code, or the platform of your choice. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context UnderstandingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data #anthropic #releases #claude #opus #sonnet

WWW.MARKTECHPOST.COM

Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent Design
Anthropic has announced the release of its next-generation language models: Claude Opus 4 and Claude Sonnet 4. The update marks a significant technical refinement in the Claude model family, particularly in areas involving structured reasoning, software engineering, and autonomous agent behaviors. This release is not another reinvention but a focused improvement—bringing increased consistency, interpretability, and performance across complex reasoning tasks. With extended context handling, long-horizon planning, and more efficient coding capabilities, these models reflect a maturing shift toward functional generalist systems that can serve a range of high-complexity applications. Claude Opus 4: Scaling Advanced Reasoning and Multi-file Code Understanding Positioned as the flagship model, Claude Opus 4 has been benchmarked as Anthropic’s most capable model to date. Designed to handle intricate reasoning workflows and software development scenarios, Opus 4 has achieved: 72.5% accuracy on the SWE-bench benchmark, which tests models against real-world GitHub issue resolution. 43.2% on TerminalBench, which evaluates correctness in terminal-based code generation tasks requiring multi-step planning. A notable aspect of Claude Opus 4 is its agentic behavior in software environments. In practical testing, the model was able to autonomously sustain nearly seven hours of uninterrupted code generation and task execution. This is a marked improvement from Claude 3 Opus, which previously sustained such tasks for under an hour. These improvements are attributed to enhanced memory management, broader context retention, and a more robust internal planning loop. From a developer’s perspective, Opus 4 reduces the need for frequent interventions and exhibits stronger consistency in handling edge cases across software stacks. Claude Sonnet 4: A Balanced Model for General Reasoning and Code Tasks Claude Sonnet 4 replaces its predecessor, Claude 3.5 Sonnet, with a more stable and balanced architecture that brings improvements in both speed and quality without significantly increasing computational costs. Sonnet 4 is optimized for mid-scale deployments where cost-performance trade-offs are critical. While not matching Opus 4’s reasoning ceiling, it inherits many architectural upgrades—supporting multi-file code navigation, intermediate tool use, and structured text processing with improved latency. It serves as the new default model for free-tier users on Claude.ai and is also available via API. This makes Sonnet 4 a practical option for lightweight development tools, user-facing assistants, and analytical pipelines requiring consistent but less intensive model calls. Architectural Highlights: Hybrid Reasoning and Extended Thinking Both models incorporate hybrid reasoning capabilities, introducing two distinct response modes: Fast Mode for low-latency responses suitable for short prompts and conversational tasks. Extended Thinking Mode for computationally intensive tasks requiring deeper inference, longer memory chains, or multi-turn agentic behavior. This dual-mode reasoning strategy allows users to dynamically allocate compute and latency budgets based on task complexity. It is especially relevant in agent frameworks, where LLMs must balance fast reaction time with deliberative planning. Deployment and Integration Claude Opus 4 and Sonnet 4 are accessible through multiple cloud platforms: Anthropic’s Claude API Amazon Bedrock Google Cloud Vertex AI This cross-platform availability simplifies model deployment into diverse enterprise environments, supporting use cases ranging from autonomous agents to code analysis, decision support, and retrieval-augmented generation (RAG) pipelines. Conclusion The Claude 4 series does not introduce radical design changes but instead demonstrates measured improvements in reliability, interpretability, and task generalization. With Claude Opus 4, Anthropic positions itself firmly in the upper tier of AI model providers for reasoning and coding automation. Meanwhile, Claude Sonnet 4 offers a technically sound, cost-efficient entry point for developers and researchers working on mid-scale AI applications. For engineering teams evaluating LLMs for long-context planning, software agents, or structured data workflows, the Claude 4 models present a competitive, technically capable alternative. Check out the Technical details and Get started today on Claude, Claude Code, or the platform of your choice. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context UnderstandingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal Data

0 Yorumlar 0 hisse senetleri 0 önizleme

Please log in to like, share and comment!
Marktechpost AI paylaşılan bir bağlantı
2025-05-22 08:56:37 -

Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding

Addressing Architectural Trade-offs in Language Models
As language models scale, balancing expressivity, efficiency, and adaptability becomes increasingly challenging. Transformer architectures dominate due to their strong performance across a wide range of tasks, but they are computationally expensive—particularly for long-context scenarios—due to the quadratic complexity of self-attention. On the other hand, Structured State Space Modelsoffer improved efficiency and linear scaling, yet often lack the nuanced sequence modeling required for complex language understanding. A combined architecture that leverages the strengths of both approaches is needed to support diverse applications across environments.
Introducing Falcon-H1: A Hybrid Architecture
The Falcon-H1 series, released by the Technology Innovation Institute, introduces a hybrid family of language models that combine Transformer attention mechanisms with Mamba2-based SSM components. This architecture is designed to improve computational efficiency while maintaining competitive performance across tasks requiring deep contextual understanding.
Falcon-H1 covers a wide parameter range—from 0.5B to 34B—catering to use cases from resource-constrained deployments to large-scale distributed inference. The design aims to address common bottlenecks in LLM deployment: memory efficiency, scalability, multilingual support, and the ability to handle extended input sequences.

Source: /
Architectural Details and Design Objectives
Falcon-H1 adopts a parallel structure where attention heads and Mamba2 SSMs operate side by side. This design allows each mechanism to independently contribute to sequence modeling: attention heads specialize in capturing token-level dependencies, while SSM components support efficient long-range information retention.
The series supports a context length of up to 256K tokens, which is particularly useful for applications in document summarization, retrieval-augmented generation, and multi-turn dialogue systems. Model training incorporates a customized microparameterizationrecipe and optimized data pipelines, allowing for stable and efficient training across model sizes.
The models are trained with a focus on multilingual capabilities. The architecture is natively equipped to handle 18 languages, with coverage including English, Chinese, Arabic, Hindi, French, and others. The framework is extensible to over 100 languages, supporting localization and region-specific model adaptation.
Empirical Results and Comparative Evaluation
Despite relatively modest parameter counts, Falcon-H1 models demonstrate strong empirical performance:

Falcon-H1-0.5B achieves results comparable to 7B-parameter models released in 2024.
Falcon-H1-1.5B-Deep performs on par with leading 7B to 10B Transformer models.
Falcon-H1-34B matches or exceeds the performance of models such as Qwen3-32B, Llama4-Scout-17B/109B, and Gemma3-27B across several benchmarks.

Evaluations emphasize both general-purpose language understanding and multilingual benchmarks. Notably, the models achieve strong performance across both high-resource and low-resource languages without requiring excessive fine-tuning or additional adaptation layers.

Source: /
Deployment and inference are supported through integration with open-source tools such as Hugging Face Transformers. FlashAttention-2 compatibility further reduces memory usage during inference, offering an attractive efficiency-performance balance for enterprise use.
Conclusion
Falcon-H1 represents a methodical effort to refine language model architecture by integrating complementary mechanisms—attention and SSMs—within a unified framework. By doing so, it addresses key limitations in both long-context processing and scaling efficiency. The model family provides a range of options for practitioners, from lightweight variants suitable for edge deployment to high-capacity configurations for server-side applications.
Through its multilingual coverage, long-context capabilities, and architectural flexibility, Falcon-H1 offers a technically sound foundation for research and production use cases that demand performance without compromising on efficiency or accessibility.

Check out the Official Release, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative Modeling
#technology #innovation #institute #tii #releases

Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding
Addressing Architectural Trade-offs in Language Models As language models scale, balancing expressivity, efficiency, and adaptability becomes increasingly challenging. Transformer architectures dominate due to their strong performance across a wide range of tasks, but they are computationally expensive—particularly for long-context scenarios—due to the quadratic complexity of self-attention. On the other hand, Structured State Space Modelsoffer improved efficiency and linear scaling, yet often lack the nuanced sequence modeling required for complex language understanding. A combined architecture that leverages the strengths of both approaches is needed to support diverse applications across environments. Introducing Falcon-H1: A Hybrid Architecture The Falcon-H1 series, released by the Technology Innovation Institute, introduces a hybrid family of language models that combine Transformer attention mechanisms with Mamba2-based SSM components. This architecture is designed to improve computational efficiency while maintaining competitive performance across tasks requiring deep contextual understanding. Falcon-H1 covers a wide parameter range—from 0.5B to 34B—catering to use cases from resource-constrained deployments to large-scale distributed inference. The design aims to address common bottlenecks in LLM deployment: memory efficiency, scalability, multilingual support, and the ability to handle extended input sequences. Source: / Architectural Details and Design Objectives Falcon-H1 adopts a parallel structure where attention heads and Mamba2 SSMs operate side by side. This design allows each mechanism to independently contribute to sequence modeling: attention heads specialize in capturing token-level dependencies, while SSM components support efficient long-range information retention. The series supports a context length of up to 256K tokens, which is particularly useful for applications in document summarization, retrieval-augmented generation, and multi-turn dialogue systems. Model training incorporates a customized microparameterizationrecipe and optimized data pipelines, allowing for stable and efficient training across model sizes. The models are trained with a focus on multilingual capabilities. The architecture is natively equipped to handle 18 languages, with coverage including English, Chinese, Arabic, Hindi, French, and others. The framework is extensible to over 100 languages, supporting localization and region-specific model adaptation. Empirical Results and Comparative Evaluation Despite relatively modest parameter counts, Falcon-H1 models demonstrate strong empirical performance: Falcon-H1-0.5B achieves results comparable to 7B-parameter models released in 2024. Falcon-H1-1.5B-Deep performs on par with leading 7B to 10B Transformer models. Falcon-H1-34B matches or exceeds the performance of models such as Qwen3-32B, Llama4-Scout-17B/109B, and Gemma3-27B across several benchmarks. Evaluations emphasize both general-purpose language understanding and multilingual benchmarks. Notably, the models achieve strong performance across both high-resource and low-resource languages without requiring excessive fine-tuning or additional adaptation layers. Source: / Deployment and inference are supported through integration with open-source tools such as Hugging Face Transformers. FlashAttention-2 compatibility further reduces memory usage during inference, offering an attractive efficiency-performance balance for enterprise use. Conclusion Falcon-H1 represents a methodical effort to refine language model architecture by integrating complementary mechanisms—attention and SSMs—within a unified framework. By doing so, it addresses key limitations in both long-context processing and scaling efficiency. The model family provides a range of options for practitioners, from lightweight variants suitable for edge deployment to high-capacity configurations for server-side applications. Through its multilingual coverage, long-context capabilities, and architectural flexibility, Falcon-H1 offers a technically sound foundation for research and production use cases that demand performance without compromising on efficiency or accessibility. Check out the Official Release, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative Modeling #technology #innovation #institute #tii #releases

WWW.MARKTECHPOST.COM

Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding
Addressing Architectural Trade-offs in Language Models As language models scale, balancing expressivity, efficiency, and adaptability becomes increasingly challenging. Transformer architectures dominate due to their strong performance across a wide range of tasks, but they are computationally expensive—particularly for long-context scenarios—due to the quadratic complexity of self-attention. On the other hand, Structured State Space Models (SSMs) offer improved efficiency and linear scaling, yet often lack the nuanced sequence modeling required for complex language understanding. A combined architecture that leverages the strengths of both approaches is needed to support diverse applications across environments. Introducing Falcon-H1: A Hybrid Architecture The Falcon-H1 series, released by the Technology Innovation Institute (TII), introduces a hybrid family of language models that combine Transformer attention mechanisms with Mamba2-based SSM components. This architecture is designed to improve computational efficiency while maintaining competitive performance across tasks requiring deep contextual understanding. Falcon-H1 covers a wide parameter range—from 0.5B to 34B—catering to use cases from resource-constrained deployments to large-scale distributed inference. The design aims to address common bottlenecks in LLM deployment: memory efficiency, scalability, multilingual support, and the ability to handle extended input sequences. Source: https://falcon-lm.github.io/blog/falcon-h1/ Architectural Details and Design Objectives Falcon-H1 adopts a parallel structure where attention heads and Mamba2 SSMs operate side by side. This design allows each mechanism to independently contribute to sequence modeling: attention heads specialize in capturing token-level dependencies, while SSM components support efficient long-range information retention. The series supports a context length of up to 256K tokens, which is particularly useful for applications in document summarization, retrieval-augmented generation, and multi-turn dialogue systems. Model training incorporates a customized microparameterization (μP) recipe and optimized data pipelines, allowing for stable and efficient training across model sizes. The models are trained with a focus on multilingual capabilities. The architecture is natively equipped to handle 18 languages, with coverage including English, Chinese, Arabic, Hindi, French, and others. The framework is extensible to over 100 languages, supporting localization and region-specific model adaptation. Empirical Results and Comparative Evaluation Despite relatively modest parameter counts, Falcon-H1 models demonstrate strong empirical performance: Falcon-H1-0.5B achieves results comparable to 7B-parameter models released in 2024. Falcon-H1-1.5B-Deep performs on par with leading 7B to 10B Transformer models. Falcon-H1-34B matches or exceeds the performance of models such as Qwen3-32B, Llama4-Scout-17B/109B, and Gemma3-27B across several benchmarks. Evaluations emphasize both general-purpose language understanding and multilingual benchmarks. Notably, the models achieve strong performance across both high-resource and low-resource languages without requiring excessive fine-tuning or additional adaptation layers. Source: https://falcon-lm.github.io/blog/falcon-h1/ Deployment and inference are supported through integration with open-source tools such as Hugging Face Transformers. FlashAttention-2 compatibility further reduces memory usage during inference, offering an attractive efficiency-performance balance for enterprise use. Conclusion Falcon-H1 represents a methodical effort to refine language model architecture by integrating complementary mechanisms—attention and SSMs—within a unified framework. By doing so, it addresses key limitations in both long-context processing and scaling efficiency. The model family provides a range of options for practitioners, from lightweight variants suitable for edge deployment to high-capacity configurations for server-side applications. Through its multilingual coverage, long-context capabilities, and architectural flexibility, Falcon-H1 offers a technically sound foundation for research and production use cases that demand performance without compromising on efficiency or accessibility. Check out the Official Release, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative Modeling

0 Yorumlar 0 hisse senetleri 0 önizleme

Please log in to like, share and comment!
Marktechpost AI paylaşılan bir bağlantı
2025-05-22 04:40:48 -

Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device Use

Researchers are reimagining how models operate as demand skyrockets for faster, smarter, and more private AI on phones, tablets, and laptops. The next generation of AI isn’t just lighter and faster; it’s local. By embedding intelligence directly into devices, developers are unlocking near-instant responsiveness, slashing memory demands, and putting privacy back into users’ hands. With mobile hardware rapidly advancing, the race is on to build compact, lightning-fast models that are intelligent enough to redefine everyday digital experiences.
A major concern is delivering high-quality, multimodal intelligence within the constrained environments of mobile devices. Unlike cloud-based systems that have access to extensive computational power, on-device models must perform under strict RAM and processing limits. Multimodal AI, capable of interpreting text, images, audio, and video, typically requires large models, which most mobile devices cannot handle efficiently. Also, cloud dependency introduces latency and privacy concerns, making it essential to design models that can run locally without sacrificing performance.
Earlier models like Gemma 3 and Gemma 3 QAT attempted to bridge this gap by reducing size while maintaining performance. Designed for use on cloud or desktop GPUs, they significantly improved model efficiency. However, these models still required robust hardware and could not fully overcome mobile platforms’ memory and responsiveness constraints. Despite supporting advanced functions, they often involved compromises limiting their real-time smartphone usability.
Researchers from Google and Google DeepMind introduced Gemma 3n. The architecture behind Gemma 3n has been optimized for mobile-first deployment, targeting performance across Android and Chrome platforms. It also forms the underlying basis for the next version of Gemini Nano. The innovation represents a significant leap forward by supporting multimodal AI functionalities with a much lower memory footprint while maintaining real-time response capabilities. This marks the first open model built on this shared infrastructure and is made available to developers in preview, allowing immediate experimentation.

The core innovation in Gemma 3n is the application of Per-Layer Embeddings, a method that drastically reduces RAM usage. While the raw model sizes include 5 billion and 8 billion parameters, they behave with memory footprints equivalent to 2 billion and 4 billion parameter models. The dynamic memory consumption is just 2GB for the 5B model and 3GB for the 8B version. Also, it uses a nested model configuration where a 4B active memory footprint model includes a 2B submodel trained through a technique known as MatFormer. This allows developers to dynamically switch performance modes without loading separate models. Further advancements include KVC sharing and activation quantization, which reduce latency and increase response speed. For example, response time on mobile improved by 1.5x compared to Gemma 3 4B while maintaining better output quality.

The performance metrics achieved by Gemma 3n reinforce its suitability for mobile deployment. It excels in automatic speech recognition and translation, allowing seamless speech conversion to translated text. On multilingual benchmarks like WMT24++, it scores 50.1%, highlighting its strength in Japanese, German, Korean, Spanish, and French. Its mix’n’match capability allows the creation of submodels optimized for various quality and latency combinations, offering developers further customization. The architecture supports interleaved inputs from different modalities, text, audio, images, and video, allowing more natural and context-rich interactions. It also performs offline, ensuring privacy and reliability even without network connectivity. Use cases include live visual and auditory feedback, context-aware content generation, and advanced voice-based applications.

Several Key Takeaways from the Research on Gemma 3n include:

Built using collaboration between Google, DeepMind, Qualcomm, MediaTek, and Samsung System LSI. Designed for mobile-first deployment.
Raw model size of 5B and 8B parameters, with operational footprints of 2GB and 3GB, respectively, using Per-Layer Embeddings.
1.5x faster response on mobile vs Gemma 3 4B. Multilingual benchmark score of 50.1% on WMT24++.
Accepts and understands audio, text, image, and video, enabling complex multimodal processing and interleaved inputs.
Supports dynamic trade-offs using MatFormer training with nested submodels and mix’n’match capabilities.
Operates without an internet connection, ensuring privacy and reliability.
Preview is available via Google AI Studio and Google AI Edge, with text and image processing capabilities.

In conclusion, this innovation provides a clear pathway for making high-performance AI portable and private. By tackling RAM constraints through innovative architecture and enhancing multilingual and multimodal capabilities, researchers offer a viable solution for bringing sophisticated AI directly into everyday devices. The flexible submodel switching, offline readiness, and fast response time mark a comprehensive approach to mobile-first AI. The research addresses the balance of computational efficiency, user privacy, and dynamic responsiveness. The result is a system capable of delivering real-time AI experiences without sacrificing capability or versatility, fundamentally expanding what users can expect from on-device intelligence.

Check out the Technical details and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image Comprehension
#google #deepmind #releases #gemma #compact

Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device Use
Researchers are reimagining how models operate as demand skyrockets for faster, smarter, and more private AI on phones, tablets, and laptops. The next generation of AI isn’t just lighter and faster; it’s local. By embedding intelligence directly into devices, developers are unlocking near-instant responsiveness, slashing memory demands, and putting privacy back into users’ hands. With mobile hardware rapidly advancing, the race is on to build compact, lightning-fast models that are intelligent enough to redefine everyday digital experiences. A major concern is delivering high-quality, multimodal intelligence within the constrained environments of mobile devices. Unlike cloud-based systems that have access to extensive computational power, on-device models must perform under strict RAM and processing limits. Multimodal AI, capable of interpreting text, images, audio, and video, typically requires large models, which most mobile devices cannot handle efficiently. Also, cloud dependency introduces latency and privacy concerns, making it essential to design models that can run locally without sacrificing performance. Earlier models like Gemma 3 and Gemma 3 QAT attempted to bridge this gap by reducing size while maintaining performance. Designed for use on cloud or desktop GPUs, they significantly improved model efficiency. However, these models still required robust hardware and could not fully overcome mobile platforms’ memory and responsiveness constraints. Despite supporting advanced functions, they often involved compromises limiting their real-time smartphone usability. Researchers from Google and Google DeepMind introduced Gemma 3n. The architecture behind Gemma 3n has been optimized for mobile-first deployment, targeting performance across Android and Chrome platforms. It also forms the underlying basis for the next version of Gemini Nano. The innovation represents a significant leap forward by supporting multimodal AI functionalities with a much lower memory footprint while maintaining real-time response capabilities. This marks the first open model built on this shared infrastructure and is made available to developers in preview, allowing immediate experimentation. The core innovation in Gemma 3n is the application of Per-Layer Embeddings, a method that drastically reduces RAM usage. While the raw model sizes include 5 billion and 8 billion parameters, they behave with memory footprints equivalent to 2 billion and 4 billion parameter models. The dynamic memory consumption is just 2GB for the 5B model and 3GB for the 8B version. Also, it uses a nested model configuration where a 4B active memory footprint model includes a 2B submodel trained through a technique known as MatFormer. This allows developers to dynamically switch performance modes without loading separate models. Further advancements include KVC sharing and activation quantization, which reduce latency and increase response speed. For example, response time on mobile improved by 1.5x compared to Gemma 3 4B while maintaining better output quality. The performance metrics achieved by Gemma 3n reinforce its suitability for mobile deployment. It excels in automatic speech recognition and translation, allowing seamless speech conversion to translated text. On multilingual benchmarks like WMT24++, it scores 50.1%, highlighting its strength in Japanese, German, Korean, Spanish, and French. Its mix’n’match capability allows the creation of submodels optimized for various quality and latency combinations, offering developers further customization. The architecture supports interleaved inputs from different modalities, text, audio, images, and video, allowing more natural and context-rich interactions. It also performs offline, ensuring privacy and reliability even without network connectivity. Use cases include live visual and auditory feedback, context-aware content generation, and advanced voice-based applications. Several Key Takeaways from the Research on Gemma 3n include: Built using collaboration between Google, DeepMind, Qualcomm, MediaTek, and Samsung System LSI. Designed for mobile-first deployment. Raw model size of 5B and 8B parameters, with operational footprints of 2GB and 3GB, respectively, using Per-Layer Embeddings. 1.5x faster response on mobile vs Gemma 3 4B. Multilingual benchmark score of 50.1% on WMT24++. Accepts and understands audio, text, image, and video, enabling complex multimodal processing and interleaved inputs. Supports dynamic trade-offs using MatFormer training with nested submodels and mix’n’match capabilities. Operates without an internet connection, ensuring privacy and reliability. Preview is available via Google AI Studio and Google AI Edge, with text and image processing capabilities. In conclusion, this innovation provides a clear pathway for making high-performance AI portable and private. By tackling RAM constraints through innovative architecture and enhancing multilingual and multimodal capabilities, researchers offer a viable solution for bringing sophisticated AI directly into everyday devices. The flexible submodel switching, offline readiness, and fast response time mark a comprehensive approach to mobile-first AI. The research addresses the balance of computational efficiency, user privacy, and dynamic responsiveness. The result is a system capable of delivering real-time AI experiences without sacrificing capability or versatility, fundamentally expanding what users can expect from on-device intelligence. Check out the Technical details and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image Comprehension #google #deepmind #releases #gemma #compact

WWW.MARKTECHPOST.COM

Google DeepMind Releases Gemma 3n: A Compact, High-Efficiency Multimodal AI Model for Real-Time On-Device Use
Researchers are reimagining how models operate as demand skyrockets for faster, smarter, and more private AI on phones, tablets, and laptops. The next generation of AI isn’t just lighter and faster; it’s local. By embedding intelligence directly into devices, developers are unlocking near-instant responsiveness, slashing memory demands, and putting privacy back into users’ hands. With mobile hardware rapidly advancing, the race is on to build compact, lightning-fast models that are intelligent enough to redefine everyday digital experiences. A major concern is delivering high-quality, multimodal intelligence within the constrained environments of mobile devices. Unlike cloud-based systems that have access to extensive computational power, on-device models must perform under strict RAM and processing limits. Multimodal AI, capable of interpreting text, images, audio, and video, typically requires large models, which most mobile devices cannot handle efficiently. Also, cloud dependency introduces latency and privacy concerns, making it essential to design models that can run locally without sacrificing performance. Earlier models like Gemma 3 and Gemma 3 QAT attempted to bridge this gap by reducing size while maintaining performance. Designed for use on cloud or desktop GPUs, they significantly improved model efficiency. However, these models still required robust hardware and could not fully overcome mobile platforms’ memory and responsiveness constraints. Despite supporting advanced functions, they often involved compromises limiting their real-time smartphone usability. Researchers from Google and Google DeepMind introduced Gemma 3n. The architecture behind Gemma 3n has been optimized for mobile-first deployment, targeting performance across Android and Chrome platforms. It also forms the underlying basis for the next version of Gemini Nano. The innovation represents a significant leap forward by supporting multimodal AI functionalities with a much lower memory footprint while maintaining real-time response capabilities. This marks the first open model built on this shared infrastructure and is made available to developers in preview, allowing immediate experimentation. The core innovation in Gemma 3n is the application of Per-Layer Embeddings (PLE), a method that drastically reduces RAM usage. While the raw model sizes include 5 billion and 8 billion parameters, they behave with memory footprints equivalent to 2 billion and 4 billion parameter models. The dynamic memory consumption is just 2GB for the 5B model and 3GB for the 8B version. Also, it uses a nested model configuration where a 4B active memory footprint model includes a 2B submodel trained through a technique known as MatFormer. This allows developers to dynamically switch performance modes without loading separate models. Further advancements include KVC sharing and activation quantization, which reduce latency and increase response speed. For example, response time on mobile improved by 1.5x compared to Gemma 3 4B while maintaining better output quality. The performance metrics achieved by Gemma 3n reinforce its suitability for mobile deployment. It excels in automatic speech recognition and translation, allowing seamless speech conversion to translated text. On multilingual benchmarks like WMT24++ (ChrF), it scores 50.1%, highlighting its strength in Japanese, German, Korean, Spanish, and French. Its mix’n’match capability allows the creation of submodels optimized for various quality and latency combinations, offering developers further customization. The architecture supports interleaved inputs from different modalities, text, audio, images, and video, allowing more natural and context-rich interactions. It also performs offline, ensuring privacy and reliability even without network connectivity. Use cases include live visual and auditory feedback, context-aware content generation, and advanced voice-based applications. Several Key Takeaways from the Research on Gemma 3n include: Built using collaboration between Google, DeepMind, Qualcomm, MediaTek, and Samsung System LSI. Designed for mobile-first deployment. Raw model size of 5B and 8B parameters, with operational footprints of 2GB and 3GB, respectively, using Per-Layer Embeddings (PLE). 1.5x faster response on mobile vs Gemma 3 4B. Multilingual benchmark score of 50.1% on WMT24++ (ChrF). Accepts and understands audio, text, image, and video, enabling complex multimodal processing and interleaved inputs. Supports dynamic trade-offs using MatFormer training with nested submodels and mix’n’match capabilities. Operates without an internet connection, ensuring privacy and reliability. Preview is available via Google AI Studio and Google AI Edge, with text and image processing capabilities. In conclusion, this innovation provides a clear pathway for making high-performance AI portable and private. By tackling RAM constraints through innovative architecture and enhancing multilingual and multimodal capabilities, researchers offer a viable solution for bringing sophisticated AI directly into everyday devices. The flexible submodel switching, offline readiness, and fast response time mark a comprehensive approach to mobile-first AI. The research addresses the balance of computational efficiency, user privacy, and dynamic responsiveness. The result is a system capable of delivering real-time AI experiences without sacrificing capability or versatility, fundamentally expanding what users can expect from on-device intelligence. Check out the Technical details and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraphAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image Comprehension

0 Yorumlar 0 hisse senetleri 0 önizleme

Please log in to like, share and comment!
Marktechpost AI paylaşılan bir bağlantı
2025-05-22 00:33:58 -

A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraph

In this tutorial, we provide a practical guide for implementing LangGraph, a streamlined, graph-based AI orchestration framework, integrated seamlessly with Anthropic’s Claude API. Through detailed, executable code optimized for Google Colab, developers learn how to build and visualize AI workflows as interconnected nodes performing distinct tasks, such as generating concise answers, critically analyzing responses, and automatically composing technical blog content. The compact implementation highlights LangGraph’s intuitive node-graph architecture. It can manage complex sequences of Claude-powered natural language tasks, from basic question-answering scenarios to advanced content generation pipelines.
from getpass import getpass
import os

anthropic_key = getpassos.environ= anthropic_key

printWe securely prompt users to input their Anthropic API key using Python’s getpass module, ensuring sensitive data isn’t displayed. It then sets this key as an environment variableand confirms successful storage.
import os
import json
import requests
from typing import Dict, List, Any, Callable, Optional, Union
from dataclasses import dataclass, field
import networkx as nx
import matplotlib.pyplot as plt
from IPython.display import display, HTML, clear_output
We import essential libraries for building and visualizing structured AI workflows. It includes modules for handling data, graph creation and visualization, interactive notebook display, and type annotationsfor clarity and maintainability.
try:
import anthropic
except ImportError:
print!pip install -q anthropic
import anthropic

from anthropic import Anthropic
We ensure the anthropic Python package is available for use. It attempts to import the module and, if not found, automatically installs it using pip in a Google Colab environment. After installation, it imports the Anthropic client, essential for interacting with Claude models via the Anthropic API. 4o
@dataclass
class NodeConfig:
name: str
function: Callable
inputs: List= fieldoutputs: List= fieldconfig: Dict= fieldThis NodeConfig data class defines the structure of each node in the LangGraph workflow. Each node has a name, an executable function, optional inputs and outputs, and an optional config dictionary to store additional parameters. This setup allows for modular, reusable node definitions for graph-based AI tasks.
class LangGraph:
def __init__:
self.api_key = api_key or os.environ.getif not self.api_key:
from google.colab import userdata
try:
self.api_key = userdata.getif not self.api_key:
raise ValueErrorexcept:
printself.api_key = inputif not self.api_key:
raise ValueErrorself.client = Anthropicself.graph = nx.DiGraphself.nodes = {}
self.state = {}

def add_node:
self.nodes= node_config
self.graph.add_nodefor input_node in node_config.inputs:
if input_node in self.nodes:
self.graph.add_edgereturn self

def claude_node:
"""Convenience method to create a Claude API node"""
inputs = inputs oroutputs = outputs ordef claude_fn:
prompt = prompt_template
for k, v in state.items:
if isinstance:
prompt = prompt.replacemessage_params = {
"model": model,
"max_tokens": 1000,
"messages":}

if system_prompt:
message_params= system_prompt

response = self.client.messages.createreturn response.content.text

node_config = NodeConfigreturn self.add_nodedef transform_node:
"""Add a data transformation node"""
inputs = inputs oroutputs = outputs ornode_config = NodeConfigreturn self.add_nodedef visualize:
"""Visualize the graph"""
plt.figure)
pos = nx.spring_layoutnx.drawplt.titleplt.tight_layoutplt.showprintfor node in self.graph.nodes:
successors = list)
if successors:
print}")
else:
print")
printdef _get_execution_order:
"""Determine execution order based on dependencies"""
try:
return list)
except nx.NetworkXUnfeasible:
raise ValueErrordef execute:
"""Execute the graph in topological order"""
self.state = initial_state or {}
execution_order = self._get_execution_orderprintfor node_name in execution_order:
printnode = self.nodesinputs = {k: self.state.getfor k in node.inputs if k in self.state}

result = node.functionif len== 1:
self.state] = result
elif isinstance) and len== len:
for i, output_name in enumerate:
self.state= resultprintreturn self.state

def run_example:
"""Run an example LangGraph flow with a predefined question"""
printgraph = LangGraphdef question_provider:
return question

graph.transform_nodegraph.claude_nodegraph.claude_nodegraph.visualizeresult = graph.executeprintprintprintprint}\n")
print}\n")
print}")
printreturn graph
The LangGraph class implements a lightweight framework for constructing and executing graph-based AI workflows using Claude from Anthropic. It allows users to define modular nodes, either Claude-powered prompts or custom transformation functions, connect them via dependencies, visualize the entire pipeline, and execute them in topological order. The run_example function demonstrates this by building a simple question-answering and evaluation flow, showcasing the clarity and modularity of LangGraph’s architecture.
def run_advanced_example:
"""Run a more advanced example with multiple nodes for content generation"""
graph = LangGraphdef topic_selector:
return "Graph-based AI systems"

graph.transform_nodegraph.claude_nodegraph.claude_nodegraph.claude_nodedef assembler:
return f"# {state}\n\n{introduction}\n\n## Outline\n{outline}\n\n## Conclusion\n{conclusion}"

graph.transform_nodegraph.visualizeresult = graph.executeprintprintprintprint)
printreturn graph
The run_advanced_example function showcases a more sophisticated use of LangGraph by orchestrating multiple Claude-powered nodes to generate a complete blog post. It starts by selecting a topic, then creates an outline, an introduction, and a conclusion, all using structured Claude prompts. Finally, a transformation node assembles the content into a formatted blog post. This example demonstrates how LangGraph can automate complex, multi-step content generation tasks using modular, connected nodes in a clear and executable flow.
printquestion = "What are the three main advantages of using graph-based AI architectures?"
simple_graph = run_exampleprintadvanced_graph = run_advanced_exampleFinally, we trigger the execution of both defined LangGraph workflows. First, it runs the simple question-answering example by passing a predefined question to the run_examplefunction. Then, it initiates the more advanced blog post generation workflow using run_advanced_example. Together, these calls demonstrate the practical flexibility of LangGraph, from basic prompt-based interactions to multi-step content automation using Anthropic’s Claude API.
In conclusion, we have implemented LangGraph integrated with Anthropic’s Claude API, which illustrates the ease of designing modular AI workflows that leverage powerful language models in structured, graph-based pipelines. Through visualizing task flows and separating responsibilities among nodes, such as question processing, analytical evaluation, content outlining, and assembly, developers gain practical experience in building maintainable, scalable AI systems. LangGraph’s clear node dependencies and Claude’s sophisticated language capabilities provide an efficient solution for orchestrating complex AI processes, especially for rapid prototyping and execution in environments like Google Colab.

Check out the Colab Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World Environments
#stepbystep #implementation #tutorial #building #modular

A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraph
In this tutorial, we provide a practical guide for implementing LangGraph, a streamlined, graph-based AI orchestration framework, integrated seamlessly with Anthropic’s Claude API. Through detailed, executable code optimized for Google Colab, developers learn how to build and visualize AI workflows as interconnected nodes performing distinct tasks, such as generating concise answers, critically analyzing responses, and automatically composing technical blog content. The compact implementation highlights LangGraph’s intuitive node-graph architecture. It can manage complex sequences of Claude-powered natural language tasks, from basic question-answering scenarios to advanced content generation pipelines. from getpass import getpass import os anthropic_key = getpassos.environ= anthropic_key printWe securely prompt users to input their Anthropic API key using Python’s getpass module, ensuring sensitive data isn’t displayed. It then sets this key as an environment variableand confirms successful storage. import os import json import requests from typing import Dict, List, Any, Callable, Optional, Union from dataclasses import dataclass, field import networkx as nx import matplotlib.pyplot as plt from IPython.display import display, HTML, clear_output We import essential libraries for building and visualizing structured AI workflows. It includes modules for handling data, graph creation and visualization, interactive notebook display, and type annotationsfor clarity and maintainability. try: import anthropic except ImportError: print!pip install -q anthropic import anthropic from anthropic import Anthropic We ensure the anthropic Python package is available for use. It attempts to import the module and, if not found, automatically installs it using pip in a Google Colab environment. After installation, it imports the Anthropic client, essential for interacting with Claude models via the Anthropic API. 4o @dataclass class NodeConfig: name: str function: Callable inputs: List= fieldoutputs: List= fieldconfig: Dict= fieldThis NodeConfig data class defines the structure of each node in the LangGraph workflow. Each node has a name, an executable function, optional inputs and outputs, and an optional config dictionary to store additional parameters. This setup allows for modular, reusable node definitions for graph-based AI tasks. class LangGraph: def __init__: self.api_key = api_key or os.environ.getif not self.api_key: from google.colab import userdata try: self.api_key = userdata.getif not self.api_key: raise ValueErrorexcept: printself.api_key = inputif not self.api_key: raise ValueErrorself.client = Anthropicself.graph = nx.DiGraphself.nodes = {} self.state = {} def add_node: self.nodes= node_config self.graph.add_nodefor input_node in node_config.inputs: if input_node in self.nodes: self.graph.add_edgereturn self def claude_node: """Convenience method to create a Claude API node""" inputs = inputs oroutputs = outputs ordef claude_fn: prompt = prompt_template for k, v in state.items: if isinstance: prompt = prompt.replacemessage_params = { "model": model, "max_tokens": 1000, "messages":} if system_prompt: message_params= system_prompt response = self.client.messages.createreturn response.content.text node_config = NodeConfigreturn self.add_nodedef transform_node: """Add a data transformation node""" inputs = inputs oroutputs = outputs ornode_config = NodeConfigreturn self.add_nodedef visualize: """Visualize the graph""" plt.figure) pos = nx.spring_layoutnx.drawplt.titleplt.tight_layoutplt.showprintfor node in self.graph.nodes: successors = list) if successors: print}") else: print") printdef _get_execution_order: """Determine execution order based on dependencies""" try: return list) except nx.NetworkXUnfeasible: raise ValueErrordef execute: """Execute the graph in topological order""" self.state = initial_state or {} execution_order = self._get_execution_orderprintfor node_name in execution_order: printnode = self.nodesinputs = {k: self.state.getfor k in node.inputs if k in self.state} result = node.functionif len== 1: self.state] = result elif isinstance) and len== len: for i, output_name in enumerate: self.state= resultprintreturn self.state def run_example: """Run an example LangGraph flow with a predefined question""" printgraph = LangGraphdef question_provider: return question graph.transform_nodegraph.claude_nodegraph.claude_nodegraph.visualizeresult = graph.executeprintprintprintprint}\n") print}\n") print}") printreturn graph The LangGraph class implements a lightweight framework for constructing and executing graph-based AI workflows using Claude from Anthropic. It allows users to define modular nodes, either Claude-powered prompts or custom transformation functions, connect them via dependencies, visualize the entire pipeline, and execute them in topological order. The run_example function demonstrates this by building a simple question-answering and evaluation flow, showcasing the clarity and modularity of LangGraph’s architecture. def run_advanced_example: """Run a more advanced example with multiple nodes for content generation""" graph = LangGraphdef topic_selector: return "Graph-based AI systems" graph.transform_nodegraph.claude_nodegraph.claude_nodegraph.claude_nodedef assembler: return f"# {state}\n\n{introduction}\n\n## Outline\n{outline}\n\n## Conclusion\n{conclusion}" graph.transform_nodegraph.visualizeresult = graph.executeprintprintprintprint) printreturn graph The run_advanced_example function showcases a more sophisticated use of LangGraph by orchestrating multiple Claude-powered nodes to generate a complete blog post. It starts by selecting a topic, then creates an outline, an introduction, and a conclusion, all using structured Claude prompts. Finally, a transformation node assembles the content into a formatted blog post. This example demonstrates how LangGraph can automate complex, multi-step content generation tasks using modular, connected nodes in a clear and executable flow. printquestion = "What are the three main advantages of using graph-based AI architectures?" simple_graph = run_exampleprintadvanced_graph = run_advanced_exampleFinally, we trigger the execution of both defined LangGraph workflows. First, it runs the simple question-answering example by passing a predefined question to the run_examplefunction. Then, it initiates the more advanced blog post generation workflow using run_advanced_example. Together, these calls demonstrate the practical flexibility of LangGraph, from basic prompt-based interactions to multi-step content automation using Anthropic’s Claude API. In conclusion, we have implemented LangGraph integrated with Anthropic’s Claude API, which illustrates the ease of designing modular AI workflows that leverage powerful language models in structured, graph-based pipelines. Through visualizing task flows and separating responsibilities among nodes, such as question processing, analytical evaluation, content outlining, and assembly, developers gain practical experience in building maintainable, scalable AI systems. LangGraph’s clear node dependencies and Claude’s sophisticated language capabilities provide an efficient solution for orchestrating complex AI processes, especially for rapid prototyping and execution in environments like Google Colab. Check out the Colab Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World Environments #stepbystep #implementation #tutorial #building #modular

WWW.MARKTECHPOST.COM

A Step-by-Step Implementation Tutorial for Building Modular AI Workflows Using Anthropic’s Claude Sonnet 3.7 through API and LangGraph
In this tutorial, we provide a practical guide for implementing LangGraph, a streamlined, graph-based AI orchestration framework, integrated seamlessly with Anthropic’s Claude API. Through detailed, executable code optimized for Google Colab, developers learn how to build and visualize AI workflows as interconnected nodes performing distinct tasks, such as generating concise answers, critically analyzing responses, and automatically composing technical blog content. The compact implementation highlights LangGraph’s intuitive node-graph architecture. It can manage complex sequences of Claude-powered natural language tasks, from basic question-answering scenarios to advanced content generation pipelines. from getpass import getpass import os anthropic_key = getpass("Enter your Anthropic API key: ") os.environ["ANTHROPIC_API_KEY"] = anthropic_key print("Key set:", "ANTHROPIC_API_KEY" in os.environ) We securely prompt users to input their Anthropic API key using Python’s getpass module, ensuring sensitive data isn’t displayed. It then sets this key as an environment variable (ANTHROPIC_API_KEY) and confirms successful storage. import os import json import requests from typing import Dict, List, Any, Callable, Optional, Union from dataclasses import dataclass, field import networkx as nx import matplotlib.pyplot as plt from IPython.display import display, HTML, clear_output We import essential libraries for building and visualizing structured AI workflows. It includes modules for handling data (json, requests, dataclasses), graph creation and visualization (networkx, matplotlib), interactive notebook display (IPython.display), and type annotations (typing) for clarity and maintainability. try: import anthropic except ImportError: print("Installing anthropic package...") !pip install -q anthropic import anthropic from anthropic import Anthropic We ensure the anthropic Python package is available for use. It attempts to import the module and, if not found, automatically installs it using pip in a Google Colab environment. After installation, it imports the Anthropic client, essential for interacting with Claude models via the Anthropic API. 4o @dataclass class NodeConfig: name: str function: Callable inputs: List[str] = field(default_factory=list) outputs: List[str] = field(default_factory=list) config: Dict[str, Any] = field(default_factory=dict) This NodeConfig data class defines the structure of each node in the LangGraph workflow. Each node has a name, an executable function, optional inputs and outputs, and an optional config dictionary to store additional parameters. This setup allows for modular, reusable node definitions for graph-based AI tasks. class LangGraph: def __init__(self, api_key: Optional[str] = None): self.api_key = api_key or os.environ.get("ANTHROPIC_API_KEY") if not self.api_key: from google.colab import userdata try: self.api_key = userdata.get('ANTHROPIC_API_KEY') if not self.api_key: raise ValueError("No API key found") except: print("No Anthropic API key found in environment variables or Colab secrets.") self.api_key = input("Please enter your Anthropic API key: ") if not self.api_key: raise ValueError("Please provide an Anthropic API key") self.client = Anthropic(api_key=self.api_key) self.graph = nx.DiGraph() self.nodes = {} self.state = {} def add_node(self, node_config: NodeConfig): self.nodes[node_config.name] = node_config self.graph.add_node(node_config.name) for input_node in node_config.inputs: if input_node in self.nodes: self.graph.add_edge(input_node, node_config.name) return self def claude_node(self, name: str, prompt_template: str, model: str = "claude-3-7-sonnet-20250219", inputs: List[str] = None, outputs: List[str] = None, system_prompt: str = None): """Convenience method to create a Claude API node""" inputs = inputs or [] outputs = outputs or [name + "_response"] def claude_fn(state, **kwargs): prompt = prompt_template for k, v in state.items(): if isinstance(v, str): prompt = prompt.replace(f"{{{k}}}", v) message_params = { "model": model, "max_tokens": 1000, "messages": [{"role": "user", "content": prompt}] } if system_prompt: message_params["system"] = system_prompt response = self.client.messages.create(**message_params) return response.content[0].text node_config = NodeConfig( name=name, function=claude_fn, inputs=inputs, outputs=outputs, config={"model": model, "prompt_template": prompt_template} ) return self.add_node(node_config) def transform_node(self, name: str, transform_fn: Callable, inputs: List[str] = None, outputs: List[str] = None): """Add a data transformation node""" inputs = inputs or [] outputs = outputs or [name + "_output"] node_config = NodeConfig( name=name, function=transform_fn, inputs=inputs, outputs=outputs ) return self.add_node(node_config) def visualize(self): """Visualize the graph""" plt.figure(figsize=(10, 6)) pos = nx.spring_layout(self.graph) nx.draw(self.graph, pos, with_labels=True, node_color="lightblue", node_size=1500, arrowsize=20, font_size=10) plt.title("LangGraph Flow") plt.tight_layout() plt.show() print("\nGraph Structure:") for node in self.graph.nodes(): successors = list(self.graph.successors(node)) if successors: print(f" {node} → {', '.join(successors)}") else: print(f" {node} (endpoint)") print() def _get_execution_order(self): """Determine execution order based on dependencies""" try: return list(nx.topological_sort(self.graph)) except nx.NetworkXUnfeasible: raise ValueError("Graph contains a cycle") def execute(self, initial_state: Dict[str, Any] = None): """Execute the graph in topological order""" self.state = initial_state or {} execution_order = self._get_execution_order() print("Executing LangGraph flow:") for node_name in execution_order: print(f"- Running node: {node_name}") node = self.nodes[node_name] inputs = {k: self.state.get(k) for k in node.inputs if k in self.state} result = node.function(self.state, **inputs) if len(node.outputs) == 1: self.state[node.outputs[0]] = result elif isinstance(result, (list, tuple)) and len(result) == len(node.outputs): for i, output_name in enumerate(node.outputs): self.state[output_name] = result[i] print("Execution completed!") return self.state def run_example(question="What are the key benefits of using a graph-based architecture for AI workflows?"): """Run an example LangGraph flow with a predefined question""" print(f"Running example with question: '{question}'") graph = LangGraph() def question_provider(state, **kwargs): return question graph.transform_node( name="question_provider", transform_fn=question_provider, outputs=["user_question"] ) graph.claude_node( name="question_answerer", prompt_template="Answer this question clearly and concisely: {user_question}", inputs=["user_question"], outputs=["answer"], system_prompt="You are a helpful AI assistant." ) graph.claude_node( name="answer_analyzer", prompt_template="Analyze if this answer addresses the question well: Question: {user_question}\nAnswer: {answer}", inputs=["user_question", "answer"], outputs=["analysis"], system_prompt="You are a critical evaluator. Be brief but thorough." ) graph.visualize() result = graph.execute() print("\n" + "="*50) print("EXECUTION RESULTS:") print("="*50) print(f"\n🔍 QUESTION:\n{result.get('user_question')}\n") print(f"📝 ANSWER:\n{result.get('answer')}\n") print(f"✅ ANALYSIS:\n{result.get('analysis')}") print("="*50 + "\n") return graph The LangGraph class implements a lightweight framework for constructing and executing graph-based AI workflows using Claude from Anthropic. It allows users to define modular nodes, either Claude-powered prompts or custom transformation functions, connect them via dependencies, visualize the entire pipeline, and execute them in topological order. The run_example function demonstrates this by building a simple question-answering and evaluation flow, showcasing the clarity and modularity of LangGraph’s architecture. def run_advanced_example(): """Run a more advanced example with multiple nodes for content generation""" graph = LangGraph() def topic_selector(state, **kwargs): return "Graph-based AI systems" graph.transform_node( name="topic_selector", transform_fn=topic_selector, outputs=["topic"] ) graph.claude_node( name="outline_generator", prompt_template="Create a brief outline for a technical blog post about {topic}. Include 3-4 main sections only.", inputs=["topic"], outputs=["outline"], system_prompt="You are a technical writer specializing in AI technologies." ) graph.claude_node( name="intro_writer", prompt_template="Write an engaging introduction for a blog post with this outline: {outline}\nTopic: {topic}", inputs=["topic", "outline"], outputs=["introduction"], system_prompt="You are a technical writer. Write in a clear, engaging style." ) graph.claude_node( name="conclusion_writer", prompt_template="Write a conclusion for a blog post with this outline: {outline}\nTopic: {topic}", inputs=["topic", "outline"], outputs=["conclusion"], system_prompt="You are a technical writer. Summarize key points and include a forward-looking statement." ) def assembler(state, introduction, outline, conclusion, **kwargs): return f"# {state['topic']}\n\n{introduction}\n\n## Outline\n{outline}\n\n## Conclusion\n{conclusion}" graph.transform_node( name="content_assembler", transform_fn=assembler, inputs=["topic", "introduction", "outline", "conclusion"], outputs=["final_content"] ) graph.visualize() result = graph.execute() print("\n" + "="*50) print("BLOG POST GENERATED:") print("="*50 + "\n") print(result.get("final_content")) print("\n" + "="*50) return graph The run_advanced_example function showcases a more sophisticated use of LangGraph by orchestrating multiple Claude-powered nodes to generate a complete blog post. It starts by selecting a topic, then creates an outline, an introduction, and a conclusion, all using structured Claude prompts. Finally, a transformation node assembles the content into a formatted blog post. This example demonstrates how LangGraph can automate complex, multi-step content generation tasks using modular, connected nodes in a clear and executable flow. print("1. Running simple question-answering example") question = "What are the three main advantages of using graph-based AI architectures?" simple_graph = run_example(question) print("\n2. Running advanced blog post creation example") advanced_graph = run_advanced_example() Finally, we trigger the execution of both defined LangGraph workflows. First, it runs the simple question-answering example by passing a predefined question to the run_example() function. Then, it initiates the more advanced blog post generation workflow using run_advanced_example(). Together, these calls demonstrate the practical flexibility of LangGraph, from basic prompt-based interactions to multi-step content automation using Anthropic’s Claude API. In conclusion, we have implemented LangGraph integrated with Anthropic’s Claude API, which illustrates the ease of designing modular AI workflows that leverage powerful language models in structured, graph-based pipelines. Through visualizing task flows and separating responsibilities among nodes, such as question processing, analytical evaluation, content outlining, and assembly, developers gain practical experience in building maintainable, scalable AI systems. LangGraph’s clear node dependencies and Claude’s sophisticated language capabilities provide an efficient solution for orchestrating complex AI processes, especially for rapid prototyping and execution in environments like Google Colab. Check out the Colab Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta Researchers Introduced J1: A Reinforcement Learning Framework That Trains Language Models to Judge With Reasoned Consistency and Minimal DataAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Sampling Without Data is Now Scalable: Meta AI Releases Adjoint Sampling for Reward-Driven Generative ModelingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image ComprehensionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World Environments

0 Yorumlar 0 hisse senetleri 0 önizleme

Please log in to like, share and comment!