Cerca | CGShares

Cerca

Articoli

Blogs

Utenti

Pagine

Gruppi

Marktechpost AI @MarktechpostAI ha condiviso un link
2025-05-24 22:40:00 ·

Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent Creation

In this comprehensive tutorial, we guide users through creating a powerful multi-tool AI agent using LangGraph and Claude, optimized for diverse tasks including mathematical computations, web searches, weather inquiries, text analysis, and real-time information retrieval. It begins by simplifying dependency installations to ensure effortless setup, even for beginners. Users are then introduced to structured implementations of specialized tools, such as a safe calculator, an efficient web-search utility leveraging DuckDuckGo, a mock weather information provider, a detailed text analyzer, and a time-fetching function. The tutorial also clearly delineates the integration of these tools within a sophisticated agent architecture built using LangGraph, illustrating practical usage through interactive examples and clear explanations, facilitating both beginners and advanced developers to deploy custom multi-functional AI agents rapidly.
import subprocess
import sys

def install_packages:
packages =for package in packages:
try:
subprocess.check_callprintexcept subprocess.CalledProcessError:
printprintinstall_packagesprintWe automate the installation of essential Python packages required for building a LangGraph-based multi-tool AI agent. It leverages a subprocess to run pip commands silently and ensures each package, ranging from long-chain components to web search and environment handling tools, is installed successfully. This setup streamlines the environment preparation process, making the notebook portable and beginner-friendly.
import os
import json
import math
import requests
from typing import Dict, List, Any, Annotated, TypedDict
from datetime import datetime
import operator

from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, ToolMessage
from langchain_core.tools import tool
from langchain_anthropic import ChatAnthropic
from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import ToolNode
from langgraph.checkpoint.memory import MemorySaver
from duckduckgo_search import DDGS
We import all the necessary libraries and modules for constructing the multi-tool AI agent. It includes Python standard libraries such as os, json, math, and datetime for general-purpose functionality and external libraries like requests for HTTP calls and duckduckgo_search for implementing web search. The LangChain and LangGraph ecosystems bring in message types, tool decorators, state graph components, and checkpointing utilities, while ChatAnthropic enables integration with the Claude model for conversational intelligence. These imports form the foundational building blocks for defining tools, agent workflows, and interactions.
os.environ= "Use Your API Key Here"

ANTHROPIC_API_KEY = os.getenvWe set and retrieve the Anthropic API key required to authenticate and interact with Claude models. The os.environ line assigns your API key, while os.getenv securely retrieves it for later use in model initialization. This approach ensures the key is accessible throughout the script without hardcoding it multiple times.
from typing import TypedDict

class AgentState:
messages: Annotated, operator.add]

@tool
def calculator-> str:
"""
Perform mathematical calculations. Supports basic arithmetic, trigonometry, and more.

Args:
expression: Mathematical expression as a string")

Returns:
Result of the calculation as a string
"""
try:
allowed_names = {
'abs': abs, 'round': round, 'min': min, 'max': max,
'sum': sum, 'pow': pow, 'sqrt': math.sqrt,
'sin': math.sin, 'cos': math.cos, 'tan': math.tan,
'log': math.log, 'log10': math.log10, 'exp': math.exp,
'pi': math.pi, 'e': math.e
}

expression = expression.replaceresult = evalreturn f"Result: {result}"
except Exception as e:
return f"Error in calculation: {str}"
We define the agent’s internal state and implement a robust calculator tool. The AgentState class uses TypedDict to structure agent memory, specifically tracking messages exchanged during the conversation. The calculator function, decorated with @tool to register it as an AI-usable utility, securely evaluates mathematical expressions. It allows for safe computation by limiting available functions to a predefined set from the math module and replacing common syntax like ^ with Python’s exponentiation operator. This ensures the tool can handle simple arithmetic and advanced functions like trigonometry or logarithms while preventing unsafe code execution.
@tool
def web_search-> str:
"""
Search the web for information using DuckDuckGo.

Args:
query: Search query string
num_results: Number of results to returnReturns:
Search results as formatted string
"""
try:
num_results = min, 10)

with DDGSas ddgs:
results = list)

if not results:
return f"No search results found for: {query}"

formatted_results = f"Search results for '{query}':\n\n"
for i, result in enumerate:
formatted_results += f"{i}. **{result}**\n"
formatted_results += f" {result}\n"
formatted_results += f" Source: {result}\n\n"

return formatted_results
except Exception as e:
return f"Error performing web search: {str}"
We define a web_search tool that enables the agent to fetch real-time information from the internet using the DuckDuckGo Search API via the duckduckgo_search Python package. The tool accepts a search query and an optional num_results parameter, ensuring that the number of results returned is between 1 and 10. It opens a DuckDuckGo search session, retrieves the results, and formats them neatly for user-friendly display. If no results are found or an error occurs, the function handles it gracefully by returning an informative message. This tool equips the agent with real-time search capabilities, enhancing responsiveness and utility.
@tool
def weather_info-> str:
"""
Get current weather information for a city using OpenWeatherMap API.
Note: This is a mock implementation for demo purposes.

Args:
city: Name of the city

Returns:
Weather information as a string
"""
mock_weather = {
"new york": {"temp": 22, "condition": "Partly Cloudy", "humidity": 65},
"london": {"temp": 15, "condition": "Rainy", "humidity": 80},
"tokyo": {"temp": 28, "condition": "Sunny", "humidity": 70},
"paris": {"temp": 18, "condition": "Overcast", "humidity": 75}
}

city_lower = city.lowerif city_lower in mock_weather:
weather = mock_weatherreturn f"Weather in {city}:\n" \
f"Temperature: {weather}°C\n" \
f"Condition: {weather}\n" \
f"Humidity: {weather}%"
else:
return f"Weather data not available for {city}."
We define a weather_info tool that simulates retrieving current weather data for a given city. While it does not connect to a live weather API, it uses a predefined dictionary of mock data for major cities like New York, London, Tokyo, and Paris. Upon receiving a city name, the function normalizes it to lowercase and checks for its presence in the mock dataset. It returns temperature, weather condition, and humidity in a readable format if found. Otherwise, it notifies the user that weather data is unavailable. This tool serves as a placeholder and can later be upgraded to fetch live data from an actual weather API.
@tool
def text_analyzer-> str:
"""
Analyze text and provide statistics like word count, character count, etc.

Args:
text: Text to analyze

Returns:
Text analysis results
"""
if not text.strip:
return "Please provide text to analyze."

words = text.splitsentences = text.split+ text.split+ text.splitsentences =analysis = f"Text Analysis Results:\n"
analysis += f"• Characters: {len}\n"
analysis += f"• Characters: {len)}\n"
analysis += f"• Words: {len}\n"
analysis += f"• Sentences: {len}\n"
analysis += f"• Average words per sentence: {len/ max, 1):.1f}\n"
analysis += f"• Most common word: {max, key=words.count) if words else 'N/A'}"

return analysis
The text_analyzer tool provides a detailed statistical analysis of a given text input. It calculates metrics such as character count, word count, sentence count, and average words per sentence, and it identifies the most frequently occurring word. The tool handles empty input gracefully by prompting the user to provide valid text. It uses simple string operations and Python’s set and max functions to extract meaningful insights. It is a valuable utility for language analysis or content quality checks in the AI agent’s toolkit.
@tool
def current_time-> str:
"""
Get the current date and time.

Returns:
Current date and time as a formatted string
"""
now = datetime.nowreturn f"Current date and time: {now.strftime}"
The current_time tool provides a straightforward way to retrieve the current system date and time in a human-readable format. Using Python’s datetime module, it captures the present moment and formats it as YYYY-MM-DD HH:MM:SS. This utility is particularly useful for time-stamping responses or answering user queries about the current date and time within the AI agent’s interaction flow.
tools =def create_llm:
if ANTHROPIC_API_KEY:
return ChatAnthropicelse:
class MockLLM:
def invoke:
last_message = messages.content if messages else ""

if anyfor word in):
import re
numbers = re.findall\s\w]+', last_message)
expr = numbersif numbers else "2+2"
return AIMessage}, "id": "calc1"}])
elif anyfor word in):
query = last_message.replace.replace.replace.stripif not query or len< 3:
query = "python programming"
return AIMessageelif anyfor word in):
city = "New York"
words = last_message.lower.splitfor i, word in enumerate:
if word == 'in' and i + 1 < len:
city = words.titlebreak
return AIMessageelif anyfor word in):
return AIMessageelif anyfor word in):
text = last_message.replace.replace.stripif not text:
text = "Sample text for analysis"
return AIMessageelse:
return AIMessagedef bind_tools:
return self

printreturn MockLLMllm = create_llmllm_with_tools = llm.bind_toolsWe initialize the language model that powers the AI agent. If a valid Anthropic API key is available, it uses the Claude 3 Haiku model for high-quality responses. Without an API key, a MockLLM is defined to simulate basic tool-routing behavior based on keyword matching, allowing the agent to function offline with limited capabilities. The bind_tools method links the defined tools to the model, enabling it to invoke them as needed.
def agent_node-> Dict:
"""Main agent node that processes messages and decides on tool usage."""
messages = stateresponse = llm_with_tools.invokereturn {"messages":}

def should_continue-> str:
"""Determine whether to continue with tool calls or end."""
last_message = stateif hasattrand last_message.tool_calls:
return "tools"
return END
We define the agent’s core decision-making logic. The agent_node function handles incoming messages, invokes the language model, and returns the model’s response. The should_continue function then evaluates whether the model’s response includes tool calls. If so, it routes control to the tool execution node; otherwise, it directs the flow to end the interaction. These functions enable dynamic and conditional transitions within the agent’s workflow.
def create_agent_graph:
tool_node = ToolNodeworkflow = StateGraphworkflow.add_nodeworkflow.add_nodeworkflow.add_edgeworkflow.add_conditional_edgesworkflow.add_edgememory = MemorySaverapp = workflow.compilereturn app

printagent = create_agent_graphprintWe construct the LangGraph-powered workflow that defines the AI agent’s operational structure. It initializes a ToolNode to handle tool executions and uses a StateGraph to organize the flow between agent decisions and tool usage. Nodes and edges are added to manage transitions: starting with the agent, conditionally routing to tools, and looping back as needed. A MemorySaver is integrated for persistent state tracking across turns. The graph is compiled into an executable application, enabling a structured, memory-aware multi-tool agent ready for deployment.
def test_agent:
"""Test the agent with various queries."""
config = {"configurable": {"thread_id": "test-thread"}}

test_queries =printfor i, query in enumerate:
printprinttry:
response = agent.invoke]},
config=config
)

last_message = responseprintexcept Exception as e:
print}\n")
The test_agent function is a validation utility that ensures that the LangGraph agent responds correctly across different use cases. It runs predefined queries, arithmetic, web search, weather, time, and text analysis, and prints the agent’s responses. Using a consistent thread_id for configuration, it invokes the agent with each query. It neatly displays the results, helping developers verify tool integration and conversational logic before moving to interactive or production use.
def chat_with_agent:
"""Interactive chat function."""
config = {"configurable": {"thread_id": "interactive-thread"}}

printprintprintwhile True:
try:
user_input = input.stripif user_input.lowerin:
printbreak
elif user_input.lower== 'help':
printprint?'")
printprintprintprintprintcontinue
elif not user_input:
continue

response = agent.invoke]},
config=config
)

last_message = responseprintexcept KeyboardInterrupt:
printbreak
except Exception as e:
print}\n")
The chat_with_agent function provides an interactive command-line interface for real-time conversations with the LangGraph multi-tool agent. It supports natural language queries and recognizes commands like “help” for usage guidance and “quit” to exit. Each user input is processed through the agent, which dynamically selects and invokes appropriate response tools. The function enhances user engagement by simulating a conversational experience and showcasing the agent’s capabilities in handling various queries, from math and web search to weather, text analysis, and time retrieval.
if __name__ == "__main__":
test_agentprintprintprintchat_with_agentdef quick_demo:
"""Quick demonstration of agent capabilities."""
config = {"configurable": {"thread_id": "demo"}}

demos =printfor category, query in demos:
printtry:
response = agent.invoke]},
config=config
)
printexcept Exception as e:
print}\n")

printprintprintprintprintfor a quick demonstration")
printfor interactive chat")
printprintprintFinally, we orchestrate the execution of the LangGraph multi-tool agent. If the script is run directly, it initiates test_agentto validate functionality with sample queries, followed by launching the interactive chat_with_agentmode for real-time interaction. The quick_demofunction also briefly showcases the agent’s capabilities in math, search, and time queries. Clear usage instructions are printed at the end, guiding users on configuring the API key, running demonstrations, and interacting with the agent. This provides a smooth onboarding experience for users to explore and extend the agent’s functionality.
In conclusion, this step-by-step tutorial gives valuable insights into building an effective multi-tool AI agent leveraging LangGraph and Claude’s generative capabilities. With straightforward explanations and hands-on demonstrations, the guide empowers users to integrate diverse utilities into a cohesive and interactive system. The agent’s flexibility in performing tasks, from complex calculations to dynamic information retrieval, showcases the versatility of modern AI development frameworks. Also, the inclusion of user-friendly functions for both testing and interactive chat enhances practical understanding, enabling immediate application in various contexts. Developers can confidently extend and customize their AI agents with this foundational knowledge.

Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGenAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent DesignAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding
#stepbystep #guide #build #customizable #multitool

Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent Creation
In this comprehensive tutorial, we guide users through creating a powerful multi-tool AI agent using LangGraph and Claude, optimized for diverse tasks including mathematical computations, web searches, weather inquiries, text analysis, and real-time information retrieval. It begins by simplifying dependency installations to ensure effortless setup, even for beginners. Users are then introduced to structured implementations of specialized tools, such as a safe calculator, an efficient web-search utility leveraging DuckDuckGo, a mock weather information provider, a detailed text analyzer, and a time-fetching function. The tutorial also clearly delineates the integration of these tools within a sophisticated agent architecture built using LangGraph, illustrating practical usage through interactive examples and clear explanations, facilitating both beginners and advanced developers to deploy custom multi-functional AI agents rapidly. import subprocess import sys def install_packages: packages =for package in packages: try: subprocess.check_callprintexcept subprocess.CalledProcessError: printprintinstall_packagesprintWe automate the installation of essential Python packages required for building a LangGraph-based multi-tool AI agent. It leverages a subprocess to run pip commands silently and ensures each package, ranging from long-chain components to web search and environment handling tools, is installed successfully. This setup streamlines the environment preparation process, making the notebook portable and beginner-friendly. import os import json import math import requests from typing import Dict, List, Any, Annotated, TypedDict from datetime import datetime import operator from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, ToolMessage from langchain_core.tools import tool from langchain_anthropic import ChatAnthropic from langgraph.graph import StateGraph, START, END from langgraph.prebuilt import ToolNode from langgraph.checkpoint.memory import MemorySaver from duckduckgo_search import DDGS We import all the necessary libraries and modules for constructing the multi-tool AI agent. It includes Python standard libraries such as os, json, math, and datetime for general-purpose functionality and external libraries like requests for HTTP calls and duckduckgo_search for implementing web search. The LangChain and LangGraph ecosystems bring in message types, tool decorators, state graph components, and checkpointing utilities, while ChatAnthropic enables integration with the Claude model for conversational intelligence. These imports form the foundational building blocks for defining tools, agent workflows, and interactions. os.environ= "Use Your API Key Here" ANTHROPIC_API_KEY = os.getenvWe set and retrieve the Anthropic API key required to authenticate and interact with Claude models. The os.environ line assigns your API key, while os.getenv securely retrieves it for later use in model initialization. This approach ensures the key is accessible throughout the script without hardcoding it multiple times. from typing import TypedDict class AgentState: messages: Annotated, operator.add] @tool def calculator-> str: """ Perform mathematical calculations. Supports basic arithmetic, trigonometry, and more. Args: expression: Mathematical expression as a string") Returns: Result of the calculation as a string """ try: allowed_names = { 'abs': abs, 'round': round, 'min': min, 'max': max, 'sum': sum, 'pow': pow, 'sqrt': math.sqrt, 'sin': math.sin, 'cos': math.cos, 'tan': math.tan, 'log': math.log, 'log10': math.log10, 'exp': math.exp, 'pi': math.pi, 'e': math.e } expression = expression.replaceresult = evalreturn f"Result: {result}" except Exception as e: return f"Error in calculation: {str}" We define the agent’s internal state and implement a robust calculator tool. The AgentState class uses TypedDict to structure agent memory, specifically tracking messages exchanged during the conversation. The calculator function, decorated with @tool to register it as an AI-usable utility, securely evaluates mathematical expressions. It allows for safe computation by limiting available functions to a predefined set from the math module and replacing common syntax like ^ with Python’s exponentiation operator. This ensures the tool can handle simple arithmetic and advanced functions like trigonometry or logarithms while preventing unsafe code execution. @tool def web_search-> str: """ Search the web for information using DuckDuckGo. Args: query: Search query string num_results: Number of results to returnReturns: Search results as formatted string """ try: num_results = min, 10) with DDGSas ddgs: results = list) if not results: return f"No search results found for: {query}" formatted_results = f"Search results for '{query}':\n\n" for i, result in enumerate: formatted_results += f"{i}. **{result}**\n" formatted_results += f" {result}\n" formatted_results += f" Source: {result}\n\n" return formatted_results except Exception as e: return f"Error performing web search: {str}" We define a web_search tool that enables the agent to fetch real-time information from the internet using the DuckDuckGo Search API via the duckduckgo_search Python package. The tool accepts a search query and an optional num_results parameter, ensuring that the number of results returned is between 1 and 10. It opens a DuckDuckGo search session, retrieves the results, and formats them neatly for user-friendly display. If no results are found or an error occurs, the function handles it gracefully by returning an informative message. This tool equips the agent with real-time search capabilities, enhancing responsiveness and utility. @tool def weather_info-> str: """ Get current weather information for a city using OpenWeatherMap API. Note: This is a mock implementation for demo purposes. Args: city: Name of the city Returns: Weather information as a string """ mock_weather = { "new york": {"temp": 22, "condition": "Partly Cloudy", "humidity": 65}, "london": {"temp": 15, "condition": "Rainy", "humidity": 80}, "tokyo": {"temp": 28, "condition": "Sunny", "humidity": 70}, "paris": {"temp": 18, "condition": "Overcast", "humidity": 75} } city_lower = city.lowerif city_lower in mock_weather: weather = mock_weatherreturn f"Weather in {city}:\n" \ f"Temperature: {weather}°C\n" \ f"Condition: {weather}\n" \ f"Humidity: {weather}%" else: return f"Weather data not available for {city}." We define a weather_info tool that simulates retrieving current weather data for a given city. While it does not connect to a live weather API, it uses a predefined dictionary of mock data for major cities like New York, London, Tokyo, and Paris. Upon receiving a city name, the function normalizes it to lowercase and checks for its presence in the mock dataset. It returns temperature, weather condition, and humidity in a readable format if found. Otherwise, it notifies the user that weather data is unavailable. This tool serves as a placeholder and can later be upgraded to fetch live data from an actual weather API. @tool def text_analyzer-> str: """ Analyze text and provide statistics like word count, character count, etc. Args: text: Text to analyze Returns: Text analysis results """ if not text.strip: return "Please provide text to analyze." words = text.splitsentences = text.split+ text.split+ text.splitsentences =analysis = f"Text Analysis Results:\n" analysis += f"• Characters: {len}\n" analysis += f"• Characters: {len)}\n" analysis += f"• Words: {len}\n" analysis += f"• Sentences: {len}\n" analysis += f"• Average words per sentence: {len/ max, 1):.1f}\n" analysis += f"• Most common word: {max, key=words.count) if words else 'N/A'}" return analysis The text_analyzer tool provides a detailed statistical analysis of a given text input. It calculates metrics such as character count, word count, sentence count, and average words per sentence, and it identifies the most frequently occurring word. The tool handles empty input gracefully by prompting the user to provide valid text. It uses simple string operations and Python’s set and max functions to extract meaningful insights. It is a valuable utility for language analysis or content quality checks in the AI agent’s toolkit. @tool def current_time-> str: """ Get the current date and time. Returns: Current date and time as a formatted string """ now = datetime.nowreturn f"Current date and time: {now.strftime}" The current_time tool provides a straightforward way to retrieve the current system date and time in a human-readable format. Using Python’s datetime module, it captures the present moment and formats it as YYYY-MM-DD HH:MM:SS. This utility is particularly useful for time-stamping responses or answering user queries about the current date and time within the AI agent’s interaction flow. tools =def create_llm: if ANTHROPIC_API_KEY: return ChatAnthropicelse: class MockLLM: def invoke: last_message = messages.content if messages else "" if anyfor word in): import re numbers = re.findall\s\w]+', last_message) expr = numbersif numbers else "2+2" return AIMessage}, "id": "calc1"}]) elif anyfor word in): query = last_message.replace.replace.replace.stripif not query or len< 3: query = "python programming" return AIMessageelif anyfor word in): city = "New York" words = last_message.lower.splitfor i, word in enumerate: if word == 'in' and i + 1 < len: city = words.titlebreak return AIMessageelif anyfor word in): return AIMessageelif anyfor word in): text = last_message.replace.replace.stripif not text: text = "Sample text for analysis" return AIMessageelse: return AIMessagedef bind_tools: return self printreturn MockLLMllm = create_llmllm_with_tools = llm.bind_toolsWe initialize the language model that powers the AI agent. If a valid Anthropic API key is available, it uses the Claude 3 Haiku model for high-quality responses. Without an API key, a MockLLM is defined to simulate basic tool-routing behavior based on keyword matching, allowing the agent to function offline with limited capabilities. The bind_tools method links the defined tools to the model, enabling it to invoke them as needed. def agent_node-> Dict: """Main agent node that processes messages and decides on tool usage.""" messages = stateresponse = llm_with_tools.invokereturn {"messages":} def should_continue-> str: """Determine whether to continue with tool calls or end.""" last_message = stateif hasattrand last_message.tool_calls: return "tools" return END We define the agent’s core decision-making logic. The agent_node function handles incoming messages, invokes the language model, and returns the model’s response. The should_continue function then evaluates whether the model’s response includes tool calls. If so, it routes control to the tool execution node; otherwise, it directs the flow to end the interaction. These functions enable dynamic and conditional transitions within the agent’s workflow. def create_agent_graph: tool_node = ToolNodeworkflow = StateGraphworkflow.add_nodeworkflow.add_nodeworkflow.add_edgeworkflow.add_conditional_edgesworkflow.add_edgememory = MemorySaverapp = workflow.compilereturn app printagent = create_agent_graphprintWe construct the LangGraph-powered workflow that defines the AI agent’s operational structure. It initializes a ToolNode to handle tool executions and uses a StateGraph to organize the flow between agent decisions and tool usage. Nodes and edges are added to manage transitions: starting with the agent, conditionally routing to tools, and looping back as needed. A MemorySaver is integrated for persistent state tracking across turns. The graph is compiled into an executable application, enabling a structured, memory-aware multi-tool agent ready for deployment. def test_agent: """Test the agent with various queries.""" config = {"configurable": {"thread_id": "test-thread"}} test_queries =printfor i, query in enumerate: printprinttry: response = agent.invoke]}, config=config ) last_message = responseprintexcept Exception as e: print}\n") The test_agent function is a validation utility that ensures that the LangGraph agent responds correctly across different use cases. It runs predefined queries, arithmetic, web search, weather, time, and text analysis, and prints the agent’s responses. Using a consistent thread_id for configuration, it invokes the agent with each query. It neatly displays the results, helping developers verify tool integration and conversational logic before moving to interactive or production use. def chat_with_agent: """Interactive chat function.""" config = {"configurable": {"thread_id": "interactive-thread"}} printprintprintwhile True: try: user_input = input.stripif user_input.lowerin: printbreak elif user_input.lower== 'help': printprint?'") printprintprintprintprintcontinue elif not user_input: continue response = agent.invoke]}, config=config ) last_message = responseprintexcept KeyboardInterrupt: printbreak except Exception as e: print}\n") The chat_with_agent function provides an interactive command-line interface for real-time conversations with the LangGraph multi-tool agent. It supports natural language queries and recognizes commands like “help” for usage guidance and “quit” to exit. Each user input is processed through the agent, which dynamically selects and invokes appropriate response tools. The function enhances user engagement by simulating a conversational experience and showcasing the agent’s capabilities in handling various queries, from math and web search to weather, text analysis, and time retrieval. if __name__ == "__main__": test_agentprintprintprintchat_with_agentdef quick_demo: """Quick demonstration of agent capabilities.""" config = {"configurable": {"thread_id": "demo"}} demos =printfor category, query in demos: printtry: response = agent.invoke]}, config=config ) printexcept Exception as e: print}\n") printprintprintprintprintfor a quick demonstration") printfor interactive chat") printprintprintFinally, we orchestrate the execution of the LangGraph multi-tool agent. If the script is run directly, it initiates test_agentto validate functionality with sample queries, followed by launching the interactive chat_with_agentmode for real-time interaction. The quick_demofunction also briefly showcases the agent’s capabilities in math, search, and time queries. Clear usage instructions are printed at the end, guiding users on configuring the API key, running demonstrations, and interacting with the agent. This provides a smooth onboarding experience for users to explore and extend the agent’s functionality. In conclusion, this step-by-step tutorial gives valuable insights into building an effective multi-tool AI agent leveraging LangGraph and Claude’s generative capabilities. With straightforward explanations and hands-on demonstrations, the guide empowers users to integrate diverse utilities into a cohesive and interactive system. The agent’s flexibility in performing tasks, from complex calculations to dynamic information retrieval, showcases the versatility of modern AI development frameworks. Also, the inclusion of user-friendly functions for both testing and interactive chat enhances practical understanding, enabling immediate application in various contexts. Developers can confidently extend and customize their AI agents with this foundational knowledge. Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGenAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent DesignAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding #stepbystep #guide #build #customizable #multitool

Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent Creation

www.marktechpost.com
In this comprehensive tutorial, we guide users through creating a powerful multi-tool AI agent using LangGraph and Claude, optimized for diverse tasks including mathematical computations, web searches, weather inquiries, text analysis, and real-time information retrieval. It begins by simplifying dependency installations to ensure effortless setup, even for beginners. Users are then introduced to structured implementations of specialized tools, such as a safe calculator, an efficient web-search utility leveraging DuckDuckGo, a mock weather information provider, a detailed text analyzer, and a time-fetching function. The tutorial also clearly delineates the integration of these tools within a sophisticated agent architecture built using LangGraph, illustrating practical usage through interactive examples and clear explanations, facilitating both beginners and advanced developers to deploy custom multi-functional AI agents rapidly. import subprocess import sys def install_packages(): packages = [ "langgraph", "langchain", "langchain-anthropic", "langchain-community", "requests", "python-dotenv", "duckduckgo-search" ] for package in packages: try: subprocess.check_call([sys.executable, "-m", "pip", "install", package, "-q"]) print(f"✓ Installed {package}") except subprocess.CalledProcessError: print(f"✗ Failed to install {package}") print("Installing required packages...") install_packages() print("Installation complete!\n") We automate the installation of essential Python packages required for building a LangGraph-based multi-tool AI agent. It leverages a subprocess to run pip commands silently and ensures each package, ranging from long-chain components to web search and environment handling tools, is installed successfully. This setup streamlines the environment preparation process, making the notebook portable and beginner-friendly. import os import json import math import requests from typing import Dict, List, Any, Annotated, TypedDict from datetime import datetime import operator from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, ToolMessage from langchain_core.tools import tool from langchain_anthropic import ChatAnthropic from langgraph.graph import StateGraph, START, END from langgraph.prebuilt import ToolNode from langgraph.checkpoint.memory import MemorySaver from duckduckgo_search import DDGS We import all the necessary libraries and modules for constructing the multi-tool AI agent. It includes Python standard libraries such as os, json, math, and datetime for general-purpose functionality and external libraries like requests for HTTP calls and duckduckgo_search for implementing web search. The LangChain and LangGraph ecosystems bring in message types, tool decorators, state graph components, and checkpointing utilities, while ChatAnthropic enables integration with the Claude model for conversational intelligence. These imports form the foundational building blocks for defining tools, agent workflows, and interactions. os.environ["ANTHROPIC_API_KEY"] = "Use Your API Key Here" ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY") We set and retrieve the Anthropic API key required to authenticate and interact with Claude models. The os.environ line assigns your API key (which you should replace with a valid key), while os.getenv securely retrieves it for later use in model initialization. This approach ensures the key is accessible throughout the script without hardcoding it multiple times. from typing import TypedDict class AgentState(TypedDict): messages: Annotated[List[BaseMessage], operator.add] @tool def calculator(expression: str) -> str: """ Perform mathematical calculations. Supports basic arithmetic, trigonometry, and more. Args: expression: Mathematical expression as a string (e.g., "2 + 3 * 4", "sin(3.14159/2)") Returns: Result of the calculation as a string """ try: allowed_names = { 'abs': abs, 'round': round, 'min': min, 'max': max, 'sum': sum, 'pow': pow, 'sqrt': math.sqrt, 'sin': math.sin, 'cos': math.cos, 'tan': math.tan, 'log': math.log, 'log10': math.log10, 'exp': math.exp, 'pi': math.pi, 'e': math.e } expression = expression.replace('^', '**') result = eval(expression, {"__builtins__": {}}, allowed_names) return f"Result: {result}" except Exception as e: return f"Error in calculation: {str(e)}" We define the agent’s internal state and implement a robust calculator tool. The AgentState class uses TypedDict to structure agent memory, specifically tracking messages exchanged during the conversation. The calculator function, decorated with @tool to register it as an AI-usable utility, securely evaluates mathematical expressions. It allows for safe computation by limiting available functions to a predefined set from the math module and replacing common syntax like ^ with Python’s exponentiation operator. This ensures the tool can handle simple arithmetic and advanced functions like trigonometry or logarithms while preventing unsafe code execution. @tool def web_search(query: str, num_results: int = 3) -> str: """ Search the web for information using DuckDuckGo. Args: query: Search query string num_results: Number of results to return (default: 3, max: 10) Returns: Search results as formatted string """ try: num_results = min(max(num_results, 1), 10) with DDGS() as ddgs: results = list(ddgs.text(query, max_results=num_results)) if not results: return f"No search results found for: {query}" formatted_results = f"Search results for '{query}':\n\n" for i, result in enumerate(results, 1): formatted_results += f"{i}. **{result['title']}**\n" formatted_results += f" {result['body']}\n" formatted_results += f" Source: {result['href']}\n\n" return formatted_results except Exception as e: return f"Error performing web search: {str(e)}" We define a web_search tool that enables the agent to fetch real-time information from the internet using the DuckDuckGo Search API via the duckduckgo_search Python package. The tool accepts a search query and an optional num_results parameter, ensuring that the number of results returned is between 1 and 10. It opens a DuckDuckGo search session, retrieves the results, and formats them neatly for user-friendly display. If no results are found or an error occurs, the function handles it gracefully by returning an informative message. This tool equips the agent with real-time search capabilities, enhancing responsiveness and utility. @tool def weather_info(city: str) -> str: """ Get current weather information for a city using OpenWeatherMap API. Note: This is a mock implementation for demo purposes. Args: city: Name of the city Returns: Weather information as a string """ mock_weather = { "new york": {"temp": 22, "condition": "Partly Cloudy", "humidity": 65}, "london": {"temp": 15, "condition": "Rainy", "humidity": 80}, "tokyo": {"temp": 28, "condition": "Sunny", "humidity": 70}, "paris": {"temp": 18, "condition": "Overcast", "humidity": 75} } city_lower = city.lower() if city_lower in mock_weather: weather = mock_weather[city_lower] return f"Weather in {city}:\n" \ f"Temperature: {weather['temp']}°C\n" \ f"Condition: {weather['condition']}\n" \ f"Humidity: {weather['humidity']}%" else: return f"Weather data not available for {city}. (This is a demo with limited cities: New York, London, Tokyo, Paris)" We define a weather_info tool that simulates retrieving current weather data for a given city. While it does not connect to a live weather API, it uses a predefined dictionary of mock data for major cities like New York, London, Tokyo, and Paris. Upon receiving a city name, the function normalizes it to lowercase and checks for its presence in the mock dataset. It returns temperature, weather condition, and humidity in a readable format if found. Otherwise, it notifies the user that weather data is unavailable. This tool serves as a placeholder and can later be upgraded to fetch live data from an actual weather API. @tool def text_analyzer(text: str) -> str: """ Analyze text and provide statistics like word count, character count, etc. Args: text: Text to analyze Returns: Text analysis results """ if not text.strip(): return "Please provide text to analyze." words = text.split() sentences = text.split('.') + text.split('!') + text.split('?') sentences = [s.strip() for s in sentences if s.strip()] analysis = f"Text Analysis Results:\n" analysis += f"• Characters (with spaces): {len(text)}\n" analysis += f"• Characters (without spaces): {len(text.replace(' ', ''))}\n" analysis += f"• Words: {len(words)}\n" analysis += f"• Sentences: {len(sentences)}\n" analysis += f"• Average words per sentence: {len(words) / max(len(sentences), 1):.1f}\n" analysis += f"• Most common word: {max(set(words), key=words.count) if words else 'N/A'}" return analysis The text_analyzer tool provides a detailed statistical analysis of a given text input. It calculates metrics such as character count (with and without spaces), word count, sentence count, and average words per sentence, and it identifies the most frequently occurring word. The tool handles empty input gracefully by prompting the user to provide valid text. It uses simple string operations and Python’s set and max functions to extract meaningful insights. It is a valuable utility for language analysis or content quality checks in the AI agent’s toolkit. @tool def current_time() -> str: """ Get the current date and time. Returns: Current date and time as a formatted string """ now = datetime.now() return f"Current date and time: {now.strftime('%Y-%m-%d %H:%M:%S')}" The current_time tool provides a straightforward way to retrieve the current system date and time in a human-readable format. Using Python’s datetime module, it captures the present moment and formats it as YYYY-MM-DD HH:MM:SS. This utility is particularly useful for time-stamping responses or answering user queries about the current date and time within the AI agent’s interaction flow. tools = [calculator, web_search, weather_info, text_analyzer, current_time] def create_llm(): if ANTHROPIC_API_KEY: return ChatAnthropic( model="claude-3-haiku-20240307", temperature=0.1, max_tokens=1024 ) else: class MockLLM: def invoke(self, messages): last_message = messages[-1].content if messages else "" if any(word in last_message.lower() for word in ['calculate', 'math', '+', '-', '*', '/', 'sqrt', 'sin', 'cos']): import re numbers = re.findall(r'[\d\+\-\*/\.\(\)\s\w]+', last_message) expr = numbers[0] if numbers else "2+2" return AIMessage(content="I'll help you with that calculation.", tool_calls=[{"name": "calculator", "args": {"expression": expr.strip()}, "id": "calc1"}]) elif any(word in last_message.lower() for word in ['search', 'find', 'look up', 'information about']): query = last_message.replace('search for', '').replace('find', '').replace('look up', '').strip() if not query or len(query) < 3: query = "python programming" return AIMessage(content="I'll search for that information.", tool_calls=[{"name": "web_search", "args": {"query": query}, "id": "search1"}]) elif any(word in last_message.lower() for word in ['weather', 'temperature']): city = "New York" words = last_message.lower().split() for i, word in enumerate(words): if word == 'in' and i + 1 < len(words): city = words[i + 1].title() break return AIMessage(content="I'll get the weather information.", tool_calls=[{"name": "weather_info", "args": {"city": city}, "id": "weather1"}]) elif any(word in last_message.lower() for word in ['time', 'date']): return AIMessage(content="I'll get the current time.", tool_calls=[{"name": "current_time", "args": {}, "id": "time1"}]) elif any(word in last_message.lower() for word in ['analyze', 'analysis']): text = last_message.replace('analyze this text:', '').replace('analyze', '').strip() if not text: text = "Sample text for analysis" return AIMessage(content="I'll analyze that text for you.", tool_calls=[{"name": "text_analyzer", "args": {"text": text}, "id": "analyze1"}]) else: return AIMessage(content="Hello! I'm a multi-tool agent powered by Claude. I can help with:\n• Mathematical calculations\n• Web searches\n• Weather information\n• Text analysis\n• Current time/date\n\nWhat would you like me to help you with?") def bind_tools(self, tools): return self print("⚠️ Note: Using mock LLM for demo. Add your ANTHROPIC_API_KEY for full functionality.") return MockLLM() llm = create_llm() llm_with_tools = llm.bind_tools(tools) We initialize the language model that powers the AI agent. If a valid Anthropic API key is available, it uses the Claude 3 Haiku model for high-quality responses. Without an API key, a MockLLM is defined to simulate basic tool-routing behavior based on keyword matching, allowing the agent to function offline with limited capabilities. The bind_tools method links the defined tools to the model, enabling it to invoke them as needed. def agent_node(state: AgentState) -> Dict[str, Any]: """Main agent node that processes messages and decides on tool usage.""" messages = state["messages"] response = llm_with_tools.invoke(messages) return {"messages": [response]} def should_continue(state: AgentState) -> str: """Determine whether to continue with tool calls or end.""" last_message = state["messages"][-1] if hasattr(last_message, 'tool_calls') and last_message.tool_calls: return "tools" return END We define the agent’s core decision-making logic. The agent_node function handles incoming messages, invokes the language model (with tools), and returns the model’s response. The should_continue function then evaluates whether the model’s response includes tool calls. If so, it routes control to the tool execution node; otherwise, it directs the flow to end the interaction. These functions enable dynamic and conditional transitions within the agent’s workflow. def create_agent_graph(): tool_node = ToolNode(tools) workflow = StateGraph(AgentState) workflow.add_node("agent", agent_node) workflow.add_node("tools", tool_node) workflow.add_edge(START, "agent") workflow.add_conditional_edges("agent", should_continue, {"tools": "tools", END: END}) workflow.add_edge("tools", "agent") memory = MemorySaver() app = workflow.compile(checkpointer=memory) return app print("Creating LangGraph Multi-Tool Agent...") agent = create_agent_graph() print("✓ Agent created successfully!\n") We construct the LangGraph-powered workflow that defines the AI agent’s operational structure. It initializes a ToolNode to handle tool executions and uses a StateGraph to organize the flow between agent decisions and tool usage. Nodes and edges are added to manage transitions: starting with the agent, conditionally routing to tools, and looping back as needed. A MemorySaver is integrated for persistent state tracking across turns. The graph is compiled into an executable application (app), enabling a structured, memory-aware multi-tool agent ready for deployment. def test_agent(): """Test the agent with various queries.""" config = {"configurable": {"thread_id": "test-thread"}} test_queries = [ "What's 15 * 7 + 23?", "Search for information about Python programming", "What's the weather like in Tokyo?", "What time is it?", "Analyze this text: 'LangGraph is an amazing framework for building AI agents.'" ] print("🧪 Testing the agent with sample queries...\n") for i, query in enumerate(test_queries, 1): print(f"Query {i}: {query}") print("-" * 50) try: response = agent.invoke( {"messages": [HumanMessage(content=query)]}, config=config ) last_message = response["messages"][-1] print(f"Response: {last_message.content}\n") except Exception as e: print(f"Error: {str(e)}\n") The test_agent function is a validation utility that ensures that the LangGraph agent responds correctly across different use cases. It runs predefined queries, arithmetic, web search, weather, time, and text analysis, and prints the agent’s responses. Using a consistent thread_id for configuration, it invokes the agent with each query. It neatly displays the results, helping developers verify tool integration and conversational logic before moving to interactive or production use. def chat_with_agent(): """Interactive chat function.""" config = {"configurable": {"thread_id": "interactive-thread"}} print("🤖 Multi-Tool Agent Chat") print("Available tools: Calculator, Web Search, Weather Info, Text Analyzer, Current Time") print("Type 'quit' to exit, 'help' for available commands\n") while True: try: user_input = input("You: ").strip() if user_input.lower() in ['quit', 'exit', 'q']: print("Goodbye!") break elif user_input.lower() == 'help': print("\nAvailable commands:") print("• Calculator: 'Calculate 15 * 7 + 23' or 'What's sin(pi/2)?'") print("• Web Search: 'Search for Python tutorials' or 'Find information about AI'") print("• Weather: 'Weather in Tokyo' or 'What's the temperature in London?'") print("• Text Analysis: 'Analyze this text: [your text]'") print("• Current Time: 'What time is it?' or 'Current date'") print("• quit: Exit the chat\n") continue elif not user_input: continue response = agent.invoke( {"messages": [HumanMessage(content=user_input)]}, config=config ) last_message = response["messages"][-1] print(f"Agent: {last_message.content}\n") except KeyboardInterrupt: print("\nGoodbye!") break except Exception as e: print(f"Error: {str(e)}\n") The chat_with_agent function provides an interactive command-line interface for real-time conversations with the LangGraph multi-tool agent. It supports natural language queries and recognizes commands like “help” for usage guidance and “quit” to exit. Each user input is processed through the agent, which dynamically selects and invokes appropriate response tools. The function enhances user engagement by simulating a conversational experience and showcasing the agent’s capabilities in handling various queries, from math and web search to weather, text analysis, and time retrieval. if __name__ == "__main__": test_agent() print("=" * 60) print("🎉 LangGraph Multi-Tool Agent is ready!") print("=" * 60) chat_with_agent() def quick_demo(): """Quick demonstration of agent capabilities.""" config = {"configurable": {"thread_id": "demo"}} demos = [ ("Math", "Calculate the square root of 144 plus 5 times 3"), ("Search", "Find recent news about artificial intelligence"), ("Time", "What's the current date and time?") ] print("🚀 Quick Demo of Agent Capabilities\n") for category, query in demos: print(f"[{category}] Query: {query}") try: response = agent.invoke( {"messages": [HumanMessage(content=query)]}, config=config ) print(f"Response: {response['messages'][-1].content}\n") except Exception as e: print(f"Error: {str(e)}\n") print("\n" + "="*60) print("🔧 Usage Instructions:") print("1. Add your ANTHROPIC_API_KEY to use Claude model") print(" os.environ['ANTHROPIC_API_KEY'] = 'your-anthropic-api-key'") print("2. Run quick_demo() for a quick demonstration") print("3. Run chat_with_agent() for interactive chat") print("4. The agent supports: calculations, web search, weather, text analysis, and time") print("5. Example: 'Calculate 15*7+23' or 'Search for Python tutorials'") print("="*60) Finally, we orchestrate the execution of the LangGraph multi-tool agent. If the script is run directly, it initiates test_agent() to validate functionality with sample queries, followed by launching the interactive chat_with_agent() mode for real-time interaction. The quick_demo() function also briefly showcases the agent’s capabilities in math, search, and time queries. Clear usage instructions are printed at the end, guiding users on configuring the API key, running demonstrations, and interacting with the agent. This provides a smooth onboarding experience for users to explore and extend the agent’s functionality. In conclusion, this step-by-step tutorial gives valuable insights into building an effective multi-tool AI agent leveraging LangGraph and Claude’s generative capabilities. With straightforward explanations and hands-on demonstrations, the guide empowers users to integrate diverse utilities into a cohesive and interactive system. The agent’s flexibility in performing tasks, from complex calculations to dynamic information retrieval, showcases the versatility of modern AI development frameworks. Also, the inclusion of user-friendly functions for both testing and interactive chat enhances practical understanding, enabling immediate application in various contexts. Developers can confidently extend and customize their AI agents with this foundational knowledge. Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGenAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser UseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent DesignAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding

0 Commenti ·0 condivisioni ·0 Anteprima

Effettua l'accesso per mettere mi piace, condividere e commentare!
Towards Data Science @TowardsDataScience ha condiviso un link
2025-05-23 20:01:01 ·

New to LLMs? Start Here

A guide to Agents, LLMs, RAG, Fine-tuning, LangChain with practical examples to start building
The post New to LLMs? Start Here appeared first on Towards Data Science.
#new #llms #starthere

New to LLMs? Start Here
A guide to Agents, LLMs, RAG, Fine-tuning, LangChain with practical examples to start building The post New to LLMs? Start Here appeared first on Towards Data Science. #new #llms #starthere

towardsdatascience.com
A guide to Agents, LLMs, RAG, Fine-tuning, LangChain with practical examples to start building The post New to LLMs? Start Here appeared first on Towards Data Science.

0 Commenti ·0 condivisioni ·0 Anteprima

Effettua l'accesso per mettere mi piace, condividere e commentare!
Towards Data Science @TowardsDataScience ha condiviso un link
2025-05-21 19:53:40 ·

Building AI Applications in Ruby

This is the second in a multi-part series on creating web applications with generative AI integration. Part 1 focused on explaining the AI stack and why the application layer is the best place in the stack to be. Check it out here.

Table of Contents

Introduction

I thought spas were supposed to be relaxing?

Microservices are for Macrocompanies

Ruby and Python: Two Sides of the Same Coin

Recent AI Based Gems

Summary

Introduction

It’s not often that you hear the Ruby language mentioned when discussing AI.

Python, of course, is the king in this world, and for good reason. The community has coalesced around the language. Most model training is done in PyTorch or TensorFlow these days. Scikit-learn and Keras are also very popular. RAG frameworks such as LangChain and LlamaIndex cater primarily to Python.

However, when it comes to building web applications with AI integration, I believe Ruby is the better language.

As the co-founder of an agency dedicated to building MVPs with generative AI integration, I frequently hear potential clients complaining about two things:

Applications take too long to build

Developers are quoting insane prices to build custom web apps

These complaints have a common source: complexity. Modern web apps have a lot more complexity in them than in the good ol’ days. But why is this? Are the benefits brought by complexity worth the cost?

I thought spas were supposed to be relaxing?

One big piece of the puzzle is the recent rise of single-page applications. The most popular stack used today in building modern SPAs is MERN . The stack is popular for a few reasons:

It is a JavaScript-only stack, across both front-end and back-end. Having to only code in only one language is pretty nice!

SPAs can offer dynamic designs and a “smooth” user experience. Smooth here means that when some piece of data changes, only a part of the site is updated, as opposed to having to reload the whole page. Of course, if you don’t have a modern smartphone, SPAs won’t feel so smooth, as they tend to be pretty heavy. All that JavaScript starts to drag down the performance.

There is a large ecosystem of libraries and developers with experience in this stack. This is pretty circular logic: is the stack popular because of the ecosystem, or is there an ecosystem because of the popularity? Either way, this point stands.React was created by Meta.

Lots of money and effort has been thrown at the library, helping to polish and promote the product.

Unfortunately, there are some downsides of working in the MERN stack, the most critical being the sheer complexity.

Traditional web development was done using the Model-View-Controllerparadigm. In MVC, all of the logic managing a user’s session is handled in the backend, on the server. Something like fetching a user’s data was done via function calls and SQL statements in the backend. The backend then serves fully built HTML and CSS to the browser, which just has to display it. Hence the name “server”.

In a SPA, this logic is handled on the user’s browser, in the frontend. SPAs have to handle UI state, application state, and sometimes even server state all in the browser. API calls have to be made to the backend to fetch user data. There is still quite a bit of logic on the backend, mainly exposing data and functionality through APIs.

To illustrate the difference, let me use the analogy of a commercial kitchen. The customer will be the frontend and the kitchen will be the backend.

MVCs vs. SPAs. Image generated by ChatGPT.

Traditional MVC apps are like dining at a full-service restaurant. Yes, there is a lot of complexityin the backend. But the frontend experience is simple and satisfying: all the customer has to do is pick up a fork and eat their food.

SPAs are like eating at a buffet-style dining restaurant. There is still quite a bit of complexity in the kitchen. But now the customer also has to decide what food to grab, how to combine them, how to arrange them on the plate, where to put the plate when finished, etc.

Andrej Karpathy had a tweet recently discussing his frustration with attempting to build web apps in 2025. It can be overwhelming for those new to the space.

The reality of building web apps in 2025 is that it's a bit like assembling IKEA furniture. There's no "full-stack" product with batteries included, you have to piece together and configure many individual services:– frontend / backend– hosting…— Andrej KarpathyMarch 27, 2025

In order to build MVPs with AI integration rapidly, our agency has decided to forgo the SPA and instead go with the traditional MVC approach. In particular, we have found Ruby on Railsto be the framework best suited to quickly developing and deploying quality apps with AI integration. Ruby on Rails was developed by David Heinemeier Hansson in 2004 and has long been known as a great web framework, but I would argue it has recently made leaps in its ability to incorporate AI into apps, as we will see.

Django is the most popular Python web framework, and also has a more traditional pattern of development. Unfortunately, in our testing we found Django was simply not as full-featured or “batteries included” as Rails is. As a simple example, Django has no built-in background job system. Nearly all of our apps incorporate background jobs, so to not include this was disappointing. We also prefer how Rails emphasizes simplicity, with Rails 8 encouraging developers to easily self-host their apps instead of going through a provider like Heroku. They also recently released a stack of tools meant to replace external services like Redis.

“But what about the smooth user experience?” you might ask. The truth is that modern Rails includes several ways of crafting SPA-like experiences without all of the heavy JavaScript. The primary tool is Hotwire, which bundles tools like Turbo and Stimulus. Turbo lets you dynamically change pieces of HTML on your webpage without writing custom JavaScript. For the times where you do need to include custom JavaScript, Stimulus is a minimal JavaScript framework that lets you do just that. Even if you want to use React, you can do so with the react-rails gem. So you can have your cake, and eat it too!

SPAs are not the only reason for the increase in complexity, however. Another has to do with the advent of the microservices architecture.

Microservices are for Macrocompanies

Once again, we find ourselves comparing the simple past with the complexity of today.

In the past, software was primarily developed as monoliths. A monolithic application means that all the different parts of your app — such as the user interface, business logic, and data handling — are developed, tested, and deployed as one single unit. The code is all typically housed in a single repo.

Working with a monolith is simple and satisfying. Running a development setup for testing purposes is easy. You are working with a single database schema containing all of your tables, making queries and joins straightforward. Deployment is simple, since you just have one container to look at and modify.

However, once your company scales to the size of a Google or Amazon, real problems begin to emerge. With hundreds or thousands of developers contributing simultaneously to a single codebase, coordinating changes and managing merge conflicts becomes increasingly difficult. Deployments also become more complex and risky, since even minor changes can blow up the entire application!

To manage these issues, large companies began to coalesce around the microservices architecture. This is a style of programming where you design your codebase as a set of small, autonomous services. Each service owns its own codebase, data storage, and deployment pipelines. As a simple example, instead of stuffing all of your logic regarding an OpenAI client into your main app, you can move that logic into its own service. To call that service, you would then typically make REST calls, as opposed to function calls. This ups the complexity, but resolves the merge conflict and deployment issues, since each team in the organization gets to work on their own island of code.

Another benefit to using microservices is that they allow for a polyglot tech stack. This means that each team can code up their service using whatever language they prefer. If one team prefers JavaScript while another likes Python, this is no issue. When we first began our agency, this idea of a polyglot stack pushed us to use a microservices architecture. Not because we had a large team, but because we each wanted to use the “best” language for each functionality. This meant:

Using Ruby on Rails for web development. It’s been battle-tested in this area for decades.

Using Python for the AI integration, perhaps deployed with something like FastAPI. Serious AI work requires Python, I was led to believe.

Two different languages, each focused on its area of specialty. What could go wrong?

Unfortunately, we found the process of development frustrating. Just setting up our dev environment was time-consuming. Having to wrangle Docker compose files and manage inter-service communication made us wish we could go back to the beauty and simplicity of the monolith. Having to make a REST call and set up the appropriate routing in FastAPI instead of making a simple function call sucked.

“Surely we can’t develop AI apps in pure Ruby,” I thought. And then I gave it a try.

And I’m glad I did.

I found the process of developing an MVP with AI integration in Ruby very satisfying. We were able to sprint where before we were jogging. I loved the emphasis on beauty, simplicity, and developer happiness in the Ruby community. And I found the state of the AI ecosystem in Ruby to be surprisingly mature and getting better every day.

If you are a Python programmer and are scared off by learning a new language like I was, let me comfort you by discussing the similarities between the Ruby and Python languages.

Ruby and Python: Two Sides of the Same Coin

I consider Python and Ruby to be like cousins. Both languages incorporate:

High-level Interpretation: This means they abstract away a lot of the complexity of low-level programming details, such as memory management.

Dynamic Typing: Neither language requires you to specify if a variable is an int, float, string, etc. The types are checked at runtime.

Object-Oriented Programming: Both languages are object-oriented. Both support classes, inheritance, polymorphism, etc. Ruby is more “pure”, in the sense that literally everything is an object, whereas in Python a few thingsare not objects.

Readable and Concise Syntax: Both are considered easy to learn. Either is great for a first-time learner.

Wide Ecosystem of Packages: Packages to do all sorts of cool things are available in both languages. In Python they are called libraries, and in Ruby they are called gems.

The primary difference between the two languages lies in their philosophy and design principles. Python’s core philosophy can be described as:

There should be one — and preferably only one — obvious way to do something.

In theory, this should emphasize simplicity, readability, and clarity. Ruby’s philosophy can be described as:

There’s always more than one way to do something. Maximize developer happiness.

This was a shock to me when I switched over from Python. Check out this simple example emphasizing this philosophical difference:

# A fight over philosophy: iterating over an array
# Pythonic way
for i in range:
print# Ruby way, option 1.each do |i|
puts i
end

# Ruby way, option 2
for i in 1..5
puts i
end

# Ruby way, option 3
5.times do |i|
puts i + 1
end

# Ruby way, option 4.each { |i| puts i }

Another difference between the two is syntax style. Python primarily uses indentation to denote code blocks, while Ruby uses do…end or {…} blocks. Most include indentation inside Ruby blocks, but this is entirely optional. Examples of these syntactic differences can be seen in the code shown above.

There are a lot of other little differences to learn. For example, in Python string interpolation is done using f-strings: f"Hello, {name}!", while in Ruby they are done using hashtags: "Hello, #{name}!". Within a few months, I think any competent Python programmer can transfer their proficiency over to Ruby.

Recent AI-based Gems

Despite not being in the conversation when discussing AI, Ruby has had some recent advancements in the world of gems. I will highlight some of the most impressive recent releases that we have been using in our agency to build AI apps:

RubyLLM — Any GitHub repo that gets more than 2k stars within a few weeks of release deserves a mention, and RubyLLM is definitely worthy. I have used many clunky implementations of LLM providers from libraries like LangChain and LlamaIndex, so using RubyLLM was like a breath of fresh air. As a simple example, let’s take a look at a tutorial demonstrating multi-turn conversations:

require 'ruby_llm'

# Create a model and give it instructions
chat = RubyLLM.chat
chat.with_instructions "You are a friendly Ruby expert who loves to help beginners."

# Multi-turn conversation
chat.ask "Hi! What does attr_reader do in Ruby?"
# => "Ruby creates a getter method for each symbol...

# Stream responses in real time
chat.ask "Could you give me a short example?" do |chunk|
print chunk.content
end
# => "Sure!
# ```ruby
# class Person
# attr...

Simply amazing. Multi-turn conversations are handled automatically for you. Streaming is a breeze. Compare this to a similar implementation in LangChain:

from langchain_openai import ChatOpenAI
from langchain_core.schema import SystemMessage, HumanMessage, AIMessage
from langchain_core.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

SYSTEM_PROMPT = "You are a friendly Ruby expert who loves to help beginners."
chat = ChatOpenAI])

history =def ask-> None:
"""Stream the answer token-by-token and keep the context in memory."""
history.append)
# .stream yields message chunks as they arrive
for chunk in chat.stream:
printprint# newline after the answer
# the final chunk has the full message content
history.append)

askaskYikes. And it’s important to note that this is a grug implementation. Want to know how LangChain really expects you to manage memory? Check out these links, but grab a bucket first; you may get sick.

Neighbors — This is an excellent library to use for nearest-neighbors search in a Rails application. Very useful in a RAG setup. It integrates with Postgres, SQLite, MySQL, MariaDB, and more. It was written by Andrew Kane, the same guy who wrote the pgvector extension that allows Postgres to behave as a vector database.

Async — This gem had its first official release back in December 2024, and it has been making waves in the Ruby community. Async is a fiber-based framework for Ruby that runs non-blocking I/O tasks concurrently while letting you write simple, sequential code. Fibers are like mini-threads that each have their own mini call stack. While not strictly a gem for AI, it has helped us create features like web scrapers that run blazingly fast across thousands of pages. We have also used it to handle streaming of chunks from LLMs.

Torch.rb — If you are interested in training deep learning models, then surely you have heard of PyTorch. Well, PyTorch is built on LibTorch, which essentially has a lot of C/C++ code under the hood to perform ML operations quickly. Andrew Kane took LibTorch and made a Ruby adapter over it to create Torch.rb, essentially a Ruby version of PyTorch. Andrew Kane has been a hero in the Ruby AI world, authoring dozens of ML gems for Ruby.

Summary

In short: building a web application with AI integration quickly and cheaply requires a monolithic architecture. A monolith demands a monolingual application, which is necessary if your end goal is quality apps delivered with speed. Your main options are either Python or Ruby. If you go with Python, you will probably use Django for your web framework. If you go with Ruby, you will be using Ruby on Rails. At our agency, we found Django’s lack of features disappointing. Rails has impressed us with its feature set and emphasis on simplicity. We were thrilled to find almost no issues on the AI side.

Of course, there are times where you will not want to use Ruby. If you are conducting research in AI or training machine learning models from scratch, then you will likely want to stick with Python. Research almost never involves building Web Applications. At most you’ll build a simple interface or dashboard in a notebook, but nothing production-ready. You’ll likely want the latest PyTorch updates to ensure your training runs quickly. You may even dive into low-level C/C++ programming to squeeze as much performance as you can out of your hardware. Maybe you’ll even try your hand at Mojo.

But if your goal is to integrate the latest LLMs — either open or closed source — into web applications, then we believe Ruby to be the far superior option. Give it a shot yourselves!

In part three of this series, I will dive into a fun experiment: just how simple can we make a web application with AI integration? Stay tuned.

If you’d like a custom web application with generative AI integration, visit losangelesaiapps.com

The post Building AI Applications in Ruby appeared first on Towards Data Science.
#building #applications #ruby

Building AI Applications in Ruby
This is the second in a multi-part series on creating web applications with generative AI integration. Part 1 focused on explaining the AI stack and why the application layer is the best place in the stack to be. Check it out here. Table of Contents Introduction I thought spas were supposed to be relaxing? Microservices are for Macrocompanies Ruby and Python: Two Sides of the Same Coin Recent AI Based Gems Summary Introduction It’s not often that you hear the Ruby language mentioned when discussing AI. Python, of course, is the king in this world, and for good reason. The community has coalesced around the language. Most model training is done in PyTorch or TensorFlow these days. Scikit-learn and Keras are also very popular. RAG frameworks such as LangChain and LlamaIndex cater primarily to Python. However, when it comes to building web applications with AI integration, I believe Ruby is the better language. As the co-founder of an agency dedicated to building MVPs with generative AI integration, I frequently hear potential clients complaining about two things: Applications take too long to build Developers are quoting insane prices to build custom web apps These complaints have a common source: complexity. Modern web apps have a lot more complexity in them than in the good ol’ days. But why is this? Are the benefits brought by complexity worth the cost? I thought spas were supposed to be relaxing? One big piece of the puzzle is the recent rise of single-page applications. The most popular stack used today in building modern SPAs is MERN . The stack is popular for a few reasons: It is a JavaScript-only stack, across both front-end and back-end. Having to only code in only one language is pretty nice! SPAs can offer dynamic designs and a “smooth” user experience. Smooth here means that when some piece of data changes, only a part of the site is updated, as opposed to having to reload the whole page. Of course, if you don’t have a modern smartphone, SPAs won’t feel so smooth, as they tend to be pretty heavy. All that JavaScript starts to drag down the performance. There is a large ecosystem of libraries and developers with experience in this stack. This is pretty circular logic: is the stack popular because of the ecosystem, or is there an ecosystem because of the popularity? Either way, this point stands.React was created by Meta. Lots of money and effort has been thrown at the library, helping to polish and promote the product. Unfortunately, there are some downsides of working in the MERN stack, the most critical being the sheer complexity. Traditional web development was done using the Model-View-Controllerparadigm. In MVC, all of the logic managing a user’s session is handled in the backend, on the server. Something like fetching a user’s data was done via function calls and SQL statements in the backend. The backend then serves fully built HTML and CSS to the browser, which just has to display it. Hence the name “server”. In a SPA, this logic is handled on the user’s browser, in the frontend. SPAs have to handle UI state, application state, and sometimes even server state all in the browser. API calls have to be made to the backend to fetch user data. There is still quite a bit of logic on the backend, mainly exposing data and functionality through APIs. To illustrate the difference, let me use the analogy of a commercial kitchen. The customer will be the frontend and the kitchen will be the backend. MVCs vs. SPAs. Image generated by ChatGPT. Traditional MVC apps are like dining at a full-service restaurant. Yes, there is a lot of complexityin the backend. But the frontend experience is simple and satisfying: all the customer has to do is pick up a fork and eat their food. SPAs are like eating at a buffet-style dining restaurant. There is still quite a bit of complexity in the kitchen. But now the customer also has to decide what food to grab, how to combine them, how to arrange them on the plate, where to put the plate when finished, etc. Andrej Karpathy had a tweet recently discussing his frustration with attempting to build web apps in 2025. It can be overwhelming for those new to the space. The reality of building web apps in 2025 is that it's a bit like assembling IKEA furniture. There's no "full-stack" product with batteries included, you have to piece together and configure many individual services:– frontend / backend– hosting…— Andrej KarpathyMarch 27, 2025 In order to build MVPs with AI integration rapidly, our agency has decided to forgo the SPA and instead go with the traditional MVC approach. In particular, we have found Ruby on Railsto be the framework best suited to quickly developing and deploying quality apps with AI integration. Ruby on Rails was developed by David Heinemeier Hansson in 2004 and has long been known as a great web framework, but I would argue it has recently made leaps in its ability to incorporate AI into apps, as we will see. Django is the most popular Python web framework, and also has a more traditional pattern of development. Unfortunately, in our testing we found Django was simply not as full-featured or “batteries included” as Rails is. As a simple example, Django has no built-in background job system. Nearly all of our apps incorporate background jobs, so to not include this was disappointing. We also prefer how Rails emphasizes simplicity, with Rails 8 encouraging developers to easily self-host their apps instead of going through a provider like Heroku. They also recently released a stack of tools meant to replace external services like Redis. “But what about the smooth user experience?” you might ask. The truth is that modern Rails includes several ways of crafting SPA-like experiences without all of the heavy JavaScript. The primary tool is Hotwire, which bundles tools like Turbo and Stimulus. Turbo lets you dynamically change pieces of HTML on your webpage without writing custom JavaScript. For the times where you do need to include custom JavaScript, Stimulus is a minimal JavaScript framework that lets you do just that. Even if you want to use React, you can do so with the react-rails gem. So you can have your cake, and eat it too! SPAs are not the only reason for the increase in complexity, however. Another has to do with the advent of the microservices architecture. Microservices are for Macrocompanies Once again, we find ourselves comparing the simple past with the complexity of today. In the past, software was primarily developed as monoliths. A monolithic application means that all the different parts of your app — such as the user interface, business logic, and data handling — are developed, tested, and deployed as one single unit. The code is all typically housed in a single repo. Working with a monolith is simple and satisfying. Running a development setup for testing purposes is easy. You are working with a single database schema containing all of your tables, making queries and joins straightforward. Deployment is simple, since you just have one container to look at and modify. However, once your company scales to the size of a Google or Amazon, real problems begin to emerge. With hundreds or thousands of developers contributing simultaneously to a single codebase, coordinating changes and managing merge conflicts becomes increasingly difficult. Deployments also become more complex and risky, since even minor changes can blow up the entire application! To manage these issues, large companies began to coalesce around the microservices architecture. This is a style of programming where you design your codebase as a set of small, autonomous services. Each service owns its own codebase, data storage, and deployment pipelines. As a simple example, instead of stuffing all of your logic regarding an OpenAI client into your main app, you can move that logic into its own service. To call that service, you would then typically make REST calls, as opposed to function calls. This ups the complexity, but resolves the merge conflict and deployment issues, since each team in the organization gets to work on their own island of code. Another benefit to using microservices is that they allow for a polyglot tech stack. This means that each team can code up their service using whatever language they prefer. If one team prefers JavaScript while another likes Python, this is no issue. When we first began our agency, this idea of a polyglot stack pushed us to use a microservices architecture. Not because we had a large team, but because we each wanted to use the “best” language for each functionality. This meant: Using Ruby on Rails for web development. It’s been battle-tested in this area for decades. Using Python for the AI integration, perhaps deployed with something like FastAPI. Serious AI work requires Python, I was led to believe. Two different languages, each focused on its area of specialty. What could go wrong? Unfortunately, we found the process of development frustrating. Just setting up our dev environment was time-consuming. Having to wrangle Docker compose files and manage inter-service communication made us wish we could go back to the beauty and simplicity of the monolith. Having to make a REST call and set up the appropriate routing in FastAPI instead of making a simple function call sucked. “Surely we can’t develop AI apps in pure Ruby,” I thought. And then I gave it a try. And I’m glad I did. I found the process of developing an MVP with AI integration in Ruby very satisfying. We were able to sprint where before we were jogging. I loved the emphasis on beauty, simplicity, and developer happiness in the Ruby community. And I found the state of the AI ecosystem in Ruby to be surprisingly mature and getting better every day. If you are a Python programmer and are scared off by learning a new language like I was, let me comfort you by discussing the similarities between the Ruby and Python languages. Ruby and Python: Two Sides of the Same Coin I consider Python and Ruby to be like cousins. Both languages incorporate: High-level Interpretation: This means they abstract away a lot of the complexity of low-level programming details, such as memory management. Dynamic Typing: Neither language requires you to specify if a variable is an int, float, string, etc. The types are checked at runtime. Object-Oriented Programming: Both languages are object-oriented. Both support classes, inheritance, polymorphism, etc. Ruby is more “pure”, in the sense that literally everything is an object, whereas in Python a few thingsare not objects. Readable and Concise Syntax: Both are considered easy to learn. Either is great for a first-time learner. Wide Ecosystem of Packages: Packages to do all sorts of cool things are available in both languages. In Python they are called libraries, and in Ruby they are called gems. The primary difference between the two languages lies in their philosophy and design principles. Python’s core philosophy can be described as: There should be one — and preferably only one — obvious way to do something. In theory, this should emphasize simplicity, readability, and clarity. Ruby’s philosophy can be described as: There’s always more than one way to do something. Maximize developer happiness. This was a shock to me when I switched over from Python. Check out this simple example emphasizing this philosophical difference: # A fight over philosophy: iterating over an array # Pythonic way for i in range: print# Ruby way, option 1.each do |i| puts i end # Ruby way, option 2 for i in 1..5 puts i end # Ruby way, option 3 5.times do |i| puts i + 1 end # Ruby way, option 4.each { |i| puts i } Another difference between the two is syntax style. Python primarily uses indentation to denote code blocks, while Ruby uses do…end or {…} blocks. Most include indentation inside Ruby blocks, but this is entirely optional. Examples of these syntactic differences can be seen in the code shown above. There are a lot of other little differences to learn. For example, in Python string interpolation is done using f-strings: f"Hello, {name}!", while in Ruby they are done using hashtags: "Hello, #{name}!". Within a few months, I think any competent Python programmer can transfer their proficiency over to Ruby. Recent AI-based Gems Despite not being in the conversation when discussing AI, Ruby has had some recent advancements in the world of gems. I will highlight some of the most impressive recent releases that we have been using in our agency to build AI apps: RubyLLM — Any GitHub repo that gets more than 2k stars within a few weeks of release deserves a mention, and RubyLLM is definitely worthy. I have used many clunky implementations of LLM providers from libraries like LangChain and LlamaIndex, so using RubyLLM was like a breath of fresh air. As a simple example, let’s take a look at a tutorial demonstrating multi-turn conversations: require 'ruby_llm' # Create a model and give it instructions chat = RubyLLM.chat chat.with_instructions "You are a friendly Ruby expert who loves to help beginners." # Multi-turn conversation chat.ask "Hi! What does attr_reader do in Ruby?" # => "Ruby creates a getter method for each symbol... # Stream responses in real time chat.ask "Could you give me a short example?" do |chunk| print chunk.content end # => "Sure! # ```ruby # class Person # attr... Simply amazing. Multi-turn conversations are handled automatically for you. Streaming is a breeze. Compare this to a similar implementation in LangChain: from langchain_openai import ChatOpenAI from langchain_core.schema import SystemMessage, HumanMessage, AIMessage from langchain_core.callbacks.streaming_stdout import StreamingStdOutCallbackHandler SYSTEM_PROMPT = "You are a friendly Ruby expert who loves to help beginners." chat = ChatOpenAI]) history =def ask-> None: """Stream the answer token-by-token and keep the context in memory.""" history.append) # .stream yields message chunks as they arrive for chunk in chat.stream: printprint# newline after the answer # the final chunk has the full message content history.append) askaskYikes. And it’s important to note that this is a grug implementation. Want to know how LangChain really expects you to manage memory? Check out these links, but grab a bucket first; you may get sick. Neighbors — This is an excellent library to use for nearest-neighbors search in a Rails application. Very useful in a RAG setup. It integrates with Postgres, SQLite, MySQL, MariaDB, and more. It was written by Andrew Kane, the same guy who wrote the pgvector extension that allows Postgres to behave as a vector database. Async — This gem had its first official release back in December 2024, and it has been making waves in the Ruby community. Async is a fiber-based framework for Ruby that runs non-blocking I/O tasks concurrently while letting you write simple, sequential code. Fibers are like mini-threads that each have their own mini call stack. While not strictly a gem for AI, it has helped us create features like web scrapers that run blazingly fast across thousands of pages. We have also used it to handle streaming of chunks from LLMs. Torch.rb — If you are interested in training deep learning models, then surely you have heard of PyTorch. Well, PyTorch is built on LibTorch, which essentially has a lot of C/C++ code under the hood to perform ML operations quickly. Andrew Kane took LibTorch and made a Ruby adapter over it to create Torch.rb, essentially a Ruby version of PyTorch. Andrew Kane has been a hero in the Ruby AI world, authoring dozens of ML gems for Ruby. Summary In short: building a web application with AI integration quickly and cheaply requires a monolithic architecture. A monolith demands a monolingual application, which is necessary if your end goal is quality apps delivered with speed. Your main options are either Python or Ruby. If you go with Python, you will probably use Django for your web framework. If you go with Ruby, you will be using Ruby on Rails. At our agency, we found Django’s lack of features disappointing. Rails has impressed us with its feature set and emphasis on simplicity. We were thrilled to find almost no issues on the AI side. Of course, there are times where you will not want to use Ruby. If you are conducting research in AI or training machine learning models from scratch, then you will likely want to stick with Python. Research almost never involves building Web Applications. At most you’ll build a simple interface or dashboard in a notebook, but nothing production-ready. You’ll likely want the latest PyTorch updates to ensure your training runs quickly. You may even dive into low-level C/C++ programming to squeeze as much performance as you can out of your hardware. Maybe you’ll even try your hand at Mojo. But if your goal is to integrate the latest LLMs — either open or closed source — into web applications, then we believe Ruby to be the far superior option. Give it a shot yourselves! In part three of this series, I will dive into a fun experiment: just how simple can we make a web application with AI integration? Stay tuned. If you’d like a custom web application with generative AI integration, visit losangelesaiapps.com The post Building AI Applications in Ruby appeared first on Towards Data Science. #building #applications #ruby

Building AI Applications in Ruby

towardsdatascience.com
This is the second in a multi-part series on creating web applications with generative AI integration. Part 1 focused on explaining the AI stack and why the application layer is the best place in the stack to be. Check it out here. Table of Contents Introduction I thought spas were supposed to be relaxing? Microservices are for Macrocompanies Ruby and Python: Two Sides of the Same Coin Recent AI Based Gems Summary Introduction It’s not often that you hear the Ruby language mentioned when discussing AI. Python, of course, is the king in this world, and for good reason. The community has coalesced around the language. Most model training is done in PyTorch or TensorFlow these days. Scikit-learn and Keras are also very popular. RAG frameworks such as LangChain and LlamaIndex cater primarily to Python. However, when it comes to building web applications with AI integration, I believe Ruby is the better language. As the co-founder of an agency dedicated to building MVPs with generative AI integration, I frequently hear potential clients complaining about two things: Applications take too long to build Developers are quoting insane prices to build custom web apps These complaints have a common source: complexity. Modern web apps have a lot more complexity in them than in the good ol’ days. But why is this? Are the benefits brought by complexity worth the cost? I thought spas were supposed to be relaxing? One big piece of the puzzle is the recent rise of single-page applications (SPAs). The most popular stack used today in building modern SPAs is MERN (MongoDB, Express.js, React.js, Node.js). The stack is popular for a few reasons: It is a JavaScript-only stack, across both front-end and back-end. Having to only code in only one language is pretty nice! SPAs can offer dynamic designs and a “smooth” user experience. Smooth here means that when some piece of data changes, only a part of the site is updated, as opposed to having to reload the whole page. Of course, if you don’t have a modern smartphone, SPAs won’t feel so smooth, as they tend to be pretty heavy. All that JavaScript starts to drag down the performance. There is a large ecosystem of libraries and developers with experience in this stack. This is pretty circular logic: is the stack popular because of the ecosystem, or is there an ecosystem because of the popularity? Either way, this point stands.React was created by Meta. Lots of money and effort has been thrown at the library, helping to polish and promote the product. Unfortunately, there are some downsides of working in the MERN stack, the most critical being the sheer complexity. Traditional web development was done using the Model-View-Controller (MVC) paradigm. In MVC, all of the logic managing a user’s session is handled in the backend, on the server. Something like fetching a user’s data was done via function calls and SQL statements in the backend. The backend then serves fully built HTML and CSS to the browser, which just has to display it. Hence the name “server”. In a SPA, this logic is handled on the user’s browser, in the frontend. SPAs have to handle UI state, application state, and sometimes even server state all in the browser. API calls have to be made to the backend to fetch user data. There is still quite a bit of logic on the backend, mainly exposing data and functionality through APIs. To illustrate the difference, let me use the analogy of a commercial kitchen. The customer will be the frontend and the kitchen will be the backend. MVCs vs. SPAs. Image generated by ChatGPT. Traditional MVC apps are like dining at a full-service restaurant. Yes, there is a lot of complexity (and yelling, if The Bear is to be believed) in the backend. But the frontend experience is simple and satisfying: all the customer has to do is pick up a fork and eat their food. SPAs are like eating at a buffet-style dining restaurant. There is still quite a bit of complexity in the kitchen. But now the customer also has to decide what food to grab, how to combine them, how to arrange them on the plate, where to put the plate when finished, etc. Andrej Karpathy had a tweet recently discussing his frustration with attempting to build web apps in 2025. It can be overwhelming for those new to the space. The reality of building web apps in 2025 is that it's a bit like assembling IKEA furniture. There's no "full-stack" product with batteries included, you have to piece together and configure many individual services:– frontend / backend (e.g. React, Next.js, APIs)– hosting…— Andrej Karpathy (@karpathy) March 27, 2025 In order to build MVPs with AI integration rapidly, our agency has decided to forgo the SPA and instead go with the traditional MVC approach. In particular, we have found Ruby on Rails (often denoted as Rails) to be the framework best suited to quickly developing and deploying quality apps with AI integration. Ruby on Rails was developed by David Heinemeier Hansson in 2004 and has long been known as a great web framework, but I would argue it has recently made leaps in its ability to incorporate AI into apps, as we will see. Django is the most popular Python web framework, and also has a more traditional pattern of development. Unfortunately, in our testing we found Django was simply not as full-featured or “batteries included” as Rails is. As a simple example, Django has no built-in background job system. Nearly all of our apps incorporate background jobs, so to not include this was disappointing. We also prefer how Rails emphasizes simplicity, with Rails 8 encouraging developers to easily self-host their apps instead of going through a provider like Heroku. They also recently released a stack of tools meant to replace external services like Redis. “But what about the smooth user experience?” you might ask. The truth is that modern Rails includes several ways of crafting SPA-like experiences without all of the heavy JavaScript. The primary tool is Hotwire, which bundles tools like Turbo and Stimulus. Turbo lets you dynamically change pieces of HTML on your webpage without writing custom JavaScript. For the times where you do need to include custom JavaScript, Stimulus is a minimal JavaScript framework that lets you do just that. Even if you want to use React, you can do so with the react-rails gem. So you can have your cake, and eat it too! SPAs are not the only reason for the increase in complexity, however. Another has to do with the advent of the microservices architecture. Microservices are for Macrocompanies Once again, we find ourselves comparing the simple past with the complexity of today. In the past, software was primarily developed as monoliths. A monolithic application means that all the different parts of your app — such as the user interface, business logic, and data handling — are developed, tested, and deployed as one single unit. The code is all typically housed in a single repo. Working with a monolith is simple and satisfying. Running a development setup for testing purposes is easy. You are working with a single database schema containing all of your tables, making queries and joins straightforward. Deployment is simple, since you just have one container to look at and modify. However, once your company scales to the size of a Google or Amazon, real problems begin to emerge. With hundreds or thousands of developers contributing simultaneously to a single codebase, coordinating changes and managing merge conflicts becomes increasingly difficult. Deployments also become more complex and risky, since even minor changes can blow up the entire application! To manage these issues, large companies began to coalesce around the microservices architecture. This is a style of programming where you design your codebase as a set of small, autonomous services. Each service owns its own codebase, data storage, and deployment pipelines. As a simple example, instead of stuffing all of your logic regarding an OpenAI client into your main app, you can move that logic into its own service. To call that service, you would then typically make REST calls, as opposed to function calls. This ups the complexity, but resolves the merge conflict and deployment issues, since each team in the organization gets to work on their own island of code. Another benefit to using microservices is that they allow for a polyglot tech stack. This means that each team can code up their service using whatever language they prefer. If one team prefers JavaScript while another likes Python, this is no issue. When we first began our agency, this idea of a polyglot stack pushed us to use a microservices architecture. Not because we had a large team, but because we each wanted to use the “best” language for each functionality. This meant: Using Ruby on Rails for web development. It’s been battle-tested in this area for decades. Using Python for the AI integration, perhaps deployed with something like FastAPI. Serious AI work requires Python, I was led to believe. Two different languages, each focused on its area of specialty. What could go wrong? Unfortunately, we found the process of development frustrating. Just setting up our dev environment was time-consuming. Having to wrangle Docker compose files and manage inter-service communication made us wish we could go back to the beauty and simplicity of the monolith. Having to make a REST call and set up the appropriate routing in FastAPI instead of making a simple function call sucked. “Surely we can’t develop AI apps in pure Ruby,” I thought. And then I gave it a try. And I’m glad I did. I found the process of developing an MVP with AI integration in Ruby very satisfying. We were able to sprint where before we were jogging. I loved the emphasis on beauty, simplicity, and developer happiness in the Ruby community. And I found the state of the AI ecosystem in Ruby to be surprisingly mature and getting better every day. If you are a Python programmer and are scared off by learning a new language like I was, let me comfort you by discussing the similarities between the Ruby and Python languages. Ruby and Python: Two Sides of the Same Coin I consider Python and Ruby to be like cousins. Both languages incorporate: High-level Interpretation: This means they abstract away a lot of the complexity of low-level programming details, such as memory management. Dynamic Typing: Neither language requires you to specify if a variable is an int, float, string, etc. The types are checked at runtime. Object-Oriented Programming: Both languages are object-oriented. Both support classes, inheritance, polymorphism, etc. Ruby is more “pure”, in the sense that literally everything is an object, whereas in Python a few things (such as if and for statements) are not objects. Readable and Concise Syntax: Both are considered easy to learn. Either is great for a first-time learner. Wide Ecosystem of Packages: Packages to do all sorts of cool things are available in both languages. In Python they are called libraries, and in Ruby they are called gems. The primary difference between the two languages lies in their philosophy and design principles. Python’s core philosophy can be described as: There should be one — and preferably only one — obvious way to do something. In theory, this should emphasize simplicity, readability, and clarity. Ruby’s philosophy can be described as: There’s always more than one way to do something. Maximize developer happiness. This was a shock to me when I switched over from Python. Check out this simple example emphasizing this philosophical difference: # A fight over philosophy: iterating over an array # Pythonic way for i in range(1, 6): print(i) # Ruby way, option 1 (1..5).each do |i| puts i end # Ruby way, option 2 for i in 1..5 puts i end # Ruby way, option 3 5.times do |i| puts i + 1 end # Ruby way, option 4 (1..5).each { |i| puts i } Another difference between the two is syntax style. Python primarily uses indentation to denote code blocks, while Ruby uses do…end or {…} blocks. Most include indentation inside Ruby blocks, but this is entirely optional. Examples of these syntactic differences can be seen in the code shown above. There are a lot of other little differences to learn. For example, in Python string interpolation is done using f-strings: f"Hello, {name}!", while in Ruby they are done using hashtags: "Hello, #{name}!". Within a few months, I think any competent Python programmer can transfer their proficiency over to Ruby. Recent AI-based Gems Despite not being in the conversation when discussing AI, Ruby has had some recent advancements in the world of gems. I will highlight some of the most impressive recent releases that we have been using in our agency to build AI apps: RubyLLM (link) — Any GitHub repo that gets more than 2k stars within a few weeks of release deserves a mention, and RubyLLM is definitely worthy. I have used many clunky implementations of LLM providers from libraries like LangChain and LlamaIndex, so using RubyLLM was like a breath of fresh air. As a simple example, let’s take a look at a tutorial demonstrating multi-turn conversations: require 'ruby_llm' # Create a model and give it instructions chat = RubyLLM.chat chat.with_instructions "You are a friendly Ruby expert who loves to help beginners." # Multi-turn conversation chat.ask "Hi! What does attr_reader do in Ruby?" # => "Ruby creates a getter method for each symbol... # Stream responses in real time chat.ask "Could you give me a short example?" do |chunk| print chunk.content end # => "Sure! # ```ruby # class Person # attr... Simply amazing. Multi-turn conversations are handled automatically for you. Streaming is a breeze. Compare this to a similar implementation in LangChain: from langchain_openai import ChatOpenAI from langchain_core.schema import SystemMessage, HumanMessage, AIMessage from langchain_core.callbacks.streaming_stdout import StreamingStdOutCallbackHandler SYSTEM_PROMPT = "You are a friendly Ruby expert who loves to help beginners." chat = ChatOpenAI(streaming=True, callbacks=[StreamingStdOutCallbackHandler()]) history = [SystemMessage(content=SYSTEM_PROMPT)] def ask(user_text: str) -> None: """Stream the answer token-by-token and keep the context in memory.""" history.append(HumanMessage(content=user_text)) # .stream yields message chunks as they arrive for chunk in chat.stream(history): print(chunk.content, end="", flush=True) print() # newline after the answer # the final chunk has the full message content history.append(AIMessage(content=chunk.content)) ask("Hi! What does attr_reader do in Ruby?") ask("Great - could you show a short example with attr_accessor?") Yikes. And it’s important to note that this is a grug implementation. Want to know how LangChain really expects you to manage memory? Check out these links, but grab a bucket first; you may get sick. Neighbors (link) — This is an excellent library to use for nearest-neighbors search in a Rails application. Very useful in a RAG setup. It integrates with Postgres, SQLite, MySQL, MariaDB, and more. It was written by Andrew Kane, the same guy who wrote the pgvector extension that allows Postgres to behave as a vector database. Async (link) — This gem had its first official release back in December 2024, and it has been making waves in the Ruby community. Async is a fiber-based framework for Ruby that runs non-blocking I/O tasks concurrently while letting you write simple, sequential code. Fibers are like mini-threads that each have their own mini call stack. While not strictly a gem for AI, it has helped us create features like web scrapers that run blazingly fast across thousands of pages. We have also used it to handle streaming of chunks from LLMs. Torch.rb (link) — If you are interested in training deep learning models, then surely you have heard of PyTorch. Well, PyTorch is built on LibTorch, which essentially has a lot of C/C++ code under the hood to perform ML operations quickly. Andrew Kane took LibTorch and made a Ruby adapter over it to create Torch.rb, essentially a Ruby version of PyTorch. Andrew Kane has been a hero in the Ruby AI world, authoring dozens of ML gems for Ruby. Summary In short: building a web application with AI integration quickly and cheaply requires a monolithic architecture. A monolith demands a monolingual application, which is necessary if your end goal is quality apps delivered with speed. Your main options are either Python or Ruby. If you go with Python, you will probably use Django for your web framework. If you go with Ruby, you will be using Ruby on Rails. At our agency, we found Django’s lack of features disappointing. Rails has impressed us with its feature set and emphasis on simplicity. We were thrilled to find almost no issues on the AI side. Of course, there are times where you will not want to use Ruby. If you are conducting research in AI or training machine learning models from scratch, then you will likely want to stick with Python. Research almost never involves building Web Applications. At most you’ll build a simple interface or dashboard in a notebook, but nothing production-ready. You’ll likely want the latest PyTorch updates to ensure your training runs quickly. You may even dive into low-level C/C++ programming to squeeze as much performance as you can out of your hardware. Maybe you’ll even try your hand at Mojo. But if your goal is to integrate the latest LLMs — either open or closed source — into web applications, then we believe Ruby to be the far superior option. Give it a shot yourselves! In part three of this series, I will dive into a fun experiment: just how simple can we make a web application with AI integration? Stay tuned. If you’d like a custom web application with generative AI integration, visit losangelesaiapps.com The post Building AI Applications in Ruby appeared first on Towards Data Science.

0 Commenti ·0 condivisioni ·0 Anteprima

Effettua l'accesso per mettere mi piace, condividere e commentare!
Marktechpost AI @MarktechpostAI ha condiviso un link
2025-05-21 04:20:32 ·

Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image Comprehension

At Google I/O 2025, Google introduced MedGemma, an open suite of models designed for multimodal medical text and image comprehension. Built on the Gemma 3 architecture, MedGemma aims to provide developers with a robust foundation for creating healthcare applications that require integrated analysis of medical images and textual data.
Model Variants and Architecture
MedGemma is available in two configurations:

MedGemma 4B: A 4-billion parameter multimodal model capable of processing both medical images and text. It employs a SigLIP image encoder pre-trained on de-identified medical datasets, including chest X-rays, dermatology images, ophthalmology images, and histopathology slides. The language model component is trained on diverse medical data to facilitate comprehensive understanding.
MedGemma 27B: A 27-billion parameter text-only model optimized for tasks requiring deep medical text comprehension and clinical reasoning. This variant is exclusively instruction-tuned and is designed for applications that demand advanced textual analysis.

Deployment and Accessibility
Developers can access MedGemma models through Hugging Face, subject to agreeing to the Health AI Developer Foundations terms of use. The models can be run locally for experimentation or deployed as scalable HTTPS endpoints via Google Cloud’s Vertex AI for production-grade applications. Google provides resources, including Colab notebooks, to facilitate fine-tuning and integration into various workflows.
Applications and Use Cases
MedGemma serves as a foundational model for several healthcare-related applications:

Medical Image Classification: The 4B model’s pre-training makes it suitable for classifying various medical images, such as radiology scans and dermatological images.
Medical Image Interpretation: It can generate reports or answer questions related to medical images, aiding in diagnostic processes.
Clinical Text Analysis: The 27B model excels in understanding and summarizing clinical notes, supporting tasks like patient triaging and decision support.

Adaptation and Fine-Tuning
While MedGemma provides strong baseline performance, developers are encouraged to validate and fine-tune the models for their specific use cases. Techniques such as prompt engineering, in-context learning, and parameter-efficient fine-tuning methods like LoRA can be employed to enhance performance. Google offers guidance and tools to support these adaptation processes.
Conclusion
MedGemma represents a significant step in providing accessible, open-source tools for medical AI development. By combining multimodal capabilities with scalability and adaptability, it offers a valuable resource for developers aiming to build applications that integrate medical image and text analysis.

Check out the Models on Hugging Face and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World EnvironmentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA OptimizationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Chain-of-Thought May Not Be a Window into AI’s Reasoning: Anthropic’s New Study Reveals Hidden GapsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/How to Build a Powerful and Intelligent Question-Answering System by Using Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain Framework

Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub!
#google #releases #medgemma #open #suite

Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image Comprehension
At Google I/O 2025, Google introduced MedGemma, an open suite of models designed for multimodal medical text and image comprehension. Built on the Gemma 3 architecture, MedGemma aims to provide developers with a robust foundation for creating healthcare applications that require integrated analysis of medical images and textual data. Model Variants and Architecture MedGemma is available in two configurations: MedGemma 4B: A 4-billion parameter multimodal model capable of processing both medical images and text. It employs a SigLIP image encoder pre-trained on de-identified medical datasets, including chest X-rays, dermatology images, ophthalmology images, and histopathology slides. The language model component is trained on diverse medical data to facilitate comprehensive understanding. MedGemma 27B: A 27-billion parameter text-only model optimized for tasks requiring deep medical text comprehension and clinical reasoning. This variant is exclusively instruction-tuned and is designed for applications that demand advanced textual analysis. Deployment and Accessibility Developers can access MedGemma models through Hugging Face, subject to agreeing to the Health AI Developer Foundations terms of use. The models can be run locally for experimentation or deployed as scalable HTTPS endpoints via Google Cloud’s Vertex AI for production-grade applications. Google provides resources, including Colab notebooks, to facilitate fine-tuning and integration into various workflows. Applications and Use Cases MedGemma serves as a foundational model for several healthcare-related applications: Medical Image Classification: The 4B model’s pre-training makes it suitable for classifying various medical images, such as radiology scans and dermatological images. Medical Image Interpretation: It can generate reports or answer questions related to medical images, aiding in diagnostic processes. Clinical Text Analysis: The 27B model excels in understanding and summarizing clinical notes, supporting tasks like patient triaging and decision support. Adaptation and Fine-Tuning While MedGemma provides strong baseline performance, developers are encouraged to validate and fine-tune the models for their specific use cases. Techniques such as prompt engineering, in-context learning, and parameter-efficient fine-tuning methods like LoRA can be employed to enhance performance. Google offers guidance and tools to support these adaptation processes. Conclusion MedGemma represents a significant step in providing accessible, open-source tools for medical AI development. By combining multimodal capabilities with scalability and adaptability, it offers a valuable resource for developers aiming to build applications that integrate medical image and text analysis. Check out the Models on Hugging Face and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World EnvironmentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA OptimizationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Chain-of-Thought May Not Be a Window into AI’s Reasoning: Anthropic’s New Study Reveals Hidden GapsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/How to Build a Powerful and Intelligent Question-Answering System by Using Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain Framework 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! #google #releases #medgemma #open #suite

Google AI Releases MedGemma: An Open Suite of Models Trained for Performance on Medical Text and Image Comprehension

www.marktechpost.com
At Google I/O 2025, Google introduced MedGemma, an open suite of models designed for multimodal medical text and image comprehension. Built on the Gemma 3 architecture, MedGemma aims to provide developers with a robust foundation for creating healthcare applications that require integrated analysis of medical images and textual data. Model Variants and Architecture MedGemma is available in two configurations: MedGemma 4B: A 4-billion parameter multimodal model capable of processing both medical images and text. It employs a SigLIP image encoder pre-trained on de-identified medical datasets, including chest X-rays, dermatology images, ophthalmology images, and histopathology slides. The language model component is trained on diverse medical data to facilitate comprehensive understanding. MedGemma 27B: A 27-billion parameter text-only model optimized for tasks requiring deep medical text comprehension and clinical reasoning. This variant is exclusively instruction-tuned and is designed for applications that demand advanced textual analysis. Deployment and Accessibility Developers can access MedGemma models through Hugging Face, subject to agreeing to the Health AI Developer Foundations terms of use. The models can be run locally for experimentation or deployed as scalable HTTPS endpoints via Google Cloud’s Vertex AI for production-grade applications. Google provides resources, including Colab notebooks, to facilitate fine-tuning and integration into various workflows. Applications and Use Cases MedGemma serves as a foundational model for several healthcare-related applications: Medical Image Classification: The 4B model’s pre-training makes it suitable for classifying various medical images, such as radiology scans and dermatological images. Medical Image Interpretation: It can generate reports or answer questions related to medical images, aiding in diagnostic processes. Clinical Text Analysis: The 27B model excels in understanding and summarizing clinical notes, supporting tasks like patient triaging and decision support. Adaptation and Fine-Tuning While MedGemma provides strong baseline performance, developers are encouraged to validate and fine-tune the models for their specific use cases. Techniques such as prompt engineering, in-context learning, and parameter-efficient fine-tuning methods like LoRA can be employed to enhance performance. Google offers guidance and tools to support these adaptation processes. Conclusion MedGemma represents a significant step in providing accessible, open-source tools for medical AI development. By combining multimodal capabilities with scalability and adaptability, it offers a valuable resource for developers aiming to build applications that integrate medical image and text analysis. Check out the Models on Hugging Face and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World EnvironmentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA OptimizationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Chain-of-Thought May Not Be a Window into AI’s Reasoning: Anthropic’s New Study Reveals Hidden GapsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/How to Build a Powerful and Intelligent Question-Answering System by Using Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain Framework 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! (Promoted)

0 Commenti ·0 condivisioni ·0 Anteprima

Effettua l'accesso per mettere mi piace, condividere e commentare!
Marktechpost AI @MarktechpostAI ha condiviso un link
2025-05-21 01:15:56 ·

NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World Environments

AI has advanced in language processing, mathematics, and code generation, but extending these capabilities to physical environments remains challenging. Physical AI seeks to close this gap by developing systems that perceive, understand, and act in dynamic, real-world settings. Unlike conventional AI that processes text or symbols, Physical AI engages with sensory inputs, especially video, and generates responses grounded in real-world physics. These systems are designed for navigation, manipulation, and interaction, relying on common-sense reasoning and an embodied understanding of space, time, and physical laws. Applications span robotics, autonomous vehicles, and human-machine collaboration, where adaptability to real-time perception is crucial.
The current AI models’ weak connection to real-world physics is a major limitation. While they perform well on abstract tasks, they often fail to predict physical consequences or respond appropriately to sensory data. Concepts like gravity or spatial relationships are not intuitively understood, making them unreliable for embodied tasks. Training directly in the physical world is costly and risky, which hampers development and iteration. This lack of physical grounding and embodied understanding is a significant barrier to deploying AI effectively in real-world applications.
Previously, tools for physical reasoning in AI were fragmented. Vision-language models linked visual and textual data but lacked depth in reasoning. Rule-based systems were rigid and failed in novel scenarios. Simulations and synthetic data often miss the nuances of real-world physics. Critically, there was no standardized framework to define or evaluate physical common sense or embodied reasoning. Inconsistent methodologies and benchmarks made progress difficult to quantify. Reinforcement learning approaches lacked task-specific reward structures, leading to models that struggled with cause-and-effect reasoning and physical feasibility.
Researchers from NVIDIA introduced Cosmos-Reason1, a suite of multimodal large language models. These models, Cosmos-Reason1-7B and Cosmos-Reason1-56B, were designed specifically for physical reasoning tasks. Each model is trained in two major phases: Physical AI Supervised Fine-Tuningand Physical AI Reinforcement Learning. What differentiates this approach is the introduction of a dual-ontology system. One hierarchical ontology organizes physical common sense into three main categories, Space, Time, and Fundamental Physics, divided further into 16 subcategories. The second ontology is two-dimensional and maps reasoning capabilities across five embodied agents, including humans, robot arms, humanoid robots, and autonomous vehicles. These ontologies are training guides and evaluation tools for benchmarking AI’s physical reasoning.

The architecture of Cosmos-Reason1 uses a decoder-only LLM augmented with a vision encoder. Videos are processed to extract visual features, which are then projected into a shared space with language tokens. This integration enables the model to reason over textual and visual data simultaneously. The researchers curated a massive dataset comprising around 4 million annotated video-text pairs for training. These include action descriptions, multiple choice questions, and long chain-of-thought reasoning traces. The reinforcement learning stage is driven by rule-based, verifiable rewards derived from human-labeled multiple-choice questions and self-supervised video tasks. These tasks include predicting the temporal direction of videos and solving puzzles with spatiotemporal patches, making the training deeply tied to real-world physical logic.

The team constructed three benchmarks for physical common sense, Space, Time, and Fundamental Physics, containing 604 questions from 426 videos. Six benchmarks were built for embodied reasoning with 610 questions from 600 videos, covering a wide range of tasks. The Cosmos-Reason1 models outperformed previous baselines, especially after the RL phase. Notably, they improved in task completion verification, predicting next plausible actions, and assessing the physical feasibility of actions. These gains were observed in both model sizes, with Cosmos-Reason1-56B showing stronger performance across most metrics. This performance improvement underscores the effectiveness of using structured ontologies and multimodal data to enhance physical reasoning in AI.

Several Key Takeaways from the Research on Cosmos-Reason1:

Two models introduced: Cosmos-Reason1-7B and Cosmos-Reason1-56B, trained specifically for physical reasoning tasks.
The models were trained in two phases: Physical AI Supervised Fine-Tuningand Physical AI Reinforcement Learning.
The training dataset includes approximately 4 million annotated video-text pairs curated for physical reasoning.
Reinforcement learning uses rule-based and verifiable rewards, derived from human annotations and video-based tasks.
The team relied on two ontologies: a hierarchical one with three categories and 16 subcategories, and a two-dimensional one mapping agent capabilities.
Benchmarks: 604 questions from 426 videos for physical common sense, and 610 from 600 videos for embodied reasoning.
Performance gains were observed across all benchmarks after RL training, particularly in predicting next actions and verifying task completion.
Real-world applicability for robots, vehicles, and other embodied agents across diverse environments.

In conclusion, the Cosmos-Reason1 initiative demonstrates how AI can be better equipped for the physical world. It addresses key limitations in perception, reasoning, and decision-making that have hindered progress in deploying AI in embodied scenarios. The structured training pipeline, grounded in real-world data and ontological frameworks, ensures that the models are accurate and adaptable. These advancements signal a major step forward in bridging the gap between abstract AI reasoning and the needs of systems that must operate in unpredictable, real-world environments.

Check out the Paper, Project Page, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA OptimizationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Chain-of-Thought May Not Be a Window into AI’s Reasoning: Anthropic’s New Study Reveals Hidden GapsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/How to Build a Powerful and Intelligent Question-Answering System by Using Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain FrameworkAsif Razzaqhttps://www.marktechpost.com/author/6flvq/AWS Open-Sources Strands Agents SDK to Simplify AI Agent Development

Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub!
#nvidia #releases #cosmosreason1 #suite #models

NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World Environments
AI has advanced in language processing, mathematics, and code generation, but extending these capabilities to physical environments remains challenging. Physical AI seeks to close this gap by developing systems that perceive, understand, and act in dynamic, real-world settings. Unlike conventional AI that processes text or symbols, Physical AI engages with sensory inputs, especially video, and generates responses grounded in real-world physics. These systems are designed for navigation, manipulation, and interaction, relying on common-sense reasoning and an embodied understanding of space, time, and physical laws. Applications span robotics, autonomous vehicles, and human-machine collaboration, where adaptability to real-time perception is crucial. The current AI models’ weak connection to real-world physics is a major limitation. While they perform well on abstract tasks, they often fail to predict physical consequences or respond appropriately to sensory data. Concepts like gravity or spatial relationships are not intuitively understood, making them unreliable for embodied tasks. Training directly in the physical world is costly and risky, which hampers development and iteration. This lack of physical grounding and embodied understanding is a significant barrier to deploying AI effectively in real-world applications. Previously, tools for physical reasoning in AI were fragmented. Vision-language models linked visual and textual data but lacked depth in reasoning. Rule-based systems were rigid and failed in novel scenarios. Simulations and synthetic data often miss the nuances of real-world physics. Critically, there was no standardized framework to define or evaluate physical common sense or embodied reasoning. Inconsistent methodologies and benchmarks made progress difficult to quantify. Reinforcement learning approaches lacked task-specific reward structures, leading to models that struggled with cause-and-effect reasoning and physical feasibility. Researchers from NVIDIA introduced Cosmos-Reason1, a suite of multimodal large language models. These models, Cosmos-Reason1-7B and Cosmos-Reason1-56B, were designed specifically for physical reasoning tasks. Each model is trained in two major phases: Physical AI Supervised Fine-Tuningand Physical AI Reinforcement Learning. What differentiates this approach is the introduction of a dual-ontology system. One hierarchical ontology organizes physical common sense into three main categories, Space, Time, and Fundamental Physics, divided further into 16 subcategories. The second ontology is two-dimensional and maps reasoning capabilities across five embodied agents, including humans, robot arms, humanoid robots, and autonomous vehicles. These ontologies are training guides and evaluation tools for benchmarking AI’s physical reasoning. The architecture of Cosmos-Reason1 uses a decoder-only LLM augmented with a vision encoder. Videos are processed to extract visual features, which are then projected into a shared space with language tokens. This integration enables the model to reason over textual and visual data simultaneously. The researchers curated a massive dataset comprising around 4 million annotated video-text pairs for training. These include action descriptions, multiple choice questions, and long chain-of-thought reasoning traces. The reinforcement learning stage is driven by rule-based, verifiable rewards derived from human-labeled multiple-choice questions and self-supervised video tasks. These tasks include predicting the temporal direction of videos and solving puzzles with spatiotemporal patches, making the training deeply tied to real-world physical logic. The team constructed three benchmarks for physical common sense, Space, Time, and Fundamental Physics, containing 604 questions from 426 videos. Six benchmarks were built for embodied reasoning with 610 questions from 600 videos, covering a wide range of tasks. The Cosmos-Reason1 models outperformed previous baselines, especially after the RL phase. Notably, they improved in task completion verification, predicting next plausible actions, and assessing the physical feasibility of actions. These gains were observed in both model sizes, with Cosmos-Reason1-56B showing stronger performance across most metrics. This performance improvement underscores the effectiveness of using structured ontologies and multimodal data to enhance physical reasoning in AI. Several Key Takeaways from the Research on Cosmos-Reason1: Two models introduced: Cosmos-Reason1-7B and Cosmos-Reason1-56B, trained specifically for physical reasoning tasks. The models were trained in two phases: Physical AI Supervised Fine-Tuningand Physical AI Reinforcement Learning. The training dataset includes approximately 4 million annotated video-text pairs curated for physical reasoning. Reinforcement learning uses rule-based and verifiable rewards, derived from human annotations and video-based tasks. The team relied on two ontologies: a hierarchical one with three categories and 16 subcategories, and a two-dimensional one mapping agent capabilities. Benchmarks: 604 questions from 426 videos for physical common sense, and 610 from 600 videos for embodied reasoning. Performance gains were observed across all benchmarks after RL training, particularly in predicting next actions and verifying task completion. Real-world applicability for robots, vehicles, and other embodied agents across diverse environments. In conclusion, the Cosmos-Reason1 initiative demonstrates how AI can be better equipped for the physical world. It addresses key limitations in perception, reasoning, and decision-making that have hindered progress in deploying AI in embodied scenarios. The structured training pipeline, grounded in real-world data and ontological frameworks, ensures that the models are accurate and adaptable. These advancements signal a major step forward in bridging the gap between abstract AI reasoning and the needs of systems that must operate in unpredictable, real-world environments. Check out the Paper, Project Page, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA OptimizationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Chain-of-Thought May Not Be a Window into AI’s Reasoning: Anthropic’s New Study Reveals Hidden GapsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/How to Build a Powerful and Intelligent Question-Answering System by Using Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain FrameworkAsif Razzaqhttps://www.marktechpost.com/author/6flvq/AWS Open-Sources Strands Agents SDK to Simplify AI Agent Development 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! #nvidia #releases #cosmosreason1 #suite #models

NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World Environments

www.marktechpost.com
AI has advanced in language processing, mathematics, and code generation, but extending these capabilities to physical environments remains challenging. Physical AI seeks to close this gap by developing systems that perceive, understand, and act in dynamic, real-world settings. Unlike conventional AI that processes text or symbols, Physical AI engages with sensory inputs, especially video, and generates responses grounded in real-world physics. These systems are designed for navigation, manipulation, and interaction, relying on common-sense reasoning and an embodied understanding of space, time, and physical laws. Applications span robotics, autonomous vehicles, and human-machine collaboration, where adaptability to real-time perception is crucial. The current AI models’ weak connection to real-world physics is a major limitation. While they perform well on abstract tasks, they often fail to predict physical consequences or respond appropriately to sensory data. Concepts like gravity or spatial relationships are not intuitively understood, making them unreliable for embodied tasks. Training directly in the physical world is costly and risky, which hampers development and iteration. This lack of physical grounding and embodied understanding is a significant barrier to deploying AI effectively in real-world applications. Previously, tools for physical reasoning in AI were fragmented. Vision-language models linked visual and textual data but lacked depth in reasoning. Rule-based systems were rigid and failed in novel scenarios. Simulations and synthetic data often miss the nuances of real-world physics. Critically, there was no standardized framework to define or evaluate physical common sense or embodied reasoning. Inconsistent methodologies and benchmarks made progress difficult to quantify. Reinforcement learning approaches lacked task-specific reward structures, leading to models that struggled with cause-and-effect reasoning and physical feasibility. Researchers from NVIDIA introduced Cosmos-Reason1, a suite of multimodal large language models. These models, Cosmos-Reason1-7B and Cosmos-Reason1-56B, were designed specifically for physical reasoning tasks. Each model is trained in two major phases: Physical AI Supervised Fine-Tuning (SFT) and Physical AI Reinforcement Learning (RL). What differentiates this approach is the introduction of a dual-ontology system. One hierarchical ontology organizes physical common sense into three main categories, Space, Time, and Fundamental Physics, divided further into 16 subcategories. The second ontology is two-dimensional and maps reasoning capabilities across five embodied agents, including humans, robot arms, humanoid robots, and autonomous vehicles. These ontologies are training guides and evaluation tools for benchmarking AI’s physical reasoning. The architecture of Cosmos-Reason1 uses a decoder-only LLM augmented with a vision encoder. Videos are processed to extract visual features, which are then projected into a shared space with language tokens. This integration enables the model to reason over textual and visual data simultaneously. The researchers curated a massive dataset comprising around 4 million annotated video-text pairs for training. These include action descriptions, multiple choice questions, and long chain-of-thought reasoning traces. The reinforcement learning stage is driven by rule-based, verifiable rewards derived from human-labeled multiple-choice questions and self-supervised video tasks. These tasks include predicting the temporal direction of videos and solving puzzles with spatiotemporal patches, making the training deeply tied to real-world physical logic. The team constructed three benchmarks for physical common sense, Space, Time, and Fundamental Physics, containing 604 questions from 426 videos. Six benchmarks were built for embodied reasoning with 610 questions from 600 videos, covering a wide range of tasks. The Cosmos-Reason1 models outperformed previous baselines, especially after the RL phase. Notably, they improved in task completion verification, predicting next plausible actions, and assessing the physical feasibility of actions. These gains were observed in both model sizes, with Cosmos-Reason1-56B showing stronger performance across most metrics. This performance improvement underscores the effectiveness of using structured ontologies and multimodal data to enhance physical reasoning in AI. Several Key Takeaways from the Research on Cosmos-Reason1: Two models introduced: Cosmos-Reason1-7B and Cosmos-Reason1-56B, trained specifically for physical reasoning tasks. The models were trained in two phases: Physical AI Supervised Fine-Tuning (SFT) and Physical AI Reinforcement Learning (RL). The training dataset includes approximately 4 million annotated video-text pairs curated for physical reasoning. Reinforcement learning uses rule-based and verifiable rewards, derived from human annotations and video-based tasks. The team relied on two ontologies: a hierarchical one with three categories and 16 subcategories, and a two-dimensional one mapping agent capabilities. Benchmarks: 604 questions from 426 videos for physical common sense, and 610 from 600 videos for embodied reasoning. Performance gains were observed across all benchmarks after RL training, particularly in predicting next actions and verifying task completion. Real-world applicability for robots, vehicles, and other embodied agents across diverse environments. In conclusion, the Cosmos-Reason1 initiative demonstrates how AI can be better equipped for the physical world. It addresses key limitations in perception, reasoning, and decision-making that have hindered progress in deploying AI in embodied scenarios. The structured training pipeline, grounded in real-world data and ontological frameworks, ensures that the models are accurate and adaptable. These advancements signal a major step forward in bridging the gap between abstract AI reasoning and the needs of systems that must operate in unpredictable, real-world environments. Check out the Paper, Project Page, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Efficiently Fine-Tune Qwen3-14B Using Unsloth AI on Google Colab with Mixed Datasets and LoRA OptimizationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Chain-of-Thought May Not Be a Window into AI’s Reasoning: Anthropic’s New Study Reveals Hidden GapsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/How to Build a Powerful and Intelligent Question-Answering System by Using Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain FrameworkAsif Razzaqhttps://www.marktechpost.com/author/6flvq/AWS Open-Sources Strands Agents SDK to Simplify AI Agent Development 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! (Promoted)

0 Commenti ·0 condivisioni ·0 Anteprima

Effettua l'accesso per mettere mi piace, condividere e commentare!

Passa a Pro