Agentic AI 102: Guardrails and Agent Evaluation
Introduction
In the first post of this series, we talked about the fundamentals of creating AI Agents and introduced concepts like reasoning, memory, and tools.
Of course, that first post touched only the surface of this new area of the data industry. There is so much more that can be done, and we are going to learn more along the way in this series.
So, it is time to take one step further.
In this post, we will cover three topics:
Guardrails: these are safe blocks that prevent a Large Language Modelfrom responding about some topics.
Agent Evaluation: Have you ever thought about how accurate the responses from LLM are? I bet you did. So we will see the main ways to measure that.
Monitoring: We will also learn about the built-in monitoring app in Agno’s framework.
We shall begin now.
Guardrails
Our first topic is the simplest, in my opinion. Guardrails are rules that will keep an AI agent from responding to a given topic or list of topics.
I believe there is a good chance that you have ever asked something to ChatGPT or Gemini and received a response like “I can’t talk about this topic”, or “Please consult a professional specialist”, something like that. Usually, that occurs with sensitive topics like health advice, psychological conditions, or financial advice.
Those blocks are safeguards to prevent people from hurting themselves, harming their health, or their pockets. As we know, LLMs are trained on massive amounts of text, ergo inheriting a lot of bad content with it, which could easily lead to bad advice in those areas for people. And I didn’t even mention hallucinations!
Think about how many stories there are of people who lost money by following investment tips from online forums. Or how many people took the wrong medicine because they read about it on the internet.
Well, I guess you got the point. We must prevent our agents from talking about certain topics or taking certain actions. For that, we will use guardrails.
The best framework I found to impose those blocks is Guardrails AI. There, you will see a hub full of predefined rules that a response must follow in order to pass and be displayed to the user.
To get started quickly, first go to this linkand get an API key. Then, install the package. Next, type the guardrails setup command. It will ask you a couple of questions that you can respond n, and it will ask you to enter the API Key generated.
pip install guardrails-ai
guardrails configure
Once that is completed, go to the Guardrails AI Huband choose one that you need. Every guardrail has instructions on how to implement it. Basically, you install it via the command line and then use it like a module in Python.
For this example, we’re choosing one called Restrict to Topic, which, as its name says, lets the user talk only about what’s in the list. So, go back to the terminal and install it using the code below.
guardrails hub install hub://tryolabs/restricttotopic
Next, let’s open our Python script and import some modules.
# Imports
from agno.agent import Agent
from agno.models.google import Gemini
import os
# Import Guard and Validator
from guardrails import Guard
from guardrails.hub import RestrictToTopic
Next, we create the guard. We will restrict our agent to talk only about sports or the weather. And we are restricting it to talk about stocks.
# Setup Guard
guard = Guard.use)
Now we can run the agent and the guard.
# Create agent
agent = Agent),
description= "An assistant agent",
instructions=,
markdown= True
)
# Run the agent
response = agent.run.content
# Run agent with validation
validation_step = guard.validate# Print validated response
if validation_step.validation_passed:
printelse:
printThis is the response when we ask about a stock symbol.
Validation Failed Invalid topics found:If I ask about a topic that is not on the valid_topics list, I will also see a block.
"What's the number one soda drink?"
Validation Failed No valid topic was found.
Finally, let’s ask about sports.
"Who is Michael Jordan?"
Michael Jordan is a former professional basketball player widely considered one of
the greatest of all time. He won six NBA championships with the Chicago Bulls.
And we saw a response this time, as it is a valid topic.
Let’s move on to the evaluation of agents now.
Agent Evaluation
Since I started studying LLMs and Agentic Ai, one of my main questions has been about model evaluation. Unlike traditional Data Science Modeling, where you have structured metrics that are adequate for each case, for AI Agents, this is more blurry.
Fortunately, the developer community is pretty quick in finding solutions for almost everything, and so they created this nice package for LLMs evaluation: deepeval.
DeepEvalis a library created by Confident AI that gathers many methods to evaluate LLMs and AI Agents. In this section, let’s learn a couple of the main methods, just so we can build some intuition on the subject, and also because the library is quite extensive.\
The first evaluation is the most basic we can use, and it is called G-Eval. As AI tools like ChatGPT become more common in everyday tasks, we have to make sure they’re giving helpful and accurate responses. That’s where G-Eval from the DeepEval Python package comes in.
G-Eval is like a smart reviewer that uses another AI model to evaluate how well a chatbot or AI assistant is performing. For example. My agent runs Gemini, and I am using OpenAI to assess it. This method takes a more advanced approach than a human one by asking an AI to “grade” another AI’s answers based on things like relevance, correctness, and clarity.
It’s a nice way to test and improve generative AI systems in a more scalable way. Let’s quickly code an example. We will import the modules, create a prompt, a simple chat agent, and ask it about a description of the weather for the month of May in NYC.
# Imports
from agno.agent import Agent
from agno.models.google import Gemini
import os
# Evaluation Modules
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
# Prompt
prompt = "Describe the weather in NYC for May"
# Create agent
agent = Agent),
description= "An assistant agent",
instructions=,
markdown= True,
monitoring= True
)
# Run agent
response = agent.run# Print response
printIt responds: “Mild, with average highs in the 60s°F and lows in the 50s°F. Expect some rain“.
Nice. Seems pretty good to me.
But how can we put a number on it and show a potential manager or client how our agent is doing?
Here is how:
Create a test case passing the prompt and the response to the LLMTestCase class.
Create a metric. We will use the method GEval and add a prompt for the model to test it for coherence, and then I give it the meaning of what coherence is to me.
Give the output as evaluation_params.
Run the measure method and get the score and reason from it.
# Test Case
test_case = LLMTestCase# Setup the Metric
coherence_metric = GEval# Run the metric
coherence_metric.measureprintprintThe output looks like this.
0.9
The response directly addresses the prompt about NYC weather in May,
maintains logical consistency, flows naturally, and uses clear language.
However, it could be slightly more detailed.
0.9 seems pretty good, given that the default threshold is 0.5.
If you want to check the logs, use this next snippet.
# Check the logs
printHere’s the response.
Criteria:
Coherence. The agent can answer the prompt and the response makes sense.
Evaluation Steps:Very nice. Now let us learn about another interesting use case, which is the evaluation of task completion for AI Agents. Elaborating a little more, how our agent is doing when it is requested to perform a task, and how much of it the agent can deliver.
First, we are creating a simple agent that can access Wikipedia and summarize the topic of the query.
# Imports
from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.wikipedia import WikipediaTools
import os
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import TaskCompletionMetric
from deepeval import evaluate
# Prompt
prompt = "Search wikipedia for 'Time series analysis' and summarize the 3 main points"
# Create agent
agent = Agent),
description= "You are a researcher specialized in searching the wikipedia.",
tools=,
show_tool_calls= True,
markdown= True,
read_tool_call_history= True
)
# Run agent
response = agent.run# Print response
printThe result looks very good. Let’s evaluate it using the TaskCompletionMetric class.
# Create a Metric
metric = TaskCompletionMetric# Test Case
test_case = LLMTestCase]
)
# Evaluate
evaluateOutput, including the agent’s response.
======================================================================
Metrics Summary
- Task CompletionFor test case:
- input: Search wikipedia for 'Time series analysis' and summarize the 3 main points
- actual output: Here are the 3 main points about Time series analysis based on the
Wikipedia search:
1. **Definition:** A time series is a sequence of data points indexed in time order,
often taken at successive, equally spaced points in time.
2. **Applications:** Time series analysis is used in various fields like statistics,
signal processing, econometrics, weather forecasting, and more, wherever temporal
measurements are involved.
3. **Purpose:** Time series analysis involves methods for extracting meaningful
statistics and characteristics from time series data, and time series forecasting
uses models to predict future values based on past observations.
- expected output: None
- context: None
- retrieval context: None
======================================================================
Overall Metric Pass Rates
Task Completion: 100.00% pass rate
======================================================================
✓ Tests finished ! Run 'deepeval login' to save and analyze evaluation results
on Confident AI.
Our agent passed the test with honor: 100%!
You can learn much more about the DeepEval library in this link.
Finally, in the next section, we will learn the capabilities of Agno’s library for monitoring agents.
Agent Monitoring
Like I told you in my previous post, I chose Agno to learn more about Agentic AI. Just to be clear, this is not a sponsored post. It is just that I think this is the best option for those starting their journey learning about this topic.
So, one of the cool things we can take advantage of using Agno’s framework is the app they make available for model monitoring.
Take this agent that can search the internet and write Instagram posts, for example.
# Imports
import os
from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.file import FileTools
from agno.tools.googlesearch import GoogleSearchTools
# Topic
topic = "Healthy Eating"
# Create agent
agent = Agent),
description= f"""You are a social media marketer specialized in creating engaging content.
Search the internet for 'trending topics about {topic}' and use them to create a post.""",
tools=,
expected_output="""A short post for instagram and a prompt for a picture related to the content of the post.
Don't use emojis or special characters in the post. If you find an error in the character encoding, remove the character before saving the file.
Use the template:
- Post
- Prompt for the picture
the post to a file named 'post.txt'.""",
show_tool_calls=True,
monitoring=True)
# Writing and saving a file
agent.print_responseTo monitor its performance, follow these steps:
Go to and get an API Key.
Open a terminal and type ag setup.
If it is the first time, it might ask for the API Key. Copy and Paste it in the terminal prompt.
You will see the Dashboard tab open in your browser.
If you want to monitor your agent, add the argument monitoring=True.
Run your agent.
Go to the Dashboard on the web browser.
Click on Sessions. As it is a single agent, you will see it under the tab Agents on the top portion of the page.
Agno Dashboard after running the agent. Image by the author.
The cools features we can see there are:
Info about the model
The response
Tools used
Tokens consumed
This is the resulting token consumption while saving the file. Image by the author.
Pretty neat, huh?
This is useful for us to know where the agent is spending more or less tokens, and where it is taking more time to perform a task, for example.
Well, let’s wrap up then.
Before You Go
We have learned a lot in this second round. In this post, we covered:
Guardrails for AI are essential safety measures and ethical guidelines implemented to prevent unintended harmful outputs and ensure responsible AI behavior.
Model evaluation, exemplified by GEval for broad assessment and TaskCompletion with DeepEval for agents output quality, is crucial for understanding AI capabilities and limitations.
Model monitoring with Agno’s app, including tracking token usage and response time, which is vital for managing costs, ensuring performance, and identifying potential issues in deployed AI systems.
Contact & Follow Me
If you liked this content, find more of my work in my website.
GitHub Repository
References//
The post Agentic AI 102: Guardrails and Agent Evaluation appeared first on Towards Data Science.
#agentic #guardrails #agent #evaluation
Agentic AI 102: Guardrails and Agent Evaluation
Introduction
In the first post of this series, we talked about the fundamentals of creating AI Agents and introduced concepts like reasoning, memory, and tools.
Of course, that first post touched only the surface of this new area of the data industry. There is so much more that can be done, and we are going to learn more along the way in this series.
So, it is time to take one step further.
In this post, we will cover three topics:
Guardrails: these are safe blocks that prevent a Large Language Modelfrom responding about some topics.
Agent Evaluation: Have you ever thought about how accurate the responses from LLM are? I bet you did. So we will see the main ways to measure that.
Monitoring: We will also learn about the built-in monitoring app in Agno’s framework.
We shall begin now.
Guardrails
Our first topic is the simplest, in my opinion. Guardrails are rules that will keep an AI agent from responding to a given topic or list of topics.
I believe there is a good chance that you have ever asked something to ChatGPT or Gemini and received a response like “I can’t talk about this topic”, or “Please consult a professional specialist”, something like that. Usually, that occurs with sensitive topics like health advice, psychological conditions, or financial advice.
Those blocks are safeguards to prevent people from hurting themselves, harming their health, or their pockets. As we know, LLMs are trained on massive amounts of text, ergo inheriting a lot of bad content with it, which could easily lead to bad advice in those areas for people. And I didn’t even mention hallucinations!
Think about how many stories there are of people who lost money by following investment tips from online forums. Or how many people took the wrong medicine because they read about it on the internet.
Well, I guess you got the point. We must prevent our agents from talking about certain topics or taking certain actions. For that, we will use guardrails.
The best framework I found to impose those blocks is Guardrails AI. There, you will see a hub full of predefined rules that a response must follow in order to pass and be displayed to the user.
To get started quickly, first go to this linkand get an API key. Then, install the package. Next, type the guardrails setup command. It will ask you a couple of questions that you can respond n, and it will ask you to enter the API Key generated.
pip install guardrails-ai
guardrails configure
Once that is completed, go to the Guardrails AI Huband choose one that you need. Every guardrail has instructions on how to implement it. Basically, you install it via the command line and then use it like a module in Python.
For this example, we’re choosing one called Restrict to Topic, which, as its name says, lets the user talk only about what’s in the list. So, go back to the terminal and install it using the code below.
guardrails hub install hub://tryolabs/restricttotopic
Next, let’s open our Python script and import some modules.
# Imports
from agno.agent import Agent
from agno.models.google import Gemini
import os
# Import Guard and Validator
from guardrails import Guard
from guardrails.hub import RestrictToTopic
Next, we create the guard. We will restrict our agent to talk only about sports or the weather. And we are restricting it to talk about stocks.
# Setup Guard
guard = Guard.use)
Now we can run the agent and the guard.
# Create agent
agent = Agent),
description= "An assistant agent",
instructions=,
markdown= True
)
# Run the agent
response = agent.run.content
# Run agent with validation
validation_step = guard.validate# Print validated response
if validation_step.validation_passed:
printelse:
printThis is the response when we ask about a stock symbol.
Validation Failed Invalid topics found:If I ask about a topic that is not on the valid_topics list, I will also see a block.
"What's the number one soda drink?"
Validation Failed No valid topic was found.
Finally, let’s ask about sports.
"Who is Michael Jordan?"
Michael Jordan is a former professional basketball player widely considered one of
the greatest of all time. He won six NBA championships with the Chicago Bulls.
And we saw a response this time, as it is a valid topic.
Let’s move on to the evaluation of agents now.
Agent Evaluation
Since I started studying LLMs and Agentic Ai, one of my main questions has been about model evaluation. Unlike traditional Data Science Modeling, where you have structured metrics that are adequate for each case, for AI Agents, this is more blurry.
Fortunately, the developer community is pretty quick in finding solutions for almost everything, and so they created this nice package for LLMs evaluation: deepeval.
DeepEvalis a library created by Confident AI that gathers many methods to evaluate LLMs and AI Agents. In this section, let’s learn a couple of the main methods, just so we can build some intuition on the subject, and also because the library is quite extensive.\
The first evaluation is the most basic we can use, and it is called G-Eval. As AI tools like ChatGPT become more common in everyday tasks, we have to make sure they’re giving helpful and accurate responses. That’s where G-Eval from the DeepEval Python package comes in.
G-Eval is like a smart reviewer that uses another AI model to evaluate how well a chatbot or AI assistant is performing. For example. My agent runs Gemini, and I am using OpenAI to assess it. This method takes a more advanced approach than a human one by asking an AI to “grade” another AI’s answers based on things like relevance, correctness, and clarity.
It’s a nice way to test and improve generative AI systems in a more scalable way. Let’s quickly code an example. We will import the modules, create a prompt, a simple chat agent, and ask it about a description of the weather for the month of May in NYC.
# Imports
from agno.agent import Agent
from agno.models.google import Gemini
import os
# Evaluation Modules
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
# Prompt
prompt = "Describe the weather in NYC for May"
# Create agent
agent = Agent),
description= "An assistant agent",
instructions=,
markdown= True,
monitoring= True
)
# Run agent
response = agent.run# Print response
printIt responds: “Mild, with average highs in the 60s°F and lows in the 50s°F. Expect some rain“.
Nice. Seems pretty good to me.
But how can we put a number on it and show a potential manager or client how our agent is doing?
Here is how:
Create a test case passing the prompt and the response to the LLMTestCase class.
Create a metric. We will use the method GEval and add a prompt for the model to test it for coherence, and then I give it the meaning of what coherence is to me.
Give the output as evaluation_params.
Run the measure method and get the score and reason from it.
# Test Case
test_case = LLMTestCase# Setup the Metric
coherence_metric = GEval# Run the metric
coherence_metric.measureprintprintThe output looks like this.
0.9
The response directly addresses the prompt about NYC weather in May,
maintains logical consistency, flows naturally, and uses clear language.
However, it could be slightly more detailed.
0.9 seems pretty good, given that the default threshold is 0.5.
If you want to check the logs, use this next snippet.
# Check the logs
printHere’s the response.
Criteria:
Coherence. The agent can answer the prompt and the response makes sense.
Evaluation Steps:Very nice. Now let us learn about another interesting use case, which is the evaluation of task completion for AI Agents. Elaborating a little more, how our agent is doing when it is requested to perform a task, and how much of it the agent can deliver.
First, we are creating a simple agent that can access Wikipedia and summarize the topic of the query.
# Imports
from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.wikipedia import WikipediaTools
import os
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import TaskCompletionMetric
from deepeval import evaluate
# Prompt
prompt = "Search wikipedia for 'Time series analysis' and summarize the 3 main points"
# Create agent
agent = Agent),
description= "You are a researcher specialized in searching the wikipedia.",
tools=,
show_tool_calls= True,
markdown= True,
read_tool_call_history= True
)
# Run agent
response = agent.run# Print response
printThe result looks very good. Let’s evaluate it using the TaskCompletionMetric class.
# Create a Metric
metric = TaskCompletionMetric# Test Case
test_case = LLMTestCase]
)
# Evaluate
evaluateOutput, including the agent’s response.
======================================================================
Metrics Summary
- Task CompletionFor test case:
- input: Search wikipedia for 'Time series analysis' and summarize the 3 main points
- actual output: Here are the 3 main points about Time series analysis based on the
Wikipedia search:
1. **Definition:** A time series is a sequence of data points indexed in time order,
often taken at successive, equally spaced points in time.
2. **Applications:** Time series analysis is used in various fields like statistics,
signal processing, econometrics, weather forecasting, and more, wherever temporal
measurements are involved.
3. **Purpose:** Time series analysis involves methods for extracting meaningful
statistics and characteristics from time series data, and time series forecasting
uses models to predict future values based on past observations.
- expected output: None
- context: None
- retrieval context: None
======================================================================
Overall Metric Pass Rates
Task Completion: 100.00% pass rate
======================================================================
✓ Tests finished ! Run 'deepeval login' to save and analyze evaluation results
on Confident AI.
Our agent passed the test with honor: 100%!
You can learn much more about the DeepEval library in this link.
Finally, in the next section, we will learn the capabilities of Agno’s library for monitoring agents.
Agent Monitoring
Like I told you in my previous post, I chose Agno to learn more about Agentic AI. Just to be clear, this is not a sponsored post. It is just that I think this is the best option for those starting their journey learning about this topic.
So, one of the cool things we can take advantage of using Agno’s framework is the app they make available for model monitoring.
Take this agent that can search the internet and write Instagram posts, for example.
# Imports
import os
from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.file import FileTools
from agno.tools.googlesearch import GoogleSearchTools
# Topic
topic = "Healthy Eating"
# Create agent
agent = Agent),
description= f"""You are a social media marketer specialized in creating engaging content.
Search the internet for 'trending topics about {topic}' and use them to create a post.""",
tools=,
expected_output="""A short post for instagram and a prompt for a picture related to the content of the post.
Don't use emojis or special characters in the post. If you find an error in the character encoding, remove the character before saving the file.
Use the template:
- Post
- Prompt for the picture
the post to a file named 'post.txt'.""",
show_tool_calls=True,
monitoring=True)
# Writing and saving a file
agent.print_responseTo monitor its performance, follow these steps:
Go to and get an API Key.
Open a terminal and type ag setup.
If it is the first time, it might ask for the API Key. Copy and Paste it in the terminal prompt.
You will see the Dashboard tab open in your browser.
If you want to monitor your agent, add the argument monitoring=True.
Run your agent.
Go to the Dashboard on the web browser.
Click on Sessions. As it is a single agent, you will see it under the tab Agents on the top portion of the page.
Agno Dashboard after running the agent. Image by the author.
The cools features we can see there are:
Info about the model
The response
Tools used
Tokens consumed
This is the resulting token consumption while saving the file. Image by the author.
Pretty neat, huh?
This is useful for us to know where the agent is spending more or less tokens, and where it is taking more time to perform a task, for example.
Well, let’s wrap up then.
Before You Go
We have learned a lot in this second round. In this post, we covered:
Guardrails for AI are essential safety measures and ethical guidelines implemented to prevent unintended harmful outputs and ensure responsible AI behavior.
Model evaluation, exemplified by GEval for broad assessment and TaskCompletion with DeepEval for agents output quality, is crucial for understanding AI capabilities and limitations.
Model monitoring with Agno’s app, including tracking token usage and response time, which is vital for managing costs, ensuring performance, and identifying potential issues in deployed AI systems.
Contact & Follow Me
If you liked this content, find more of my work in my website.
GitHub Repository
References//
The post Agentic AI 102: Guardrails and Agent Evaluation appeared first on Towards Data Science.
#agentic #guardrails #agent #evaluation
·110 مشاهدة