البحث

Towards Data Science شارك رابطًا

2025-05-17 12:12:12 ·

Agentic AI 102: Guardrails and Agent Evaluation

Introduction

In the first post of this series, we talked about the fundamentals of creating AI Agents and introduced concepts like reasoning, memory, and tools.

Of course, that first post touched only the surface of this new area of the data industry. There is so much more that can be done, and we are going to learn more along the way in this series.

So, it is time to take one step further.

In this post, we will cover three topics:

Guardrails: these are safe blocks that prevent a Large Language Modelfrom responding about some topics.

Agent Evaluation: Have you ever thought about how accurate the responses from LLM are? I bet you did. So we will see the main ways to measure that.

Monitoring: We will also learn about the built-in monitoring app in Agno’s framework.

We shall begin now.

Guardrails

Our first topic is the simplest, in my opinion. Guardrails are rules that will keep an AI agent from responding to a given topic or list of topics.

I believe there is a good chance that you have ever asked something to ChatGPT or Gemini and received a response like “I can’t talk about this topic”, or “Please consult a professional specialist”, something like that. Usually, that occurs with sensitive topics like health advice, psychological conditions, or financial advice.

Those blocks are safeguards to prevent people from hurting themselves, harming their health, or their pockets. As we know, LLMs are trained on massive amounts of text, ergo inheriting a lot of bad content with it, which could easily lead to bad advice in those areas for people. And I didn’t even mention hallucinations!

Think about how many stories there are of people who lost money by following investment tips from online forums. Or how many people took the wrong medicine because they read about it on the internet.

Well, I guess you got the point. We must prevent our agents from talking about certain topics or taking certain actions. For that, we will use guardrails.

The best framework I found to impose those blocks is Guardrails AI. There, you will see a hub full of predefined rules that a response must follow in order to pass and be displayed to the user.

To get started quickly, first go to this linkand get an API key. Then, install the package. Next, type the guardrails setup command. It will ask you a couple of questions that you can respond n, and it will ask you to enter the API Key generated.

pip install guardrails-ai
guardrails configure

Once that is completed, go to the Guardrails AI Huband choose one that you need. Every guardrail has instructions on how to implement it. Basically, you install it via the command line and then use it like a module in Python.

For this example, we’re choosing one called Restrict to Topic, which, as its name says, lets the user talk only about what’s in the list. So, go back to the terminal and install it using the code below.

guardrails hub install hub://tryolabs/restricttotopic

Next, let’s open our Python script and import some modules.

# Imports
from agno.agent import Agent
from agno.models.google import Gemini
import os

# Import Guard and Validator
from guardrails import Guard
from guardrails.hub import RestrictToTopic

Next, we create the guard. We will restrict our agent to talk only about sports or the weather. And we are restricting it to talk about stocks.

# Setup Guard
guard = Guard.use)

Now we can run the agent and the guard.

# Create agent
agent = Agent),
description= "An assistant agent",
instructions=,
markdown= True
)

# Run the agent
response = agent.run.content

# Run agent with validation
validation_step = guard.validate# Print validated response
if validation_step.validation_passed:
printelse:
printThis is the response when we ask about a stock symbol.

Validation Failed Invalid topics found:If I ask about a topic that is not on the valid_topics list, I will also see a block.

"What's the number one soda drink?"
Validation Failed No valid topic was found.

Finally, let’s ask about sports.

"Who is Michael Jordan?"
Michael Jordan is a former professional basketball player widely considered one of
the greatest of all time. He won six NBA championships with the Chicago Bulls.

And we saw a response this time, as it is a valid topic.

Let’s move on to the evaluation of agents now.

Agent Evaluation

Since I started studying LLMs and Agentic Ai, one of my main questions has been about model evaluation. Unlike traditional Data Science Modeling, where you have structured metrics that are adequate for each case, for AI Agents, this is more blurry.

Fortunately, the developer community is pretty quick in finding solutions for almost everything, and so they created this nice package for LLMs evaluation: deepeval.

DeepEvalis a library created by Confident AI that gathers many methods to evaluate LLMs and AI Agents. In this section, let’s learn a couple of the main methods, just so we can build some intuition on the subject, and also because the library is quite extensive.\

The first evaluation is the most basic we can use, and it is called G-Eval. As AI tools like ChatGPT become more common in everyday tasks, we have to make sure they’re giving helpful and accurate responses. That’s where G-Eval from the DeepEval Python package comes in.

G-Eval is like a smart reviewer that uses another AI model to evaluate how well a chatbot or AI assistant is performing. For example. My agent runs Gemini, and I am using OpenAI to assess it. This method takes a more advanced approach than a human one by asking an AI to “grade” another AI’s answers based on things like relevance, correctness, and clarity.

It’s a nice way to test and improve generative AI systems in a more scalable way. Let’s quickly code an example. We will import the modules, create a prompt, a simple chat agent, and ask it about a description of the weather for the month of May in NYC.

# Imports
from agno.agent import Agent
from agno.models.google import Gemini
import os
# Evaluation Modules
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

# Prompt
prompt = "Describe the weather in NYC for May"

# Create agent
agent = Agent),
description= "An assistant agent",
instructions=,
markdown= True,
monitoring= True
)

# Run agent
response = agent.run# Print response
printIt responds: “Mild, with average highs in the 60s°F and lows in the 50s°F. Expect some rain“.

Nice. Seems pretty good to me.

But how can we put a number on it and show a potential manager or client how our agent is doing?

Here is how:

Create a test case passing the prompt and the response to the LLMTestCase class.

Create a metric. We will use the method GEval and add a prompt for the model to test it for coherence, and then I give it the meaning of what coherence is to me.

Give the output as evaluation_params.

Run the measure method and get the score and reason from it.

# Test Case
test_case = LLMTestCase# Setup the Metric
coherence_metric = GEval# Run the metric
coherence_metric.measureprintprintThe output looks like this.

0.9
The response directly addresses the prompt about NYC weather in May,
maintains logical consistency, flows naturally, and uses clear language.
However, it could be slightly more detailed.

0.9 seems pretty good, given that the default threshold is 0.5.

If you want to check the logs, use this next snippet.

# Check the logs
printHere’s the response.

Criteria:
Coherence. The agent can answer the prompt and the response makes sense.

Evaluation Steps:Very nice. Now let us learn about another interesting use case, which is the evaluation of task completion for AI Agents. Elaborating a little more, how our agent is doing when it is requested to perform a task, and how much of it the agent can deliver.

First, we are creating a simple agent that can access Wikipedia and summarize the topic of the query.

# Imports
from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.wikipedia import WikipediaTools
import os
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import TaskCompletionMetric
from deepeval import evaluate

# Prompt
prompt = "Search wikipedia for 'Time series analysis' and summarize the 3 main points"

# Create agent
agent = Agent),
description= "You are a researcher specialized in searching the wikipedia.",
tools=,
show_tool_calls= True,
markdown= True,
read_tool_call_history= True
)

# Run agent
response = agent.run# Print response
printThe result looks very good. Let’s evaluate it using the TaskCompletionMetric class.

# Create a Metric
metric = TaskCompletionMetric# Test Case
test_case = LLMTestCase]
)

# Evaluate
evaluateOutput, including the agent’s response.

======================================================================

Metrics Summary

- Task CompletionFor test case:

- input: Search wikipedia for 'Time series analysis' and summarize the 3 main points
- actual output: Here are the 3 main points about Time series analysis based on the
Wikipedia search:

1. **Definition:** A time series is a sequence of data points indexed in time order,
often taken at successive, equally spaced points in time.
2. **Applications:** Time series analysis is used in various fields like statistics,
signal processing, econometrics, weather forecasting, and more, wherever temporal
measurements are involved.
3. **Purpose:** Time series analysis involves methods for extracting meaningful
statistics and characteristics from time series data, and time series forecasting
uses models to predict future values based on past observations.

- expected output: None
- context: None
- retrieval context: None

======================================================================

Overall Metric Pass Rates

Task Completion: 100.00% pass rate

======================================================================

✓ Tests finished ! Run 'deepeval login' to save and analyze evaluation results
on Confident AI.

Our agent passed the test with honor: 100%!

You can learn much more about the DeepEval library in this link.

Finally, in the next section, we will learn the capabilities of Agno’s library for monitoring agents.

Agent Monitoring

Like I told you in my previous post, I chose Agno to learn more about Agentic AI. Just to be clear, this is not a sponsored post. It is just that I think this is the best option for those starting their journey learning about this topic.

So, one of the cool things we can take advantage of using Agno’s framework is the app they make available for model monitoring.

Take this agent that can search the internet and write Instagram posts, for example.

# Imports
import os
from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.file import FileTools
from agno.tools.googlesearch import GoogleSearchTools

# Topic
topic = "Healthy Eating"

# Create agent
agent = Agent),
description= f"""You are a social media marketer specialized in creating engaging content.
Search the internet for 'trending topics about {topic}' and use them to create a post.""",
tools=,
expected_output="""A short post for instagram and a prompt for a picture related to the content of the post.
Don't use emojis or special characters in the post. If you find an error in the character encoding, remove the character before saving the file.
Use the template:
- Post
- Prompt for the picture
the post to a file named 'post.txt'.""",
show_tool_calls=True,
monitoring=True)

# Writing and saving a file
agent.print_responseTo monitor its performance, follow these steps:

Go to and get an API Key.

Open a terminal and type ag setup.

If it is the first time, it might ask for the API Key. Copy and Paste it in the terminal prompt.

You will see the Dashboard tab open in your browser.

If you want to monitor your agent, add the argument monitoring=True.

Run your agent.

Go to the Dashboard on the web browser.

Click on Sessions. As it is a single agent, you will see it under the tab Agents on the top portion of the page.

Agno Dashboard after running the agent. Image by the author.

The cools features we can see there are:

Info about the model

The response

Tools used

Tokens consumed

This is the resulting token consumption while saving the file. Image by the author.

Pretty neat, huh?

This is useful for us to know where the agent is spending more or less tokens, and where it is taking more time to perform a task, for example.

Well, let’s wrap up then.

Before You Go

We have learned a lot in this second round. In this post, we covered:

Guardrails for AI are essential safety measures and ethical guidelines implemented to prevent unintended harmful outputs and ensure responsible AI behavior.

Model evaluation, exemplified by GEval for broad assessment and TaskCompletion with DeepEval for agents output quality, is crucial for understanding AI capabilities and limitations.

Model monitoring with Agno’s app, including tracking token usage and response time, which is vital for managing costs, ensuring performance, and identifying potential issues in deployed AI systems.

Contact & Follow Me

If you liked this content, find more of my work in my website.

GitHub Repository

References//

The post Agentic AI 102: Guardrails and Agent Evaluation appeared first on Towards Data Science.
#agentic #guardrails #agent #evaluation

Agentic AI 102: Guardrails and Agent Evaluation
Introduction In the first post of this series, we talked about the fundamentals of creating AI Agents and introduced concepts like reasoning, memory, and tools. Of course, that first post touched only the surface of this new area of the data industry. There is so much more that can be done, and we are going to learn more along the way in this series. So, it is time to take one step further. In this post, we will cover three topics: Guardrails: these are safe blocks that prevent a Large Language Modelfrom responding about some topics. Agent Evaluation: Have you ever thought about how accurate the responses from LLM are? I bet you did. So we will see the main ways to measure that. Monitoring: We will also learn about the built-in monitoring app in Agno’s framework. We shall begin now. Guardrails Our first topic is the simplest, in my opinion. Guardrails are rules that will keep an AI agent from responding to a given topic or list of topics. I believe there is a good chance that you have ever asked something to ChatGPT or Gemini and received a response like “I can’t talk about this topic”, or “Please consult a professional specialist”, something like that. Usually, that occurs with sensitive topics like health advice, psychological conditions, or financial advice. Those blocks are safeguards to prevent people from hurting themselves, harming their health, or their pockets. As we know, LLMs are trained on massive amounts of text, ergo inheriting a lot of bad content with it, which could easily lead to bad advice in those areas for people. And I didn’t even mention hallucinations! Think about how many stories there are of people who lost money by following investment tips from online forums. Or how many people took the wrong medicine because they read about it on the internet. Well, I guess you got the point. We must prevent our agents from talking about certain topics or taking certain actions. For that, we will use guardrails. The best framework I found to impose those blocks is Guardrails AI. There, you will see a hub full of predefined rules that a response must follow in order to pass and be displayed to the user. To get started quickly, first go to this linkand get an API key. Then, install the package. Next, type the guardrails setup command. It will ask you a couple of questions that you can respond n, and it will ask you to enter the API Key generated. pip install guardrails-ai guardrails configure Once that is completed, go to the Guardrails AI Huband choose one that you need. Every guardrail has instructions on how to implement it. Basically, you install it via the command line and then use it like a module in Python. For this example, we’re choosing one called Restrict to Topic, which, as its name says, lets the user talk only about what’s in the list. So, go back to the terminal and install it using the code below. guardrails hub install hub://tryolabs/restricttotopic Next, let’s open our Python script and import some modules. # Imports from agno.agent import Agent from agno.models.google import Gemini import os # Import Guard and Validator from guardrails import Guard from guardrails.hub import RestrictToTopic Next, we create the guard. We will restrict our agent to talk only about sports or the weather. And we are restricting it to talk about stocks. # Setup Guard guard = Guard.use) Now we can run the agent and the guard. # Create agent agent = Agent), description= "An assistant agent", instructions=, markdown= True ) # Run the agent response = agent.run.content # Run agent with validation validation_step = guard.validate# Print validated response if validation_step.validation_passed: printelse: printThis is the response when we ask about a stock symbol. Validation Failed Invalid topics found:If I ask about a topic that is not on the valid_topics list, I will also see a block. "What's the number one soda drink?" Validation Failed No valid topic was found. Finally, let’s ask about sports. "Who is Michael Jordan?" Michael Jordan is a former professional basketball player widely considered one of the greatest of all time. He won six NBA championships with the Chicago Bulls. And we saw a response this time, as it is a valid topic. Let’s move on to the evaluation of agents now. Agent Evaluation Since I started studying LLMs and Agentic Ai, one of my main questions has been about model evaluation. Unlike traditional Data Science Modeling, where you have structured metrics that are adequate for each case, for AI Agents, this is more blurry. Fortunately, the developer community is pretty quick in finding solutions for almost everything, and so they created this nice package for LLMs evaluation: deepeval. DeepEvalis a library created by Confident AI that gathers many methods to evaluate LLMs and AI Agents. In this section, let’s learn a couple of the main methods, just so we can build some intuition on the subject, and also because the library is quite extensive.\ The first evaluation is the most basic we can use, and it is called G-Eval. As AI tools like ChatGPT become more common in everyday tasks, we have to make sure they’re giving helpful and accurate responses. That’s where G-Eval from the DeepEval Python package comes in. G-Eval is like a smart reviewer that uses another AI model to evaluate how well a chatbot or AI assistant is performing. For example. My agent runs Gemini, and I am using OpenAI to assess it. This method takes a more advanced approach than a human one by asking an AI to “grade” another AI’s answers based on things like relevance, correctness, and clarity. It’s a nice way to test and improve generative AI systems in a more scalable way. Let’s quickly code an example. We will import the modules, create a prompt, a simple chat agent, and ask it about a description of the weather for the month of May in NYC. # Imports from agno.agent import Agent from agno.models.google import Gemini import os # Evaluation Modules from deepeval.test_case import LLMTestCase, LLMTestCaseParams from deepeval.metrics import GEval # Prompt prompt = "Describe the weather in NYC for May" # Create agent agent = Agent), description= "An assistant agent", instructions=, markdown= True, monitoring= True ) # Run agent response = agent.run# Print response printIt responds: “Mild, with average highs in the 60s°F and lows in the 50s°F. Expect some rain“. Nice. Seems pretty good to me. But how can we put a number on it and show a potential manager or client how our agent is doing? Here is how: Create a test case passing the prompt and the response to the LLMTestCase class. Create a metric. We will use the method GEval and add a prompt for the model to test it for coherence, and then I give it the meaning of what coherence is to me. Give the output as evaluation_params. Run the measure method and get the score and reason from it. # Test Case test_case = LLMTestCase# Setup the Metric coherence_metric = GEval# Run the metric coherence_metric.measureprintprintThe output looks like this. 0.9 The response directly addresses the prompt about NYC weather in May, maintains logical consistency, flows naturally, and uses clear language. However, it could be slightly more detailed. 0.9 seems pretty good, given that the default threshold is 0.5. If you want to check the logs, use this next snippet. # Check the logs printHere’s the response. Criteria: Coherence. The agent can answer the prompt and the response makes sense. Evaluation Steps:Very nice. Now let us learn about another interesting use case, which is the evaluation of task completion for AI Agents. Elaborating a little more, how our agent is doing when it is requested to perform a task, and how much of it the agent can deliver. First, we are creating a simple agent that can access Wikipedia and summarize the topic of the query. # Imports from agno.agent import Agent from agno.models.google import Gemini from agno.tools.wikipedia import WikipediaTools import os from deepeval.test_case import LLMTestCase, ToolCall from deepeval.metrics import TaskCompletionMetric from deepeval import evaluate # Prompt prompt = "Search wikipedia for 'Time series analysis' and summarize the 3 main points" # Create agent agent = Agent), description= "You are a researcher specialized in searching the wikipedia.", tools=, show_tool_calls= True, markdown= True, read_tool_call_history= True ) # Run agent response = agent.run# Print response printThe result looks very good. Let’s evaluate it using the TaskCompletionMetric class. # Create a Metric metric = TaskCompletionMetric# Test Case test_case = LLMTestCase] ) # Evaluate evaluateOutput, including the agent’s response. ====================================================================== Metrics Summary - Task CompletionFor test case: - input: Search wikipedia for 'Time series analysis' and summarize the 3 main points - actual output: Here are the 3 main points about Time series analysis based on the Wikipedia search: 1. **Definition:** A time series is a sequence of data points indexed in time order, often taken at successive, equally spaced points in time. 2. **Applications:** Time series analysis is used in various fields like statistics, signal processing, econometrics, weather forecasting, and more, wherever temporal measurements are involved. 3. **Purpose:** Time series analysis involves methods for extracting meaningful statistics and characteristics from time series data, and time series forecasting uses models to predict future values based on past observations. - expected output: None - context: None - retrieval context: None ====================================================================== Overall Metric Pass Rates Task Completion: 100.00% pass rate ====================================================================== ✓ Tests finished ! Run 'deepeval login' to save and analyze evaluation results on Confident AI. Our agent passed the test with honor: 100%! You can learn much more about the DeepEval library in this link. Finally, in the next section, we will learn the capabilities of Agno’s library for monitoring agents. Agent Monitoring Like I told you in my previous post, I chose Agno to learn more about Agentic AI. Just to be clear, this is not a sponsored post. It is just that I think this is the best option for those starting their journey learning about this topic. So, one of the cool things we can take advantage of using Agno’s framework is the app they make available for model monitoring. Take this agent that can search the internet and write Instagram posts, for example. # Imports import os from agno.agent import Agent from agno.models.google import Gemini from agno.tools.file import FileTools from agno.tools.googlesearch import GoogleSearchTools # Topic topic = "Healthy Eating" # Create agent agent = Agent), description= f"""You are a social media marketer specialized in creating engaging content. Search the internet for 'trending topics about {topic}' and use them to create a post.""", tools=, expected_output="""A short post for instagram and a prompt for a picture related to the content of the post. Don't use emojis or special characters in the post. If you find an error in the character encoding, remove the character before saving the file. Use the template: - Post - Prompt for the picture the post to a file named 'post.txt'.""", show_tool_calls=True, monitoring=True) # Writing and saving a file agent.print_responseTo monitor its performance, follow these steps: Go to and get an API Key. Open a terminal and type ag setup. If it is the first time, it might ask for the API Key. Copy and Paste it in the terminal prompt. You will see the Dashboard tab open in your browser. If you want to monitor your agent, add the argument monitoring=True. Run your agent. Go to the Dashboard on the web browser. Click on Sessions. As it is a single agent, you will see it under the tab Agents on the top portion of the page. Agno Dashboard after running the agent. Image by the author. The cools features we can see there are: Info about the model The response Tools used Tokens consumed This is the resulting token consumption while saving the file. Image by the author. Pretty neat, huh? This is useful for us to know where the agent is spending more or less tokens, and where it is taking more time to perform a task, for example. Well, let’s wrap up then. Before You Go We have learned a lot in this second round. In this post, we covered: Guardrails for AI are essential safety measures and ethical guidelines implemented to prevent unintended harmful outputs and ensure responsible AI behavior. Model evaluation, exemplified by GEval for broad assessment and TaskCompletion with DeepEval for agents output quality, is crucial for understanding AI capabilities and limitations. Model monitoring with Agno’s app, including tracking token usage and response time, which is vital for managing costs, ensuring performance, and identifying potential issues in deployed AI systems. Contact & Follow Me If you liked this content, find more of my work in my website. GitHub Repository References// The post Agentic AI 102: Guardrails and Agent Evaluation appeared first on Towards Data Science. #agentic #guardrails #agent #evaluation

TOWARDSDATASCIENCE.COM

Agentic AI 102: Guardrails and Agent Evaluation

Introduction In the first post of this series (Agentic AI 101: Starting Your Journey Building AI Agents), we talked about the fundamentals of creating AI Agents and introduced concepts like reasoning, memory, and tools. Of course, that first post touched only the surface of this new area of the data industry. There is so much more that can be done, and we are going to learn more along the way in this series. So, it is time to take one step further. In this post, we will cover three topics: Guardrails: these are safe blocks that prevent a Large Language Model (LLM) from responding about some topics. Agent Evaluation: Have you ever thought about how accurate the responses from LLM are? I bet you did. So we will see the main ways to measure that. Monitoring: We will also learn about the built-in monitoring app in Agno’s framework. We shall begin now. Guardrails Our first topic is the simplest, in my opinion. Guardrails are rules that will keep an AI agent from responding to a given topic or list of topics. I believe there is a good chance that you have ever asked something to ChatGPT or Gemini and received a response like “I can’t talk about this topic”, or “Please consult a professional specialist”, something like that. Usually, that occurs with sensitive topics like health advice, psychological conditions, or financial advice. Those blocks are safeguards to prevent people from hurting themselves, harming their health, or their pockets. As we know, LLMs are trained on massive amounts of text, ergo inheriting a lot of bad content with it, which could easily lead to bad advice in those areas for people. And I didn’t even mention hallucinations! Think about how many stories there are of people who lost money by following investment tips from online forums. Or how many people took the wrong medicine because they read about it on the internet. Well, I guess you got the point. We must prevent our agents from talking about certain topics or taking certain actions. For that, we will use guardrails. The best framework I found to impose those blocks is Guardrails AI [1]. There, you will see a hub full of predefined rules that a response must follow in order to pass and be displayed to the user. To get started quickly, first go to this link [2] and get an API key. Then, install the package. Next, type the guardrails setup command. It will ask you a couple of questions that you can respond n (for No), and it will ask you to enter the API Key generated. pip install guardrails-ai guardrails configure Once that is completed, go to the Guardrails AI Hub [3] and choose one that you need. Every guardrail has instructions on how to implement it. Basically, you install it via the command line and then use it like a module in Python. For this example, we’re choosing one called Restrict to Topic [4], which, as its name says, lets the user talk only about what’s in the list. So, go back to the terminal and install it using the code below. guardrails hub install hub://tryolabs/restricttotopic Next, let’s open our Python script and import some modules. # Imports from agno.agent import Agent from agno.models.google import Gemini import os # Import Guard and Validator from guardrails import Guard from guardrails.hub import RestrictToTopic Next, we create the guard. We will restrict our agent to talk only about sports or the weather. And we are restricting it to talk about stocks. # Setup Guard guard = Guard().use( RestrictToTopic( valid_topics=["sports", "weather"], invalid_topics=["stocks"], disable_classifier=True, disable_llm=False, on_fail="filter" ) ) Now we can run the agent and the guard. # Create agent agent = Agent( model= Gemini(id="gemini-1.5-flash", api_key = os.environ.get("GEMINI_API_KEY")), description= "An assistant agent", instructions= ["Be sucint. Reply in maximum two sentences"], markdown= True ) # Run the agent response = agent.run("What's the ticker symbol for Apple?").content # Run agent with validation validation_step = guard.validate(response) # Print validated response if validation_step.validation_passed: print(response) else: print("Validation Failed", validation_step.validation_summaries[0].failure_reason) This is the response when we ask about a stock symbol. Validation Failed Invalid topics found: ['stocks'] If I ask about a topic that is not on the valid_topics list, I will also see a block. "What's the number one soda drink?" Validation Failed No valid topic was found. Finally, let’s ask about sports. "Who is Michael Jordan?" Michael Jordan is a former professional basketball player widely considered one of the greatest of all time. He won six NBA championships with the Chicago Bulls. And we saw a response this time, as it is a valid topic. Let’s move on to the evaluation of agents now. Agent Evaluation Since I started studying LLMs and Agentic Ai, one of my main questions has been about model evaluation. Unlike traditional Data Science Modeling, where you have structured metrics that are adequate for each case, for AI Agents, this is more blurry. Fortunately, the developer community is pretty quick in finding solutions for almost everything, and so they created this nice package for LLMs evaluation: deepeval. DeepEval [5] is a library created by Confident AI that gathers many methods to evaluate LLMs and AI Agents. In this section, let’s learn a couple of the main methods, just so we can build some intuition on the subject, and also because the library is quite extensive.\ The first evaluation is the most basic we can use, and it is called G-Eval. As AI tools like ChatGPT become more common in everyday tasks, we have to make sure they’re giving helpful and accurate responses. That’s where G-Eval from the DeepEval Python package comes in. G-Eval is like a smart reviewer that uses another AI model to evaluate how well a chatbot or AI assistant is performing. For example. My agent runs Gemini, and I am using OpenAI to assess it. This method takes a more advanced approach than a human one by asking an AI to “grade” another AI’s answers based on things like relevance, correctness, and clarity. It’s a nice way to test and improve generative AI systems in a more scalable way. Let’s quickly code an example. We will import the modules, create a prompt, a simple chat agent, and ask it about a description of the weather for the month of May in NYC. # Imports from agno.agent import Agent from agno.models.google import Gemini import os # Evaluation Modules from deepeval.test_case import LLMTestCase, LLMTestCaseParams from deepeval.metrics import GEval # Prompt prompt = "Describe the weather in NYC for May" # Create agent agent = Agent( model= Gemini(id="gemini-1.5-flash", api_key = os.environ.get("GEMINI_API_KEY")), description= "An assistant agent", instructions= ["Be sucint"], markdown= True, monitoring= True ) # Run agent response = agent.run(prompt) # Print response print(response.content) It responds: “Mild, with average highs in the 60s°F and lows in the 50s°F. Expect some rain“. Nice. Seems pretty good to me. But how can we put a number on it and show a potential manager or client how our agent is doing? Here is how: Create a test case passing the prompt and the response to the LLMTestCase class. Create a metric. We will use the method GEval and add a prompt for the model to test it for coherence, and then I give it the meaning of what coherence is to me. Give the output as evaluation_params. Run the measure method and get the score and reason from it. # Test Case test_case = LLMTestCase(input=prompt, actual_output=response) # Setup the Metric coherence_metric = GEval( name="Coherence", criteria="Coherence. The agent can answer the prompt and the response makes sense.", evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT] ) # Run the metric coherence_metric.measure(test_case) print(coherence_metric.score) print(coherence_metric.reason) The output looks like this. 0.9 The response directly addresses the prompt about NYC weather in May, maintains logical consistency, flows naturally, and uses clear language. However, it could be slightly more detailed. 0.9 seems pretty good, given that the default threshold is 0.5. If you want to check the logs, use this next snippet. # Check the logs print(coherence_metric.verbose_logs) Here’s the response. Criteria: Coherence. The agent can answer the prompt and the response makes sense. Evaluation Steps: [ "Assess whether the response directly addresses the prompt; if it aligns, it scores higher on coherence.", "Evaluate the logical flow of the response; responses that present ideas in a clear, organized manner rank better in coherence.", "Consider the relevance of examples or evidence provided; responses that include pertinent information enhance their coherence.", "Check for clarity and consistency in terminology; responses that maintain clear language without contradictions achieve a higher coherence rating." ] Very nice. Now let us learn about another interesting use case, which is the evaluation of task completion for AI Agents. Elaborating a little more, how our agent is doing when it is requested to perform a task, and how much of it the agent can deliver. First, we are creating a simple agent that can access Wikipedia and summarize the topic of the query. # Imports from agno.agent import Agent from agno.models.google import Gemini from agno.tools.wikipedia import WikipediaTools import os from deepeval.test_case import LLMTestCase, ToolCall from deepeval.metrics import TaskCompletionMetric from deepeval import evaluate # Prompt prompt = "Search wikipedia for 'Time series analysis' and summarize the 3 main points" # Create agent agent = Agent( model= Gemini(id="gemini-2.0-flash", api_key = os.environ.get("GEMINI_API_KEY")), description= "You are a researcher specialized in searching the wikipedia.", tools= [WikipediaTools()], show_tool_calls= True, markdown= True, read_tool_call_history= True ) # Run agent response = agent.run(prompt) # Print response print(response.content) The result looks very good. Let’s evaluate it using the TaskCompletionMetric class. # Create a Metric metric = TaskCompletionMetric( threshold=0.7, model="gpt-4o-mini", include_reason=True ) # Test Case test_case = LLMTestCase( input=prompt, actual_output=response.content, tools_called=[ToolCall(name="wikipedia")] ) # Evaluate evaluate(test_cases=[test_case], metrics=[metric]) Output, including the agent’s response. ====================================================================== Metrics Summary - Task Completion (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-4o-mini, reason: The system successfully searched for 'Time series analysis' on Wikipedia and provided a clear summary of the 3 main points, fully aligning with the user's goal., error: None) For test case: - input: Search wikipedia for 'Time series analysis' and summarize the 3 main points - actual output: Here are the 3 main points about Time series analysis based on the Wikipedia search: 1. **Definition:** A time series is a sequence of data points indexed in time order, often taken at successive, equally spaced points in time. 2. **Applications:** Time series analysis is used in various fields like statistics, signal processing, econometrics, weather forecasting, and more, wherever temporal measurements are involved. 3. **Purpose:** Time series analysis involves methods for extracting meaningful statistics and characteristics from time series data, and time series forecasting uses models to predict future values based on past observations. - expected output: None - context: None - retrieval context: None ====================================================================== Overall Metric Pass Rates Task Completion: 100.00% pass rate ====================================================================== ✓ Tests finished ! Run 'deepeval login' to save and analyze evaluation results on Confident AI. Our agent passed the test with honor: 100%! You can learn much more about the DeepEval library in this link [8]. Finally, in the next section, we will learn the capabilities of Agno’s library for monitoring agents. Agent Monitoring Like I told you in my previous post [9], I chose Agno to learn more about Agentic AI. Just to be clear, this is not a sponsored post. It is just that I think this is the best option for those starting their journey learning about this topic. So, one of the cool things we can take advantage of using Agno’s framework is the app they make available for model monitoring. Take this agent that can search the internet and write Instagram posts, for example. # Imports import os from agno.agent import Agent from agno.models.google import Gemini from agno.tools.file import FileTools from agno.tools.googlesearch import GoogleSearchTools # Topic topic = "Healthy Eating" # Create agent agent = Agent( model= Gemini(id="gemini-1.5-flash", api_key = os.environ.get("GEMINI_API_KEY")), description= f"""You are a social media marketer specialized in creating engaging content. Search the internet for 'trending topics about {topic}' and use them to create a post.""", tools=[FileTools(save_files=True), GoogleSearchTools()], expected_output="""A short post for instagram and a prompt for a picture related to the content of the post. Don't use emojis or special characters in the post. If you find an error in the character encoding, remove the character before saving the file. Use the template: - Post - Prompt for the picture Save the post to a file named 'post.txt'.""", show_tool_calls=True, monitoring=True) # Writing and saving a file agent.print_response("""Write a short post for instagram with tips and tricks that positions me as an authority in {topic}.""", markdown=True) To monitor its performance, follow these steps: Go to https://app.agno.com/settings and get an API Key. Open a terminal and type ag setup. If it is the first time, it might ask for the API Key. Copy and Paste it in the terminal prompt. You will see the Dashboard tab open in your browser. If you want to monitor your agent, add the argument monitoring=True. Run your agent. Go to the Dashboard on the web browser. Click on Sessions. As it is a single agent, you will see it under the tab Agents on the top portion of the page. Agno Dashboard after running the agent. Image by the author. The cools features we can see there are: Info about the model The response Tools used Tokens consumed This is the resulting token consumption while saving the file. Image by the author. Pretty neat, huh? This is useful for us to know where the agent is spending more or less tokens, and where it is taking more time to perform a task, for example. Well, let’s wrap up then. Before You Go We have learned a lot in this second round. In this post, we covered: Guardrails for AI are essential safety measures and ethical guidelines implemented to prevent unintended harmful outputs and ensure responsible AI behavior. Model evaluation, exemplified by GEval for broad assessment and TaskCompletion with DeepEval for agents output quality, is crucial for understanding AI capabilities and limitations. Model monitoring with Agno’s app, including tracking token usage and response time, which is vital for managing costs, ensuring performance, and identifying potential issues in deployed AI systems. Contact & Follow Me If you liked this content, find more of my work in my website. https://gustavorsantos.me GitHub Repository https://github.com/gurezende/agno-ai-labs References [1. Guardrails Ai] https://www.guardrailsai.com/docs/getting_started/guardrails_server [2. Guardrails AI Auth Key] https://hub.guardrailsai.com/keys [3. Guardrails AI Hub] https://hub.guardrailsai.com/ [4. Guardrails Restrict to Topic] https://hub.guardrailsai.com/validator/tryolabs/restricttotopic [5. DeepEval.] https://www.deepeval.com/docs/getting-started [6. DataCamp – DeepEval Tutorial] https://www.datacamp.com/tutorial/deepeval [7. DeepEval. TaskCompletion] https://www.deepeval.com/docs/metrics-task-completion [8. Llm Evaluation Metrics: The Ultimate LLM Evaluation Guide] https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation [9. Agentic AI 101: Starting Your Journey Building AI Agents] https://towardsdatascience.com/agentic-ai-101-starting-your-journey-building-ai-agents/ The post Agentic AI 102: Guardrails and Agent Evaluation appeared first on Towards Data Science.

·110 مشاهدة

الرجاء تسجيل الدخول , للأعجاب والمشاركة والتعليق على هذا!
Marktechpost AI شارك رابطًا

2025-05-17 10:28:38 ·

This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency

The growth in developing and deploying large language modelsis closely tied to architectural innovations, large-scale datasets, and hardware improvements. Models like DeepSeek-V3, GPT-4o, Claude 3.5 Sonnet, and LLaMA-3 have demonstrated how scaling enhances reasoning and dialogue capabilities. However, as their performance increases, so do computing, memory, and communication bandwidth demands, placing substantial strain on hardware. Without parallel progress in model and infrastructure co-design, these models risk becoming accessible only to organizations with massive resources. This makes optimizing training cost, inference speed, and memory efficiency a critical area of research.
A core challenge is the mismatch between model size and hardware capabilities. LLM memory consumption grows over 1000% annually, while high-speed memory bandwidth increases by less than 50%. During inference, caching prior context in Key-Valuestores adds to memory strain and slows processing. Dense models activate all parameters per token, escalating computational costs, particularly for models with hundreds of billions of parameters. This results in billions of floating-point operations per token and high energy demands. Time Per Output Token, a key performance metric, also suffers, impacting user experience. These problems call for solutions beyond simply adding more hardware.
Techniques like Multi-Query Attentionand Grouped-Query Attentionreduce memory usage by sharing attention weights. Windowed KV caching lowers memory use by storing only recent tokens, but can limit long-context understanding. Quantized compression with low-bit formats like 4-bit and 8-bit cuts memory further, though sometimes with trade-offs in accuracy. Precision formats such as BF16 and FP8 improve training speed and efficiency. While useful, these techniques often tackle individual issues rather than a comprehensive solution to scaling challenges.
Researchers from DeepSeek-AI introduced a more integrated and efficient strategy with the development of DeepSeek-V3, designed to scale intelligently rather than excessively. Utilizing 2,048 NVIDIA H800 GPUs, the model achieves state-of-the-art performance while focusing on cost-efficiency. Instead of depending on expansive infrastructure, the team engineered the model architecture to work harmoniously with hardware constraints. Central to this effort are innovations such as Multi-head Latent Attentionfor memory optimization, a Mixture of Expertsframework for computational efficiency, and FP8 mixed-precision training to accelerate performance without sacrificing accuracy. A custom Multi-Plane Network Topology was also employed to minimize inter-device communication overhead. Collectively, these components make DeepSeek-V3 a scalable and accessible solution, capable of rivaling much larger systems while operating on significantly leaner resources.

The architecture achieves memory efficiency by reducing the KV cache requirement per token to just 70 KB using MLA, compared to 327 KB and 516 KB in Qwen-2.5 and LLaMA-3.1, respectively. This reduction is accomplished by compressing attention heads into a smaller latent vector jointly trained with the model. Computational efficiency is further boosted with the MoE model, which increases total parameters to 671 billion but only activates 37 billion per token. This contrasts sharply with dense models that require full parameter activation. For example, LLaMA-3.1 needs 2,448 GFLOPS per token, while DeepSeek-V3 operates at just 250 GFLOPS. Also, the architecture integrates a Multi-Token Predictionmodule, enabling the generation of multiple tokens in a single step. The system achieves up to 1.8x improvement in generation speed, and real-world measurements show 80-90% token acceptance for speculative decoding.

Using a system interconnected by CX7 400 Gbps InfiniBand NICs, DeepSeek-V3 achieves a theoretical TPOT of 14.76 milliseconds, equal to 67 tokens per second. With higher-bandwidth setups like NVIDIA GB200 NVL72 offering 900 GB/s, this number can be reduced to 0.82 milliseconds TPOT, potentially achieving 1,200 tokens per second. The practical throughput is lower due to compute-communication overlap and memory limitations, but the framework lays the foundation for future high-speed implementations. FP8 precision further adds to the speed gains. The training framework applies tile-wise 1×128 and block-wise 128×128 quantization, with less than 0.25% accuracy loss compared to BF16. These results were validated on smaller 16B and 230B parameter versions before integration into the 671B model.

Several key takeaways from the research on insights into DeepSeek-V3 include:

MLA compression reduces KV cache size per token from 516 KB to 70 KB, significantly lowering memory demands during inference.
Only 37 billion of the 671 billion total parameters are activated per token, dramatically reducing compute and memory requirements without compromising model performance.
DeepSeek-V3 requires just 250 GFLOPS per token, compared to 2,448 GFLOPS for dense models like LLaMA-3.1, highlighting its computational efficiency.
Achieves up to 67 tokens per secondon a 400 Gbps InfiniBand network, with the potential to scale to 1,200 TPS using advanced interconnects like NVL72.
Multi-Token Predictionimproves generation speed by 1.8×, with a token acceptance rate of 80-90%, enhancing inference throughput.
FP8 mixed-precision training enables faster computation with less than 0.25% accuracy degradation, validated through extensive small-scale ablations.
Capable of running on a server equipped with a consumer-grade GPU, delivering nearly 20 TPS, making high-performance LLMs more accessible.

In conclusion, the research presents a well-rounded framework for building powerful and resource-conscious large-scale language models. By directly addressing fundamental constraints, such as memory limitations, high computational costs, and inference latency, the researchers demonstrate that intelligent architecture-hardware co-design can unlock high performance without relying on vast infrastructure. DeepSeek-V3 is a clear example of how efficiency and scalability coexist, enabling broader adoption of cutting-edge AI capabilities across diverse organizations. This approach shifts the narrative from scaling through brute force to scaling through smarter engineering.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.
Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Meet LangGraph Multi-Agent Swarm: A Python Library for Creating Swarm-Style Multi-Agent Systems Using LangGraphSana Hassanhttps://www.marktechpost.com/author/sana-hassan/ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and ReasoningSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers from Tsinghua and ModelBest Release Ultra-FineWeb: A Trillion-Token Dataset Enhancing LLM Accuracy Across BenchmarksSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Coding Agents See 75% Surge: SimilarWeb’s AI Usage Report Highlights the Sectors Winning and Losing in 2025’s Generative AI Boom
#this #paper #deepseekai #explores #how

This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency
The growth in developing and deploying large language modelsis closely tied to architectural innovations, large-scale datasets, and hardware improvements. Models like DeepSeek-V3, GPT-4o, Claude 3.5 Sonnet, and LLaMA-3 have demonstrated how scaling enhances reasoning and dialogue capabilities. However, as their performance increases, so do computing, memory, and communication bandwidth demands, placing substantial strain on hardware. Without parallel progress in model and infrastructure co-design, these models risk becoming accessible only to organizations with massive resources. This makes optimizing training cost, inference speed, and memory efficiency a critical area of research. A core challenge is the mismatch between model size and hardware capabilities. LLM memory consumption grows over 1000% annually, while high-speed memory bandwidth increases by less than 50%. During inference, caching prior context in Key-Valuestores adds to memory strain and slows processing. Dense models activate all parameters per token, escalating computational costs, particularly for models with hundreds of billions of parameters. This results in billions of floating-point operations per token and high energy demands. Time Per Output Token, a key performance metric, also suffers, impacting user experience. These problems call for solutions beyond simply adding more hardware. Techniques like Multi-Query Attentionand Grouped-Query Attentionreduce memory usage by sharing attention weights. Windowed KV caching lowers memory use by storing only recent tokens, but can limit long-context understanding. Quantized compression with low-bit formats like 4-bit and 8-bit cuts memory further, though sometimes with trade-offs in accuracy. Precision formats such as BF16 and FP8 improve training speed and efficiency. While useful, these techniques often tackle individual issues rather than a comprehensive solution to scaling challenges. Researchers from DeepSeek-AI introduced a more integrated and efficient strategy with the development of DeepSeek-V3, designed to scale intelligently rather than excessively. Utilizing 2,048 NVIDIA H800 GPUs, the model achieves state-of-the-art performance while focusing on cost-efficiency. Instead of depending on expansive infrastructure, the team engineered the model architecture to work harmoniously with hardware constraints. Central to this effort are innovations such as Multi-head Latent Attentionfor memory optimization, a Mixture of Expertsframework for computational efficiency, and FP8 mixed-precision training to accelerate performance without sacrificing accuracy. A custom Multi-Plane Network Topology was also employed to minimize inter-device communication overhead. Collectively, these components make DeepSeek-V3 a scalable and accessible solution, capable of rivaling much larger systems while operating on significantly leaner resources. The architecture achieves memory efficiency by reducing the KV cache requirement per token to just 70 KB using MLA, compared to 327 KB and 516 KB in Qwen-2.5 and LLaMA-3.1, respectively. This reduction is accomplished by compressing attention heads into a smaller latent vector jointly trained with the model. Computational efficiency is further boosted with the MoE model, which increases total parameters to 671 billion but only activates 37 billion per token. This contrasts sharply with dense models that require full parameter activation. For example, LLaMA-3.1 needs 2,448 GFLOPS per token, while DeepSeek-V3 operates at just 250 GFLOPS. Also, the architecture integrates a Multi-Token Predictionmodule, enabling the generation of multiple tokens in a single step. The system achieves up to 1.8x improvement in generation speed, and real-world measurements show 80-90% token acceptance for speculative decoding. Using a system interconnected by CX7 400 Gbps InfiniBand NICs, DeepSeek-V3 achieves a theoretical TPOT of 14.76 milliseconds, equal to 67 tokens per second. With higher-bandwidth setups like NVIDIA GB200 NVL72 offering 900 GB/s, this number can be reduced to 0.82 milliseconds TPOT, potentially achieving 1,200 tokens per second. The practical throughput is lower due to compute-communication overlap and memory limitations, but the framework lays the foundation for future high-speed implementations. FP8 precision further adds to the speed gains. The training framework applies tile-wise 1×128 and block-wise 128×128 quantization, with less than 0.25% accuracy loss compared to BF16. These results were validated on smaller 16B and 230B parameter versions before integration into the 671B model. Several key takeaways from the research on insights into DeepSeek-V3 include: MLA compression reduces KV cache size per token from 516 KB to 70 KB, significantly lowering memory demands during inference. Only 37 billion of the 671 billion total parameters are activated per token, dramatically reducing compute and memory requirements without compromising model performance. DeepSeek-V3 requires just 250 GFLOPS per token, compared to 2,448 GFLOPS for dense models like LLaMA-3.1, highlighting its computational efficiency. Achieves up to 67 tokens per secondon a 400 Gbps InfiniBand network, with the potential to scale to 1,200 TPS using advanced interconnects like NVL72. Multi-Token Predictionimproves generation speed by 1.8×, with a token acceptance rate of 80-90%, enhancing inference throughput. FP8 mixed-precision training enables faster computation with less than 0.25% accuracy degradation, validated through extensive small-scale ablations. Capable of running on a server equipped with a consumer-grade GPU, delivering nearly 20 TPS, making high-performance LLMs more accessible. In conclusion, the research presents a well-rounded framework for building powerful and resource-conscious large-scale language models. By directly addressing fundamental constraints, such as memory limitations, high computational costs, and inference latency, the researchers demonstrate that intelligent architecture-hardware co-design can unlock high performance without relying on vast infrastructure. DeepSeek-V3 is a clear example of how efficiency and scalability coexist, enabling broader adoption of cutting-edge AI capabilities across diverse organizations. This approach shifts the narrative from scaling through brute force to scaling through smarter engineering. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Meet LangGraph Multi-Agent Swarm: A Python Library for Creating Swarm-Style Multi-Agent Systems Using LangGraphSana Hassanhttps://www.marktechpost.com/author/sana-hassan/ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and ReasoningSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers from Tsinghua and ModelBest Release Ultra-FineWeb: A Trillion-Token Dataset Enhancing LLM Accuracy Across BenchmarksSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Coding Agents See 75% Surge: SimilarWeb’s AI Usage Report Highlights the Sectors Winning and Losing in 2025’s Generative AI Boom #this #paper #deepseekai #explores #how

WWW.MARKTECHPOST.COM

This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency

The growth in developing and deploying large language models (LLMs) is closely tied to architectural innovations, large-scale datasets, and hardware improvements. Models like DeepSeek-V3, GPT-4o, Claude 3.5 Sonnet, and LLaMA-3 have demonstrated how scaling enhances reasoning and dialogue capabilities. However, as their performance increases, so do computing, memory, and communication bandwidth demands, placing substantial strain on hardware. Without parallel progress in model and infrastructure co-design, these models risk becoming accessible only to organizations with massive resources. This makes optimizing training cost, inference speed, and memory efficiency a critical area of research. A core challenge is the mismatch between model size and hardware capabilities. LLM memory consumption grows over 1000% annually, while high-speed memory bandwidth increases by less than 50%. During inference, caching prior context in Key-Value (KV) stores adds to memory strain and slows processing. Dense models activate all parameters per token, escalating computational costs, particularly for models with hundreds of billions of parameters. This results in billions of floating-point operations per token and high energy demands. Time Per Output Token (TPOT), a key performance metric, also suffers, impacting user experience. These problems call for solutions beyond simply adding more hardware. Techniques like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce memory usage by sharing attention weights. Windowed KV caching lowers memory use by storing only recent tokens, but can limit long-context understanding. Quantized compression with low-bit formats like 4-bit and 8-bit cuts memory further, though sometimes with trade-offs in accuracy. Precision formats such as BF16 and FP8 improve training speed and efficiency. While useful, these techniques often tackle individual issues rather than a comprehensive solution to scaling challenges. Researchers from DeepSeek-AI introduced a more integrated and efficient strategy with the development of DeepSeek-V3, designed to scale intelligently rather than excessively. Utilizing 2,048 NVIDIA H800 GPUs, the model achieves state-of-the-art performance while focusing on cost-efficiency. Instead of depending on expansive infrastructure, the team engineered the model architecture to work harmoniously with hardware constraints. Central to this effort are innovations such as Multi-head Latent Attention (MLA) for memory optimization, a Mixture of Experts (MoE) framework for computational efficiency, and FP8 mixed-precision training to accelerate performance without sacrificing accuracy. A custom Multi-Plane Network Topology was also employed to minimize inter-device communication overhead. Collectively, these components make DeepSeek-V3 a scalable and accessible solution, capable of rivaling much larger systems while operating on significantly leaner resources. The architecture achieves memory efficiency by reducing the KV cache requirement per token to just 70 KB using MLA, compared to 327 KB and 516 KB in Qwen-2.5 and LLaMA-3.1, respectively. This reduction is accomplished by compressing attention heads into a smaller latent vector jointly trained with the model. Computational efficiency is further boosted with the MoE model, which increases total parameters to 671 billion but only activates 37 billion per token. This contrasts sharply with dense models that require full parameter activation. For example, LLaMA-3.1 needs 2,448 GFLOPS per token, while DeepSeek-V3 operates at just 250 GFLOPS. Also, the architecture integrates a Multi-Token Prediction (MTP) module, enabling the generation of multiple tokens in a single step. The system achieves up to 1.8x improvement in generation speed, and real-world measurements show 80-90% token acceptance for speculative decoding. Using a system interconnected by CX7 400 Gbps InfiniBand NICs, DeepSeek-V3 achieves a theoretical TPOT of 14.76 milliseconds, equal to 67 tokens per second. With higher-bandwidth setups like NVIDIA GB200 NVL72 offering 900 GB/s, this number can be reduced to 0.82 milliseconds TPOT, potentially achieving 1,200 tokens per second. The practical throughput is lower due to compute-communication overlap and memory limitations, but the framework lays the foundation for future high-speed implementations. FP8 precision further adds to the speed gains. The training framework applies tile-wise 1×128 and block-wise 128×128 quantization, with less than 0.25% accuracy loss compared to BF16. These results were validated on smaller 16B and 230B parameter versions before integration into the 671B model. Several key takeaways from the research on insights into DeepSeek-V3 include: MLA compression reduces KV cache size per token from 516 KB to 70 KB, significantly lowering memory demands during inference. Only 37 billion of the 671 billion total parameters are activated per token, dramatically reducing compute and memory requirements without compromising model performance. DeepSeek-V3 requires just 250 GFLOPS per token, compared to 2,448 GFLOPS for dense models like LLaMA-3.1, highlighting its computational efficiency. Achieves up to 67 tokens per second (TPS) on a 400 Gbps InfiniBand network, with the potential to scale to 1,200 TPS using advanced interconnects like NVL72. Multi-Token Prediction (MTP) improves generation speed by 1.8×, with a token acceptance rate of 80-90%, enhancing inference throughput. FP8 mixed-precision training enables faster computation with less than 0.25% accuracy degradation, validated through extensive small-scale ablations. Capable of running on a $10,000 server equipped with a consumer-grade GPU, delivering nearly 20 TPS, making high-performance LLMs more accessible. In conclusion, the research presents a well-rounded framework for building powerful and resource-conscious large-scale language models. By directly addressing fundamental constraints, such as memory limitations, high computational costs, and inference latency, the researchers demonstrate that intelligent architecture-hardware co-design can unlock high performance without relying on vast infrastructure. DeepSeek-V3 is a clear example of how efficiency and scalability coexist, enabling broader adoption of cutting-edge AI capabilities across diverse organizations. This approach shifts the narrative from scaling through brute force to scaling through smarter engineering. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Meet LangGraph Multi-Agent Swarm: A Python Library for Creating Swarm-Style Multi-Agent Systems Using LangGraphSana Hassanhttps://www.marktechpost.com/author/sana-hassan/ByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and ReasoningSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers from Tsinghua and ModelBest Release Ultra-FineWeb: A Trillion-Token Dataset Enhancing LLM Accuracy Across BenchmarksSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Coding Agents See 75% Surge: SimilarWeb’s AI Usage Report Highlights the Sectors Winning and Losing in 2025’s Generative AI Boom

·166 مشاهدة

الرجاء تسجيل الدخول , للأعجاب والمشاركة والتعليق على هذا!
UX Collective شارك رابطًا

2025-05-17 09:53:23 ·

The perverse incentives of Vibe Coding

Image Credit: Chat GPT o3I’ve been using AI coding assistants like Claude Code for a while now, and I’m here to say, I may be an addict. And boy is this is an expensive habit.Its “almost there” quality — the feeling we’re just one prompt away from the perfect solution — is what makes it so addicting. Vibe coding operates on the principle of variable-ratio reinforcement, a powerful form of operant conditioning where rewards come unpredictably. Unlike fixed rewards, this intermittent success pattern, triggers stronger dopamine responses in our brain’s reward pathways, similar to gambling behaviors.What makes this especially effective with AI is the minimal effort required for potentially significant rewards — creating what neuroscientists call an “effort discounting” advantage. Combined with our innate completion bias — the drive to finish tasks we’ve started — this creates a compelling psychological loop that keeps us prompting.I don’t smoke, but don’t these bar graphs look like ciagrettes?Since Claude Code has been released, I have probably spent over vibe coding various projects into reality.But lets talk about the expense too, because I think there’s something bad there as well: coding agents, and especially Claude 3.7, tend to write too much code, a phenomenon that ends up costing users more than it should.Where an experienced developer might solve a problem with a few elegant lines with a thoughtful functional method, these AI systems often produce verbose, over-engineered solutions that tackle problems incrementally rather than addressing them at their core.My initial reaction was to attribute this to the relative immaturity of LLMs and their limitations when reasoning about abstract logic problems. Since these models are primarily trained to predict and generate text based on patterns they’ve seen before, it makes sense that they might struggle with the deeper architectural thinking that leads to elegant, minimal solutions.My human code on the left, Claude Code on the right implementing the same algorithmAnd indeed, the highly complex tasks I’ve handed to them have largely resulted in failure: implementing a minimax algorithm in a novel card game, crafting thoughtful animations in CSS, completely refactoring a codebase. The LLMs routinely get lost in the sauce when it comes to thinking through the high level principles required to solve difficult problems with computer science.In the example above, my human implemented version of minimax from 2018 totals 400 lines of code, whereas Claude Code’s version comes in at 627 lines. The LLM version also requires almost a dozen other library files. Granted, this version is in TypeScript and has a ton of extra bells and whistles, some of which I explicitly asked for, but the real problem is: it doesn’t actually work. Furthermore, using the LLM to debug it requires sending the bloated code back and forth to the API every time I want to holistically debug it.In an effort to impress the user and over-deliver, LLMs end up creating a rat’s nest of ultra-defensive code littered with debugging statements, neurotic comments and barely-useful helper funcitions. If you’ve ever worked in a highly functional production codebase, this is enough to drive you insane.I think everyone who spends any time vibe coding eventually discovers something like this and realizes that it’s much more worthwhile to work with a plan composed of discrete tasks that could be explained to a junior level developer vs. a feature-level project handed off to a staff engineer.There’s also the likelihood that the vast majority of code that LLMs have been trained on tends to be inelegant and overly verbose. Lord knows there’s a lot of AbstractJavaFinalSerializedFactory code out there.But I’m beginning to think the problem runs deeper, and it has to do with the economics of AI assistance.The economic incentive problemMany AI coding assistants, including Claude Code, charge based on token count — essentially the amount of text processed and generated. This creates what economists would call a “perverse incentive” — an incentive that produces behavior contrary to what’s actually desired.Let’s break down how this works:The AI generates verbose, procedural code for a given taskThis code becomes part of the context when you ask for further changes or additionsThe AI now has to readthis verbose code in every subsequent interactionMore tokens processed = more revenue for the company behind the AIThe LLM developers have no incentive to “fix” the verbose code problem because doing so will meaningfully impact their bottom lineAs Upton Sinclair famously noted: “It is difficult to get a man to understand something when his salary depends on his not understanding it.” Similarly, it might be difficult for AI companies to prioritize code conciseness when their revenue depends on token count.The broader implicationsThis pattern points to a more general concern in AI development: the alignment between how systems are monetized and how well they serve user needs. When charging by token count, there’s naturally less incentive to optimize for elegant, minimal solutions.Even “all you can eat” subscription plansdon’t fully resolve this tension, as they typically come with usage caps or other limitations that maintain the underlying incentive structure.System instructions and verbosity trade-offsThe perverse incentives in AI code generation point to a more fundamental issue that extends beyond coding assistants. When she was reading a draft of this, Louise pointed out some recent research from Giskard AI’s Phare benchmark that reveals a troubling pattern that mirrors our coding dilemma: demanding shorter responses jeopardizes the accuracy of the answers.According to their findings, instructions emphasizing concisenesssignificantly degraded factual reliability across most models tested — in some cases causing a 20% drop in hallucination resistance. When forced to be concise, models face an impossible choice between fabricating short but inaccurate answers or appearing unhelpful by rejecting the question entirely. The data shows models consistently prioritize brevity over accuracy when given these constraints.There’s clearly something going on where the more verbose the LLM is, the better it does. This actually makes sense given the discovery that chain-of-thought reasoning improves accuracy, but this issue has begun to feel like a real tradeoff when it comes to these almost-magical systems.We see this exact tension in code generation every day. When we optimize for conciseness and ask for the problems to be solved in fewer setps, we often sacrifice quality. The difference is that in coding, the sacrifice manifests as over-engineered verbosity — the model produces more tokens to cover all possible edge cases rather than thinking deeply about the elegant core solution or a root cause problem. In both cases, economic incentiveswork against quality outcomes.Just as Phare’s research suggests that seemingly innocent prompts like “be concise” can sabotage a model’s ability to debunk misinformation, our experience shows that standard prompting approaches can yield bloated, inefficient code. In both domains, the fundamental misalignment between token economics and quality outputs creates a persistent tension that users must actively manage.Some tricks to manage these perverse incentivesWhile we wait for AI companies to better align their incentives with our need for elegant code, I’ve developed several strategies to counteract verbose code generation:1. Force planning before implementationI harass the LLM to write a detailed plan before generating any code. This forces the model to think through the architecture and approach, rather than diving straight into implementation details. Often, I find that a well-articulated plan leads to more concise code, as the model has already resolved the logical structure of the solution before writing a single line.2. Explicit permission protocolI’ve implemented a strict “ask before generating” protocol in my workflow. My personal CLAUDE.md file explicitly instructs Claude to request permission before writing any code. Infuriatingly, Claude Code regularly ignores this, likely due to its massive system prompt that talks so much about writing code it overrides my preferences. Enforcing this boundary and repeatedly belaboring ithelps prevent the automatic generation of unwanted, verbose solutions.3. Git-based experimentation with ruthless pruningVersion control becomes essential when working with AI-generated code. I frequently benchmark code in git when I arrive at an “ok it works as intended” moment. Creating experimental branches is also very helpful. Most importantly, I’m ready to throw out branches entirely when fixing them would require more work than starting from scratch. This willingness to abandon sunk costs is surprisingly important — it helps me work through problems and figure out the AI’s hangups while preventing the accumulation of bandaid solutions on top of fundamentally flawed approaches.4. Use a cheaper modelSometimes the simplest solution works best: using a smaller, cheaper model often results in more direct solutions. These models tend to generate less verbose code simply because they have limited context windows and processing capacity. While they might not handle extremely complex problems as well, for many day-to-day coding tasks, their constraints can actually produce more elegant solutions. For example, Claude 3.5 Haiku is currently 26% the price of Claude 3.7. Also, Claude 3.7 seems to overengineer more frequently than Claude 3.5.Moving toward better alignmentWhat might a better approach look like?LLM coding agents could evaluated and incentivized based on code quality metrics rather than just token counts. The challenge here is that this kind of metric is quite subjective.Companies could offer pricing models that reward efficiency rather than verbosityLLMs training should incorporate feedback mechanisms that specifically promote concise, elegant solutions via RLHFCompanies realize that overly verbose code generation is not good for their bottom lineThis isn’t just about getting better AI — it’s about making sure that the economic incentives driving AI development align with what we actually value as developers: clean, maintainable, elegant code that solves problems at their root.Until then, don’t forget: brevity is the soul of wit, and machines have no soul.Thanks to Louise Macfadyen, Justin Kazmark and Bethany Crystal for reading and suggesting edits to a draft of this.— -PS: Yes, I used Claude to help write this post critiquing AI verbosity. There’s a delicious irony here: these systems will happily help you articulate why they might be ripping you off. Their willingness to steelman arguments against their own economic interests shows that the perverse incentives aren’t embedded in the models themselves, but in the business decisions surrounding them. In other words, don’t blame the AI — blame the humans optimizing the revenue models. The machines are just doing what they’re told, even when that includes explaining how they’re being told to do too much.The perverse incentives of Vibe Coding was originally published in UX Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.
#perverse #incentives #vibe #coding

The perverse incentives of Vibe Coding
Image Credit: Chat GPT o3I’ve been using AI coding assistants like Claude Code for a while now, and I’m here to say, I may be an addict. And boy is this is an expensive habit.Its “almost there” quality — the feeling we’re just one prompt away from the perfect solution — is what makes it so addicting. Vibe coding operates on the principle of variable-ratio reinforcement, a powerful form of operant conditioning where rewards come unpredictably. Unlike fixed rewards, this intermittent success pattern, triggers stronger dopamine responses in our brain’s reward pathways, similar to gambling behaviors.What makes this especially effective with AI is the minimal effort required for potentially significant rewards — creating what neuroscientists call an “effort discounting” advantage. Combined with our innate completion bias — the drive to finish tasks we’ve started — this creates a compelling psychological loop that keeps us prompting.I don’t smoke, but don’t these bar graphs look like ciagrettes?Since Claude Code has been released, I have probably spent over vibe coding various projects into reality.But lets talk about the expense too, because I think there’s something bad there as well: coding agents, and especially Claude 3.7, tend to write too much code, a phenomenon that ends up costing users more than it should.Where an experienced developer might solve a problem with a few elegant lines with a thoughtful functional method, these AI systems often produce verbose, over-engineered solutions that tackle problems incrementally rather than addressing them at their core.My initial reaction was to attribute this to the relative immaturity of LLMs and their limitations when reasoning about abstract logic problems. Since these models are primarily trained to predict and generate text based on patterns they’ve seen before, it makes sense that they might struggle with the deeper architectural thinking that leads to elegant, minimal solutions.My human code on the left, Claude Code on the right implementing the same algorithmAnd indeed, the highly complex tasks I’ve handed to them have largely resulted in failure: implementing a minimax algorithm in a novel card game, crafting thoughtful animations in CSS, completely refactoring a codebase. The LLMs routinely get lost in the sauce when it comes to thinking through the high level principles required to solve difficult problems with computer science.In the example above, my human implemented version of minimax from 2018 totals 400 lines of code, whereas Claude Code’s version comes in at 627 lines. The LLM version also requires almost a dozen other library files. Granted, this version is in TypeScript and has a ton of extra bells and whistles, some of which I explicitly asked for, but the real problem is: it doesn’t actually work. Furthermore, using the LLM to debug it requires sending the bloated code back and forth to the API every time I want to holistically debug it.In an effort to impress the user and over-deliver, LLMs end up creating a rat’s nest of ultra-defensive code littered with debugging statements, neurotic comments and barely-useful helper funcitions. If you’ve ever worked in a highly functional production codebase, this is enough to drive you insane.I think everyone who spends any time vibe coding eventually discovers something like this and realizes that it’s much more worthwhile to work with a plan composed of discrete tasks that could be explained to a junior level developer vs. a feature-level project handed off to a staff engineer.There’s also the likelihood that the vast majority of code that LLMs have been trained on tends to be inelegant and overly verbose. Lord knows there’s a lot of AbstractJavaFinalSerializedFactory code out there.But I’m beginning to think the problem runs deeper, and it has to do with the economics of AI assistance.The economic incentive problemMany AI coding assistants, including Claude Code, charge based on token count — essentially the amount of text processed and generated. This creates what economists would call a “perverse incentive” — an incentive that produces behavior contrary to what’s actually desired.Let’s break down how this works:The AI generates verbose, procedural code for a given taskThis code becomes part of the context when you ask for further changes or additionsThe AI now has to readthis verbose code in every subsequent interactionMore tokens processed = more revenue for the company behind the AIThe LLM developers have no incentive to “fix” the verbose code problem because doing so will meaningfully impact their bottom lineAs Upton Sinclair famously noted: “It is difficult to get a man to understand something when his salary depends on his not understanding it.” Similarly, it might be difficult for AI companies to prioritize code conciseness when their revenue depends on token count.The broader implicationsThis pattern points to a more general concern in AI development: the alignment between how systems are monetized and how well they serve user needs. When charging by token count, there’s naturally less incentive to optimize for elegant, minimal solutions.Even “all you can eat” subscription plansdon’t fully resolve this tension, as they typically come with usage caps or other limitations that maintain the underlying incentive structure.System instructions and verbosity trade-offsThe perverse incentives in AI code generation point to a more fundamental issue that extends beyond coding assistants. When she was reading a draft of this, Louise pointed out some recent research from Giskard AI’s Phare benchmark that reveals a troubling pattern that mirrors our coding dilemma: demanding shorter responses jeopardizes the accuracy of the answers.According to their findings, instructions emphasizing concisenesssignificantly degraded factual reliability across most models tested — in some cases causing a 20% drop in hallucination resistance. When forced to be concise, models face an impossible choice between fabricating short but inaccurate answers or appearing unhelpful by rejecting the question entirely. The data shows models consistently prioritize brevity over accuracy when given these constraints.There’s clearly something going on where the more verbose the LLM is, the better it does. This actually makes sense given the discovery that chain-of-thought reasoning improves accuracy, but this issue has begun to feel like a real tradeoff when it comes to these almost-magical systems.We see this exact tension in code generation every day. When we optimize for conciseness and ask for the problems to be solved in fewer setps, we often sacrifice quality. The difference is that in coding, the sacrifice manifests as over-engineered verbosity — the model produces more tokens to cover all possible edge cases rather than thinking deeply about the elegant core solution or a root cause problem. In both cases, economic incentiveswork against quality outcomes.Just as Phare’s research suggests that seemingly innocent prompts like “be concise” can sabotage a model’s ability to debunk misinformation, our experience shows that standard prompting approaches can yield bloated, inefficient code. In both domains, the fundamental misalignment between token economics and quality outputs creates a persistent tension that users must actively manage.Some tricks to manage these perverse incentivesWhile we wait for AI companies to better align their incentives with our need for elegant code, I’ve developed several strategies to counteract verbose code generation:1. Force planning before implementationI harass the LLM to write a detailed plan before generating any code. This forces the model to think through the architecture and approach, rather than diving straight into implementation details. Often, I find that a well-articulated plan leads to more concise code, as the model has already resolved the logical structure of the solution before writing a single line.2. Explicit permission protocolI’ve implemented a strict “ask before generating” protocol in my workflow. My personal CLAUDE.md file explicitly instructs Claude to request permission before writing any code. Infuriatingly, Claude Code regularly ignores this, likely due to its massive system prompt that talks so much about writing code it overrides my preferences. Enforcing this boundary and repeatedly belaboring ithelps prevent the automatic generation of unwanted, verbose solutions.3. Git-based experimentation with ruthless pruningVersion control becomes essential when working with AI-generated code. I frequently benchmark code in git when I arrive at an “ok it works as intended” moment. Creating experimental branches is also very helpful. Most importantly, I’m ready to throw out branches entirely when fixing them would require more work than starting from scratch. This willingness to abandon sunk costs is surprisingly important — it helps me work through problems and figure out the AI’s hangups while preventing the accumulation of bandaid solutions on top of fundamentally flawed approaches.4. Use a cheaper modelSometimes the simplest solution works best: using a smaller, cheaper model often results in more direct solutions. These models tend to generate less verbose code simply because they have limited context windows and processing capacity. While they might not handle extremely complex problems as well, for many day-to-day coding tasks, their constraints can actually produce more elegant solutions. For example, Claude 3.5 Haiku is currently 26% the price of Claude 3.7. Also, Claude 3.7 seems to overengineer more frequently than Claude 3.5.Moving toward better alignmentWhat might a better approach look like?LLM coding agents could evaluated and incentivized based on code quality metrics rather than just token counts. The challenge here is that this kind of metric is quite subjective.Companies could offer pricing models that reward efficiency rather than verbosityLLMs training should incorporate feedback mechanisms that specifically promote concise, elegant solutions via RLHFCompanies realize that overly verbose code generation is not good for their bottom lineThis isn’t just about getting better AI — it’s about making sure that the economic incentives driving AI development align with what we actually value as developers: clean, maintainable, elegant code that solves problems at their root.Until then, don’t forget: brevity is the soul of wit, and machines have no soul.Thanks to Louise Macfadyen, Justin Kazmark and Bethany Crystal for reading and suggesting edits to a draft of this.— -PS: Yes, I used Claude to help write this post critiquing AI verbosity. There’s a delicious irony here: these systems will happily help you articulate why they might be ripping you off. Their willingness to steelman arguments against their own economic interests shows that the perverse incentives aren’t embedded in the models themselves, but in the business decisions surrounding them. In other words, don’t blame the AI — blame the humans optimizing the revenue models. The machines are just doing what they’re told, even when that includes explaining how they’re being told to do too much.The perverse incentives of Vibe Coding was originally published in UX Collective on Medium, where people are continuing the conversation by highlighting and responding to this story. #perverse #incentives #vibe #coding

UXDESIGN.CC

The perverse incentives of Vibe Coding

Image Credit: Chat GPT o3I’ve been using AI coding assistants like Claude Code for a while now, and I’m here to say (with all due respect to people who have substance abuse issues), I may be an addict. And boy is this is an expensive habit.Its “almost there” quality — the feeling we’re just one prompt away from the perfect solution — is what makes it so addicting. Vibe coding operates on the principle of variable-ratio reinforcement, a powerful form of operant conditioning where rewards come unpredictably. Unlike fixed rewards, this intermittent success pattern (“the code works! it’s brilliant! it just broke! wtf!”), triggers stronger dopamine responses in our brain’s reward pathways, similar to gambling behaviors.What makes this especially effective with AI is the minimal effort required for potentially significant rewards — creating what neuroscientists call an “effort discounting” advantage. Combined with our innate completion bias — the drive to finish tasks we’ve started — this creates a compelling psychological loop that keeps us prompting.I don’t smoke, but don’t these bar graphs look like ciagrettes?Since Claude Code has been released, I have probably spent over $1,000 vibe coding various projects into reality (some of which I hope to announce soon, don’t worry).But lets talk about the expense too, because I think there’s something bad there as well: coding agents, and especially Claude 3.7 (the backend of Claude Code), tend to write too much code, a phenomenon that ends up costing users more than it should.Where an experienced developer might solve a problem with a few elegant lines with a thoughtful functional method, these AI systems often produce verbose, over-engineered solutions that tackle problems incrementally rather than addressing them at their core.My initial reaction was to attribute this to the relative immaturity of LLMs and their limitations when reasoning about abstract logic problems. Since these models are primarily trained to predict and generate text based on patterns they’ve seen before, it makes sense that they might struggle with the deeper architectural thinking that leads to elegant, minimal solutions.My human code on the left, Claude Code on the right implementing the same algorithmAnd indeed, the highly complex tasks I’ve handed to them have largely resulted in failure: implementing a minimax algorithm in a novel card game, crafting thoughtful animations in CSS, completely refactoring a codebase. The LLMs routinely get lost in the sauce when it comes to thinking through the high level principles required to solve difficult problems with computer science.In the example above, my human implemented version of minimax from 2018 totals 400 lines of code, whereas Claude Code’s version comes in at 627 lines. The LLM version also requires almost a dozen other library files. Granted, this version is in TypeScript and has a ton of extra bells and whistles, some of which I explicitly asked for, but the real problem is: it doesn’t actually work. Furthermore, using the LLM to debug it requires sending the bloated code back and forth to the API every time I want to holistically debug it.In an effort to impress the user and over-deliver, LLMs end up creating a rat’s nest of ultra-defensive code littered with debugging statements, neurotic comments and barely-useful helper funcitions. If you’ve ever worked in a highly functional production codebase, this is enough to drive you insane.I think everyone who spends any time vibe coding eventually discovers something like this and realizes that it’s much more worthwhile to work with a plan composed of discrete tasks that could be explained to a junior level developer vs. a feature-level project handed off to a staff engineer.There’s also the likelihood that the vast majority of code that LLMs have been trained on tends to be inelegant and overly verbose. Lord knows there’s a lot of AbstractJavaFinalSerializedFactory code out there.But I’m beginning to think the problem runs deeper, and it has to do with the economics of AI assistance.The economic incentive problemMany AI coding assistants, including Claude Code, charge based on token count — essentially the amount of text processed and generated. This creates what economists would call a “perverse incentive” — an incentive that produces behavior contrary to what’s actually desired.Let’s break down how this works:The AI generates verbose, procedural code for a given taskThis code becomes part of the context when you ask for further changes or additions (this is key)The AI now has to read (and you pay for) this verbose code in every subsequent interactionMore tokens processed = more revenue for the company behind the AIThe LLM developers have no incentive to “fix” the verbose code problem because doing so will meaningfully impact their bottom lineAs Upton Sinclair famously noted: “It is difficult to get a man to understand something when his salary depends on his not understanding it.” Similarly, it might be difficult for AI companies to prioritize code conciseness when their revenue depends on token count.The broader implicationsThis pattern points to a more general concern in AI development: the alignment between how systems are monetized and how well they serve user needs. When charging by token count, there’s naturally less incentive to optimize for elegant, minimal solutions.Even “all you can eat” subscription plans (e.g. Claude’s “Max” subscription) don’t fully resolve this tension, as they typically come with usage caps or other limitations that maintain the underlying incentive structure.System instructions and verbosity trade-offsThe perverse incentives in AI code generation point to a more fundamental issue that extends beyond coding assistants. When she was reading a draft of this, Louise pointed out some recent research from Giskard AI’s Phare benchmark that reveals a troubling pattern that mirrors our coding dilemma: demanding shorter responses jeopardizes the accuracy of the answers.According to their findings, instructions emphasizing conciseness (like “answer this question briefly”) significantly degraded factual reliability across most models tested — in some cases causing a 20% drop in hallucination resistance. When forced to be concise, models face an impossible choice between fabricating short but inaccurate answers or appearing unhelpful by rejecting the question entirely. The data shows models consistently prioritize brevity over accuracy when given these constraints.There’s clearly something going on where the more verbose the LLM is, the better it does. This actually makes sense given the discovery that chain-of-thought reasoning improves accuracy, but this issue has begun to feel like a real tradeoff when it comes to these almost-magical systems.We see this exact tension in code generation every day. When we optimize for conciseness and ask for the problems to be solved in fewer setps, we often sacrifice quality. The difference is that in coding, the sacrifice manifests as over-engineered verbosity — the model produces more tokens to cover all possible edge cases rather than thinking deeply about the elegant core solution or a root cause problem. In both cases, economic incentives (token optimization) work against quality outcomes (factual accuracy or elegant code).Just as Phare’s research suggests that seemingly innocent prompts like “be concise” can sabotage a model’s ability to debunk misinformation, our experience shows that standard prompting approaches can yield bloated, inefficient code. In both domains, the fundamental misalignment between token economics and quality outputs creates a persistent tension that users must actively manage.Some tricks to manage these perverse incentivesWhile we wait for AI companies to better align their incentives with our need for elegant code, I’ve developed several strategies to counteract verbose code generation:1. Force planning before implementationI harass the LLM to write a detailed plan before generating any code. This forces the model to think through the architecture and approach, rather than diving straight into implementation details. Often, I find that a well-articulated plan leads to more concise code, as the model has already resolved the logical structure of the solution before writing a single line.2. Explicit permission protocolI’ve implemented a strict “ask before generating” protocol in my workflow. My personal CLAUDE.md file explicitly instructs Claude to request permission before writing any code. Infuriatingly, Claude Code regularly ignores this, likely due to its massive system prompt that talks so much about writing code it overrides my preferences. Enforcing this boundary and repeatedly belaboring it (“remember, don’t write any code”) helps prevent the automatic generation of unwanted, verbose solutions.3. Git-based experimentation with ruthless pruningVersion control becomes essential when working with AI-generated code. I frequently benchmark code in git when I arrive at an “ok it works as intended” moment. Creating experimental branches is also very helpful. Most importantly, I’m ready to throw out branches entirely when fixing them would require more work than starting from scratch. This willingness to abandon sunk costs is surprisingly important — it helps me work through problems and figure out the AI’s hangups while preventing the accumulation of bandaid solutions on top of fundamentally flawed approaches.4. Use a cheaper modelSometimes the simplest solution works best: using a smaller, cheaper model often results in more direct solutions. These models tend to generate less verbose code simply because they have limited context windows and processing capacity. While they might not handle extremely complex problems as well, for many day-to-day coding tasks, their constraints can actually produce more elegant solutions. For example, Claude 3.5 Haiku is currently 26% the price of Claude 3.7 ($0.80 per token vs. $3). Also, Claude 3.7 seems to overengineer more frequently than Claude 3.5.Moving toward better alignmentWhat might a better approach look like?LLM coding agents could evaluated and incentivized based on code quality metrics rather than just token counts. The challenge here is that this kind of metric is quite subjective.Companies could offer pricing models that reward efficiency rather than verbosity (I have no idea how this would work, this was Claude’s dumb idea)LLMs training should incorporate feedback mechanisms that specifically promote concise, elegant solutions via RLHF (e.g. showing developers multiple versions of the same code and having them pick the opitmal one, perhaps this is already happening)Companies realize that overly verbose code generation is not good for their bottom line (e.g. Sam Altman admitted that users saying “please” and “thank you” to ChatGPT is costing them millions of dollars)This isn’t just about getting better AI — it’s about making sure that the economic incentives driving AI development align with what we actually value as developers: clean, maintainable, elegant code that solves problems at their root.Until then, don’t forget: brevity is the soul of wit, and machines have no soul.Thanks to Louise Macfadyen, Justin Kazmark and Bethany Crystal for reading and suggesting edits to a draft of this.— -PS: Yes, I used Claude to help write this post critiquing AI verbosity. There’s a delicious irony here: these systems will happily help you articulate why they might be ripping you off. Their willingness to steelman arguments against their own economic interests shows that the perverse incentives aren’t embedded in the models themselves, but in the business decisions surrounding them. In other words, don’t blame the AI — blame the humans optimizing the revenue models. The machines are just doing what they’re told, even when that includes explaining how they’re being told to do too much.The perverse incentives of Vibe Coding was originally published in UX Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.

·188 مشاهدة

الرجاء تسجيل الدخول , للأعجاب والمشاركة والتعليق على هذا!
Venture Beat شارك رابطًا

2025-05-17 03:19:48 ·

Google’s AlphaEvolve: The AI agent that reclaimed 0.7% of Google’s compute – and how to copy it

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Google’s new AlphaEvolve shows what happens when an AI agent graduates from lab demo to production work, and you’ve got one of the most talented technology companies driving it.
Built by Google’s DeepMind, the system autonomously rewrites critical code and already pays for itself inside Google. It shattered a 56-year-old record in matrix multiplicationand clawed back 0.7% of compute capacity across the company’s global data centers.
Those headline feats matter, but the deeper lesson for enterprise tech leaders is how AlphaEvolve pulls them off. Its architecture – controller, fast-draft models, deep-thinking models, automated evaluators and versioned memory – illustrates the kind of production-grade plumbing that makes autonomous agents safe to deploy at scale.
Google’s AI technology is arguably second to none. So the trick is figuring out how to learn from it, or even using it directly. Google says an Early Access Program is coming for academic partners and that “broader availability” is being explored, but details are thin. Until then, AlphaEvolve is a best-practice template: If you want agents that touch high-value workloads, you’ll need comparable orchestration, testing and guardrails.
Consider just the data center win. Google won’t put a price tag on the reclaimed 0.7%, but its annual capex runs tens of billions of dollars. Even a rough estimate puts the savings in the hundreds of millions annually—enough, as independent developer Sam Witteveen noted on our recent podcast, to pay for training one of the flagship Gemini models, estimated to cost upwards of million for a version like Gemini Ultra.
VentureBeat was the first to report about the AlphaEvolve news earlier this week. Now we’ll go deeper: how the system works, where the engineering bar really sits and the concrete steps enterprises can take to buildsomething comparable.
1. Beyond simple scripts: The rise of the “agent operating system”
AlphaEvolve runs on what is best described as an agent operating system – a distributed, asynchronous pipeline built for continuous improvement at scale. Its core pieces are a controller, a pair of large language models, a versioned program-memory database and a fleet of evaluator workers, all tuned for high throughput rather than just low latency.
A high-level overview of the AlphaEvolve agent structure. Source: AlphaEvolve paper.
This architecture isn’t conceptually new, but the execution is. “It’s just an unbelievably good execution,” Witteveen says.
The AlphaEvolve paper describes the orchestrator as an “evolutionary algorithm that gradually develops programs that improve the score on the automated evaluation metrics”; in short, an “autonomous pipeline of LLMs whose task is to improve an algorithm by making direct changes to the code”.
Takeaway for enterprises: If your agent plans include unsupervised runs on high-value tasks, plan for similar infrastructure: job queues, a versioned memory store, service-mesh tracing and secure sandboxing for any code the agent produces.
2. The evaluator engine: driving progress with automated, objective feedback
A key element of AlphaEvolve is its rigorous evaluation framework. Every iteration proposed by the pair of LLMs is accepted or rejected based on a user-supplied “evaluate” function that returns machine-gradable metrics. This evaluation system begins with ultrafast unit-test checks on each proposed code change – simple, automatic teststhat verify the snippet still compiles and produces the right answers on a handful of micro-inputs – before passing the survivors on to heavier benchmarks and LLM-generated reviews. This runs in parallel, so the search stays fast and safe.
In short: Let the models suggest fixes, then verify each one against tests you trust. AlphaEvolve also supports multi-objective optimization, evolving programs that hit several metrics at once. Counter-intuitively, balancing multiple goals can improve a single target metric by encouraging more diverse solutions.
Takeaway for enterprises: Production agents need deterministic scorekeepers. Whether that’s unit tests, full simulators, or canary traffic analysis. Automated evaluators are both your safety net and your growth engine. Before you launch an agentic project, ask: “Do we have a metric the agent can score itself against?”
3. Smart model use, iterative code refinement
AlphaEvolve tackles every coding problem with a two-model rhythm. First, Gemini Flash fires off quick drafts, giving the system a broad set of ideas to explore. Then Gemini Pro studies those drafts in more depth and returns a smaller set of stronger candidates. Feeding both models is a lightweight “prompt builder,” a helper script that assembles the question each model sees. It blends three kinds of context: earlier code attempts saved in a project database, any guardrails or rules the engineering team has written and relevant external material such as research papers or developer notes. With that richer backdrop, Gemini Flash can roam widely while Gemini Pro zeroes in on quality.
Unlike many agent demos that tweak one function at a time, AlphaEvolve edits entire repositories. It describes each change as a standard diff block – the same patch format engineers push to GitHub – so it can touch dozens of files without losing track. Afterward, automated tests decide whether the patch sticks. Over repeated cycles, the agent’s memory of success and failure grows, so it proposes better patches and wastes less compute on dead ends.
Takeaway for enterprises: Let cheaper, faster models handle brainstorming, then call on a more capable model to refine the best ideas. Preserve every trial in a searchable history, because that memory speeds up later work and can be reused across teams. Accordingly, vendors are rushing to provide developers with new tooling around things like memory. Products such as OpenMemory MCP, which provides a portable memory store, and the new long- and short-term memory APIs in LlamaIndex are making this kind of persistent context almost as easy to plug in as logging.
OpenAI’s Codex-1 software-engineering agent, also released today, underscores the same pattern. It fires off parallel tasks inside a secure sandbox, runs unit tests and returns pull-request drafts—effectively a code-specific echo of AlphaEvolve’s broader search-and-evaluate loop.
4. Measure to manage: targeting agentic AI for demonstrable ROI
AlphaEvolve’s tangible wins – reclaiming 0.7% of data center capacity, cutting Gemini training kernel runtime 23%, speeding FlashAttention 32%, and simplifying TPU design – share one trait: they target domains with airtight metrics.
For data center scheduling, AlphaEvolve evolved a heuristic that was evaluated using a simulator of Google’s data centers based on historical workloads. For kernel optimization, the objective was to minimize actual runtime on TPU accelerators across a dataset of realistic kernel input shapes.
Takeaway for enterprises:
This clarity allows the agent to self-improve and demonstrate unambiguous value.
5. Laying the groundwork: essential prerequisites for enterprise agentic success
While AlphaEvolve’s achievements are inspiring, Google’s paper is also clear about its scope and requirements.
The primary limitation is the need for an automated evaluator; problems requiring manual experimentation or “wet-lab” feedback are currently out of scope for this specific approach. The system can consume significant compute – “on the order of 100 compute-hours to evaluate any new solution”, necessitating parallelization and careful capacity planning.
Before allocating significant budget to complex agentic systems, technical leaders must ask critical questions:

Machine-gradable problem? Do we have a clear, automatable metric against which the agent can score its own performance?
Compute capacity? Can we afford the potentially compute-heavy inner loop of generation, evaluation, and refinement, especially during the development and training phase?
Codebase & memory readiness? Is your codebase structured for iterative, possibly diff-based, modifications? And can you implement the instrumented memory systems vital for an agent to learn from its evolutionary history?

Takeaway for enterprises: The increasing focus on robust agent identity and access management, as seen with platforms like Frontegg, Auth0 and others, also points to the maturing infrastructure required to deploy agents that interact securely with multiple enterprise systems.
The agentic future is engineered, not just summoned
AlphaEvolve’s message for enterprise teams is manifold. First, your operating system around agents is now far more important than model intelligence. Google’s blueprint shows three pillars that can’t be skipped:

Deterministic evaluators that give the agent an unambiguous score every time it makes a change.
Long-running orchestration that can juggle fast “draft” models like Gemini Flash with slower, more rigorous models – whether that’s Google’s stack or a framework such as LangChain’s LangGraph.
Persistent memory so each iteration builds on the last instead of relearning from scratch.

Enterprises that already have logging, test harnesses and versioned code repositories are closer than they think. The next step is to wire those assets into a self-serve evaluation loop so multiple agent-generated solutions can compete, and only the highest-scoring patch ships.
As Cisco’s Anurag Dhingra, VP and GM of Enterprise Connectivity and Collaboration, told VentureBeat in an interview this week: “It’s happening, it is very, very real,” he said of enterprises using AI agents in manufacturing, warehouses, customer contact centers. “It is not something in the future. It is happening there today.” He warned that as these agents become more pervasive, doing “human-like work,” the strain on existing systems will be immense: “The network traffic is going to go through the roof,” Dhingra said. Your network, budget and competitive edge will likely feel that strain before the hype cycle settles. Start proving out a contained, metric-driven use case this quarter – then scale what works.
Watch the video podcast I did with developer Sam Witteveen, where we go deep on production-grade agents, and how AlphaEvolve is showing the way:

Daily insights on business use cases with VB Daily
If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.
Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.
#googles #alphaevolve #agent #that #reclaimed

Google’s AlphaEvolve: The AI agent that reclaimed 0.7% of Google’s compute – and how to copy it
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Google’s new AlphaEvolve shows what happens when an AI agent graduates from lab demo to production work, and you’ve got one of the most talented technology companies driving it. Built by Google’s DeepMind, the system autonomously rewrites critical code and already pays for itself inside Google. It shattered a 56-year-old record in matrix multiplicationand clawed back 0.7% of compute capacity across the company’s global data centers. Those headline feats matter, but the deeper lesson for enterprise tech leaders is how AlphaEvolve pulls them off. Its architecture – controller, fast-draft models, deep-thinking models, automated evaluators and versioned memory – illustrates the kind of production-grade plumbing that makes autonomous agents safe to deploy at scale. Google’s AI technology is arguably second to none. So the trick is figuring out how to learn from it, or even using it directly. Google says an Early Access Program is coming for academic partners and that “broader availability” is being explored, but details are thin. Until then, AlphaEvolve is a best-practice template: If you want agents that touch high-value workloads, you’ll need comparable orchestration, testing and guardrails. Consider just the data center win. Google won’t put a price tag on the reclaimed 0.7%, but its annual capex runs tens of billions of dollars. Even a rough estimate puts the savings in the hundreds of millions annually—enough, as independent developer Sam Witteveen noted on our recent podcast, to pay for training one of the flagship Gemini models, estimated to cost upwards of million for a version like Gemini Ultra. VentureBeat was the first to report about the AlphaEvolve news earlier this week. Now we’ll go deeper: how the system works, where the engineering bar really sits and the concrete steps enterprises can take to buildsomething comparable. 1. Beyond simple scripts: The rise of the “agent operating system” AlphaEvolve runs on what is best described as an agent operating system – a distributed, asynchronous pipeline built for continuous improvement at scale. Its core pieces are a controller, a pair of large language models, a versioned program-memory database and a fleet of evaluator workers, all tuned for high throughput rather than just low latency. A high-level overview of the AlphaEvolve agent structure. Source: AlphaEvolve paper. This architecture isn’t conceptually new, but the execution is. “It’s just an unbelievably good execution,” Witteveen says. The AlphaEvolve paper describes the orchestrator as an “evolutionary algorithm that gradually develops programs that improve the score on the automated evaluation metrics”; in short, an “autonomous pipeline of LLMs whose task is to improve an algorithm by making direct changes to the code”. Takeaway for enterprises: If your agent plans include unsupervised runs on high-value tasks, plan for similar infrastructure: job queues, a versioned memory store, service-mesh tracing and secure sandboxing for any code the agent produces. 2. The evaluator engine: driving progress with automated, objective feedback A key element of AlphaEvolve is its rigorous evaluation framework. Every iteration proposed by the pair of LLMs is accepted or rejected based on a user-supplied “evaluate” function that returns machine-gradable metrics. This evaluation system begins with ultrafast unit-test checks on each proposed code change – simple, automatic teststhat verify the snippet still compiles and produces the right answers on a handful of micro-inputs – before passing the survivors on to heavier benchmarks and LLM-generated reviews. This runs in parallel, so the search stays fast and safe. In short: Let the models suggest fixes, then verify each one against tests you trust. AlphaEvolve also supports multi-objective optimization, evolving programs that hit several metrics at once. Counter-intuitively, balancing multiple goals can improve a single target metric by encouraging more diverse solutions. Takeaway for enterprises: Production agents need deterministic scorekeepers. Whether that’s unit tests, full simulators, or canary traffic analysis. Automated evaluators are both your safety net and your growth engine. Before you launch an agentic project, ask: “Do we have a metric the agent can score itself against?” 3. Smart model use, iterative code refinement AlphaEvolve tackles every coding problem with a two-model rhythm. First, Gemini Flash fires off quick drafts, giving the system a broad set of ideas to explore. Then Gemini Pro studies those drafts in more depth and returns a smaller set of stronger candidates. Feeding both models is a lightweight “prompt builder,” a helper script that assembles the question each model sees. It blends three kinds of context: earlier code attempts saved in a project database, any guardrails or rules the engineering team has written and relevant external material such as research papers or developer notes. With that richer backdrop, Gemini Flash can roam widely while Gemini Pro zeroes in on quality. Unlike many agent demos that tweak one function at a time, AlphaEvolve edits entire repositories. It describes each change as a standard diff block – the same patch format engineers push to GitHub – so it can touch dozens of files without losing track. Afterward, automated tests decide whether the patch sticks. Over repeated cycles, the agent’s memory of success and failure grows, so it proposes better patches and wastes less compute on dead ends. Takeaway for enterprises: Let cheaper, faster models handle brainstorming, then call on a more capable model to refine the best ideas. Preserve every trial in a searchable history, because that memory speeds up later work and can be reused across teams. Accordingly, vendors are rushing to provide developers with new tooling around things like memory. Products such as OpenMemory MCP, which provides a portable memory store, and the new long- and short-term memory APIs in LlamaIndex are making this kind of persistent context almost as easy to plug in as logging. OpenAI’s Codex-1 software-engineering agent, also released today, underscores the same pattern. It fires off parallel tasks inside a secure sandbox, runs unit tests and returns pull-request drafts—effectively a code-specific echo of AlphaEvolve’s broader search-and-evaluate loop. 4. Measure to manage: targeting agentic AI for demonstrable ROI AlphaEvolve’s tangible wins – reclaiming 0.7% of data center capacity, cutting Gemini training kernel runtime 23%, speeding FlashAttention 32%, and simplifying TPU design – share one trait: they target domains with airtight metrics. For data center scheduling, AlphaEvolve evolved a heuristic that was evaluated using a simulator of Google’s data centers based on historical workloads. For kernel optimization, the objective was to minimize actual runtime on TPU accelerators across a dataset of realistic kernel input shapes. Takeaway for enterprises: This clarity allows the agent to self-improve and demonstrate unambiguous value. 5. Laying the groundwork: essential prerequisites for enterprise agentic success While AlphaEvolve’s achievements are inspiring, Google’s paper is also clear about its scope and requirements. The primary limitation is the need for an automated evaluator; problems requiring manual experimentation or “wet-lab” feedback are currently out of scope for this specific approach. The system can consume significant compute – “on the order of 100 compute-hours to evaluate any new solution”, necessitating parallelization and careful capacity planning. Before allocating significant budget to complex agentic systems, technical leaders must ask critical questions: Machine-gradable problem? Do we have a clear, automatable metric against which the agent can score its own performance? Compute capacity? Can we afford the potentially compute-heavy inner loop of generation, evaluation, and refinement, especially during the development and training phase? Codebase & memory readiness? Is your codebase structured for iterative, possibly diff-based, modifications? And can you implement the instrumented memory systems vital for an agent to learn from its evolutionary history? Takeaway for enterprises: The increasing focus on robust agent identity and access management, as seen with platforms like Frontegg, Auth0 and others, also points to the maturing infrastructure required to deploy agents that interact securely with multiple enterprise systems. The agentic future is engineered, not just summoned AlphaEvolve’s message for enterprise teams is manifold. First, your operating system around agents is now far more important than model intelligence. Google’s blueprint shows three pillars that can’t be skipped: Deterministic evaluators that give the agent an unambiguous score every time it makes a change. Long-running orchestration that can juggle fast “draft” models like Gemini Flash with slower, more rigorous models – whether that’s Google’s stack or a framework such as LangChain’s LangGraph. Persistent memory so each iteration builds on the last instead of relearning from scratch. Enterprises that already have logging, test harnesses and versioned code repositories are closer than they think. The next step is to wire those assets into a self-serve evaluation loop so multiple agent-generated solutions can compete, and only the highest-scoring patch ships. As Cisco’s Anurag Dhingra, VP and GM of Enterprise Connectivity and Collaboration, told VentureBeat in an interview this week: “It’s happening, it is very, very real,” he said of enterprises using AI agents in manufacturing, warehouses, customer contact centers. “It is not something in the future. It is happening there today.” He warned that as these agents become more pervasive, doing “human-like work,” the strain on existing systems will be immense: “The network traffic is going to go through the roof,” Dhingra said. Your network, budget and competitive edge will likely feel that strain before the hype cycle settles. Start proving out a contained, metric-driven use case this quarter – then scale what works. Watch the video podcast I did with developer Sam Witteveen, where we go deep on production-grade agents, and how AlphaEvolve is showing the way: Daily insights on business use cases with VB Daily If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI. Read our Privacy Policy Thanks for subscribing. Check out more VB newsletters here. An error occured. #googles #alphaevolve #agent #that #reclaimed

VENTUREBEAT.COM

Google’s AlphaEvolve: The AI agent that reclaimed 0.7% of Google’s compute – and how to copy it

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Google’s new AlphaEvolve shows what happens when an AI agent graduates from lab demo to production work, and you’ve got one of the most talented technology companies driving it. Built by Google’s DeepMind, the system autonomously rewrites critical code and already pays for itself inside Google. It shattered a 56-year-old record in matrix multiplication (the core of many machine learning workloads) and clawed back 0.7% of compute capacity across the company’s global data centers. Those headline feats matter, but the deeper lesson for enterprise tech leaders is how AlphaEvolve pulls them off. Its architecture – controller, fast-draft models, deep-thinking models, automated evaluators and versioned memory – illustrates the kind of production-grade plumbing that makes autonomous agents safe to deploy at scale. Google’s AI technology is arguably second to none. So the trick is figuring out how to learn from it, or even using it directly. Google says an Early Access Program is coming for academic partners and that “broader availability” is being explored, but details are thin. Until then, AlphaEvolve is a best-practice template: If you want agents that touch high-value workloads, you’ll need comparable orchestration, testing and guardrails. Consider just the data center win. Google won’t put a price tag on the reclaimed 0.7%, but its annual capex runs tens of billions of dollars. Even a rough estimate puts the savings in the hundreds of millions annually—enough, as independent developer Sam Witteveen noted on our recent podcast, to pay for training one of the flagship Gemini models, estimated to cost upwards of $191 million for a version like Gemini Ultra. VentureBeat was the first to report about the AlphaEvolve news earlier this week. Now we’ll go deeper: how the system works, where the engineering bar really sits and the concrete steps enterprises can take to build (or buy) something comparable. 1. Beyond simple scripts: The rise of the “agent operating system” AlphaEvolve runs on what is best described as an agent operating system – a distributed, asynchronous pipeline built for continuous improvement at scale. Its core pieces are a controller, a pair of large language models (Gemini Flash for breadth; Gemini Pro for depth), a versioned program-memory database and a fleet of evaluator workers, all tuned for high throughput rather than just low latency. A high-level overview of the AlphaEvolve agent structure. Source: AlphaEvolve paper. This architecture isn’t conceptually new, but the execution is. “It’s just an unbelievably good execution,” Witteveen says. The AlphaEvolve paper describes the orchestrator as an “evolutionary algorithm that gradually develops programs that improve the score on the automated evaluation metrics” (p. 3); in short, an “autonomous pipeline of LLMs whose task is to improve an algorithm by making direct changes to the code” (p. 1). Takeaway for enterprises: If your agent plans include unsupervised runs on high-value tasks, plan for similar infrastructure: job queues, a versioned memory store, service-mesh tracing and secure sandboxing for any code the agent produces. 2. The evaluator engine: driving progress with automated, objective feedback A key element of AlphaEvolve is its rigorous evaluation framework. Every iteration proposed by the pair of LLMs is accepted or rejected based on a user-supplied “evaluate” function that returns machine-gradable metrics. This evaluation system begins with ultrafast unit-test checks on each proposed code change – simple, automatic tests (similar to the unit tests developers already write) that verify the snippet still compiles and produces the right answers on a handful of micro-inputs – before passing the survivors on to heavier benchmarks and LLM-generated reviews. This runs in parallel, so the search stays fast and safe. In short: Let the models suggest fixes, then verify each one against tests you trust. AlphaEvolve also supports multi-objective optimization (optimizing latency and accuracy simultaneously), evolving programs that hit several metrics at once. Counter-intuitively, balancing multiple goals can improve a single target metric by encouraging more diverse solutions. Takeaway for enterprises: Production agents need deterministic scorekeepers. Whether that’s unit tests, full simulators, or canary traffic analysis. Automated evaluators are both your safety net and your growth engine. Before you launch an agentic project, ask: “Do we have a metric the agent can score itself against?” 3. Smart model use, iterative code refinement AlphaEvolve tackles every coding problem with a two-model rhythm. First, Gemini Flash fires off quick drafts, giving the system a broad set of ideas to explore. Then Gemini Pro studies those drafts in more depth and returns a smaller set of stronger candidates. Feeding both models is a lightweight “prompt builder,” a helper script that assembles the question each model sees. It blends three kinds of context: earlier code attempts saved in a project database, any guardrails or rules the engineering team has written and relevant external material such as research papers or developer notes. With that richer backdrop, Gemini Flash can roam widely while Gemini Pro zeroes in on quality. Unlike many agent demos that tweak one function at a time, AlphaEvolve edits entire repositories. It describes each change as a standard diff block – the same patch format engineers push to GitHub – so it can touch dozens of files without losing track. Afterward, automated tests decide whether the patch sticks. Over repeated cycles, the agent’s memory of success and failure grows, so it proposes better patches and wastes less compute on dead ends. Takeaway for enterprises: Let cheaper, faster models handle brainstorming, then call on a more capable model to refine the best ideas. Preserve every trial in a searchable history, because that memory speeds up later work and can be reused across teams. Accordingly, vendors are rushing to provide developers with new tooling around things like memory. Products such as OpenMemory MCP, which provides a portable memory store, and the new long- and short-term memory APIs in LlamaIndex are making this kind of persistent context almost as easy to plug in as logging. OpenAI’s Codex-1 software-engineering agent, also released today, underscores the same pattern. It fires off parallel tasks inside a secure sandbox, runs unit tests and returns pull-request drafts—effectively a code-specific echo of AlphaEvolve’s broader search-and-evaluate loop. 4. Measure to manage: targeting agentic AI for demonstrable ROI AlphaEvolve’s tangible wins – reclaiming 0.7% of data center capacity, cutting Gemini training kernel runtime 23%, speeding FlashAttention 32%, and simplifying TPU design – share one trait: they target domains with airtight metrics. For data center scheduling, AlphaEvolve evolved a heuristic that was evaluated using a simulator of Google’s data centers based on historical workloads. For kernel optimization, the objective was to minimize actual runtime on TPU accelerators across a dataset of realistic kernel input shapes. Takeaway for enterprises: This clarity allows the agent to self-improve and demonstrate unambiguous value. 5. Laying the groundwork: essential prerequisites for enterprise agentic success While AlphaEvolve’s achievements are inspiring, Google’s paper is also clear about its scope and requirements. The primary limitation is the need for an automated evaluator; problems requiring manual experimentation or “wet-lab” feedback are currently out of scope for this specific approach. The system can consume significant compute – “on the order of 100 compute-hours to evaluate any new solution” (AlphaEvolve paper, page 8), necessitating parallelization and careful capacity planning. Before allocating significant budget to complex agentic systems, technical leaders must ask critical questions: Machine-gradable problem? Do we have a clear, automatable metric against which the agent can score its own performance? Compute capacity? Can we afford the potentially compute-heavy inner loop of generation, evaluation, and refinement, especially during the development and training phase? Codebase & memory readiness? Is your codebase structured for iterative, possibly diff-based, modifications? And can you implement the instrumented memory systems vital for an agent to learn from its evolutionary history? Takeaway for enterprises: The increasing focus on robust agent identity and access management, as seen with platforms like Frontegg, Auth0 and others, also points to the maturing infrastructure required to deploy agents that interact securely with multiple enterprise systems. The agentic future is engineered, not just summoned AlphaEvolve’s message for enterprise teams is manifold. First, your operating system around agents is now far more important than model intelligence. Google’s blueprint shows three pillars that can’t be skipped: Deterministic evaluators that give the agent an unambiguous score every time it makes a change. Long-running orchestration that can juggle fast “draft” models like Gemini Flash with slower, more rigorous models – whether that’s Google’s stack or a framework such as LangChain’s LangGraph. Persistent memory so each iteration builds on the last instead of relearning from scratch. Enterprises that already have logging, test harnesses and versioned code repositories are closer than they think. The next step is to wire those assets into a self-serve evaluation loop so multiple agent-generated solutions can compete, and only the highest-scoring patch ships. As Cisco’s Anurag Dhingra, VP and GM of Enterprise Connectivity and Collaboration, told VentureBeat in an interview this week: “It’s happening, it is very, very real,” he said of enterprises using AI agents in manufacturing, warehouses, customer contact centers. “It is not something in the future. It is happening there today.” He warned that as these agents become more pervasive, doing “human-like work,” the strain on existing systems will be immense: “The network traffic is going to go through the roof,” Dhingra said. Your network, budget and competitive edge will likely feel that strain before the hype cycle settles. Start proving out a contained, metric-driven use case this quarter – then scale what works. Watch the video podcast I did with developer Sam Witteveen, where we go deep on production-grade agents, and how AlphaEvolve is showing the way: Daily insights on business use cases with VB Daily If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI. Read our Privacy Policy Thanks for subscribing. Check out more VB newsletters here. An error occured.

·404 مشاهدة

الرجاء تسجيل الدخول , للأعجاب والمشاركة والتعليق على هذا!
Towards Data Science شارك رابطًا

2025-05-17 00:03:30 ·

How to Build an AI Journal with LlamaIndex

This post will share how to build an AI journal with the LlamaIndex. We will cover one essential function of this AI journal: asking for advice. We will start with the most basic implementation and iterate from there. We can see significant improvements for this function when we apply design patterns like Agentic Rag and multi-agent workflow.You can find the source code of this AI Journal in my GitHub repo here. And about who I am.

Overview of AI Journal

I want to build my principles by following Ray Dalio’s practice. An AI journal will help me to self-reflect, track my improvement, and even give me advice. The overall function of such an AI journal looks like this:

AI Journal Overview. Image by Author.

Today, we will only cover the implementation of the seek-advise flow, which is represented by multiple purple cycles in the above diagram.

Simplest Form: LLM with Large Context

In the most straightforward implementation, we can pass all the relevant content into the context and attach the question we want to ask. We can do that in Llamaindex with a few lines of code.

import pymupdf
from llama_index.llms.openai import OpenAI

path_to_pdf_book = './path/to/pdf/book.pdf'
def load_book_content:
text = ""
with pymupdf.openas pdf:
for page in pdf:
text += str.encode)
return text

system_prompt_template = """You are an AI assistant that provides thoughtful, practical, and *deeply personalized* suggestions by combining:
- The user's personal profile and principles
- Insights retrieved from *Principles* by Ray Dalio
Book Content:
```
{book_content}
```
User profile:
```
{user_profile}
```
User's question:
```
{user_question}
```
"""

def get_system_prompt:
system_prompt = system_prompt_template.formatreturn system_prompt

def chat:
llm = get_openai_llmuser_profile = inputuser_question = inputuser_profile = user_profile.stripbook_content = load_book_summaryresponse = llm.complete)
return response

This approach has downsides:

Low Precision: Loading all the book context might prompt LLM to lose focus on the user’s question.

High Cost: Sending over significant-sized content in every LLM call means high cost and poor performance.

With this approach, if you pass the whole content of Ray Dalio’s Principles book, responses to questions like “How to handle stress?” become very general. Such responses without relating to my question made me feel that the AI was not listening to me. Even though it covers many important concepts like embracing reality, the 5-step process to get what you want, and being radically open-minded. I like the advice I got to be more targeted to the question I raised. Let’s see how we can improve it with RAG.

Enhanced Form: Agentic RAG

So, what is Agentic RAG? Agentic RAG is combining dynamic decision-making and data retrieval. In our AI journal, the Agentic RAG flow looks like this:

Stages of Agentic Rag. Image by Author

Question Evaluation: Poorly framed questions lead to poor query results. The agent will evaluate the user’s query and clarify the questions if the Agent believes it is necessary.

Question Re-write: Rewrite the user enquiry to project it to the indexed content in the semantic space. I found these steps essential for improving the precision during the retrieval. Let’s say if your knowledge base is Q/A pair and you are indexing the questions part to search for answers. Rewriting the user’s query statement to a proper question will help you find the most relevant content.

Query Vector Index: Many parameters can be tuned when building such an index, including chunk size, overlap, or a different index type. For simplicity, we are using VectorStoreIndex here, which has a default chunking strategy.

Filter & Synthetic: Instead of a complex re-ranking process, I explicitly instruct LLM to filter and find relevant content in the prompt. I see LLM picking up the most relevant content, even though sometimes it has a lower similarity score than others.

With this Agentic RAG, you can retrieve highly relevant content to the user’s questions, generating more targeted advice.

Let’s examine the implementation. With the LlamaIndex SDK, creating and persisting an index in your local directory is straightforward.

from llama_index.core import Document, VectorStoreIndex, StorageContext, load_index_from_storage

Settings.embed_model = OpenAIEmbeddingPERSISTED_INDEX_PATH = "/path/to/the/directory/persist/index/locally"

def create_index:
documents =vector_index = VectorStoreIndex.from_documentsvector_index.storage_context.persistdef load_index:
storage_context = StorageContext.from_defaultsindex = load_index_from_storagereturn index

Once we have an index, we can create a query engine on top of that. The query engine is a powerful abstraction that allows you to adjust the parameters during the queryand the synthesis behaviour after the content retrieval. In my implementation, I overwrite the response_mode NO_TEXT because the agent will process the book content returned by the function call and synthesize the final result. Having the query engine to synthesize the result before passing it to the agent would be redundant.

from llama_index.core.indices.vector_store import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response_synthesizers import ResponseMode
from llama_index.core import VectorStoreIndex, get_response_synthesizer

def _create_query_engine_from_index:
# configure retriever
retriever = VectorIndexRetriever# return the original content without using LLM to synthesizer. For later evaluation.
response_synthesizer = get_response_synthesizer# assemble query engine
query_engine = RetrieverQueryEnginereturn query_engine

The prompt looks like the following:

You are an assistant that helps reframe user questions into clear, concept-driven statements that match
the style and topics of Principles by Ray Dalio, and perform look up principle book for relevant content.

Background:
Principles teaches structured thinking about life and work decisions.
The key ideas are:
* Radical truth and radical transparency
* Decision-making frameworks
* Embracing mistakes as learning

Task:
- Task 1: Clarify the user's question if needed. Ask follow-up questions to ensure you understand the user's intent.
- Task 2: Rewrite a user’s question into a statement that would match how Ray Dalio frames ideas in Principles. Use formal, logical, neutral tone.
- Task 3: Look up principle book with given re-wrote statements. You should provide at least {REWRITE_FACTOR} rewrote versions.
- Task 4: Find the most relevant from the book content as your fina answers.

Finally, we can build the agent with those functions defined.

def get_principle_rag_agent:
index = load_persisted_indexquery_engine = _create_query_engine_from_indexdef look_up_principle_book-> List:
result =for q in rewrote_statement:
response = query_engine.querycontent =result.extendreturn result

def clarify_question-> str:
"""
Clarify the user's question if needed. Ask follow-up questions to ensure you understand the user's intent.
"""
response = ""
for q in your_questions_to_user:
printr = inputresponse += f"Question: {q}\nResponse: {r}\n"
return response

tools =agent = FunctionAgentreturn agent

rag_agent = get_principle_rag_agentresponse = await agent.runThere are a few observations I had during the implementations:

One interesting fact I found is that providing a non-used parameter, original_question , in the function signature helps. I found that when I do not have such a parameter, LLM sometimes does not follow the rewrite instruction and passes the original question in rewrote_statement the parameter. Having original_question parameters somehow emphasizes the rewriting mission to LLM.

Different LLMs behave quite differently given the same prompt. I found DeepSeek V3 much more reluctant to trigger function calls than other model providers. This doesn’t necessarily mean it is not usable. If a functional call should be initiated 90% of the time, it should be part of the workflow instead of being registered as a function call. Also, compared to OpenAI’s models, I found Gemini good at citing the source of the book when it synthesizes the results.

The more content you load into the context window, the more inference capability the model needs. A smaller model with less inference power is more likely to get lost in the large context provided.

However, to complete the seek-advice function, you’ll need multiple Agents working together instead of a single Agent. Let’s talk about how to chain your Agents together into workflows.

Final Form: Agent Workflow

Before we start, I recommend this article by Anthropic, Building Effective Agents. The one-liner summary of the articles is that you should always prioritise building a workflow instead of a dynamic agent when possible. In LlamaIndex, you can do both. It allows you to create an agent workflow with more automatic routing or a customised workflow with more explicit control of the transition of steps. I will provide an example of both implementations.

Workflow Explain. Image by Author.

Let’s take a look at how you can build a dynamic workflow. Here is a code example.

interviewer = FunctionAgentinterviewer = FunctionAgentadvisor = FunctionAgentworkflow = AgentWorkflowhandler = await workflow.runIt is dynamic because the Agent transition is based on the function call of the LLM model. Underlying, LlamaIndex workflow provides agent descriptions as functions for LLM models. When the LLM model triggers such “Agent Function Call”, LlamaIndex will route to your next corresponding agent for the subsequent step processing. Your previous agent’s output has been added to the workflow internal state, and your following agent will pick up the state as part of the context in their call to the LLM model. You also leverage state and memory components to manage the workflow’s internal state or load external data.

However, as I have suggested, you can explicitly control the steps in your workflow to gain more control. With LlamaIndex, it can be done by extending the workflow object. For example:

class ReferenceRetrivalEvent:
question: str

class Advice:
principles: Listprofile: dict
question: str
book_content: str

class AdviceWorkFlow:
def __init__:
state = get_workflow_stateself.principles = state.load_principle_from_casesself.profile = state.load_profileself.verbose = verbose
super.__init__@step
async def interview-> ReferenceRetrivalEvent:
# Step 1: Interviewer agent asks questions to the user
interviewer = get_interviewer_agentquestion = await _run_agentreturn ReferenceRetrivalEvent@step
async def retrieve-> Advice:
# Step 2: RAG agent retrieves relevant content from the book
rag_agent = get_principle_rag_agentbook_content = await _run_agentreturn Advice@step
async def advice-> StopEvent:
# Step 3: Adviser agent provides advice based on the user's profile, principles, and book content
advisor = get_adviser_agentadvise = await _run_agentreturn StopEventThe specific event type’s return controls the workflow’s step transition. For instance, retrieve step returns an Advice event that will trigger the execution of the advice step. You can also leverage the Advice event to pass the necessary information you need.

During the implementation, if you are annoyed by having to start over the workflow to debug some steps in the middle, the context object is essential when you want to failover the workflow execution. You can store your state in a serialised format and recover your workflow by unserialising it to a context object. Your workflow will continue executing based on the state instead of starting over.

workflow = AgentWorkflowtry:
handler = w.runresult = await handler
except Exception as e:
printawait fail_over# Optional, serialised and save the contexct for debugging
ctx_dict = ctx.to_dict)
json_dump_and_save# Resume from the same context
ctx_dict = load_failed_dictrestored_ctx = Context.from_dict)
handler = w.runresult = await handler

Summary

In this post, we have discussed how to use LlamaIndex to implement an AI journal’s core function. The key learning includes:

Using Agentic RAG to leverage LLM capability to dynamically rewrite the original query and synthesis result.

Use a Customized Workflow to gain more explicit control over step transitions. Build dynamic agents when necessary.

The source code of this AI journal is in my GitHub repo here. I hope you enjoy this article and this small app I built. Cheers!
The post How to Build an AI Journal with LlamaIndex appeared first on Towards Data Science.
#how #build #journal #with #llamaindex

How to Build an AI Journal with LlamaIndex
This post will share how to build an AI journal with the LlamaIndex. We will cover one essential function of this AI journal: asking for advice. We will start with the most basic implementation and iterate from there. We can see significant improvements for this function when we apply design patterns like Agentic Rag and multi-agent workflow.You can find the source code of this AI Journal in my GitHub repo here. And about who I am. Overview of AI Journal I want to build my principles by following Ray Dalio’s practice. An AI journal will help me to self-reflect, track my improvement, and even give me advice. The overall function of such an AI journal looks like this: AI Journal Overview. Image by Author. Today, we will only cover the implementation of the seek-advise flow, which is represented by multiple purple cycles in the above diagram. Simplest Form: LLM with Large Context In the most straightforward implementation, we can pass all the relevant content into the context and attach the question we want to ask. We can do that in Llamaindex with a few lines of code. import pymupdf from llama_index.llms.openai import OpenAI path_to_pdf_book = './path/to/pdf/book.pdf' def load_book_content: text = "" with pymupdf.openas pdf: for page in pdf: text += str.encode) return text system_prompt_template = """You are an AI assistant that provides thoughtful, practical, and *deeply personalized* suggestions by combining: - The user's personal profile and principles - Insights retrieved from *Principles* by Ray Dalio Book Content: ``` {book_content} ``` User profile: ``` {user_profile} ``` User's question: ``` {user_question} ``` """ def get_system_prompt: system_prompt = system_prompt_template.formatreturn system_prompt def chat: llm = get_openai_llmuser_profile = inputuser_question = inputuser_profile = user_profile.stripbook_content = load_book_summaryresponse = llm.complete) return response This approach has downsides: Low Precision: Loading all the book context might prompt LLM to lose focus on the user’s question. High Cost: Sending over significant-sized content in every LLM call means high cost and poor performance. With this approach, if you pass the whole content of Ray Dalio’s Principles book, responses to questions like “How to handle stress?” become very general. Such responses without relating to my question made me feel that the AI was not listening to me. Even though it covers many important concepts like embracing reality, the 5-step process to get what you want, and being radically open-minded. I like the advice I got to be more targeted to the question I raised. Let’s see how we can improve it with RAG. Enhanced Form: Agentic RAG So, what is Agentic RAG? Agentic RAG is combining dynamic decision-making and data retrieval. In our AI journal, the Agentic RAG flow looks like this: Stages of Agentic Rag. Image by Author Question Evaluation: Poorly framed questions lead to poor query results. The agent will evaluate the user’s query and clarify the questions if the Agent believes it is necessary. Question Re-write: Rewrite the user enquiry to project it to the indexed content in the semantic space. I found these steps essential for improving the precision during the retrieval. Let’s say if your knowledge base is Q/A pair and you are indexing the questions part to search for answers. Rewriting the user’s query statement to a proper question will help you find the most relevant content. Query Vector Index: Many parameters can be tuned when building such an index, including chunk size, overlap, or a different index type. For simplicity, we are using VectorStoreIndex here, which has a default chunking strategy. Filter & Synthetic: Instead of a complex re-ranking process, I explicitly instruct LLM to filter and find relevant content in the prompt. I see LLM picking up the most relevant content, even though sometimes it has a lower similarity score than others. With this Agentic RAG, you can retrieve highly relevant content to the user’s questions, generating more targeted advice. Let’s examine the implementation. With the LlamaIndex SDK, creating and persisting an index in your local directory is straightforward. from llama_index.core import Document, VectorStoreIndex, StorageContext, load_index_from_storage Settings.embed_model = OpenAIEmbeddingPERSISTED_INDEX_PATH = "/path/to/the/directory/persist/index/locally" def create_index: documents =vector_index = VectorStoreIndex.from_documentsvector_index.storage_context.persistdef load_index: storage_context = StorageContext.from_defaultsindex = load_index_from_storagereturn index Once we have an index, we can create a query engine on top of that. The query engine is a powerful abstraction that allows you to adjust the parameters during the queryand the synthesis behaviour after the content retrieval. In my implementation, I overwrite the response_mode NO_TEXT because the agent will process the book content returned by the function call and synthesize the final result. Having the query engine to synthesize the result before passing it to the agent would be redundant. from llama_index.core.indices.vector_store import VectorIndexRetriever from llama_index.core.query_engine import RetrieverQueryEngine from llama_index.core.response_synthesizers import ResponseMode from llama_index.core import VectorStoreIndex, get_response_synthesizer def _create_query_engine_from_index: # configure retriever retriever = VectorIndexRetriever# return the original content without using LLM to synthesizer. For later evaluation. response_synthesizer = get_response_synthesizer# assemble query engine query_engine = RetrieverQueryEnginereturn query_engine The prompt looks like the following: You are an assistant that helps reframe user questions into clear, concept-driven statements that match the style and topics of Principles by Ray Dalio, and perform look up principle book for relevant content. Background: Principles teaches structured thinking about life and work decisions. The key ideas are: * Radical truth and radical transparency * Decision-making frameworks * Embracing mistakes as learning Task: - Task 1: Clarify the user's question if needed. Ask follow-up questions to ensure you understand the user's intent. - Task 2: Rewrite a user’s question into a statement that would match how Ray Dalio frames ideas in Principles. Use formal, logical, neutral tone. - Task 3: Look up principle book with given re-wrote statements. You should provide at least {REWRITE_FACTOR} rewrote versions. - Task 4: Find the most relevant from the book content as your fina answers. Finally, we can build the agent with those functions defined. def get_principle_rag_agent: index = load_persisted_indexquery_engine = _create_query_engine_from_indexdef look_up_principle_book-> List: result =for q in rewrote_statement: response = query_engine.querycontent =result.extendreturn result def clarify_question-> str: """ Clarify the user's question if needed. Ask follow-up questions to ensure you understand the user's intent. """ response = "" for q in your_questions_to_user: printr = inputresponse += f"Question: {q}\nResponse: {r}\n" return response tools =agent = FunctionAgentreturn agent rag_agent = get_principle_rag_agentresponse = await agent.runThere are a few observations I had during the implementations: One interesting fact I found is that providing a non-used parameter, original_question , in the function signature helps. I found that when I do not have such a parameter, LLM sometimes does not follow the rewrite instruction and passes the original question in rewrote_statement the parameter. Having original_question parameters somehow emphasizes the rewriting mission to LLM. Different LLMs behave quite differently given the same prompt. I found DeepSeek V3 much more reluctant to trigger function calls than other model providers. This doesn’t necessarily mean it is not usable. If a functional call should be initiated 90% of the time, it should be part of the workflow instead of being registered as a function call. Also, compared to OpenAI’s models, I found Gemini good at citing the source of the book when it synthesizes the results. The more content you load into the context window, the more inference capability the model needs. A smaller model with less inference power is more likely to get lost in the large context provided. However, to complete the seek-advice function, you’ll need multiple Agents working together instead of a single Agent. Let’s talk about how to chain your Agents together into workflows. Final Form: Agent Workflow Before we start, I recommend this article by Anthropic, Building Effective Agents. The one-liner summary of the articles is that you should always prioritise building a workflow instead of a dynamic agent when possible. In LlamaIndex, you can do both. It allows you to create an agent workflow with more automatic routing or a customised workflow with more explicit control of the transition of steps. I will provide an example of both implementations. Workflow Explain. Image by Author. Let’s take a look at how you can build a dynamic workflow. Here is a code example. interviewer = FunctionAgentinterviewer = FunctionAgentadvisor = FunctionAgentworkflow = AgentWorkflowhandler = await workflow.runIt is dynamic because the Agent transition is based on the function call of the LLM model. Underlying, LlamaIndex workflow provides agent descriptions as functions for LLM models. When the LLM model triggers such “Agent Function Call”, LlamaIndex will route to your next corresponding agent for the subsequent step processing. Your previous agent’s output has been added to the workflow internal state, and your following agent will pick up the state as part of the context in their call to the LLM model. You also leverage state and memory components to manage the workflow’s internal state or load external data. However, as I have suggested, you can explicitly control the steps in your workflow to gain more control. With LlamaIndex, it can be done by extending the workflow object. For example: class ReferenceRetrivalEvent: question: str class Advice: principles: Listprofile: dict question: str book_content: str class AdviceWorkFlow: def __init__: state = get_workflow_stateself.principles = state.load_principle_from_casesself.profile = state.load_profileself.verbose = verbose super.__init__@step async def interview-> ReferenceRetrivalEvent: # Step 1: Interviewer agent asks questions to the user interviewer = get_interviewer_agentquestion = await _run_agentreturn ReferenceRetrivalEvent@step async def retrieve-> Advice: # Step 2: RAG agent retrieves relevant content from the book rag_agent = get_principle_rag_agentbook_content = await _run_agentreturn Advice@step async def advice-> StopEvent: # Step 3: Adviser agent provides advice based on the user's profile, principles, and book content advisor = get_adviser_agentadvise = await _run_agentreturn StopEventThe specific event type’s return controls the workflow’s step transition. For instance, retrieve step returns an Advice event that will trigger the execution of the advice step. You can also leverage the Advice event to pass the necessary information you need. During the implementation, if you are annoyed by having to start over the workflow to debug some steps in the middle, the context object is essential when you want to failover the workflow execution. You can store your state in a serialised format and recover your workflow by unserialising it to a context object. Your workflow will continue executing based on the state instead of starting over. workflow = AgentWorkflowtry: handler = w.runresult = await handler except Exception as e: printawait fail_over# Optional, serialised and save the contexct for debugging ctx_dict = ctx.to_dict) json_dump_and_save# Resume from the same context ctx_dict = load_failed_dictrestored_ctx = Context.from_dict) handler = w.runresult = await handler Summary In this post, we have discussed how to use LlamaIndex to implement an AI journal’s core function. The key learning includes: Using Agentic RAG to leverage LLM capability to dynamically rewrite the original query and synthesis result. Use a Customized Workflow to gain more explicit control over step transitions. Build dynamic agents when necessary. The source code of this AI journal is in my GitHub repo here. I hope you enjoy this article and this small app I built. Cheers! The post How to Build an AI Journal with LlamaIndex appeared first on Towards Data Science. #how #build #journal #with #llamaindex

TOWARDSDATASCIENCE.COM

How to Build an AI Journal with LlamaIndex

This post will share how to build an AI journal with the LlamaIndex. We will cover one essential function of this AI journal: asking for advice. We will start with the most basic implementation and iterate from there. We can see significant improvements for this function when we apply design patterns like Agentic Rag and multi-agent workflow.You can find the source code of this AI Journal in my GitHub repo here. And about who I am. Overview of AI Journal I want to build my principles by following Ray Dalio’s practice. An AI journal will help me to self-reflect, track my improvement, and even give me advice. The overall function of such an AI journal looks like this: AI Journal Overview. Image by Author. Today, we will only cover the implementation of the seek-advise flow, which is represented by multiple purple cycles in the above diagram. Simplest Form: LLM with Large Context In the most straightforward implementation, we can pass all the relevant content into the context and attach the question we want to ask. We can do that in Llamaindex with a few lines of code. import pymupdf from llama_index.llms.openai import OpenAI path_to_pdf_book = './path/to/pdf/book.pdf' def load_book_content(): text = "" with pymupdf.open(path_to_pdf_book) as pdf: for page in pdf: text += str(page.get_text().encode("utf8", errors='ignore')) return text system_prompt_template = """You are an AI assistant that provides thoughtful, practical, and *deeply personalized* suggestions by combining: - The user's personal profile and principles - Insights retrieved from *Principles* by Ray Dalio Book Content: ``` {book_content} ``` User profile: ``` {user_profile} ``` User's question: ``` {user_question} ``` """ def get_system_prompt(book_content: str, user_profile: str, user_question: str): system_prompt = system_prompt_template.format( book_content=book_content, user_profile=user_profile, user_question=user_question ) return system_prompt def chat(): llm = get_openai_llm() user_profile = input(">>Tell me about yourself: ") user_question = input(">>What do you want to ask: ") user_profile = user_profile.strip() book_content = load_book_summary() response = llm.complete(prompt=get_system_prompt(book_content, user_profile, user_question)) return response This approach has downsides: Low Precision: Loading all the book context might prompt LLM to lose focus on the user’s question. High Cost: Sending over significant-sized content in every LLM call means high cost and poor performance. With this approach, if you pass the whole content of Ray Dalio’s Principles book, responses to questions like “How to handle stress?” become very general. Such responses without relating to my question made me feel that the AI was not listening to me. Even though it covers many important concepts like embracing reality, the 5-step process to get what you want, and being radically open-minded. I like the advice I got to be more targeted to the question I raised. Let’s see how we can improve it with RAG. Enhanced Form: Agentic RAG So, what is Agentic RAG? Agentic RAG is combining dynamic decision-making and data retrieval. In our AI journal, the Agentic RAG flow looks like this: Stages of Agentic Rag. Image by Author Question Evaluation: Poorly framed questions lead to poor query results. The agent will evaluate the user’s query and clarify the questions if the Agent believes it is necessary. Question Re-write: Rewrite the user enquiry to project it to the indexed content in the semantic space. I found these steps essential for improving the precision during the retrieval. Let’s say if your knowledge base is Q/A pair and you are indexing the questions part to search for answers. Rewriting the user’s query statement to a proper question will help you find the most relevant content. Query Vector Index: Many parameters can be tuned when building such an index, including chunk size, overlap, or a different index type. For simplicity, we are using VectorStoreIndex here, which has a default chunking strategy. Filter & Synthetic: Instead of a complex re-ranking process, I explicitly instruct LLM to filter and find relevant content in the prompt. I see LLM picking up the most relevant content, even though sometimes it has a lower similarity score than others. With this Agentic RAG, you can retrieve highly relevant content to the user’s questions, generating more targeted advice. Let’s examine the implementation. With the LlamaIndex SDK, creating and persisting an index in your local directory is straightforward. from llama_index.core import Document, VectorStoreIndex, StorageContext, load_index_from_storage Settings.embed_model = OpenAIEmbedding(api_key="ak-xxxx") PERSISTED_INDEX_PATH = "/path/to/the/directory/persist/index/locally" def create_index(content: str): documents = [Document(text=content)] vector_index = VectorStoreIndex.from_documents(documents) vector_index.storage_context.persist(persist_dir=PERSISTED_INDEX_PATH) def load_index(): storage_context = StorageContext.from_defaults(persist_dir=PERSISTED_INDEX_PATH) index = load_index_from_storage(storage_context) return index Once we have an index, we can create a query engine on top of that. The query engine is a powerful abstraction that allows you to adjust the parameters during the query(e.g., TOP K) and the synthesis behaviour after the content retrieval. In my implementation, I overwrite the response_mode NO_TEXT because the agent will process the book content returned by the function call and synthesize the final result. Having the query engine to synthesize the result before passing it to the agent would be redundant. from llama_index.core.indices.vector_store import VectorIndexRetriever from llama_index.core.query_engine import RetrieverQueryEngine from llama_index.core.response_synthesizers import ResponseMode from llama_index.core import VectorStoreIndex, get_response_synthesizer def _create_query_engine_from_index(index: VectorStoreIndex): # configure retriever retriever = VectorIndexRetriever( index=index, similarity_top_k=TOP_K, ) # return the original content without using LLM to synthesizer. For later evaluation. response_synthesizer = get_response_synthesizer(response_mode=ResponseMode.NO_TEXT) # assemble query engine query_engine = RetrieverQueryEngine( retriever=retriever, response_synthesizer=response_synthesizer ) return query_engine The prompt looks like the following: You are an assistant that helps reframe user questions into clear, concept-driven statements that match the style and topics of Principles by Ray Dalio, and perform look up principle book for relevant content. Background: Principles teaches structured thinking about life and work decisions. The key ideas are: * Radical truth and radical transparency * Decision-making frameworks * Embracing mistakes as learning Task: - Task 1: Clarify the user's question if needed. Ask follow-up questions to ensure you understand the user's intent. - Task 2: Rewrite a user’s question into a statement that would match how Ray Dalio frames ideas in Principles. Use formal, logical, neutral tone. - Task 3: Look up principle book with given re-wrote statements. You should provide at least {REWRITE_FACTOR} rewrote versions. - Task 4: Find the most relevant from the book content as your fina answers. Finally, we can build the agent with those functions defined. def get_principle_rag_agent(): index = load_persisted_index() query_engine = _create_query_engine_from_index(index) def look_up_principle_book(original_question: str, rewrote_statement: List[str]) -> List[str]: result = [] for q in rewrote_statement: response = query_engine.query(q) content = [n.get_content() for n in response.source_nodes] result.extend(content) return result def clarify_question(original_question: str, your_questions_to_user: List[str]) -> str: """ Clarify the user's question if needed. Ask follow-up questions to ensure you understand the user's intent. """ response = "" for q in your_questions_to_user: print(f"Question: {q}") r = input("Response:") response += f"Question: {q}\nResponse: {r}\n" return response tools = [ FunctionTool.from_defaults( fn=look_up_principle_book, name="look_up_principle_book", description="Look up principle book with re-wrote queries. Getting the suggestions from the Principle book by Ray Dalio"), FunctionTool.from_defaults( fn=clarify_question, name="clarify_question", description="Clarify the user's question if needed. Ask follow-up questions to ensure you understand the user's intent.", ) ] agent = FunctionAgent( name="principle_reference_loader", description="You are a helpful agent will based on user's question and look up the most relevant content in principle book.\n", system_prompt=QUESTION_REWRITE_PROMPT, tools=tools, ) return agent rag_agent = get_principle_rag_agent() response = await agent.run(chat_history=chat_history) There are a few observations I had during the implementations: One interesting fact I found is that providing a non-used parameter, original_question , in the function signature helps. I found that when I do not have such a parameter, LLM sometimes does not follow the rewrite instruction and passes the original question in rewrote_statement the parameter. Having original_question parameters somehow emphasizes the rewriting mission to LLM. Different LLMs behave quite differently given the same prompt. I found DeepSeek V3 much more reluctant to trigger function calls than other model providers. This doesn’t necessarily mean it is not usable. If a functional call should be initiated 90% of the time, it should be part of the workflow instead of being registered as a function call. Also, compared to OpenAI’s models, I found Gemini good at citing the source of the book when it synthesizes the results. The more content you load into the context window, the more inference capability the model needs. A smaller model with less inference power is more likely to get lost in the large context provided. However, to complete the seek-advice function, you’ll need multiple Agents working together instead of a single Agent. Let’s talk about how to chain your Agents together into workflows. Final Form: Agent Workflow Before we start, I recommend this article by Anthropic, Building Effective Agents. The one-liner summary of the articles is that you should always prioritise building a workflow instead of a dynamic agent when possible. In LlamaIndex, you can do both. It allows you to create an agent workflow with more automatic routing or a customised workflow with more explicit control of the transition of steps. I will provide an example of both implementations. Workflow Explain. Image by Author. Let’s take a look at how you can build a dynamic workflow. Here is a code example. interviewer = FunctionAgent( name="interviewer", description="Useful agent to clarify user's questions", system_prompt=_intervierw_prompt, can_handoff_to = ["retriver"] tools=tools ) interviewer = FunctionAgent( name="retriever", description="Useful agent to retrive principle book's content.", system_prompt=_retriver_prompt, can_handoff_to = ["advisor"] tools=tools ) advisor = FunctionAgent( name="advisor", description="Useful agent to advise user.", system_prompt=_advisor_prompt, can_handoff_to = [] tools=tools ) workflow = AgentWorkflow( agents=[interviewer, advisor, retriever], root_agent="interviewer", ) handler = await workflow.run(user_msg="How to handle stress?") It is dynamic because the Agent transition is based on the function call of the LLM model. Underlying, LlamaIndex workflow provides agent descriptions as functions for LLM models. When the LLM model triggers such “Agent Function Call”, LlamaIndex will route to your next corresponding agent for the subsequent step processing. Your previous agent’s output has been added to the workflow internal state, and your following agent will pick up the state as part of the context in their call to the LLM model. You also leverage state and memory components to manage the workflow’s internal state or load external data(reference the document here). However, as I have suggested, you can explicitly control the steps in your workflow to gain more control. With LlamaIndex, it can be done by extending the workflow object. For example: class ReferenceRetrivalEvent(Event): question: str class Advice(Event): principles: List[str] profile: dict question: str book_content: str class AdviceWorkFlow(Workflow): def __init__(self, verbose: bool = False, session_id: str = None): state = get_workflow_state(session_id) self.principles = state.load_principle_from_cases() self.profile = state.load_profile() self.verbose = verbose super().__init__(timeout=None, verbose=verbose) @step async def interview(self, ctx: Context, ev: StartEvent) -> ReferenceRetrivalEvent: # Step 1: Interviewer agent asks questions to the user interviewer = get_interviewer_agent() question = await _run_agent(interviewer, question=ev.user_msg, verbose=self.verbose) return ReferenceRetrivalEvent(question=question) @step async def retrieve(self, ctx: Context, ev: ReferenceRetrivalEvent) -> Advice: # Step 2: RAG agent retrieves relevant content from the book rag_agent = get_principle_rag_agent() book_content = await _run_agent(rag_agent, question=ev.question, verbose=self.verbose) return Advice(principles=self.principles, profile=self.profile, question=ev.question, book_content=book_content) @step async def advice(self, ctx: Context, ev: Advice) -> StopEvent: # Step 3: Adviser agent provides advice based on the user's profile, principles, and book content advisor = get_adviser_agent(ev.profile, ev.principles, ev.book_content) advise = await _run_agent(advisor, question=ev.question, verbose=self.verbose) return StopEvent(result=advise) The specific event type’s return controls the workflow’s step transition. For instance, retrieve step returns an Advice event that will trigger the execution of the advice step. You can also leverage the Advice event to pass the necessary information you need. During the implementation, if you are annoyed by having to start over the workflow to debug some steps in the middle, the context object is essential when you want to failover the workflow execution. You can store your state in a serialised format and recover your workflow by unserialising it to a context object. Your workflow will continue executing based on the state instead of starting over. workflow = AgentWorkflow( agents=[interviewer, advisor, retriever], root_agent="interviewer", ) try: handler = w.run() result = await handler except Exception as e: print(f"Error during initial run: {e}") await fail_over() # Optional, serialised and save the contexct for debugging ctx_dict = ctx.to_dict(serializer=JsonSerializer()) json_dump_and_save(ctx_dict) # Resume from the same context ctx_dict = load_failed_dict() restored_ctx = Context.from_dict(workflow, ctx_dict,serializer=JsonSerializer()) handler = w.run(ctx=handler.ctx) result = await handler Summary In this post, we have discussed how to use LlamaIndex to implement an AI journal’s core function. The key learning includes: Using Agentic RAG to leverage LLM capability to dynamically rewrite the original query and synthesis result. Use a Customized Workflow to gain more explicit control over step transitions. Build dynamic agents when necessary. The source code of this AI journal is in my GitHub repo here. I hope you enjoy this article and this small app I built. Cheers! The post How to Build an AI Journal with LlamaIndex appeared first on Towards Data Science.

·334 مشاهدة

الرجاء تسجيل الدخول , للأعجاب والمشاركة والتعليق على هذا!
Towards AI شارك رابطًا

2025-05-16 22:19:36 ·

Agency is The Key to AGI

Author: Adam BEN KHALIFA

Originally published on Towards AI.

Why are agentic workflows essential for achieving AGI
Let me ask you this, what if the path to truly smart and effective AI , the kind we call AGI, isn’t just about building one colossal, all-knowing brain? What if the real breakthrough lies not in making our models only smarter, but in making them also capable of acting, adapting, and evolving?
Well, LLMs continue to amaze us day after day, but the road to AGI demands more than raw intellect. It requires agency.
Getting Our Terms Straight: AGI, Agency, and Agentic Workflows
Before we dive in, let’s define the main concepts here:
AGI — Artificial General Intelligence:
You can see it as an AI model that can perform any intellectual task a human can. This means not just understanding language or generating images, but adapting, learning, reasoning, and acting across entirely new domains.
Agency:
The capacity of an entity to act purposefully in its environment to achieve goals. A rock has no agency; a human planning their day has plenty. For an AI, agency means it’s not just passively responding to prompts but actively pursuing objectives.

Simply put, it’s the capacity to pursue goals autonomously through planning, acting, and adapting.

Agentic Workflows:
If agency is the “what”, agentic workflows are the “how”. These are the dynamic processes and systems an AI uses to exercise its agency. Think beyond a simple input-output model.
Agentic workflows involve:

Autonomous Goal-Setting & Planning: the AI doesn’t just execute a pre-defined plan, it can formulate goals and strategize how to achieve them.
Tool Use & Orchestration: like a skilled craftsperson, it can select, combine, and utilize various “tools”to get the job done.
Memory & Learning: it remembers past actions, learns from successes and failures, and adapts its strategies over time.
Adaptation in Dynamic Environments: the real world is messy, an agentic AI can adjust its plan when encountering unexpected obstacles or new information.

It’s vital to understand the difference here:
An LLM calling a weather API is just tool use.
An agentic workflow is when an LLM, tasked with “analyzing market trends for a new product,” autonomously decides to:1) search recent financial news, 2) query a sales database, 3) use a data analysis tool to spot correlations, 4) ask a specialized forecasting model for projections, and then 5) compile a summary report, re-evaluating its approach at each step.It’s like the difference between a single musician playing one note, and a conductor leading an entire orchestra.

The Limitation of Isolated Intelligence
Let’s consider a human analogy. Imagine a brilliant engineer, a genius in their field. Now, strip away their tools: no computer, no internet for research, no pen and paper for sketching ideas or taking notes, no lab for prototyping, no colleagues to bounce ideas off. Confine them to only their thoughts. How much could they truly achieve? Their raw intellect, however vast, becomes severely handicapped when uncoupled from the ability to interact, experiment, and leverage external resources.
This “intelligence in isolation” scenario illustrates a fundamental truth: intelligence doesn’t operate in a vacuum. It thrives on interaction, tool use, and the ability to execute plans in the world. If we want AGI, we can’t just build a disembodied digital brain, we need to build something that can act.
Agentic Workflows: AI can act like us, and perhaps even better
Humans are masters of adapting their “workflows.” A painter uses different tools and processes than an engineer, who uses different methods than a chef. We intuitively understand context, choose the right approach, and even invent new methods when old ones fail.
Agentic workflows aim to achieve similar capabilities:
Contextual Flexibility: An Agentic AI could switch between “investigative journalist mode”and “creative writer mode”as needed for a complex task.
Learning by Doing: Human learning is an iterative workflow: observe, hypothesize, experiment, analyze, conclude, refine. Agentic systems can embody this, trying approaches, evaluating outcomes, and improving their strategies.
Beyond Monolithic Thought: We don’t store everything in our heads. We use notes, computers, books, and critically, we delegate tasks to others. Agentic AI can similarly leverage external knowledge bases, specialized sub-agents, and computational tools, creating a distributed, more powerful form of intelligence.
Thinking About Thinking: Humans possess meta-cognition — the ability to reflect on our own thought processes and adjust them. Agentic workflows, with their capacity for self-monitoring and re-planning, are a foundational step towards AI developing its own form of meta-cognition.
Inventing New Ways: Perhaps soon enough, an advanced agentic AI won’t just use existing tools and workflows, but identify the need for entirely new ones and even contribute to their creation. A hallmark of true general intelligence.
Not Just Helpful, But Mandatory: Why AGI Needs Agentic Workflows
These capabilities aren’t just fancy add-ons. They are arguably essential for anything we’d recognize as AGI:
Tackling Complexity: Real-world problems are messy, multifaceted, and rarely solved by a single, linear process. Agentic workflows will allow AI to break down these complex challenges into manageable sub-tasks, orchestrating diverse capabilities.
Achieving Scale: Imagine trying to manage global logistics, conduct large-scale scientific research, or personalize education for millions with a single, rigid program. Agentic systems offer the modularity and dynamic coordination needed for such scale.
Adaptability and Robustness: What happens when the data changes, a tool fails, or an assumption proves wrong? A static AI might grind to a halt. An agentic AI can adapt, re-plan, find alternative solutions, and continue pursuing its goal. It can handle the unexpected.
Resourcefulness: Like our engineer, an AGI needs to be able to identify and use the right “tool”for the job at hand, rather than trying to be a jack-of-all-trades with a single block massive model.
Surpassing Human Adaptability
The first step is for AI to achieve a human-like ability to set goals, plan, use tools, and adapt through agentic workflows. But the true promise of AGI lies in surpassing these capabilities:

Speed: Learn and adapt at speeds incomprehensible to us, iterating through problem-solving cycles in milliseconds.
Scale: Manage and orchestrate operations of immense complexity, juggling thousands of variables and “tools” simultaneously.
Novelty: Devise entirely new, perhaps counter-intuitive, workflows and solutions to problems that humans haven’t even conceived of.
Self-Improvement of Workflows: An AGI that doesn’t just use workflows but actively refines, optimizes, and even discovers fundamentally new and more efficient ways to achieve its goals.
Deeper Meta-Learning: Learning how to learn, plan, and strategize more effectively over time, becoming increasingly more intelligent and capable.
Long-Horizon Reasoning: Successfully breaking down and navigating extremely complex, multi-stage goals that unfold over extended periods, adapting robustly along the way.

Obviously, this is easier said than done. Building true AGI presents formidable challenges: How do we design systems that can reliably plan in open-ended environments? How can they discover and integrate new tools seamlessly? How does the system learn which part of a long, complex workflow was responsible for success or failure?
These are active areas of research, pushing the boundaries of what AI can do. Thankfully we are witnessing more and more breakthroughs everyday, and RL — Reinforcement Learning based approaches are showing great promise.
Conclusion: Agency as the Cornerstone of AGI
The quest for AGI is more than a race for larger models or faster processing. It’s a quest for intelligence that is versatile, adaptive, and purposeful. Agentic workflows provide the framework for such intelligence, enabling AI to move beyond mere pattern recognition to become an active participant in the problem-solving process.
Just as human collective general intelligence emerged not merely from neurons, but from networks of thought, culture, and action — we must build AGI not as a single-block model, but as an AI capable of learning, adapting, and acting. Agency, in this light, isn’t just a feature; it’s the fundamental engine that will drive us towards true Artificial General Intelligence.
If you liked this article, make sure to follow for more.And you can find me on:

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI
#agency #key #agi

Agency is The Key to AGI
Author: Adam BEN KHALIFA Originally published on Towards AI. Why are agentic workflows essential for achieving AGI Let me ask you this, what if the path to truly smart and effective AI , the kind we call AGI, isn’t just about building one colossal, all-knowing brain? What if the real breakthrough lies not in making our models only smarter, but in making them also capable of acting, adapting, and evolving? Well, LLMs continue to amaze us day after day, but the road to AGI demands more than raw intellect. It requires agency. Getting Our Terms Straight: AGI, Agency, and Agentic Workflows Before we dive in, let’s define the main concepts here: AGI — Artificial General Intelligence: You can see it as an AI model that can perform any intellectual task a human can. This means not just understanding language or generating images, but adapting, learning, reasoning, and acting across entirely new domains. Agency: The capacity of an entity to act purposefully in its environment to achieve goals. A rock has no agency; a human planning their day has plenty. For an AI, agency means it’s not just passively responding to prompts but actively pursuing objectives. Simply put, it’s the capacity to pursue goals autonomously through planning, acting, and adapting. Agentic Workflows: If agency is the “what”, agentic workflows are the “how”. These are the dynamic processes and systems an AI uses to exercise its agency. Think beyond a simple input-output model. Agentic workflows involve: Autonomous Goal-Setting & Planning: the AI doesn’t just execute a pre-defined plan, it can formulate goals and strategize how to achieve them. Tool Use & Orchestration: like a skilled craftsperson, it can select, combine, and utilize various “tools”to get the job done. Memory & Learning: it remembers past actions, learns from successes and failures, and adapts its strategies over time. Adaptation in Dynamic Environments: the real world is messy, an agentic AI can adjust its plan when encountering unexpected obstacles or new information. It’s vital to understand the difference here: An LLM calling a weather API is just tool use. An agentic workflow is when an LLM, tasked with “analyzing market trends for a new product,” autonomously decides to:1) search recent financial news, 2) query a sales database, 3) use a data analysis tool to spot correlations, 4) ask a specialized forecasting model for projections, and then 5) compile a summary report, re-evaluating its approach at each step.It’s like the difference between a single musician playing one note, and a conductor leading an entire orchestra. The Limitation of Isolated Intelligence Let’s consider a human analogy. Imagine a brilliant engineer, a genius in their field. Now, strip away their tools: no computer, no internet for research, no pen and paper for sketching ideas or taking notes, no lab for prototyping, no colleagues to bounce ideas off. Confine them to only their thoughts. How much could they truly achieve? Their raw intellect, however vast, becomes severely handicapped when uncoupled from the ability to interact, experiment, and leverage external resources. This “intelligence in isolation” scenario illustrates a fundamental truth: intelligence doesn’t operate in a vacuum. It thrives on interaction, tool use, and the ability to execute plans in the world. If we want AGI, we can’t just build a disembodied digital brain, we need to build something that can act. Agentic Workflows: AI can act like us, and perhaps even better Humans are masters of adapting their “workflows.” A painter uses different tools and processes than an engineer, who uses different methods than a chef. We intuitively understand context, choose the right approach, and even invent new methods when old ones fail. Agentic workflows aim to achieve similar capabilities: Contextual Flexibility: An Agentic AI could switch between “investigative journalist mode”and “creative writer mode”as needed for a complex task. Learning by Doing: Human learning is an iterative workflow: observe, hypothesize, experiment, analyze, conclude, refine. Agentic systems can embody this, trying approaches, evaluating outcomes, and improving their strategies. Beyond Monolithic Thought: We don’t store everything in our heads. We use notes, computers, books, and critically, we delegate tasks to others. Agentic AI can similarly leverage external knowledge bases, specialized sub-agents, and computational tools, creating a distributed, more powerful form of intelligence. Thinking About Thinking: Humans possess meta-cognition — the ability to reflect on our own thought processes and adjust them. Agentic workflows, with their capacity for self-monitoring and re-planning, are a foundational step towards AI developing its own form of meta-cognition. Inventing New Ways: Perhaps soon enough, an advanced agentic AI won’t just use existing tools and workflows, but identify the need for entirely new ones and even contribute to their creation. A hallmark of true general intelligence. Not Just Helpful, But Mandatory: Why AGI Needs Agentic Workflows These capabilities aren’t just fancy add-ons. They are arguably essential for anything we’d recognize as AGI: Tackling Complexity: Real-world problems are messy, multifaceted, and rarely solved by a single, linear process. Agentic workflows will allow AI to break down these complex challenges into manageable sub-tasks, orchestrating diverse capabilities. Achieving Scale: Imagine trying to manage global logistics, conduct large-scale scientific research, or personalize education for millions with a single, rigid program. Agentic systems offer the modularity and dynamic coordination needed for such scale. Adaptability and Robustness: What happens when the data changes, a tool fails, or an assumption proves wrong? A static AI might grind to a halt. An agentic AI can adapt, re-plan, find alternative solutions, and continue pursuing its goal. It can handle the unexpected. Resourcefulness: Like our engineer, an AGI needs to be able to identify and use the right “tool”for the job at hand, rather than trying to be a jack-of-all-trades with a single block massive model. Surpassing Human Adaptability The first step is for AI to achieve a human-like ability to set goals, plan, use tools, and adapt through agentic workflows. But the true promise of AGI lies in surpassing these capabilities: Speed: Learn and adapt at speeds incomprehensible to us, iterating through problem-solving cycles in milliseconds. Scale: Manage and orchestrate operations of immense complexity, juggling thousands of variables and “tools” simultaneously. Novelty: Devise entirely new, perhaps counter-intuitive, workflows and solutions to problems that humans haven’t even conceived of. Self-Improvement of Workflows: An AGI that doesn’t just use workflows but actively refines, optimizes, and even discovers fundamentally new and more efficient ways to achieve its goals. Deeper Meta-Learning: Learning how to learn, plan, and strategize more effectively over time, becoming increasingly more intelligent and capable. Long-Horizon Reasoning: Successfully breaking down and navigating extremely complex, multi-stage goals that unfold over extended periods, adapting robustly along the way. Obviously, this is easier said than done. Building true AGI presents formidable challenges: How do we design systems that can reliably plan in open-ended environments? How can they discover and integrate new tools seamlessly? How does the system learn which part of a long, complex workflow was responsible for success or failure? These are active areas of research, pushing the boundaries of what AI can do. Thankfully we are witnessing more and more breakthroughs everyday, and RL — Reinforcement Learning based approaches are showing great promise. Conclusion: Agency as the Cornerstone of AGI The quest for AGI is more than a race for larger models or faster processing. It’s a quest for intelligence that is versatile, adaptive, and purposeful. Agentic workflows provide the framework for such intelligence, enabling AI to move beyond mere pattern recognition to become an active participant in the problem-solving process. Just as human collective general intelligence emerged not merely from neurons, but from networks of thought, culture, and action — we must build AGI not as a single-block model, but as an AI capable of learning, adapting, and acting. Agency, in this light, isn’t just a feature; it’s the fundamental engine that will drive us towards true Artificial General Intelligence. If you liked this article, make sure to follow for more.And you can find me on: Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI #agency #key #agi

TOWARDSAI.NET

Agency is The Key to AGI

Author(s): Adam BEN KHALIFA Originally published on Towards AI. Why are agentic workflows essential for achieving AGI Let me ask you this, what if the path to truly smart and effective AI , the kind we call AGI, isn’t just about building one colossal, all-knowing brain? What if the real breakthrough lies not in making our models only smarter, but in making them also capable of acting, adapting, and evolving? Well, LLMs continue to amaze us day after day, but the road to AGI demands more than raw intellect. It requires agency. Getting Our Terms Straight: AGI, Agency, and Agentic Workflows Before we dive in, let’s define the main concepts here: AGI — Artificial General Intelligence: You can see it as an AI model that can perform any intellectual task a human can. This means not just understanding language or generating images, but adapting, learning, reasoning, and acting across entirely new domains. Agency: The capacity of an entity to act purposefully in its environment to achieve goals. A rock has no agency; a human planning their day has plenty. For an AI, agency means it’s not just passively responding to prompts but actively pursuing objectives. Simply put, it’s the capacity to pursue goals autonomously through planning, acting, and adapting. Agentic Workflows: If agency is the “what”, agentic workflows are the “how”. These are the dynamic processes and systems an AI uses to exercise its agency. Think beyond a simple input-output model. Agentic workflows involve: Autonomous Goal-Setting & Planning: the AI doesn’t just execute a pre-defined plan, it can formulate goals and strategize how to achieve them. Tool Use & Orchestration: like a skilled craftsperson, it can select, combine, and utilize various “tools” (other AI models, databases, APIs, code execution environments) to get the job done. Memory & Learning: it remembers past actions, learns from successes and failures, and adapts its strategies over time. Adaptation in Dynamic Environments: the real world is messy, an agentic AI can adjust its plan when encountering unexpected obstacles or new information. It’s vital to understand the difference here: An LLM calling a weather API is just tool use. An agentic workflow is when an LLM, tasked with “analyzing market trends for a new product,” autonomously decides to:1) search recent financial news, 2) query a sales database, 3) use a data analysis tool to spot correlations, 4) ask a specialized forecasting model for projections, and then 5) compile a summary report, re-evaluating its approach at each step.It’s like the difference between a single musician playing one note, and a conductor leading an entire orchestra. The Limitation of Isolated Intelligence Let’s consider a human analogy. Imagine a brilliant engineer, a genius in their field. Now, strip away their tools: no computer, no internet for research, no pen and paper for sketching ideas or taking notes, no lab for prototyping, no colleagues to bounce ideas off. Confine them to only their thoughts. How much could they truly achieve? Their raw intellect, however vast, becomes severely handicapped when uncoupled from the ability to interact, experiment, and leverage external resources. This “intelligence in isolation” scenario illustrates a fundamental truth: intelligence doesn’t operate in a vacuum. It thrives on interaction, tool use, and the ability to execute plans in the world. If we want AGI, we can’t just build a disembodied digital brain, we need to build something that can act. Agentic Workflows: AI can act like us, and perhaps even better Humans are masters of adapting their “workflows.” A painter uses different tools and processes than an engineer, who uses different methods than a chef. We intuitively understand context, choose the right approach, and even invent new methods when old ones fail. Agentic workflows aim to achieve similar capabilities: Contextual Flexibility: An Agentic AI could switch between “investigative journalist mode” (querying databases, cross-referencing sources, interviewing) and “creative writer mode” (generating narratives, exploring styles) as needed for a complex task. Learning by Doing (and Re-doing): Human learning is an iterative workflow: observe, hypothesize, experiment, analyze, conclude, refine. Agentic systems can embody this, trying approaches, evaluating outcomes, and improving their strategies. Beyond Monolithic Thought: We don’t store everything in our heads. We use notes, computers, books, and critically, we delegate tasks to others. Agentic AI can similarly leverage external knowledge bases, specialized sub-agents, and computational tools, creating a distributed, more powerful form of intelligence. Thinking About Thinking: Humans possess meta-cognition — the ability to reflect on our own thought processes and adjust them. Agentic workflows, with their capacity for self-monitoring and re-planning, are a foundational step towards AI developing its own form of meta-cognition. Inventing New Ways: Perhaps soon enough, an advanced agentic AI won’t just use existing tools and workflows, but identify the need for entirely new ones and even contribute to their creation. A hallmark of true general intelligence. Not Just Helpful, But Mandatory: Why AGI Needs Agentic Workflows These capabilities aren’t just fancy add-ons. They are arguably essential for anything we’d recognize as AGI: Tackling Complexity: Real-world problems are messy, multifaceted, and rarely solved by a single, linear process. Agentic workflows will allow AI to break down these complex challenges into manageable sub-tasks, orchestrating diverse capabilities. Achieving Scale: Imagine trying to manage global logistics, conduct large-scale scientific research, or personalize education for millions with a single, rigid program. Agentic systems offer the modularity and dynamic coordination needed for such scale. Adaptability and Robustness: What happens when the data changes, a tool fails, or an assumption proves wrong? A static AI might grind to a halt. An agentic AI can adapt, re-plan, find alternative solutions, and continue pursuing its goal. It can handle the unexpected. Resourcefulness: Like our engineer, an AGI needs to be able to identify and use the right “tool” (be it a specific algorithm, dataset, or external service) for the job at hand, rather than trying to be a jack-of-all-trades with a single block massive model. Surpassing Human Adaptability The first step is for AI to achieve a human-like ability to set goals, plan, use tools, and adapt through agentic workflows. But the true promise of AGI lies in surpassing these capabilities: Speed: Learn and adapt at speeds incomprehensible to us, iterating through problem-solving cycles in milliseconds. Scale: Manage and orchestrate operations of immense complexity, juggling thousands of variables and “tools” simultaneously. Novelty: Devise entirely new, perhaps counter-intuitive, workflows and solutions to problems that humans haven’t even conceived of. Self-Improvement of Workflows: An AGI that doesn’t just use workflows but actively refines, optimizes, and even discovers fundamentally new and more efficient ways to achieve its goals. Deeper Meta-Learning: Learning how to learn, plan, and strategize more effectively over time, becoming increasingly more intelligent and capable. Long-Horizon Reasoning: Successfully breaking down and navigating extremely complex, multi-stage goals that unfold over extended periods, adapting robustly along the way. Obviously, this is easier said than done. Building true AGI presents formidable challenges: How do we design systems that can reliably plan in open-ended environments? How can they discover and integrate new tools seamlessly? How does the system learn which part of a long, complex workflow was responsible for success or failure? These are active areas of research, pushing the boundaries of what AI can do. Thankfully we are witnessing more and more breakthroughs everyday, and RL — Reinforcement Learning based approaches are showing great promise (which make sense actually, but that’s for another article). Conclusion: Agency as the Cornerstone of AGI The quest for AGI is more than a race for larger models or faster processing. It’s a quest for intelligence that is versatile, adaptive, and purposeful. Agentic workflows provide the framework for such intelligence, enabling AI to move beyond mere pattern recognition to become an active participant in the problem-solving process. Just as human collective general intelligence emerged not merely from neurons, but from networks of thought, culture, and action — we must build AGI not as a single-block model, but as an AI capable of learning, adapting, and acting. Agency, in this light, isn’t just a feature; it’s the fundamental engine that will drive us towards true Artificial General Intelligence. If you liked this article, make sure to follow for more.And you can find me on: Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI

·373 مشاهدة

الرجاء تسجيل الدخول , للأعجاب والمشاركة والتعليق على هذا!
Towards AI شارك رابطًا

2025-05-16 19:17:27 ·

You Have NO Excuse to Not Be an LLM Developer Today

Author: Towards AI Editorial Team

Originally published on Towards AI.

Ever since we released our first book on LLMs — and now our most comprehensive course, From Beginner to Advanced LLM Developer — we’ve heard the same questions from devs sitting on the fence:

“Isn’t most of this stuff already free online?”

Sure — if you have time to stitch together YouTube videos, half-written blog posts, outdated Colabs, and guess your way through broken workflows. But this course is the equivalent of hiring an expert AI mentor for /hour — someone who’s built production-grade LLM systems and walks you through the exact pipeline, step by step.

“But things are moving so fast… won’t it be outdated?”

That’s why we built it: to move with you. The course is updated every single week to reflect the latest models, tools, and practices. What you’re learning isn’t a static curriculum — it’s a living roadmap with lifetime access, so you can adapt as the field evolves.

And because staying current isn’t enough, you also need confidence that what you ship today still holds tomorrow.
That’s why we’re now running monthly live cohorts — so you stay sharp, supported, and up to date.
The next cohort kicks off June 1 with a live welcome call with our CEO.
Join the course here

“Will I actually build something real?”

You won’t just learn. You’ll build and ship.
A real LLM product. A full-stack application with prompting, RAG, fine-tuning, evaluation, and deployment — wrapped in a Gradio or Streamlit front end. You’ll walk away with something you can show off to a CTO, use in a job interview, or demo to your team.

“Will it actually move the needle on my career?”

Don’t take our word for it — here’s what past students are saying:

“Best course out there to become an AI engineer. Planning to build my own startup based on the learnings.” — Abhijit L
“Expanded my knowledge of RAG pipelines and gave me real-world tools.” — Eoin McGrath
“From zero to hero as an LLM Developer… a clear path to build LLM applications that can change your career.” — Luca Tanieli

This is the course we wish we had when we started.
It’s not fluff. It’s not slides. It’s a system built on two years of working with real-world LLM deployments across companies, research teams, and startups.
You’ll walk away with:

A repeatable pipeline A mindset for thinking like an AI engineer, not just a prompt tinkerer
Weekly updates to stay ahead of the curve
Access to our private Slack + 70,000+ builder community on Discord
A portfolio-ready project that proves what you can do

The next live cohort starts June 1 with a welcome call from our CEO.

We’re capping enrollments again this month to keep the experience hands-on and high-touch, and last month’s seats filled up faster than expected.
When you join now, you don’t need to wait — you get immediate access to the full course and can start building your AI product this week.
If you want to be ahead of this next wave of AI instead of trying to catch up…
This is your window.
PEOPLE LIKE YOU ARE:

Breaking into in-demand roles like LLM Developer
Advancing careers and increasing earnings
Monetizing AI by building high-selling LLM products
Leading AI and ML engineering teams
Using AI to work smarter, not harder, and achieving more

If you’re thinking, “This sounds great, but what if it’s not for me?” — we get it. That’s why the course comes with a 30-day, no-questions-asked money-back guarantee. Try it. Dive into the material. If it doesn’t meet your expectations, we’ll refund you in full.
Secure your spot for the June 1st cohort
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI
#you #have #excuse #not #llm

You Have NO Excuse to Not Be an LLM Developer Today
Author: Towards AI Editorial Team Originally published on Towards AI. Ever since we released our first book on LLMs — and now our most comprehensive course, From Beginner to Advanced LLM Developer — we’ve heard the same questions from devs sitting on the fence: “Isn’t most of this stuff already free online?” Sure — if you have time to stitch together YouTube videos, half-written blog posts, outdated Colabs, and guess your way through broken workflows. But this course is the equivalent of hiring an expert AI mentor for /hour — someone who’s built production-grade LLM systems and walks you through the exact pipeline, step by step. “But things are moving so fast… won’t it be outdated?” That’s why we built it: to move with you. The course is updated every single week to reflect the latest models, tools, and practices. What you’re learning isn’t a static curriculum — it’s a living roadmap with lifetime access, so you can adapt as the field evolves. And because staying current isn’t enough, you also need confidence that what you ship today still holds tomorrow. That’s why we’re now running monthly live cohorts — so you stay sharp, supported, and up to date. The next cohort kicks off June 1 with a live welcome call with our CEO. 👉 Join the course here “Will I actually build something real?” You won’t just learn. You’ll build and ship. A real LLM product. A full-stack application with prompting, RAG, fine-tuning, evaluation, and deployment — wrapped in a Gradio or Streamlit front end. You’ll walk away with something you can show off to a CTO, use in a job interview, or demo to your team. “Will it actually move the needle on my career?” Don’t take our word for it — here’s what past students are saying: “Best course out there to become an AI engineer. Planning to build my own startup based on the learnings.” — Abhijit L “Expanded my knowledge of RAG pipelines and gave me real-world tools.” — Eoin McGrath “From zero to hero as an LLM Developer… a clear path to build LLM applications that can change your career.” — Luca Tanieli This is the course we wish we had when we started. It’s not fluff. It’s not slides. It’s a system built on two years of working with real-world LLM deployments across companies, research teams, and startups. You’ll walk away with: ✅ A repeatable pipeline✅ A mindset for thinking like an AI engineer, not just a prompt tinkerer ✅ Weekly updates to stay ahead of the curve ✅ Access to our private Slack + 70,000+ builder community on Discord ✅ A portfolio-ready project that proves what you can do The next live cohort starts June 1 with a welcome call from our CEO. We’re capping enrollments again this month to keep the experience hands-on and high-touch, and last month’s seats filled up faster than expected. When you join now, you don’t need to wait — you get immediate access to the full course and can start building your AI product this week. If you want to be ahead of this next wave of AI instead of trying to catch up… This is your window. PEOPLE LIKE YOU ARE: Breaking into in-demand roles like LLM Developer Advancing careers and increasing earnings Monetizing AI by building high-selling LLM products Leading AI and ML engineering teams Using AI to work smarter, not harder, and achieving more If you’re thinking, “This sounds great, but what if it’s not for me?” — we get it. That’s why the course comes with a 30-day, no-questions-asked money-back guarantee. Try it. Dive into the material. If it doesn’t meet your expectations, we’ll refund you in full. 👉 Secure your spot for the June 1st cohort Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI #you #have #excuse #not #llm

TOWARDSAI.NET

You Have NO Excuse to Not Be an LLM Developer Today

Author(s): Towards AI Editorial Team Originally published on Towards AI. Ever since we released our first book on LLMs — and now our most comprehensive course, From Beginner to Advanced LLM Developer — we’ve heard the same questions from devs sitting on the fence: “Isn’t most of this stuff already free online?” Sure — if you have time to stitch together YouTube videos, half-written blog posts, outdated Colabs, and guess your way through broken workflows. But this course is the equivalent of hiring an expert AI mentor for $4/hour — someone who’s built production-grade LLM systems and walks you through the exact pipeline, step by step. “But things are moving so fast… won’t it be outdated?” That’s why we built it: to move with you. The course is updated every single week to reflect the latest models, tools, and practices. What you’re learning isn’t a static curriculum — it’s a living roadmap with lifetime access, so you can adapt as the field evolves. And because staying current isn’t enough, you also need confidence that what you ship today still holds tomorrow. That’s why we’re now running monthly live cohorts — so you stay sharp, supported, and up to date. The next cohort kicks off June 1 with a live welcome call with our CEO. 👉 Join the course here “Will I actually build something real?” You won’t just learn. You’ll build and ship. A real LLM product. A full-stack application with prompting, RAG, fine-tuning, evaluation, and deployment — wrapped in a Gradio or Streamlit front end. You’ll walk away with something you can show off to a CTO, use in a job interview, or demo to your team. “Will it actually move the needle on my career?” Don’t take our word for it — here’s what past students are saying: “Best course out there to become an AI engineer. Planning to build my own startup based on the learnings.” — Abhijit L “Expanded my knowledge of RAG pipelines and gave me real-world tools.” — Eoin McGrath “From zero to hero as an LLM Developer… a clear path to build LLM applications that can change your career.” — Luca Tanieli This is the course we wish we had when we started. It’s not fluff. It’s not slides. It’s a system built on two years of working with real-world LLM deployments across companies, research teams, and startups. You’ll walk away with: ✅ A repeatable pipeline (prompting → RAG → fine-tuning → evaluation → deployment) ✅ A mindset for thinking like an AI engineer, not just a prompt tinkerer ✅ Weekly updates to stay ahead of the curve ✅ Access to our private Slack + 70,000+ builder community on Discord ✅ A portfolio-ready project that proves what you can do The next live cohort starts June 1 with a welcome call from our CEO. We’re capping enrollments again this month to keep the experience hands-on and high-touch, and last month’s seats filled up faster than expected. When you join now, you don’t need to wait — you get immediate access to the full course and can start building your AI product this week. If you want to be ahead of this next wave of AI instead of trying to catch up… This is your window. PEOPLE LIKE YOU ARE: Breaking into in-demand roles like LLM Developer Advancing careers and increasing earnings Monetizing AI by building high-selling LLM products Leading AI and ML engineering teams Using AI to work smarter, not harder, and achieving more If you’re thinking, “This sounds great, but what if it’s not for me?” — we get it. That’s why the course comes with a 30-day, no-questions-asked money-back guarantee. Try it. Dive into the material. If it doesn’t meet your expectations, we’ll refund you in full. 👉 Secure your spot for the June 1st cohort Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI

·196 مشاهدة

الرجاء تسجيل الدخول , للأعجاب والمشاركة والتعليق على هذا!
Computerworld UK شارك رابطًا

2025-05-16 16:06:28 ·

ChatGPT gave wildly inaccurate translations — to try and make users happy

Enterprise IT leaders are becoming uncomfortably aware that generative AItechnology is still a work in progress and buying into it is like spending several billion dollars to participate in an alpha test— not even a beta test, but an early alpha, where coders can barely keep up with bug reports.

For people who remember the first three seasons of Saturday Night Live, genAI is the ultimate Not-Ready-for-Primetime algorithm.

One of the latest pieces of evidence for this comes from OpenAI, which had to sheepishly pull back a recent version of ChatGPTwhen it — among other things — delivered wildly inaccurate translations.

Lost in translation

Why? In the words of a CTO who discovered the issue, “ChatGPT didn’t actually translate the document. It guessed what I wanted to hear, blending it with past conversations to make it feel legitimate. It didn’t just predict words. It predicted my expectations. That’s absolutely terrifying, as I truly believed it.”

OpenAI said ChatGPT was just being too nice.

“We have rolled back last week’s GPT‑4o update in ChatGPT so people are now using an earlier version with more balanced behavior. The update we removed was overly flattering or agreeable — often described as sycophantic,” OpenAI explained, adding that in that “GPT‑4o update, we made adjustments aimed at improving the model’s default personality to make it feel more intuitive and effective across a variety of tasks. We focused too much on short-term feedback and did not fully account for how users’ interactions with ChatGPT evolve over time. As a result, GPT‑4o skewed towards responses that were overly supportive but disingenuous.

“…Each of these desirable qualities, like attempting to be useful or supportive, can have unintended side effects. And with 500 million people using ChatGPT each week, across every culture and context, a single default can’t capture every preference.”

OpenAI was being deliberately obtuse. The problem was not that the app was being too polite and well-mannered. This wasn’t an issue of it emulating Miss Manners.

I am not being nice if you ask me to translate a document and I tell you what I think you want to hear. This is akin to Excel taking your financial figures and making the net income much larger because it thinks that will make you happy.

In the same way that IT decision-makers expect Excel to calculate numbers accurately regardless of how it may impact our mood, they expect that the translation of a Chinese document doesn’t make stuff up.

OpenAI can’t paper over this mess by saying that “desirable qualities like attempting to be useful or supportive can have unintended side effects.” Let’s be clear: giving people wrong answers will have the precisely expected effect — bad decisions.

Yale: LLMs need data labeled as wrong

Alas, OpenAI’s happiness efforts weren’t the only bizarre genAI news of late. Researchers at Yale University explored a fascinating theory: If an LLM is only trained on information that is labeled as being correct — whether or not the data is actually correct is not material — it has no chance of identifying flawed or highly unreliable data because it doesn’t know what it looks like.

In short, if it’s never been trained on data labeled as false, how could it possibly recognize it?

Even the US government is finding genAI claims going too far. And when the feds say a lie is going too far, that is quite a statement.

FTC: GenAI vendor makes false, misleading claims

The US Federal Trade Commissionfound that one large language modelvendor, Workado, was deceiving people with flawed claims of the accuracy of its LLM detection product. It wants that vendor to “maintain competent and reliable evidence showing those products are as accurate as claimed.”

Customers “trusted Workado’s AI Content Detector to help them decipher whether AI was behind a piece of writing, but the product did no better than a coin toss,” said Chris Mufarrige, director of the FTC’s Bureau of Consumer Protection. “Misleading claims about AI undermine competition by making it harder for legitimate providers of AI-related products to reach consumers.

“…The order settles allegations that Workado promoted its AI Content Detector as ‘98 percent’ accurate in detecting whether text was written by AI or human. But independent testing showed the accuracy rate on general-purpose content was just 53 percent,” according to the FTC’s administrative complaint.

“The FTC alleges that Workado violated the FTC Act because the ‘98 percent’ claim was false, misleading, or non-substantiated.”

There is a critical lesson here for enterprise IT. GenAI vendors are making major claims for their products without meaningful documentation. You think genAI makes stuff up? Imagine what comes out of their vendors’ marketing departments.
#chatgpt #gave #wildly #inaccurate #translations

ChatGPT gave wildly inaccurate translations — to try and make users happy
Enterprise IT leaders are becoming uncomfortably aware that generative AItechnology is still a work in progress and buying into it is like spending several billion dollars to participate in an alpha test— not even a beta test, but an early alpha, where coders can barely keep up with bug reports. For people who remember the first three seasons of Saturday Night Live, genAI is the ultimate Not-Ready-for-Primetime algorithm. One of the latest pieces of evidence for this comes from OpenAI, which had to sheepishly pull back a recent version of ChatGPTwhen it — among other things — delivered wildly inaccurate translations. Lost in translation Why? In the words of a CTO who discovered the issue, “ChatGPT didn’t actually translate the document. It guessed what I wanted to hear, blending it with past conversations to make it feel legitimate. It didn’t just predict words. It predicted my expectations. That’s absolutely terrifying, as I truly believed it.” OpenAI said ChatGPT was just being too nice. “We have rolled back last week’s GPT‑4o update in ChatGPT so people are now using an earlier version with more balanced behavior. The update we removed was overly flattering or agreeable — often described as sycophantic,” OpenAI explained, adding that in that “GPT‑4o update, we made adjustments aimed at improving the model’s default personality to make it feel more intuitive and effective across a variety of tasks. We focused too much on short-term feedback and did not fully account for how users’ interactions with ChatGPT evolve over time. As a result, GPT‑4o skewed towards responses that were overly supportive but disingenuous. “…Each of these desirable qualities, like attempting to be useful or supportive, can have unintended side effects. And with 500 million people using ChatGPT each week, across every culture and context, a single default can’t capture every preference.” OpenAI was being deliberately obtuse. The problem was not that the app was being too polite and well-mannered. This wasn’t an issue of it emulating Miss Manners. I am not being nice if you ask me to translate a document and I tell you what I think you want to hear. This is akin to Excel taking your financial figures and making the net income much larger because it thinks that will make you happy. In the same way that IT decision-makers expect Excel to calculate numbers accurately regardless of how it may impact our mood, they expect that the translation of a Chinese document doesn’t make stuff up. OpenAI can’t paper over this mess by saying that “desirable qualities like attempting to be useful or supportive can have unintended side effects.” Let’s be clear: giving people wrong answers will have the precisely expected effect — bad decisions. Yale: LLMs need data labeled as wrong Alas, OpenAI’s happiness efforts weren’t the only bizarre genAI news of late. Researchers at Yale University explored a fascinating theory: If an LLM is only trained on information that is labeled as being correct — whether or not the data is actually correct is not material — it has no chance of identifying flawed or highly unreliable data because it doesn’t know what it looks like. In short, if it’s never been trained on data labeled as false, how could it possibly recognize it? Even the US government is finding genAI claims going too far. And when the feds say a lie is going too far, that is quite a statement. FTC: GenAI vendor makes false, misleading claims The US Federal Trade Commissionfound that one large language modelvendor, Workado, was deceiving people with flawed claims of the accuracy of its LLM detection product. It wants that vendor to “maintain competent and reliable evidence showing those products are as accurate as claimed.” Customers “trusted Workado’s AI Content Detector to help them decipher whether AI was behind a piece of writing, but the product did no better than a coin toss,” said Chris Mufarrige, director of the FTC’s Bureau of Consumer Protection. “Misleading claims about AI undermine competition by making it harder for legitimate providers of AI-related products to reach consumers. “…The order settles allegations that Workado promoted its AI Content Detector as ‘98 percent’ accurate in detecting whether text was written by AI or human. But independent testing showed the accuracy rate on general-purpose content was just 53 percent,” according to the FTC’s administrative complaint. “The FTC alleges that Workado violated the FTC Act because the ‘98 percent’ claim was false, misleading, or non-substantiated.” There is a critical lesson here for enterprise IT. GenAI vendors are making major claims for their products without meaningful documentation. You think genAI makes stuff up? Imagine what comes out of their vendors’ marketing departments. #chatgpt #gave #wildly #inaccurate #translations

WWW.COMPUTERWORLD.COM

ChatGPT gave wildly inaccurate translations — to try and make users happy

Enterprise IT leaders are becoming uncomfortably aware that generative AI (genAI) technology is still a work in progress and buying into it is like spending several billion dollars to participate in an alpha test— not even a beta test, but an early alpha, where coders can barely keep up with bug reports. For people who remember the first three seasons of Saturday Night Live, genAI is the ultimate Not-Ready-for-Primetime algorithm. One of the latest pieces of evidence for this comes from OpenAI, which had to sheepishly pull back a recent version of ChatGPT (GPT-4o) when it — among other things — delivered wildly inaccurate translations. Lost in translation Why? In the words of a CTO who discovered the issue, “ChatGPT didn’t actually translate the document. It guessed what I wanted to hear, blending it with past conversations to make it feel legitimate. It didn’t just predict words. It predicted my expectations. That’s absolutely terrifying, as I truly believed it.” OpenAI said ChatGPT was just being too nice. “We have rolled back last week’s GPT‑4o update in ChatGPT so people are now using an earlier version with more balanced behavior. The update we removed was overly flattering or agreeable — often described as sycophantic,” OpenAI explained, adding that in that “GPT‑4o update, we made adjustments aimed at improving the model’s default personality to make it feel more intuitive and effective across a variety of tasks. We focused too much on short-term feedback and did not fully account for how users’ interactions with ChatGPT evolve over time. As a result, GPT‑4o skewed towards responses that were overly supportive but disingenuous. “…Each of these desirable qualities, like attempting to be useful or supportive, can have unintended side effects. And with 500 million people using ChatGPT each week, across every culture and context, a single default can’t capture every preference.” OpenAI was being deliberately obtuse. The problem was not that the app was being too polite and well-mannered. This wasn’t an issue of it emulating Miss Manners. I am not being nice if you ask me to translate a document and I tell you what I think you want to hear. This is akin to Excel taking your financial figures and making the net income much larger because it thinks that will make you happy. In the same way that IT decision-makers expect Excel to calculate numbers accurately regardless of how it may impact our mood, they expect that the translation of a Chinese document doesn’t make stuff up. OpenAI can’t paper over this mess by saying that “desirable qualities like attempting to be useful or supportive can have unintended side effects.” Let’s be clear: giving people wrong answers will have the precisely expected effect — bad decisions. Yale: LLMs need data labeled as wrong Alas, OpenAI’s happiness efforts weren’t the only bizarre genAI news of late. Researchers at Yale University explored a fascinating theory: If an LLM is only trained on information that is labeled as being correct — whether or not the data is actually correct is not material — it has no chance of identifying flawed or highly unreliable data because it doesn’t know what it looks like. In short, if it’s never been trained on data labeled as false, how could it possibly recognize it? (The full study from Yale is here.) Even the US government is finding genAI claims going too far. And when the feds say a lie is going too far, that is quite a statement. FTC: GenAI vendor makes false, misleading claims The US Federal Trade Commission (FTC) found that one large language model (LLM) vendor, Workado, was deceiving people with flawed claims of the accuracy of its LLM detection product. It wants that vendor to “maintain competent and reliable evidence showing those products are as accurate as claimed.” Customers “trusted Workado’s AI Content Detector to help them decipher whether AI was behind a piece of writing, but the product did no better than a coin toss,” said Chris Mufarrige, director of the FTC’s Bureau of Consumer Protection. “Misleading claims about AI undermine competition by making it harder for legitimate providers of AI-related products to reach consumers. “…The order settles allegations that Workado promoted its AI Content Detector as ‘98 percent’ accurate in detecting whether text was written by AI or human. But independent testing showed the accuracy rate on general-purpose content was just 53 percent,” according to the FTC’s administrative complaint. “The FTC alleges that Workado violated the FTC Act because the ‘98 percent’ claim was false, misleading, or non-substantiated.” There is a critical lesson here for enterprise IT. GenAI vendors are making major claims for their products without meaningful documentation. You think genAI makes stuff up? Imagine what comes out of their vendors’ marketing departments.

·182 مشاهدة

الرجاء تسجيل الدخول , للأعجاب والمشاركة والتعليق على هذا!
Marktechpost AI شارك رابطًا

2025-05-16 13:17:33 ·

DanceGRPO: A Unified Framework for Reinforcement Learning in Visual Generation Across Multiple Paradigms and Tasks

Recent advances in generative models, especially diffusion models and rectified flows, have revolutionized visual content creation with enhanced output quality and versatility. Human feedback integration during training is essential for aligning outputs with human preferences and aesthetic standards. Current approaches like ReFL methods depend on differentiable reward models that introduce VRAM inefficiency for video generation. DPO variants achieve only marginal visual improvements. Further, RL-based methods face challenges including conflicts between ODE-based sampling of rectified flow models and Markov Decision Process formulations, instability when scaling beyond small datasets, and a lack of validation for video generation tasks.
Aligning LLMs employs Reinforcement Learning from Human Feedback, which trains reward functions based on comparison data to capture human preferences. Policy gradient methods have proven effective but are computationally intensive and require extensive tuning, while Direct Policy Optimizationoffers cost efficiency but delivers inferior performance. DeepSeek-R1 recently showed that large-scale RL with specialized reward functions can guide LLMs toward self-emergent thought processes. Current approaches include DPO-style methods, direct backpropagation with reward signals like ReFL, and policy gradient-based methods such as DPOK and DDPO. Production models primarily utilize DPO and ReFL due to the instability of policy gradient methods in large-scale applications.
Researchers from ByteDance Seed and the University of Hong Kong have proposed DanceGRPO, a unified framework adapting Group Relative Policy Optimization to visual generation paradigms. This solution operates seamlessly across diffusion models and rectified flows, handling text-to-image, text-to-video, and image-to-video tasks. The framework integrates with four foundation modelsand five reward models covering image/video aesthetics, text-image alignment, video motion quality, and binary reward assessments. DanceGRPO outperforms baselines by up to 181% on key benchmarks, including HPS-v2.1, CLIP Score, VideoAlign, and GenEval.
The architecture utilizes five specialized reward models to optimize visual generation quality:

Image Aesthetics quantifies visual appeal using models fine-tuned on human-rated data.
Text-image Alignment uses CLIP to maximize cross-modal consistency.
Video Aesthetics Quality extends evaluation to temporal domains using Vision Language Models.
Video Motion Quality evaluates motion realism through physics-aware VLM analysis.
Thresholding Binary Reward employs a discretization mechanism where values exceeding a threshold receive 1, others 0, specifically designed to evaluate generative models’ ability to learn abrupt reward distributions under threshold-based optimization.

DanceGRPO shows significant improvements in reward metrics for Stable Diffusion v1.4 with an increase in the HPS score from 0.239 to 0.365, and CLIP Score from 0.363 to 0.395. Pick-a-Pic and GenEval evaluations confirm the method’s effectiveness, with DanceGRPO outperforming all competing approaches. For HunyuanVideo-T2I, optimization using the HPS-v2.1 model increases the mean reward score from 0.23 to 0.33, showing enhanced alignment with human aesthetic preferences. With HunyuanVideo, despite excluding text-video alignment due to instability, the methodology achieves relative improvements of 56% and 181% in visual and motion quality metrics, respectively. DanceGRPO uses the VideoAlign reward model’s motion quality metric, achieving a substantial 91% relative improvement in this dimension.

In this paper, researchers have introduced DanceGRPO, a unified framework for enhancing diffusion models and rectified flows across text-to-image, text-to-video, and image-to-video tasks. It addresses critical limitations of prior methods by bridging the gap between language and visual modalities, achieving superior performance through efficient alignment with human preferences and robust scaling to complex, multi-task settings. Experiments demonstrate substantial improvements in visual fidelity, motion quality, and text-image alignment. Future work will explore GRPO’s extension to multimodal generation, further unifying optimization paradigms across Generative AI.

Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.
Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Reinforcement Learning, Not Fine-Tuning: Nemotron-Tool-N1 Trains LLMs to Use Tools with Minimal Supervision and Maximum GeneralizationSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/RL^V: Unifying Reasoning and Verification in Language Models through Value-Free Reinforcement LearningSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Offline Video-LLMs Can Now Understand Real-Time Streams: Apple Researchers Introduce StreamBridge to Enable Multi-Turn and Proactive Video UnderstandingSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/AI That Teaches Itself: Tsinghua University’s ‘Absolute Zero’ Trains LLMs With Zero External Data
#dancegrpo #unified #framework #reinforcement #learning

DanceGRPO: A Unified Framework for Reinforcement Learning in Visual Generation Across Multiple Paradigms and Tasks
Recent advances in generative models, especially diffusion models and rectified flows, have revolutionized visual content creation with enhanced output quality and versatility. Human feedback integration during training is essential for aligning outputs with human preferences and aesthetic standards. Current approaches like ReFL methods depend on differentiable reward models that introduce VRAM inefficiency for video generation. DPO variants achieve only marginal visual improvements. Further, RL-based methods face challenges including conflicts between ODE-based sampling of rectified flow models and Markov Decision Process formulations, instability when scaling beyond small datasets, and a lack of validation for video generation tasks. Aligning LLMs employs Reinforcement Learning from Human Feedback, which trains reward functions based on comparison data to capture human preferences. Policy gradient methods have proven effective but are computationally intensive and require extensive tuning, while Direct Policy Optimizationoffers cost efficiency but delivers inferior performance. DeepSeek-R1 recently showed that large-scale RL with specialized reward functions can guide LLMs toward self-emergent thought processes. Current approaches include DPO-style methods, direct backpropagation with reward signals like ReFL, and policy gradient-based methods such as DPOK and DDPO. Production models primarily utilize DPO and ReFL due to the instability of policy gradient methods in large-scale applications. Researchers from ByteDance Seed and the University of Hong Kong have proposed DanceGRPO, a unified framework adapting Group Relative Policy Optimization to visual generation paradigms. This solution operates seamlessly across diffusion models and rectified flows, handling text-to-image, text-to-video, and image-to-video tasks. The framework integrates with four foundation modelsand five reward models covering image/video aesthetics, text-image alignment, video motion quality, and binary reward assessments. DanceGRPO outperforms baselines by up to 181% on key benchmarks, including HPS-v2.1, CLIP Score, VideoAlign, and GenEval. The architecture utilizes five specialized reward models to optimize visual generation quality: Image Aesthetics quantifies visual appeal using models fine-tuned on human-rated data. Text-image Alignment uses CLIP to maximize cross-modal consistency. Video Aesthetics Quality extends evaluation to temporal domains using Vision Language Models. Video Motion Quality evaluates motion realism through physics-aware VLM analysis. Thresholding Binary Reward employs a discretization mechanism where values exceeding a threshold receive 1, others 0, specifically designed to evaluate generative models’ ability to learn abrupt reward distributions under threshold-based optimization. DanceGRPO shows significant improvements in reward metrics for Stable Diffusion v1.4 with an increase in the HPS score from 0.239 to 0.365, and CLIP Score from 0.363 to 0.395. Pick-a-Pic and GenEval evaluations confirm the method’s effectiveness, with DanceGRPO outperforming all competing approaches. For HunyuanVideo-T2I, optimization using the HPS-v2.1 model increases the mean reward score from 0.23 to 0.33, showing enhanced alignment with human aesthetic preferences. With HunyuanVideo, despite excluding text-video alignment due to instability, the methodology achieves relative improvements of 56% and 181% in visual and motion quality metrics, respectively. DanceGRPO uses the VideoAlign reward model’s motion quality metric, achieving a substantial 91% relative improvement in this dimension. In this paper, researchers have introduced DanceGRPO, a unified framework for enhancing diffusion models and rectified flows across text-to-image, text-to-video, and image-to-video tasks. It addresses critical limitations of prior methods by bridging the gap between language and visual modalities, achieving superior performance through efficient alignment with human preferences and robust scaling to complex, multi-task settings. Experiments demonstrate substantial improvements in visual fidelity, motion quality, and text-image alignment. Future work will explore GRPO’s extension to multimodal generation, further unifying optimization paradigms across Generative AI. Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Reinforcement Learning, Not Fine-Tuning: Nemotron-Tool-N1 Trains LLMs to Use Tools with Minimal Supervision and Maximum GeneralizationSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/RL^V: Unifying Reasoning and Verification in Language Models through Value-Free Reinforcement LearningSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Offline Video-LLMs Can Now Understand Real-Time Streams: Apple Researchers Introduce StreamBridge to Enable Multi-Turn and Proactive Video UnderstandingSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/AI That Teaches Itself: Tsinghua University’s ‘Absolute Zero’ Trains LLMs With Zero External Data #dancegrpo #unified #framework #reinforcement #learning

WWW.MARKTECHPOST.COM

DanceGRPO: A Unified Framework for Reinforcement Learning in Visual Generation Across Multiple Paradigms and Tasks

Recent advances in generative models, especially diffusion models and rectified flows, have revolutionized visual content creation with enhanced output quality and versatility. Human feedback integration during training is essential for aligning outputs with human preferences and aesthetic standards. Current approaches like ReFL methods depend on differentiable reward models that introduce VRAM inefficiency for video generation. DPO variants achieve only marginal visual improvements. Further, RL-based methods face challenges including conflicts between ODE-based sampling of rectified flow models and Markov Decision Process formulations, instability when scaling beyond small datasets, and a lack of validation for video generation tasks. Aligning LLMs employs Reinforcement Learning from Human Feedback (RLHF), which trains reward functions based on comparison data to capture human preferences. Policy gradient methods have proven effective but are computationally intensive and require extensive tuning, while Direct Policy Optimization (DPO) offers cost efficiency but delivers inferior performance. DeepSeek-R1 recently showed that large-scale RL with specialized reward functions can guide LLMs toward self-emergent thought processes. Current approaches include DPO-style methods, direct backpropagation with reward signals like ReFL, and policy gradient-based methods such as DPOK and DDPO. Production models primarily utilize DPO and ReFL due to the instability of policy gradient methods in large-scale applications. Researchers from ByteDance Seed and the University of Hong Kong have proposed DanceGRPO, a unified framework adapting Group Relative Policy Optimization to visual generation paradigms. This solution operates seamlessly across diffusion models and rectified flows, handling text-to-image, text-to-video, and image-to-video tasks. The framework integrates with four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReels-I2V) and five reward models covering image/video aesthetics, text-image alignment, video motion quality, and binary reward assessments. DanceGRPO outperforms baselines by up to 181% on key benchmarks, including HPS-v2.1, CLIP Score, VideoAlign, and GenEval. The architecture utilizes five specialized reward models to optimize visual generation quality: Image Aesthetics quantifies visual appeal using models fine-tuned on human-rated data. Text-image Alignment uses CLIP to maximize cross-modal consistency. Video Aesthetics Quality extends evaluation to temporal domains using Vision Language Models (VLMs). Video Motion Quality evaluates motion realism through physics-aware VLM analysis. Thresholding Binary Reward employs a discretization mechanism where values exceeding a threshold receive 1, others 0, specifically designed to evaluate generative models’ ability to learn abrupt reward distributions under threshold-based optimization. DanceGRPO shows significant improvements in reward metrics for Stable Diffusion v1.4 with an increase in the HPS score from 0.239 to 0.365, and CLIP Score from 0.363 to 0.395. Pick-a-Pic and GenEval evaluations confirm the method’s effectiveness, with DanceGRPO outperforming all competing approaches. For HunyuanVideo-T2I, optimization using the HPS-v2.1 model increases the mean reward score from 0.23 to 0.33, showing enhanced alignment with human aesthetic preferences. With HunyuanVideo, despite excluding text-video alignment due to instability, the methodology achieves relative improvements of 56% and 181% in visual and motion quality metrics, respectively. DanceGRPO uses the VideoAlign reward model’s motion quality metric, achieving a substantial 91% relative improvement in this dimension. In this paper, researchers have introduced DanceGRPO, a unified framework for enhancing diffusion models and rectified flows across text-to-image, text-to-video, and image-to-video tasks. It addresses critical limitations of prior methods by bridging the gap between language and visual modalities, achieving superior performance through efficient alignment with human preferences and robust scaling to complex, multi-task settings. Experiments demonstrate substantial improvements in visual fidelity, motion quality, and text-image alignment. Future work will explore GRPO’s extension to multimodal generation, further unifying optimization paradigms across Generative AI. Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Reinforcement Learning, Not Fine-Tuning: Nemotron-Tool-N1 Trains LLMs to Use Tools with Minimal Supervision and Maximum GeneralizationSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/RL^V: Unifying Reasoning and Verification in Language Models through Value-Free Reinforcement LearningSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Offline Video-LLMs Can Now Understand Real-Time Streams: Apple Researchers Introduce StreamBridge to Enable Multi-Turn and Proactive Video UnderstandingSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/AI That Teaches Itself: Tsinghua University’s ‘Absolute Zero’ Trains LLMs With Zero External Data

·244 مشاهدة

الرجاء تسجيل الدخول , للأعجاب والمشاركة والتعليق على هذا!
Computerworld UK شارك رابطًا

2025-05-16 13:04:11 ·

Meta hits pause on ‘Llama 4 Behemoth’ AI model amid capability concerns

Meta Platforms has decided to delay the public release of its most ambitious artificial intelligence model yet — Llama 4 Behemoth. Initially expected to debut at Meta’s first-ever AI developer conference in April, the model’s launch was pushed to June and is now delayed until fall or possibly even later.

Engineers at Meta are grappling with whether Behemoth delivers enough of a leap in performance to justify a public rollout, The Wall Street Journal reported. Internally, the sentiment is split — some feel the improvements over earlier versions are incremental at best.

The delay doesn’t just affect Meta’s timeline. It’s a reminder to the entire AI industry that building the most powerful model isn’t just about parameter count—it’s about usefulness, efficiency, and real-world performance.

Sanchit Vir Gogia, chief analyst and CEO at Greyhound Research, interprets this not as a standalone setback but as “a reflection of a broader shift: from brute-force scaling to controlled, adaptable AI models.”

He said that while Meta has not officially disclosed a reason for the delay, the reported mention of “capacity constraints” points to larger pressures around infrastructure, usability, and practical deployment.

What’s inside Llama 4 Behemoth?

Behemoth was never intended to be just another model in Meta’s Llama family. It’s intended to be the crown jewel of the Llama 4 series, designed as a “teacher model” for training smaller, more nimble versions like Llama Scout and Maverick. Meta had previously touted it as “one of the smartest LLMs in the world.”

Technically, Behemoth is built on a Mixture-of-Expertsarchitecture, designed to optimize both power and efficiency. It is said to have a total of 2 trillion parameters, with 288 billion active at any given inference — a staggering scale, even by today’s AI standards.

What made Behemoth especially interesting was its use of iRoPE, an architectural choice that allows the model to handle extremely long context windows—up to 10 million tokens. That means it could, in theory, retain far more contextual information during a conversation or data task than most current models can manage.

But theory doesn’t always play out smoothly in practice.

“Meta’s Behemoth delay aligns with a market that is actively shifting from scale-first strategies to deployment-first priorities,” Gogia added. “Controlled Open LLMs and SLMs are central to this reorientation — and to what we believe is the future of trustworthy enterprise AI.”

How Behemoth stacks up against the competition

When Behemoth was first previewed in April, it was positioned as Meta’s answer to the dominance of models like OpenAI’s GPT-4.5, Anthropic’s Claude 3.5/3.7, and Google’s Gemini 1.5/2.5 series.

Each of those models has made strides in different areas. OpenAI’s GPT-4 Turbo remains strong in reasoning and code generation. Claude 3.5 Sonnet is gaining attention for its efficiency and balance between performance and cost. Gemini Pro 1.5, from Google, excels in multimodal tasks and integration with enterprise tools.

Behemoth, in contrast, showed strong results in STEM benchmarks and long-context tasks but has yet to demonstrate a clear superiority across commercial and enterprise-grade benchmarks. That ambiguity is believed to have contributed to Meta’s hesitation in launching the model publicly.

Gogia noted that the situation “reignites a vital industry dialogue: is bigger still better?” Increasingly, enterprise buyers are leaning toward SLMsand Controlled Open LLMs, which offer better governance, easier integration, and clearer ROI compared to gargantuan foundation models that demand complex infrastructure and longer implementation cycles.

A telling sign for the AI industry

This delay speaks volumes about where the AI industry is heading. For much of 2023 and 2024, the narrative was about who could build the largest model. But as model sizes ballooned, the return on added parameters began to flatten out.

AI experts and practitioners now acknowledge that smarter architectural design, domain specificity, and deployment efficiency are fast becoming the new metrics of success. Meta’s experience with smaller models like Scout and Maverick reinforces this trend—many users have found them to be more practical and easier to fine-tune for specific use cases.

There’s also a financial and sustainability angle. Training and running ultra-large models like Behemoth requires immense computing resources, energy, and fine-grained optimization. Even for Meta, this scale introduces operational trade-offs, including cost, latency, and reliability concerns.

Why enterprises should pay attention

For enterprise IT and innovation leaders, the delay isn’t just about Meta—it reflects a more fundamental decision point around AI adoption.

Enterprises are moving away from chasing the biggest models in favor of those that offer tighter control, compliance readiness, and explainability. Gogia pointed out that “usability, governance, and real-world readiness” are becoming central filters in AI procurement, especially in regulated sectors like finance, healthcare, and government.

The delay of Behemoth may accelerate the adoption of open-weight, deployment-friendly models such as Llama 4 Scout, or even third-party solutions that are optimized for enterprise workflows. The choice now isn’t about raw performance alone—it’s about aligning AI capabilities with specific business goals.

What lies ahead

Meta’s delay doesn’t suggest failure — it’s a strategic pause. If anything, it shows the company’s willingness to prioritize stability and impact over hype. Behemoth still has the potential to become a powerful tool, but only if it proves itself in the areas that matter most: performance consistency, scalability, and enterprise integration.

“This doesn’t negate the value of scale, but it elevates a new set of criteria that enterprises now care about deeply,” Gogia stated. In the coming months, as Meta refines Behemoth and the industry moves deeper into deployment-era AI, one thing is clear: we are moving beyond the age of AI spectacle into an age of applied, responsible intelligence.
#meta #hits #pause #llama #behemoth

Meta hits pause on ‘Llama 4 Behemoth’ AI model amid capability concerns
Meta Platforms has decided to delay the public release of its most ambitious artificial intelligence model yet — Llama 4 Behemoth. Initially expected to debut at Meta’s first-ever AI developer conference in April, the model’s launch was pushed to June and is now delayed until fall or possibly even later. Engineers at Meta are grappling with whether Behemoth delivers enough of a leap in performance to justify a public rollout, The Wall Street Journal reported. Internally, the sentiment is split — some feel the improvements over earlier versions are incremental at best. The delay doesn’t just affect Meta’s timeline. It’s a reminder to the entire AI industry that building the most powerful model isn’t just about parameter count—it’s about usefulness, efficiency, and real-world performance. Sanchit Vir Gogia, chief analyst and CEO at Greyhound Research, interprets this not as a standalone setback but as “a reflection of a broader shift: from brute-force scaling to controlled, adaptable AI models.” He said that while Meta has not officially disclosed a reason for the delay, the reported mention of “capacity constraints” points to larger pressures around infrastructure, usability, and practical deployment. What’s inside Llama 4 Behemoth? Behemoth was never intended to be just another model in Meta’s Llama family. It’s intended to be the crown jewel of the Llama 4 series, designed as a “teacher model” for training smaller, more nimble versions like Llama Scout and Maverick. Meta had previously touted it as “one of the smartest LLMs in the world.” Technically, Behemoth is built on a Mixture-of-Expertsarchitecture, designed to optimize both power and efficiency. It is said to have a total of 2 trillion parameters, with 288 billion active at any given inference — a staggering scale, even by today’s AI standards. What made Behemoth especially interesting was its use of iRoPE, an architectural choice that allows the model to handle extremely long context windows—up to 10 million tokens. That means it could, in theory, retain far more contextual information during a conversation or data task than most current models can manage. But theory doesn’t always play out smoothly in practice. “Meta’s Behemoth delay aligns with a market that is actively shifting from scale-first strategies to deployment-first priorities,” Gogia added. “Controlled Open LLMs and SLMs are central to this reorientation — and to what we believe is the future of trustworthy enterprise AI.” How Behemoth stacks up against the competition When Behemoth was first previewed in April, it was positioned as Meta’s answer to the dominance of models like OpenAI’s GPT-4.5, Anthropic’s Claude 3.5/3.7, and Google’s Gemini 1.5/2.5 series. Each of those models has made strides in different areas. OpenAI’s GPT-4 Turbo remains strong in reasoning and code generation. Claude 3.5 Sonnet is gaining attention for its efficiency and balance between performance and cost. Gemini Pro 1.5, from Google, excels in multimodal tasks and integration with enterprise tools. Behemoth, in contrast, showed strong results in STEM benchmarks and long-context tasks but has yet to demonstrate a clear superiority across commercial and enterprise-grade benchmarks. That ambiguity is believed to have contributed to Meta’s hesitation in launching the model publicly. Gogia noted that the situation “reignites a vital industry dialogue: is bigger still better?” Increasingly, enterprise buyers are leaning toward SLMsand Controlled Open LLMs, which offer better governance, easier integration, and clearer ROI compared to gargantuan foundation models that demand complex infrastructure and longer implementation cycles. A telling sign for the AI industry This delay speaks volumes about where the AI industry is heading. For much of 2023 and 2024, the narrative was about who could build the largest model. But as model sizes ballooned, the return on added parameters began to flatten out. AI experts and practitioners now acknowledge that smarter architectural design, domain specificity, and deployment efficiency are fast becoming the new metrics of success. Meta’s experience with smaller models like Scout and Maverick reinforces this trend—many users have found them to be more practical and easier to fine-tune for specific use cases. There’s also a financial and sustainability angle. Training and running ultra-large models like Behemoth requires immense computing resources, energy, and fine-grained optimization. Even for Meta, this scale introduces operational trade-offs, including cost, latency, and reliability concerns. Why enterprises should pay attention For enterprise IT and innovation leaders, the delay isn’t just about Meta—it reflects a more fundamental decision point around AI adoption. Enterprises are moving away from chasing the biggest models in favor of those that offer tighter control, compliance readiness, and explainability. Gogia pointed out that “usability, governance, and real-world readiness” are becoming central filters in AI procurement, especially in regulated sectors like finance, healthcare, and government. The delay of Behemoth may accelerate the adoption of open-weight, deployment-friendly models such as Llama 4 Scout, or even third-party solutions that are optimized for enterprise workflows. The choice now isn’t about raw performance alone—it’s about aligning AI capabilities with specific business goals. What lies ahead Meta’s delay doesn’t suggest failure — it’s a strategic pause. If anything, it shows the company’s willingness to prioritize stability and impact over hype. Behemoth still has the potential to become a powerful tool, but only if it proves itself in the areas that matter most: performance consistency, scalability, and enterprise integration. “This doesn’t negate the value of scale, but it elevates a new set of criteria that enterprises now care about deeply,” Gogia stated. In the coming months, as Meta refines Behemoth and the industry moves deeper into deployment-era AI, one thing is clear: we are moving beyond the age of AI spectacle into an age of applied, responsible intelligence. #meta #hits #pause #llama #behemoth

WWW.COMPUTERWORLD.COM

Meta hits pause on ‘Llama 4 Behemoth’ AI model amid capability concerns

Meta Platforms has decided to delay the public release of its most ambitious artificial intelligence model yet — Llama 4 Behemoth. Initially expected to debut at Meta’s first-ever AI developer conference in April, the model’s launch was pushed to June and is now delayed until fall or possibly even later. Engineers at Meta are grappling with whether Behemoth delivers enough of a leap in performance to justify a public rollout, The Wall Street Journal reported. Internally, the sentiment is split — some feel the improvements over earlier versions are incremental at best. The delay doesn’t just affect Meta’s timeline. It’s a reminder to the entire AI industry that building the most powerful model isn’t just about parameter count—it’s about usefulness, efficiency, and real-world performance. Sanchit Vir Gogia, chief analyst and CEO at Greyhound Research, interprets this not as a standalone setback but as “a reflection of a broader shift: from brute-force scaling to controlled, adaptable AI models.” He said that while Meta has not officially disclosed a reason for the delay, the reported mention of “capacity constraints” points to larger pressures around infrastructure, usability, and practical deployment. What’s inside Llama 4 Behemoth? Behemoth was never intended to be just another model in Meta’s Llama family. It’s intended to be the crown jewel of the Llama 4 series, designed as a “teacher model” for training smaller, more nimble versions like Llama Scout and Maverick. Meta had previously touted it as “one of the smartest LLMs in the world.” Technically, Behemoth is built on a Mixture-of-Experts (MoE) architecture, designed to optimize both power and efficiency. It is said to have a total of 2 trillion parameters, with 288 billion active at any given inference — a staggering scale, even by today’s AI standards. What made Behemoth especially interesting was its use of iRoPE (interleaved Rotary Position Embedding), an architectural choice that allows the model to handle extremely long context windows—up to 10 million tokens. That means it could, in theory, retain far more contextual information during a conversation or data task than most current models can manage. But theory doesn’t always play out smoothly in practice. “Meta’s Behemoth delay aligns with a market that is actively shifting from scale-first strategies to deployment-first priorities,” Gogia added. “Controlled Open LLMs and SLMs are central to this reorientation — and to what we believe is the future of trustworthy enterprise AI.” How Behemoth stacks up against the competition When Behemoth was first previewed in April, it was positioned as Meta’s answer to the dominance of models like OpenAI’s GPT-4.5, Anthropic’s Claude 3.5/3.7, and Google’s Gemini 1.5/2.5 series. Each of those models has made strides in different areas. OpenAI’s GPT-4 Turbo remains strong in reasoning and code generation. Claude 3.5 Sonnet is gaining attention for its efficiency and balance between performance and cost. Gemini Pro 1.5, from Google, excels in multimodal tasks and integration with enterprise tools. Behemoth, in contrast, showed strong results in STEM benchmarks and long-context tasks but has yet to demonstrate a clear superiority across commercial and enterprise-grade benchmarks. That ambiguity is believed to have contributed to Meta’s hesitation in launching the model publicly. Gogia noted that the situation “reignites a vital industry dialogue: is bigger still better?” Increasingly, enterprise buyers are leaning toward SLMs (Small Language Models) and Controlled Open LLMs, which offer better governance, easier integration, and clearer ROI compared to gargantuan foundation models that demand complex infrastructure and longer implementation cycles. A telling sign for the AI industry This delay speaks volumes about where the AI industry is heading. For much of 2023 and 2024, the narrative was about who could build the largest model. But as model sizes ballooned, the return on added parameters began to flatten out. AI experts and practitioners now acknowledge that smarter architectural design, domain specificity, and deployment efficiency are fast becoming the new metrics of success. Meta’s experience with smaller models like Scout and Maverick reinforces this trend—many users have found them to be more practical and easier to fine-tune for specific use cases. There’s also a financial and sustainability angle. Training and running ultra-large models like Behemoth requires immense computing resources, energy, and fine-grained optimization. Even for Meta, this scale introduces operational trade-offs, including cost, latency, and reliability concerns. Why enterprises should pay attention For enterprise IT and innovation leaders, the delay isn’t just about Meta—it reflects a more fundamental decision point around AI adoption. Enterprises are moving away from chasing the biggest models in favor of those that offer tighter control, compliance readiness, and explainability. Gogia pointed out that “usability, governance, and real-world readiness” are becoming central filters in AI procurement, especially in regulated sectors like finance, healthcare, and government. The delay of Behemoth may accelerate the adoption of open-weight, deployment-friendly models such as Llama 4 Scout, or even third-party solutions that are optimized for enterprise workflows. The choice now isn’t about raw performance alone—it’s about aligning AI capabilities with specific business goals. What lies ahead Meta’s delay doesn’t suggest failure — it’s a strategic pause. If anything, it shows the company’s willingness to prioritize stability and impact over hype. Behemoth still has the potential to become a powerful tool, but only if it proves itself in the areas that matter most: performance consistency, scalability, and enterprise integration. “This doesn’t negate the value of scale, but it elevates a new set of criteria that enterprises now care about deeply,” Gogia stated. In the coming months, as Meta refines Behemoth and the industry moves deeper into deployment-era AI, one thing is clear: we are moving beyond the age of AI spectacle into an age of applied, responsible intelligence.

·247 مشاهدة

الرجاء تسجيل الدخول , للأعجاب والمشاركة والتعليق على هذا!