TOWARDSDATASCIENCE.COM
AI Agents Processing Time Series and Large Dataframes
Intro
Agents are AI systems, powered by LLMs, that can reason about their objectives and take actions to achieve a final goal. They are designed not just to respond to queries, but to orchestrate a sequence of operations, including processing data (i.e. dataframes and time series). This ability unlocks numerous real-world applications for democratizing access to data analysis, such as automating reporting, no-code queries, support on data cleaning and manipulation.
Agents that can interact with dataframes in two different ways:
with natural language — the LLM reads the table as a string and tries to make sense of it based on its knowledge base
by generating and executing code — the Agent activates tools to process the dataset as an object.
So, by combining the power of NLP with the precision of code execution, AI Agents enable a broader range of users to interact with complex datasets and derive insights.
In this tutorial, I’m going to show how to process dataframes and time series with AI Agents. I will present some useful Python code that can be easily applied in other similar cases (just copy, paste, run) and walk through every line of code with comments so that you can replicate this example (link to full code at the end of the article).
Setup
Let’s start by setting up Ollama (pip install ollama==0.4.7), a library that allows users to run open-source LLMs locally, without needing cloud-based services, giving more control over data privacy and performance. Since it runs locally, any conversation data does not leave your machine.
First of all, you need to download Ollama from the website.
Then, on the prompt shell of your laptop, use the command to download the selected LLM. I’m going with Alibaba’s Qwen, as it’s both smart and light.
After the download is completed, you can move on to Python and start writing code.
import ollama
llm = "qwen2.5"
Let’s test the LLM:
stream = ollama.generate(model=llm, prompt='''what time is it?''', stream=True)
for chunk in stream:
print(chunk['response'], end='', flush=True)
Time Series
A time series is a sequence of data points measured over time, often used for analysis and forecasting. It allows us to see how variables change over time, and it’s used to identify trends and seasonal patterns.
I’m going to generate a fake time series dataset to use as an example.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
## create data
np.random.seed(1) #<--for reproducibility
length = 30
ts = pd.DataFrame(data=np.random.randint(low=0, high=15, size=length),
columns=['y'],
index=pd.date_range(start='2023-01-01', freq='MS', periods=length).strftime('%Y-%m'))
## plot
ts.plot(kind="bar", figsize=(10,3), legend=False, color="black").grid(axis='y')
Usually, time series datasets have a really simple structure with the main variable as a column and the time as the index.
Before transforming it into a string, I want to make sure that everything is placed under a column, so that we don’t lose any piece of information.
dtf = ts.reset_index().rename(columns={"index":"date"})
dtf.head()
Then, I shall change the data type from dataframe to dictionary.
data = dtf.to_dict(orient='records')
data[0:5]
Finally, from dictionary to string.
str_data = "\n".join([str(row) for row in data])
str_data
Now that we have a string, it can be included in a prompt that any language model is able to process. When you paste a dataset into a prompt, the LLM reads the data as plain text, but can still understand the structure and meaning based on patterns seen during training.
prompt = f'''
Analyze this dataset, it contains monthly sales data of an online retail product:
{str_data}
'''
We can easily start a chat with the LLM. Please note that, right now, this is not an Agent as it doesn’t have any Tool, we’re just using the language model. While it doesn’t process numbers like a computer, the LLM can recognize column names, time-based patterns, trends, and outliers, especially with smaller datasets. It can simulate analysis and explain findings, but it won’t perform precise calculations independently, as it’s not executing code like an Agent.
messages = [{"role":"system", "content":prompt}]
while True:
## User
q = input(' >')
if q == "quit":
break
messages.append( {"role":"user", "content":q} )
## Model
agent_res = ollama.chat(model=llm, messages=messages, tools=[])
res = agent_res["message"]["content"]
## Response
print(" >", f"\x1b[1;30m{res}\x1b[0m")
messages.append( {"role":"assistant", "content":res} )
The LLM recognizes numbers and understands the general context, the same way it might understand a recipe or a line of code.
As you can see, using LLMs to analyze time series is great for quick and conversational insights.
Agent
LLMs are good for brainstorming and lite exploration, while an Agent can run code. Therefore, it can handle more complex tasks like plotting, forecasting, and anomaly detection. So, let’s create the Tools.
Sometimes, it can be more effective to treat the “final answer” as a Tool. For example, if the Agent does multiple actions to generate intermediate results, the final answer can be thought of as the Tool that integrates all of this information into a cohesive response. By designing it this way, you have more customization and control over the results.
def final_answer(text:str) -> str:
return text
tool_final_answer = {'type':'function', 'function':{
'name': 'final_answer',
'description': 'Returns a natural language response to the user',
'parameters': {'type': 'object',
'required': ['text'],
'properties': {'text': {'type':'str', 'description':'natural language response'}}
}}}
final_answer(text="hi")
Then, the coding Tool.
import io
import contextlib
def code_exec(code:str) -> str:
output = io.StringIO()
with contextlib.redirect_stdout(output):
try:
exec(code)
except Exception as e:
print(f"Error: {e}")
return output.getvalue()
tool_code_exec = {'type':'function', 'function':{
'name': 'code_exec',
'description': 'Execute python code. Use always the function print() to get the output.',
'parameters': {'type': 'object',
'required': ['code'],
'properties': {
'code': {'type':'str', 'description':'code to execute'},
}}}}
code_exec("from datetime import datetime; print(datetime.now().strftime('%H:%M'))")
Moreover, I shall add a couple of utils functions for Tool usage and to run the Agent.
dic_tools = {"final_answer":final_answer, "code_exec":code_exec}
# Utils
def use_tool(agent_res:dict, dic_tools:dict) -> dict:
## use tool
if "tool_calls" in agent_res["message"].keys():
for tool in agent_res["message"]["tool_calls"]:
t_name, t_inputs = tool["function"]["name"], tool["function"]["arguments"]
if f := dic_tools.get(t_name):
### calling tool
print(' >', f"\x1b[1;31m{t_name} -> Inputs: {t_inputs}\x1b[0m")
### tool output
t_output = f(**tool["function"]["arguments"])
print(t_output)
### final res
res = t_output
else:
print(' >', f"\x1b[1;31m{t_name} -> NotFound\x1b[0m")
## don't use tool
if agent_res['message']['content'] != '':
res = agent_res["message"]["content"]
t_name, t_inputs = '', ''
return {'res':res, 'tool_used':t_name, 'inputs_used':t_inputs}
When the Agent is trying to solve a task, I want it to keep track of the Tools that have been used, the inputs that it tried, and the results it gets. The iteration should stop only when the model is ready to give the final answer.
def run_agent(llm, messages, available_tools):
tool_used, local_memory = '', ''
while tool_used != 'final_answer':
### use tools
try:
agent_res = ollama.chat(model=llm,
messages=messages, tools=[v for v in available_tools.values()])
dic_res = use_tool(agent_res, dic_tools)
res, tool_used, inputs_used = dic_res["res"], dic_res["tool_used"], dic_res["inputs_used"]
### error
except Exception as e:
print(" >", e)
res = f"I tried to use {tool_used} but didn't work. I will try something else."
print(" >", f"\x1b[1;30m{res}\x1b[0m")
messages.append( {"role":"assistant", "content":res} )
### update memory
if tool_used not in ['','final_answer']:
local_memory += f"\nTool used: {tool_used}.\nInput used: {inputs_used}.\nOutput: {res}"
messages.append( {"role":"assistant", "content":local_memory} )
available_tools.pop(tool_used)
if len(available_tools) == 1:
messages.append( {"role":"user", "content":"now activate the tool final_answer."} )
### tools not used
if tool_used == '':
break
return res
In regard to the coding Tool, I’ve noticed that Agents tend to recreate the dataframe at every step. So I will use a memory reinforcement to remind the model that the dataset already exists. A trick commonly used to get the desired behaviour. Ultimately, memory reinforcements help you to get more meaningful and effective interactions.
# Start a chat
messages = [{"role":"system", "content":prompt}]
memory = '''
The dataset already exists and it's called 'dtf', don't create a new one.
'''
while True:
## User
q = input(' >')
if q == "quit":
break
messages.append( {"role":"user", "content":q} )
## Memory
messages.append( {"role":"user", "content":memory} )
## Model
available_tools = {"final_answer":tool_final_answer, "code_exec":tool_code_exec}
res = run_agent(llm, messages, available_tools)
## Response
print(" >", f"\x1b[1;30m{res}\x1b[0m")
messages.append( {"role":"assistant", "content":res} )
Creating a plot is something that the LLM alone can’t do. But keep in mind that even if Agents can create images, they can’t see them, because after all, the engine is still a language model. So the user is the only one who visualises the plot.
The Agent is using the library statsmodels to train a model and forecast the time series.
Large Dataframes
LLMs have limited memory, which restricts how much information they can process at once, even the most advanced models have token limits (a few hundred pages of text). Additionally, LLMs don’t retain memory across sessions unless a retrieval system is integrated. In practice, to effectively work with large dataframes, developers often use strategies like chunking, RAG, vector databases, and summarizing content before feeding it into the model.
Let’s create a big dataset to play with.
import random
import string
length = 1000
dtf = pd.DataFrame(data={
'Id': [''.join(random.choices(string.ascii_letters, k=5)) for _ in range(length)],
'Age': np.random.randint(low=18, high=80, size=length),
'Score': np.random.uniform(low=50, high=100, size=length).round(1),
'Status': np.random.choice(['Active','Inactive','Pending'], size=length)
})
dtf.tail()
I’ll add a web-searching Tool, so that, with the ability to execute Python code and search the internet, a general-purpose AI gains access to all the available knowledge and can make data-driven decisions.
In Python, the easiest way to create a web-searching Tool is with the famous private browser DuckDuckGo (pip install duckduckgo-search==6.3.5). You can directly use the original library or import the LangChain wrapper (pip install langchain-community==0.3.17).
from langchain_community.tools import DuckDuckGoSearchResults
def search_web(query:str) -> str:
return DuckDuckGoSearchResults(backend="news").run(query)
tool_search_web = {'type':'function', 'function':{
'name': 'search_web',
'description': 'Search the web',
'parameters': {'type': 'object',
'required': ['query'],
'properties': {
'query': {'type':'str', 'description':'the topic or subject to search on the web'},
}}}}
search_web(query="nvidia")
In total, the Agent now has 3 tools.
dic_tools = {'final_answer':final_answer,
'search_web':search_web,
'code_exec':code_exec}
Since I can’t add the full dataframe in the prompt, I shall feed only the first 10 rows so that the LLM can understand the general context of the dataset. Additionally, I will specify where to find the full dataset.
str_data = "\n".join([str(row) for row in dtf.head(10).to_dict(orient='records')])
prompt = f'''
You are a Data Analyst, you will be given a task to solve as best you can.
You have access to the following tools:
- tool 'final_answer' to return a text response.
- tool 'code_exec' to execute Python code.
- tool 'search_web' to search for information on the internet.
If you use the 'code_exec' tool, remember to always use the function print() to get the output.
The dataset already exists and it's called 'dtf', don't create a new one.
This dataset contains credit score for each customer of the bank. Here's the first rows:
{str_data}
'''
Finally, we can run the Agent.
messages = [{"role":"system", "content":prompt}]
memory = '''
The dataset already exists and it's called 'dtf', don't create a new one.
'''
while True:
## User
q = input(' >')
if q == "quit":
break
messages.append( {"role":"user", "content":q} )
## Memory
messages.append( {"role":"user", "content":memory} )
## Model
available_tools = {"final_answer":tool_final_answer, "code_exec":tool_code_exec, "search_web":tool_search_web}
res = run_agent(llm, messages, available_tools)
## Response
print(" >", f"\x1b[1;30m{res}\x1b[0m")
messages.append( {"role":"assistant", "content":res} )
In this interaction, the Agent used the coding Tool properly. Now, I want to make it utilize the other tool as well.
At last, I need the Agent to put together all the pieces of information obtained so far from this chat.
Conclusion
This article has been a tutorial to demonstrate how to build from scratch Agents that process time series and large dataframes. We covered both ways that models can interact with the data: through natural language, where the LLM interprets the table as a string using its knowledge base, and by generating and executing code, leveraging tools to process the dataset as an object.
Full code for this article: GitHub
I hope you enjoyed it! Feel free to contact me for questions and feedback, or just to share your interesting projects.
Let’s Connect
The post AI Agents Processing Time Series and Large Dataframes appeared first on Towards Data Science.
0 Commentarios
0 Acciones
44 Views