TOWARDSDATASCIENCE.COM
Step-by-Step Guide to Build and Deploy an LLM-Powered Chat with Memory in Streamlit
In this post, I’ll show you step by step how to build and deploy a chat powered with LLM — Gemini — in Streamlit and monitor the API usage on Google Cloud Console. Streamlit is a Python framework that makes it super easy to turn your Python scripts into interactive web apps, with almost no front-end work. Recently, I built a project, bordAI — a chat assistant powered by LLM integrated with tools I developed to support embroidery projects. After that, I decided to start this series of posts to share tips I’ve learned along the way.  Here’s a quick summary of the post: 1 to 6 — Project Setup 7 to 13 — Building the Chat 14 to 15— Deploy and Monitor the app 1. Create a New GitHub repository Go to GitHub and create a new repository. 2. Clone the repository locally → Execute this command in your terminal to clone it: git clone <your-repository-url> 3. Set Up a Virtual Environment (optional) A Virtual Environment is like a separate space on your computer where you can install a specific version of Python and libraries without affecting the rest of your system. This is useful because different projects might need different versions of the same libraries.  → To create a virtual environment: pyenv virtualenv 3.9.14 chat-streamlit-tutorial → To activate it: pyenv activate chat-streamlit-tutorial 4. Project Structure A project structure is just a way to organize all the files and folders for your project. Ours will look like this: chat-streamlit-tutorial/ │ ├── .env ├── .gitignore ├── app.py ├── functions.py ├── requirements.txt └── README.md .env→ file where you store your API key (not pushed to GitHub) .gitignore → file where you list the files or folders for git to ignore  app.py → main streamlit app functions.py → custom functions to better organize the code requirements.txt → list of libraries your project needs README.md → file that explains what your project is about → Execute this inside your project folder to create these files: touch .env .gitignore app.py functions.py requirements.txt → Inside the file .gitignore, add: .env __pycache__/ → Add this to the requirements.txt: streamlit google-generativeai python-dotenv → Install dependencies: pip install -r requirements.txt 5. Get API Key An API Key is like a password that tells a service you have permission to use it. In this project, we’ll use the Gemini API because they have a free tier, so you can play around with it without spending money.  Go to https://aistudio.google.com/ Create or log in to your account. Click on “Create API Key“, create it, and copy it.  Don’t set up billing if you just want to use the free tier. It should say “Free” under “Plan”, just like here: Image by the author We’ll use gemini-2.0-flash in this project. It offers a free tier, as you can see in the table below: Screenshot by the author from https://aistudio.google.com/plan_information 15 RPM = 15 Requests per minute 1,000,000 TPM = 1 Million Tokens Per Minute 1,500 RPD = 1,500 Requests Per Day Note: These limits are accurate as of April 2025 and may change over time.  Just a heads up: if you are using the free tier, Google may use your prompts to improve their products, including human reviews, so it’s not recommended to send sensitive information. If you want to read more about this, check this link. 6. Store your API Key We’ll store our API Key inside a .env file. A .env file is a simple text file where you store secret information, so you don’t write it directly in your code. We don’t want it going to GitHub, so we have to add it to our .gitignore file. This file determines which files git should literally ignore when you push your changes to the repository. I’ve already mentioned this in part 4, “Project Structure”, but just in case you missed it, I’m repeating it here. This step is really important, don’t forget it!→ Add this to .gitignore:  .env __pycache__/ → Add the API Key to .env: API_KEY= "your-api-key" If you’re running locally, .env works fine. However, if you’re deploying in Streamlit later, you will have to use st.secrets. Here I’ve included a code that can work in both scenarios.  →Add this function to your functions.py: import streamlit as st import os from dotenv import load_dotenv def get_secret(key): """ Get a secret from Streamlit or fallback to .env for local development. This allows the app to run both on Streamlit Cloud and locally. """ try: return st.secrets[key] except Exception: load_dotenv() return os.getenv(key) → Add this to your app.py: import streamlit as st import google.generativeai as genai from functions import get_secret api_key = get_secret("API_KEY") 7. Choose the model  I chose gemini-2.0-flash for this project because I think it’s a great model with a generous free tier. However, you can explore other model options that also offer free tiers and choose your preferred one. Screenshot by the author from https://aistudio.google.com/plan_information Pro: models designed for high–quality outputs, including reasoning and creativity. Generally used for complex tasks, problem-solving, and content generation. They are multimodal — this means they can process text, image, video, and audio for input and output. Flash: models projected for speed and cost efficiency. Can have lower-quality answers compared to the Pro for complex tasks. Generally used for chatbots, assistants, and real-time applications like automatic phrase completion. They are multimodal for input, and for output is currently just text, other features are in development. Lite: even faster and cheaper than Flash, but with some reduced capabilities, such as it is multimodal only for input and text-only output. Its main characteristic is that it is more economical than the Flash, ideal for generating large amounts of text within cost restrictions. This link has plenty of details about the models and their differences. Here we are setting up the model. Just replace “gemini-2.0-flash” with the model you’ve chosen.  → Add this to your app.py: genai.configure(api_key=api_key) model = genai.GenerativeModel("gemini-2.0-flash") 8. Build the chat First, let’s discuss the key concepts we’ll use: st.session_state: this works like a memory for your app. Streamlit reruns your script from top to bottom every time something changes — when you send a message or click a button —  so normally, all the variables would be reset. This allows Streamlit to remember values between reruns. However, if you refresh your web page you’ll lose the session_state.  st.chat_message(name, avatar): Creates a chat bubble for a message in the interface. The first parameter is the name of the message author, which can be “user”, “human”, “assistant”, “ai”, or str. If you use user/human and assistant/ai, it already has default avatars of user and bot icons. You can change this if you want to. Check out the documentation for more details. st.chat_input(placeholder): Displays an input box at the bottom for the user to type messages. It has many parameters, so I recommend you check out the documentation.  First, I’ll explain each part of the code separately, and after I’ll show you the whole code together.  This initial step initializes your session_state, the app’s “memory”, to keep all the messages within one session.  if "chat_history" not in st.session_state: st.session_state.chat_history = [] Next, we’ll set the first default message. This is optional, but I like to add it. You could add some initial instructions if suitable for your context. Every time Streamlit runs the page and st.session_state.chat_history is empty, it’ll append this message to the history with the role “assistant”. if not st.session_state.chat_history: st.session_state.chat_history.append(("assistant", "Hi! How can I help you?")) In my app bordAI, I added this initial message giving context and instructions for my app: Image by the author For the user part, the first line creates the input box. If user_message contains content, it writes it to the interface and then appends it to chat_history.  user_message = st.chat_input("Type your message...") if user_message: st.chat_message("user").write(user_message) st.session_state.chat_history.append(("user", user_message)) Now let’s add the assistant part: system_prompt is the prompt sent to the model. You could just send the user_message in place of full_input (look at the code below). However, the output might not be precise. A prompt provides context and instructions about how you want the model to behave, not just what you want it to answer. A good prompt makes the model’s response more accurate, consistent, and aligned with your goals. In addition, without telling how our model should behave, it’s vulnerable to prompt injections.  Prompt injection is when someone tries to manipulate the model’s prompt in order to alter its behavior. One way to mitigate this is to structure prompts clearly and delimit the user’s message within triple quotes.  We’ll start with a simple and unclear system_prompt and in the next session we’ll make it better to compare the difference.  full_input: here, we’re organizing the input, delimiting the user message with triple quotes (“””). This doesn’t prevent all prompt injections, but it is one way to create better and more reliable interactions.  response: sends a request to the API, storing the output in response.  assistant_reply: extracts the text from the response. Finally, we use st.chat_message() combined to write() to display the assistant reply and append it to the st.session_state.chat_history, just like we did with the user.  if user_message: st.chat_message("user").write(user_message) st.session_state.chat_history.append(("user", user_message)) system_prompt = f""" You are an assistant. Be nice and kind in all your responses. """ full_input = f"{system_prompt}\n\nUser message:\n\"\"\"{user_message}\"\"\"" response = model.generate_content(full_input) assistant_reply = response.text st.chat_message("assistant").write(assistant_reply) st.session_state.chat_history.append(("assistant", assistant_reply)) Now let’s see everything together! → Add this to your app.py: import streamlit as st import google.generativeai as genai from functions import get_secret api_key = get_secret("API_KEY") genai.configure(api_key=api_key) model = genai.GenerativeModel("gemini-2.0-flash") if "chat_history" not in st.session_state: st.session_state.chat_history = [] if not st.session_state.chat_history: st.session_state.chat_history.append(("assistant", "Hi! How can I help you?")) user_message = st.chat_input("Type your message...") if user_message: st.chat_message("user").write(user_message) st.session_state.chat_history.append(("user", user_message)) system_prompt = f""" You are an assistant. Be nice and kind in all your responses. """ full_input = f"{system_prompt}\n\nUser message:\n\"\"\"{user_message}\"\"\"" response = model.generate_content(full_input) assistant_reply = response.text st.chat_message("assistant").write(assistant_reply) st.session_state.chat_history.append(("assistant", assistant_reply)) To run and test your app locally, first navigate to the project folder, then execute the following command. → Execute in your terminal: cd chat-streamlit-tutorial streamlit run app.py Yay! You now have a chat running in Streamlit! 9. Prompt Engineering  Prompt Engineering is a process of writing instructions to get the best possible output from an AI model.  There are plenty of techniques for prompt engineering. Here are 5 tips: Write clear and specific instructions. Define a role, expected behavior, and rules for the assistant. Give the right amount of context. Use the delimiters to indicate user input (as I explained in part 8). Ask for the output in a specified format. These tips can be applied to the system_prompt or when you’re writing a prompt to interact with the chat assistant. Our current system prompt is: system_prompt = f""" You are an assistant. Be nice and kind in all your responses. """ It is super vague and provides no guidance to the model.  No clear direction for the assistant, what kind of help it should provide No specification of the role or what is the topic of the assistance No guidelines for structuring the output No context on whether it should be technical or casual Lack of boundaries  We can improve our prompt based on the tips above. Here’s an example. → Change the system_prompt in the app.py:  system_prompt = f""" You are a friendly and a programming tutor. Always explain concepts in a simple and clear way, using examples when possible. If the user asks something unrelated to programming, politely bring the conversation back to programming topics. """ full_input = f"{system_prompt}\n\nUser message:\n\"\"\"{user_message}\"\"\"" If we ask “What is python?” to the old prompt, it just gives a generic short answer: Image by the author With the new prompt, it provides a more detailed response with examples: Image by the author Image by the author Try changing the system_prompt yourself to see the difference in the model outputs and craft the ideal prompt for your context! 10. Choose Generate Content Parameters There are many parameters you can configure when generating content. Here I’ll demonstrate how temperature and maxOutputTokens work. Check the documentation for more details. temperature: controls the randomness of the output, ranging from 0 to 2. The default is 1. Lower values produce more deterministic outputs, while higher values produce more creative ones. maxOutputTokens: the maximum number of tokens that can be generated in the output. A token is approximately four characters.  To change the temperature dynamically and test it, you can create a sidebar slider to control this parameter. → Add this to app.py: temperature = st.sidebar.slider( label="Select the temperature", min_value=0.0, max_value=2.0, value=1.0 ) → Change the response variable to: response = model.generate_content( full_input, generation_config={ "temperature": temperature, "max_output_tokens": 1000 } ) The sidebar will look like this: Image by the author Try adjusting the temperature to see how the output changes! 11. Display chat history  This step ensures that you keep track of all the exchanged messages in the chat, so you can see the chat history. Without this, you’d only see the latest messages from the assistant and user each time you send something. This code accesses everything appended to chat_history and displays it in the interface. → Add this before the if user_message in app.py: for role, message in st.session_state.chat_history: st.chat_message(role).write(message) Now, all the messages within one session are kept visible in the interface: Image by the author Obs: I tried to ask a non-programming question, and the assistant tried to change the subject back to programming. Our prompt is working! 12. Chat with memory  Besides having messages stored in chat_history, our model isn’t aware of the context of our conversation. It is stateless, each transaction is independent.  Image by the author To solve this, we have to pass all this context inside our prompt so the model can reference previous messages exchanged.  Create context which is a list containing all the messages exchanged until that moment. Adding lastly the most recent user message, so it doesn’t get lost in the context. system_prompt = f""" You are a friendly and knowledgeable programming tutor. Always explain concepts in a simple and clear way, using examples when possible. If the user asks something unrelated to programming, politely bring the conversation back to programming topics. """ full_input = f"{system_prompt}\n\nUser message:\n\"\"\"{user_message}\"\"\"" context = [ *[ {"role": role, "parts": [{"text": msg}]} for role, msg in st.session_state.chat_history ], {"role": "user", "parts": [{"text": full_input}]} ] response = model.generate_content( context, generation_config={ "temperature": temperature, "max_output_tokens": 1000 } ) Now, I told the assistant that I was working on a project to analyze weather data. Then I asked what the theme of my project was and it correctly answered “weather data analysis”, as it now has the context of the previous messages.  Image by the author If your context gets too long, you can consider summarizing it to save costs, since the more tokens you send to the API, the more you’ll pay. 13. Create a Reset Button (optional)  I like adding a reset button in case something goes wrong or the user just wants to clear the conversation.  You just need to create a function to set de chat_history as an empty list. If you created other session states, you should set them here as False or empty, too.  → Add this to functions.py:  def reset_chat(): """ Reset the Streamlit chat session state. """ st.session_state.chat_history = [] st.session_state.example = False # Add others if needed → And if you want it in the sidebar, add this to app.py: from functions import get_secret, reset_chat if st.sidebar.button("Reset chat"): reset_chat() It will look like this: Image by the author Everything together: import streamlit as st import google.generativeai as genai from functions import get_secret, reset_chat api_key = get_secret("API_KEY") genai.configure(api_key=api_key) model = genai.GenerativeModel("gemini-2.0-flash") temperature = st.sidebar.slider( label="Select the temperature", min_value=0.0, max_value=2.0, value=1.0 ) if st.sidebar.button("Reset chat"): reset_chat() if "chat_history" not in st.session_state: st.session_state.chat_history = [] if not st.session_state.chat_history: st.session_state.chat_history.append(("assistant", "Hi! How can I help you?")) for role, message in st.session_state.chat_history: st.chat_message(role).write(message) user_message = st.chat_input("Type your message...") if user_message: st.chat_message("user").write(user_message) st.session_state.chat_history.append(("user", user_message)) system_prompt = f""" You are a friendly and a programming tutor. Always explain concepts in a simple and clear way, using examples when possible. If the user asks something unrelated to programming, politely bring the conversation back to programming topics. """ full_input = f"{system_prompt}\n\nUser message:\n\"\"\"{user_message}\"\"\"" context = [ *[ {"role": role, "parts": [{"text": msg}]} for role, msg in st.session_state.chat_history ], {"role": "user", "parts": [{"text": full_input}]} ] response = model.generate_content( context, generation_config={ "temperature": temperature, "max_output_tokens": 1000 } ) assistant_reply = response.text st.chat_message("assistant").write(assistant_reply) st.session_state.chat_history.append(("assistant", assistant_reply)) 14. Deploy If your repository is public, you can deploy with Streamlit for free.  MAKE SURE YOU DO NOT HAVE API KEYS ON YOUR PUBLIC REPOSITORY. First, save and push your code to the repository. → Execute in your terminal: git add . git commit -m "tutorial chat streamlit" git push origin main Pushing directly into the main isn’t a best practice, but since it’s just a simple tutorial, we’ll do it for convenience.  Go to your streamlit app that is running locally. Click on “Deploy” at the top right. In Streamlit Community Cloud, click “Deploy now”. Fill out the information. Image by the author 5. Click on “Advanced settings” and write API_KEY="your-api-key", just like you did with the .env file.  6. Click “Deploy”. All done! If you’d like, check out my app here! 15. Monitor API usage on Google Console  The last part of this post shows you how to monitor API usage on the Google Cloud Console. This is important if you deploy your app publicly, so you don’t have any surprises. Access Google Cloud Console. Go to “APIs and services”. Click on “Generative Language API”. Image by the author Requests: how many times your API was called. In our case, the API is called each time we run model.generate_content(context). Error (%): the percentage of requests that failed. Errors can have the code 4xx which is usually the user’s/requester’s fault — for instance, 400 for bad input, and 429 means you’re hitting the API too frequently. In addition, errors with the code 5xx are usually the system’s/server’s fault and are less common. Google typically retries internally or recommends retrying after a few seconds — e.g. 500 for Internal Server Error and 503 for Service Unavailable. Latency, median (ms): This shows how long (in milliseconds) it takes for your service to respond, at the 50th percentile — meaning half the requests are faster and half are slower. It’s a good general measure of your service’s speed, answering the question, “How fast is it normally?”. Latency, 95% (ms): This shows the response time at the 95th percentile — meaning 95% of requests are faster than this time, and only 5% slower. It helps to identify how your system behaves under heavy load or with slower cases, answering the question, “How bad is it getting for some users?”. A quick example of the difference between Latency median and Latency p95:Imagine your service usually responds in 200ms: Median latency = 200ms (good!) p95 latency = 220ms (also good) Now under heavy load: Median latency = 220ms (still looks OK) p95 latency = 1200ms (not good) The metric p95 shows that 5% of your users are waiting more than 1.2 seconds — a much worse experience. If we had looked just at the median, we’d assume everything was fine, but p95 shows hidden problems. Continuing in the “Metrics” page, you’ll find graphs and, at the bottom, the methods called by the API. Also, in “Quotas & System Limits”, you can monitor the API usage compared to the free tier limit. Image by the author Click “Show usage chart” to compare usage day by day. Image by the author I hope you enjoyed this tutorial.  You can find all the code for this project on my GitHub. I’d love to hear your thoughts! Let me know in the comments what you think. Follow me on: Linkedin GitHub Youtube The post Step-by-Step Guide to Build and Deploy an LLM-Powered Chat with Memory in Streamlit appeared first on Towards Data Science.
0 Reacties 0 aandelen 68 Views