A Step-By-Step Guide To Powering Your Application With LLMs

shared a link

2025-04-25 22:39:58 -

You might be wondering whether GenAI is just hype or external noise. I also thought this was hype, and I could sit this one out until the dust cleared. Oh, boy, was I wrong. GenAI has real-world applications. It also generates revenue for companies, so we expect companies to invest heavily in research. Every time a technology disrupts something, the process generally moves through the following phases: denial, anger, and acceptance. The same thing happened when computers were introduced. If we work in the software or hardware field, we might need to use GenAI at some point. In this article, I cover how to power your application with large Language Models (LLMs) and discuss the challenges I faced while setting up LLMs. Let’s get started. 1. Start by defining your use case clearly Before jumping onto LLM, we should ask ourselves some questions a. What problem will my LLM solve? b. Can my application do without LLMc. Do I have enough resources and compute power to develop and deploy this application? Narrow down your use case and document it. In my case, I was working on a data platform as a service. We had tons of information on wikis, Slack, team channels, etc. We wanted a chatbot to read this information and answer questions on our behalf. The chatbot would answer customer questions and requests on our behalf, and if customers were still unhappy, they would be routed to an Engineer. 2. Choose your model Photo by Solen Feyissa on Unsplash You have two options: Train your model from scratch or use a pre-trained model and build on top of it. The latter would work in most cases unless you have a particular use case. Training your model from scratch will require massive computing power, significant engineering efforts, and costs, among other things. Now, the next question is, which pre-trained model should I choose? You can select a model based on your use case. 1B parameter model has basic knowledge and pattern matching. Use cases can be restaurant reviews. The 10B parameter model has excellent knowledge and can follow instructions like a food order chatbot. A 100B+ parameters model has rich world knowledge and complex reasoning. This can be used as a brainstorming partner. There are many models available, such as Llama and ChatGPT. Once you have a model in place, you can expand on the model. 3. Enhance the model as per your data Once you have a model in place, you can expand on the model. The LLM model is trained on generally available data. We want to train it on our data. Our model needs more context to provide answers. Let’s assume we want to build a restaurant chatbot that answers customer questions. The model does not know information particular to your restaurant. So, we want to provide the model some context. There are many ways we can achieve this. Let’s dive into some of them. Prompt Engineering Prompt engineering involves augmenting the input prompt with more context during inference time. You provide context in your input quote itself. This is the easiest to do and has no enhancements. But this comes with its disadvantages. You cannot give a large context inside the prompt. There is a limit to the context prompt. Also, you cannot expect the user to always provide full context. The context might be extensive. This is a quick and easy solution, but it has several limitations. Here is a sample prompt engineering. “Classify this reviewI love the movieSentiment: PositiveClassify this reviewI hated the movie.Sentiment: NegativeClassify the movieThe ending was exciting” Reinforced Learning With Human Feedback (RLHF) RLHF Model RLHF is one of the most-used methods for integrating LLM into an application. You provide some contextual data for the model to learn from. Here is the flow it follows: The model takes an action from the action space and observes the state change in the environment as a result of that action. The reward model generated a reward ranking based on the output. The model updates its weight accordingly to maximize the reward and learns iteratively. For instance, in LLM, action is the next word that the LLM generates, and the action space is the dictionary of all possible words and vocabulary. The environment is the text context; the State is the current text in the context window. The above explanation is more like a textbook explanation. Let’s have a look at a real-life example. You want your chatbot to answer questions regarding your wiki documents. Now, you choose a pre-trained model like ChatGPT. Your wikis will be your context data. You can leverage the langchain library to perform RAG. You can Here is a sample code in Python from langchain.document_loaders import WikipediaLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import FAISS from langchain.chat_models import ChatOpenAI from langchain.chains import RetrievalQA import os # Set your OpenAI API key os.environ["OPENAI_API_KEY"] = "your-openai-key-here" # Step 1: Load Wikipedia documents query = "Alan Turing" wiki_loader = WikipediaLoader(query=query, load_max_docs=3) wiki_docs = wiki_loader.load() # Step 2: Split the text into manageable chunks splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100) split_docs = splitter.split_documents(wiki_docs) # Step 3: Embed the chunks into vectors embeddings = OpenAIEmbeddings() vector_store = FAISS.from_documents(split_docs, embeddings) # Step 4: Create a retriever retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 3}) # Step 5: Create a RetrievalQA chain llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo") qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", # You can also try "map_reduce" or "refine" retriever=retriever, return_source_documents=True, ) # Step 6: Ask a question question = "What did Alan Turing contribute to computer science?" response = qa_chain(question) # Print the answer print("Answer:", response["result"]) print("\n--- Sources ---") for doc in response["source_documents"]: print(doc.metadata) 4. Evaluate your model Now, you have added RAG to your model. How do you check if your model is behaving correctly? This is not a code where you give some input parameters and receive a fixed output, which you can test against. Since this is a language-based communication, there can be multiple correct answers. But what you can know for sure is whether the answer is incorrect. There are many metrics you can test your model against. Evaluate manually You can continually evaluate your model manually. For instance, we had integrated a Slack chatbot that was enhanced with RAG using our wikis and Jira. Once we added the chatbot to the Slack channel, we initially shadowed its responses. The clients could not view the responses. Once we gained confidence, we made the chatbot publicly visible to the clients. We evaluated its response manually. But this is a quick and vague approach. You cannot gain confidence from such manual testing. So, the solution is to test against some benchmark, such as ROUGE. Evaluate with ROUGE score. ROUGE metrics are used for text summarization. Rouge metrics compare the generated summary with reference summaries using different ROUGE metrics. Rouge metrics evaluate the model using recall, precision, and F1 scores. ROUGE metrics come in various types, and poor completion can still result in a good score; hence, we refer to different ROUGE metrics. For some context, a unigram is a single word; a bigram is two words; and an n-gram is N words. ROUGE-1 Recall = Unigram matches/Unigram in referenceROUGE-1 Precision = Unigram matches/Unigram in generated outputROUGE-1 F1 = 2 * (Recall * Precision / (Recall + Precision))ROUGE-2 Recall = Bigram matches/bigram referenceROUGE-2 Precision = Bigram matches / Bigram in generated outputROUGE-2 F1 = 2 * (Recall * Precision / (Recall + Precision))ROUGE-L Recall = Longest common subsequence/Unigram in referenceROUGE-L Precision = Longest common subsequence/Unigram in outputROUGE-L F1 = 2 * (Recall * Precision / (Recall + Precision)) For example, Reference: “It is cold outside.”Generated output: “It is very cold outside.” ROUGE-1 Recall = 4/4 = 1.0ROUGE-1 Precision = 4/5 = 0.8ROUGE-1 F1 = 2 * 0.8/1.8 = 0.89ROUGE-2 Recall = 2/3 = 0.67ROUGE-2 Precision = 2/4 = 0.5ROUGE-2 F1 = 2 * 0.335/1.17 = 0.57ROUGE-L Recall = 2/4 = 0.5ROUGE-L Precision = 2/5 = 0.4ROUGE-L F1 = 2 * 0.335/1.17 = 0.44 Reduce hassle with the external benchmark The ROUGE Score is used to understand how model evaluation works. Other benchmarks exist, like the BLEU Score. However, we cannot practically build the dataset to evaluate our model. We can leverage external libraries to benchmark our models. The most commonly used are the GLUE Benchmark and SuperGLUE Benchmark. 5. Optimize and deploy your model This step might not be crucial, but reducing computing costs and getting faster results is always good. Once your model is ready, you can optimize it to improve performance and reduce memory requirements. We will touch on a few concepts that require more engineering efforts, knowledge, time, and costs. These concepts will help you get acquainted with some techniques. Quantization of the weights Models have parameters, internal variables within a model that are learned from data during training and whose values determine how the model makes predictions. 1 parameter usually requires 24 bytes of processor memory. So, if you choose 1B, parameters will require 24 GB of processor memory. Quantization converts the model weights from higher-precision floating-point numbers to lower-precision floating-point numbers for efficient storage. Changing the storage precision can significantly affect the number of bytes required to store a single value of the weight. The table below illustrates different precisions for storing weights. Pruning Pruning involves removing weights in a model that are less important and have little impact, such as weights equal to or close to zero. Some techniques of pruning are a. Full model retrainingb. PEFT like LoRAc. Post-training. Conclusion To conclude, you can choose a pre-trained model, such as ChatGPT or FLAN-T5, and build on top of it. Building your pre-trained model requires expertise, resources, time, and budget. You can fine-tune it as per your use case if needed. Then, you can use your LLM to power applications and tailor them to your application use case using techniques like RAG. You can evaluate your model against some benchmarks to see if it behaves correctly. You can then deploy your model. The post A Step-By-Step Guide To Powering Your Application With LLMs appeared first on Towards Data Science.

0 Comments 0 Shares 44 Views