towardsai.net
LatestMachine LearningWays to Deal With Hallucinations in LLM 0 like December 16, 2024Share this postLast Updated on December 17, 2024 by Editorial TeamAuthor(s): Igor Novikov Originally published on Towards AI. Image by the authorOne of the major challenges in using LLMs in business is that LLMs hallucinate. How can you entrust your clients to a chatbot that can go mad and tell them something inappropriate at any moment? Or how can you trust your corporate AI assistant if it makes things up randomly?Thats a problem, especially given that an LLM cant be fired or held accountable.Thats the thing with AI systems they dont benefit from lying to you in any way but at the same time, despite sounding intelligent they are not a person, so they cant be blamed either.Some tout RAG as a cure-all approach, but in reality it only solves one particular cause and doesnt help with others. Only a combination of several methods can help.Not all hope is lost though. There are ways to work with it so lets look at that.So not to go too philosophical about what is hallucination, lets define the most important cases:The model understands the question but gives an incorrect answerThe model didnt understand the question and thus gave an incorrect answerThere is no right or wrong answer, and therefore if you disagree with the mode it doesnt make it incorrect. Like if you ask Apple vs Android whatever it answers is technically just an opinionLets start with the latter. These are reasons why a model can misunderstand the questions:The question is crap (ambiguous, not clear, etc.), and therefore the answer is crap. Not the model's fault, ask better questionsThe model does not have contextLanguage: the model does not understand the language you are usingBad luck or, in other words, stochastic distribution led the reasoning in a weird wayNow lets look at the first one: why would a model lie, that is give factually and verifiably incorrect information, if it understands the questions?It didnt follow all the logical steps to arrive at a conclusionIt didnt have enough contextThe information (context) in this is incorrect as wellIt has the right information but got confusedIt was trained to give incorrect answers (for political and similar reasons)Bad luck, and stochastic distribution led to the reasoning in a weird wayIt was configured so it is allowed to fantasize (which can be sometimes desirable)Overfitting and underfitting: the model was trained in a specific field and tries to apply its logic to a different field, leading to incorrect deduction or induction in answeringThe model is overwhelmed with data and starts to lose contextIm not going to discuss things that are not a model problem, like bad questions or questions with no right answers. Lets concentrate on what we can try to solve, one by one.The model does not have enough context or information, or the information that was provided to it is not correct or fullThis is where RAG comes into play. RAG, when correctly implemented should provide the model's necessary context, so it can answer. Here is the article on how to do the RAG properly.It is important to do it right, with all required metadata about the information structure and attributes. It is desirable to use something like GraphRag, and Reranking in the retrieval phase, so that the model is given only relevant context, otherwise, the model can get confused.It is also extremely important to keep the data you provide to the model up to date and continuously update it, taking versioning into account. If you have data conflicts, which is not uncommon, the model will start generating conflicting answers as well. There are methods, such as the Maximum Marginal Relevance (MMR) algorithm, which considers the relevance and novelty of information for filtering and reordering. However, this is not a panacea, and it is best to address this issue at the data storage stage.LanguageNot all models understand all languages equally well. It is always preferable to use English for prompts as it works best for most models. If you have to use a specific language you may have to use a model build for that, like Qwen for Chinese.A model does not follow all the logical steps to arrive at a conclusionYou can force the model to follow a particular thinking process with techniques like SelfRag, Chain of Thought, or SelfCheckGPT. Here is an article about these techniques.The general idea is to ask the model to think in steps and explain/validate its conclusions and intermediate steps, so it can catch its errors.Alternatively, you can use the Agents model, where several LLM agents communicate with each other and verify each other's outputs and each step.A model got confused with the information it had and bad luckThese two are actually caused by the same thing and this is a tricky one. The way models work is they stochastically predict the next token in a sentence. The process is somewhat random, so it is possible that it will pick some less probable route and go off course. It is built into the model and the way it works.There are several methods on how to handle this:MultiQuerry ran several queries for the same answer and picked the best one using relevance score like Cross Encoder. If you get 3 very similar answers and one very different it is likely that it was a random hallucination. It adds certain overhead, so you pay the price but it is a very good method to ensure you dont randomly get a bad answerSet the model temperature to a lower value to discourage it from going in less probable directions (ie fantasizing)There is one more, which is harder to fix. The model keeps semantically similar ideas close in the vector space. Being asked about facts that have other facts close in proximity that are close but not actually related will lead the model to a path of least resistance. The model has associative memory, so to speak, so it thinks in associations, and that mode of thinking is not suitable for tasks like playing chess or math. The model has a fast-thinking brain, per Kahneman's description, but lacks a slow one.For example, you ask a mode what is 3 + 7 and it answers 37. Why???But it all makes sense since if you look at 3 and 7 in vector space, the closest vector to them is 37. Here the mistake is obvious but it may be much more subtle.Example:Image by the authorThe answer is incorrect.Afonso was the third king of Portugal. Not Alfonso. There was no Alfonso II as the king of Portugal.The mother of Afonso II was Dulce of Aragon, not Urraca of Castile.From the LLMs perspective, Alfonso is basically the same as Afonso and mother is a direct match. Therefore, if there is no mother close to Afonso then the LLM will choose the Alfonso/mother combination.Here is an article explaining this in detail and potential ways to fix this. Also, in general, fine-tuning the model on data from your domain will make it less likely to happen, as the model will be less confused with similar facts in edge cases.The model was configured so it is allowed to fantasizeThis can be done either through a master prompt or by setting the model temperature too high. So basically you need to instruct the model to:Not give an answer if it is not sure or dont have informationEnsure nothing in the prompt instructs the model to make up facts and, in general, make instructions very clearSet temperature lowerOverfitting and underfittingIf you use a model that is trained in healthcare space to solve programming tasks it will hallucinate, or in other words, will try to put square bits into round holes because it only knows how to do that. Thats kind of obvious. Same if you use a generic model, trained on generic data from the internet to solve industry-specific tasks.The solution is to use a proper model for your industry and fine-tune/train it in that area. That will improve the correctness dramatically in certain cases. Im not saying you always have to do that, but you might have to.Another case of this is using a model too small (in terms of parameters) to solve your tasks. Yes, certain tasks may not require a large model, but certainly do, and you should use a model not smaller than appropriate. Using a model too big will cost you but at least it will work correctly.The model is overwhelmed with data and starts to lose contextYou may think that the more data you have the better but it is not the case at all!Model context window and attention span are limited. Even recent models with millions of tokens of context window do not work well. They will start to forget things, ignore things in the middle, and so on.The solution here is to use RAG with proper context size management. You have to pre-select only relevant data, rerank it, and feed it to LLM.Here is my article that overviews some of the techniques to do that.Also, some models do not handle long context at all, and at a certain point, the quality of answers will start to degrade with increasing context size, see below:Here is a research paper on that.Other general techniquesHuman in the loopYou can always have someone in the loop to fact-check LLM outputs. For example, you use LLM for data annotation (which is a great idea) you will need to use it in conjunction with real humans to validate the results. Or use your system in Co-pilot mode where humans make the final decision. This doesnt scale well thoughOraclesAlternatively, you can use an automated Oracle to fact-check the system results, if that option is availableExternal toolsCertain things, like calculations and math, should be done outside of LLM, using tools that are provided to LLM. For example, you can use LLM to generate a query to SQL database or Elasticsearch and execute that, and then use the results to generate the final answer.What to read next:RAG architecture guideAdvanced RAG guidePeace!Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming asponsor. Published via Towards AITowards AI - Medium Share this post