WWW.MARKTECHPOST.COM
RAG-Check: A Novel AI Framework for Hallucination Detection in Multi-Modal Retrieval-Augmented Generation Systems
Large Language Models (LLMs) have revolutionized generative AI, showing remarkable capabilities in producing human-like responses. However, these models face a critical challenge known as hallucination, the tendency to generate incorrect or irrelevant information. This issue poses significant risks in high-stakes applications such as medical evaluations, insurance claim processing, and autonomous decision-making systems where accuracy is most important. The hallucination problem extends beyond text-based models to vision-language models (VLMs) that process images and text queries. Despite developing robust VLMs such as LLaVA, InstructBLIP, and VILA, these systems struggle with generating accurate responses based on image inputs and user queries.Existing research has introduced several methods to address hallucination in language models. For text-based systems, FactScore improved accuracy by breaking long statements into atomic units for better verification. Lookback Lens developed an attention score analysis approach to detect context hallucination, while MARS implemented a weighted system focusing on crucial statement components. For RAG systems specifically, RAGAS and LlamaIndex emerged as evaluation tools, with RAGAS focusing on response accuracy and relevance using human evaluators, while LlamaIndex employs GPT-4 for faithfulness assessment. However, no existing works provide hallucination scores specifically for multi-modal RAG systems, where the contexts include multiple pieces of multi-modal data.Researchers from the University of Maryland, College Park, MD, and NEC Laboratories America, Princeton, NJ have proposed RAG-check, a comprehensive method to evaluate multi-modal RAG systems. It consists of three key components designed to assess both relevance and accuracy. The first component involves a neural network that evaluates the relevancy of each retrieved piece of data to the user query. The second component implements an algorithm that segments and categorizes the RAG output into scorable (objective) and non-scorable (subjective) spans. The third component utilizes another neural network to evaluate the correctness of objective spans against the raw context, which can include both text and images converted to text-based format through VLMs.The RAG-check architecture uses two primary evaluation metrics: the Relevancy Score (RS) and Correctness Score (CS) to evaluate different aspects of RAG system performance. For evaluating selection mechanisms, the system analyzes the relevancy scores of the top 5 retrieved images across a test set of 1,000 questions, providing insights into the effectiveness of different retrieval methods. In terms of context generation, the architecture allows for flexible integration of various model combinations either separate VLMs (like LLaVA or GPT4) and LLMs (such as LLAMA or GPT-3.5), or unified MLLMs like GPT-4. This flexibility enables a comprehensive evaluation of different model architectures and their impact on response generation quality.The evaluation results demonstrate significant performance variations across different RAG system configurations. When using CLIP models as vision encoders with cosine similarity for image selection, the average relevancy scores ranged from 30% to 41%. However, implementing the RS model for query-image pair evaluation dramatically improves relevancy scores to between 71% and 89.5%, though at the cost of a 35-fold increase in computational requirements when using an A100 GPU. GPT-4o emerges as the superior configuration for context generation and error rates, outperforming other setups by 20%. The remaining RAG configurations show comparable performance, with an accuracy rate between 60% and 68%.In conclusion, researchers RAG-check, a novel evaluation framework for multi-modal RAG systems to address the critical challenge of hallucination detection across multiple images and text inputs. The frameworks three-component architecture, comprising relevancy scoring, span categorization, and correctness assessment shows significant improvements in performance evaluation. The results reveal that while the RS model substantially enhances relevancy scores from 41% to 89.5%, it comes with increased computational costs. Among various configurations tested, GPT-4o emerged as the most effective model for context generation, highlighting the potential of unified multi-modal language models in improving RAG system accuracy and reliability.Check out the Paper. All credit for this research goes to the researchers of this project. Also,dont forget to follow us onTwitter and join ourTelegram Channel andLinkedIn Group. Dont Forget to join our65k+ ML SubReddit. Sajjad Ansari+ postsSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner. [Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
0 Comments 0 Shares 52 Views