Visual Grounding for Advanced RAG Frameworks
Author(s): Felix Pappe
Originally published on Towards AI.
Image created by the author using gpt-image-1
AI chatbots and advanced Retrieval-Augmented Generation (RAG) systems are increasingly adept at providing up-to-date, context-aware answers based on previously retrieved text chunks.
However, despite their seemingly reliable responses, the significant issue remains, that users often lack a clear method to verify the source of these answers without having to return to the original, often lengthy, documents themselves.
This is particularly cumbersome when the source is a multi-page academic paper, technical manual, or book.
Even when a link is provided, users are left searching through dozens of pages, trying to locate the exact section that the chatbot used for the generation of the final response.
This manual cross-checking is not only time-consuming, it undermines trust in the LLM’s output.As a result, AI-generated answers may appear correct at first glance but remain opaque, leaving users uncertain about their reliability.
This lack of verifiability can lead to misinformation and misinterpretation.
In this blog post, I would like to offer a solution to this problem with a visual grounding approach using the Docling parsing tool, Qdrant vector store, and LangChain.
This RAG framework doesn’t just retrieve relevant text.
It also highlights the exact location of the extracted text directly on the page from the source document.By connecting answers to theirLLM’s output.The result is a transparent, verifiable, and user-friendly RAG framework that builds trust while maintaining accuracy, which is built in this blog post.
Docling
The foundation of the visual grounding approach introduced in this blog post is the Docling document processing pipeline.
Docling is an open-source tool for layout-aware document parsing and grounding, achieving results comparable to paid solutions like Mistral OCR.Moreover, docling provides an additional key feature for visual grounding, which other document-to-markdown solutions don’t have.
This feature is the decomposition of the input document into smaller sub-elements, including headings, text chunks, formulas, and tables, using different models in a sophisticated processing pipeline, which is presented in the following image.Docling Pipeline from docling paper
The output of this processing pipeline is not a markdown file but a DoclingDocument, which consists of all the detected and extracted elements from the input document.
This intermediate DoclingDocumentclass, enhanced with metadata from extracted elements, allows for the transformation of the original document into various file types and supports the visual grounding discussed in this blog post.RAG framework
Like all RAG frameworks, this one consists of two phases that can be divided into two scripts.
In the offline indexing phases, input documents are split into chunks and encoded into vector representations.
These vectors are stored in a specialized vector database for later retrieval.In the online retrieval and generation phase, text chunks related to a user’s input are retrieved and passed to the LLM to generate a final response.These two phases are implemented in two separate scripts in this post.The following image illustrates the final visual grounding result produced by the two scripts explained in this blog post.
In the first Python script, the Docling paper is uploaded to a Qdrant vector store during the offline indexing phase.
In the second script, relevant passages are retrieved based on a given question and are then highlighted directly on the document.Image created by author and designed in canva
Indexing phase
The indexing script handles data preprocessing and stores the embedded text chunks in a vector database for the second online retrieval phase.
I split the script into several parts to explain in detail what happens at each part, providing a deeper understanding of the entire code.
Imports
Let’s start very gently with the import of all required libraries and packages, including docling, langchain, and qdrant.
import osfrom pathlib import Pathfrom uuid import uuid4from dotenv import load_dotenvfrom transformers import AutoTokenizerfrom qdrant_client import QdrantClientfrom qdrant_client.http.models import Distance, VectorParamsfrom langchain_qdrant import QdrantVectorStorefrom langchain_huggingface.embeddings import HuggingFaceEmbeddingsfrom docling.datamodel.base_models import InputFormatfrom docling.datamodel.pipeline_options import PdfPipelineOptionsfrom docling.document_converter import DocumentConverter, PdfFormatOptionfrom docling.chunking import HybridChunkerfrom langchain_docling.loader import DoclingLoader, ExportTypefrom docling_core.types.doc import ImageRefMode
Configuration and environment variables
In the next step, the environmental variables are loaded from the .env file and read in with getenv().
Inside the .env file, your HF_TOKEN for the embedding model and your MISTRAL_API_KEY for the LLM must be included.
Of course, you can also adjust the code to your needs and choose any other embedding model or LLM.
But if you change the embedding model, also change the DIM variable, which refers to the final vector dimensions of the embedded chunks.Afterwards, the docling processing pipeline is defined, enabling all functionalities to generate a complete DoclingDocument.
This pipeline includes the detection and extraction of code blocks, formulas, tables, pages, and page images of the document.
Moreover, the extracted images are scaled by 2.0 for a higher resolution.
load_dotenv() HF_TOKEN = os.getenv("HF_TOKEN")MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")SOURCES = ["attention-04.pdf"]OUTPUT_DIR = Path("output")OUTPUT_DIR.mkdir(parents=True, exist_ok=True)EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"COLLECTION_NAME = "demo_collection"DIM = 384pipeline_options = PdfPipelineOptions( do_code_enrichment=True, do_formula_enrichment=True, do_table_structure=True, generate_picture_images=True, generate_page_images=True, images_scale=2.0,)pipeline_options.table_structure_options.do_cell_matching = True
Setting up document converter
In the next step, the DocumentConverter from Docling is configured using the previously defined pipeline options.
Later in the code, this converter will be used to transform an input PDF into a DoclingDocument, which consists of modular components such as headings, text paragraphs, images, and tables.
Furthermore, the embedding model is initialized, which is necessary to embed the extracted sentences from the document into a vector representation.
converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) })embeddings = HuggingFaceEmbeddings(model_name=EMBED_MODEL)
Initialize qdrant collection
A local Qdrant vector storage is then set up, allowing you to experiment with the script on your device.
It either creates a new vector storage if none exists with the current name or connects to an existing one if there is a match.
client = QdrantClient(path="langchain_qdrant")collections = [col.name for col in client.get_collections().collections]if COLLECTION_NAME not in collections: client.create_collection( collection_name=COLLECTION_NAME, vectors_config=VectorParams(size=DIM, distance=Distance.COSINE), ) print(f"Created new collection '{COLLECTION_NAME}'.")else: print(f"Using existing collection '{COLLECTION_NAME}'.") vector_store = QdrantVectorStore( client=client, collection_name=COLLECTION_NAME, embedding=embeddings,)
Convert pdf to docling and save outputs
Subsequently, all PDF files listed in the SOURCE are converted into DoclingDocuments and saved in dl_doc.
These documents are then transformed into JSON and markdown files, which are stored on your local device.
The markdown file is used to assess the quality of the file transformation process, while the JSON file is necessary for the subsequent visual grounding process in the second script.
for source in SOURCES: dl_doc = converter.convert(source=source).document # JSON export out_json = OUTPUT_DIR / f"{dl_doc.origin.binary_hash}.json" dl_doc.save_as_json(out_json) # Markdown export with embedded images out_md = OUTPUT_DIR / f"{dl_doc.origin.binary_hash}.md" dl_doc.save_as_markdown(out_md, image_mode=ImageRefMode.EMBEDDED)
Chunking the document
Finally, we come to the main part of the script, including the chunking process of the document into smaller texts.
This is achieved using the HybridChunker provided by docling.
The best feature of this HybridChunker is that it tries to keep related passages together based on the markdown formatting and merges passages with each other if they are too small.
A small max_tokens size has been selected to implement a small-to-big retrieval approach.
This means that initially, a small chunk of text that closely matches the user's query is retrieved.
Following this, a larger context chunk that surrounds the retrieved section is additionally retrieved and provided to the language model for generating the final answer.
In this case, the larger context chunk refers to the paragraph containing the smaller chunk.
chunker = HybridChunker( tokenizer=EMBED_MODEL, max_tokens=64, merge_peers=True)loader = DoclingLoader( file_path=SOURCES, converter=converter, export_type=ExportType.DOC_CHUNKS, chunker=chunker,)docs = loader.load()
Embedding the document
In the final step, the generated chunks are stored in the vector database with a unique identifier, from which they can be retrieved during the online phase.
ids = [str(uuid4()) for _ in docs]vector_store.add_documents(documents=docs, ids=ids)print("Documents have been embedded into the vector store.")
Retrieval and generation part
Now, the online retrieval and generation part leverages the previously embedded text chunks to generate an answer for the user’s input based on the knowledge in the vector store.
Imports
As in the previous indexing script, all the necessary packages are included first.
import osimport refrom pathlib import Pathfrom dotenv import load_dotenvfrom PIL import ImageDraw, Imagefrom pydantic import BaseModel, Fieldfrom qdrant_client import QdrantClientfrom langchain_core.output_parsers import PydanticOutputParserfrom langchain_core.prompts import PromptTemplatefrom langchain_huggingface.embeddings import HuggingFaceEmbeddingsfrom langchain_mistralai import ChatMistralAIfrom langchain_qdrant import QdrantVectorStorefrom docling.chunking import DocMetafrom docling.datamodel.document import DoclingDocument
Configuration and environment variables
Then, the necessary environment variables are loaded and configuration variables are set.
load_dotenv()MISTRAL_API_KEY = os.environ["MISTRAL_API_KEY"]OUTPUT_DIR = Path("output")EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"COLLECTION_NAME = "demo_collection"QUESTION = ( "How does attention is computed in the transformer architecture?")TOP_K = 3OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
Build JSON lookup table
In the next step, all JSON files in the OUTPUT directory are detected.
For each file, its name is extracted using the stem method.
Afterwards, it’s verified that the name consists only of digits, representing the hash value of the file, before adding the numeric name as a key and the file’s path as a value in the doc_store dictionary.
Initializing embedding model, vector store, and llm
After that, the embedding model and the LLM are initialized.
It is important to use the same embedding model and Qdrant collection name as in the previous indexing script.
Optionally, you may want to define these names in a separate configuration script and import this script into both the indexing and retrieval scripts.
For the LLM, I selected Mistral, but you can also choose GPT-4, Gemini, Llama, or any other model you prefer.
embeddings = HuggingFaceEmbeddings(model_name=EMBED_MODEL)client = QdrantClient(path="langchain_qdrant")try: client.get_collection(collection_name=COLLECTION_NAME)except Exception: print( f"Collection {COLLECTION_NAME} not found; create it before indexing.")vector_store = QdrantVectorStore( client=client, collection_name=COLLECTION_NAME, embedding=embeddings,)llm = ChatMistralAI( api_key=MISTRAL_API_KEY, model="mistral-large-latest", temperature=0, max_retries=2,)
Preparing the chain
Thereafter, all components of the RAG chain are set up, beginning with the Answer Pydantic schema, which consists of two fields.
The first field, answerable is a boolean value that indicates whether the question can be answered using the retrieved knowledge.
The second field, answer provides the actual response.
This structure is employed to address the hallucination issue that many LLMs encounter, helping to prevent the dissemination of incorrect information to the user.Following that, the PydanticOutputParser and its get_format_instructuons() method format_instructuons generate the format instructions for the LLM.
Moreover, the parser is employed again as the final Runnable of the LangChain execution chain, with its elements connected by the | operator.
The prompt created for the PromptTemplate is basic and can be further enhanced by the introduction of few-shot examples or other advanced prompt engineering techniques.
class Answer(BaseModel): answerable: bool = Field( ..., description="Whether the question can be answered" ) answer: str = Field( ..., description="The answer based on the provided knowledge" ) parser = PydanticOutputParser(pydantic_object=Answer)format_instructions = parser.get_format_instructions()prompt = PromptTemplate( input_variables=["knowledge", "topic"], partial_variables={"format_instructions": format_instructions}, template=( "You are given the following grounding material:\n\n" "{knowledge}\n\n" "Question:\n" "{topic}\n\n" "Please provide a concise answer in full sentences " "based solely on the information above.\n" "If the answer is not contained within the provided material, " "reply with:\n" "“There is no answer to this question in the provided material.”\n\n" "{format_instructions}" ),)rag_chain = prompt | llm | parser
Run similarity search
Once the chain has been set up.
The text chunks semantically related to the user input query are retrieved.
results = vector_store.similarity_search_with_score( k=TOP_K, query=QUESTION)
Load and assemble grounding texts
Based on the retrieved text chunks from the vector database, the entire paragraph containing each chunk is loaded.
This approach is inspired by the small-to-big retrieval technique introduced earlier.
Image created by the author, designed with canva and graphics generated using gpt-image-1
The relevant paragraph is pulled from the JSON file associated with the document from which the chunk originates.
The correct JSON file is identified by the document’s hash value, which appears both in the text chunk’s metadata and in the JSON filename.
Once the correct JSON file is identified, it’s loaded via the load_from_json method of the DoclingDocument.
The original text item referenced in the current result chunk’s metadata is then extracted from the JSON file using a regular expression.
If the referenced text item is found, the full text passage is retrieved to generate the final result.
This example focuses exclusively on text grounding.
However, Docling also provides references to previously identified images and tables.
grounding_texts: list[str] = []for res, score in results: meta = DocMeta.model_validate(res.metadata["dl_meta"]) h = meta.origin.binary_hash json_file = doc_store.get(h) if not json_file: continue dl_doc = DoclingDocument.load_from_json(json_file) for item in meta.doc_items: if not item.prov: continue match = re.search(r"^#/texts/(\d+)$", item.self_ref) if not match: continue idx = int(match.group(1)) grounding_texts.append(dl_doc.texts[idx].text.strip())knowledge = "\n\n".join(grounding_texts)print("Assembled grounding material:\n", knowledge)
Invoke the llm
Finally, the LLM can be invoked using the previously defined chain.
Inside the invoke method, the retrieved knowledge passages from the original JSON file and the user’s input question are passed in.
The returned values conform to the defined AnswerPydantic schema for a structured evaluation the results.
answer_obj = rag_chain.invoke( {"knowledge": knowledge, "topic": QUESTION})print("Answerable?", answer_obj.answerable)print("Answer: ", answer_obj.answer)
Visual grounding
The subsequent script proceeds to visually ground the used chunks in the corresponding documents only if the question can be answered based on the provided content, meaning the LLM returns that answerable is equal to true/1.
This grounding process is approached similarly to how original text passages are extracted from documents.
However, this time, the focus is on the bounding boxes surrounding these text passages instead of the raw text itself.
These bounding boxes are another outcome of the document analysis and processing performed by the Docling pipeline in the indexing script.
Now, these boxes can be utilized to anchor the generated answers to the original pages of the document.
The coordinates of these bounding boxes come also from the metadata of each chunk,
Moreover, the metadata of a chunk includes the page number, which is required to extract the screenshot of the correct page from the corresponding file.
This screenshot of the page is the foundation for the visual grounding process, as bounding boxes are added on top of it.
In the end, these bounding boxes are on top of the retrieved image of the correct side, allowing the user to precisely identify from which part of the document the information was retrieved.
The final visually enhanced images are then stored in the same output directory as the input document and are ready for visual inspection.
if answer_obj.answerable: for i, (res, score) in enumerate(results, start=1): meta = DocMeta.model_validate(res.metadata["dl_meta"]) h = meta.origin.binary_hash json_file = doc_store.get(h) if not json_file: continue dl_doc = DoclingDocument.load_from_json(json_file) image_by_page: dict[int, "Image.Image"] = {} for item in meta.doc_items: if not item.prov: continue prov = item.prov[0] p = prov.page_no if p not in image_by_page: image_by_page[p] = dl_doc.pages[p].image.pil_image.copy() img = image_by_page[p] bbox = prov.bbox.to_top_left_origin( page_height=dl_doc.pages[p].size.height ).normalized(dl_doc.pages[p].size) left = round(bbox.l * img.width) - 2 top = round(bbox.t * img.height) - 2 right = round(bbox.r * img.width) + 2 bottom = round(bbox.b * img.height) + 2 draw = ImageDraw.Draw(img) draw.rectangle([left, top, right, bottom], outline="blue", width=2) for p, img in image_by_page.items(): out_png = OUTPUT_DIR / f"source_{i}_page_{p}.png" img.save(out_png) print(f"Saved annotated page {p} → {out_png}")
Limitation
However, no solution is without flaws, and this one is no exception.
The first point to consider is the heavy reliance on Docling.
This tool is deeply integrated into the RAG framework, as it is used in almost every part of the system, including document parsing, text chunking, and grounding the final answer, which depends on the DoclingDocument class.
Another point is the processing speed and resource requirements.
While the advantage of Docling is that it can be deployed entirely on your local device, its processing speed and the maximum manageable file size depend significantly on your hardware.
If your hardware is not powerful enough, you may only be able to parse one page of a document at a time.
Additionally, a limitation of the provided example in this blog post is its focus solely on text.
It ignores images and tables that also contain rich information and leaves space for further enhancements.
Conclusion
But despite these limitations, Docling is a neat way to enhance advanced RAG frameworks quickly with additional visual grounding capabilities.
By connecting retrieved answers directly to their visual origin, this approach not only boosts transparency but also helps users build trust in the system’s output.
What is your opinion about Docling and this introduced RAG framework?
Have you already experimented with other grounding approaches for more explainable AI?Sources
Docling Paper: https://arxiv.org/abs/2501.17887" style="color: #0066cc;">https://arxiv.org/abs/2501.17887
Docling Documentation: https://docling-project.github.io/docling/" style="color: #0066cc;">https://docling-project.github.io/docling/
Qdrant local quickstart: https://qdrant.tech/documentation/quickstart/" style="color: #0066cc;">https://qdrant.tech/documentation/quickstart/
LangChain ChatMistralAI: https://python.langchain.com/docs/integrations/chat/mistralai/" style="color: #0066cc;">https://python.langchain.com/docs/integrations/chat/mistralai/
LangChain structured outputs: https://python.langchain.com/docs/concepts/structured_outputs/" style="color: #0066cc;">https://python.langchain.com/docs/concepts/structured_outputs/
Join thousands of data leaders on the AI newsletter.
Join over 80,000 subscribers and keep up to date with the latest developments in AI.
From research to projects and ideas.
If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Source: https://towardsai.net/p/l/visual-grounding-for-advanced-rag-frameworks" style="color: #0066cc;">https://towardsai.net/p/l/visual-grounding-for-advanced-rag-frameworks
#visual #grounding #for #advanced #rag #frameworks
Visual Grounding for Advanced RAG Frameworks
Author(s): Felix Pappe
Originally published on Towards AI.
Image created by the author using gpt-image-1
AI chatbots and advanced Retrieval-Augmented Generation (RAG) systems are increasingly adept at providing up-to-date, context-aware answers based on previously retrieved text chunks.
However, despite their seemingly reliable responses, the significant issue remains, that users often lack a clear method to verify the source of these answers without having to return to the original, often lengthy, documents themselves.
This is particularly cumbersome when the source is a multi-page academic paper, technical manual, or book.
Even when a link is provided, users are left searching through dozens of pages, trying to locate the exact section that the chatbot used for the generation of the final response.
This manual cross-checking is not only time-consuming, it undermines trust in the LLM’s output.As a result, AI-generated answers may appear correct at first glance but remain opaque, leaving users uncertain about their reliability.
This lack of verifiability can lead to misinformation and misinterpretation.
In this blog post, I would like to offer a solution to this problem with a visual grounding approach using the Docling parsing tool, Qdrant vector store, and LangChain.
This RAG framework doesn’t just retrieve relevant text.
It also highlights the exact location of the extracted text directly on the page from the source document.By connecting answers to theirLLM’s output.The result is a transparent, verifiable, and user-friendly RAG framework that builds trust while maintaining accuracy, which is built in this blog post.
Docling
The foundation of the visual grounding approach introduced in this blog post is the Docling document processing pipeline.
Docling is an open-source tool for layout-aware document parsing and grounding, achieving results comparable to paid solutions like Mistral OCR.Moreover, docling provides an additional key feature for visual grounding, which other document-to-markdown solutions don’t have.
This feature is the decomposition of the input document into smaller sub-elements, including headings, text chunks, formulas, and tables, using different models in a sophisticated processing pipeline, which is presented in the following image.Docling Pipeline from docling paper
The output of this processing pipeline is not a markdown file but a DoclingDocument, which consists of all the detected and extracted elements from the input document.
This intermediate DoclingDocumentclass, enhanced with metadata from extracted elements, allows for the transformation of the original document into various file types and supports the visual grounding discussed in this blog post.RAG framework
Like all RAG frameworks, this one consists of two phases that can be divided into two scripts.
In the offline indexing phases, input documents are split into chunks and encoded into vector representations.
These vectors are stored in a specialized vector database for later retrieval.In the online retrieval and generation phase, text chunks related to a user’s input are retrieved and passed to the LLM to generate a final response.These two phases are implemented in two separate scripts in this post.The following image illustrates the final visual grounding result produced by the two scripts explained in this blog post.
In the first Python script, the Docling paper is uploaded to a Qdrant vector store during the offline indexing phase.
In the second script, relevant passages are retrieved based on a given question and are then highlighted directly on the document.Image created by author and designed in canva
Indexing phase
The indexing script handles data preprocessing and stores the embedded text chunks in a vector database for the second online retrieval phase.
I split the script into several parts to explain in detail what happens at each part, providing a deeper understanding of the entire code.
Imports
Let’s start very gently with the import of all required libraries and packages, including docling, langchain, and qdrant.
import osfrom pathlib import Pathfrom uuid import uuid4from dotenv import load_dotenvfrom transformers import AutoTokenizerfrom qdrant_client import QdrantClientfrom qdrant_client.http.models import Distance, VectorParamsfrom langchain_qdrant import QdrantVectorStorefrom langchain_huggingface.embeddings import HuggingFaceEmbeddingsfrom docling.datamodel.base_models import InputFormatfrom docling.datamodel.pipeline_options import PdfPipelineOptionsfrom docling.document_converter import DocumentConverter, PdfFormatOptionfrom docling.chunking import HybridChunkerfrom langchain_docling.loader import DoclingLoader, ExportTypefrom docling_core.types.doc import ImageRefMode
Configuration and environment variables
In the next step, the environmental variables are loaded from the .env file and read in with getenv().
Inside the .env file, your HF_TOKEN for the embedding model and your MISTRAL_API_KEY for the LLM must be included.
Of course, you can also adjust the code to your needs and choose any other embedding model or LLM.
But if you change the embedding model, also change the DIM variable, which refers to the final vector dimensions of the embedded chunks.Afterwards, the docling processing pipeline is defined, enabling all functionalities to generate a complete DoclingDocument.
This pipeline includes the detection and extraction of code blocks, formulas, tables, pages, and page images of the document.
Moreover, the extracted images are scaled by 2.0 for a higher resolution.
load_dotenv() HF_TOKEN = os.getenv("HF_TOKEN")MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")SOURCES = ["attention-04.pdf"]OUTPUT_DIR = Path("output")OUTPUT_DIR.mkdir(parents=True, exist_ok=True)EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"COLLECTION_NAME = "demo_collection"DIM = 384pipeline_options = PdfPipelineOptions( do_code_enrichment=True, do_formula_enrichment=True, do_table_structure=True, generate_picture_images=True, generate_page_images=True, images_scale=2.0,)pipeline_options.table_structure_options.do_cell_matching = True
Setting up document converter
In the next step, the DocumentConverter from Docling is configured using the previously defined pipeline options.
Later in the code, this converter will be used to transform an input PDF into a DoclingDocument, which consists of modular components such as headings, text paragraphs, images, and tables.
Furthermore, the embedding model is initialized, which is necessary to embed the extracted sentences from the document into a vector representation.
converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) })embeddings = HuggingFaceEmbeddings(model_name=EMBED_MODEL)
Initialize qdrant collection
A local Qdrant vector storage is then set up, allowing you to experiment with the script on your device.
It either creates a new vector storage if none exists with the current name or connects to an existing one if there is a match.
client = QdrantClient(path="langchain_qdrant")collections = [col.name for col in client.get_collections().collections]if COLLECTION_NAME not in collections: client.create_collection( collection_name=COLLECTION_NAME, vectors_config=VectorParams(size=DIM, distance=Distance.COSINE), ) print(f"Created new collection '{COLLECTION_NAME}'.")else: print(f"Using existing collection '{COLLECTION_NAME}'.") vector_store = QdrantVectorStore( client=client, collection_name=COLLECTION_NAME, embedding=embeddings,)
Convert pdf to docling and save outputs
Subsequently, all PDF files listed in the SOURCE are converted into DoclingDocuments and saved in dl_doc.
These documents are then transformed into JSON and markdown files, which are stored on your local device.
The markdown file is used to assess the quality of the file transformation process, while the JSON file is necessary for the subsequent visual grounding process in the second script.
for source in SOURCES: dl_doc = converter.convert(source=source).document # JSON export out_json = OUTPUT_DIR / f"{dl_doc.origin.binary_hash}.json" dl_doc.save_as_json(out_json) # Markdown export with embedded images out_md = OUTPUT_DIR / f"{dl_doc.origin.binary_hash}.md" dl_doc.save_as_markdown(out_md, image_mode=ImageRefMode.EMBEDDED)
Chunking the document
Finally, we come to the main part of the script, including the chunking process of the document into smaller texts.
This is achieved using the HybridChunker provided by docling.
The best feature of this HybridChunker is that it tries to keep related passages together based on the markdown formatting and merges passages with each other if they are too small.
A small max_tokens size has been selected to implement a small-to-big retrieval approach.
This means that initially, a small chunk of text that closely matches the user's query is retrieved.
Following this, a larger context chunk that surrounds the retrieved section is additionally retrieved and provided to the language model for generating the final answer.
In this case, the larger context chunk refers to the paragraph containing the smaller chunk.
chunker = HybridChunker( tokenizer=EMBED_MODEL, max_tokens=64, merge_peers=True)loader = DoclingLoader( file_path=SOURCES, converter=converter, export_type=ExportType.DOC_CHUNKS, chunker=chunker,)docs = loader.load()
Embedding the document
In the final step, the generated chunks are stored in the vector database with a unique identifier, from which they can be retrieved during the online phase.
ids = [str(uuid4()) for _ in docs]vector_store.add_documents(documents=docs, ids=ids)print("Documents have been embedded into the vector store.")
Retrieval and generation part
Now, the online retrieval and generation part leverages the previously embedded text chunks to generate an answer for the user’s input based on the knowledge in the vector store.
Imports
As in the previous indexing script, all the necessary packages are included first.
import osimport refrom pathlib import Pathfrom dotenv import load_dotenvfrom PIL import ImageDraw, Imagefrom pydantic import BaseModel, Fieldfrom qdrant_client import QdrantClientfrom langchain_core.output_parsers import PydanticOutputParserfrom langchain_core.prompts import PromptTemplatefrom langchain_huggingface.embeddings import HuggingFaceEmbeddingsfrom langchain_mistralai import ChatMistralAIfrom langchain_qdrant import QdrantVectorStorefrom docling.chunking import DocMetafrom docling.datamodel.document import DoclingDocument
Configuration and environment variables
Then, the necessary environment variables are loaded and configuration variables are set.
load_dotenv()MISTRAL_API_KEY = os.environ["MISTRAL_API_KEY"]OUTPUT_DIR = Path("output")EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"COLLECTION_NAME = "demo_collection"QUESTION = ( "How does attention is computed in the transformer architecture?")TOP_K = 3OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
Build JSON lookup table
In the next step, all JSON files in the OUTPUT directory are detected.
For each file, its name is extracted using the stem method.
Afterwards, it’s verified that the name consists only of digits, representing the hash value of the file, before adding the numeric name as a key and the file’s path as a value in the doc_store dictionary.
Initializing embedding model, vector store, and llm
After that, the embedding model and the LLM are initialized.
It is important to use the same embedding model and Qdrant collection name as in the previous indexing script.
Optionally, you may want to define these names in a separate configuration script and import this script into both the indexing and retrieval scripts.
For the LLM, I selected Mistral, but you can also choose GPT-4, Gemini, Llama, or any other model you prefer.
embeddings = HuggingFaceEmbeddings(model_name=EMBED_MODEL)client = QdrantClient(path="langchain_qdrant")try: client.get_collection(collection_name=COLLECTION_NAME)except Exception: print( f"Collection {COLLECTION_NAME} not found; create it before indexing.")vector_store = QdrantVectorStore( client=client, collection_name=COLLECTION_NAME, embedding=embeddings,)llm = ChatMistralAI( api_key=MISTRAL_API_KEY, model="mistral-large-latest", temperature=0, max_retries=2,)
Preparing the chain
Thereafter, all components of the RAG chain are set up, beginning with the Answer Pydantic schema, which consists of two fields.
The first field, answerable is a boolean value that indicates whether the question can be answered using the retrieved knowledge.
The second field, answer provides the actual response.
This structure is employed to address the hallucination issue that many LLMs encounter, helping to prevent the dissemination of incorrect information to the user.Following that, the PydanticOutputParser and its get_format_instructuons() method format_instructuons generate the format instructions for the LLM.
Moreover, the parser is employed again as the final Runnable of the LangChain execution chain, with its elements connected by the | operator.
The prompt created for the PromptTemplate is basic and can be further enhanced by the introduction of few-shot examples or other advanced prompt engineering techniques.
class Answer(BaseModel): answerable: bool = Field( ..., description="Whether the question can be answered" ) answer: str = Field( ..., description="The answer based on the provided knowledge" ) parser = PydanticOutputParser(pydantic_object=Answer)format_instructions = parser.get_format_instructions()prompt = PromptTemplate( input_variables=["knowledge", "topic"], partial_variables={"format_instructions": format_instructions}, template=( "You are given the following grounding material:\n\n" "{knowledge}\n\n" "Question:\n" "{topic}\n\n" "Please provide a concise answer in full sentences " "based solely on the information above.\n" "If the answer is not contained within the provided material, " "reply with:\n" "“There is no answer to this question in the provided material.”\n\n" "{format_instructions}" ),)rag_chain = prompt | llm | parser
Run similarity search
Once the chain has been set up.
The text chunks semantically related to the user input query are retrieved.
results = vector_store.similarity_search_with_score( k=TOP_K, query=QUESTION)
Load and assemble grounding texts
Based on the retrieved text chunks from the vector database, the entire paragraph containing each chunk is loaded.
This approach is inspired by the small-to-big retrieval technique introduced earlier.
Image created by the author, designed with canva and graphics generated using gpt-image-1
The relevant paragraph is pulled from the JSON file associated with the document from which the chunk originates.
The correct JSON file is identified by the document’s hash value, which appears both in the text chunk’s metadata and in the JSON filename.
Once the correct JSON file is identified, it’s loaded via the load_from_json method of the DoclingDocument.
The original text item referenced in the current result chunk’s metadata is then extracted from the JSON file using a regular expression.
If the referenced text item is found, the full text passage is retrieved to generate the final result.
This example focuses exclusively on text grounding.
However, Docling also provides references to previously identified images and tables.
grounding_texts: list[str] = []for res, score in results: meta = DocMeta.model_validate(res.metadata["dl_meta"]) h = meta.origin.binary_hash json_file = doc_store.get(h) if not json_file: continue dl_doc = DoclingDocument.load_from_json(json_file) for item in meta.doc_items: if not item.prov: continue match = re.search(r"^#/texts/(\d+)$", item.self_ref) if not match: continue idx = int(match.group(1)) grounding_texts.append(dl_doc.texts[idx].text.strip())knowledge = "\n\n".join(grounding_texts)print("Assembled grounding material:\n", knowledge)
Invoke the llm
Finally, the LLM can be invoked using the previously defined chain.
Inside the invoke method, the retrieved knowledge passages from the original JSON file and the user’s input question are passed in.
The returned values conform to the defined AnswerPydantic schema for a structured evaluation the results.
answer_obj = rag_chain.invoke( {"knowledge": knowledge, "topic": QUESTION})print("Answerable?", answer_obj.answerable)print("Answer: ", answer_obj.answer)
Visual grounding
The subsequent script proceeds to visually ground the used chunks in the corresponding documents only if the question can be answered based on the provided content, meaning the LLM returns that answerable is equal to true/1.
This grounding process is approached similarly to how original text passages are extracted from documents.
However, this time, the focus is on the bounding boxes surrounding these text passages instead of the raw text itself.
These bounding boxes are another outcome of the document analysis and processing performed by the Docling pipeline in the indexing script.
Now, these boxes can be utilized to anchor the generated answers to the original pages of the document.
The coordinates of these bounding boxes come also from the metadata of each chunk,
Moreover, the metadata of a chunk includes the page number, which is required to extract the screenshot of the correct page from the corresponding file.
This screenshot of the page is the foundation for the visual grounding process, as bounding boxes are added on top of it.
In the end, these bounding boxes are on top of the retrieved image of the correct side, allowing the user to precisely identify from which part of the document the information was retrieved.
The final visually enhanced images are then stored in the same output directory as the input document and are ready for visual inspection.
if answer_obj.answerable: for i, (res, score) in enumerate(results, start=1): meta = DocMeta.model_validate(res.metadata["dl_meta"]) h = meta.origin.binary_hash json_file = doc_store.get(h) if not json_file: continue dl_doc = DoclingDocument.load_from_json(json_file) image_by_page: dict[int, "Image.Image"] = {} for item in meta.doc_items: if not item.prov: continue prov = item.prov[0] p = prov.page_no if p not in image_by_page: image_by_page[p] = dl_doc.pages[p].image.pil_image.copy() img = image_by_page[p] bbox = prov.bbox.to_top_left_origin( page_height=dl_doc.pages[p].size.height ).normalized(dl_doc.pages[p].size) left = round(bbox.l * img.width) - 2 top = round(bbox.t * img.height) - 2 right = round(bbox.r * img.width) + 2 bottom = round(bbox.b * img.height) + 2 draw = ImageDraw.Draw(img) draw.rectangle([left, top, right, bottom], outline="blue", width=2) for p, img in image_by_page.items(): out_png = OUTPUT_DIR / f"source_{i}_page_{p}.png" img.save(out_png) print(f"Saved annotated page {p} → {out_png}")
Limitation
However, no solution is without flaws, and this one is no exception.
The first point to consider is the heavy reliance on Docling.
This tool is deeply integrated into the RAG framework, as it is used in almost every part of the system, including document parsing, text chunking, and grounding the final answer, which depends on the DoclingDocument class.
Another point is the processing speed and resource requirements.
While the advantage of Docling is that it can be deployed entirely on your local device, its processing speed and the maximum manageable file size depend significantly on your hardware.
If your hardware is not powerful enough, you may only be able to parse one page of a document at a time.
Additionally, a limitation of the provided example in this blog post is its focus solely on text.
It ignores images and tables that also contain rich information and leaves space for further enhancements.
Conclusion
But despite these limitations, Docling is a neat way to enhance advanced RAG frameworks quickly with additional visual grounding capabilities.
By connecting retrieved answers directly to their visual origin, this approach not only boosts transparency but also helps users build trust in the system’s output.
What is your opinion about Docling and this introduced RAG framework?
Have you already experimented with other grounding approaches for more explainable AI?Sources
Docling Paper: https://arxiv.org/abs/2501.17887
Docling Documentation: https://docling-project.github.io/docling/
Qdrant local quickstart: https://qdrant.tech/documentation/quickstart/
LangChain ChatMistralAI: https://python.langchain.com/docs/integrations/chat/mistralai/
LangChain structured outputs: https://python.langchain.com/docs/concepts/structured_outputs/
Join thousands of data leaders on the AI newsletter.
Join over 80,000 subscribers and keep up to date with the latest developments in AI.
From research to projects and ideas.
If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Source: https://towardsai.net/p/l/visual-grounding-for-advanced-rag-frameworks
#visual #grounding #for #advanced #rag #frameworks
·51 Visualizações