Easy Late-Chunking With Chonkie
towardsai.net
Author(s): Michael Ryaboy Originally published on Towards AI. Image Source https://github.com/chonkie-ai/Late Chunking has just been released in Chonkie, a lean chunking library that already boasts over 2,000 stars on GitHub. This is a welcome update for anyone looking to integrate late chunking into their retrieval pipelines, since implementing it from the ground up can be conceptually tricky and prone to mistakes. This article breaks down what Late Chunking is, why its essential for embedding larger or more intricate documents, and how to build it into your search pipeline using Chonkie and KDB.AI as the vector store.What is Late Chunking?When you have a document that spans thousands of words, encoding it into a single embedding often isnt optimal. In many scenarios, you need to retrieve smaller segments of text, and dense-vector retrieval tends to perform better when those text segments (chunks) are smaller. This is partly because embedding a whole, massive document may over-compress its semantics into a single vector.Retrieval-Augmented Generation (RAG) is a prime example that benefits from splitting documents into smaller text chunks often around 512 tokens each. In RAG, you store these chunks in a vector database and encode them with a text embedding model.The Lost Context ProblemThe typical RAG pipeline of chunk embed retrieve generate is far from perfect. Splitting text naively can inadvertently break longer contextual relationships. If crucial information is spread across multiple chunks, or a chunk requires context from the wider document, simply retrieving one chunk alone might not provide enough context to answer a query accurately. Our chunk embeddings also do not represent the chunks full meaning, which means the correct chunks might not be retrieved.Take, for instance, a query like:What is the population of Berlin?If an article is split sentence by sentence, one chunk might mention Berlin, while another mentions the population figure without restating the city name. Without the context from the entire document, these fragments cant answer the query effectively, especially when resolving references like it or the city. This example by Jina AI demonstrates this further:Late Chunking SolutionInstead of passing each chunk individually to an embedding model, in Late Chunking:The entire text (or as much as possible) is processed by the transformer layers of your embedding model, generating token embeddings that reflect global context.Text is split into chunks, and mean pooling is applied to token embeddings within each chunk to create embeddings informed by the whole document.This preserves document context in every chunk, ensuring the embedding captures more than just the local semantics of the individual chunk. Of course, this doesnt solve the issue of the chunk itself not having enough context. To solve this, check out my article comparing Late Chunking to Contextual Retrieval, a method popularized by Anthropic to add context to chunks with LLMs:https://medium.com/kx-systems/late-chunking-vs-contextual-retrieval-the-math-behind-rags-context-problem-d5a26b9bbd38.In practice, what this does instead is reduce the number of failed retrievals, and clusters chunk embeddings around the document.Naive vs Late Chunking ComparisonLate Embedding Process. Image By Author.In a naive approach, each chunk is encoded independently, producing embeddings that lack context from other chunks. Late Chunking, on the other hand, creates chunk embeddings conditioned on the global context, significantly improving retrieval performance. This helps reduce hallucinations and failed responses in RAG systems.Late chunking has been shown to improve retrieval performance, which in turn means it can reduce RAG hallucinations and failed responses.Implementation with Chonkie and KDB.AIImage Source: KDB.AIHeres how you can implement Late Chunking using KDB.AI as the vector store.(Disclaimer, Im a Developer Advocate for KDB.AI and a contributor to Chonkie.)1. Install Dependencies and Set Up LateChunker!pip install "chonkie[st]" kdbai-client sentence-transformersfrom chonkie import LateChunkerimport kdbai_client as kdbaiimport pandas as pd# Initialize Late Chunkerchunker = LateChunker( embedding_model="all-MiniLM-L6-v2", mode="sentence", chunk_size=512, min_sentences_per_chunk=1, min_characters_per_sentence=12,)2. Set Up the Vector DatabaseYou can sign up for a free-tier KDB.AI instance at kdb.ai, which offers up to 4 MB memory and 32 GB storage. This is more than enough for most use cases if embeddings are stored efficiently.# Initialize KDB.AI sessionsession = kdbai.Session( api_key="your_api_key", endpoint="your_endpoint")# Create database and define schemadb = session.create_database("documents")schema = [ {"name": "sentences", "type": "str"}, {"name": "vectors", "type": "float64s"},]# Configure HNSW index for fast similarity searchindexes = [{ 'type': 'hnsw', 'name': 'hnsw_index', 'column': 'vectors', 'params': {'dims': 384, 'metric': "L2"},}]# Create tabletable = db.create_table( table="chunks", schema=schema, indexes=indexes)3. Chunk and EmbedHeres an example using Paul Grahams essays in Markdown format. Well generate late chunks and store them in the vector database.import requestsurls = ["ww.paulgraham.com/wealth.html", "www.paulgraham.com/start.html"]texts = [requests.get('http://r.jina.ai/' + url).text for url in urls]batch_chunks = chunker(texts)chunks = [chunk for batch in batch_chunks for chunk in batch]# Store in KDB.AIembeddings_df = pd.DataFrame({ "vectors": [chunk.embedding.tolist() for chunk in chunks], "sentences": [chunk.text for chunk in chunks]})embeddings_df.head()4. Query the Vector StoreLets test the retrieval pipeline by embedding a search query and finding the most relevant chunks.import sentence_transformerssearch_query = "to get rich do this"search_embedding = sentence_transformers.SentenceTransformer("all-MiniLM-L6-v2").encode(search_query)# search for similar documentstable.search(vectors={'hnsw_index': [search_embedding]}, n=3)[0]['sentences']And we are able to get some results! The results arent ideal, as the dataset size is tiny, we are using a weak embedding model, and we arent utilizing reranking. But as the size of the dataset scales, late chunking can give a very significant boost in accuracy.5. Clean UpRemember to drop the database to save resources:db.drop()ConclusionLate Chunking solves the critical issue of preserving long-distance context in retrieval pipelines. When paired with KDB.AI, you get:Context-aware embeddings: Every chunks embedding reflects the entire document.Sub-100ms latency: Leveraging KDB.AIs HNSW index ensures fast retrieval.Scalability: Capable of handling large-scale datasets in production.Chonkie makes adding Late Chunking to your pipeline extremely simple. If youve struggled with building this from scratch before (like I have), this library will definitely save you a lot of time and headaches.For more insights into advanced AI techniques, vector search, and Retrieval-Augmented Generation, follow me on Linkedin!Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming asponsor. Published via Towards AI
0 Yorumlar
·0 hisse senetleri
·53 Views