Upgrade to Pro

TOWARDSAI.NET
Language models are transfer learners: using BERT to solve Multi-Hop RAG
Author(s): Anuar Sharafudinov Originally published on Towards AI. Credits: GPT4.1 Introduction In previous article, we addressed a critical limitation of today’s Retrieval-Augmented Generation (RAG) systems: missing contextual information due to independent chunking. However, this is just one of RAG’s shortcomings. Another significant challenge is multi-hop question answering (QA), which involves integrating evidence spread across multiple chunks to derive a complete and accurate response. This means the system must gather and reason over information distributed throughout the document — often spanning non-contiguous sections — to arrive at the correct answer. The Challenge of Multi-Hop QA Multi-hop QA tasks demand that multiple pieces of evidence from different sections of a document (or even multiple documents) be combined to answer a question. Some examples are straightforward. For instance: “Compare iPhone 14 and iPhone 16 characteristics.” This can be handled using two independent retrievals: one for “iPhone 14” and one for “iPhone 16.” But others are more complex, such as causal chains: “Give me a detailed comparison of the latest OSX versions.” In this case, a model must first determine what the latest OSX versions are (e.g., Sonama, Catalina), and then perform separate retrievals or queries for each version before aggregating and comparing the information. This multi-step querying process is computationally expensive and slow. A Simpler Alternative: Single-Model Chunk Scoring What if we could avoid query generation altogether and directly score all document chunks against the original question using a single machine learning model? This approach solves three core RAG problems: Eliminates the need for query generation Mitigates the downsides of independent chunking Enables multi-hop reasoning in a single pass The Nuance: Chunk Embeddings Instead of Token Embeddings A common objection is: wouldn’t scoring all chunks at once require large context windows, the very problem RAG was meant to solve? Here’s the twist: we replace individual token embeddings with chunk embeddings — each chunk being up to 200 tokens long. With a sequence length of 500, this setup effectively allows us to encode 100,000 tokens’ worth of content. 100 if mode==2 else 23),Credits: GPT4.1 Training Setup Training was conducted on a desktop PC with an 8GB GPU for a few hours. The setup was as follows: Embedding model: jinaai/jina-embeddings-v3 Base model for fine-tuning: google/bert-base-uncased Input format: [question, chunk_i_1, chunk_i_2, ..., chunk_j_1, chunk_j_2, ...]where chunk_i_1 is the first chunk embedding of document_i. Labels: [0, label_i_1, label_i_2, ..., label_j_2, ...] where label_i_x = 1 if chunk_i_x contributes to the answer, else 0 Chunks are shuffled across documents for each mini-batch, but order within a document is preserved Training parameters: Batch size: 16 Precision: FP32 Learning rate: 1e-5 Epochs: 100 Average sequence length: 511 Loss function: Binary Cross-Entropy Dataset A hybrid dataset was selected for training, composed of several sources: MultiHop RAG (2,000 rows) Microsoft MS MARCO (2,000 rows) FinanceBench pdfs (700 rows) HotpotQA (40,000 rows) Testset: 100 samples from HotpotQA Results Recall@10 was used as the performance metric — how many relevant chunks were retrieved in the top 10 candidates. Baseline comparisons: Raw cosine similarity using jinaai embeddings Cohere Rerank API — which evaluates relevance using a learned scoring model rather than relying solely on cosine similarity. Cosine similarity, while efficient, often fails to capture semantic nuances between a question and a passage. To address this, Cohere and similar companies have developed reranking models that assign a more accurate relevance score based on contextual understanding of both the query and the retrieved chunk (one by one). Sample output: Relevant chunks: [1, 312, 313] Top 10 candidates: [1, 313, 312, 2, 188, 46, 183, 325, 91, 149] Recall@10: 3/3 Final results Open-Sourcing the Code Want to try this yourself? We’ve open-sourced the code with training and testing scripts, including integrations with Cohere Rerank and cosine similarity baselines. Future Work This experiment used bert-base-uncased, which has a limited input size of 512 tokens. Next steps: Experiment with larger models: bert-large, DeBERTa, Longformer, etc. Jointly fine-tune the embedding and scoring models to enable end-to-end optimization. This allows the embedding space to evolve in a way that is directly aligned with the downstream scoring task, potentially improving both chunk relevance and overall retrieval accuracy. Move beyond chunk scoring: fine-tune a large language model such as LLaMA 3 to reason directly over chunk embeddings and generate answers end-to-end. This enables the model to not only retrieve relevant content but also synthesize and articulate coherent multi-hop responses in a single step. Final Thoughts Transformer-based language models continue to impress. Over the recent experiments, we observed they can efficiently adapt to new modalities like character probabilities or chunk embeddings very quickly, and their adaptability to complex tasks like multi-hop reasoning is fascinating. Code Snippets Chunk Embedding Generation (JinaAI) # Example code for generating question and chunk embeddingsemb_cache = {}embedding_model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True)embedding_model.eval()embedding_model.cuda()for step, (question, chunks_list, labels_list) in enumerate(dataset): print(f"\rEmb: {step}/10k", end="", flush=True) h = hashf(question) if h not in emb_cache: emb_cache[h] = embedding_model.encode([question], task="retrieval.query")[0] for chunks in chunks_list: for chunk in chunks: h = hashf(chunk) if h not in emb_cache: emb_cache[h] = embedding_model.encode([chunk], task="retrieval.passage")[0] Cohere Rerank API # Example usage of Cohere Rerank APIimport cohereco = cohere.Client('your-api-key')documents_reranked = co.rerank(model="rerank-v3.5", query=question, documents=documents, top_n=10)top_indices = [r.index for r in documents_reranked.results] Data Collator (Batch Creator) # Example code for custom data collatorclass DataCollator: def shuffle_lists(self, a, b): combined = list(zip(a, b)) # Pair corresponding elements random.shuffle(combined) # Shuffle the pairs a_shuffled, b_shuffled = zip(*combined) # Unzip after shuffling return list(a_shuffled), list(b_shuffled) def __call__(self, features) -> Dict[str, torch.Tensor]: batch = {"input_values": [], "labels":[], "position_ids":[]} for x in features: question, labels_list, chunks_list = x["question"], x["labels_list"], x["chunks_list"] question_emb, chunks_emb = emb_cache[hashf(question)], [] for chunks in chunks_list: chunks_emb.append( [emb_cache[hashf(chunk)] for chunk in chunks] ) chunks_emb, labels_list = self.shuffle_lists(chunks_emb, labels_list) input_values, labels, position_ids = [question_emb], [0], [0] for i, embs in enumerate(chunks_emb): for idx, emb in enumerate(embs): input_values.append(emb) position_ids.append(idx+1) labels += labels_list[i] input_values, labels, position_ids = torch.tensor(input_values), torch.tensor(labels), torch.tensor(position_ids, dtype=torch.long) batch["input_values"].append(input_values) batch["labels"].append(labels) batch["position_ids"].append(position_ids) batch["input_values"] = pad_sequence(batch["input_values"], batch_first=True, padding_value=0) #B,S,C batch["labels"] = pad_sequence(batch["labels"], batch_first=True, padding_value=0) batch["position_ids"] = pad_sequence(batch["position_ids"], batch_first=True, padding_value=0) return batch Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
·16 Views