Поиск

Marktechpost AI поделился ссылкой

2025-05-17 13:29:36 ·

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks

Conversational artificial intelligence is centered on enabling large language modelsto engage in dynamic interactions where user needs are revealed progressively. These systems are widely deployed in tools that assist with coding, writing, and research by interpreting and responding to natural language instructions. The aspiration is for these models to flexibly adjust to changing user inputs over multiple turns, adapting their understanding with each new piece of information. This contrasts with static, single-turn responses and highlights a major design goal: sustaining contextual coherence and delivering accurate outcomes in extended dialogues.
A persistent problem in conversational AI is the model’s inability to handle user instructions distributed across multiple conversation turns. Rather than receiving all necessary information simultaneously, LLMs must extract and integrate key details incrementally. However, when the task is not specified upfront, models tend to make early assumptions about what is being asked and attempt final solutions prematurely. This leads to errors that persist through the conversation, as the models often stick to their earlier interpretations. The result is that once an LLM makes a misstep in understanding, it struggles to recover, resulting in incomplete or misguided answers.

Most current tools evaluate LLMs using single-turn, fully-specified prompts, where all task requirements are presented in one go. Even in research claiming multi-turn analysis, the conversations are typically episodic, treated as isolated subtasks rather than an evolving flow. These evaluations fail to account for how models behave when the information is fragmented and context must be actively constructed from multiple exchanges. Consequently, evaluations often miss the core difficulty models face: integrating underspecified inputs over several conversational turns without explicit direction.
Researchers from Microsoft Research and Salesforce Research introduced a simulation setup that mimics how users reveal information in real conversations. Their “sharded simulation” method takes complete instructions from high-quality benchmarks and splits them into smaller, logically connected parts or “shards.” Each shard delivers a single element of the original instruction, which is then revealed sequentially over multiple turns. This simulates the progressive disclosure of information that happens in practice. The setup includes a simulated user powered by an LLM that decides which shard to reveal next and reformulates it naturally to fit the ongoing context. This setup also uses classification mechanisms to evaluate whether the assistant’s responses attempt a solution or require clarification, further refining the simulation of genuine interaction.

The technology developed simulates five types of conversations, including single-turn full instructions and multiple multi-turn setups. In SHARDED simulations, LLMs received instructions one shard at a time, forcing them to wait before proposing a complete answer. This setup evaluated 15 LLMs across six generation tasks: coding, SQL queries, API actions, math problems, data-to-text descriptions, and document summaries. Each task drew from established datasets such as GSM8K, Spider, and ToTTo. For every LLM and instruction, 10 simulations were conducted, totaling over 200,000 simulations. Aptitude, unreliability, and average performance were computed using a percentile-based scoring system, allowing direct comparison of best and worst-case outcomes per model.
Across all tasks and models, a consistent decline in performance was observed in the SHARDED setting. On average, performance dropped from 90% in single-turn to 65% in multi-turn scenarios—a 25-point decline. The main cause was not reduced capability but a dramatic rise in unreliability. While aptitude dropped by 16%, unreliability increased by 112%, revealing that models varied wildly in how they performed when information was presented gradually. For example, even top-performing models like GPT-4.1 and Gemini 2.5 Pro exhibited 30-40% average degradations. Additional compute at generation time or lowering randomnessoffered only minor improvements in consistency.

This research clarifies that even state-of-the-art LLMs are not yet equipped to manage complex conversations where task requirements unfold gradually. The sharded simulation methodology effectively exposes how models falter in adapting to evolving instructions, highlighting the urgent need to improve reliability in multi-turn settings. Enhancing the ability of LLMs to process incomplete instructions over time is essential for real-world applications where conversations are naturally unstructured and incremental.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.
NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/Georgia Tech and Stanford Researchers Introduce MLE-Dojo: A Gym-Style Framework Designed for Training, Evaluating, and Benchmarking Autonomous Machine Learning EngineeringAgentsNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Investigates Test-Time Scaling of English-Centric RLMs for Enhanced Multilingual Reasoning and Domain GeneralizationNikhilhttps://www.marktechpost.com/author/nikhil0980/PwC Releases Executive Guide on Agentic AI: A Strategic Blueprint for Deploying Autonomous Multi-Agent Systems in the Enterprise
#llms #struggle #with #real #conversations

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks
Conversational artificial intelligence is centered on enabling large language modelsto engage in dynamic interactions where user needs are revealed progressively. These systems are widely deployed in tools that assist with coding, writing, and research by interpreting and responding to natural language instructions. The aspiration is for these models to flexibly adjust to changing user inputs over multiple turns, adapting their understanding with each new piece of information. This contrasts with static, single-turn responses and highlights a major design goal: sustaining contextual coherence and delivering accurate outcomes in extended dialogues. A persistent problem in conversational AI is the model’s inability to handle user instructions distributed across multiple conversation turns. Rather than receiving all necessary information simultaneously, LLMs must extract and integrate key details incrementally. However, when the task is not specified upfront, models tend to make early assumptions about what is being asked and attempt final solutions prematurely. This leads to errors that persist through the conversation, as the models often stick to their earlier interpretations. The result is that once an LLM makes a misstep in understanding, it struggles to recover, resulting in incomplete or misguided answers. Most current tools evaluate LLMs using single-turn, fully-specified prompts, where all task requirements are presented in one go. Even in research claiming multi-turn analysis, the conversations are typically episodic, treated as isolated subtasks rather than an evolving flow. These evaluations fail to account for how models behave when the information is fragmented and context must be actively constructed from multiple exchanges. Consequently, evaluations often miss the core difficulty models face: integrating underspecified inputs over several conversational turns without explicit direction. Researchers from Microsoft Research and Salesforce Research introduced a simulation setup that mimics how users reveal information in real conversations. Their “sharded simulation” method takes complete instructions from high-quality benchmarks and splits them into smaller, logically connected parts or “shards.” Each shard delivers a single element of the original instruction, which is then revealed sequentially over multiple turns. This simulates the progressive disclosure of information that happens in practice. The setup includes a simulated user powered by an LLM that decides which shard to reveal next and reformulates it naturally to fit the ongoing context. This setup also uses classification mechanisms to evaluate whether the assistant’s responses attempt a solution or require clarification, further refining the simulation of genuine interaction. The technology developed simulates five types of conversations, including single-turn full instructions and multiple multi-turn setups. In SHARDED simulations, LLMs received instructions one shard at a time, forcing them to wait before proposing a complete answer. This setup evaluated 15 LLMs across six generation tasks: coding, SQL queries, API actions, math problems, data-to-text descriptions, and document summaries. Each task drew from established datasets such as GSM8K, Spider, and ToTTo. For every LLM and instruction, 10 simulations were conducted, totaling over 200,000 simulations. Aptitude, unreliability, and average performance were computed using a percentile-based scoring system, allowing direct comparison of best and worst-case outcomes per model. Across all tasks and models, a consistent decline in performance was observed in the SHARDED setting. On average, performance dropped from 90% in single-turn to 65% in multi-turn scenarios—a 25-point decline. The main cause was not reduced capability but a dramatic rise in unreliability. While aptitude dropped by 16%, unreliability increased by 112%, revealing that models varied wildly in how they performed when information was presented gradually. For example, even top-performing models like GPT-4.1 and Gemini 2.5 Pro exhibited 30-40% average degradations. Additional compute at generation time or lowering randomnessoffered only minor improvements in consistency. This research clarifies that even state-of-the-art LLMs are not yet equipped to manage complex conversations where task requirements unfold gradually. The sharded simulation methodology effectively exposes how models falter in adapting to evolving instructions, highlighting the urgent need to improve reliability in multi-turn settings. Enhancing the ability of LLMs to process incomplete instructions over time is essential for real-world applications where conversations are naturally unstructured and incremental. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/Georgia Tech and Stanford Researchers Introduce MLE-Dojo: A Gym-Style Framework Designed for Training, Evaluating, and Benchmarking Autonomous Machine Learning EngineeringAgentsNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Investigates Test-Time Scaling of English-Centric RLMs for Enhanced Multilingual Reasoning and Domain GeneralizationNikhilhttps://www.marktechpost.com/author/nikhil0980/PwC Releases Executive Guide on Agentic AI: A Strategic Blueprint for Deploying Autonomous Multi-Agent Systems in the Enterprise #llms #struggle #with #real #conversations

WWW.MARKTECHPOST.COM

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks

Conversational artificial intelligence is centered on enabling large language models (LLMs) to engage in dynamic interactions where user needs are revealed progressively. These systems are widely deployed in tools that assist with coding, writing, and research by interpreting and responding to natural language instructions. The aspiration is for these models to flexibly adjust to changing user inputs over multiple turns, adapting their understanding with each new piece of information. This contrasts with static, single-turn responses and highlights a major design goal: sustaining contextual coherence and delivering accurate outcomes in extended dialogues. A persistent problem in conversational AI is the model’s inability to handle user instructions distributed across multiple conversation turns. Rather than receiving all necessary information simultaneously, LLMs must extract and integrate key details incrementally. However, when the task is not specified upfront, models tend to make early assumptions about what is being asked and attempt final solutions prematurely. This leads to errors that persist through the conversation, as the models often stick to their earlier interpretations. The result is that once an LLM makes a misstep in understanding, it struggles to recover, resulting in incomplete or misguided answers. Most current tools evaluate LLMs using single-turn, fully-specified prompts, where all task requirements are presented in one go. Even in research claiming multi-turn analysis, the conversations are typically episodic, treated as isolated subtasks rather than an evolving flow. These evaluations fail to account for how models behave when the information is fragmented and context must be actively constructed from multiple exchanges. Consequently, evaluations often miss the core difficulty models face: integrating underspecified inputs over several conversational turns without explicit direction. Researchers from Microsoft Research and Salesforce Research introduced a simulation setup that mimics how users reveal information in real conversations. Their “sharded simulation” method takes complete instructions from high-quality benchmarks and splits them into smaller, logically connected parts or “shards.” Each shard delivers a single element of the original instruction, which is then revealed sequentially over multiple turns. This simulates the progressive disclosure of information that happens in practice. The setup includes a simulated user powered by an LLM that decides which shard to reveal next and reformulates it naturally to fit the ongoing context. This setup also uses classification mechanisms to evaluate whether the assistant’s responses attempt a solution or require clarification, further refining the simulation of genuine interaction. The technology developed simulates five types of conversations, including single-turn full instructions and multiple multi-turn setups. In SHARDED simulations, LLMs received instructions one shard at a time, forcing them to wait before proposing a complete answer. This setup evaluated 15 LLMs across six generation tasks: coding, SQL queries, API actions, math problems, data-to-text descriptions, and document summaries. Each task drew from established datasets such as GSM8K, Spider, and ToTTo. For every LLM and instruction, 10 simulations were conducted, totaling over 200,000 simulations. Aptitude, unreliability, and average performance were computed using a percentile-based scoring system, allowing direct comparison of best and worst-case outcomes per model. Across all tasks and models, a consistent decline in performance was observed in the SHARDED setting. On average, performance dropped from 90% in single-turn to 65% in multi-turn scenarios—a 25-point decline. The main cause was not reduced capability but a dramatic rise in unreliability. While aptitude dropped by 16%, unreliability increased by 112%, revealing that models varied wildly in how they performed when information was presented gradually. For example, even top-performing models like GPT-4.1 and Gemini 2.5 Pro exhibited 30-40% average degradations. Additional compute at generation time or lowering randomness (temperature settings) offered only minor improvements in consistency. This research clarifies that even state-of-the-art LLMs are not yet equipped to manage complex conversations where task requirements unfold gradually. The sharded simulation methodology effectively exposes how models falter in adapting to evolving instructions, highlighting the urgent need to improve reliability in multi-turn settings. Enhancing the ability of LLMs to process incomplete instructions over time is essential for real-world applications where conversations are naturally unstructured and incremental. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and GenerationNikhilhttps://www.marktechpost.com/author/nikhil0980/Georgia Tech and Stanford Researchers Introduce MLE-Dojo: A Gym-Style Framework Designed for Training, Evaluating, and Benchmarking Autonomous Machine Learning Engineering (MLE) AgentsNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Investigates Test-Time Scaling of English-Centric RLMs for Enhanced Multilingual Reasoning and Domain GeneralizationNikhilhttps://www.marktechpost.com/author/nikhil0980/PwC Releases Executive Guide on Agentic AI: A Strategic Blueprint for Deploying Autonomous Multi-Agent Systems in the Enterprise

·126 Просмотры

Войдите, чтобы отмечать, делиться и комментировать!
Computerworld UK поделился ссылкой

2025-05-16 01:08:55 ·

Relying on file storage heritage, Box pivots to AI

AI agents will fundamentally change the value of content and the way people work with data and files, said Aaron Levie, CEO of Box, during a Wednesday webcast that was part of the company’s Content+AI Virtual Summit.

The traditional approach to managing content is fundamentally changing with AI, Levie said.

Over the last two decades, Box has changed its focus from file storage to extracting value from content in those files. The company provides collaborative tools and integrates popular apps for users to work with those files and data.

Extracting value from the content in stored files will take on new meaning with AI. The amount of unstructured data within files has grown at an exponential pace over the last few decades, but most of the data is underutilized, Levie said.

AI will provide instant answers from unstructured data, fundamentally changing the value of content stored in systems. It will also automate workflows and introduce an AI-first culture, Levie said.

“We have the opportunity to drive an incredible amount of new experiences,” Levie said. “We need a better and modern way to manage information.”

Box has kept up with various stages in AI evolution, with integration of AI models and more recently AI agents. As AI advances, the vendor is allowing its clients to do more advanced data extraction, multi-step reasoning, and more complex task planning.

“We didn’t think about AI as an add-on capability on the side,” but as central to the Box platform, Levie said.

For example, the Box interface has an agent that can answer queries and work with content within specific folders. It provides multimedia responses to queries, linking to videos, charts, and citations within files stored in the system.

The company on Wednesday launched a series of new AI agents that serve different objectives, including a “Search” agent for basic results from files and a “Deep Research” agent that digs deeper into information for more comprehensive answers.

The Deep Research agent works with large volumes of enterprise content in files, summarizes findings, and provides links to relevant files.

Box is also integrating a new AI agent for Microsoft 365 Copilot, allowing users of Word, PowerPoint, and other software to work with data stored in Box systems.

In the coming years, Microsoft plans to integrate thousands of agents from third parties such as Box and Adobe to improve user productivity.

Many of Box’s competitors are also offering their own AI technology, though agent-to-agent integrations between vendors are growing.

Forrester Research has a roadmap for broad adoption of various agent technologies in coming years. It shows helper-style agents like the ones announced by Box taking off gradually in 2026.

Helper-style AI agents will see broad adoption by next year, according to Forrester. More complex agents capable of executive decision-making will take a little longer.
Forrester Research

But as AI agents advance to solving problems and making executive decisions, things will get more complex, said Craig Le Clair, vice president and principal analyst at Forrester.

There’s a huge gap between vendors of AI agents and the actual adoption profile of users and buyers, Le Clair said. “The gap that exists there is astounding,” he said.

Companies are advancing with AI agents, but it’s much more complex when it comes to sophisticated solvers and managing agents, Le Clair said.

“If you’re in a financial institution, you’ve built all these layers of process, control, risk mitigation, and reporting around the technology underneath. And you don’t change that easily, because it takes legal, compliance, meetings, security. It takes longer to change the processes that sit above the technology,” Le Clair said.
#relying #file #storage #heritage #box

Relying on file storage heritage, Box pivots to AI
AI agents will fundamentally change the value of content and the way people work with data and files, said Aaron Levie, CEO of Box, during a Wednesday webcast that was part of the company’s Content+AI Virtual Summit. The traditional approach to managing content is fundamentally changing with AI, Levie said. Over the last two decades, Box has changed its focus from file storage to extracting value from content in those files. The company provides collaborative tools and integrates popular apps for users to work with those files and data. Extracting value from the content in stored files will take on new meaning with AI. The amount of unstructured data within files has grown at an exponential pace over the last few decades, but most of the data is underutilized, Levie said. AI will provide instant answers from unstructured data, fundamentally changing the value of content stored in systems. It will also automate workflows and introduce an AI-first culture, Levie said. “We have the opportunity to drive an incredible amount of new experiences,” Levie said. “We need a better and modern way to manage information.” Box has kept up with various stages in AI evolution, with integration of AI models and more recently AI agents. As AI advances, the vendor is allowing its clients to do more advanced data extraction, multi-step reasoning, and more complex task planning. “We didn’t think about AI as an add-on capability on the side,” but as central to the Box platform, Levie said. For example, the Box interface has an agent that can answer queries and work with content within specific folders. It provides multimedia responses to queries, linking to videos, charts, and citations within files stored in the system. The company on Wednesday launched a series of new AI agents that serve different objectives, including a “Search” agent for basic results from files and a “Deep Research” agent that digs deeper into information for more comprehensive answers. The Deep Research agent works with large volumes of enterprise content in files, summarizes findings, and provides links to relevant files. Box is also integrating a new AI agent for Microsoft 365 Copilot, allowing users of Word, PowerPoint, and other software to work with data stored in Box systems. In the coming years, Microsoft plans to integrate thousands of agents from third parties such as Box and Adobe to improve user productivity. Many of Box’s competitors are also offering their own AI technology, though agent-to-agent integrations between vendors are growing. Forrester Research has a roadmap for broad adoption of various agent technologies in coming years. It shows helper-style agents like the ones announced by Box taking off gradually in 2026. Helper-style AI agents will see broad adoption by next year, according to Forrester. More complex agents capable of executive decision-making will take a little longer. Forrester Research But as AI agents advance to solving problems and making executive decisions, things will get more complex, said Craig Le Clair, vice president and principal analyst at Forrester. There’s a huge gap between vendors of AI agents and the actual adoption profile of users and buyers, Le Clair said. “The gap that exists there is astounding,” he said. Companies are advancing with AI agents, but it’s much more complex when it comes to sophisticated solvers and managing agents, Le Clair said. “If you’re in a financial institution, you’ve built all these layers of process, control, risk mitigation, and reporting around the technology underneath. And you don’t change that easily, because it takes legal, compliance, meetings, security. It takes longer to change the processes that sit above the technology,” Le Clair said. #relying #file #storage #heritage #box

WWW.COMPUTERWORLD.COM

Relying on file storage heritage, Box pivots to AI

AI agents will fundamentally change the value of content and the way people work with data and files, said Aaron Levie, CEO of Box, during a Wednesday webcast that was part of the company’s Content+AI Virtual Summit. The traditional approach to managing content is fundamentally changing with AI, Levie said. Over the last two decades, Box has changed its focus from file storage to extracting value from content in those files. The company provides collaborative tools and integrates popular apps for users to work with those files and data. Extracting value from the content in stored files will take on new meaning with AI. The amount of unstructured data within files has grown at an exponential pace over the last few decades, but most of the data is underutilized, Levie said. AI will provide instant answers from unstructured data, fundamentally changing the value of content stored in systems. It will also automate workflows and introduce an AI-first culture, Levie said. “We have the opportunity to drive an incredible amount of new experiences,” Levie said. “We need a better and modern way to manage information.” Box has kept up with various stages in AI evolution, with integration of AI models and more recently AI agents. As AI advances, the vendor is allowing its clients to do more advanced data extraction, multi-step reasoning, and more complex task planning. “We didn’t think about AI as an add-on capability on the side,” but as central to the Box platform, Levie said. For example, the Box interface has an agent that can answer queries and work with content within specific folders. It provides multimedia responses to queries, linking to videos, charts, and citations within files stored in the system. The company on Wednesday launched a series of new AI agents that serve different objectives, including a “Search” agent for basic results from files and a “Deep Research” agent that digs deeper into information for more comprehensive answers. The Deep Research agent works with large volumes of enterprise content in files, summarizes findings, and provides links to relevant files. Box is also integrating a new AI agent for Microsoft 365 Copilot, allowing users of Word, PowerPoint, and other software to work with data stored in Box systems. In the coming years, Microsoft plans to integrate thousands of agents from third parties such as Box and Adobe to improve user productivity. Many of Box’s competitors are also offering their own AI technology, though agent-to-agent integrations between vendors are growing. Forrester Research has a roadmap for broad adoption of various agent technologies in coming years. It shows helper-style agents like the ones announced by Box taking off gradually in 2026. Helper-style AI agents will see broad adoption by next year, according to Forrester. More complex agents capable of executive decision-making will take a little longer. Forrester Research But as AI agents advance to solving problems and making executive decisions, things will get more complex, said Craig Le Clair, vice president and principal analyst at Forrester. There’s a huge gap between vendors of AI agents and the actual adoption profile of users and buyers, Le Clair said. “The gap that exists there is astounding,” he said. Companies are advancing with AI agents, but it’s much more complex when it comes to sophisticated solvers and managing agents, Le Clair said. “If you’re in a financial institution, you’ve built all these layers of process, control, risk mitigation, and reporting around the technology underneath. And you don’t change that easily, because it takes legal, compliance, meetings, security. It takes longer to change the processes that sit above the technology,” Le Clair said.

·230 Просмотры

Войдите, чтобы отмечать, делиться и комментировать!
Towards AI поделился ссылкой

2025-05-15 04:16:37 ·

Extracting Data from Unstructured Documents

Author: Felix Pappe

Originally published on Towards AI.

Image created by the author using gpt-image-1
Introduction
In the past, extracting specific information from documents or images using traditional methods could have become quickly cumbersome and frustrating, especially when the final results stray far from what you intended. The reasons for this can be diverse, ranging from overly complex document layouts to improperly formatted files, or an avalanche of visual elements, which machines struggle to interpret.
However, vision-enabled languagemodels have come to the rescue. Over the past months and years, these models have gained ever-greater capabilities, from rough image descriptions to detailed text extraction. Notably, the extraction of complex textual information from images has seen astonishing progress. This allows for rapid knowledge extraction from diverse document types without brittle, rule-based systems that break as soon as the document structure changes — and without the time-, data-, and cost-intensive specialized training of custom models.
However, there is one flaw: vLMs, like their text-only counterparts, tend to produce verbose output around the information you actually want. Phrases such as “Of course, here is the information you requested” or “This is the extracted information about XYZ” commonly surround the essential content.
You could use regular expressions together with advanced prompt engineering to constrain the vLM to output only the requested information. However, crafting the perfect prompt and matching regex for a given task is difficult and requires much trial and error. In this blog post, I’d like to introduce a simpler approach: combining the rich capabilities of vLLMs with the strict validation offered by Pydantic classes to extract exactly the desired information for your document processing pipeline.
Description of post tackled issue
The example in this blog post describes a situation that every job applicant has likely experienced many times. I am sure of it.
After you have carefully and thoroughly created your CV, thinking about every word and maybe even every letter, you upload the file to a job portal. But after successfully uploading the file, including all the requested information, you are asked once again to fill out the same details in standard HTML forms by copying and pasting the information from your CV into the correct fields.Some companies attempt to autofill these fields based on the information extracted from your CV, but the results are often far from accurate or complete.In the following code, I combine Pixtral, LangChain, and Pydantic to provide a simple solution.
The code extracts the first name, last name, phone number, email, and birthday from the CV if they exist. This helps keep the example simple and focuses on the technical aspects.The code can be easily adapted for other use cases or extended to extract all required information from a CV.So let us dive into the code.
Code walkthrough
Importing required libraries
In the first step, the required libraries are imported, including:

os, pathlib, and typing for standard Python modules providing filesystem access and type annotations
base64
dontenv.env file into os.environ
pydanticLLM output
ChatMistralAILLM interface
PIL

import osimport base64from pathlib import Pathfrom typing import Optionalfrom dotenv import load_dotenvfrom pydantic import BaseModel, Fieldfrom langchain_mistralai.chat_models import ChatMistralAIfrom langchain_core.messages import HumanMessagefrom PIL import Image
Loading environment variables
Subsequently, the environment variables are loaded using load_dotenv, and the MISTRAL_API_KEY is retrieved.
load_dotenvMISTRAL_API_KEY = os.getenvif not MISTRAL_API_KEY: raise ValueErrorDefining the output schema with pydantic
Following that, the output schema is defined using Pydantic. Pydantic is a Python library for data parsing and validation based on Python type hints. At its core, Pydantic’s BaseModel offers various useful features, such as the declaration of data typesand automatic coercion of incoming data into the required types when possible.
Moreover, it validates whether the incoming data matches the predefined schema and raises an error if it does not. Thanks to these clearly defined schemas, the data can be quickly serialized into other formats such as JSON. Likewise, Pydantic also allows the creation of document fields with metadata that tools such as LLMs can inspect and utilize. The next code block defines the structure of the expected output using Pydantic. These are the data points that the model should extract from the CV image.class BasicCV: first_name: Optional= Fieldlast_name: Optional= Fieldphone: Optional= Fieldemail: Optional= Fieldbirthday: Optional= Field")
Converting images to base64
Subsequently, the first function is defined for the script. The function encode_image_to_base64does exactly what its name suggests. It loads an image and converts it into a base64 string, which is passed into the vLM later.
Moreover, an upscaling factor has been integrated. Although no additional information is gained by simply increasing the height and width, in my experience, the results tend to improve, especially in situations where the original resolution of the image is low.
def encode_image_to_base64-> str: with Image.openas img: if upscale_factor != 1.0: new_size =, int) img = img.resizefrom io import BytesIO buffer = BytesIOimg.saveimage_bytes = buffer.getvaluereturn base64.b64encode.decodeProcessing the CV with a vision language model
Now, let’s move on to the main function of this script. The process_cvfunction begins by initializing the Mistral interface using a previously generated API key. This model is then wrapped using the .with_structured_outputfunction, in which the Pydantic model defined above is passed as input. If you are using a different vLM, make sure that it supports structured output, as not all vLMs do.
Afterwards, the input image is converted into a base64string, which is then transformed into a Uniform Resource Identifierby attaching a metadata string in front of the b64 string.
Next, a simple system prompt is defined, which leaves room for improvement in more complex extraction tasks but works perfectly for this scenario.
Finally, the URI and system prompt are combined into a LangChain HumanMessage, which is passed to the structured vLM. The model then returns the requested information in the previously defined Pydantic format.
def process_cv-> BasicCV: image_path: Path, api_key: Optional= None llm = ChatMistralAIstructured_llm = llm.with_structured_outputimage_b64 = encode_image_to_base64data_uri = f"data:image/png;base64,{image_b64}" system_text =message = HumanMessageresult: BasicCV = structured_llm.invokereturn result
Running the script
This function is executed by the main, where the path is defined and the final information is printed out.
if __name__ == "__main__": image_file = Pathcv_data = process_cvprintprintprintprintprintConclusion
This simple Python script provides only a first impression of how powerful and flexible vLMs have become. In combination with Pydantic and with the support of the powerful LangChain framework, vLMs can be turned into a meaningful solution for many document processing workflows, such as application processing or invoice handling.
What experience have you had with vision Large Language Models? Do you have other fields in mind where such a workflow might be beneficial?
Source
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI
#extracting #data #unstructured #documents

Extracting Data from Unstructured Documents
Author: Felix Pappe Originally published on Towards AI. Image created by the author using gpt-image-1 Introduction In the past, extracting specific information from documents or images using traditional methods could have become quickly cumbersome and frustrating, especially when the final results stray far from what you intended. The reasons for this can be diverse, ranging from overly complex document layouts to improperly formatted files, or an avalanche of visual elements, which machines struggle to interpret. However, vision-enabled languagemodels have come to the rescue. Over the past months and years, these models have gained ever-greater capabilities, from rough image descriptions to detailed text extraction. Notably, the extraction of complex textual information from images has seen astonishing progress. This allows for rapid knowledge extraction from diverse document types without brittle, rule-based systems that break as soon as the document structure changes — and without the time-, data-, and cost-intensive specialized training of custom models. However, there is one flaw: vLMs, like their text-only counterparts, tend to produce verbose output around the information you actually want. Phrases such as “Of course, here is the information you requested” or “This is the extracted information about XYZ” commonly surround the essential content. You could use regular expressions together with advanced prompt engineering to constrain the vLM to output only the requested information. However, crafting the perfect prompt and matching regex for a given task is difficult and requires much trial and error. In this blog post, I’d like to introduce a simpler approach: combining the rich capabilities of vLLMs with the strict validation offered by Pydantic classes to extract exactly the desired information for your document processing pipeline. Description of post tackled issue The example in this blog post describes a situation that every job applicant has likely experienced many times. I am sure of it. After you have carefully and thoroughly created your CV, thinking about every word and maybe even every letter, you upload the file to a job portal. But after successfully uploading the file, including all the requested information, you are asked once again to fill out the same details in standard HTML forms by copying and pasting the information from your CV into the correct fields.Some companies attempt to autofill these fields based on the information extracted from your CV, but the results are often far from accurate or complete.In the following code, I combine Pixtral, LangChain, and Pydantic to provide a simple solution. The code extracts the first name, last name, phone number, email, and birthday from the CV if they exist. This helps keep the example simple and focuses on the technical aspects.The code can be easily adapted for other use cases or extended to extract all required information from a CV.So let us dive into the code. Code walkthrough Importing required libraries In the first step, the required libraries are imported, including: os, pathlib, and typing for standard Python modules providing filesystem access and type annotations base64 dontenv.env file into os.environ pydanticLLM output ChatMistralAILLM interface PIL import osimport base64from pathlib import Pathfrom typing import Optionalfrom dotenv import load_dotenvfrom pydantic import BaseModel, Fieldfrom langchain_mistralai.chat_models import ChatMistralAIfrom langchain_core.messages import HumanMessagefrom PIL import Image Loading environment variables Subsequently, the environment variables are loaded using load_dotenv, and the MISTRAL_API_KEY is retrieved. load_dotenvMISTRAL_API_KEY = os.getenvif not MISTRAL_API_KEY: raise ValueErrorDefining the output schema with pydantic Following that, the output schema is defined using Pydantic. Pydantic is a Python library for data parsing and validation based on Python type hints. At its core, Pydantic’s BaseModel offers various useful features, such as the declaration of data typesand automatic coercion of incoming data into the required types when possible. Moreover, it validates whether the incoming data matches the predefined schema and raises an error if it does not. Thanks to these clearly defined schemas, the data can be quickly serialized into other formats such as JSON. Likewise, Pydantic also allows the creation of document fields with metadata that tools such as LLMs can inspect and utilize. The next code block defines the structure of the expected output using Pydantic. These are the data points that the model should extract from the CV image.class BasicCV: first_name: Optional= Fieldlast_name: Optional= Fieldphone: Optional= Fieldemail: Optional= Fieldbirthday: Optional= Field") Converting images to base64 Subsequently, the first function is defined for the script. The function encode_image_to_base64does exactly what its name suggests. It loads an image and converts it into a base64 string, which is passed into the vLM later. Moreover, an upscaling factor has been integrated. Although no additional information is gained by simply increasing the height and width, in my experience, the results tend to improve, especially in situations where the original resolution of the image is low. def encode_image_to_base64-> str: with Image.openas img: if upscale_factor != 1.0: new_size =, int) img = img.resizefrom io import BytesIO buffer = BytesIOimg.saveimage_bytes = buffer.getvaluereturn base64.b64encode.decodeProcessing the CV with a vision language model Now, let’s move on to the main function of this script. The process_cvfunction begins by initializing the Mistral interface using a previously generated API key. This model is then wrapped using the .with_structured_outputfunction, in which the Pydantic model defined above is passed as input. If you are using a different vLM, make sure that it supports structured output, as not all vLMs do. Afterwards, the input image is converted into a base64string, which is then transformed into a Uniform Resource Identifierby attaching a metadata string in front of the b64 string. Next, a simple system prompt is defined, which leaves room for improvement in more complex extraction tasks but works perfectly for this scenario. Finally, the URI and system prompt are combined into a LangChain HumanMessage, which is passed to the structured vLM. The model then returns the requested information in the previously defined Pydantic format. def process_cv-> BasicCV: image_path: Path, api_key: Optional= None llm = ChatMistralAIstructured_llm = llm.with_structured_outputimage_b64 = encode_image_to_base64data_uri = f"data:image/png;base64,{image_b64}" system_text =message = HumanMessageresult: BasicCV = structured_llm.invokereturn result Running the script This function is executed by the main, where the path is defined and the final information is printed out. if __name__ == "__main__": image_file = Pathcv_data = process_cvprintprintprintprintprintConclusion This simple Python script provides only a first impression of how powerful and flexible vLMs have become. In combination with Pydantic and with the support of the powerful LangChain framework, vLMs can be turned into a meaningful solution for many document processing workflows, such as application processing or invoice handling. What experience have you had with vision Large Language Models? Do you have other fields in mind where such a workflow might be beneficial? Source Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI #extracting #data #unstructured #documents

TOWARDSAI.NET

Extracting Data from Unstructured Documents

Author(s): Felix Pappe Originally published on Towards AI. Image created by the author using gpt-image-1 Introduction In the past, extracting specific information from documents or images using traditional methods could have become quickly cumbersome and frustrating, especially when the final results stray far from what you intended. The reasons for this can be diverse, ranging from overly complex document layouts to improperly formatted files, or an avalanche of visual elements, which machines struggle to interpret. However, vision-enabled language (vLMs)models have come to the rescue. Over the past months and years, these models have gained ever-greater capabilities, from rough image descriptions to detailed text extraction. Notably, the extraction of complex textual information from images has seen astonishing progress. This allows for rapid knowledge extraction from diverse document types without brittle, rule-based systems that break as soon as the document structure changes — and without the time-, data-, and cost-intensive specialized training of custom models. However, there is one flaw: vLMs, like their text-only counterparts, tend to produce verbose output around the information you actually want. Phrases such as “Of course, here is the information you requested” or “This is the extracted information about XYZ” commonly surround the essential content. You could use regular expressions together with advanced prompt engineering to constrain the vLM to output only the requested information. However, crafting the perfect prompt and matching regex for a given task is difficult and requires much trial and error. In this blog post, I’d like to introduce a simpler approach: combining the rich capabilities of vLLMs with the strict validation offered by Pydantic classes to extract exactly the desired information for your document processing pipeline. Description of post tackled issue The example in this blog post describes a situation that every job applicant has likely experienced many times. I am sure of it. After you have carefully and thoroughly created your CV, thinking about every word and maybe even every letter, you upload the file to a job portal. But after successfully uploading the file, including all the requested information, you are asked once again to fill out the same details in standard HTML forms by copying and pasting the information from your CV into the correct fields.Some companies attempt to autofill these fields based on the information extracted from your CV, but the results are often far from accurate or complete.In the following code, I combine Pixtral, LangChain, and Pydantic to provide a simple solution. The code extracts the first name, last name, phone number, email, and birthday from the CV if they exist. This helps keep the example simple and focuses on the technical aspects.The code can be easily adapted for other use cases or extended to extract all required information from a CV.So let us dive into the code. Code walkthrough Importing required libraries In the first step, the required libraries are imported, including: os, pathlib, and typing for standard Python modules providing filesystem access and type annotations base64 dontenv.env file into os.environ pydanticLLM output ChatMistralAILLM interface PIL import osimport base64from pathlib import Pathfrom typing import Optionalfrom dotenv import load_dotenvfrom pydantic import BaseModel, Fieldfrom langchain_mistralai.chat_models import ChatMistralAIfrom langchain_core.messages import HumanMessagefrom PIL import Image Loading environment variables Subsequently, the environment variables are loaded using load_dotenv(), and the MISTRAL_API_KEY is retrieved. load_dotenv()MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")if not MISTRAL_API_KEY: raise ValueError("MISTRAL_API_KEY not set in environment") Defining the output schema with pydantic Following that, the output schema is defined using Pydantic. Pydantic is a Python library for data parsing and validation based on Python type hints. At its core, Pydantic’s BaseModel offers various useful features, such as the declaration of data types (e.g. str, int, List[str], nested models, etc.) and automatic coercion of incoming data into the required types when possible (e.g., converting "102" into 102). Moreover, it validates whether the incoming data matches the predefined schema and raises an error if it does not. Thanks to these clearly defined schemas, the data can be quickly serialized into other formats such as JSON. Likewise, Pydantic also allows the creation of document fields with metadata that tools such as LLMs can inspect and utilize. The next code block defines the structure of the expected output using Pydantic. These are the data points that the model should extract from the CV image.class BasicCV(BaseModel): first_name: Optional[str] = Field(None, description="first name") last_name: Optional[str] = Field(None, description="last name") phone: Optional[str] = Field(None, description="Telephone number") email: Optional[str] = Field(None, description="Email address") birthday: Optional[str] = Field(None, description="Date of birth (e.g., YYYY-MM-DD)") Converting images to base64 Subsequently, the first function is defined for the script. The function encode_image_to_base64() does exactly what its name suggests. It loads an image and converts it into a base64 string, which is passed into the vLM later. Moreover, an upscaling factor has been integrated. Although no additional information is gained by simply increasing the height and width, in my experience, the results tend to improve, especially in situations where the original resolution of the image is low. def encode_image_to_base64(image_path: Path, upscale_factor: float = 1.0) -> str: with Image.open(image_path) as img: if upscale_factor != 1.0: new_size = (int(img.width * upscale_factor), int(img.height * upscale_factor)) img = img.resize(new_size, Image.LANCZOS) from io import BytesIO buffer = BytesIO() img.save(buffer, format="PNG") image_bytes = buffer.getvalue() return base64.b64encode(image_bytes).decode() Processing the CV with a vision language model Now, let’s move on to the main function of this script. The process_cv() function begins by initializing the Mistral interface using a previously generated API key. This model is then wrapped using the .with_structured_output(BasicCV) function, in which the Pydantic model defined above is passed as input. If you are using a different vLM, make sure that it supports structured output, as not all vLMs do. Afterwards, the input image is converted into a base64 (b64) string, which is then transformed into a Uniform Resource Identifier (URI) by attaching a metadata string in front of the b64 string. Next, a simple system prompt is defined, which leaves room for improvement in more complex extraction tasks but works perfectly for this scenario. Finally, the URI and system prompt are combined into a LangChain HumanMessage, which is passed to the structured vLM. The model then returns the requested information in the previously defined Pydantic format. def process_cv() -> BasicCV: image_path: Path, api_key: Optional[str] = None llm = ChatMistralAI( model="pixtral-12b-latest", mistral_api_key=api_key or MISTRAL_API_KEY, ) structured_llm = llm.with_structured_output(BasicCV) image_b64 = encode_image_to_base64(image_path) data_uri = f"data:image/png;base64,{image_b64}" system_text = ( "Extract only the following fields from this CV: first name, last name, " "telephone number, email address, and birthday. Return JSON matching the schema." ) message = HumanMessage( content=[ {"type": "text", "text": system_text}, {"type": "image_url", "image_url": data_uri}, ] ) result: BasicCV = structured_llm.invoke([message]) return result Running the script This function is executed by the main, where the path is defined and the final information is printed out. if __name__ == "__main__": image_file = Path("cv-test.png") cv_data = process_cv(image_file) print(f"First Name: {cv_data.first_name}") print(f"Last Name: {cv_data.last_name}") print(f"Phone: {cv_data.phone}") print(f"Email: {cv_data.email}") print(f"Birthday: {cv_data.birthday}") Conclusion This simple Python script provides only a first impression of how powerful and flexible vLMs have become. In combination with Pydantic and with the support of the powerful LangChain framework, vLMs can be turned into a meaningful solution for many document processing workflows, such as application processing or invoice handling. What experience have you had with vision Large Language Models? Do you have other fields in mind where such a workflow might be beneficial? Source Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI

·382 Просмотры

Войдите, чтобы отмечать, делиться и комментировать!
Marktechpost AI поделился ссылкой

2025-05-14 07:32:30 ·

A Step-by-Step Guide to Build a Fast Semantic Search and RAG QA Engine on Web-Scraped Data Using Together AI Embeddings, FAISS Retrieval, and LangChain

In this tutorial, we lean hard on Together AI’s growing ecosystem to show how quickly we can turn unstructured text into a question-answering service that cites its sources.
We’ll scrape a handful of live web pages, slice them into coherent chunks, and feed those chunks to the togethercomputer/m2-bert-80M-8k-retrieval embedding model.
Those vectors land in a FAISS index for millisecond similarity search, after which a lightweight ChatTogether model drafts answers that stay grounded in the retrieved passages.
Because Together AI handles embeddings and chat behind a single API key, we avoid juggling multiple providers, quotas, or SDK dialects.
!pip -q install --upgrade langchain-core langchain-community langchain-together
faiss-cpu tiktoken beautifulsoup4 html2text
This quiet (-q) pip command upgrades and installs everything the Colab RAG needs.
It pulls core LangChain libraries plus the Together AI integration, FAISS for vector search, token-handling with tiktoken, and lightweight HTML parsing via beautifulsoup4 and html2text, ensuring the notebook runs end-to-end without additional setup.
import os, getpass, warnings, textwrap, json
if "TOGETHER_API_KEY" not in os.environ:
os.environ["TOGETHER_API_KEY"] = getpass.getpass(" Enter your Together API key: ")
We check whether the TOGETHER_API_KEY environment variable is already set; if not, it securely prompts us for the key with getpass and stores it in os.environ.
The rest of the notebook can call Together AI’s API without hard‑coding secrets or exposing them in plain text by capturing the credentials once per runtime.
from langchain_community.document_loaders import WebBaseLoader
URLS = [
"https://python.langchain.com/docs/integrations/text_embedding/together/"," style="color: #0066cc;">https://python.langchain.com/docs/integrations/text_embedding/together/",
"https://api.together.xyz/"," style="color: #0066cc;">https://api.together.xyz/",
"https://together.ai/blog"" style="color: #0066cc;">https://together.ai/blog"
]
raw_docs = WebBaseLoader(URLS).load()
WebBaseLoader fetches each URL, strips boilerplate, and returns LangChain Document objects containing the clean page text plus metadata.
By passing a list of Together-related links, we immediately collect live documentation and blog content that will later be chunked and embedded for semantic search.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
docs = splitter.split_documents(raw_docs)
print(f"Loaded {len(raw_docs)} pages → {len(docs)} chunks after splitting.")
RecursiveCharacterTextSplitter slices every fetched page into ~800-character segments with a 100-character overlap so contextual clues aren’t lost at chunk boundaries.
The resulting list docs holds these bite-sized LangChain Document objects, and the printout shows how many chunks were produced from the original pages, essential prep for high-quality embedding.
from langchain_together.embeddings import TogetherEmbeddings
embeddings = TogetherEmbeddings(
model="togethercomputer/m2-bert-80M-8k-retrieval"
)
from langchain_community.vectorstores import FAISS
vector_store = FAISS.from_documents(docs, embeddings)
Here we instantiate Together AI’s 80 M-parameter m2-bert retrieval model as a drop-in LangChain embedder, then feed every text chunk into it while FAISS.from_documents builds an in-memory vector index.
The resulting vector store supports millisecond-level cosine searches, turning our scraped pages into a searchable semantic database.
from langchain_together.chat_models import ChatTogether
llm = ChatTogether(
model="mistralai/Mistral-7B-Instruct-v0.3",
temperature=0.2,
max_tokens=512,
)
ChatTogether wraps a chat-tuned model hosted on Together AI, Mistral-7B-Instruct-v0.3 to be used like any other LangChain LLM.
A low temperature of 0.2 keeps answers grounded and repeatable, while max_tokens=512 leaves room for detailed, multi-paragraph responses without runaway cost.
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vector_store.as_retriever(search_kwargs={"k": 4}),
return_source_documents=True,
)
RetrievalQA stitches the pieces together: it takes our FAISS retriever (returning the top 4 similar chunks) and feeds those snippets into the llm using the simple “stuff” prompt template.
Setting return_source_documents=True means each answer will return with the exact passages it relied on, giving us instant, citation-ready Q-and-A.
QUESTION = "How do I use TogetherEmbeddings inside LangChain, and what model name should I pass?"
result = qa_chain(QUESTION)
print("n Answer:n", textwrap.fill(result['result'], 100))
print("n Sources:")
for doc in result['source_documents']:
print(" •", doc.metadata['source'])
Finally, we send a natural-language query through the qa_chain, which retrieves the four most relevant chunks, feeds them to the ChatTogether model, and returns a concise answer.
It then prints the formatted response, followed by a list of source URLs, giving us both the synthesized explanation and transparent citations in one shot.
Output from the Final Cell
In conclusion, in roughly fifty lines of code, we built a complete RAG loop powered end-to-end by Together AI: ingest, embed, store, retrieve, and converse.
The approach is deliberately modular, swap FAISS for Chroma, trade the 80 M-parameter embedder for Together’s larger multilingual model, or plug in a reranker without touching the rest of the pipeline.
What remains constant is the convenience of a unified Together AI backend: fast, affordable embeddings, chat models tuned for instruction following, and a generous free tier that makes experimentation painless.
Use this template to bootstrap an internal knowledge assistant, a documentation bot for customers, or a personal research aide.
Check out the Colab Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc..
As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good.
His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience.
The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Agent-Based" style="color: #0066cc;">https://www.marktechpost.com/author/6flvq/Agent-Based Debugging Gets a Cost-Effective Alternative: Salesforce AI Presents SWERank for Accurate and Scalable Software Issue LocalizationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A" style="color: #0066cc;">https://www.marktechpost.com/author/6flvq/A Step-by-Step Guide to Deploy a Fully Integrated Firecrawl-Powered MCP Server on Claude Desktop with Smithery and VeryaXAsif Razzaqhttps://www.marktechpost.com/author/6flvq/OpenAI" style="color: #0066cc;">https://www.marktechpost.com/author/6flvq/OpenAI Releases HealthBench: An Open-Source Benchmark for Measuring the Performance and Safety of Large Language Models in HealthcareAsif Razzaqhttps://www.marktechpost.com/author/6flvq/PrimeIntellect" style="color: #0066cc;">https://www.marktechpost.com/author/6flvq/PrimeIntellect Releases INTELLECT-2: A 32B Reasoning Model Trained via Distributed Asynchronous Reinforcement Learning

Source: https://www.marktechpost.com/2025/05/14/step-by-step-guide-to-build-a-fast-semantic-search-and-rag-qa-engine-on-web-scraped-data-using-together-ai-embeddings-faiss-retrieval-and-langchain/" style="color: #0066cc;">https://www.marktechpost.com/2025/05/14/step-by-step-guide-to-build-a-fast-semantic-search-and-rag-qa-engine-on-web-scraped-data-using-together-ai-embeddings-faiss-retrieval-and-langchain/
#stepbystep #guide #build #fast #semantic #search #and #rag #engine #webscraped #data #using #together #embeddings #faiss #retrieval #langchain

A Step-by-Step Guide to Build a Fast Semantic Search and RAG QA Engine on Web-Scraped Data Using Together AI Embeddings, FAISS Retrieval, and LangChain
In this tutorial, we lean hard on Together AI’s growing ecosystem to show how quickly we can turn unstructured text into a question-answering service that cites its sources. We’ll scrape a handful of live web pages, slice them into coherent chunks, and feed those chunks to the togethercomputer/m2-bert-80M-8k-retrieval embedding model. Those vectors land in a FAISS index for millisecond similarity search, after which a lightweight ChatTogether model drafts answers that stay grounded in the retrieved passages. Because Together AI handles embeddings and chat behind a single API key, we avoid juggling multiple providers, quotas, or SDK dialects. !pip -q install --upgrade langchain-core langchain-community langchain-together faiss-cpu tiktoken beautifulsoup4 html2text This quiet (-q) pip command upgrades and installs everything the Colab RAG needs. It pulls core LangChain libraries plus the Together AI integration, FAISS for vector search, token-handling with tiktoken, and lightweight HTML parsing via beautifulsoup4 and html2text, ensuring the notebook runs end-to-end without additional setup. import os, getpass, warnings, textwrap, json if "TOGETHER_API_KEY" not in os.environ: os.environ["TOGETHER_API_KEY"] = getpass.getpass("🔑 Enter your Together API key: ") We check whether the TOGETHER_API_KEY environment variable is already set; if not, it securely prompts us for the key with getpass and stores it in os.environ. The rest of the notebook can call Together AI’s API without hard‑coding secrets or exposing them in plain text by capturing the credentials once per runtime. from langchain_community.document_loaders import WebBaseLoader URLS = [ "https://python.langchain.com/docs/integrations/text_embedding/together/", "https://api.together.xyz/", "https://together.ai/blog" ] raw_docs = WebBaseLoader(URLS).load() WebBaseLoader fetches each URL, strips boilerplate, and returns LangChain Document objects containing the clean page text plus metadata. By passing a list of Together-related links, we immediately collect live documentation and blog content that will later be chunked and embedded for semantic search. from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100) docs = splitter.split_documents(raw_docs) print(f"Loaded {len(raw_docs)} pages → {len(docs)} chunks after splitting.") RecursiveCharacterTextSplitter slices every fetched page into ~800-character segments with a 100-character overlap so contextual clues aren’t lost at chunk boundaries. The resulting list docs holds these bite-sized LangChain Document objects, and the printout shows how many chunks were produced from the original pages, essential prep for high-quality embedding. from langchain_together.embeddings import TogetherEmbeddings embeddings = TogetherEmbeddings( model="togethercomputer/m2-bert-80M-8k-retrieval" ) from langchain_community.vectorstores import FAISS vector_store = FAISS.from_documents(docs, embeddings) Here we instantiate Together AI’s 80 M-parameter m2-bert retrieval model as a drop-in LangChain embedder, then feed every text chunk into it while FAISS.from_documents builds an in-memory vector index. The resulting vector store supports millisecond-level cosine searches, turning our scraped pages into a searchable semantic database. from langchain_together.chat_models import ChatTogether llm = ChatTogether( model="mistralai/Mistral-7B-Instruct-v0.3", temperature=0.2, max_tokens=512, ) ChatTogether wraps a chat-tuned model hosted on Together AI, Mistral-7B-Instruct-v0.3 to be used like any other LangChain LLM. A low temperature of 0.2 keeps answers grounded and repeatable, while max_tokens=512 leaves room for detailed, multi-paragraph responses without runaway cost. from langchain.chains import RetrievalQA qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=vector_store.as_retriever(search_kwargs={"k": 4}), return_source_documents=True, ) RetrievalQA stitches the pieces together: it takes our FAISS retriever (returning the top 4 similar chunks) and feeds those snippets into the llm using the simple “stuff” prompt template. Setting return_source_documents=True means each answer will return with the exact passages it relied on, giving us instant, citation-ready Q-and-A. QUESTION = "How do I use TogetherEmbeddings inside LangChain, and what model name should I pass?" result = qa_chain(QUESTION) print("n🤖 Answer:n", textwrap.fill(result['result'], 100)) print("n📄 Sources:") for doc in result['source_documents']: print(" •", doc.metadata['source']) Finally, we send a natural-language query through the qa_chain, which retrieves the four most relevant chunks, feeds them to the ChatTogether model, and returns a concise answer. It then prints the formatted response, followed by a list of source URLs, giving us both the synthesized explanation and transparent citations in one shot. Output from the Final Cell In conclusion, in roughly fifty lines of code, we built a complete RAG loop powered end-to-end by Together AI: ingest, embed, store, retrieve, and converse. The approach is deliberately modular, swap FAISS for Chroma, trade the 80 M-parameter embedder for Together’s larger multilingual model, or plug in a reranker without touching the rest of the pipeline. What remains constant is the convenience of a unified Together AI backend: fast, affordable embeddings, chat models tuned for instruction following, and a generous free tier that makes experimentation painless. Use this template to bootstrap an internal knowledge assistant, a documentation bot for customers, or a personal research aide. Check out the Colab Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Agent-Based Debugging Gets a Cost-Effective Alternative: Salesforce AI Presents SWERank for Accurate and Scalable Software Issue LocalizationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Guide to Deploy a Fully Integrated Firecrawl-Powered MCP Server on Claude Desktop with Smithery and VeryaXAsif Razzaqhttps://www.marktechpost.com/author/6flvq/OpenAI Releases HealthBench: An Open-Source Benchmark for Measuring the Performance and Safety of Large Language Models in HealthcareAsif Razzaqhttps://www.marktechpost.com/author/6flvq/PrimeIntellect Releases INTELLECT-2: A 32B Reasoning Model Trained via Distributed Asynchronous Reinforcement Learning Source: https://www.marktechpost.com/2025/05/14/step-by-step-guide-to-build-a-fast-semantic-search-and-rag-qa-engine-on-web-scraped-data-using-together-ai-embeddings-faiss-retrieval-and-langchain/ #stepbystep #guide #build #fast #semantic #search #and #rag #engine #webscraped #data #using #together #embeddings #faiss #retrieval #langchain

WWW.MARKTECHPOST.COM

A Step-by-Step Guide to Build a Fast Semantic Search and RAG QA Engine on Web-Scraped Data Using Together AI Embeddings, FAISS Retrieval, and LangChain

In this tutorial, we lean hard on Together AI’s growing ecosystem to show how quickly we can turn unstructured text into a question-answering service that cites its sources. We’ll scrape a handful of live web pages, slice them into coherent chunks, and feed those chunks to the togethercomputer/m2-bert-80M-8k-retrieval embedding model. Those vectors land in a FAISS index for millisecond similarity search, after which a lightweight ChatTogether model drafts answers that stay grounded in the retrieved passages. Because Together AI handles embeddings and chat behind a single API key, we avoid juggling multiple providers, quotas, or SDK dialects. !pip -q install --upgrade langchain-core langchain-community langchain-together faiss-cpu tiktoken beautifulsoup4 html2text This quiet (-q) pip command upgrades and installs everything the Colab RAG needs. It pulls core LangChain libraries plus the Together AI integration, FAISS for vector search, token-handling with tiktoken, and lightweight HTML parsing via beautifulsoup4 and html2text, ensuring the notebook runs end-to-end without additional setup. import os, getpass, warnings, textwrap, json if "TOGETHER_API_KEY" not in os.environ: os.environ["TOGETHER_API_KEY"] = getpass.getpass("🔑 Enter your Together API key: ") We check whether the TOGETHER_API_KEY environment variable is already set; if not, it securely prompts us for the key with getpass and stores it in os.environ. The rest of the notebook can call Together AI’s API without hard‑coding secrets or exposing them in plain text by capturing the credentials once per runtime. from langchain_community.document_loaders import WebBaseLoader URLS = [ "https://python.langchain.com/docs/integrations/text_embedding/together/", "https://api.together.xyz/", "https://together.ai/blog" ] raw_docs = WebBaseLoader(URLS).load() WebBaseLoader fetches each URL, strips boilerplate, and returns LangChain Document objects containing the clean page text plus metadata. By passing a list of Together-related links, we immediately collect live documentation and blog content that will later be chunked and embedded for semantic search. from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100) docs = splitter.split_documents(raw_docs) print(f"Loaded {len(raw_docs)} pages → {len(docs)} chunks after splitting.") RecursiveCharacterTextSplitter slices every fetched page into ~800-character segments with a 100-character overlap so contextual clues aren’t lost at chunk boundaries. The resulting list docs holds these bite-sized LangChain Document objects, and the printout shows how many chunks were produced from the original pages, essential prep for high-quality embedding. from langchain_together.embeddings import TogetherEmbeddings embeddings = TogetherEmbeddings( model="togethercomputer/m2-bert-80M-8k-retrieval" ) from langchain_community.vectorstores import FAISS vector_store = FAISS.from_documents(docs, embeddings) Here we instantiate Together AI’s 80 M-parameter m2-bert retrieval model as a drop-in LangChain embedder, then feed every text chunk into it while FAISS.from_documents builds an in-memory vector index. The resulting vector store supports millisecond-level cosine searches, turning our scraped pages into a searchable semantic database. from langchain_together.chat_models import ChatTogether llm = ChatTogether( model="mistralai/Mistral-7B-Instruct-v0.3", temperature=0.2, max_tokens=512, ) ChatTogether wraps a chat-tuned model hosted on Together AI, Mistral-7B-Instruct-v0.3 to be used like any other LangChain LLM. A low temperature of 0.2 keeps answers grounded and repeatable, while max_tokens=512 leaves room for detailed, multi-paragraph responses without runaway cost. from langchain.chains import RetrievalQA qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=vector_store.as_retriever(search_kwargs={"k": 4}), return_source_documents=True, ) RetrievalQA stitches the pieces together: it takes our FAISS retriever (returning the top 4 similar chunks) and feeds those snippets into the llm using the simple “stuff” prompt template. Setting return_source_documents=True means each answer will return with the exact passages it relied on, giving us instant, citation-ready Q-and-A. QUESTION = "How do I use TogetherEmbeddings inside LangChain, and what model name should I pass?" result = qa_chain(QUESTION) print("n🤖 Answer:n", textwrap.fill(result['result'], 100)) print("n📄 Sources:") for doc in result['source_documents']: print(" •", doc.metadata['source']) Finally, we send a natural-language query through the qa_chain, which retrieves the four most relevant chunks, feeds them to the ChatTogether model, and returns a concise answer. It then prints the formatted response, followed by a list of source URLs, giving us both the synthesized explanation and transparent citations in one shot. Output from the Final Cell In conclusion, in roughly fifty lines of code, we built a complete RAG loop powered end-to-end by Together AI: ingest, embed, store, retrieve, and converse. The approach is deliberately modular, swap FAISS for Chroma, trade the 80 M-parameter embedder for Together’s larger multilingual model, or plug in a reranker without touching the rest of the pipeline. What remains constant is the convenience of a unified Together AI backend: fast, affordable embeddings, chat models tuned for instruction following, and a generous free tier that makes experimentation painless. Use this template to bootstrap an internal knowledge assistant, a documentation bot for customers, or a personal research aide. Check out the Colab Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Agent-Based Debugging Gets a Cost-Effective Alternative: Salesforce AI Presents SWERank for Accurate and Scalable Software Issue LocalizationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Guide to Deploy a Fully Integrated Firecrawl-Powered MCP Server on Claude Desktop with Smithery and VeryaXAsif Razzaqhttps://www.marktechpost.com/author/6flvq/OpenAI Releases HealthBench: An Open-Source Benchmark for Measuring the Performance and Safety of Large Language Models in HealthcareAsif Razzaqhttps://www.marktechpost.com/author/6flvq/PrimeIntellect Releases INTELLECT-2: A 32B Reasoning Model Trained via Distributed Asynchronous Reinforcement Learning

·475 Просмотры

Войдите, чтобы отмечать, делиться и комментировать!

Вступить

Языки

LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks

Relying on file storage heritage, Box pivots to AI

Extracting Data from Unstructured Documents

A Step-by-Step Guide to Build a Fast Semantic Search and RAG QA Engine on Web-Scraped Data Using Together AI Embeddings, FAISS Retrieval, and LangChain