Extracting Data from Unstructured Documents
Author: Felix Pappe
Originally published on Towards AI.
Image created by the author using gpt-image-1
Introduction
In the past, extracting specific information from documents or images using traditional methods could have become quickly cumbersome and frustrating, especially when the final results stray far from what you intended. The reasons for this can be diverse, ranging from overly complex document layouts to improperly formatted files, or an avalanche of visual elements, which machines struggle to interpret.
However, vision-enabled languagemodels have come to the rescue. Over the past months and years, these models have gained ever-greater capabilities, from rough image descriptions to detailed text extraction. Notably, the extraction of complex textual information from images has seen astonishing progress. This allows for rapid knowledge extraction from diverse document types without brittle, rule-based systems that break as soon as the document structure changes — and without the time-, data-, and cost-intensive specialized training of custom models.
However, there is one flaw: vLMs, like their text-only counterparts, tend to produce verbose output around the information you actually want. Phrases such as “Of course, here is the information you requested” or “This is the extracted information about XYZ” commonly surround the essential content.
You could use regular expressions together with advanced prompt engineering to constrain the vLM to output only the requested information. However, crafting the perfect prompt and matching regex for a given task is difficult and requires much trial and error. In this blog post, I’d like to introduce a simpler approach: combining the rich capabilities of vLLMs with the strict validation offered by Pydantic classes to extract exactly the desired information for your document processing pipeline.
Description of post tackled issue
The example in this blog post describes a situation that every job applicant has likely experienced many times. I am sure of it.
After you have carefully and thoroughly created your CV, thinking about every word and maybe even every letter, you upload the file to a job portal. But after successfully uploading the file, including all the requested information, you are asked once again to fill out the same details in standard HTML forms by copying and pasting the information from your CV into the correct fields.Some companies attempt to autofill these fields based on the information extracted from your CV, but the results are often far from accurate or complete.In the following code, I combine Pixtral, LangChain, and Pydantic to provide a simple solution.
The code extracts the first name, last name, phone number, email, and birthday from the CV if they exist. This helps keep the example simple and focuses on the technical aspects.The code can be easily adapted for other use cases or extended to extract all required information from a CV.So let us dive into the code.
Code walkthrough
Importing required libraries
In the first step, the required libraries are imported, including:
os, pathlib, and typing for standard Python modules providing filesystem access and type annotations
base64
dontenv.env file into os.environ
pydanticLLM output
ChatMistralAILLM interface
PIL
import osimport base64from pathlib import Pathfrom typing import Optionalfrom dotenv import load_dotenvfrom pydantic import BaseModel, Fieldfrom langchain_mistralai.chat_models import ChatMistralAIfrom langchain_core.messages import HumanMessagefrom PIL import Image
Loading environment variables
Subsequently, the environment variables are loaded using load_dotenv, and the MISTRAL_API_KEY is retrieved.
load_dotenvMISTRAL_API_KEY = os.getenvif not MISTRAL_API_KEY: raise ValueErrorDefining the output schema with pydantic
Following that, the output schema is defined using Pydantic. Pydantic is a Python library for data parsing and validation based on Python type hints. At its core, Pydantic’s BaseModel offers various useful features, such as the declaration of data typesand automatic coercion of incoming data into the required types when possible.
Moreover, it validates whether the incoming data matches the predefined schema and raises an error if it does not. Thanks to these clearly defined schemas, the data can be quickly serialized into other formats such as JSON. Likewise, Pydantic also allows the creation of document fields with metadata that tools such as LLMs can inspect and utilize. The next code block defines the structure of the expected output using Pydantic. These are the data points that the model should extract from the CV image.class BasicCV: first_name: Optional= Fieldlast_name: Optional= Fieldphone: Optional= Fieldemail: Optional= Fieldbirthday: Optional= Field")
Converting images to base64
Subsequently, the first function is defined for the script. The function encode_image_to_base64does exactly what its name suggests. It loads an image and converts it into a base64 string, which is passed into the vLM later.
Moreover, an upscaling factor has been integrated. Although no additional information is gained by simply increasing the height and width, in my experience, the results tend to improve, especially in situations where the original resolution of the image is low.
def encode_image_to_base64-> str: with Image.openas img: if upscale_factor != 1.0: new_size =, int) img = img.resizefrom io import BytesIO buffer = BytesIOimg.saveimage_bytes = buffer.getvaluereturn base64.b64encode.decodeProcessing the CV with a vision language model
Now, let’s move on to the main function of this script. The process_cvfunction begins by initializing the Mistral interface using a previously generated API key. This model is then wrapped using the .with_structured_outputfunction, in which the Pydantic model defined above is passed as input. If you are using a different vLM, make sure that it supports structured output, as not all vLMs do.
Afterwards, the input image is converted into a base64string, which is then transformed into a Uniform Resource Identifierby attaching a metadata string in front of the b64 string.
Next, a simple system prompt is defined, which leaves room for improvement in more complex extraction tasks but works perfectly for this scenario.
Finally, the URI and system prompt are combined into a LangChain HumanMessage, which is passed to the structured vLM. The model then returns the requested information in the previously defined Pydantic format.
def process_cv-> BasicCV: image_path: Path, api_key: Optional= None llm = ChatMistralAIstructured_llm = llm.with_structured_outputimage_b64 = encode_image_to_base64data_uri = f"data:image/png;base64,{image_b64}" system_text =message = HumanMessageresult: BasicCV = structured_llm.invokereturn result
Running the script
This function is executed by the main, where the path is defined and the final information is printed out.
if __name__ == "__main__": image_file = Pathcv_data = process_cvprintprintprintprintprintConclusion
This simple Python script provides only a first impression of how powerful and flexible vLMs have become. In combination with Pydantic and with the support of the powerful LangChain framework, vLMs can be turned into a meaningful solution for many document processing workflows, such as application processing or invoice handling.
What experience have you had with vision Large Language Models? Do you have other fields in mind where such a workflow might be beneficial?
Source
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
#extracting #data #unstructured #documents
Extracting Data from Unstructured Documents
Author: Felix Pappe
Originally published on Towards AI.
Image created by the author using gpt-image-1
Introduction
In the past, extracting specific information from documents or images using traditional methods could have become quickly cumbersome and frustrating, especially when the final results stray far from what you intended. The reasons for this can be diverse, ranging from overly complex document layouts to improperly formatted files, or an avalanche of visual elements, which machines struggle to interpret.
However, vision-enabled languagemodels have come to the rescue. Over the past months and years, these models have gained ever-greater capabilities, from rough image descriptions to detailed text extraction. Notably, the extraction of complex textual information from images has seen astonishing progress. This allows for rapid knowledge extraction from diverse document types without brittle, rule-based systems that break as soon as the document structure changes — and without the time-, data-, and cost-intensive specialized training of custom models.
However, there is one flaw: vLMs, like their text-only counterparts, tend to produce verbose output around the information you actually want. Phrases such as “Of course, here is the information you requested” or “This is the extracted information about XYZ” commonly surround the essential content.
You could use regular expressions together with advanced prompt engineering to constrain the vLM to output only the requested information. However, crafting the perfect prompt and matching regex for a given task is difficult and requires much trial and error. In this blog post, I’d like to introduce a simpler approach: combining the rich capabilities of vLLMs with the strict validation offered by Pydantic classes to extract exactly the desired information for your document processing pipeline.
Description of post tackled issue
The example in this blog post describes a situation that every job applicant has likely experienced many times. I am sure of it.
After you have carefully and thoroughly created your CV, thinking about every word and maybe even every letter, you upload the file to a job portal. But after successfully uploading the file, including all the requested information, you are asked once again to fill out the same details in standard HTML forms by copying and pasting the information from your CV into the correct fields.Some companies attempt to autofill these fields based on the information extracted from your CV, but the results are often far from accurate or complete.In the following code, I combine Pixtral, LangChain, and Pydantic to provide a simple solution.
The code extracts the first name, last name, phone number, email, and birthday from the CV if they exist. This helps keep the example simple and focuses on the technical aspects.The code can be easily adapted for other use cases or extended to extract all required information from a CV.So let us dive into the code.
Code walkthrough
Importing required libraries
In the first step, the required libraries are imported, including:
os, pathlib, and typing for standard Python modules providing filesystem access and type annotations
base64
dontenv.env file into os.environ
pydanticLLM output
ChatMistralAILLM interface
PIL
import osimport base64from pathlib import Pathfrom typing import Optionalfrom dotenv import load_dotenvfrom pydantic import BaseModel, Fieldfrom langchain_mistralai.chat_models import ChatMistralAIfrom langchain_core.messages import HumanMessagefrom PIL import Image
Loading environment variables
Subsequently, the environment variables are loaded using load_dotenv, and the MISTRAL_API_KEY is retrieved.
load_dotenvMISTRAL_API_KEY = os.getenvif not MISTRAL_API_KEY: raise ValueErrorDefining the output schema with pydantic
Following that, the output schema is defined using Pydantic. Pydantic is a Python library for data parsing and validation based on Python type hints. At its core, Pydantic’s BaseModel offers various useful features, such as the declaration of data typesand automatic coercion of incoming data into the required types when possible.
Moreover, it validates whether the incoming data matches the predefined schema and raises an error if it does not. Thanks to these clearly defined schemas, the data can be quickly serialized into other formats such as JSON. Likewise, Pydantic also allows the creation of document fields with metadata that tools such as LLMs can inspect and utilize. The next code block defines the structure of the expected output using Pydantic. These are the data points that the model should extract from the CV image.class BasicCV: first_name: Optional= Fieldlast_name: Optional= Fieldphone: Optional= Fieldemail: Optional= Fieldbirthday: Optional= Field")
Converting images to base64
Subsequently, the first function is defined for the script. The function encode_image_to_base64does exactly what its name suggests. It loads an image and converts it into a base64 string, which is passed into the vLM later.
Moreover, an upscaling factor has been integrated. Although no additional information is gained by simply increasing the height and width, in my experience, the results tend to improve, especially in situations where the original resolution of the image is low.
def encode_image_to_base64-> str: with Image.openas img: if upscale_factor != 1.0: new_size =, int) img = img.resizefrom io import BytesIO buffer = BytesIOimg.saveimage_bytes = buffer.getvaluereturn base64.b64encode.decodeProcessing the CV with a vision language model
Now, let’s move on to the main function of this script. The process_cvfunction begins by initializing the Mistral interface using a previously generated API key. This model is then wrapped using the .with_structured_outputfunction, in which the Pydantic model defined above is passed as input. If you are using a different vLM, make sure that it supports structured output, as not all vLMs do.
Afterwards, the input image is converted into a base64string, which is then transformed into a Uniform Resource Identifierby attaching a metadata string in front of the b64 string.
Next, a simple system prompt is defined, which leaves room for improvement in more complex extraction tasks but works perfectly for this scenario.
Finally, the URI and system prompt are combined into a LangChain HumanMessage, which is passed to the structured vLM. The model then returns the requested information in the previously defined Pydantic format.
def process_cv-> BasicCV: image_path: Path, api_key: Optional= None llm = ChatMistralAIstructured_llm = llm.with_structured_outputimage_b64 = encode_image_to_base64data_uri = f"data:image/png;base64,{image_b64}" system_text =message = HumanMessageresult: BasicCV = structured_llm.invokereturn result
Running the script
This function is executed by the main, where the path is defined and the final information is printed out.
if __name__ == "__main__": image_file = Pathcv_data = process_cvprintprintprintprintprintConclusion
This simple Python script provides only a first impression of how powerful and flexible vLMs have become. In combination with Pydantic and with the support of the powerful LangChain framework, vLMs can be turned into a meaningful solution for many document processing workflows, such as application processing or invoice handling.
What experience have you had with vision Large Language Models? Do you have other fields in mind where such a workflow might be beneficial?
Source
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
#extracting #data #unstructured #documents
·16 Views