Step-by-Step Guide to Creating Synthetic Data Using the Synthetic...

@MarktechpostAI ha condiviso un link

2025-05-26 12:40:02 ·

Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data Vault (SDV)

Real-world data is often costly, messy, and limited by privacy rules. Synthetic data offers a solution—and it’s already widely used:

LLMs train on AI-generated text

Fraud systems simulate edge cases

Vision models pretrain on fake images

SDVis an open-source Python library that generates realistic tabular data using machine learning. It learns patterns from real data and creates high-quality synthetic data for safe sharing, testing, and model training.
In this tutorial, we’ll use SDV to generate synthetic data step by step.
pip install sdv
We will first install the sdv library:
from sdv.io.local import CSVHandler

connector = CSVHandlerFOLDER_NAME = '.' # If the data is in the same directory

data = connector.readsalesDf = dataNext, we import the necessary module and connect to our local folder containing the dataset files. This reads the CSV files from the specified folder and stores them as pandas DataFrames. In this case, we access the main dataset using data.
from sdv.metadata import Metadata
metadata = Metadata.load_from_jsonWe now import the metadata for our dataset. This metadata is stored in a JSON file and tells SDV how to interpret your data. It includes:

The table name
The primary key
The data type of each columnOptional column formats like datetime patterns or ID patterns
Table relationshipsHere is a sample metadata.json format:
{
"METADATA_SPEC_VERSION": "V1",
"tables": {
"your_table_name": {
"primary_key": "your_primary_key_column",
"columns": {
"your_primary_key_column": { "sdtype": "id", "regex_format": "T{6}" },
"date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" },
"category_column": { "sdtype": "categorical" },
"numeric_column": { "sdtype": "numerical" }
},
"column_relationships":}
}
}
from sdv.metadata import Metadata

metadata = Metadata.detect_from_dataframesAlternatively, we can use the SDV library to automatically infer the metadata. However, the results may not always be accurate or complete, so you might need to review and update it if there are any discrepancies.
from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizersynthesizer.fitsynthetic_data = synthesizer.sampleWith the metadata and original dataset ready, we can now use SDV to train a model and generate synthetic data. The model learns the structure and patterns in your real dataset and uses that knowledge to create synthetic records.
You can control how many rows to generate using the num_rows argument.
from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_qualityThe SDV library also provides tools to evaluate the quality of your synthetic data by comparing it to the original dataset. A great place to start is by generating a quality report

You can also visualize how the synthetic data compares to the real data using SDV’s built-in plotting tools. For example, import get_column_plot from sdv.evaluation.single_table to create comparison plots for specific columns:
from sdv.evaluation.single_table import get_column_plot

fig = get_column_plotfig.showWe can observe that the distribution of the ‘Sales’ column in the real and synthetic data is very similar. To explore further, we can use matplotlib to create more detailed comparisons—such as visualizing the average monthly sales trends across both datasets.
import pandas as pd
import matplotlib.pyplot as plt

# Ensure 'Date' columns are datetime
salesDf= pd.to_datetimesynthetic_data= pd.to_datetime# Extract 'Month' as year-month string
salesDf= salesDf.dt.to_period.astypesynthetic_data= synthetic_data.dt.to_period.astype# Group by 'Month' and calculate average sales
actual_avg_monthly = salesDf.groupby.mean.renamesynthetic_avg_monthly = synthetic_data.groupby.mean.rename# Merge the two series into a DataFrame
avg_monthly_comparison = pd.concat.fillna# Plot
plt.figure)
plt.plotplt.plotplt.titleplt.xlabelplt.ylabelplt.xticksplt.gridplt.legendplt.ylim# y-axis starts at 0
plt.tight_layoutplt.showThis chart also shows that the average monthly sales in both datasets are very similar, with only minimal differences.
In this tutorial, we demonstrated how to prepare your data and metadata for synthetic data generation using the SDV library. By training a model on your original dataset, SDV can create high-quality synthetic data that closely mirrors the real data’s patterns and distributions. We also explored how to evaluate and visualize the synthetic data, confirming that key metrics like sales distributions and monthly trends remain consistent. Synthetic data offers a powerful way to overcome privacy and availability challenges while enabling robust data analysis and machine learning workflows.

Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
Arham IslamI am a Civil Engineering Graduatefrom Jamia Millia Islamia, New Delhi, and I have a keen interest in Data Science, especially Neural Networks and their application in various areas.Arham Islamhttps://www.marktechpost.com/author/arhamislam/Step-by-Step Guide to Create an AI agent with Google ADKArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an LLM Agent with Tool Access Using MCP-UseArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an AgentQL Model Context ProtocolServerArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing An Airbnb and Excel MCP Server
#stepbystep #guide #creating #synthetic #data

Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data Vault (SDV)

Real-world data is often costly, messy, and limited by privacy rules. Synthetic data offers a solution—and it’s already widely used: LLMs train on AI-generated text Fraud systems simulate edge cases Vision models pretrain on fake images SDVis an open-source Python library that generates realistic tabular data using machine learning. It learns patterns from real data and creates high-quality synthetic data for safe sharing, testing, and model training. In this tutorial, we’ll use SDV to generate synthetic data step by step. pip install sdv We will first install the sdv library: from sdv.io.local import CSVHandler connector = CSVHandlerFOLDER_NAME = '.' # If the data is in the same directory data = connector.readsalesDf = dataNext, we import the necessary module and connect to our local folder containing the dataset files. This reads the CSV files from the specified folder and stores them as pandas DataFrames. In this case, we access the main dataset using data. from sdv.metadata import Metadata metadata = Metadata.load_from_jsonWe now import the metadata for our dataset. This metadata is stored in a JSON file and tells SDV how to interpret your data. It includes: The table name The primary key The data type of each columnOptional column formats like datetime patterns or ID patterns Table relationshipsHere is a sample metadata.json format: { "METADATA_SPEC_VERSION": "V1", "tables": { "your_table_name": { "primary_key": "your_primary_key_column", "columns": { "your_primary_key_column": { "sdtype": "id", "regex_format": "T{6}" }, "date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" }, "category_column": { "sdtype": "categorical" }, "numeric_column": { "sdtype": "numerical" } }, "column_relationships":} } } from sdv.metadata import Metadata metadata = Metadata.detect_from_dataframesAlternatively, we can use the SDV library to automatically infer the metadata. However, the results may not always be accurate or complete, so you might need to review and update it if there are any discrepancies. from sdv.single_table import GaussianCopulaSynthesizer synthesizer = GaussianCopulaSynthesizersynthesizer.fitsynthetic_data = synthesizer.sampleWith the metadata and original dataset ready, we can now use SDV to train a model and generate synthetic data. The model learns the structure and patterns in your real dataset and uses that knowledge to create synthetic records. You can control how many rows to generate using the num_rows argument. from sdv.evaluation.single_table import evaluate_quality quality_report = evaluate_qualityThe SDV library also provides tools to evaluate the quality of your synthetic data by comparing it to the original dataset. A great place to start is by generating a quality report You can also visualize how the synthetic data compares to the real data using SDV’s built-in plotting tools. For example, import get_column_plot from sdv.evaluation.single_table to create comparison plots for specific columns: from sdv.evaluation.single_table import get_column_plot fig = get_column_plotfig.showWe can observe that the distribution of the ‘Sales’ column in the real and synthetic data is very similar. To explore further, we can use matplotlib to create more detailed comparisons—such as visualizing the average monthly sales trends across both datasets. import pandas as pd import matplotlib.pyplot as plt # Ensure 'Date' columns are datetime salesDf= pd.to_datetimesynthetic_data= pd.to_datetime# Extract 'Month' as year-month string salesDf= salesDf.dt.to_period.astypesynthetic_data= synthetic_data.dt.to_period.astype# Group by 'Month' and calculate average sales actual_avg_monthly = salesDf.groupby.mean.renamesynthetic_avg_monthly = synthetic_data.groupby.mean.rename# Merge the two series into a DataFrame avg_monthly_comparison = pd.concat.fillna# Plot plt.figure) plt.plotplt.plotplt.titleplt.xlabelplt.ylabelplt.xticksplt.gridplt.legendplt.ylim# y-axis starts at 0 plt.tight_layoutplt.showThis chart also shows that the average monthly sales in both datasets are very similar, with only minimal differences. In this tutorial, we demonstrated how to prepare your data and metadata for synthetic data generation using the SDV library. By training a model on your original dataset, SDV can create high-quality synthetic data that closely mirrors the real data’s patterns and distributions. We also explored how to evaluate and visualize the synthetic data, confirming that key metrics like sales distributions and monthly trends remain consistent. Synthetic data offers a powerful way to overcome privacy and availability challenges while enabling robust data analysis and machine learning workflows. Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Arham IslamI am a Civil Engineering Graduatefrom Jamia Millia Islamia, New Delhi, and I have a keen interest in Data Science, especially Neural Networks and their application in various areas.Arham Islamhttps://www.marktechpost.com/author/arhamislam/Step-by-Step Guide to Create an AI agent with Google ADKArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an LLM Agent with Tool Access Using MCP-UseArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an AgentQL Model Context ProtocolServerArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing An Airbnb and Excel MCP Server #stepbystep #guide #creating #synthetic #data

Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data Vault (SDV)

www.marktechpost.com

Real-world data is often costly, messy, and limited by privacy rules. Synthetic data offers a solution—and it’s already widely used: LLMs train on AI-generated text Fraud systems simulate edge cases Vision models pretrain on fake images SDV (Synthetic Data Vault) is an open-source Python library that generates realistic tabular data using machine learning. It learns patterns from real data and creates high-quality synthetic data for safe sharing, testing, and model training. In this tutorial, we’ll use SDV to generate synthetic data step by step. pip install sdv We will first install the sdv library: from sdv.io.local import CSVHandler connector = CSVHandler() FOLDER_NAME = '.' # If the data is in the same directory data = connector.read(folder_name=FOLDER_NAME) salesDf = data['data'] Next, we import the necessary module and connect to our local folder containing the dataset files. This reads the CSV files from the specified folder and stores them as pandas DataFrames. In this case, we access the main dataset using data[‘data’]. from sdv.metadata import Metadata metadata = Metadata.load_from_json('metadata.json') We now import the metadata for our dataset. This metadata is stored in a JSON file and tells SDV how to interpret your data. It includes: The table name The primary key The data type of each column (e.g., categorical, numerical, datetime, etc.) Optional column formats like datetime patterns or ID patterns Table relationships (for multi-table setups) Here is a sample metadata.json format: { "METADATA_SPEC_VERSION": "V1", "tables": { "your_table_name": { "primary_key": "your_primary_key_column", "columns": { "your_primary_key_column": { "sdtype": "id", "regex_format": "T[0-9]{6}" }, "date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" }, "category_column": { "sdtype": "categorical" }, "numeric_column": { "sdtype": "numerical" } }, "column_relationships": [] } } } from sdv.metadata import Metadata metadata = Metadata.detect_from_dataframes(data) Alternatively, we can use the SDV library to automatically infer the metadata. However, the results may not always be accurate or complete, so you might need to review and update it if there are any discrepancies. from sdv.single_table import GaussianCopulaSynthesizer synthesizer = GaussianCopulaSynthesizer(metadata) synthesizer.fit(data=salesDf) synthetic_data = synthesizer.sample(num_rows=10000) With the metadata and original dataset ready, we can now use SDV to train a model and generate synthetic data. The model learns the structure and patterns in your real dataset and uses that knowledge to create synthetic records. You can control how many rows to generate using the num_rows argument. from sdv.evaluation.single_table import evaluate_quality quality_report = evaluate_quality( salesDf, synthetic_data, metadata) The SDV library also provides tools to evaluate the quality of your synthetic data by comparing it to the original dataset. A great place to start is by generating a quality report You can also visualize how the synthetic data compares to the real data using SDV’s built-in plotting tools. For example, import get_column_plot from sdv.evaluation.single_table to create comparison plots for specific columns: from sdv.evaluation.single_table import get_column_plot fig = get_column_plot( real_data=salesDf, synthetic_data=synthetic_data, column_name='Sales', metadata=metadata ) fig.show() We can observe that the distribution of the ‘Sales’ column in the real and synthetic data is very similar. To explore further, we can use matplotlib to create more detailed comparisons—such as visualizing the average monthly sales trends across both datasets. import pandas as pd import matplotlib.pyplot as plt # Ensure 'Date' columns are datetime salesDf['Date'] = pd.to_datetime(salesDf['Date'], format='%d-%m-%Y') synthetic_data['Date'] = pd.to_datetime(synthetic_data['Date'], format='%d-%m-%Y') # Extract 'Month' as year-month string salesDf['Month'] = salesDf['Date'].dt.to_period('M').astype(str) synthetic_data['Month'] = synthetic_data['Date'].dt.to_period('M').astype(str) # Group by 'Month' and calculate average sales actual_avg_monthly = salesDf.groupby('Month')['Sales'].mean().rename('Actual Average Sales') synthetic_avg_monthly = synthetic_data.groupby('Month')['Sales'].mean().rename('Synthetic Average Sales') # Merge the two series into a DataFrame avg_monthly_comparison = pd.concat([actual_avg_monthly, synthetic_avg_monthly], axis=1).fillna(0) # Plot plt.figure(figsize=(10, 6)) plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Actual Average Sales'], label='Actual Average Sales', marker='o') plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Synthetic Average Sales'], label='Synthetic Average Sales', marker='o') plt.title('Average Monthly Sales Comparison: Actual vs Synthetic') plt.xlabel('Month') plt.ylabel('Average Sales') plt.xticks(rotation=45) plt.grid(True) plt.legend() plt.ylim(bottom=0) # y-axis starts at 0 plt.tight_layout() plt.show() This chart also shows that the average monthly sales in both datasets are very similar, with only minimal differences. In this tutorial, we demonstrated how to prepare your data and metadata for synthetic data generation using the SDV library. By training a model on your original dataset, SDV can create high-quality synthetic data that closely mirrors the real data’s patterns and distributions. We also explored how to evaluate and visualize the synthetic data, confirming that key metrics like sales distributions and monthly trends remain consistent. Synthetic data offers a powerful way to overcome privacy and availability challenges while enabling robust data analysis and machine learning workflows. Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Arham IslamI am a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I have a keen interest in Data Science, especially Neural Networks and their application in various areas.Arham Islamhttps://www.marktechpost.com/author/arhamislam/Step-by-Step Guide to Create an AI agent with Google ADKArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an LLM Agent with Tool Access Using MCP-UseArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing an AgentQL Model Context Protocol (MCP) ServerArham Islamhttps://www.marktechpost.com/author/arhamislam/Implementing An Airbnb and Excel MCP Server

0 Commenti ·0 condivisioni ·0 Anteprima

Passa a Pro