How To Build a Benchmark for Your Models
I’ve been working as a data science consultant for the past three years, and I’ve had the opportunity to work on multiple projects across various industries. Yet, I noticed one common denominator among most of the clients I worked with:
They rarely have a clear idea of the project objective.
This is one of the main obstacles data scientists face, especially now that Gen AI is taking over every domain.
But let’s suppose that after some back and forth, the objective becomes clear. We managed to pin down a specific question to answer. For example:
I want to classify my customers into two groups according to their probability to churn: “high likelihood to churn” and “low likelihood to churn”
Well, now what? Easy, let’s start building some models!
Wrong!
If having a clear objective is rare, having a reliable benchmark is even rarer.
In my opinion, one of the most important steps in delivering a data science project is defining and agreeing on a set of benchmarks with the client.
In this blog post, I’ll explain:
What a benchmark is,
Why it is important to have a benchmark,
How I would build one using an example scenario and
Some potential drawbacks to keep in mind
What is a benchmark?
A benchmark is a standardized way to evaluate the performance of a model. It provides a reference point against which new models can be compared.
A benchmark needs two key components to be considered complete:
A set of metrics to evaluate the performance
A set of simple models to use as baselines
The concept at its core is simple: every time I develop a new model I compare it against both previous versions and the baseline models. This ensures improvements are real and tracked.
It is essential to understand that this baseline shouldn’t be model or dataset-specific, but rather business-case-specific. It should be a general benchmark for a given business case.
If I encounter a new dataset, with the same business objective, this benchmark should be a reliable reference point.
Why building a benchmark is important
Now that we’ve defined what a benchmark is, let’s dive into why I believe it’s worth spending an extra project week on the development of a strong benchmark.
Without a Benchmark you’re aiming for perfection — If you are working without a clear reference point any result will lose meaning. “My model has a MAE of 30.000” Is that good? IDK! Maybe with a simple mean you would get a MAE of 25.000. By comparing your model to a baseline, you can measure both performance and improvement.
Improves Communicating with Clients — Clients and business teams might not immediately understand the standard output of a model. However, by engaging them with simple baselines from the start, it becomes easier to demonstrate improvements later. In many cases benchmarks could come directly from the business in different shapes or forms.
Helps in Model Selection — A benchmark gives a starting point to compare multiple models fairly. Without it, you might waste time testing models that aren’t worth considering.
Model Drift Detection and Monitoring — Models can degrade over time. By having a benchmark you might be able to intercept drifts early by comparing new model outputs against past benchmarks and baselines.
Consistency Between Different Datasets — Datasets evolve. By having a fixed set of metrics and models you ensure that performance comparisons remain valid over time.
With a clear benchmark, every step in the model development will provide immediate feedback, making the whole process more intentional and data-driven.
How I would build a benchmark
I hope I’ve convinced you of the importance of having a benchmark. Now, let’s actually build one.
Let’s start from the business question we presented at the very beginning of this blog post:
I want to classify my customers into two groups according to their probability to churn: “high likelihood to churn” and “low likelihood to churn”
For simplicity, I’ll assume no additional business constraints, but in real-world scenarios, constraints often exist.
For this example, I am using this dataset . The data contains some attributes from a company’s customer basealong with their churn status.
Now that we have something to work on let’s build the benchmark:
1. Defining the metrics
We are dealing with a churn use case, in particular, this is a binary classification problem. Thus the main metrics that we could use are:
Precision — Percentage of correctly predicted churners among all predicted churners
Recall — Percentage of actual churners correctly identified
F1 score — Balances precision and recall
True Positives, False Positives, True Negative and False Negatives
These are some of the “simple” metrics that could be used to evaluate the output of a model.
However, it is not an exhaustive list, standard metrics aren’t always enough. In many use cases, it might be useful to build custom metrics.
Let’s assume that in our business case the customers labeled as “high likelihood to churn” are offered a discount. This creates:
A cost when offering the discount to a non-churning customer
A profit when retaining a churning customer
Following on this definition we can build a custom metric that will be crucial in our scenario:
# Defining the business case-specific reference metric
def financial_gain:
loss_from_fp = np.sum) * 250
gain_from_tp = np.sum) * 1000
return gain_from_tp - loss_from_fp
When you are building business-driven metrics these are usually the most relevant. Such metrics could take any shape or form: Financial goals, minimum requirements, percentage of coverage and more.
2. Defining the benchmarks
Now that we’ve defined our metrics, we can define a set of baseline models to be used as a reference.
In this phase, you should define a list of simple-to-implement model in their simplest possible setup. There is no reason at this state to spend time and resources on the optimization of these models, my mindset is:
If I had 15 minutes, how would I implement this model?
In later phases of the model, you can add mode baseline models as the project proceeds.
In this case, I will use the following models:
Random Model — Assigns labels randomly
Majority Model — Always predicts the most frequent class
Simple XGB
Simple KNN
import numpy as np
import xgboost as xgb
from sklearn.neighbors import KNeighborsClassifier
class BinaryMean:
@staticmethod
def run_benchmark:
np.random.seedreturn np.random.choice, p=.mean, 1 - df_train.mean])
class SimpleXbg:
@staticmethod
def run_benchmark:
model = xgb.XGBClassifiermodel.fit.drop, df_train)
return model.predict.drop)
class MajorityClass:
@staticmethod
def run_benchmark:
majority_class = df_train.modereturn np.full, majority_class)
class SimpleKNN:
@staticmethod
def run_benchmark:
model = KNeighborsClassifiermodel.fit.drop, df_train)
return model.predict.drop)
Again, as in the case of the metrics, we can build custom benchmarks.
Let’s assume that in our business case the the marketing team contacts every client who’s:
Over 50 y/o and
That is not active anymore
Following this rule we can build this model:
# Defining the business case-specific benchmark
class BusinessBenchmark:
@staticmethod
def run_benchmark:
df = df_test.copydf.loc= 0
df.loc= 1
return dfRunning the benchmark
To run the benchmark I will use the following class. The entry point is the method compare_with_benchmark that, given a prediction, runs all the models and calculates all the metrics.
import numpy as np
class ChurnBinaryBenchmark:
def __init__:
self.metrics = metrics
self.benchmark_models = benchmark_models
def compare_pred_with_benchmark:
output_metrics = {
'Prediction': self._calculate_metrics}
dct_benchmarks = {}
for model in self.benchmark_models:
dct_benchmarks= model.run_benchmarkoutput_metrics= self._calculate_metricsreturn output_metrics
def _calculate_metrics:
return {getattr: funcfor func in self.metrics}
Now all we need is a prediction. For this example, I made a quick feature engineering and some hyperparameter tuning.
The last step is just to run the benchmark:
binary_benchmark = ChurnBinaryBenchmarkres = binary_benchmark.compare_pred_with_benchmarkpd.DataFrameBenchmark metrics comparison | Image by Author
This generates a comparison table of all models across all metrics. Using this table, it is possible to draw concrete conclusions on the model’s predictions and make informed decisions on the following steps of the process.
Some drawbacks
As we’ve seen there are plenty of reasons why it is useful to have a benchmark. However, even though benchmarks are incredibly useful, there are some pitfalls to watch out for:
Non-Informative Benchmark — When the metrics or models are poorly defined the marginal impact of having a benchmark decreases. Always define meaningful baselines.
Misinterpretation by Stakeholders — Communication with the client is essential, it is important to state clearly what the metrics are measuring. The best model might not be the best on all the defined metrics.
Overfitting to the Benchmark — You might end up trying to create features that are too specific, that might beat the benchmark, but do not generalize well in prediction. Don’t focus on beating the benchmark, but on creating the best solution possible to the problem.
Change of Objective — Objectives defined might change, due to miscommunication or changes in plans. Keep your benchmark flexible so it can adapt when needed.
Final thoughts
Benchmarks provide clarity, ensure improvements are measurable, and create a shared reference point between data scientists and clients. They help avoid the trap of assuming a model is performing well without proof and ensure that every iteration brings real value.
They also act as a communication tool, making it easier to explain progress to clients. Instead of just presenting numbers, you can show clear comparisons that highlight improvements.
Here you can find a notebook with a full implementation from this blog post.
The post How To Build a Benchmark for Your Models appeared first on Towards Data Science.
#how #build #benchmark #your #models
How To Build a Benchmark for Your Models
I’ve been working as a data science consultant for the past three years, and I’ve had the opportunity to work on multiple projects across various industries. Yet, I noticed one common denominator among most of the clients I worked with:
They rarely have a clear idea of the project objective.
This is one of the main obstacles data scientists face, especially now that Gen AI is taking over every domain.
But let’s suppose that after some back and forth, the objective becomes clear. We managed to pin down a specific question to answer. For example:
I want to classify my customers into two groups according to their probability to churn: “high likelihood to churn” and “low likelihood to churn”
Well, now what? Easy, let’s start building some models!
Wrong!
If having a clear objective is rare, having a reliable benchmark is even rarer.
In my opinion, one of the most important steps in delivering a data science project is defining and agreeing on a set of benchmarks with the client.
In this blog post, I’ll explain:
What a benchmark is,
Why it is important to have a benchmark,
How I would build one using an example scenario and
Some potential drawbacks to keep in mind
What is a benchmark?
A benchmark is a standardized way to evaluate the performance of a model. It provides a reference point against which new models can be compared.
A benchmark needs two key components to be considered complete:
A set of metrics to evaluate the performance
A set of simple models to use as baselines
The concept at its core is simple: every time I develop a new model I compare it against both previous versions and the baseline models. This ensures improvements are real and tracked.
It is essential to understand that this baseline shouldn’t be model or dataset-specific, but rather business-case-specific. It should be a general benchmark for a given business case.
If I encounter a new dataset, with the same business objective, this benchmark should be a reliable reference point.
Why building a benchmark is important
Now that we’ve defined what a benchmark is, let’s dive into why I believe it’s worth spending an extra project week on the development of a strong benchmark.
Without a Benchmark you’re aiming for perfection — If you are working without a clear reference point any result will lose meaning. “My model has a MAE of 30.000” Is that good? IDK! Maybe with a simple mean you would get a MAE of 25.000. By comparing your model to a baseline, you can measure both performance and improvement.
Improves Communicating with Clients — Clients and business teams might not immediately understand the standard output of a model. However, by engaging them with simple baselines from the start, it becomes easier to demonstrate improvements later. In many cases benchmarks could come directly from the business in different shapes or forms.
Helps in Model Selection — A benchmark gives a starting point to compare multiple models fairly. Without it, you might waste time testing models that aren’t worth considering.
Model Drift Detection and Monitoring — Models can degrade over time. By having a benchmark you might be able to intercept drifts early by comparing new model outputs against past benchmarks and baselines.
Consistency Between Different Datasets — Datasets evolve. By having a fixed set of metrics and models you ensure that performance comparisons remain valid over time.
With a clear benchmark, every step in the model development will provide immediate feedback, making the whole process more intentional and data-driven.
How I would build a benchmark
I hope I’ve convinced you of the importance of having a benchmark. Now, let’s actually build one.
Let’s start from the business question we presented at the very beginning of this blog post:
I want to classify my customers into two groups according to their probability to churn: “high likelihood to churn” and “low likelihood to churn”
For simplicity, I’ll assume no additional business constraints, but in real-world scenarios, constraints often exist.
For this example, I am using this dataset . The data contains some attributes from a company’s customer basealong with their churn status.
Now that we have something to work on let’s build the benchmark:
1. Defining the metrics
We are dealing with a churn use case, in particular, this is a binary classification problem. Thus the main metrics that we could use are:
Precision — Percentage of correctly predicted churners among all predicted churners
Recall — Percentage of actual churners correctly identified
F1 score — Balances precision and recall
True Positives, False Positives, True Negative and False Negatives
These are some of the “simple” metrics that could be used to evaluate the output of a model.
However, it is not an exhaustive list, standard metrics aren’t always enough. In many use cases, it might be useful to build custom metrics.
Let’s assume that in our business case the customers labeled as “high likelihood to churn” are offered a discount. This creates:
A cost when offering the discount to a non-churning customer
A profit when retaining a churning customer
Following on this definition we can build a custom metric that will be crucial in our scenario:
# Defining the business case-specific reference metric
def financial_gain:
loss_from_fp = np.sum) * 250
gain_from_tp = np.sum) * 1000
return gain_from_tp - loss_from_fp
When you are building business-driven metrics these are usually the most relevant. Such metrics could take any shape or form: Financial goals, minimum requirements, percentage of coverage and more.
2. Defining the benchmarks
Now that we’ve defined our metrics, we can define a set of baseline models to be used as a reference.
In this phase, you should define a list of simple-to-implement model in their simplest possible setup. There is no reason at this state to spend time and resources on the optimization of these models, my mindset is:
If I had 15 minutes, how would I implement this model?
In later phases of the model, you can add mode baseline models as the project proceeds.
In this case, I will use the following models:
Random Model — Assigns labels randomly
Majority Model — Always predicts the most frequent class
Simple XGB
Simple KNN
import numpy as np
import xgboost as xgb
from sklearn.neighbors import KNeighborsClassifier
class BinaryMean:
@staticmethod
def run_benchmark:
np.random.seedreturn np.random.choice, p=.mean, 1 - df_train.mean])
class SimpleXbg:
@staticmethod
def run_benchmark:
model = xgb.XGBClassifiermodel.fit.drop, df_train)
return model.predict.drop)
class MajorityClass:
@staticmethod
def run_benchmark:
majority_class = df_train.modereturn np.full, majority_class)
class SimpleKNN:
@staticmethod
def run_benchmark:
model = KNeighborsClassifiermodel.fit.drop, df_train)
return model.predict.drop)
Again, as in the case of the metrics, we can build custom benchmarks.
Let’s assume that in our business case the the marketing team contacts every client who’s:
Over 50 y/o and
That is not active anymore
Following this rule we can build this model:
# Defining the business case-specific benchmark
class BusinessBenchmark:
@staticmethod
def run_benchmark:
df = df_test.copydf.loc= 0
df.loc= 1
return dfRunning the benchmark
To run the benchmark I will use the following class. The entry point is the method compare_with_benchmark that, given a prediction, runs all the models and calculates all the metrics.
import numpy as np
class ChurnBinaryBenchmark:
def __init__:
self.metrics = metrics
self.benchmark_models = benchmark_models
def compare_pred_with_benchmark:
output_metrics = {
'Prediction': self._calculate_metrics}
dct_benchmarks = {}
for model in self.benchmark_models:
dct_benchmarks= model.run_benchmarkoutput_metrics= self._calculate_metricsreturn output_metrics
def _calculate_metrics:
return {getattr: funcfor func in self.metrics}
Now all we need is a prediction. For this example, I made a quick feature engineering and some hyperparameter tuning.
The last step is just to run the benchmark:
binary_benchmark = ChurnBinaryBenchmarkres = binary_benchmark.compare_pred_with_benchmarkpd.DataFrameBenchmark metrics comparison | Image by Author
This generates a comparison table of all models across all metrics. Using this table, it is possible to draw concrete conclusions on the model’s predictions and make informed decisions on the following steps of the process.
Some drawbacks
As we’ve seen there are plenty of reasons why it is useful to have a benchmark. However, even though benchmarks are incredibly useful, there are some pitfalls to watch out for:
Non-Informative Benchmark — When the metrics or models are poorly defined the marginal impact of having a benchmark decreases. Always define meaningful baselines.
Misinterpretation by Stakeholders — Communication with the client is essential, it is important to state clearly what the metrics are measuring. The best model might not be the best on all the defined metrics.
Overfitting to the Benchmark — You might end up trying to create features that are too specific, that might beat the benchmark, but do not generalize well in prediction. Don’t focus on beating the benchmark, but on creating the best solution possible to the problem.
Change of Objective — Objectives defined might change, due to miscommunication or changes in plans. Keep your benchmark flexible so it can adapt when needed.
Final thoughts
Benchmarks provide clarity, ensure improvements are measurable, and create a shared reference point between data scientists and clients. They help avoid the trap of assuming a model is performing well without proof and ensure that every iteration brings real value.
They also act as a communication tool, making it easier to explain progress to clients. Instead of just presenting numbers, you can show clear comparisons that highlight improvements.
Here you can find a notebook with a full implementation from this blog post.
The post How To Build a Benchmark for Your Models appeared first on Towards Data Science.
#how #build #benchmark #your #models
·131 Views