Поиск | CGShares

Поиск

Slashdot @Slashdot поделился ссылкой
2025-06-15 07:29:43 ·

Python Creator Guido van Rossum Asks: Is 'Worse is Better' Still True for Programming Languages?

In 1989 a computer scientist argued that more functionality in software actually lowers usability and practicality — leading to the counterintuitive proposition that "worse is better". But is that still true?

Python's original creator Guido van Rossum addressed the question last month in a lightning talk at the annual Python Language Summit 2025.

Guido started by recounting earlier periods of Python development from 35 years ago, where he used UNIX "almost exclusively" and thus "Python was greatly influenced by UNIX's 'worse is better' philosophy"... "The fact thatwasn't perfect encouraged many people to start contributing. All of the code was straightforward, there were no thoughts of optimization... These early contributors also now had a stake in the language;was also their baby"...

Guido contrasted early development to how Python is developed now: "features that take years to produce from teams of software developers paid by big tech companies. The static type system requires an academic-level understanding of esoteric type system features." And this isn't just Python the language, "third-party projects like numpy are maintained by folks who are paid full-time to do so.... Now we have a huge community, but very few people, relatively speaking, are contributing meaningfully."
Guido asked whether the expectation for Python contributors going forward would be that "you had to write a perfect PEP or create a perfect prototype that can be turned into production-ready code?" Guido pined for the "old days" where feature development could skip performance or feature-completion to get something into the hands of the community to "start kicking the tires". "Do we have to abandon 'worse is better' as a philosophy and try to make everything as perfect as possible?" Guido thought doing so "would be a shame", but that he "wasn't sure how to change it", acknowledging that core developers wouldn't want to create features and then break users with future releases.
Guido referenced David Hewitt's PyO3 talk about Rust and Python, and that development "was using worse is better," where there is a core feature set that works, and plenty of work to be done and open questions. "That sounds a lot more fun than working on core CPython", Guido paused, "...not that I'd ever personally learn Rust. Maybe I should give it a try after," which garnered laughter from core developers.

"Maybe we should do more of that: allowing contributors in the community to have a stake and care".

of this story at Slashdot.
#python #creator #guido #van #rossum

Python Creator Guido van Rossum Asks: Is 'Worse is Better' Still True for Programming Languages?
In 1989 a computer scientist argued that more functionality in software actually lowers usability and practicality — leading to the counterintuitive proposition that "worse is better". But is that still true? Python's original creator Guido van Rossum addressed the question last month in a lightning talk at the annual Python Language Summit 2025. Guido started by recounting earlier periods of Python development from 35 years ago, where he used UNIX "almost exclusively" and thus "Python was greatly influenced by UNIX's 'worse is better' philosophy"... "The fact thatwasn't perfect encouraged many people to start contributing. All of the code was straightforward, there were no thoughts of optimization... These early contributors also now had a stake in the language;was also their baby"... Guido contrasted early development to how Python is developed now: "features that take years to produce from teams of software developers paid by big tech companies. The static type system requires an academic-level understanding of esoteric type system features." And this isn't just Python the language, "third-party projects like numpy are maintained by folks who are paid full-time to do so.... Now we have a huge community, but very few people, relatively speaking, are contributing meaningfully." Guido asked whether the expectation for Python contributors going forward would be that "you had to write a perfect PEP or create a perfect prototype that can be turned into production-ready code?" Guido pined for the "old days" where feature development could skip performance or feature-completion to get something into the hands of the community to "start kicking the tires". "Do we have to abandon 'worse is better' as a philosophy and try to make everything as perfect as possible?" Guido thought doing so "would be a shame", but that he "wasn't sure how to change it", acknowledging that core developers wouldn't want to create features and then break users with future releases. Guido referenced David Hewitt's PyO3 talk about Rust and Python, and that development "was using worse is better," where there is a core feature set that works, and plenty of work to be done and open questions. "That sounds a lot more fun than working on core CPython", Guido paused, "...not that I'd ever personally learn Rust. Maybe I should give it a try after," which garnered laughter from core developers. "Maybe we should do more of that: allowing contributors in the community to have a stake and care". of this story at Slashdot. #python #creator #guido #van #rossum

Python Creator Guido van Rossum Asks: Is 'Worse is Better' Still True for Programming Languages?

developers.slashdot.org
In 1989 a computer scientist argued that more functionality in software actually lowers usability and practicality — leading to the counterintuitive proposition that "worse is better". But is that still true? Python's original creator Guido van Rossum addressed the question last month in a lightning talk at the annual Python Language Summit 2025. Guido started by recounting earlier periods of Python development from 35 years ago, where he used UNIX "almost exclusively" and thus "Python was greatly influenced by UNIX's 'worse is better' philosophy"... "The fact that [Python] wasn't perfect encouraged many people to start contributing. All of the code was straightforward, there were no thoughts of optimization... These early contributors also now had a stake in the language; [Python] was also their baby"... Guido contrasted early development to how Python is developed now: "features that take years to produce from teams of software developers paid by big tech companies. The static type system requires an academic-level understanding of esoteric type system features." And this isn't just Python the language, "third-party projects like numpy are maintained by folks who are paid full-time to do so.... Now we have a huge community, but very few people, relatively speaking, are contributing meaningfully." Guido asked whether the expectation for Python contributors going forward would be that "you had to write a perfect PEP or create a perfect prototype that can be turned into production-ready code?" Guido pined for the "old days" where feature development could skip performance or feature-completion to get something into the hands of the community to "start kicking the tires". "Do we have to abandon 'worse is better' as a philosophy and try to make everything as perfect as possible?" Guido thought doing so "would be a shame", but that he "wasn't sure how to change it", acknowledging that core developers wouldn't want to create features and then break users with future releases. Guido referenced David Hewitt's PyO3 talk about Rust and Python, and that development "was using worse is better," where there is a core feature set that works, and plenty of work to be done and open questions. "That sounds a lot more fun than working on core CPython", Guido paused, "...not that I'd ever personally learn Rust. Maybe I should give it a try after," which garnered laughter from core developers. "Maybe we should do more of that: allowing contributors in the community to have a stake and care". Read more of this story at Slashdot.

0 Комментарии ·0 Поделились ·0 предпросмотр

Войдите, чтобы отмечать, делиться и комментировать!
Towards Data Science @TowardsDataScience поделился ссылкой
2025-05-23 22:09:11 ·

Do More with NumPy Array Type Hints: Annotate & Validate Shape & Dtype

Improve static analysis and run-time validation with full generic specification
The post Do More with NumPy Array Type Hints: Annotate & Validate Shape & Dtype appeared first on Towards Data Science.
#more #with #numpy #array #type

Do More with NumPy Array Type Hints: Annotate & Validate Shape & Dtype
Improve static analysis and run-time validation with full generic specification The post Do More with NumPy Array Type Hints: Annotate & Validate Shape & Dtype appeared first on Towards Data Science. #more #with #numpy #array #type

towardsdatascience.com
Improve static analysis and run-time validation with full generic specification The post Do More with NumPy Array Type Hints: Annotate & Validate Shape & Dtype appeared first on Towards Data Science.

0 Комментарии ·0 Поделились ·0 предпросмотр

Войдите, чтобы отмечать, делиться и комментировать!
Towards Data Science @TowardsDataScience поделился ссылкой
2025-05-22 18:06:37 ·

What Statistics Can Tell Us About NBA Coaches

Who gets hired as an NBA coach? How long does a typical coach last? And does their coaching background play any part in predicting success?

This analysis was inspired by several key theories. First, there has been a common criticism among casual NBA fans that teams overly prefer hiring candidates with previous NBA head coaches experience.

Consequently, this analysis aims to answer two related questions. First, is it true that NBA teams frequently re-hire candidates with previous head coaching experience? And second, is there any evidence that these candidates under-perform relative to other candidates?

The second theory is that internal candidatesare often more successful than external candidates. This theory was derived from a pair of anecdotes. Two of the most successful coaches in NBA history, Gregg Popovich of San Antonio and Erik Spoelstra of Miami, were both internal hires. However, rigorous quantitative evidence is needed to test if this relationship holds over a larger sample.

This analysis aims to explore these questions, and provide the code to reproduce the analysis in Python.

The Data

The codeand dataset for this project are available on Github here. The analysis was performed using Python in Google Colaboratory.

A prerequisite to this analysis was determining a way to measure coaching success quantitatively. I decided on a simple idea: the success of a coach would be best measured by the length of their tenure in that job. Tenure best represents the differing expectations that might be placed on a coach. A coach hired to a contending team would be expected to win games and generate deep playoff runs. A coach hired to a rebuilding team might be judged on the development of younger players and their ability to build a strong culture. If a coach meets expectations, the team will keep them around.

Since there was no existing dataset with all of the required data, I collected the data myself from Wikipedia. I recorded every off-season coaching change from 1990 through 2021. Since the primary outcome variable is tenure, in-season coaching changes were excluded since these coaches often carried an “interim” tag—meaning they were intended to be temporary until a permanent replacement could be found.

In addition, the following variables were collected:

VariableDefinitionTeamThe NBA team the coach was hired forYearThe year the coach was hiredCoachThe name of the coachInternal?An indicator if the coach was internal or not—meaning they worked for the organization in some capacity immediately prior to being hired as head coachTypeThe background of the coach. Categories are Previous HC, Previous AC, College, Player, Management, and Foreign.YearsThe number of years a coach was employed in the role. For coaches fired mid-season, the value was counted as 0.5.

First, the dataset is imported from its location in Google Drive. I also convert ‘Internal?’ into a dummy variable, replacing “Yes” with 1 and “No” with 0.

from google.colab import drive
drive.mountimport pandas as pd
pd.set_option#Bring in the dataset
coach = pd.read_csv.iloccoach= coach.map)
coach

This prints a preview of what the dataset looks like:

In total, the dataset contains 221 coaching hires over this time.

Descriptive Statistics

First, basic summary Statistics are calculated and visualized to determine the backgrounds of NBA head coaches.

#Create chart of coaching background
import matplotlib.pyplot as plt

#Count number of coaches per category
counts = coach.value_counts#Create chart
plt.barplt.titleplt.figtextplt.xticksplt.ylabelplt.gca.spines.set_visibleplt.gca.spines.set_visiblefor i, value in enumerate:
plt.text)*100,1)) + '%' + '+ ')', ha='center', fontsize=9)
plt.savefigprint.sum/len)*100,1)) + " percent of coaches are internal.")

Over half of coaching hires previously served as an NBA head coach, and nearly 90% had NBA coaching experience of some kind. This answers the first question posed—NBA teams show a strong preference for experienced head coaches. If you get hired once as an NBA coach, your odds of being hired again are much higher. Additionally, 13.6% of hires are internal, confirming that teams do not frequently hire from their own ranks.

Second, I will explore the typical tenure of an NBA head coach. This can be visualized using a histogram.

#Create histogram
plt.histplt.titleplt.figtextplt.annotate', xy=, xytext=,
arrowprops=dict, fontsize=9, color='black')
plt.gca.spines.set_visibleplt.gca.spines.set_visibleplt.savefigplt.showcoach.sort_values#Calculate some stats with the data
import numpy as np

print) + " years is the median coaching tenure length.")
print.sum/len)*100,1)) + " percent of coaches last five years or less.")
print.sum/len*100,1)) + " percent of coaches last a year or less.")

Using tenure as an indicator of success, the the data clearly shows that the large majority of coaches are unsuccessful. The median tenure is just 2.5 seasons. 18.1% of coaches last a single season or less, and barely 10% of coaches last more than 5 seasons.

This can also be viewed as a survival analysis plot to see the drop-off at various points in time:

#Survival analysis
import matplotlib.ticker as mtick

lst = np.arangesurv = pd.DataFramesurv= np.nan

for i in range):
surv.iloc=.sum/lenplt.stepplt.titleplt.xlabel')
plt.figtextplt.gca.yaxis.set_major_formatter)
plt.gca.spines.set_visibleplt.gca.spines.set_visibleplt.savefigplt.show

Lastly, a box plot can be generated to see if there are any obvious differences in tenure based on coaching type. Boxplots also display outliers for each group.

#Create a boxplot
import seaborn as sns

sns.boxplotplt.titleplt.gca.spines.set_visibleplt.gca.spines.set_visibleplt.xlabelplt.xticksplt.figtextplt.savefigplt.show

There are some differences between the groups. Aside from management hires, previous head coaches have the longest average tenure at 3.3 years. However, since many of the groups have small sample sizes, we need to use more advanced techniques to test if the differences are statistically significant.

Statistical Analysis

First, to test if either Type or Internal has a statistically significant difference among the group means, we can use ANOVA:

#ANOVA
import statsmodels.api as sm
from statsmodels.formula.api import ols

am = ols+ C', data=coach).fitanova_table = sm.stats.anova_lmprintThe results show high p-values and low F-stats—indicating no evidence of statistically significant difference in means. Thus, the initial conclusion is that there is no evidence NBA teams are under-valuing internal candidates or over-valuing previous head coaching experience as initially hypothesized.

However, there is a possible distortion when comparing group averages. NBA coaches are signed to contracts that typically run between three and five years. Teams typically have to pay out the remainder of the contract even if coaches are dismissed early for poor performance. A coach that lasts two years may be no worse than one that lasts three or four years—the difference could simply be attributable to the length and terms of the initial contract, which is in turn impacted by the desirability of the coach in the job market. Since coaches with prior experience are highly coveted, they may use that leverage to negotiate longer contracts and/or higher salaries, both of which could deter teams from terminating their employment too early.

To account for this possibility, the outcome can be treated as binary rather than continuous. If a coach lasted more than 5 seasons, it is highly likely they completed at least their initial contract term and the team chose to extend or re-sign them. These coaches will be treated as successes, with those having a tenure of five years or less categorized as unsuccessful. To run this analysis, all coaching hires from 2020 and 2021 must be excluded, since they have not yet been able to eclipse 5 seasons.

With a binary dependent variable, a logistic regression can be used to test if any of the variables predict coaching success. Internal and Type are both converted to dummy variables. Since previous head coaches represent the most common coaching hires, I set this as the “reference” category against which the others will be measured against. Additionally, the dataset contains just one foreign-hired coachso this observation is dropped from the analysis.

#Logistic regression
coach3 = coach<2020]

coach3.loc= np.wherecoach_type_dummies = pd.get_dummies.astypecoach_type_dummies.dropcoach3 = pd.concat#Drop foreign category / David Blatt since n = 1
coach3 = coach3.dropcoach3 = coach3.loc!= "David Blatt"]

print)

x = coach3]
x = sm.add_constanty = coach3logm = sm.Logitlogm.r = logm.fitprint)

#Convert coefficients to odds ratio
print) + "is the odds ratio for internal.") #Internal coefficient
print) #Management
print) #Player
print) #Previous AC
print) #College

Consistent with ANOVA results, none of the variables are statistically significant under any conventional threshold. However, closer examination of the coefficients tells an interesting story.

The beta coefficients represent the change in the log-odds of the outcome. Since this is unintuitive to interpret, the coefficients can be converted to an Odds Ratio as follows:

Internal has an odds ratio of 0.23—indicating that internal candidates are 77% less likely to be successful compared to external candidates. Management has an odds ratio of 2.725, indicating these candidates are 172.5% more likely to be successful. The odds ratios for players is effectively zero, 0.696 for previous assistant coaches, and 0.5 for college coaches. Since three out of four coaching type dummy variables have an odds ratio under one, this indicates that only management hires were more likely to be successful than previous head coaches.

From a practical standpoint, these are large effect sizes. So why are the variables statistically insignificant?

The cause is a limited sample size of successful coaches. Out of 202 coaches remaining in the sample, just 23were successful. Regardless of the coach’s background, odds are low they last more than a few seasons. If we look at the one category able to outperform previous head coachesspecifically:

# Filter to management

manage = coach3== 1]
print)
printThe filtered dataset contains just 6 hires—of which just oneis classified as a success. In other words, the entire effect was driven by a single successful observation. Thus, it would take a considerably larger sample size to be confident if differences exist.

With a p-value of 0.202, the Internal variable comes the closest to statistical significance. Notably, however, the direction of the effect is actually the opposite of what was hypothesized—internal hires are less likely to be successful than external hires. Out of 26 internal hires, just onemet the criteria for success.

Conclusion

In conclusion, this analysis was able to draw several key conclusions:

Regardless of background, being an NBA coach is typically a short-lived job. It’s rare for a coach to last more than a few seasons.

The common wisdom that NBA teams strongly prefer to hire previous head coaches holds true. More than half of hires already had NBA head coaching experience.

If teams don’t hire an experienced head coach, they’re likely to hire an NBA assistant coach. Hires outside of these two categories are especially uncommon.

Though they are frequently hired, there is no evidence to suggest NBA teams overly prioritize previous head coaches. To the contrary, previous head coaches stay in the job longer on average and are more likely to outlast their initial contract term—though neither of these differences are statistically significant.

Despite high-profile anecdotes, there is no evidence to suggest that internal hires are more successful than external hires either.

Note: All images were created by the author unless otherwise credited.
The post What Statistics Can Tell Us About NBA Coaches appeared first on Towards Data Science.
#what #statistics #can #tell #about

What Statistics Can Tell Us About NBA Coaches
Who gets hired as an NBA coach? How long does a typical coach last? And does their coaching background play any part in predicting success? This analysis was inspired by several key theories. First, there has been a common criticism among casual NBA fans that teams overly prefer hiring candidates with previous NBA head coaches experience. Consequently, this analysis aims to answer two related questions. First, is it true that NBA teams frequently re-hire candidates with previous head coaching experience? And second, is there any evidence that these candidates under-perform relative to other candidates? The second theory is that internal candidatesare often more successful than external candidates. This theory was derived from a pair of anecdotes. Two of the most successful coaches in NBA history, Gregg Popovich of San Antonio and Erik Spoelstra of Miami, were both internal hires. However, rigorous quantitative evidence is needed to test if this relationship holds over a larger sample. This analysis aims to explore these questions, and provide the code to reproduce the analysis in Python. The Data The codeand dataset for this project are available on Github here. The analysis was performed using Python in Google Colaboratory. A prerequisite to this analysis was determining a way to measure coaching success quantitatively. I decided on a simple idea: the success of a coach would be best measured by the length of their tenure in that job. Tenure best represents the differing expectations that might be placed on a coach. A coach hired to a contending team would be expected to win games and generate deep playoff runs. A coach hired to a rebuilding team might be judged on the development of younger players and their ability to build a strong culture. If a coach meets expectations, the team will keep them around. Since there was no existing dataset with all of the required data, I collected the data myself from Wikipedia. I recorded every off-season coaching change from 1990 through 2021. Since the primary outcome variable is tenure, in-season coaching changes were excluded since these coaches often carried an “interim” tag—meaning they were intended to be temporary until a permanent replacement could be found. In addition, the following variables were collected: VariableDefinitionTeamThe NBA team the coach was hired forYearThe year the coach was hiredCoachThe name of the coachInternal?An indicator if the coach was internal or not—meaning they worked for the organization in some capacity immediately prior to being hired as head coachTypeThe background of the coach. Categories are Previous HC, Previous AC, College, Player, Management, and Foreign.YearsThe number of years a coach was employed in the role. For coaches fired mid-season, the value was counted as 0.5. First, the dataset is imported from its location in Google Drive. I also convert ‘Internal?’ into a dummy variable, replacing “Yes” with 1 and “No” with 0. from google.colab import drive drive.mountimport pandas as pd pd.set_option#Bring in the dataset coach = pd.read_csv.iloccoach= coach.map) coach This prints a preview of what the dataset looks like: In total, the dataset contains 221 coaching hires over this time. Descriptive Statistics First, basic summary Statistics are calculated and visualized to determine the backgrounds of NBA head coaches. #Create chart of coaching background import matplotlib.pyplot as plt #Count number of coaches per category counts = coach.value_counts#Create chart plt.barplt.titleplt.figtextplt.xticksplt.ylabelplt.gca.spines.set_visibleplt.gca.spines.set_visiblefor i, value in enumerate: plt.text)*100,1)) + '%' + '+ ')', ha='center', fontsize=9) plt.savefigprint.sum/len)*100,1)) + " percent of coaches are internal.") Over half of coaching hires previously served as an NBA head coach, and nearly 90% had NBA coaching experience of some kind. This answers the first question posed—NBA teams show a strong preference for experienced head coaches. If you get hired once as an NBA coach, your odds of being hired again are much higher. Additionally, 13.6% of hires are internal, confirming that teams do not frequently hire from their own ranks. Second, I will explore the typical tenure of an NBA head coach. This can be visualized using a histogram. #Create histogram plt.histplt.titleplt.figtextplt.annotate', xy=, xytext=, arrowprops=dict, fontsize=9, color='black') plt.gca.spines.set_visibleplt.gca.spines.set_visibleplt.savefigplt.showcoach.sort_values#Calculate some stats with the data import numpy as np print) + " years is the median coaching tenure length.") print.sum/len)*100,1)) + " percent of coaches last five years or less.") print.sum/len*100,1)) + " percent of coaches last a year or less.") Using tenure as an indicator of success, the the data clearly shows that the large majority of coaches are unsuccessful. The median tenure is just 2.5 seasons. 18.1% of coaches last a single season or less, and barely 10% of coaches last more than 5 seasons. This can also be viewed as a survival analysis plot to see the drop-off at various points in time: #Survival analysis import matplotlib.ticker as mtick lst = np.arangesurv = pd.DataFramesurv= np.nan for i in range): surv.iloc=.sum/lenplt.stepplt.titleplt.xlabel') plt.figtextplt.gca.yaxis.set_major_formatter) plt.gca.spines.set_visibleplt.gca.spines.set_visibleplt.savefigplt.show Lastly, a box plot can be generated to see if there are any obvious differences in tenure based on coaching type. Boxplots also display outliers for each group. #Create a boxplot import seaborn as sns sns.boxplotplt.titleplt.gca.spines.set_visibleplt.gca.spines.set_visibleplt.xlabelplt.xticksplt.figtextplt.savefigplt.show There are some differences between the groups. Aside from management hires, previous head coaches have the longest average tenure at 3.3 years. However, since many of the groups have small sample sizes, we need to use more advanced techniques to test if the differences are statistically significant. Statistical Analysis First, to test if either Type or Internal has a statistically significant difference among the group means, we can use ANOVA: #ANOVA import statsmodels.api as sm from statsmodels.formula.api import ols am = ols+ C', data=coach).fitanova_table = sm.stats.anova_lmprintThe results show high p-values and low F-stats—indicating no evidence of statistically significant difference in means. Thus, the initial conclusion is that there is no evidence NBA teams are under-valuing internal candidates or over-valuing previous head coaching experience as initially hypothesized. However, there is a possible distortion when comparing group averages. NBA coaches are signed to contracts that typically run between three and five years. Teams typically have to pay out the remainder of the contract even if coaches are dismissed early for poor performance. A coach that lasts two years may be no worse than one that lasts three or four years—the difference could simply be attributable to the length and terms of the initial contract, which is in turn impacted by the desirability of the coach in the job market. Since coaches with prior experience are highly coveted, they may use that leverage to negotiate longer contracts and/or higher salaries, both of which could deter teams from terminating their employment too early. To account for this possibility, the outcome can be treated as binary rather than continuous. If a coach lasted more than 5 seasons, it is highly likely they completed at least their initial contract term and the team chose to extend or re-sign them. These coaches will be treated as successes, with those having a tenure of five years or less categorized as unsuccessful. To run this analysis, all coaching hires from 2020 and 2021 must be excluded, since they have not yet been able to eclipse 5 seasons. With a binary dependent variable, a logistic regression can be used to test if any of the variables predict coaching success. Internal and Type are both converted to dummy variables. Since previous head coaches represent the most common coaching hires, I set this as the “reference” category against which the others will be measured against. Additionally, the dataset contains just one foreign-hired coachso this observation is dropped from the analysis. #Logistic regression coach3 = coach<2020] coach3.loc= np.wherecoach_type_dummies = pd.get_dummies.astypecoach_type_dummies.dropcoach3 = pd.concat#Drop foreign category / David Blatt since n = 1 coach3 = coach3.dropcoach3 = coach3.loc!= "David Blatt"] print) x = coach3] x = sm.add_constanty = coach3logm = sm.Logitlogm.r = logm.fitprint) #Convert coefficients to odds ratio print) + "is the odds ratio for internal.") #Internal coefficient print) #Management print) #Player print) #Previous AC print) #College Consistent with ANOVA results, none of the variables are statistically significant under any conventional threshold. However, closer examination of the coefficients tells an interesting story. The beta coefficients represent the change in the log-odds of the outcome. Since this is unintuitive to interpret, the coefficients can be converted to an Odds Ratio as follows: Internal has an odds ratio of 0.23—indicating that internal candidates are 77% less likely to be successful compared to external candidates. Management has an odds ratio of 2.725, indicating these candidates are 172.5% more likely to be successful. The odds ratios for players is effectively zero, 0.696 for previous assistant coaches, and 0.5 for college coaches. Since three out of four coaching type dummy variables have an odds ratio under one, this indicates that only management hires were more likely to be successful than previous head coaches. From a practical standpoint, these are large effect sizes. So why are the variables statistically insignificant? The cause is a limited sample size of successful coaches. Out of 202 coaches remaining in the sample, just 23were successful. Regardless of the coach’s background, odds are low they last more than a few seasons. If we look at the one category able to outperform previous head coachesspecifically: # Filter to management manage = coach3== 1] print) printThe filtered dataset contains just 6 hires—of which just oneis classified as a success. In other words, the entire effect was driven by a single successful observation. Thus, it would take a considerably larger sample size to be confident if differences exist. With a p-value of 0.202, the Internal variable comes the closest to statistical significance. Notably, however, the direction of the effect is actually the opposite of what was hypothesized—internal hires are less likely to be successful than external hires. Out of 26 internal hires, just onemet the criteria for success. Conclusion In conclusion, this analysis was able to draw several key conclusions: Regardless of background, being an NBA coach is typically a short-lived job. It’s rare for a coach to last more than a few seasons. The common wisdom that NBA teams strongly prefer to hire previous head coaches holds true. More than half of hires already had NBA head coaching experience. If teams don’t hire an experienced head coach, they’re likely to hire an NBA assistant coach. Hires outside of these two categories are especially uncommon. Though they are frequently hired, there is no evidence to suggest NBA teams overly prioritize previous head coaches. To the contrary, previous head coaches stay in the job longer on average and are more likely to outlast their initial contract term—though neither of these differences are statistically significant. Despite high-profile anecdotes, there is no evidence to suggest that internal hires are more successful than external hires either. Note: All images were created by the author unless otherwise credited. The post What Statistics Can Tell Us About NBA Coaches appeared first on Towards Data Science. #what #statistics #can #tell #about

What Statistics Can Tell Us About NBA Coaches

towardsdatascience.com
Who gets hired as an NBA coach? How long does a typical coach last? And does their coaching background play any part in predicting success? This analysis was inspired by several key theories. First, there has been a common criticism among casual NBA fans that teams overly prefer hiring candidates with previous NBA head coaches experience. Consequently, this analysis aims to answer two related questions. First, is it true that NBA teams frequently re-hire candidates with previous head coaching experience? And second, is there any evidence that these candidates under-perform relative to other candidates? The second theory is that internal candidates (though infrequently hired) are often more successful than external candidates. This theory was derived from a pair of anecdotes. Two of the most successful coaches in NBA history, Gregg Popovich of San Antonio and Erik Spoelstra of Miami, were both internal hires. However, rigorous quantitative evidence is needed to test if this relationship holds over a larger sample. This analysis aims to explore these questions, and provide the code to reproduce the analysis in Python. The Data The code (contained in a Jupyter notebook) and dataset for this project are available on Github here. The analysis was performed using Python in Google Colaboratory. A prerequisite to this analysis was determining a way to measure coaching success quantitatively. I decided on a simple idea: the success of a coach would be best measured by the length of their tenure in that job. Tenure best represents the differing expectations that might be placed on a coach. A coach hired to a contending team would be expected to win games and generate deep playoff runs. A coach hired to a rebuilding team might be judged on the development of younger players and their ability to build a strong culture. If a coach meets expectations (whatever those may be), the team will keep them around. Since there was no existing dataset with all of the required data, I collected the data myself from Wikipedia. I recorded every off-season coaching change from 1990 through 2021. Since the primary outcome variable is tenure, in-season coaching changes were excluded since these coaches often carried an “interim” tag—meaning they were intended to be temporary until a permanent replacement could be found. In addition, the following variables were collected: VariableDefinitionTeamThe NBA team the coach was hired forYearThe year the coach was hiredCoachThe name of the coachInternal?An indicator if the coach was internal or not—meaning they worked for the organization in some capacity immediately prior to being hired as head coachTypeThe background of the coach. Categories are Previous HC (prior NBA head coaching experience), Previous AC (prior NBA assistant coaching experience, but no head coaching experience), College (head coach of a college team), Player (a former NBA player with no coaching experience), Management (someone with front office experience but no coaching experience), and Foreign (someone coaching outside of North America with no NBA coaching experience).YearsThe number of years a coach was employed in the role. For coaches fired mid-season, the value was counted as 0.5. First, the dataset is imported from its location in Google Drive. I also convert ‘Internal?’ into a dummy variable, replacing “Yes” with 1 and “No” with 0. from google.colab import drive drive.mount('/content/drive') import pandas as pd pd.set_option('display.max_columns', None) #Bring in the dataset coach = pd.read_csv('/content/drive/MyDrive/Python_Files/Coaches.csv', on_bad_lines = 'skip').iloc[:,0:6] coach['Internal'] = coach['Internal?'].map(dict(Yes=1, No=0)) coach This prints a preview of what the dataset looks like: In total, the dataset contains 221 coaching hires over this time. Descriptive Statistics First, basic summary Statistics are calculated and visualized to determine the backgrounds of NBA head coaches. #Create chart of coaching background import matplotlib.pyplot as plt #Count number of coaches per category counts = coach['Type'].value_counts() #Create chart plt.bar(counts.index, counts.values, color = 'blue', edgecolor = 'black') plt.title('Where Do NBA Coaches Come From?') plt.figtext(0.76, -0.1, "Made by Brayden Gerrard", ha="center") plt.xticks(rotation = 45) plt.ylabel('Number of Coaches') plt.gca().spines['top'].set_visible(False) plt.gca().spines['right'].set_visible(False) for i, value in enumerate(counts.values): plt.text(i, value + 1, str(round((value/sum(counts.values))*100,1)) + '%' + ' (' + str(value) + ')', ha='center', fontsize=9) plt.savefig('coachtype.png', bbox_inches = 'tight') print(str(round(((coach['Internal'] == 1).sum()/len(coach))*100,1)) + " percent of coaches are internal.") Over half of coaching hires previously served as an NBA head coach, and nearly 90% had NBA coaching experience of some kind. This answers the first question posed—NBA teams show a strong preference for experienced head coaches. If you get hired once as an NBA coach, your odds of being hired again are much higher. Additionally, 13.6% of hires are internal, confirming that teams do not frequently hire from their own ranks. Second, I will explore the typical tenure of an NBA head coach. This can be visualized using a histogram. #Create histogram plt.hist(coach['Years'], bins =12, edgecolor = 'black', color = 'blue') plt.title('Distribution of Coaching Tenure') plt.figtext(0.76, 0, "Made by Brayden Gerrard", ha="center") plt.annotate('Erik Spoelstra (MIA)', xy=(16.4, 2), xytext=(14 + 1, 15), arrowprops=dict(facecolor='black', shrink=0.1), fontsize=9, color='black') plt.gca().spines['top'].set_visible(False) plt.gca().spines['right'].set_visible(False) plt.savefig('tenurehist.png', bbox_inches = 'tight') plt.show() coach.sort_values('Years', ascending = False) #Calculate some stats with the data import numpy as np print(str(np.median(coach['Years'])) + " years is the median coaching tenure length.") print(str(round(((coach['Years'] <= 5).sum()/len(coach))*100,1)) + " percent of coaches last five years or less.") print(str(round((coach['Years'] <= 1).sum()/len(coach)*100,1)) + " percent of coaches last a year or less.") Using tenure as an indicator of success, the the data clearly shows that the large majority of coaches are unsuccessful. The median tenure is just 2.5 seasons. 18.1% of coaches last a single season or less, and barely 10% of coaches last more than 5 seasons. This can also be viewed as a survival analysis plot to see the drop-off at various points in time: #Survival analysis import matplotlib.ticker as mtick lst = np.arange(0,18,0.5) surv = pd.DataFrame(lst, columns = ['Period']) surv['Number'] = np.nan for i in range(0,len(surv)): surv.iloc[i,1] = (coach['Years'] >= surv.iloc[i,0]).sum()/len(coach) plt.step(surv['Period'],surv['Number']) plt.title('NBA Coach Survival Rate') plt.xlabel('Coaching Tenure (Years)') plt.figtext(0.76, -0.05, "Made by Brayden Gerrard", ha="center") plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1)) plt.gca().spines['top'].set_visible(False) plt.gca().spines['right'].set_visible(False) plt.savefig('coachsurvival.png', bbox_inches = 'tight') plt.show Lastly, a box plot can be generated to see if there are any obvious differences in tenure based on coaching type. Boxplots also display outliers for each group. #Create a boxplot import seaborn as sns sns.boxplot(data=coach, x='Type', y='Years') plt.title('Coaching Tenure by Coach Type') plt.gca().spines['top'].set_visible(False) plt.gca().spines['right'].set_visible(False) plt.xlabel('') plt.xticks(rotation = 30, ha = 'right') plt.figtext(0.76, -0.1, "Made by Brayden Gerrard", ha="center") plt.savefig('coachtypeboxplot.png', bbox_inches = 'tight') plt.show There are some differences between the groups. Aside from management hires (which have a sample of just six), previous head coaches have the longest average tenure at 3.3 years. However, since many of the groups have small sample sizes, we need to use more advanced techniques to test if the differences are statistically significant. Statistical Analysis First, to test if either Type or Internal has a statistically significant difference among the group means, we can use ANOVA: #ANOVA import statsmodels.api as sm from statsmodels.formula.api import ols am = ols('Years ~ C(Type) + C(Internal)', data=coach).fit() anova_table = sm.stats.anova_lm(am, typ=2) print(anova_table) The results show high p-values and low F-stats—indicating no evidence of statistically significant difference in means. Thus, the initial conclusion is that there is no evidence NBA teams are under-valuing internal candidates or over-valuing previous head coaching experience as initially hypothesized. However, there is a possible distortion when comparing group averages. NBA coaches are signed to contracts that typically run between three and five years. Teams typically have to pay out the remainder of the contract even if coaches are dismissed early for poor performance. A coach that lasts two years may be no worse than one that lasts three or four years—the difference could simply be attributable to the length and terms of the initial contract, which is in turn impacted by the desirability of the coach in the job market. Since coaches with prior experience are highly coveted, they may use that leverage to negotiate longer contracts and/or higher salaries, both of which could deter teams from terminating their employment too early. To account for this possibility, the outcome can be treated as binary rather than continuous. If a coach lasted more than 5 seasons, it is highly likely they completed at least their initial contract term and the team chose to extend or re-sign them. These coaches will be treated as successes, with those having a tenure of five years or less categorized as unsuccessful. To run this analysis, all coaching hires from 2020 and 2021 must be excluded, since they have not yet been able to eclipse 5 seasons. With a binary dependent variable, a logistic regression can be used to test if any of the variables predict coaching success. Internal and Type are both converted to dummy variables. Since previous head coaches represent the most common coaching hires, I set this as the “reference” category against which the others will be measured against. Additionally, the dataset contains just one foreign-hired coach (David Blatt) so this observation is dropped from the analysis. #Logistic regression coach3 = coach[coach['Year']<2020] coach3.loc[:, 'Success'] = np.where(coach3['Years'] > 5, 1, 0) coach_type_dummies = pd.get_dummies(coach3['Type'], prefix = 'Type').astype(int) coach_type_dummies.drop(columns=['Type_Previous HC'], inplace=True) coach3 = pd.concat([coach3, coach_type_dummies], axis = 1) #Drop foreign category / David Blatt since n = 1 coach3 = coach3.drop(columns=['Type_Foreign']) coach3 = coach3.loc[coach3['Coach'] != "David Blatt"] print(coach3['Success'].value_counts()) x = coach3[['Internal','Type_Management','Type_Player','Type_Previous AC', 'Type_College']] x = sm.add_constant(x) y = coach3['Success'] logm = sm.Logit(y,x) logm.r = logm.fit(maxiter=1000) print(logm.r.summary()) #Convert coefficients to odds ratio print(str(np.exp(-1.4715)) + "is the odds ratio for internal.") #Internal coefficient print(np.exp(1.0025)) #Management print(np.exp(-39.6956)) #Player print(np.exp(-0.3626)) #Previous AC print(np.exp(-0.6901)) #College Consistent with ANOVA results, none of the variables are statistically significant under any conventional threshold. However, closer examination of the coefficients tells an interesting story. The beta coefficients represent the change in the log-odds of the outcome. Since this is unintuitive to interpret, the coefficients can be converted to an Odds Ratio as follows: Internal has an odds ratio of 0.23—indicating that internal candidates are 77% less likely to be successful compared to external candidates. Management has an odds ratio of 2.725, indicating these candidates are 172.5% more likely to be successful. The odds ratios for players is effectively zero, 0.696 for previous assistant coaches, and 0.5 for college coaches. Since three out of four coaching type dummy variables have an odds ratio under one, this indicates that only management hires were more likely to be successful than previous head coaches. From a practical standpoint, these are large effect sizes. So why are the variables statistically insignificant? The cause is a limited sample size of successful coaches. Out of 202 coaches remaining in the sample, just 23 (11.4%) were successful. Regardless of the coach’s background, odds are low they last more than a few seasons. If we look at the one category able to outperform previous head coaches (management hires) specifically: # Filter to management manage = coach3[coach3['Type_Management'] == 1] print(manage['Success'].value_counts()) print(manage) The filtered dataset contains just 6 hires—of which just one (Steve Kerr with Golden State) is classified as a success. In other words, the entire effect was driven by a single successful observation. Thus, it would take a considerably larger sample size to be confident if differences exist. With a p-value of 0.202, the Internal variable comes the closest to statistical significance (though it still falls well short of a typical alpha of 0.05). Notably, however, the direction of the effect is actually the opposite of what was hypothesized—internal hires are less likely to be successful than external hires. Out of 26 internal hires, just one (Erik Spoelstra of Miami) met the criteria for success. Conclusion In conclusion, this analysis was able to draw several key conclusions: Regardless of background, being an NBA coach is typically a short-lived job. It’s rare for a coach to last more than a few seasons. The common wisdom that NBA teams strongly prefer to hire previous head coaches holds true. More than half of hires already had NBA head coaching experience. If teams don’t hire an experienced head coach, they’re likely to hire an NBA assistant coach. Hires outside of these two categories are especially uncommon. Though they are frequently hired, there is no evidence to suggest NBA teams overly prioritize previous head coaches. To the contrary, previous head coaches stay in the job longer on average and are more likely to outlast their initial contract term—though neither of these differences are statistically significant. Despite high-profile anecdotes, there is no evidence to suggest that internal hires are more successful than external hires either. Note: All images were created by the author unless otherwise credited. The post What Statistics Can Tell Us About NBA Coaches appeared first on Towards Data Science.

0 Комментарии ·0 Поделились ·0 предпросмотр

Войдите, чтобы отмечать, делиться и комментировать!
Towards Data Science @TowardsDataScience поделился ссылкой
2025-05-21 21:52:31 ·

Use PyTorch to Easily Access Your GPU

Let’s say you are lucky enough to have access to a system with an Nvidia Graphical Processing Unit. Did you know there is an absurdly easy method to use your GPU’s capabilities using a Python library intended and predominantly used for machine learningapplications?

Don’t worry if you’re not up to speed on the ins and outs of ML, since we won’t be using it in this article. Instead, I’ll show you how to use the PyTorch library to access and use the capabilities of your GPU. We’ll compare the run times of Python programs using the popular numerical library NumPy, running on the CPU, with equivalent code using PyTorch on the GPU.

Before continuing, let’s quickly recap what a GPU and Pytorch are.

What is a GPU?

A GPU is a specialised electronic chip initially designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. Its utility as a rapid image manipulation device was based on its ability to perform many calculations simultaneously, and it’s still used for that purpose.

However, GPUs have recently become invaluable in machine learning, large language model training and development. Their inherent ability to perform highly parallelizable computations makes them ideal workhorses in these fields, as they employ complex mathematical models and simulations.

What is PyTorch?

PyTorch is an open-source machine learning library developed by Facebook’s AI Research Lab. It’s widely used for natural language processing and computer vision applications. Two of the main reasons that Pytorch can be used for GPU operations are,

One of PyTorch’s core data structures is the Tensor. Tensors are similar to arrays and matrices in other programming languages, but are optimised for running on a GPU.

Pytorch has CUDA support. PyTorch seamlessly integrates with CUDA, a parallel computing platform and programming model developed by NVIDIA for general computing on its GPUS. This allows PyTorch to access the GPU hardware directly, accelerating numerical computations. CUDA will enable developers to use PyTorch to write software that fully utilises GPU acceleration.

In summary, PyTorch’s support for GPU operations through CUDA and its efficient tensor manipulation capabilities make it an excellent tool for developing GPU-accelerated Python functions with high computational demands.

As we’ll show later on, you don’t have to use PyTorch to develop machine learning models or train large language models.

In the rest of this article, we’ll set up our development environment, install PyTorch and run through a few examples where we’ll compare some computationally heavy PyTorch implementations with the equivalent numpy implementation and see what, if any, performance differences we find.

Pre-requisites

An Nvidia GPU

You need an Nvidia GPU on your system. To check your GPU, issue the following command at your system prompt. I’m using the Windows Subsystem for Linux.

$ nvidia-smi

>>PS C:\Users\thoma> nvidia-smi
Fri Mar 22 11:41:34 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.61 Driver Version: 551.61 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 Ti WDDM | 00000000:01:00.0 On | N/A |
| 32% 24C P8 9W / 285W | 843MiB / 12282MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1268 C+G ...tility\HPSystemEventUtilityHost.exe N/A |
| 0 N/A N/A 2204 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A |
| 0 N/A N/A 3904 C+G ...cal\Microsoft\OneDrive\OneDrive.exe N/A |
| 0 N/A N/A 7068 C+G ...CBS_cw5n
etc ..

If that command isn’t recognised and you’re sure you have a GPU, it probably means you’re missing an NVIDIA driver. Just follow the rest of the instructions in this article, and it should be installed as part of that process.

Nvidia GPU drivers

While PyTorch installation packages can include CUDA libraries, your system must still install the appropriate NVIDIA GPU drivers. These drivers are necessary for your operating system to communicate with the graphics processing unithardware. The CUDA toolkit includes drivers, but if you’re using PyTorch’s bundled CUDA, you only need to ensure that your GPU drivers are current.

Click this link to go to the NVIDIA website and install the latest drivers compatible with your system and GPU specifications.

Setting up our development environment

As a best practice, we should set up a separate development environment for each project. I use conda, but use whatever method suits you.

If you want to go down the conda route and don’t already have it, you must install Minicondaor Anaconda first.

Please note that, at the time of writing, PyTorch currently only officially supports Python versions 3.8 to 3.11.

#create our test environment$ conda create -n pytorch_test python=3.11 -y

Now activate your new environment.$ conda activate pytorch_test

We now need to get the appropriate conda install command for PyTorch. This will depend on your operating system, chosen programming language, preferred package manager, and CUDA version.

Luckily, Pytorch provides a useful web interface that makes this easy to set up. So, to get started, head over to the Pytorch website at…

Click on the Get Started link near the top of the screen. From there, scroll down a little until you see this,

Image from Pytorch website

Click on each box in the appropriate position for your system and specs. As you do, you’ll see that the command in the Run this Command output field changes dynamically. When you’re done making your choices, copy the final command text shown and type it into your command window prompt.

For me, this was:-$ conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y

We’ll install Jupyter, Pandas, and Matplotlib to enable us to run our Python code in a notebook with our example code.$ conda install pandas matplotlib jupyter -y

Now type in jupyter notebook into your command prompt. You should see a jupyter notebook open in your browser. If that doesn’t happen automatically, you’ll likely see a screenful of information after the jupyter notebook command.

Near the bottom, there will be a URL that you should copy and paste into your browser to initiate the Jupyter Notebook.

Your URL will be different to mine, but it should look something like this:-

Testing our setup

The first thing we’ll do is test our setup. Please enter the following into a Jupyter cell and run it.

import torch
x = torch.randprintYou should see a similar output to the following.

tensorAdditionally, to check if your GPU driver and CUDA are enabled and accessible by PyTorch, run the following commands:

import torch
torch.cuda.is_availableThis should output True if all is OK.

If everything is okay, we can proceed to our examples. If not, go back and check your installation processes.

NB In the timings below, I ran each of the Numpy and PyTorch processes several times in succession and took the best time for each. This does favour the PyTorch runs somewhat as there is a small overhead on the very first invocation of each PyTorch run but, overall, I think it’s a fairer comparison.

Example 1 — A simple array math operation.

In this example, we set up two large, identical one-dimensional arrays and perform a simple addition to each array element.

import numpy as np
import torch as pt

from timeit import default_timer as timer

#func1 will run on the CPU
def func1:
a+= 1

#func2 will run on the GPU
def func2:
a+= 2

if __name__=="__main__":
n1 = 300000000
a1 = np.ones# had to make this array much smaller than
# the others due to slow loop processing on the GPU
n2 = 300000000
a2 = pt.onesstart = timerfunc1print-start)

start = timerfunc2#wait for all calcs on the GPU to complete
pt.cuda.synchronizeprint-start)
printprintprintTiming with CPU:numpy 0.1334826999955112
Timing with GPU:pytorch 0.10177790001034737

a1 =a2 = tensorWe see a slight improvement when using PyTorch over Numpy, but we missed one crucial point. We haven’t used the GPU because our PyTorch tensor data is still in CPU memory.

To move the data to the GPU memory, we need to add the device='cuda' directive when creating the tensor. Let’s do that and see if it makes a difference.

# Same code as above except
# to get the array data onto the GPU memory
# we changed

a2 = pt.ones# to

a2 = pt.onesAfter re-running with the changes we get,

Timing with CPU:numpy 0.12852740001108032
Timing with GPU:pytorch 0.011292399998637848

a1 =a2 = tensorThat’s more like it, a greater than 10x speed up.

Example 2—A slightly more complex array operation.

For this example, we’ll multiply multi-dimensional matrices using the built-in matmul operations available in the PyTorch and Numpy libraries. Each array will be 10000 x 10000 and contain random floating-point numbers between 1 and 100.

# NUMPY first
import numpy as np
from timeit import default_timer as timer

# Set the seed for reproducibility
np.random.seed# Generate two 10000x10000 arrays of random floating point numbers between 1 and 100
A = np.random.uniform).astypeB = np.random.uniform).astype# Perform matrix multiplication
start = timerC = np.matmul# Due to the large size of the matrices, it's not practical to print them entirely.
# Instead, we print a small portion to verify.
printprint-start)

A small portion of the result matrix:]

Without GPU: 1.4450852000009036

Now for the PyTorch version.

import torch
from timeit import default_timer as timer

# Set the seed for reproducibility
torch.manual_seed# Use the GPU
device = 'cuda'

# Generate two 10000x10000 tensors of random floating point
# numbers between 1 and 100 and move them to the GPU
#
A = torch.FloatTensor.uniform_.toB = torch.FloatTensor.uniform_.to# Perform matrix multiplication
start = timerC = torch.matmul# Wait for all current GPU operations to completetorch.cuda.synchronize# Due to the large size of the matrices, it's not practical to print them entirely.
# Instead, we print a small portion to verify.
printprint- start)

A small portion of the result matrix:]

With GPU: 0.07081239999388345

The PyTorch run was 20 times better this time than the NumPy run. Great stuff.

Example 3 — Combining CPU and GPU code.

Sometimes, not all of your processing can be done on a GPU. An everyday use case for this is graphing data. Sure, you can manipulate your data using the GPU, but often the next step is to see what your final dataset looks like using a plot.

You can’t plot data if it resides in the GPU memory, so you must move it back to CPU memory before calling your plotting functions. Is it worth the overhead of moving large chunks of data from the GPU to the CPU? Let’s find out.

In this example, we will solve this polar equation for values of θ between 0 and 2π incoordinate terms and then plot out the resulting graph.

Don’t get too hung up on the math. It’s just an equation that, when converted to use the x, y coordinate system and solved, looks nice when plotted.

For even a few million values of x and y, Numpy can solve this in milliseconds, so to make it a bit more interesting, we’ll use 100 millioncoordinates.

Here is the numpy code first.

%%time
import numpy as np
import matplotlib.pyplot as plt
from time import time as timer

start = timer# create an array of 100M thetas between 0 and 2pi
theta = np.linspace# our original polar formula
r = 1 + 3/4 * np.sin# calculate the equivalent x and y's coordinates
# for each theta
x = r * np.cosy = r * np.sin# see how long the calc part took
print-start)

# Now plot out the data
start = timerplt.plot# see how long the plotting part took
print-start)

Here is the output. Would you have guessed beforehand that it would look like this? I sure wouldn’t have!

Now, let’s see what the equivalent PyTorch implementation looks like and how much of a speed-up we get.

%%time
import torch as pt
import matplotlib.pyplot as plt
from time import time as timer

# Make sure PyTorch is using the GPU
device = 'cuda'

# Start the timer
start = timer# Creating the theta tensor on the GPU
theta = pt.linspace# Calculating r, x, and y using PyTorch operations on the GPU
r = 1 + 3/4 * pt.sinx = r * pt.cosy = r * pt.sin# Moving the result back to CPU for plotting
x_cpu = x.cpu.numpyy_cpu = y.cpu.numpypt.cuda.synchronizeprint- start)

# Plotting
start = timerplt.plotplt.showprint- start)

And our output again.

The calculation part was about 10 times more than the numpy calculation. The data plotting took around the same time using both the PyTorch and NumPy versions, which was expected since the data was still in CPU memory then, and the GPU played no further part in the processing.

But, overall, we shaved about 40% off the total run-time, which is excellent.

Summary

This article has demonstrated how to leverage an NVIDIA GPU using PyTorch—a machine learning library typically used for AI applications—to accelerate non-ML numerical Python code. It compares standard NumPyimplementations with GPU-accelerated PyTorch equivalents to show the performance benefits of running tensor-based operations on a GPU.

You don’t need to be doing machine learning to benefit from PyTorch. If you can access an NVIDIA GPU, PyTorch provides a simple and effective way to significantly speed up computationally intensive numerical operations—even in general-purpose Python code.
The post Use PyTorch to Easily Access Your GPU appeared first on Towards Data Science.
#use #pytorch #easily #access #your

Use PyTorch to Easily Access Your GPU
Let’s say you are lucky enough to have access to a system with an Nvidia Graphical Processing Unit. Did you know there is an absurdly easy method to use your GPU’s capabilities using a Python library intended and predominantly used for machine learningapplications? Don’t worry if you’re not up to speed on the ins and outs of ML, since we won’t be using it in this article. Instead, I’ll show you how to use the PyTorch library to access and use the capabilities of your GPU. We’ll compare the run times of Python programs using the popular numerical library NumPy, running on the CPU, with equivalent code using PyTorch on the GPU. Before continuing, let’s quickly recap what a GPU and Pytorch are. What is a GPU? A GPU is a specialised electronic chip initially designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. Its utility as a rapid image manipulation device was based on its ability to perform many calculations simultaneously, and it’s still used for that purpose. However, GPUs have recently become invaluable in machine learning, large language model training and development. Their inherent ability to perform highly parallelizable computations makes them ideal workhorses in these fields, as they employ complex mathematical models and simulations. What is PyTorch? PyTorch is an open-source machine learning library developed by Facebook’s AI Research Lab. It’s widely used for natural language processing and computer vision applications. Two of the main reasons that Pytorch can be used for GPU operations are, One of PyTorch’s core data structures is the Tensor. Tensors are similar to arrays and matrices in other programming languages, but are optimised for running on a GPU. Pytorch has CUDA support. PyTorch seamlessly integrates with CUDA, a parallel computing platform and programming model developed by NVIDIA for general computing on its GPUS. This allows PyTorch to access the GPU hardware directly, accelerating numerical computations. CUDA will enable developers to use PyTorch to write software that fully utilises GPU acceleration. In summary, PyTorch’s support for GPU operations through CUDA and its efficient tensor manipulation capabilities make it an excellent tool for developing GPU-accelerated Python functions with high computational demands. As we’ll show later on, you don’t have to use PyTorch to develop machine learning models or train large language models. In the rest of this article, we’ll set up our development environment, install PyTorch and run through a few examples where we’ll compare some computationally heavy PyTorch implementations with the equivalent numpy implementation and see what, if any, performance differences we find. Pre-requisites An Nvidia GPU You need an Nvidia GPU on your system. To check your GPU, issue the following command at your system prompt. I’m using the Windows Subsystem for Linux. $ nvidia-smi >>PS C:\Users\thoma> nvidia-smi Fri Mar 22 11:41:34 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 551.61 Driver Version: 551.61 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4070 Ti WDDM | 00000000:01:00.0 On | N/A | | 32% 24C P8 9W / 285W | 843MiB / 12282MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1268 C+G ...tility\HPSystemEventUtilityHost.exe N/A | | 0 N/A N/A 2204 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A | | 0 N/A N/A 3904 C+G ...cal\Microsoft\OneDrive\OneDrive.exe N/A | | 0 N/A N/A 7068 C+G ...CBS_cw5n etc .. If that command isn’t recognised and you’re sure you have a GPU, it probably means you’re missing an NVIDIA driver. Just follow the rest of the instructions in this article, and it should be installed as part of that process. Nvidia GPU drivers While PyTorch installation packages can include CUDA libraries, your system must still install the appropriate NVIDIA GPU drivers. These drivers are necessary for your operating system to communicate with the graphics processing unithardware. The CUDA toolkit includes drivers, but if you’re using PyTorch’s bundled CUDA, you only need to ensure that your GPU drivers are current. Click this link to go to the NVIDIA website and install the latest drivers compatible with your system and GPU specifications. Setting up our development environment As a best practice, we should set up a separate development environment for each project. I use conda, but use whatever method suits you. If you want to go down the conda route and don’t already have it, you must install Minicondaor Anaconda first. Please note that, at the time of writing, PyTorch currently only officially supports Python versions 3.8 to 3.11. #create our test environment$ conda create -n pytorch_test python=3.11 -y Now activate your new environment.$ conda activate pytorch_test We now need to get the appropriate conda install command for PyTorch. This will depend on your operating system, chosen programming language, preferred package manager, and CUDA version. Luckily, Pytorch provides a useful web interface that makes this easy to set up. So, to get started, head over to the Pytorch website at… Click on the Get Started link near the top of the screen. From there, scroll down a little until you see this, Image from Pytorch website Click on each box in the appropriate position for your system and specs. As you do, you’ll see that the command in the Run this Command output field changes dynamically. When you’re done making your choices, copy the final command text shown and type it into your command window prompt. For me, this was:-$ conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y We’ll install Jupyter, Pandas, and Matplotlib to enable us to run our Python code in a notebook with our example code.$ conda install pandas matplotlib jupyter -y Now type in jupyter notebook into your command prompt. You should see a jupyter notebook open in your browser. If that doesn’t happen automatically, you’ll likely see a screenful of information after the jupyter notebook command. Near the bottom, there will be a URL that you should copy and paste into your browser to initiate the Jupyter Notebook. Your URL will be different to mine, but it should look something like this:- Testing our setup The first thing we’ll do is test our setup. Please enter the following into a Jupyter cell and run it. import torch x = torch.randprintYou should see a similar output to the following. tensorAdditionally, to check if your GPU driver and CUDA are enabled and accessible by PyTorch, run the following commands: import torch torch.cuda.is_availableThis should output True if all is OK. If everything is okay, we can proceed to our examples. If not, go back and check your installation processes. NB In the timings below, I ran each of the Numpy and PyTorch processes several times in succession and took the best time for each. This does favour the PyTorch runs somewhat as there is a small overhead on the very first invocation of each PyTorch run but, overall, I think it’s a fairer comparison. Example 1 — A simple array math operation. In this example, we set up two large, identical one-dimensional arrays and perform a simple addition to each array element. import numpy as np import torch as pt from timeit import default_timer as timer #func1 will run on the CPU def func1: a+= 1 #func2 will run on the GPU def func2: a+= 2 if __name__=="__main__": n1 = 300000000 a1 = np.ones# had to make this array much smaller than # the others due to slow loop processing on the GPU n2 = 300000000 a2 = pt.onesstart = timerfunc1print-start) start = timerfunc2#wait for all calcs on the GPU to complete pt.cuda.synchronizeprint-start) printprintprintTiming with CPU:numpy 0.1334826999955112 Timing with GPU:pytorch 0.10177790001034737 a1 =a2 = tensorWe see a slight improvement when using PyTorch over Numpy, but we missed one crucial point. We haven’t used the GPU because our PyTorch tensor data is still in CPU memory. To move the data to the GPU memory, we need to add the device='cuda' directive when creating the tensor. Let’s do that and see if it makes a difference. # Same code as above except # to get the array data onto the GPU memory # we changed a2 = pt.ones# to a2 = pt.onesAfter re-running with the changes we get, Timing with CPU:numpy 0.12852740001108032 Timing with GPU:pytorch 0.011292399998637848 a1 =a2 = tensorThat’s more like it, a greater than 10x speed up. Example 2—A slightly more complex array operation. For this example, we’ll multiply multi-dimensional matrices using the built-in matmul operations available in the PyTorch and Numpy libraries. Each array will be 10000 x 10000 and contain random floating-point numbers between 1 and 100. # NUMPY first import numpy as np from timeit import default_timer as timer # Set the seed for reproducibility np.random.seed# Generate two 10000x10000 arrays of random floating point numbers between 1 and 100 A = np.random.uniform).astypeB = np.random.uniform).astype# Perform matrix multiplication start = timerC = np.matmul# Due to the large size of the matrices, it's not practical to print them entirely. # Instead, we print a small portion to verify. printprint-start) A small portion of the result matrix:] Without GPU: 1.4450852000009036 Now for the PyTorch version. import torch from timeit import default_timer as timer # Set the seed for reproducibility torch.manual_seed# Use the GPU device = 'cuda' # Generate two 10000x10000 tensors of random floating point # numbers between 1 and 100 and move them to the GPU # A = torch.FloatTensor.uniform_.toB = torch.FloatTensor.uniform_.to# Perform matrix multiplication start = timerC = torch.matmul# Wait for all current GPU operations to completetorch.cuda.synchronize# Due to the large size of the matrices, it's not practical to print them entirely. # Instead, we print a small portion to verify. printprint- start) A small portion of the result matrix:] With GPU: 0.07081239999388345 The PyTorch run was 20 times better this time than the NumPy run. Great stuff. Example 3 — Combining CPU and GPU code. Sometimes, not all of your processing can be done on a GPU. An everyday use case for this is graphing data. Sure, you can manipulate your data using the GPU, but often the next step is to see what your final dataset looks like using a plot. You can’t plot data if it resides in the GPU memory, so you must move it back to CPU memory before calling your plotting functions. Is it worth the overhead of moving large chunks of data from the GPU to the CPU? Let’s find out. In this example, we will solve this polar equation for values of θ between 0 and 2π incoordinate terms and then plot out the resulting graph. Don’t get too hung up on the math. It’s just an equation that, when converted to use the x, y coordinate system and solved, looks nice when plotted. For even a few million values of x and y, Numpy can solve this in milliseconds, so to make it a bit more interesting, we’ll use 100 millioncoordinates. Here is the numpy code first. %%time import numpy as np import matplotlib.pyplot as plt from time import time as timer start = timer# create an array of 100M thetas between 0 and 2pi theta = np.linspace# our original polar formula r = 1 + 3/4 * np.sin# calculate the equivalent x and y's coordinates # for each theta x = r * np.cosy = r * np.sin# see how long the calc part took print-start) # Now plot out the data start = timerplt.plot# see how long the plotting part took print-start) Here is the output. Would you have guessed beforehand that it would look like this? I sure wouldn’t have! Now, let’s see what the equivalent PyTorch implementation looks like and how much of a speed-up we get. %%time import torch as pt import matplotlib.pyplot as plt from time import time as timer # Make sure PyTorch is using the GPU device = 'cuda' # Start the timer start = timer# Creating the theta tensor on the GPU theta = pt.linspace# Calculating r, x, and y using PyTorch operations on the GPU r = 1 + 3/4 * pt.sinx = r * pt.cosy = r * pt.sin# Moving the result back to CPU for plotting x_cpu = x.cpu.numpyy_cpu = y.cpu.numpypt.cuda.synchronizeprint- start) # Plotting start = timerplt.plotplt.showprint- start) And our output again. The calculation part was about 10 times more than the numpy calculation. The data plotting took around the same time using both the PyTorch and NumPy versions, which was expected since the data was still in CPU memory then, and the GPU played no further part in the processing. But, overall, we shaved about 40% off the total run-time, which is excellent. Summary This article has demonstrated how to leverage an NVIDIA GPU using PyTorch—a machine learning library typically used for AI applications—to accelerate non-ML numerical Python code. It compares standard NumPyimplementations with GPU-accelerated PyTorch equivalents to show the performance benefits of running tensor-based operations on a GPU. You don’t need to be doing machine learning to benefit from PyTorch. If you can access an NVIDIA GPU, PyTorch provides a simple and effective way to significantly speed up computationally intensive numerical operations—even in general-purpose Python code. The post Use PyTorch to Easily Access Your GPU appeared first on Towards Data Science. #use #pytorch #easily #access #your

Use PyTorch to Easily Access Your GPU

towardsdatascience.com
Let’s say you are lucky enough to have access to a system with an Nvidia Graphical Processing Unit (Gpu). Did you know there is an absurdly easy method to use your GPU’s capabilities using a Python library intended and predominantly used for machine learning (ML) applications? Don’t worry if you’re not up to speed on the ins and outs of ML, since we won’t be using it in this article. Instead, I’ll show you how to use the PyTorch library to access and use the capabilities of your GPU. We’ll compare the run times of Python programs using the popular numerical library NumPy, running on the CPU, with equivalent code using PyTorch on the GPU. Before continuing, let’s quickly recap what a GPU and Pytorch are. What is a GPU? A GPU is a specialised electronic chip initially designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. Its utility as a rapid image manipulation device was based on its ability to perform many calculations simultaneously, and it’s still used for that purpose. However, GPUs have recently become invaluable in machine learning, large language model training and development. Their inherent ability to perform highly parallelizable computations makes them ideal workhorses in these fields, as they employ complex mathematical models and simulations. What is PyTorch? PyTorch is an open-source machine learning library developed by Facebook’s AI Research Lab (FAIR). It’s widely used for natural language processing and computer vision applications. Two of the main reasons that Pytorch can be used for GPU operations are, One of PyTorch’s core data structures is the Tensor. Tensors are similar to arrays and matrices in other programming languages, but are optimised for running on a GPU. Pytorch has CUDA support. PyTorch seamlessly integrates with CUDA, a parallel computing platform and programming model developed by NVIDIA for general computing on its GPUS. This allows PyTorch to access the GPU hardware directly, accelerating numerical computations. CUDA will enable developers to use PyTorch to write software that fully utilises GPU acceleration. In summary, PyTorch’s support for GPU operations through CUDA and its efficient tensor manipulation capabilities make it an excellent tool for developing GPU-accelerated Python functions with high computational demands. As we’ll show later on, you don’t have to use PyTorch to develop machine learning models or train large language models. In the rest of this article, we’ll set up our development environment, install PyTorch and run through a few examples where we’ll compare some computationally heavy PyTorch implementations with the equivalent numpy implementation and see what, if any, performance differences we find. Pre-requisites An Nvidia GPU You need an Nvidia GPU on your system. To check your GPU, issue the following command at your system prompt. I’m using the Windows Subsystem for Linux (WSL). $ nvidia-smi >> (base) PS C:\Users\thoma> nvidia-smi Fri Mar 22 11:41:34 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 551.61 Driver Version: 551.61 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4070 Ti WDDM | 00000000:01:00.0 On | N/A | | 32% 24C P8 9W / 285W | 843MiB / 12282MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1268 C+G ...tility\HPSystemEventUtilityHost.exe N/A | | 0 N/A N/A 2204 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A | | 0 N/A N/A 3904 C+G ...cal\Microsoft\OneDrive\OneDrive.exe N/A | | 0 N/A N/A 7068 C+G ...CBS_cw5n etc .. If that command isn’t recognised and you’re sure you have a GPU, it probably means you’re missing an NVIDIA driver. Just follow the rest of the instructions in this article, and it should be installed as part of that process. Nvidia GPU drivers While PyTorch installation packages can include CUDA libraries, your system must still install the appropriate NVIDIA GPU drivers. These drivers are necessary for your operating system to communicate with the graphics processing unit (GPU) hardware. The CUDA toolkit includes drivers, but if you’re using PyTorch’s bundled CUDA, you only need to ensure that your GPU drivers are current. Click this link to go to the NVIDIA website and install the latest drivers compatible with your system and GPU specifications. Setting up our development environment As a best practice, we should set up a separate development environment for each project. I use conda, but use whatever method suits you. If you want to go down the conda route and don’t already have it, you must install Miniconda (recommended) or Anaconda first. Please note that, at the time of writing, PyTorch currently only officially supports Python versions 3.8 to 3.11. #create our test environment (base) $ conda create -n pytorch_test python=3.11 -y Now activate your new environment. (base) $ conda activate pytorch_test We now need to get the appropriate conda install command for PyTorch. This will depend on your operating system, chosen programming language, preferred package manager, and CUDA version. Luckily, Pytorch provides a useful web interface that makes this easy to set up. So, to get started, head over to the Pytorch website at… https://pytorch.org Click on the Get Started link near the top of the screen. From there, scroll down a little until you see this, Image from Pytorch website Click on each box in the appropriate position for your system and specs. As you do, you’ll see that the command in the Run this Command output field changes dynamically. When you’re done making your choices, copy the final command text shown and type it into your command window prompt. For me, this was:- (pytorch_test) $ conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y We’ll install Jupyter, Pandas, and Matplotlib to enable us to run our Python code in a notebook with our example code. (pytroch_test) $ conda install pandas matplotlib jupyter -y Now type in jupyter notebook into your command prompt. You should see a jupyter notebook open in your browser. If that doesn’t happen automatically, you’ll likely see a screenful of information after the jupyter notebook command. Near the bottom, there will be a URL that you should copy and paste into your browser to initiate the Jupyter Notebook. Your URL will be different to mine, but it should look something like this:- http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69da Testing our setup The first thing we’ll do is test our setup. Please enter the following into a Jupyter cell and run it. import torch x = torch.rand(5, 3) print(x) You should see a similar output to the following. tensor([[0.3715, 0.5503, 0.5783], [0.8638, 0.5206, 0.8439], [0.4664, 0.0557, 0.6280], [0.5704, 0.0322, 0.6053], [0.3416, 0.4090, 0.6366]]) Additionally, to check if your GPU driver and CUDA are enabled and accessible by PyTorch, run the following commands: import torch torch.cuda.is_available() This should output True if all is OK. If everything is okay, we can proceed to our examples. If not, go back and check your installation processes. NB In the timings below, I ran each of the Numpy and PyTorch processes several times in succession and took the best time for each. This does favour the PyTorch runs somewhat as there is a small overhead on the very first invocation of each PyTorch run but, overall, I think it’s a fairer comparison. Example 1 — A simple array math operation. In this example, we set up two large, identical one-dimensional arrays and perform a simple addition to each array element. import numpy as np import torch as pt from timeit import default_timer as timer #func1 will run on the CPU def func1(a): a+= 1 #func2 will run on the GPU def func2(a): a+= 2 if __name__=="__main__": n1 = 300000000 a1 = np.ones(n1, dtype = np.float64) # had to make this array much smaller than # the others due to slow loop processing on the GPU n2 = 300000000 a2 = pt.ones(n2,dtype=pt.float64) start = timer() func1(a1) print("Timing with CPU:numpy", timer()-start) start = timer() func2(a2) #wait for all calcs on the GPU to complete pt.cuda.synchronize() print("Timing with GPU:pytorch", timer()-start) print() print("a1 = ",a1) print("a2 = ",a2) Timing with CPU:numpy 0.1334826999955112 Timing with GPU:pytorch 0.10177790001034737 a1 = [2. 2. 2. ... 2. 2. 2.] a2 = tensor([3., 3., 3., ..., 3., 3., 3.], dtype=torch.float64) We see a slight improvement when using PyTorch over Numpy, but we missed one crucial point. We haven’t used the GPU because our PyTorch tensor data is still in CPU memory. To move the data to the GPU memory, we need to add the device='cuda' directive when creating the tensor. Let’s do that and see if it makes a difference. # Same code as above except # to get the array data onto the GPU memory # we changed a2 = pt.ones(n2,dtype=pt.float64) # to a2 = pt.ones(n2,dtype=pt.float64,device='cuda') After re-running with the changes we get, Timing with CPU:numpy 0.12852740001108032 Timing with GPU:pytorch 0.011292399998637848 a1 = [2. 2. 2. ... 2. 2. 2.] a2 = tensor([3., 3., 3., ..., 3., 3., 3.], device='cuda:0', dtype=torch.float64) That’s more like it, a greater than 10x speed up. Example 2—A slightly more complex array operation. For this example, we’ll multiply multi-dimensional matrices using the built-in matmul operations available in the PyTorch and Numpy libraries. Each array will be 10000 x 10000 and contain random floating-point numbers between 1 and 100. # NUMPY first import numpy as np from timeit import default_timer as timer # Set the seed for reproducibility np.random.seed(0) # Generate two 10000x10000 arrays of random floating point numbers between 1 and 100 A = np.random.uniform(low=1.0, high=100.0, size=(10000, 10000)).astype(np.float32) B = np.random.uniform(low=1.0, high=100.0, size=(10000, 10000)).astype(np.float32) # Perform matrix multiplication start = timer() C = np.matmul(A, B) # Due to the large size of the matrices, it's not practical to print them entirely. # Instead, we print a small portion to verify. print("A small portion of the result matrix:\n", C[:5, :5]) print("Without GPU:", timer()-start) A small portion of the result matrix: [[25461280. 25168352. 25212526. 25303304. 25277884.] [25114760. 25197558. 25340074. 25341850. 25373122.] [25381820. 25326522. 25438612. 25596932. 25538602.] [25317282. 25223540. 25272242. 25551428. 25467986.] [25327290. 25527838. 25499606. 25657218. 25527856.]] Without GPU: 1.4450852000009036 Now for the PyTorch version. import torch from timeit import default_timer as timer # Set the seed for reproducibility torch.manual_seed(0) # Use the GPU device = 'cuda' # Generate two 10000x10000 tensors of random floating point # numbers between 1 and 100 and move them to the GPU # A = torch.FloatTensor(10000, 10000).uniform_(1, 100).to(device) B = torch.FloatTensor(10000, 10000).uniform_(1, 100).to(device) # Perform matrix multiplication start = timer() C = torch.matmul(A, B) # Wait for all current GPU operations to complete (synchronize) torch.cuda.synchronize() # Due to the large size of the matrices, it's not practical to print them entirely. # Instead, we print a small portion to verify. print("A small portion of the result matrix:\n", C[:5, :5]) print("With GPU:", timer() - start) A small portion of the result matrix: [[25145748. 25495480. 25376196. 25446946. 25646938.] [25357524. 25678558. 25675806. 25459324. 25619908.] [25533988. 25632858. 25657696. 25616978. 25901294.] [25159630. 25230138. 25450480. 25221246. 25589418.] [24800246. 25145700. 25103040. 25012414. 25465890.]] With GPU: 0.07081239999388345 The PyTorch run was 20 times better this time than the NumPy run. Great stuff. Example 3 — Combining CPU and GPU code. Sometimes, not all of your processing can be done on a GPU. An everyday use case for this is graphing data. Sure, you can manipulate your data using the GPU, but often the next step is to see what your final dataset looks like using a plot. You can’t plot data if it resides in the GPU memory, so you must move it back to CPU memory before calling your plotting functions. Is it worth the overhead of moving large chunks of data from the GPU to the CPU? Let’s find out. In this example, we will solve this polar equation for values of θ between 0 and 2π in (x, y) coordinate terms and then plot out the resulting graph. Don’t get too hung up on the math. It’s just an equation that, when converted to use the x, y coordinate system and solved, looks nice when plotted. For even a few million values of x and y, Numpy can solve this in milliseconds, so to make it a bit more interesting, we’ll use 100 million (x, y) coordinates. Here is the numpy code first. %%time import numpy as np import matplotlib.pyplot as plt from time import time as timer start = timer() # create an array of 100M thetas between 0 and 2pi theta = np.linspace(0, 2*np.pi, 100000000) # our original polar formula r = 1 + 3/4 * np.sin(3*theta) # calculate the equivalent x and y's coordinates # for each theta x = r * np.cos(theta) y = r * np.sin(theta) # see how long the calc part took print("Finished with calcs ", timer()-start) # Now plot out the data start = timer() plt.plot(x,y) # see how long the plotting part took print("Finished with plot ", timer()-start) Here is the output. Would you have guessed beforehand that it would look like this? I sure wouldn’t have! Now, let’s see what the equivalent PyTorch implementation looks like and how much of a speed-up we get. %%time import torch as pt import matplotlib.pyplot as plt from time import time as timer # Make sure PyTorch is using the GPU device = 'cuda' # Start the timer start = timer() # Creating the theta tensor on the GPU theta = pt.linspace(0, 2 * pt.pi, 100000000, device=device) # Calculating r, x, and y using PyTorch operations on the GPU r = 1 + 3/4 * pt.sin(3 * theta) x = r * pt.cos(theta) y = r * pt.sin(theta) # Moving the result back to CPU for plotting x_cpu = x.cpu().numpy() y_cpu = y.cpu().numpy() pt.cuda.synchronize() print("Finished with calcs", timer() - start) # Plotting start = timer() plt.plot(x_cpu, y_cpu) plt.show() print("Finished with plot", timer() - start) And our output again. The calculation part was about 10 times more than the numpy calculation. The data plotting took around the same time using both the PyTorch and NumPy versions, which was expected since the data was still in CPU memory then, and the GPU played no further part in the processing. But, overall, we shaved about 40% off the total run-time, which is excellent. Summary This article has demonstrated how to leverage an NVIDIA GPU using PyTorch—a machine learning library typically used for AI applications—to accelerate non-ML numerical Python code. It compares standard NumPy (CPU-based) implementations with GPU-accelerated PyTorch equivalents to show the performance benefits of running tensor-based operations on a GPU. You don’t need to be doing machine learning to benefit from PyTorch. If you can access an NVIDIA GPU, PyTorch provides a simple and effective way to significantly speed up computationally intensive numerical operations—even in general-purpose Python code. The post Use PyTorch to Easily Access Your GPU appeared first on Towards Data Science.

0 Комментарии ·0 Поделились ·0 предпросмотр

Войдите, чтобы отмечать, делиться и комментировать!
Towards Data Science @TowardsDataScience поделился ссылкой
2025-05-20 23:30:41 ·

Optimizing Multi-Objective Problems with Desirability Functions

When working in Data Science, it is not uncommon to encounter problems with competing objectives. Whether designing products, tuning algorithms or optimizing portfolios, we often need to balance several metrics to get the best possible outcome. Sometimes, maximizing one metrics comes at the expense of another, making it hard to have an overall optimized solution.

While several solutions exist to solve multi-objective Optimization problems, I found desirability function to be both elegant and easy to explain to non-technical audience. Which makes them an interesting option to consider. Desirability functions will combine several metrics into a standardized score, allowing for a holistic optimization.

In this article, we’ll explore:

The mathematical foundation of desirability functions

How to implement these functions in Python

How to optimize a multi-objective problem with desirability functions

Visualization for interpretation and explanation of the results

To ground these concepts in a real example, we’ll apply desirability functions to optimize a bread baking: a toy problem with a few, interconnected parameters and competing quality objectives that will allow us to explore several optimization choices.

By the end of this article, you’ll have a powerful new tool in your data science toolkit for tackling multi-objective optimization problems across numerous domains, as well as a fully functional code available here on GitHub.

What are Desirability Functions?

Desirability functions were first formalized by Harringtonand later extended by Derringer and Suich. The idea is to:

Transform each response into a performance score between 0and 1Combine all scores into a single metric to maximize

Let’s explore the types of desirability functions and then how we can combine all the scores.

The different types of desirability functions

There are three different desirability functions, that would allow to handle many situations.

Smaller-is-better: Used when minimizing a response is desirable

def desirability_smaller_is_better-> float:
"""Calculate desirability function value where smaller values are better.

Args:
x: Input parameter value
x_min: Minimum acceptable value
x_max: Maximum acceptable value

Returns:
Desirability score between 0 and 1
"""
if x <= x_min:
return 1.0
elif x >= x_max:
return 0.0
else:
return/Larger-is-better: Used when maximizing a response is desirable

def desirability_larger_is_better-> float:
"""Calculate desirability function value where larger values are better.

Args:
x: Input parameter value
x_min: Minimum acceptable value
x_max: Maximum acceptable value

Returns:
Desirability score between 0 and 1
"""
if x <= x_min:
return 0.0
elif x >= x_max:
return 1.0
else:
return/Target-is-best: Used when a specific target value is optimal

def desirability_target_is_best-> float:
"""Calculate two-sided desirability function value with target value.

Args:
x: Input parameter value
x_min: Minimum acceptable value
x_target: Targetvalue
x_max: Maximum acceptable value

Returns:
Desirability score between 0 and 1
"""
if x_min <= x <= x_target:
return/elif x_target < x <= x_max:
return/else:
return 0.0

Every input parameter can be parameterized with one of these three desirability functions, before combining them into a single desirability score.

Combining Desirability Scores

Once individual metrics are transformed into desirability scores, they need to be combined into an overall desirability. The most common approach is the geometric mean:

Where di are individual desirability values and wi are weights reflecting the relative importance of each metric.

The geometric mean has an important property: if any single desirability is 0, the overall desirability is also 0, regardless of other values. This enforces that all requirements must be met to some extent.

def overall_desirability:
"""Compute overall desirability using geometric mean

Parameters:
-----------
desirabilities : list
Individual desirability scores
weights : list
Weights for each desirability

Returns:
--------
float
Overall desirability score
"""
if weights is None:
weights =* len# Convert to numpy arrays
d = np.arrayw = np.array# Calculate geometric mean
return np.prod**)

The weights are hyperparameters that give leverage on the final outcome and give room for customization.

A Practical Optimization Example: Bread Baking

To demonstrate desirability functions in action, let’s apply them to a toy problem: a bread baking optimization problem.

The Parameters and Quality Metrics

Let’s play with the following parameters:

Fermentation TimeFermentation TemperatureHydration LevelKneading TimeBaking TemperatureAnd let’s try to optimize these metrics:

Texture Quality: The texture of the bread

Flavor Profile: The flavor of the bread

Practicality: The practicality of the whole process

Of course, each of these metrics depends on more than one parameter. So here comes one of the most critical steps: mapping parameters to quality metrics.

For each quality metric, we need to define how parameters influence it:

def compute_flavor_profile-> float:
"""Compute flavor profile score based on input parameters.

Args:
params: List of parameter valuesReturns:
Weighted flavor profile score between 0 and 1
"""
# Flavor mainly affected by fermentation parameters
fermentation_d = desirability_larger_is_betterferment_temp_d = desirability_target_is_besthydration_d = desirability_target_is_best# Baking temperature has minimal effect on flavor
weights =return np.averageHere for example, the flavor is influenced by the following:

The fermentation time, with a minimum desirability below 30 minutes and a maximum desirability above 180 minutes

The fermentation temperature, with a maximum desirability peaking at 24 degrees Celsius

The hydration, with a maximum desirability peaking at 75% humidity

These computed parameters are then weighted averaged to return the flavor desirability. Similar computations and made for the texture quality and practicality.

The Objective Function

Following the desirability function approach, we’ll use the overall desirability as our objective function. The goal is to maximize this overall score, which means finding parameters that best satisfy all our three requirements simultaneously:

def objective_function-> float:
"""Compute overall desirability score based on individual quality metrics.

Args:
params: List of parameter values
weights: Weights for texture, flavor and practicality scores

Returns:
Negative overall desirability score"""
# Compute individual desirability scores
texture = compute_texture_qualityflavor = compute_flavor_profilepracticality = compute_practicality# Ensure weights sum up to one
weights = np.array/ np.sum# Calculate overall desirability using geometric mean
overall_d = overall_desirability# Return negative value since we want to maximize desirability
# but optimization functions typically minimize
return -overall_d

After computing the individual desirabilities for texture, flavor and practicality; the overall desirability is simply computed with a weighted geometric mean. It finally returns the negative overall desirability, so that it can be minimized.

Optimization with SciPy

We finally use SciPy’s minimize function to find optimal parameters. Since we returned the negative overall desirability as the objective function, minimizing it would maximize the overall desirability:

def optimize-> list:
# Define parameter bounds
bounds = {
'fermentation_time':,
'fermentation_temp':,
'hydration_level':,
'kneading_time':,
'baking_temp':}

# Initial guessx0 =# Run optimization
result = minimize,
bounds=list),
method='SLSQP'
)

return result.x

In this function, after defining the bounds for each parameter, the initial guess is computed as the middle of bounds, and then given as input to the minimize function of SciPy. The result is finally returned.

The weights are given as input to the optimizer too, and are a good way to customize the output. For example, with a larger weight on practicality, the optimized solution will focus on practicality over flavor and texture.

Let’s now visualize the results for a few sets of weights.

Visualization of Results

Let’s see how the optimizer handles different preference profiles, demonstrating the flexibility of desirability functions, given various input weights.

Let’s have a look at the results in case of weights favoring practicality:

Optimized parameters with weights favoring practicality. Image by author.

With weights largely in favor of practicality, the achieved overall desirability is 0.69, with a short kneading time of 5 minutes, since a high value impacts negatively the practicality.

Now, if we optimize with an emphasis on texture, we have slightly different results:

Optimized parameters with weights favoring texture. Image by author.

In this case, the achieved overall desirability is 0.85, significantly higher. The kneading time is this time 12 minutes, as a higher value impacts positively the texture and is not penalized so much because of practicality.

Conclusion: Practical Applications of Desirability Functions

While we focused on bread baking as our example, the same approach can be applied to various domains, such as product formulation in cosmetics or resource allocation in portfolio optimization.

Desirability functions provide a powerful mathematical framework for tackling multi-objective optimization problems across numerous data science applications. By transforming raw metrics into standardized desirability scores, we can effectively combine and optimize disparate objectives.

The key advantages of this approach include:

Standardized scales that make different metrics comparable and easy to combine into a single target

Flexibility to handle different types of objectives: minimize, maximize, target

Clear communication of preferences through mathematical functions

The code presented here provides a starting point for your own experimentation. Whether you’re optimizing industrial processes, machine learning models, or product formulations, hopefully desirability functions offer a systematic approach to finding the best compromise among competing objectives.
The post Optimizing Multi-Objective Problems with Desirability Functions appeared first on Towards Data Science.
#optimizing #multiobjective #problems #with #desirability

Optimizing Multi-Objective Problems with Desirability Functions
When working in Data Science, it is not uncommon to encounter problems with competing objectives. Whether designing products, tuning algorithms or optimizing portfolios, we often need to balance several metrics to get the best possible outcome. Sometimes, maximizing one metrics comes at the expense of another, making it hard to have an overall optimized solution. While several solutions exist to solve multi-objective Optimization problems, I found desirability function to be both elegant and easy to explain to non-technical audience. Which makes them an interesting option to consider. Desirability functions will combine several metrics into a standardized score, allowing for a holistic optimization. In this article, we’ll explore: The mathematical foundation of desirability functions How to implement these functions in Python How to optimize a multi-objective problem with desirability functions Visualization for interpretation and explanation of the results To ground these concepts in a real example, we’ll apply desirability functions to optimize a bread baking: a toy problem with a few, interconnected parameters and competing quality objectives that will allow us to explore several optimization choices. By the end of this article, you’ll have a powerful new tool in your data science toolkit for tackling multi-objective optimization problems across numerous domains, as well as a fully functional code available here on GitHub. What are Desirability Functions? Desirability functions were first formalized by Harringtonand later extended by Derringer and Suich. The idea is to: Transform each response into a performance score between 0and 1Combine all scores into a single metric to maximize Let’s explore the types of desirability functions and then how we can combine all the scores. The different types of desirability functions There are three different desirability functions, that would allow to handle many situations. Smaller-is-better: Used when minimizing a response is desirable def desirability_smaller_is_better-> float: """Calculate desirability function value where smaller values are better. Args: x: Input parameter value x_min: Minimum acceptable value x_max: Maximum acceptable value Returns: Desirability score between 0 and 1 """ if x <= x_min: return 1.0 elif x >= x_max: return 0.0 else: return/Larger-is-better: Used when maximizing a response is desirable def desirability_larger_is_better-> float: """Calculate desirability function value where larger values are better. Args: x: Input parameter value x_min: Minimum acceptable value x_max: Maximum acceptable value Returns: Desirability score between 0 and 1 """ if x <= x_min: return 0.0 elif x >= x_max: return 1.0 else: return/Target-is-best: Used when a specific target value is optimal def desirability_target_is_best-> float: """Calculate two-sided desirability function value with target value. Args: x: Input parameter value x_min: Minimum acceptable value x_target: Targetvalue x_max: Maximum acceptable value Returns: Desirability score between 0 and 1 """ if x_min <= x <= x_target: return/elif x_target < x <= x_max: return/else: return 0.0 Every input parameter can be parameterized with one of these three desirability functions, before combining them into a single desirability score. Combining Desirability Scores Once individual metrics are transformed into desirability scores, they need to be combined into an overall desirability. The most common approach is the geometric mean: Where di are individual desirability values and wi are weights reflecting the relative importance of each metric. The geometric mean has an important property: if any single desirability is 0, the overall desirability is also 0, regardless of other values. This enforces that all requirements must be met to some extent. def overall_desirability: """Compute overall desirability using geometric mean Parameters: ----------- desirabilities : list Individual desirability scores weights : list Weights for each desirability Returns: -------- float Overall desirability score """ if weights is None: weights =* len# Convert to numpy arrays d = np.arrayw = np.array# Calculate geometric mean return np.prod**) The weights are hyperparameters that give leverage on the final outcome and give room for customization. A Practical Optimization Example: Bread Baking To demonstrate desirability functions in action, let’s apply them to a toy problem: a bread baking optimization problem. The Parameters and Quality Metrics Let’s play with the following parameters: Fermentation TimeFermentation TemperatureHydration LevelKneading TimeBaking TemperatureAnd let’s try to optimize these metrics: Texture Quality: The texture of the bread Flavor Profile: The flavor of the bread Practicality: The practicality of the whole process Of course, each of these metrics depends on more than one parameter. So here comes one of the most critical steps: mapping parameters to quality metrics. For each quality metric, we need to define how parameters influence it: def compute_flavor_profile-> float: """Compute flavor profile score based on input parameters. Args: params: List of parameter valuesReturns: Weighted flavor profile score between 0 and 1 """ # Flavor mainly affected by fermentation parameters fermentation_d = desirability_larger_is_betterferment_temp_d = desirability_target_is_besthydration_d = desirability_target_is_best# Baking temperature has minimal effect on flavor weights =return np.averageHere for example, the flavor is influenced by the following: The fermentation time, with a minimum desirability below 30 minutes and a maximum desirability above 180 minutes The fermentation temperature, with a maximum desirability peaking at 24 degrees Celsius The hydration, with a maximum desirability peaking at 75% humidity These computed parameters are then weighted averaged to return the flavor desirability. Similar computations and made for the texture quality and practicality. The Objective Function Following the desirability function approach, we’ll use the overall desirability as our objective function. The goal is to maximize this overall score, which means finding parameters that best satisfy all our three requirements simultaneously: def objective_function-> float: """Compute overall desirability score based on individual quality metrics. Args: params: List of parameter values weights: Weights for texture, flavor and practicality scores Returns: Negative overall desirability score""" # Compute individual desirability scores texture = compute_texture_qualityflavor = compute_flavor_profilepracticality = compute_practicality# Ensure weights sum up to one weights = np.array/ np.sum# Calculate overall desirability using geometric mean overall_d = overall_desirability# Return negative value since we want to maximize desirability # but optimization functions typically minimize return -overall_d After computing the individual desirabilities for texture, flavor and practicality; the overall desirability is simply computed with a weighted geometric mean. It finally returns the negative overall desirability, so that it can be minimized. Optimization with SciPy We finally use SciPy’s minimize function to find optimal parameters. Since we returned the negative overall desirability as the objective function, minimizing it would maximize the overall desirability: def optimize-> list: # Define parameter bounds bounds = { 'fermentation_time':, 'fermentation_temp':, 'hydration_level':, 'kneading_time':, 'baking_temp':} # Initial guessx0 =# Run optimization result = minimize, bounds=list), method='SLSQP' ) return result.x In this function, after defining the bounds for each parameter, the initial guess is computed as the middle of bounds, and then given as input to the minimize function of SciPy. The result is finally returned. The weights are given as input to the optimizer too, and are a good way to customize the output. For example, with a larger weight on practicality, the optimized solution will focus on practicality over flavor and texture. Let’s now visualize the results for a few sets of weights. Visualization of Results Let’s see how the optimizer handles different preference profiles, demonstrating the flexibility of desirability functions, given various input weights. Let’s have a look at the results in case of weights favoring practicality: Optimized parameters with weights favoring practicality. Image by author. With weights largely in favor of practicality, the achieved overall desirability is 0.69, with a short kneading time of 5 minutes, since a high value impacts negatively the practicality. Now, if we optimize with an emphasis on texture, we have slightly different results: Optimized parameters with weights favoring texture. Image by author. In this case, the achieved overall desirability is 0.85, significantly higher. The kneading time is this time 12 minutes, as a higher value impacts positively the texture and is not penalized so much because of practicality. Conclusion: Practical Applications of Desirability Functions While we focused on bread baking as our example, the same approach can be applied to various domains, such as product formulation in cosmetics or resource allocation in portfolio optimization. Desirability functions provide a powerful mathematical framework for tackling multi-objective optimization problems across numerous data science applications. By transforming raw metrics into standardized desirability scores, we can effectively combine and optimize disparate objectives. The key advantages of this approach include: Standardized scales that make different metrics comparable and easy to combine into a single target Flexibility to handle different types of objectives: minimize, maximize, target Clear communication of preferences through mathematical functions The code presented here provides a starting point for your own experimentation. Whether you’re optimizing industrial processes, machine learning models, or product formulations, hopefully desirability functions offer a systematic approach to finding the best compromise among competing objectives. The post Optimizing Multi-Objective Problems with Desirability Functions appeared first on Towards Data Science. #optimizing #multiobjective #problems #with #desirability

Optimizing Multi-Objective Problems with Desirability Functions

towardsdatascience.com
When working in Data Science, it is not uncommon to encounter problems with competing objectives. Whether designing products, tuning algorithms or optimizing portfolios, we often need to balance several metrics to get the best possible outcome. Sometimes, maximizing one metrics comes at the expense of another, making it hard to have an overall optimized solution. While several solutions exist to solve multi-objective Optimization problems, I found desirability function to be both elegant and easy to explain to non-technical audience. Which makes them an interesting option to consider. Desirability functions will combine several metrics into a standardized score, allowing for a holistic optimization. In this article, we’ll explore: The mathematical foundation of desirability functions How to implement these functions in Python How to optimize a multi-objective problem with desirability functions Visualization for interpretation and explanation of the results To ground these concepts in a real example, we’ll apply desirability functions to optimize a bread baking: a toy problem with a few, interconnected parameters and competing quality objectives that will allow us to explore several optimization choices. By the end of this article, you’ll have a powerful new tool in your data science toolkit for tackling multi-objective optimization problems across numerous domains, as well as a fully functional code available here on GitHub. What are Desirability Functions? Desirability functions were first formalized by Harrington (1965) and later extended by Derringer and Suich (1980). The idea is to: Transform each response into a performance score between 0 (absolutely unacceptable) and 1 (the ideal value) Combine all scores into a single metric to maximize Let’s explore the types of desirability functions and then how we can combine all the scores. The different types of desirability functions There are three different desirability functions, that would allow to handle many situations. Smaller-is-better: Used when minimizing a response is desirable def desirability_smaller_is_better(x: float, x_min: float, x_max: float) -> float: """Calculate desirability function value where smaller values are better. Args: x: Input parameter value x_min: Minimum acceptable value x_max: Maximum acceptable value Returns: Desirability score between 0 and 1 """ if x <= x_min: return 1.0 elif x >= x_max: return 0.0 else: return (x_max - x) / (x_max - x_min) Larger-is-better: Used when maximizing a response is desirable def desirability_larger_is_better(x: float, x_min: float, x_max: float) -> float: """Calculate desirability function value where larger values are better. Args: x: Input parameter value x_min: Minimum acceptable value x_max: Maximum acceptable value Returns: Desirability score between 0 and 1 """ if x <= x_min: return 0.0 elif x >= x_max: return 1.0 else: return (x - x_min) / (x_max - x_min) Target-is-best: Used when a specific target value is optimal def desirability_target_is_best(x: float, x_min: float, x_target: float, x_max: float) -> float: """Calculate two-sided desirability function value with target value. Args: x: Input parameter value x_min: Minimum acceptable value x_target: Target (optimal) value x_max: Maximum acceptable value Returns: Desirability score between 0 and 1 """ if x_min <= x <= x_target: return (x - x_min) / (x_target - x_min) elif x_target < x <= x_max: return (x_max - x) / (x_max - x_target) else: return 0.0 Every input parameter can be parameterized with one of these three desirability functions, before combining them into a single desirability score. Combining Desirability Scores Once individual metrics are transformed into desirability scores, they need to be combined into an overall desirability. The most common approach is the geometric mean: Where di are individual desirability values and wi are weights reflecting the relative importance of each metric. The geometric mean has an important property: if any single desirability is 0 (i.e. completely unacceptable), the overall desirability is also 0, regardless of other values. This enforces that all requirements must be met to some extent. def overall_desirability(desirabilities, weights=None): """Compute overall desirability using geometric mean Parameters: ----------- desirabilities : list Individual desirability scores weights : list Weights for each desirability Returns: -------- float Overall desirability score """ if weights is None: weights = [1] * len(desirabilities) # Convert to numpy arrays d = np.array(desirabilities) w = np.array(weights) # Calculate geometric mean return np.prod(d ** w) ** (1 / np.sum(w)) The weights are hyperparameters that give leverage on the final outcome and give room for customization. A Practical Optimization Example: Bread Baking To demonstrate desirability functions in action, let’s apply them to a toy problem: a bread baking optimization problem. The Parameters and Quality Metrics Let’s play with the following parameters: Fermentation Time (30–180 minutes) Fermentation Temperature (20–30°C) Hydration Level (60–85%) Kneading Time (0–20 minutes) Baking Temperature (180–250°C) And let’s try to optimize these metrics: Texture Quality: The texture of the bread Flavor Profile: The flavor of the bread Practicality: The practicality of the whole process Of course, each of these metrics depends on more than one parameter. So here comes one of the most critical steps: mapping parameters to quality metrics. For each quality metric, we need to define how parameters influence it: def compute_flavor_profile(params: List[float]) -> float: """Compute flavor profile score based on input parameters. Args: params: List of parameter values [fermentation_time, ferment_temp, hydration, kneading_time, baking_temp] Returns: Weighted flavor profile score between 0 and 1 """ # Flavor mainly affected by fermentation parameters fermentation_d = desirability_larger_is_better(params[0], 30, 180) ferment_temp_d = desirability_target_is_best(params[1], 20, 24, 28) hydration_d = desirability_target_is_best(params[2], 65, 75, 85) # Baking temperature has minimal effect on flavor weights = [0.5, 0.3, 0.2] return np.average([fermentation_d, ferment_temp_d, hydration_d], weights=weights) Here for example, the flavor is influenced by the following: The fermentation time, with a minimum desirability below 30 minutes and a maximum desirability above 180 minutes The fermentation temperature, with a maximum desirability peaking at 24 degrees Celsius The hydration, with a maximum desirability peaking at 75% humidity These computed parameters are then weighted averaged to return the flavor desirability. Similar computations and made for the texture quality and practicality. The Objective Function Following the desirability function approach, we’ll use the overall desirability as our objective function. The goal is to maximize this overall score, which means finding parameters that best satisfy all our three requirements simultaneously: def objective_function(params: List[float], weights: List[float]) -> float: """Compute overall desirability score based on individual quality metrics. Args: params: List of parameter values weights: Weights for texture, flavor and practicality scores Returns: Negative overall desirability score (for minimization) """ # Compute individual desirability scores texture = compute_texture_quality(params) flavor = compute_flavor_profile(params) practicality = compute_practicality(params) # Ensure weights sum up to one weights = np.array(weights) / np.sum(weights) # Calculate overall desirability using geometric mean overall_d = overall_desirability([texture, flavor, practicality], weights) # Return negative value since we want to maximize desirability # but optimization functions typically minimize return -overall_d After computing the individual desirabilities for texture, flavor and practicality; the overall desirability is simply computed with a weighted geometric mean. It finally returns the negative overall desirability, so that it can be minimized. Optimization with SciPy We finally use SciPy’s minimize function to find optimal parameters. Since we returned the negative overall desirability as the objective function, minimizing it would maximize the overall desirability: def optimize(weights: list[float]) -> list[float]: # Define parameter bounds bounds = { 'fermentation_time': (1, 24), 'fermentation_temp': (20, 30), 'hydration_level': (60, 85), 'kneading_time': (0, 20), 'baking_temp': (180, 250) } # Initial guess (middle of bounds) x0 = [(b[0] + b[1]) / 2 for b in bounds.values()] # Run optimization result = minimize( objective_function, x0, args=(weights,), bounds=list(bounds.values()), method='SLSQP' ) return result.x In this function, after defining the bounds for each parameter, the initial guess is computed as the middle of bounds, and then given as input to the minimize function of SciPy. The result is finally returned. The weights are given as input to the optimizer too, and are a good way to customize the output. For example, with a larger weight on practicality, the optimized solution will focus on practicality over flavor and texture. Let’s now visualize the results for a few sets of weights. Visualization of Results Let’s see how the optimizer handles different preference profiles, demonstrating the flexibility of desirability functions, given various input weights. Let’s have a look at the results in case of weights favoring practicality: Optimized parameters with weights favoring practicality. Image by author. With weights largely in favor of practicality, the achieved overall desirability is 0.69, with a short kneading time of 5 minutes, since a high value impacts negatively the practicality. Now, if we optimize with an emphasis on texture, we have slightly different results: Optimized parameters with weights favoring texture. Image by author. In this case, the achieved overall desirability is 0.85, significantly higher. The kneading time is this time 12 minutes, as a higher value impacts positively the texture and is not penalized so much because of practicality. Conclusion: Practical Applications of Desirability Functions While we focused on bread baking as our example, the same approach can be applied to various domains, such as product formulation in cosmetics or resource allocation in portfolio optimization. Desirability functions provide a powerful mathematical framework for tackling multi-objective optimization problems across numerous data science applications. By transforming raw metrics into standardized desirability scores, we can effectively combine and optimize disparate objectives. The key advantages of this approach include: Standardized scales that make different metrics comparable and easy to combine into a single target Flexibility to handle different types of objectives: minimize, maximize, target Clear communication of preferences through mathematical functions The code presented here provides a starting point for your own experimentation. Whether you’re optimizing industrial processes, machine learning models, or product formulations, hopefully desirability functions offer a systematic approach to finding the best compromise among competing objectives. The post Optimizing Multi-Objective Problems with Desirability Functions appeared first on Towards Data Science.

0 Комментарии ·0 Поделились ·0 предпросмотр

Войдите, чтобы отмечать, делиться и комментировать!
Marktechpost AI @MarktechpostAI поделился ссылкой
2025-05-18 13:30:34 ·

How to Build a Powerful and Intelligent Question-Answering System by Using Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain Framework

In this tutorial, we demonstrate how to build a powerful and intelligent question-answering system by combining the strengths of Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain framework. The pipeline leverages real-time web search using Tavily, semantic document caching with Chroma vector store, and contextual response generation through the Gemini model. These tools are integrated through LangChain’s modular components, such as RunnableLambda, ChatPromptTemplate, ConversationBufferMemory, and GoogleGenerativeAIEmbeddings. It goes beyond simple Q&A by introducing a hybrid retrieval mechanism that checks for cached embeddings before invoking fresh web searches. The retrieved documents are intelligently formatted, summarized, and passed through a structured LLM prompt, with attention to source attribution, user history, and confidence scoring. Key functions such as advanced prompt engineering, sentiment and entity analysis, and dynamic vector store updates make this pipeline suitable for advanced use cases like research assistance, domain-specific summarization, and intelligent agents.
!pip install -qU langchain-community tavily-python langchain-google-genai streamlit matplotlib pandas tiktoken chromadb langchain_core pydantic langchain
We install and upgrade a comprehensive set of libraries required to build an advanced AI search assistant. It includes tools for retrieval, LLM integration, data handling, visualization, and tokenization. These components form the core foundation for constructing a real-time, context-aware QA system.
import os
import getpass
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import json
import time
from typing import List, Dict, Any, Optional
from datetime import datetime
We import essential Python libraries used throughout the notebook. It includes standard libraries for environment variables, secure input, time tracking, and data types. Additionally, it brings in core data science tools like pandas, matplotlib, and numpy for data handling, visualization, and numerical computations, as well as json for parsing structured data.
if "TAVILY_API_KEY" not in os.environ:
os.environ= getpass.getpassif "GOOGLE_API_KEY" not in os.environ:
os.environ= getpass.getpassimport logging
logging.basicConfigs - %s - %s - %s')
logger = logging.getLoggerWe securely initialize API keys for Tavily and Google Gemini by prompting users only if they’re not already set in the environment, ensuring safe and repeatable access to external services. It also configures a standardized logging setup using Python’s logging module, which helps monitor execution flow and capture debug or error messages throughout the notebook.
from langchain_community.retrievers import TavilySearchAPIRetriever
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from langchain.memory import ConversationBufferMemory
We import key components from the LangChain ecosystem and its integrations. It brings in the TavilySearchAPIRetriever for real-time web search, Chroma for vector storage, and GoogleGenerativeAI modules for chat and embedding models. Core LangChain modules like ChatPromptTemplate, RunnableLambda, ConversationBufferMemory, and output parsers enable flexible prompt construction, memory handling, and pipeline execution.
class SearchQueryError:
"""Exception raised for errors in the search query."""
pass

def format_docs:
formatted_content =for i, doc in enumerate:
metadata = doc.metadata
source = metadata.gettitle = metadata.getscore = metadata.getformatted_content.appendreturn "nn".joinWe define two essential components for search and document handling. The SearchQueryError class creates a custom exception to manage invalid or failed search queries gracefully. The format_docs function processes a list of retrieved documents by extracting metadata such as title, source, and relevance score and formatting them into a clean, readable string.
class SearchResultsParser:
def parse:
try:
if isinstance:
import re
import json
json_match = re.searchif json_match:
json_str = json_match.groupreturn json.loadsreturn {"answer": text, "sources":, "confidence": 0.5}
elif hasattr:
return {"answer": text.content, "sources":, "confidence": 0.5}
else:
return {"answer": str, "sources":, "confidence": 0.5}
except Exception as e:
logger.warningreturn {"answer": str, "sources":, "confidence": 0.5}
The SearchResultsParser class provides a robust method for extracting structured information from LLM responses. It attempts to parse a JSON-like string from the model output, returning to a plain text response format if parsing fails. It gracefully handles string outputs and message objects, ensuring consistent downstream processing. In case of errors, it logs a warning and returns a fallback response containing the raw answer, empty sources, and a default confidence score, enhancing the system’s fault tolerance.
class EnhancedTavilyRetriever:
def __init__:
self.api_key = api_key
self.max_results = max_results
self.search_depth = search_depth
self.include_domains = include_domains orself.exclude_domains = exclude_domains orself.retriever = self._create_retrieverself.previous_searches =def _create_retriever:
try:
return TavilySearchAPIRetrieverexcept Exception as e:
logger.errorraise

def invoke:
if not query or not query.strip:
raise SearchQueryErrortry:
start_time = time.timeresults = self.retriever.invokeend_time = time.timesearch_record = {
"timestamp": datetime.now.isoformat,
"query": query,
"num_results": len,
"response_time": end_time - start_time
}
self.previous_searches.appendreturn results
except Exception as e:
logger.errorraise SearchQueryError}")

def get_search_history:
return self.previous_searches
The EnhancedTavilyRetriever class is a custom wrapper around the TavilySearchAPIRetriever, adding greater flexibility, control, and traceability to search operations. It supports advanced features like limiting search depth, domain inclusion/exclusion filters, and configurable result counts. The invoke method performs web searches and tracks each query’s metadata, storing it for later analysis.
class SearchCache:
def __init__:
self.embedding_function = GoogleGenerativeAIEmbeddingsself.vector_store = None
self.text_splitter = RecursiveCharacterTextSplitterdef add_documents:
if not documents:
return

try:
if self.vector_store is None:
self.vector_store = Chroma.from_documentselse:
self.vector_store.add_documentsexcept Exception as e:
logger.errordef search:
if self.vector_store is None:
returntry:
return self.vector_store.similarity_searchexcept Exception as e:
logger.errorreturnThe SearchCache class implements a semantic caching layer that stores and retrieves documents using vector embeddings for efficient similarity search. It uses GoogleGenerativeAIEmbeddings to convert documents into dense vectors and stores them in a Chroma vector database. The add_documents method initializes or updates the vector store, while the search method enables fast retrieval of the most relevant cached documents based on semantic similarity. This reduces redundant API calls and improves response times for repeated or related queries, serving as a lightweight hybrid memory layer in the AI assistant pipeline.
search_cache = SearchCacheenhanced_retriever = EnhancedTavilyRetrievermemory = ConversationBufferMemorysystem_template = """You are a research assistant that provides accurate answers based on the search results provided.
Follow these guidelines:
1. Only use the context provided to answer the question
2. If the context doesn't contain the answer, say "I don't have sufficient information to answer this question."
3. Cite your sources by referencing the document numbers
4. Don't make up information
5. Keep the answer concise but complete

Context: {context}
Chat History: {chat_history}
"""

system_message = SystemMessagePromptTemplate.from_templatehuman_template = "Question: {question}"
human_message = HumanMessagePromptTemplate.from_templateprompt = ChatPromptTemplate.from_messagesWe initialize the core components of the AI assistant: a semantic SearchCache, the EnhancedTavilyRetriever for web-based querying, and a ConversationBufferMemory to retain chat history across turns. It also defines a structured prompt using ChatPromptTemplate, guiding the LLM to act as a research assistant. The prompt enforces strict rules for factual accuracy, context usage, source citation, and concise answering, ensuring reliable and grounded responses.
def get_llm:
try:
return ChatGoogleGenerativeAIexcept Exception as e:
logger.errorraise

output_parser = SearchResultsParserWe define the get_llm function, which initializes a Google Gemini language model with configurable parameters such as model name, temperature, and decoding settings. It ensures robustness with error handling for failed model initialization. An instance of SearchResultsParser is also created to standardize and structure the LLM’s raw responses, enabling consistent downstream processing of answers and metadata.
def plot_search_metrics:
if not search_history:
printreturn

df = pd.DataFrameplt.figure)
plt.subplotplt.plot), df, marker='o')
plt.titleplt.xlabelplt.ylabel')
plt.gridplt.subplotplt.bar), df)
plt.titleplt.xlabelplt.ylabelplt.gridplt.tight_layoutplt.showThe plot_search_metrics function visualizes performance trends from past queries using Matplotlib. It converts the search history into a DataFrame and plots two subgraphs: one showing response time per search and the other displaying the number of results returned. This aids in analyzing the system’s efficiency and search quality over time, helping developers fine-tune the retriever or identify bottlenecks in real-world usage.
def retrieve_with_fallback:
cached_results = search_cache.searchif cached_results:
logger.info} documents from cache")
return cached_results

logger.infosearch_results = enhanced_retriever.invokesearch_cache.add_documentsreturn search_results

def summarize_documents:
llm = get_llmsummarize_prompt = ChatPromptTemplate.from_templatechain =, "query": lambda _: query}
| summarize_prompt
| llm
| StrOutputParser)

return chain.invokeThese two functions enhance the assistant’s intelligence and efficiency. The retrieve_with_fallback function implements a hybrid retrieval mechanism: it first attempts to fetch semantically relevant documents from the local Chroma cache and, if unsuccessful, falls back to a real-time Tavily web search, caching the new results for future use. Meanwhile, summarize_documents leverages a Gemini LLM to generate concise summaries from retrieved documents, guided by a structured prompt that ensures relevance to the query. Together, they enable low-latency, informative, and context-aware responses.
def advanced_chain:
llm = get_llmif query_engine == "enhanced":
retriever = lambda query: retrieve_with_fallbackelse:
retriever = enhanced_retriever.invoke

def chain_with_history:
query = input_dictchat_history = memory.load_memory_variablesif include_history elsedocs = retrievercontext = format_docsresult = prompt.invokememory.save_contextreturn llm.invokereturn RunnableLambda| StrOutputParserThe advanced_chain function defines a modular, end-to-end reasoning workflow for answering user queries using cached or real-time search. It initializes the specified Gemini model, selects the retrieval strategy, constructs a response pipeline incorporating chat history, formats documents into context, and prompts the LLM using a system-guided template. The chain also logs the interaction in memory and returns the final answer, parsed into clean text. This design enables flexible experimentation with models and retrieval strategies while maintaining conversation coherence.
qa_chain = advanced_chaindef analyze_query:
llm = get_llmanalysis_prompt = ChatPromptTemplate.from_template3. Key entities mentioned
4. Query typeQuery: {query}

Return the analysis in JSON format with the following structure:
{{
"topic": "main topic",
"sentiment": "sentiment",
"entities":,
"type": "query type"
}}
"""
)

chain = analysis_prompt | llm | output_parser

return chain.invokeprintprintquery = "what year was breath of the wild released and what was its reception?"
printWe initialize the final components of the intelligent assistant. qa_chain is the assembled reasoning pipeline ready to process user queries using retrieval, memory, and Gemini-based response generation. The analyze_query function performs a lightweight semantic analysis on a query, extracting the main topic, sentiment, entities, and query type using the Gemini model and a structured JSON prompt. The example query, about Breath of the Wild’s release and reception, showcases how the assistant is triggered and prepared for full-stack inference and semantic interpretation. The printed heading marks the start of interactive execution.
try:
printanswer = qa_chain.invokeprintprintprinttry:
query_analysis = analyze_queryprintprint)
except Exception as e:
print: {e}")
except Exception as e:
printhistory = enhanced_retriever.get_search_historyprintfor i, h in enumerate:
printprintspecialized_retriever = EnhancedTavilyRetrievertry:
specialized_results = specialized_retriever.invokeprint} specialized results")

summary = summarize_documentsprintprintexcept Exception as e:
printprintplot_search_metricsWe demonstrate the complete pipeline in action. It performs a search using the qa_chain, displays the generated answer, and then analyzes the query for sentiment, topic, entities, and type. It also retrieves and prints each query’s search history, response time, and result count. Also, it runs a domain-filtered search focused on Nintendo-related sites, summarizes the results, and visualizes search performance using plot_search_metrics, offering a comprehensive view of the assistant’s capabilities in real-time use.
In conclusion, following this tutorial gives users a comprehensive blueprint for creating a highly capable, context-aware, and scalable RAG system that bridges real-time web intelligence with conversational AI. The Tavily Search API lets users directly pull fresh and relevant content from the web. The Gemini LLM adds robust reasoning and summarization capabilities, while LangChain’s abstraction layer allows seamless orchestration between memory, embeddings, and model outputs. The implementation includes advanced features such as domain-specific filtering, query analysis, and fallback strategies using a semantic vector cache built with Chroma and GoogleGenerativeAIEmbeddings. Also, structured logging, error handling, and analytics dashboards provide transparency and diagnostics for real-world deployment.

Check out the Colab Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.
Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/AWS Open-Sources Strands Agents SDK to Simplify AI Agent DevelopmentAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Windsurf Launches SWE-1: A Frontier AI Model Family for End-to-End Software EngineeringAsif Razzaqhttps://www.marktechpost.com/author/6flvq/AI Agents Now Write Code in Parallel: OpenAI Introduces Codex, a Cloud-Based Coding Agent Inside ChatGPTAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Guide to Build an Automated Knowledge Graph Pipeline Using LangGraph and NetworkX

Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub!
#how #build #powerful #intelligent #questionanswering

How to Build a Powerful and Intelligent Question-Answering System by Using Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain Framework
In this tutorial, we demonstrate how to build a powerful and intelligent question-answering system by combining the strengths of Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain framework. The pipeline leverages real-time web search using Tavily, semantic document caching with Chroma vector store, and contextual response generation through the Gemini model. These tools are integrated through LangChain’s modular components, such as RunnableLambda, ChatPromptTemplate, ConversationBufferMemory, and GoogleGenerativeAIEmbeddings. It goes beyond simple Q&A by introducing a hybrid retrieval mechanism that checks for cached embeddings before invoking fresh web searches. The retrieved documents are intelligently formatted, summarized, and passed through a structured LLM prompt, with attention to source attribution, user history, and confidence scoring. Key functions such as advanced prompt engineering, sentiment and entity analysis, and dynamic vector store updates make this pipeline suitable for advanced use cases like research assistance, domain-specific summarization, and intelligent agents. !pip install -qU langchain-community tavily-python langchain-google-genai streamlit matplotlib pandas tiktoken chromadb langchain_core pydantic langchain We install and upgrade a comprehensive set of libraries required to build an advanced AI search assistant. It includes tools for retrieval, LLM integration, data handling, visualization, and tokenization. These components form the core foundation for constructing a real-time, context-aware QA system. import os import getpass import pandas as pd import matplotlib.pyplot as plt import numpy as np import json import time from typing import List, Dict, Any, Optional from datetime import datetime We import essential Python libraries used throughout the notebook. It includes standard libraries for environment variables, secure input, time tracking, and data types. Additionally, it brings in core data science tools like pandas, matplotlib, and numpy for data handling, visualization, and numerical computations, as well as json for parsing structured data. if "TAVILY_API_KEY" not in os.environ: os.environ= getpass.getpassif "GOOGLE_API_KEY" not in os.environ: os.environ= getpass.getpassimport logging logging.basicConfigs - %s - %s - %s') logger = logging.getLoggerWe securely initialize API keys for Tavily and Google Gemini by prompting users only if they’re not already set in the environment, ensuring safe and repeatable access to external services. It also configures a standardized logging setup using Python’s logging module, which helps monitor execution flow and capture debug or error messages throughout the notebook. from langchain_community.retrievers import TavilySearchAPIRetriever from langchain_community.vectorstores import Chroma from langchain_core.documents import Document from langchain_core.output_parsers import StrOutputParser, JsonOutputParser from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate from langchain_core.runnables import RunnablePassthrough, RunnableLambda from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.chains.summarize import load_summarize_chain from langchain.memory import ConversationBufferMemory We import key components from the LangChain ecosystem and its integrations. It brings in the TavilySearchAPIRetriever for real-time web search, Chroma for vector storage, and GoogleGenerativeAI modules for chat and embedding models. Core LangChain modules like ChatPromptTemplate, RunnableLambda, ConversationBufferMemory, and output parsers enable flexible prompt construction, memory handling, and pipeline execution. class SearchQueryError: """Exception raised for errors in the search query.""" pass def format_docs: formatted_content =for i, doc in enumerate: metadata = doc.metadata source = metadata.gettitle = metadata.getscore = metadata.getformatted_content.appendreturn "nn".joinWe define two essential components for search and document handling. The SearchQueryError class creates a custom exception to manage invalid or failed search queries gracefully. The format_docs function processes a list of retrieved documents by extracting metadata such as title, source, and relevance score and formatting them into a clean, readable string. class SearchResultsParser: def parse: try: if isinstance: import re import json json_match = re.searchif json_match: json_str = json_match.groupreturn json.loadsreturn {"answer": text, "sources":, "confidence": 0.5} elif hasattr: return {"answer": text.content, "sources":, "confidence": 0.5} else: return {"answer": str, "sources":, "confidence": 0.5} except Exception as e: logger.warningreturn {"answer": str, "sources":, "confidence": 0.5} The SearchResultsParser class provides a robust method for extracting structured information from LLM responses. It attempts to parse a JSON-like string from the model output, returning to a plain text response format if parsing fails. It gracefully handles string outputs and message objects, ensuring consistent downstream processing. In case of errors, it logs a warning and returns a fallback response containing the raw answer, empty sources, and a default confidence score, enhancing the system’s fault tolerance. class EnhancedTavilyRetriever: def __init__: self.api_key = api_key self.max_results = max_results self.search_depth = search_depth self.include_domains = include_domains orself.exclude_domains = exclude_domains orself.retriever = self._create_retrieverself.previous_searches =def _create_retriever: try: return TavilySearchAPIRetrieverexcept Exception as e: logger.errorraise def invoke: if not query or not query.strip: raise SearchQueryErrortry: start_time = time.timeresults = self.retriever.invokeend_time = time.timesearch_record = { "timestamp": datetime.now.isoformat, "query": query, "num_results": len, "response_time": end_time - start_time } self.previous_searches.appendreturn results except Exception as e: logger.errorraise SearchQueryError}") def get_search_history: return self.previous_searches The EnhancedTavilyRetriever class is a custom wrapper around the TavilySearchAPIRetriever, adding greater flexibility, control, and traceability to search operations. It supports advanced features like limiting search depth, domain inclusion/exclusion filters, and configurable result counts. The invoke method performs web searches and tracks each query’s metadata, storing it for later analysis. class SearchCache: def __init__: self.embedding_function = GoogleGenerativeAIEmbeddingsself.vector_store = None self.text_splitter = RecursiveCharacterTextSplitterdef add_documents: if not documents: return try: if self.vector_store is None: self.vector_store = Chroma.from_documentselse: self.vector_store.add_documentsexcept Exception as e: logger.errordef search: if self.vector_store is None: returntry: return self.vector_store.similarity_searchexcept Exception as e: logger.errorreturnThe SearchCache class implements a semantic caching layer that stores and retrieves documents using vector embeddings for efficient similarity search. It uses GoogleGenerativeAIEmbeddings to convert documents into dense vectors and stores them in a Chroma vector database. The add_documents method initializes or updates the vector store, while the search method enables fast retrieval of the most relevant cached documents based on semantic similarity. This reduces redundant API calls and improves response times for repeated or related queries, serving as a lightweight hybrid memory layer in the AI assistant pipeline. search_cache = SearchCacheenhanced_retriever = EnhancedTavilyRetrievermemory = ConversationBufferMemorysystem_template = """You are a research assistant that provides accurate answers based on the search results provided. Follow these guidelines: 1. Only use the context provided to answer the question 2. If the context doesn't contain the answer, say "I don't have sufficient information to answer this question." 3. Cite your sources by referencing the document numbers 4. Don't make up information 5. Keep the answer concise but complete Context: {context} Chat History: {chat_history} """ system_message = SystemMessagePromptTemplate.from_templatehuman_template = "Question: {question}" human_message = HumanMessagePromptTemplate.from_templateprompt = ChatPromptTemplate.from_messagesWe initialize the core components of the AI assistant: a semantic SearchCache, the EnhancedTavilyRetriever for web-based querying, and a ConversationBufferMemory to retain chat history across turns. It also defines a structured prompt using ChatPromptTemplate, guiding the LLM to act as a research assistant. The prompt enforces strict rules for factual accuracy, context usage, source citation, and concise answering, ensuring reliable and grounded responses. def get_llm: try: return ChatGoogleGenerativeAIexcept Exception as e: logger.errorraise output_parser = SearchResultsParserWe define the get_llm function, which initializes a Google Gemini language model with configurable parameters such as model name, temperature, and decoding settings. It ensures robustness with error handling for failed model initialization. An instance of SearchResultsParser is also created to standardize and structure the LLM’s raw responses, enabling consistent downstream processing of answers and metadata. def plot_search_metrics: if not search_history: printreturn df = pd.DataFrameplt.figure) plt.subplotplt.plot), df, marker='o') plt.titleplt.xlabelplt.ylabel') plt.gridplt.subplotplt.bar), df) plt.titleplt.xlabelplt.ylabelplt.gridplt.tight_layoutplt.showThe plot_search_metrics function visualizes performance trends from past queries using Matplotlib. It converts the search history into a DataFrame and plots two subgraphs: one showing response time per search and the other displaying the number of results returned. This aids in analyzing the system’s efficiency and search quality over time, helping developers fine-tune the retriever or identify bottlenecks in real-world usage. def retrieve_with_fallback: cached_results = search_cache.searchif cached_results: logger.info} documents from cache") return cached_results logger.infosearch_results = enhanced_retriever.invokesearch_cache.add_documentsreturn search_results def summarize_documents: llm = get_llmsummarize_prompt = ChatPromptTemplate.from_templatechain =, "query": lambda _: query} | summarize_prompt | llm | StrOutputParser) return chain.invokeThese two functions enhance the assistant’s intelligence and efficiency. The retrieve_with_fallback function implements a hybrid retrieval mechanism: it first attempts to fetch semantically relevant documents from the local Chroma cache and, if unsuccessful, falls back to a real-time Tavily web search, caching the new results for future use. Meanwhile, summarize_documents leverages a Gemini LLM to generate concise summaries from retrieved documents, guided by a structured prompt that ensures relevance to the query. Together, they enable low-latency, informative, and context-aware responses. def advanced_chain: llm = get_llmif query_engine == "enhanced": retriever = lambda query: retrieve_with_fallbackelse: retriever = enhanced_retriever.invoke def chain_with_history: query = input_dictchat_history = memory.load_memory_variablesif include_history elsedocs = retrievercontext = format_docsresult = prompt.invokememory.save_contextreturn llm.invokereturn RunnableLambda| StrOutputParserThe advanced_chain function defines a modular, end-to-end reasoning workflow for answering user queries using cached or real-time search. It initializes the specified Gemini model, selects the retrieval strategy, constructs a response pipeline incorporating chat history, formats documents into context, and prompts the LLM using a system-guided template. The chain also logs the interaction in memory and returns the final answer, parsed into clean text. This design enables flexible experimentation with models and retrieval strategies while maintaining conversation coherence. qa_chain = advanced_chaindef analyze_query: llm = get_llmanalysis_prompt = ChatPromptTemplate.from_template3. Key entities mentioned 4. Query typeQuery: {query} Return the analysis in JSON format with the following structure: {{ "topic": "main topic", "sentiment": "sentiment", "entities":, "type": "query type" }} """ ) chain = analysis_prompt | llm | output_parser return chain.invokeprintprintquery = "what year was breath of the wild released and what was its reception?" printWe initialize the final components of the intelligent assistant. qa_chain is the assembled reasoning pipeline ready to process user queries using retrieval, memory, and Gemini-based response generation. The analyze_query function performs a lightweight semantic analysis on a query, extracting the main topic, sentiment, entities, and query type using the Gemini model and a structured JSON prompt. The example query, about Breath of the Wild’s release and reception, showcases how the assistant is triggered and prepared for full-stack inference and semantic interpretation. The printed heading marks the start of interactive execution. try: printanswer = qa_chain.invokeprintprintprinttry: query_analysis = analyze_queryprintprint) except Exception as e: print: {e}") except Exception as e: printhistory = enhanced_retriever.get_search_historyprintfor i, h in enumerate: printprintspecialized_retriever = EnhancedTavilyRetrievertry: specialized_results = specialized_retriever.invokeprint} specialized results") summary = summarize_documentsprintprintexcept Exception as e: printprintplot_search_metricsWe demonstrate the complete pipeline in action. It performs a search using the qa_chain, displays the generated answer, and then analyzes the query for sentiment, topic, entities, and type. It also retrieves and prints each query’s search history, response time, and result count. Also, it runs a domain-filtered search focused on Nintendo-related sites, summarizes the results, and visualizes search performance using plot_search_metrics, offering a comprehensive view of the assistant’s capabilities in real-time use. In conclusion, following this tutorial gives users a comprehensive blueprint for creating a highly capable, context-aware, and scalable RAG system that bridges real-time web intelligence with conversational AI. The Tavily Search API lets users directly pull fresh and relevant content from the web. The Gemini LLM adds robust reasoning and summarization capabilities, while LangChain’s abstraction layer allows seamless orchestration between memory, embeddings, and model outputs. The implementation includes advanced features such as domain-specific filtering, query analysis, and fallback strategies using a semantic vector cache built with Chroma and GoogleGenerativeAIEmbeddings. Also, structured logging, error handling, and analytics dashboards provide transparency and diagnostics for real-world deployment. Check out the Colab Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/AWS Open-Sources Strands Agents SDK to Simplify AI Agent DevelopmentAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Windsurf Launches SWE-1: A Frontier AI Model Family for End-to-End Software EngineeringAsif Razzaqhttps://www.marktechpost.com/author/6flvq/AI Agents Now Write Code in Parallel: OpenAI Introduces Codex, a Cloud-Based Coding Agent Inside ChatGPTAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Guide to Build an Automated Knowledge Graph Pipeline Using LangGraph and NetworkX 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! #how #build #powerful #intelligent #questionanswering

How to Build a Powerful and Intelligent Question-Answering System by Using Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain Framework

www.marktechpost.com
In this tutorial, we demonstrate how to build a powerful and intelligent question-answering system by combining the strengths of Tavily Search API, Chroma, Google Gemini LLMs, and the LangChain framework. The pipeline leverages real-time web search using Tavily, semantic document caching with Chroma vector store, and contextual response generation through the Gemini model. These tools are integrated through LangChain’s modular components, such as RunnableLambda, ChatPromptTemplate, ConversationBufferMemory, and GoogleGenerativeAIEmbeddings. It goes beyond simple Q&A by introducing a hybrid retrieval mechanism that checks for cached embeddings before invoking fresh web searches. The retrieved documents are intelligently formatted, summarized, and passed through a structured LLM prompt, with attention to source attribution, user history, and confidence scoring. Key functions such as advanced prompt engineering, sentiment and entity analysis, and dynamic vector store updates make this pipeline suitable for advanced use cases like research assistance, domain-specific summarization, and intelligent agents. !pip install -qU langchain-community tavily-python langchain-google-genai streamlit matplotlib pandas tiktoken chromadb langchain_core pydantic langchain We install and upgrade a comprehensive set of libraries required to build an advanced AI search assistant. It includes tools for retrieval (tavily-python, chromadb), LLM integration (langchain-google-genai, langchain), data handling (pandas, pydantic), visualization (matplotlib, streamlit), and tokenization (tiktoken). These components form the core foundation for constructing a real-time, context-aware QA system. import os import getpass import pandas as pd import matplotlib.pyplot as plt import numpy as np import json import time from typing import List, Dict, Any, Optional from datetime import datetime We import essential Python libraries used throughout the notebook. It includes standard libraries for environment variables, secure input, time tracking, and data types (os, getpass, time, typing, datetime). Additionally, it brings in core data science tools like pandas, matplotlib, and numpy for data handling, visualization, and numerical computations, as well as json for parsing structured data. if "TAVILY_API_KEY" not in os.environ: os.environ["TAVILY_API_KEY"] = getpass.getpass("Enter Tavily API key: ") if "GOOGLE_API_KEY" not in os.environ: os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter Google API key: ") import logging logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s') logger = logging.getLogger(__name__) We securely initialize API keys for Tavily and Google Gemini by prompting users only if they’re not already set in the environment, ensuring safe and repeatable access to external services. It also configures a standardized logging setup using Python’s logging module, which helps monitor execution flow and capture debug or error messages throughout the notebook. from langchain_community.retrievers import TavilySearchAPIRetriever from langchain_community.vectorstores import Chroma from langchain_core.documents import Document from langchain_core.output_parsers import StrOutputParser, JsonOutputParser from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate from langchain_core.runnables import RunnablePassthrough, RunnableLambda from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.chains.summarize import load_summarize_chain from langchain.memory import ConversationBufferMemory We import key components from the LangChain ecosystem and its integrations. It brings in the TavilySearchAPIRetriever for real-time web search, Chroma for vector storage, and GoogleGenerativeAI modules for chat and embedding models. Core LangChain modules like ChatPromptTemplate, RunnableLambda, ConversationBufferMemory, and output parsers enable flexible prompt construction, memory handling, and pipeline execution. class SearchQueryError(Exception): """Exception raised for errors in the search query.""" pass def format_docs(docs): formatted_content = [] for i, doc in enumerate(docs): metadata = doc.metadata source = metadata.get('source', 'Unknown source') title = metadata.get('title', 'Untitled') score = metadata.get('score', 0) formatted_content.append( f"Document {i+1} [Score: {score:.2f}]:n" f"Title: {title}n" f"Source: {source}n" f"Content: {doc.page_content}n" ) return "nn".join(formatted_content) We define two essential components for search and document handling. The SearchQueryError class creates a custom exception to manage invalid or failed search queries gracefully. The format_docs function processes a list of retrieved documents by extracting metadata such as title, source, and relevance score and formatting them into a clean, readable string. class SearchResultsParser: def parse(self, text): try: if isinstance(text, str): import re import json json_match = re.search(r'{.*}', text, re.DOTALL) if json_match: json_str = json_match.group(0) return json.loads(json_str) return {"answer": text, "sources": [], "confidence": 0.5} elif hasattr(text, 'content'): return {"answer": text.content, "sources": [], "confidence": 0.5} else: return {"answer": str(text), "sources": [], "confidence": 0.5} except Exception as e: logger.warning(f"Failed to parse JSON: {e}") return {"answer": str(text), "sources": [], "confidence": 0.5} The SearchResultsParser class provides a robust method for extracting structured information from LLM responses. It attempts to parse a JSON-like string from the model output, returning to a plain text response format if parsing fails. It gracefully handles string outputs and message objects, ensuring consistent downstream processing. In case of errors, it logs a warning and returns a fallback response containing the raw answer, empty sources, and a default confidence score, enhancing the system’s fault tolerance. class EnhancedTavilyRetriever: def __init__(self, api_key=None, max_results=5, search_depth="advanced", include_domains=None, exclude_domains=None): self.api_key = api_key self.max_results = max_results self.search_depth = search_depth self.include_domains = include_domains or [] self.exclude_domains = exclude_domains or [] self.retriever = self._create_retriever() self.previous_searches = [] def _create_retriever(self): try: return TavilySearchAPIRetriever( api_key=self.api_key, k=self.max_results, search_depth=self.search_depth, include_domains=self.include_domains, exclude_domains=self.exclude_domains ) except Exception as e: logger.error(f"Failed to create Tavily retriever: {e}") raise def invoke(self, query, **kwargs): if not query or not query.strip(): raise SearchQueryError("Empty search query") try: start_time = time.time() results = self.retriever.invoke(query, **kwargs) end_time = time.time() search_record = { "timestamp": datetime.now().isoformat(), "query": query, "num_results": len(results), "response_time": end_time - start_time } self.previous_searches.append(search_record) return results except Exception as e: logger.error(f"Search failed: {e}") raise SearchQueryError(f"Failed to perform search: {str(e)}") def get_search_history(self): return self.previous_searches The EnhancedTavilyRetriever class is a custom wrapper around the TavilySearchAPIRetriever, adding greater flexibility, control, and traceability to search operations. It supports advanced features like limiting search depth, domain inclusion/exclusion filters, and configurable result counts. The invoke method performs web searches and tracks each query’s metadata (timestamp, response time, and result count), storing it for later analysis. class SearchCache: def __init__(self): self.embedding_function = GoogleGenerativeAIEmbeddings(model="models/embedding-001") self.vector_store = None self.text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) def add_documents(self, documents): if not documents: return try: if self.vector_store is None: self.vector_store = Chroma.from_documents( documents=documents, embedding=self.embedding_function ) else: self.vector_store.add_documents(documents) except Exception as e: logger.error(f"Failed to add documents to cache: {e}") def search(self, query, k=3): if self.vector_store is None: return [] try: return self.vector_store.similarity_search(query, k=k) except Exception as e: logger.error(f"Vector search failed: {e}") return [] The SearchCache class implements a semantic caching layer that stores and retrieves documents using vector embeddings for efficient similarity search. It uses GoogleGenerativeAIEmbeddings to convert documents into dense vectors and stores them in a Chroma vector database. The add_documents method initializes or updates the vector store, while the search method enables fast retrieval of the most relevant cached documents based on semantic similarity. This reduces redundant API calls and improves response times for repeated or related queries, serving as a lightweight hybrid memory layer in the AI assistant pipeline. search_cache = SearchCache() enhanced_retriever = EnhancedTavilyRetriever(max_results=5) memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True) system_template = """You are a research assistant that provides accurate answers based on the search results provided. Follow these guidelines: 1. Only use the context provided to answer the question 2. If the context doesn't contain the answer, say "I don't have sufficient information to answer this question." 3. Cite your sources by referencing the document numbers 4. Don't make up information 5. Keep the answer concise but complete Context: {context} Chat History: {chat_history} """ system_message = SystemMessagePromptTemplate.from_template(system_template) human_template = "Question: {question}" human_message = HumanMessagePromptTemplate.from_template(human_template) prompt = ChatPromptTemplate.from_messages([system_message, human_message]) We initialize the core components of the AI assistant: a semantic SearchCache, the EnhancedTavilyRetriever for web-based querying, and a ConversationBufferMemory to retain chat history across turns. It also defines a structured prompt using ChatPromptTemplate, guiding the LLM to act as a research assistant. The prompt enforces strict rules for factual accuracy, context usage, source citation, and concise answering, ensuring reliable and grounded responses. def get_llm(model_name="gemini-2.0-flash-lite", temperature=0.2, response_mode="json"): try: return ChatGoogleGenerativeAI( model=model_name, temperature=temperature, convert_system_message_to_human=True, top_p=0.95, top_k=40, max_output_tokens=2048 ) except Exception as e: logger.error(f"Failed to initialize LLM: {e}") raise output_parser = SearchResultsParser() We define the get_llm function, which initializes a Google Gemini language model with configurable parameters such as model name, temperature, and decoding settings (e.g., top_p, top_k, and max tokens). It ensures robustness with error handling for failed model initialization. An instance of SearchResultsParser is also created to standardize and structure the LLM’s raw responses, enabling consistent downstream processing of answers and metadata. def plot_search_metrics(search_history): if not search_history: print("No search history available") return df = pd.DataFrame(search_history) plt.figure(figsize=(12, 6)) plt.subplot(1, 2, 1) plt.plot(range(len(df)), df['response_time'], marker='o') plt.title('Search Response Times') plt.xlabel('Search Index') plt.ylabel('Time (seconds)') plt.grid(True) plt.subplot(1, 2, 2) plt.bar(range(len(df)), df['num_results']) plt.title('Number of Results per Search') plt.xlabel('Search Index') plt.ylabel('Number of Results') plt.grid(True) plt.tight_layout() plt.show() The plot_search_metrics function visualizes performance trends from past queries using Matplotlib. It converts the search history into a DataFrame and plots two subgraphs: one showing response time per search and the other displaying the number of results returned. This aids in analyzing the system’s efficiency and search quality over time, helping developers fine-tune the retriever or identify bottlenecks in real-world usage. def retrieve_with_fallback(query): cached_results = search_cache.search(query) if cached_results: logger.info(f"Retrieved {len(cached_results)} documents from cache") return cached_results logger.info("No cache hit, performing web search") search_results = enhanced_retriever.invoke(query) search_cache.add_documents(search_results) return search_results def summarize_documents(documents, query): llm = get_llm(temperature=0) summarize_prompt = ChatPromptTemplate.from_template( """Create a concise summary of the following documents related to this query: {query} {documents} Provide a comprehensive summary that addresses the key points relevant to the query. """ ) chain = ( {"documents": lambda docs: format_docs(docs), "query": lambda _: query} | summarize_prompt | llm | StrOutputParser() ) return chain.invoke(documents) These two functions enhance the assistant’s intelligence and efficiency. The retrieve_with_fallback function implements a hybrid retrieval mechanism: it first attempts to fetch semantically relevant documents from the local Chroma cache and, if unsuccessful, falls back to a real-time Tavily web search, caching the new results for future use. Meanwhile, summarize_documents leverages a Gemini LLM to generate concise summaries from retrieved documents, guided by a structured prompt that ensures relevance to the query. Together, they enable low-latency, informative, and context-aware responses. def advanced_chain(query_engine="enhanced", model="gemini-1.5-pro", include_history=True): llm = get_llm(model_name=model) if query_engine == "enhanced": retriever = lambda query: retrieve_with_fallback(query) else: retriever = enhanced_retriever.invoke def chain_with_history(input_dict): query = input_dict["question"] chat_history = memory.load_memory_variables({})["chat_history"] if include_history else [] docs = retriever(query) context = format_docs(docs) result = prompt.invoke({ "context": context, "question": query, "chat_history": chat_history }) memory.save_context({"input": query}, {"output": result.content}) return llm.invoke(result) return RunnableLambda(chain_with_history) | StrOutputParser() The advanced_chain function defines a modular, end-to-end reasoning workflow for answering user queries using cached or real-time search. It initializes the specified Gemini model, selects the retrieval strategy (cached fallback or direct search), constructs a response pipeline incorporating chat history (if enabled), formats documents into context, and prompts the LLM using a system-guided template. The chain also logs the interaction in memory and returns the final answer, parsed into clean text. This design enables flexible experimentation with models and retrieval strategies while maintaining conversation coherence. qa_chain = advanced_chain() def analyze_query(query): llm = get_llm(temperature=0) analysis_prompt = ChatPromptTemplate.from_template( """Analyze the following query and provide: 1. Main topic 2. Sentiment (positive, negative, neutral) 3. Key entities mentioned 4. Query type (factual, opinion, how-to, etc.) Query: {query} Return the analysis in JSON format with the following structure: {{ "topic": "main topic", "sentiment": "sentiment", "entities": ["entity1", "entity2"], "type": "query type" }} """ ) chain = analysis_prompt | llm | output_parser return chain.invoke({"query": query}) print("Advanced Tavily-Gemini Implementation") print("="*50) query = "what year was breath of the wild released and what was its reception?" print(f"Query: {query}") We initialize the final components of the intelligent assistant. qa_chain is the assembled reasoning pipeline ready to process user queries using retrieval, memory, and Gemini-based response generation. The analyze_query function performs a lightweight semantic analysis on a query, extracting the main topic, sentiment, entities, and query type using the Gemini model and a structured JSON prompt. The example query, about Breath of the Wild’s release and reception, showcases how the assistant is triggered and prepared for full-stack inference and semantic interpretation. The printed heading marks the start of interactive execution. try: print("nSearching for answer...") answer = qa_chain.invoke({"question": query}) print("nAnswer:") print(answer) print("nAnalyzing query...") try: query_analysis = analyze_query(query) print("nQuery Analysis:") print(json.dumps(query_analysis, indent=2)) except Exception as e: print(f"Query analysis error (non-critical): {e}") except Exception as e: print(f"Error in search: {e}") history = enhanced_retriever.get_search_history() print("nSearch History:") for i, h in enumerate(history): print(f"{i+1}. Query: {h['query']} - Results: {h['num_results']} - Time: {h['response_time']:.2f}s") print("nAdvanced search with domain filtering:") specialized_retriever = EnhancedTavilyRetriever( max_results=3, search_depth="advanced", include_domains=["nintendo.com", "zelda.com"], exclude_domains=["reddit.com", "twitter.com"] ) try: specialized_results = specialized_retriever.invoke("breath of the wild sales") print(f"Found {len(specialized_results)} specialized results") summary = summarize_documents(specialized_results, "breath of the wild sales") print("nSummary of specialized results:") print(summary) except Exception as e: print(f"Error in specialized search: {e}") print("nSearch Metrics:") plot_search_metrics(history) We demonstrate the complete pipeline in action. It performs a search using the qa_chain, displays the generated answer, and then analyzes the query for sentiment, topic, entities, and type. It also retrieves and prints each query’s search history, response time, and result count. Also, it runs a domain-filtered search focused on Nintendo-related sites, summarizes the results, and visualizes search performance using plot_search_metrics, offering a comprehensive view of the assistant’s capabilities in real-time use. In conclusion, following this tutorial gives users a comprehensive blueprint for creating a highly capable, context-aware, and scalable RAG system that bridges real-time web intelligence with conversational AI. The Tavily Search API lets users directly pull fresh and relevant content from the web. The Gemini LLM adds robust reasoning and summarization capabilities, while LangChain’s abstraction layer allows seamless orchestration between memory, embeddings, and model outputs. The implementation includes advanced features such as domain-specific filtering, query analysis (sentiment, topic, and entity extraction), and fallback strategies using a semantic vector cache built with Chroma and GoogleGenerativeAIEmbeddings. Also, structured logging, error handling, and analytics dashboards provide transparency and diagnostics for real-world deployment. Check out the Colab Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/AWS Open-Sources Strands Agents SDK to Simplify AI Agent DevelopmentAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Windsurf Launches SWE-1: A Frontier AI Model Family for End-to-End Software EngineeringAsif Razzaqhttps://www.marktechpost.com/author/6flvq/AI Agents Now Write Code in Parallel: OpenAI Introduces Codex, a Cloud-Based Coding Agent Inside ChatGPTAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Guide to Build an Automated Knowledge Graph Pipeline Using LangGraph and NetworkX 🚨 Build GenAI you can trust. ⭐️ Parlant is your open-source engine for controlled, compliant, and purposeful AI conversations — Star Parlant on GitHub! (Promoted)

0 Комментарии ·0 Поделились ·0 предпросмотр

Войдите, чтобы отмечать, делиться и комментировать!
Towards Data Science @TowardsDataScience поделился ссылкой
2025-05-16 17:40:54 ·

How To Build a Benchmark for Your Models

I’ve been working as a data science consultant for the past three years, and I’ve had the opportunity to work on multiple projects across various industries. Yet, I noticed one common denominator among most of the clients I worked with:

They rarely have a clear idea of the project objective.

This is one of the main obstacles data scientists face, especially now that Gen AI is taking over every domain.

But let’s suppose that after some back and forth, the objective becomes clear. We managed to pin down a specific question to answer. For example:

I want to classify my customers into two groups according to their probability to churn: “high likelihood to churn” and “low likelihood to churn”

Well, now what? Easy, let’s start building some models!

Wrong!

If having a clear objective is rare, having a reliable benchmark is even rarer.

In my opinion, one of the most important steps in delivering a data science project is defining and agreeing on a set of benchmarks with the client.

In this blog post, I’ll explain:

What a benchmark is,

Why it is important to have a benchmark,

How I would build one using an example scenario and

Some potential drawbacks to keep in mind

What is a benchmark?

A benchmark is a standardized way to evaluate the performance of a model. It provides a reference point against which new models can be compared.

A benchmark needs two key components to be considered complete:

A set of metrics to evaluate the performance

A set of simple models to use as baselines

The concept at its core is simple: every time I develop a new model I compare it against both previous versions and the baseline models. This ensures improvements are real and tracked.

It is essential to understand that this baseline shouldn’t be model or dataset-specific, but rather business-case-specific. It should be a general benchmark for a given business case.

If I encounter a new dataset, with the same business objective, this benchmark should be a reliable reference point.

Why building a benchmark is important

Now that we’ve defined what a benchmark is, let’s dive into why I believe it’s worth spending an extra project week on the development of a strong benchmark.

Without a Benchmark you’re aiming for perfection — If you are working without a clear reference point any result will lose meaning. “My model has a MAE of 30.000” Is that good? IDK! Maybe with a simple mean you would get a MAE of 25.000. By comparing your model to a baseline, you can measure both performance and improvement.

Improves Communicating with Clients — Clients and business teams might not immediately understand the standard output of a model. However, by engaging them with simple baselines from the start, it becomes easier to demonstrate improvements later. In many cases benchmarks could come directly from the business in different shapes or forms.

Helps in Model Selection — A benchmark gives a starting point to compare multiple models fairly. Without it, you might waste time testing models that aren’t worth considering.

Model Drift Detection and Monitoring — Models can degrade over time. By having a benchmark you might be able to intercept drifts early by comparing new model outputs against past benchmarks and baselines.

Consistency Between Different Datasets — Datasets evolve. By having a fixed set of metrics and models you ensure that performance comparisons remain valid over time.

With a clear benchmark, every step in the model development will provide immediate feedback, making the whole process more intentional and data-driven.

How I would build a benchmark

I hope I’ve convinced you of the importance of having a benchmark. Now, let’s actually build one.

Let’s start from the business question we presented at the very beginning of this blog post:

I want to classify my customers into two groups according to their probability to churn: “high likelihood to churn” and “low likelihood to churn”

For simplicity, I’ll assume no additional business constraints, but in real-world scenarios, constraints often exist.

For this example, I am using this dataset . The data contains some attributes from a company’s customer basealong with their churn status.

Now that we have something to work on let’s build the benchmark:

1. Defining the metrics

We are dealing with a churn use case, in particular, this is a binary classification problem. Thus the main metrics that we could use are:

Precision — Percentage of correctly predicted churners among all predicted churners

Recall — Percentage of actual churners correctly identified

F1 score — Balances precision and recall

True Positives, False Positives, True Negative and False Negatives

These are some of the “simple” metrics that could be used to evaluate the output of a model.

However, it is not an exhaustive list, standard metrics aren’t always enough. In many use cases, it might be useful to build custom metrics.

Let’s assume that in our business case the customers labeled as “high likelihood to churn” are offered a discount. This creates:

A cost when offering the discount to a non-churning customer

A profit when retaining a churning customer

Following on this definition we can build a custom metric that will be crucial in our scenario:

# Defining the business case-specific reference metric
def financial_gain:
loss_from_fp = np.sum) * 250
gain_from_tp = np.sum) * 1000
return gain_from_tp - loss_from_fp

When you are building business-driven metrics these are usually the most relevant. Such metrics could take any shape or form: Financial goals, minimum requirements, percentage of coverage and more.

2. Defining the benchmarks

Now that we’ve defined our metrics, we can define a set of baseline models to be used as a reference.

In this phase, you should define a list of simple-to-implement model in their simplest possible setup. There is no reason at this state to spend time and resources on the optimization of these models, my mindset is:

If I had 15 minutes, how would I implement this model?

In later phases of the model, you can add mode baseline models as the project proceeds.

In this case, I will use the following models:

Random Model — Assigns labels randomly

Majority Model — Always predicts the most frequent class

Simple XGB

Simple KNN

import numpy as np
import xgboost as xgb
from sklearn.neighbors import KNeighborsClassifier

class BinaryMean:
@staticmethod
def run_benchmark:
np.random.seedreturn np.random.choice, p=.mean, 1 - df_train.mean])

class SimpleXbg:
@staticmethod
def run_benchmark:
model = xgb.XGBClassifiermodel.fit.drop, df_train)
return model.predict.drop)

class MajorityClass:
@staticmethod
def run_benchmark:
majority_class = df_train.modereturn np.full, majority_class)

class SimpleKNN:
@staticmethod
def run_benchmark:
model = KNeighborsClassifiermodel.fit.drop, df_train)
return model.predict.drop)

Again, as in the case of the metrics, we can build custom benchmarks.

Let’s assume that in our business case the the marketing team contacts every client who’s:

Over 50 y/o and

That is not active anymore

Following this rule we can build this model:

# Defining the business case-specific benchmark
class BusinessBenchmark:
@staticmethod
def run_benchmark:
df = df_test.copydf.loc= 0
df.loc= 1
return dfRunning the benchmark

To run the benchmark I will use the following class. The entry point is the method compare_with_benchmark that, given a prediction, runs all the models and calculates all the metrics.

import numpy as np

class ChurnBinaryBenchmark:
def __init__:
self.metrics = metrics
self.benchmark_models = benchmark_models

def compare_pred_with_benchmark:

output_metrics = {
'Prediction': self._calculate_metrics}
dct_benchmarks = {}

for model in self.benchmark_models:
dct_benchmarks= model.run_benchmarkoutput_metrics= self._calculate_metricsreturn output_metrics

def _calculate_metrics:
return {getattr: funcfor func in self.metrics}

Now all we need is a prediction. For this example, I made a quick feature engineering and some hyperparameter tuning.

The last step is just to run the benchmark:

binary_benchmark = ChurnBinaryBenchmarkres = binary_benchmark.compare_pred_with_benchmarkpd.DataFrameBenchmark metrics comparison | Image by Author

This generates a comparison table of all models across all metrics. Using this table, it is possible to draw concrete conclusions on the model’s predictions and make informed decisions on the following steps of the process.

Some drawbacks

As we’ve seen there are plenty of reasons why it is useful to have a benchmark. However, even though benchmarks are incredibly useful, there are some pitfalls to watch out for:

Non-Informative Benchmark — When the metrics or models are poorly defined the marginal impact of having a benchmark decreases. Always define meaningful baselines.

Misinterpretation by Stakeholders — Communication with the client is essential, it is important to state clearly what the metrics are measuring. The best model might not be the best on all the defined metrics.

Overfitting to the Benchmark — You might end up trying to create features that are too specific, that might beat the benchmark, but do not generalize well in prediction. Don’t focus on beating the benchmark, but on creating the best solution possible to the problem.

Change of Objective — Objectives defined might change, due to miscommunication or changes in plans. Keep your benchmark flexible so it can adapt when needed.

Final thoughts

Benchmarks provide clarity, ensure improvements are measurable, and create a shared reference point between data scientists and clients. They help avoid the trap of assuming a model is performing well without proof and ensure that every iteration brings real value.

They also act as a communication tool, making it easier to explain progress to clients. Instead of just presenting numbers, you can show clear comparisons that highlight improvements.

Here you can find a notebook with a full implementation from this blog post.
The post How To Build a Benchmark for Your Models appeared first on Towards Data Science.
#how #build #benchmark #your #models

How To Build a Benchmark for Your Models
I’ve been working as a data science consultant for the past three years, and I’ve had the opportunity to work on multiple projects across various industries. Yet, I noticed one common denominator among most of the clients I worked with: They rarely have a clear idea of the project objective. This is one of the main obstacles data scientists face, especially now that Gen AI is taking over every domain. But let’s suppose that after some back and forth, the objective becomes clear. We managed to pin down a specific question to answer. For example: I want to classify my customers into two groups according to their probability to churn: “high likelihood to churn” and “low likelihood to churn” Well, now what? Easy, let’s start building some models! Wrong! If having a clear objective is rare, having a reliable benchmark is even rarer. In my opinion, one of the most important steps in delivering a data science project is defining and agreeing on a set of benchmarks with the client. In this blog post, I’ll explain: What a benchmark is, Why it is important to have a benchmark, How I would build one using an example scenario and Some potential drawbacks to keep in mind What is a benchmark? A benchmark is a standardized way to evaluate the performance of a model. It provides a reference point against which new models can be compared. A benchmark needs two key components to be considered complete: A set of metrics to evaluate the performance A set of simple models to use as baselines The concept at its core is simple: every time I develop a new model I compare it against both previous versions and the baseline models. This ensures improvements are real and tracked. It is essential to understand that this baseline shouldn’t be model or dataset-specific, but rather business-case-specific. It should be a general benchmark for a given business case. If I encounter a new dataset, with the same business objective, this benchmark should be a reliable reference point. Why building a benchmark is important Now that we’ve defined what a benchmark is, let’s dive into why I believe it’s worth spending an extra project week on the development of a strong benchmark. Without a Benchmark you’re aiming for perfection — If you are working without a clear reference point any result will lose meaning. “My model has a MAE of 30.000” Is that good? IDK! Maybe with a simple mean you would get a MAE of 25.000. By comparing your model to a baseline, you can measure both performance and improvement. Improves Communicating with Clients — Clients and business teams might not immediately understand the standard output of a model. However, by engaging them with simple baselines from the start, it becomes easier to demonstrate improvements later. In many cases benchmarks could come directly from the business in different shapes or forms. Helps in Model Selection — A benchmark gives a starting point to compare multiple models fairly. Without it, you might waste time testing models that aren’t worth considering. Model Drift Detection and Monitoring — Models can degrade over time. By having a benchmark you might be able to intercept drifts early by comparing new model outputs against past benchmarks and baselines. Consistency Between Different Datasets — Datasets evolve. By having a fixed set of metrics and models you ensure that performance comparisons remain valid over time. With a clear benchmark, every step in the model development will provide immediate feedback, making the whole process more intentional and data-driven. How I would build a benchmark I hope I’ve convinced you of the importance of having a benchmark. Now, let’s actually build one. Let’s start from the business question we presented at the very beginning of this blog post: I want to classify my customers into two groups according to their probability to churn: “high likelihood to churn” and “low likelihood to churn” For simplicity, I’ll assume no additional business constraints, but in real-world scenarios, constraints often exist. For this example, I am using this dataset . The data contains some attributes from a company’s customer basealong with their churn status. Now that we have something to work on let’s build the benchmark: 1. Defining the metrics We are dealing with a churn use case, in particular, this is a binary classification problem. Thus the main metrics that we could use are: Precision — Percentage of correctly predicted churners among all predicted churners Recall — Percentage of actual churners correctly identified F1 score — Balances precision and recall True Positives, False Positives, True Negative and False Negatives These are some of the “simple” metrics that could be used to evaluate the output of a model. However, it is not an exhaustive list, standard metrics aren’t always enough. In many use cases, it might be useful to build custom metrics. Let’s assume that in our business case the customers labeled as “high likelihood to churn” are offered a discount. This creates: A cost when offering the discount to a non-churning customer A profit when retaining a churning customer Following on this definition we can build a custom metric that will be crucial in our scenario: # Defining the business case-specific reference metric def financial_gain: loss_from_fp = np.sum) * 250 gain_from_tp = np.sum) * 1000 return gain_from_tp - loss_from_fp When you are building business-driven metrics these are usually the most relevant. Such metrics could take any shape or form: Financial goals, minimum requirements, percentage of coverage and more. 2. Defining the benchmarks Now that we’ve defined our metrics, we can define a set of baseline models to be used as a reference. In this phase, you should define a list of simple-to-implement model in their simplest possible setup. There is no reason at this state to spend time and resources on the optimization of these models, my mindset is: If I had 15 minutes, how would I implement this model? In later phases of the model, you can add mode baseline models as the project proceeds. In this case, I will use the following models: Random Model — Assigns labels randomly Majority Model — Always predicts the most frequent class Simple XGB Simple KNN import numpy as np import xgboost as xgb from sklearn.neighbors import KNeighborsClassifier class BinaryMean: @staticmethod def run_benchmark: np.random.seedreturn np.random.choice, p=.mean, 1 - df_train.mean]) class SimpleXbg: @staticmethod def run_benchmark: model = xgb.XGBClassifiermodel.fit.drop, df_train) return model.predict.drop) class MajorityClass: @staticmethod def run_benchmark: majority_class = df_train.modereturn np.full, majority_class) class SimpleKNN: @staticmethod def run_benchmark: model = KNeighborsClassifiermodel.fit.drop, df_train) return model.predict.drop) Again, as in the case of the metrics, we can build custom benchmarks. Let’s assume that in our business case the the marketing team contacts every client who’s: Over 50 y/o and That is not active anymore Following this rule we can build this model: # Defining the business case-specific benchmark class BusinessBenchmark: @staticmethod def run_benchmark: df = df_test.copydf.loc= 0 df.loc= 1 return dfRunning the benchmark To run the benchmark I will use the following class. The entry point is the method compare_with_benchmark that, given a prediction, runs all the models and calculates all the metrics. import numpy as np class ChurnBinaryBenchmark: def __init__: self.metrics = metrics self.benchmark_models = benchmark_models def compare_pred_with_benchmark: output_metrics = { 'Prediction': self._calculate_metrics} dct_benchmarks = {} for model in self.benchmark_models: dct_benchmarks= model.run_benchmarkoutput_metrics= self._calculate_metricsreturn output_metrics def _calculate_metrics: return {getattr: funcfor func in self.metrics} Now all we need is a prediction. For this example, I made a quick feature engineering and some hyperparameter tuning. The last step is just to run the benchmark: binary_benchmark = ChurnBinaryBenchmarkres = binary_benchmark.compare_pred_with_benchmarkpd.DataFrameBenchmark metrics comparison | Image by Author This generates a comparison table of all models across all metrics. Using this table, it is possible to draw concrete conclusions on the model’s predictions and make informed decisions on the following steps of the process. Some drawbacks As we’ve seen there are plenty of reasons why it is useful to have a benchmark. However, even though benchmarks are incredibly useful, there are some pitfalls to watch out for: Non-Informative Benchmark — When the metrics or models are poorly defined the marginal impact of having a benchmark decreases. Always define meaningful baselines. Misinterpretation by Stakeholders — Communication with the client is essential, it is important to state clearly what the metrics are measuring. The best model might not be the best on all the defined metrics. Overfitting to the Benchmark — You might end up trying to create features that are too specific, that might beat the benchmark, but do not generalize well in prediction. Don’t focus on beating the benchmark, but on creating the best solution possible to the problem. Change of Objective — Objectives defined might change, due to miscommunication or changes in plans. Keep your benchmark flexible so it can adapt when needed. Final thoughts Benchmarks provide clarity, ensure improvements are measurable, and create a shared reference point between data scientists and clients. They help avoid the trap of assuming a model is performing well without proof and ensure that every iteration brings real value. They also act as a communication tool, making it easier to explain progress to clients. Instead of just presenting numbers, you can show clear comparisons that highlight improvements. Here you can find a notebook with a full implementation from this blog post. The post How To Build a Benchmark for Your Models appeared first on Towards Data Science. #how #build #benchmark #your #models

How To Build a Benchmark for Your Models

towardsdatascience.com
I’ve been working as a data science consultant for the past three years, and I’ve had the opportunity to work on multiple projects across various industries. Yet, I noticed one common denominator among most of the clients I worked with: They rarely have a clear idea of the project objective. This is one of the main obstacles data scientists face, especially now that Gen AI is taking over every domain. But let’s suppose that after some back and forth, the objective becomes clear. We managed to pin down a specific question to answer. For example: I want to classify my customers into two groups according to their probability to churn: “high likelihood to churn” and “low likelihood to churn” Well, now what? Easy, let’s start building some models! Wrong! If having a clear objective is rare, having a reliable benchmark is even rarer. In my opinion, one of the most important steps in delivering a data science project is defining and agreeing on a set of benchmarks with the client. In this blog post, I’ll explain: What a benchmark is, Why it is important to have a benchmark, How I would build one using an example scenario and Some potential drawbacks to keep in mind What is a benchmark? A benchmark is a standardized way to evaluate the performance of a model. It provides a reference point against which new models can be compared. A benchmark needs two key components to be considered complete: A set of metrics to evaluate the performance A set of simple models to use as baselines The concept at its core is simple: every time I develop a new model I compare it against both previous versions and the baseline models. This ensures improvements are real and tracked. It is essential to understand that this baseline shouldn’t be model or dataset-specific, but rather business-case-specific. It should be a general benchmark for a given business case. If I encounter a new dataset, with the same business objective, this benchmark should be a reliable reference point. Why building a benchmark is important Now that we’ve defined what a benchmark is, let’s dive into why I believe it’s worth spending an extra project week on the development of a strong benchmark. Without a Benchmark you’re aiming for perfection — If you are working without a clear reference point any result will lose meaning. “My model has a MAE of 30.000” Is that good? IDK! Maybe with a simple mean you would get a MAE of 25.000. By comparing your model to a baseline, you can measure both performance and improvement. Improves Communicating with Clients — Clients and business teams might not immediately understand the standard output of a model. However, by engaging them with simple baselines from the start, it becomes easier to demonstrate improvements later. In many cases benchmarks could come directly from the business in different shapes or forms. Helps in Model Selection — A benchmark gives a starting point to compare multiple models fairly. Without it, you might waste time testing models that aren’t worth considering. Model Drift Detection and Monitoring — Models can degrade over time. By having a benchmark you might be able to intercept drifts early by comparing new model outputs against past benchmarks and baselines. Consistency Between Different Datasets — Datasets evolve. By having a fixed set of metrics and models you ensure that performance comparisons remain valid over time. With a clear benchmark, every step in the model development will provide immediate feedback, making the whole process more intentional and data-driven. How I would build a benchmark I hope I’ve convinced you of the importance of having a benchmark. Now, let’s actually build one. Let’s start from the business question we presented at the very beginning of this blog post: I want to classify my customers into two groups according to their probability to churn: “high likelihood to churn” and “low likelihood to churn” For simplicity, I’ll assume no additional business constraints, but in real-world scenarios, constraints often exist. For this example, I am using this dataset (CC0: Public Domain). The data contains some attributes from a company’s customer base (e.g., age, sex, number of products, …) along with their churn status. Now that we have something to work on let’s build the benchmark: 1. Defining the metrics We are dealing with a churn use case, in particular, this is a binary classification problem. Thus the main metrics that we could use are: Precision — Percentage of correctly predicted churners among all predicted churners Recall — Percentage of actual churners correctly identified F1 score — Balances precision and recall True Positives, False Positives, True Negative and False Negatives These are some of the “simple” metrics that could be used to evaluate the output of a model. However, it is not an exhaustive list, standard metrics aren’t always enough. In many use cases, it might be useful to build custom metrics. Let’s assume that in our business case the customers labeled as “high likelihood to churn” are offered a discount. This creates: A cost ($250) when offering the discount to a non-churning customer A profit ($1000) when retaining a churning customer Following on this definition we can build a custom metric that will be crucial in our scenario: # Defining the business case-specific reference metric def financial_gain(y_true, y_pred): loss_from_fp = np.sum(np.logical_and(y_pred == 1, y_true == 0)) * 250 gain_from_tp = np.sum(np.logical_and(y_pred == 1, y_true == 1)) * 1000 return gain_from_tp - loss_from_fp When you are building business-driven metrics these are usually the most relevant. Such metrics could take any shape or form: Financial goals, minimum requirements, percentage of coverage and more. 2. Defining the benchmarks Now that we’ve defined our metrics, we can define a set of baseline models to be used as a reference. In this phase, you should define a list of simple-to-implement model in their simplest possible setup. There is no reason at this state to spend time and resources on the optimization of these models, my mindset is: If I had 15 minutes, how would I implement this model? In later phases of the model, you can add mode baseline models as the project proceeds. In this case, I will use the following models: Random Model — Assigns labels randomly Majority Model — Always predicts the most frequent class Simple XGB Simple KNN import numpy as np import xgboost as xgb from sklearn.neighbors import KNeighborsClassifier class BinaryMean(): @staticmethod def run_benchmark(df_train, df_test): np.random.seed(21) return np.random.choice(a=[1, 0], size=len(df_test), p=[df_train['y'].mean(), 1 - df_train['y'].mean()]) class SimpleXbg(): @staticmethod def run_benchmark(df_train, df_test): model = xgb.XGBClassifier() model.fit(df_train.select_dtypes(include=np.number).drop(columns='y'), df_train['y']) return model.predict(df_test.select_dtypes(include=np.number).drop(columns='y')) class MajorityClass(): @staticmethod def run_benchmark(df_train, df_test): majority_class = df_train['y'].mode()[0] return np.full(len(df_test), majority_class) class SimpleKNN(): @staticmethod def run_benchmark(df_train, df_test): model = KNeighborsClassifier() model.fit(df_train.select_dtypes(include=np.number).drop(columns='y'), df_train['y']) return model.predict(df_test.select_dtypes(include=np.number).drop(columns='y')) Again, as in the case of the metrics, we can build custom benchmarks. Let’s assume that in our business case the the marketing team contacts every client who’s: Over 50 y/o and That is not active anymore Following this rule we can build this model: # Defining the business case-specific benchmark class BusinessBenchmark(): @staticmethod def run_benchmark(df_train, df_test): df = df_test.copy() df.loc[:,'y_hat'] = 0 df.loc[(df['IsActiveMember'] == 0) & (df['Age'] >= 50), 'y_hat'] = 1 return df['y_hat'] Running the benchmark To run the benchmark I will use the following class. The entry point is the method compare_with_benchmark() that, given a prediction, runs all the models and calculates all the metrics. import numpy as np class ChurnBinaryBenchmark(): def __init__( self, metrics = [], benchmark_models = [], ): self.metrics = metrics self.benchmark_models = benchmark_models def compare_pred_with_benchmark( self, df_train, df_test, my_predictions, ): output_metrics = { 'Prediction': self._calculate_metrics(df_test['y'], my_predictions) } dct_benchmarks = {} for model in self.benchmark_models: dct_benchmarks[model.__name__] = model.run_benchmark(df_train = df_train, df_test = df_test) output_metrics[f'Benchmark - {model.__name__}'] = self._calculate_metrics(df_test['y'], dct_benchmarks[model.__name__]) return output_metrics def _calculate_metrics(self, y_true, y_pred): return {getattr(func, '__name__', 'Unknown') : func(y_true = y_true, y_pred = y_pred) for func in self.metrics} Now all we need is a prediction. For this example, I made a quick feature engineering and some hyperparameter tuning. The last step is just to run the benchmark: binary_benchmark = ChurnBinaryBenchmark( metrics=[f1_score, precision_score, recall_score, tp, tn, fp, fn, financial_gain], benchmark_models=[BinaryMean, SimpleXbg, MajorityClass, SimpleKNN, BusinessBenchmark] ) res = binary_benchmark.compare_pred_with_benchmark( df_train=df_train, df_test=df_test, my_predictions=preds, ) pd.DataFrame(res) Benchmark metrics comparison | Image by Author This generates a comparison table of all models across all metrics. Using this table, it is possible to draw concrete conclusions on the model’s predictions and make informed decisions on the following steps of the process. Some drawbacks As we’ve seen there are plenty of reasons why it is useful to have a benchmark. However, even though benchmarks are incredibly useful, there are some pitfalls to watch out for: Non-Informative Benchmark — When the metrics or models are poorly defined the marginal impact of having a benchmark decreases. Always define meaningful baselines. Misinterpretation by Stakeholders — Communication with the client is essential, it is important to state clearly what the metrics are measuring. The best model might not be the best on all the defined metrics. Overfitting to the Benchmark — You might end up trying to create features that are too specific, that might beat the benchmark, but do not generalize well in prediction. Don’t focus on beating the benchmark, but on creating the best solution possible to the problem. Change of Objective — Objectives defined might change, due to miscommunication or changes in plans. Keep your benchmark flexible so it can adapt when needed. Final thoughts Benchmarks provide clarity, ensure improvements are measurable, and create a shared reference point between data scientists and clients. They help avoid the trap of assuming a model is performing well without proof and ensure that every iteration brings real value. They also act as a communication tool, making it easier to explain progress to clients. Instead of just presenting numbers, you can show clear comparisons that highlight improvements. Here you can find a notebook with a full implementation from this blog post. The post How To Build a Benchmark for Your Models appeared first on Towards Data Science.

0 Комментарии ·0 Поделились ·0 предпросмотр

Войдите, чтобы отмечать, делиться и комментировать!
Towards Data Science @TowardsDataScience поделился ссылкой
2025-05-16 05:44:44 ·

Understanding Random Forest using Python (scikit-learn)

Decision trees are a popular supervised learning algorithm with benefits that include being able to be used for both regression and classification as well as being easy to interpret. However, decision trees aren’t the most performant algorithm and are prone to overfitting due to small variations in the training data. This can result in a completely different tree. This is why people often turn to ensemble models like Bagged Trees and Random Forests. These consist of multiple decision trees trained on bootstrapped data and aggregated to achieve better predictive performance than any single tree could offer. This tutorial includes the following:

What is Bagging

What Makes Random Forests Different

Training and Tuning a Random Forest using Scikit-Learn

Calculating and Interpreting Feature Importance

Visualizing Individual Decision Trees in a Random Forest

As always, the code used in this tutorial is available on my GitHub. A video version of this tutorial is also available on my YouTube channel for those who prefer to follow along visually. With that, let’s get started!

What is BaggingBootstrap + aggregating = Bagging. Image by Michael Galarnyk.

Random forests can be categorized as bagging algorithms. Bagging consists of two steps:

1.) Bootstrap sampling: Create multiple training sets by randomly drawing samples with replacement from the original dataset. These new training sets, called bootstrapped datasets, typically contain the same number of rows as the original dataset, but individual rows may appear multiple times or not at all. On average, each bootstrapped dataset contains about 63.2% of the unique rows from the original data. The remaining ~36.8% of rows are left out and can be used for out-of-bagevaluation. For more on this concept, see my sampling with and without replacement blog post.

2.) Aggregating predictions: Each bootstrapped dataset is used to train a different decision tree model. The final prediction is made by combining the outputs of all individual trees. For classification, this is typically done through majority voting. For regression, predictions are averaged.

Training each tree on a different bootstrapped sample introduces variation across trees. While this doesn’t fully eliminate correlation—especially when certain features dominate—it helps reduce overfitting when combined with aggregation. Averaging the predictions of many such trees reduces the overall variance of the ensemble, improving generalization.

What Makes Random Forests Different

In contrast to some other bagged trees algorithms, for each decision tree in random forests, only a subset of features is randomly selected at each decision node and the best split feature from the subset is used. Image by Michael Galarnyk.

Suppose there’s a single strong feature in your dataset. In bagged trees, each tree may repeatedly split on that feature, leading to correlated trees and less benefit from aggregation. Random Forests reduce this issue by introducing further randomness. Specifically, they change how splits are selected during training:

1). Create N bootstrapped datasets. Note that while bootstrapping is commonly used in Random Forests, it is not strictly necessary because step 2introduces sufficient diversity among the trees.

2). For each tree, at each node, a random subset of features is selected as candidates, and the best split is chosen from that subset. In scikit-learn, this is controlled by the max_features parameter, which defaults to 'sqrt' for classifiers and 1 for regressors.

3). Aggregating predictions: vote for classification and average for regression.

Note: Random Forests use sampling with replacement for bootstrapped datasets and sampling without replacement for selecting a subset of features.

Sampling with replacement procedure. Image by Michael Galarnyk

Out-of-BagScore

Because ~36.8% of training data is excluded from any given tree, you can use this holdout portion to evaluate that tree’s predictions. Scikit-learn allows this via the oob_score=True parameter, providing an efficient way to estimate generalization error. You’ll see this parameter used in the training example later in the tutorial.

Training and Tuning a Random Forest in Scikit-Learn

Random Forests remain a strong baseline for tabular data thanks to their simplicity, interpretability, and ability to parallelize since each tree is trained independently. This section demonstrates how to load data, perform a train test split, train a baseline model, tune hyperparameters using grid search, and evaluate the final model on the test set.

Step 1: Train a Baseline Model

Before tuning, it’s good practice to train a baseline model using reasonable defaults. This gives you an initial sense of performance and lets you validate generalization using the out-of-bagscore, which is built into bagging-based models like Random Forests. This example uses the House Sales in King County dataset, which contains property sales from the Seattle area between May 2014 and May 2015. This approach allows us to reserve the test set for final evaluation after tuning.

Python"># Import libraries

# Some imports are only used later in the tutorial
import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

# Dataset: Breast Cancer Wisconsin# Source: UCI Machine Learning Repository
# License: CC BY 4.0
from sklearn.datasets import load_breast_cancer

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import RandomForestRegressor

from sklearn.inspection import permutation_importance

from sklearn.model_selection import GridSearchCV, train_test_split

from sklearn import tree

# Load dataset
# Dataset: House Sales in King County# License CC0 1.0 Universal
url = ';

df = pd.read_csvcolumns =df = df# Define features and target

X = df.dropy = df# Train/test split

X_train, X_test, y_train, y_test = train_test_split# Train baseline Random Forest

reg = RandomForestRegressorreg.fit# Evaluate baseline performance using OOB score

printStep 2: Tune Hyperparameters with Grid Search

While the baseline model gives a strong starting point, performance can often be improved by tuning key hyperparameters. Grid search cross-validation, as implemented by GridSearchCV, systematically explores combinations of hyperparameters and uses cross-validation to evaluate each one, selecting the configuration with the highest validation performance.The most commonly tuned hyperparameters include:

n_estimators: The number of decision trees in the forest. More trees can improve accuracy but increase training time.

max_features: The number of features to consider when looking for the best split. Lower values reduce correlation between trees.

max_depth: The maximum depth of each tree. Shallower trees are faster but may underfit.

min_samples_split: The minimum number of samples required to split an internal node. Higher values can reduce overfitting.

min_samples_leaf: The minimum number of samples required to be at a leaf node. Helps control tree size.

bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used.

param_grid = {

    'n_estimators':,

    'max_features':,

    'max_depth':,

    'min_samples_split':,

    'min_samples_leaf':}

# Initialize model

rf = RandomForestRegressorgrid_search = GridSearchCVgrid_search.fitprintprintStep 3: Evaluate Final Model on Test Set

Now that we’ve selected the best-performing model based on cross-validation, we can evaluate it on the held-out test set to estimate its generalization performance.

# Evaluate final model on test set

best_model = grid_search.best_estimator_

print: {best_model.score:.3f}")

Calculating Random Forest Feature Importance

One of the key advantages of Random Forests is their interpretability — something that large language modelsoften lack. While LLMs are powerful, they typically function as black boxes and can exhibit biases that are difficult to identify. In contrast, scikit-learn supports two main methods for measuring feature importance in Random Forests: Mean Decrease in Impurity and Permutation Importance.

1). Mean Decrease in Impurity: Also known as Gini importance, this method calculates the total reduction in impurity brought by each feature across all trees. This is fast and built into the model via reg.feature_importances_. However, impurity-based feature importances can be misleading, especially for features with high cardinality, as these features are more likely to be chosen simply because they provide more potential split points.

importances = reg.feature_importances_

feature_names = X.columns

sorted_idx = np.argsortfor i in sorted_idx:

    print2). Permutation Importance: This method assesses the decrease in model performance when a single feature’s values are randomly shuffled. Unlike MDI, it accounts for feature interactions and correlation. It is more reliable but also more computationally expensive.

# Perform permutation importance on the test set

perm_importance = permutation_importancesorted_idx = perm_importance.importances_mean.argsortfor i in sorted_idx:

    printIt is important to note that our geographic features lat and long are also useful for visualization as the plot below shows. It’s likely that companies like Zillow leverage location information extensively in their valuation models.

Housing Price percentile for King County. Image by Michael Galarnyk.

Visualizing Individual Decision Trees in a Random Forest

A Random Forest consists of multiple decision trees—one for each estimator specified via the n_estimators parameter. After training the model, you can access these individual trees through the .estimators_ attribute. Visualizing a few of these trees can help illustrate how differently each one splits the data due to bootstrapped training samples and random feature selection at each split. While the earlier example used a RandomForestRegressor, here we demonstrate this visualization using a RandomForestClassifier trained on the Breast Cancer Wisconsin datasetto highlight Random Forests’ versatility for both regression and classification tasks. This short video demonstrates what 100 trained estimators from this dataset look like.

Fit a Random Forest Model using Scikit-Learn

# Load the Breast CancerDataset

data = load_breast_cancerdf = pd.DataFramedf= data.target

# Arrange Data into Features Matrix and Target Vector

X = df.locy = df.loc.values

# Split the data into training and testing sets

X_train, X_test, Y_train, Y_test = train_test_split# Random Forests in `scikit-learn`rf = RandomForestClassifierrf.fitPlotting Individual Estimatorsfrom a Random Forest using Matplotlib

You can now view all the individual trees from the fitted model.

rf.estimators_

You can now visualize individual trees. The code below visualizes the first decision tree.

fn=data.feature_names

cn=data.target_names

fig, axes = plt.subplots, dpi=800)

tree.plot_tree;

fig.savefigAlthough plotting many trees can be difficult to interpret, you may wish to explore the variety across estimators. The following example shows how to visualize the first five decision trees in the forest:

# This may not the best way to view each estimator as it is small

fig, axes = plt.subplots, dpi=3000)

for index in range:

    tree.plot_tree    axes.set_titlefig.savefigConclusion

Random forests consist of multiple decision trees trained on bootstrapped data in order to achieve better predictive performance than could be obtained from any of the individual decision trees. If you have questions or thoughts on the tutorial, feel free to reach out through YouTube or X.
The post Understanding Random Forest using Pythonappeared first on Towards Data Science.
#understanding #random #forest #using #python

Understanding Random Forest using Python (scikit-learn)
Decision trees are a popular supervised learning algorithm with benefits that include being able to be used for both regression and classification as well as being easy to interpret. However, decision trees aren’t the most performant algorithm and are prone to overfitting due to small variations in the training data. This can result in a completely different tree. This is why people often turn to ensemble models like Bagged Trees and Random Forests. These consist of multiple decision trees trained on bootstrapped data and aggregated to achieve better predictive performance than any single tree could offer. This tutorial includes the following: What is Bagging What Makes Random Forests Different Training and Tuning a Random Forest using Scikit-Learn Calculating and Interpreting Feature Importance Visualizing Individual Decision Trees in a Random Forest As always, the code used in this tutorial is available on my GitHub. A video version of this tutorial is also available on my YouTube channel for those who prefer to follow along visually. With that, let’s get started! What is BaggingBootstrap + aggregating = Bagging. Image by Michael Galarnyk. Random forests can be categorized as bagging algorithms. Bagging consists of two steps: 1.) Bootstrap sampling: Create multiple training sets by randomly drawing samples with replacement from the original dataset. These new training sets, called bootstrapped datasets, typically contain the same number of rows as the original dataset, but individual rows may appear multiple times or not at all. On average, each bootstrapped dataset contains about 63.2% of the unique rows from the original data. The remaining ~36.8% of rows are left out and can be used for out-of-bagevaluation. For more on this concept, see my sampling with and without replacement blog post. 2.) Aggregating predictions: Each bootstrapped dataset is used to train a different decision tree model. The final prediction is made by combining the outputs of all individual trees. For classification, this is typically done through majority voting. For regression, predictions are averaged. Training each tree on a different bootstrapped sample introduces variation across trees. While this doesn’t fully eliminate correlation—especially when certain features dominate—it helps reduce overfitting when combined with aggregation. Averaging the predictions of many such trees reduces the overall variance of the ensemble, improving generalization. What Makes Random Forests Different In contrast to some other bagged trees algorithms, for each decision tree in random forests, only a subset of features is randomly selected at each decision node and the best split feature from the subset is used. Image by Michael Galarnyk. Suppose there’s a single strong feature in your dataset. In bagged trees, each tree may repeatedly split on that feature, leading to correlated trees and less benefit from aggregation. Random Forests reduce this issue by introducing further randomness. Specifically, they change how splits are selected during training: 1). Create N bootstrapped datasets. Note that while bootstrapping is commonly used in Random Forests, it is not strictly necessary because step 2introduces sufficient diversity among the trees. 2). For each tree, at each node, a random subset of features is selected as candidates, and the best split is chosen from that subset. In scikit-learn, this is controlled by the max_features parameter, which defaults to 'sqrt' for classifiers and 1 for regressors. 3). Aggregating predictions: vote for classification and average for regression. Note: Random Forests use sampling with replacement for bootstrapped datasets and sampling without replacement for selecting a subset of features. Sampling with replacement procedure. Image by Michael Galarnyk Out-of-BagScore Because ~36.8% of training data is excluded from any given tree, you can use this holdout portion to evaluate that tree’s predictions. Scikit-learn allows this via the oob_score=True parameter, providing an efficient way to estimate generalization error. You’ll see this parameter used in the training example later in the tutorial. Training and Tuning a Random Forest in Scikit-Learn Random Forests remain a strong baseline for tabular data thanks to their simplicity, interpretability, and ability to parallelize since each tree is trained independently. This section demonstrates how to load data, perform a train test split, train a baseline model, tune hyperparameters using grid search, and evaluate the final model on the test set. Step 1: Train a Baseline Model Before tuning, it’s good practice to train a baseline model using reasonable defaults. This gives you an initial sense of performance and lets you validate generalization using the out-of-bagscore, which is built into bagging-based models like Random Forests. This example uses the House Sales in King County dataset, which contains property sales from the Seattle area between May 2014 and May 2015. This approach allows us to reserve the test set for final evaluation after tuning. Python"># Import libraries # Some imports are only used later in the tutorial import matplotlib.pyplot as plt import numpy as np import pandas as pd # Dataset: Breast Cancer Wisconsin# Source: UCI Machine Learning Repository # License: CC BY 4.0 from sklearn.datasets import load_breast_cancer from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor from sklearn.inspection import permutation_importance from sklearn.model_selection import GridSearchCV, train_test_split from sklearn import tree # Load dataset # Dataset: House Sales in King County# License CC0 1.0 Universal url = '; df = pd.read_csvcolumns =df = df# Define features and target X = df.dropy = df# Train/test split X_train, X_test, y_train, y_test = train_test_split# Train baseline Random Forest reg = RandomForestRegressorreg.fit# Evaluate baseline performance using OOB score printStep 2: Tune Hyperparameters with Grid Search While the baseline model gives a strong starting point, performance can often be improved by tuning key hyperparameters. Grid search cross-validation, as implemented by GridSearchCV, systematically explores combinations of hyperparameters and uses cross-validation to evaluate each one, selecting the configuration with the highest validation performance.The most commonly tuned hyperparameters include: n_estimators: The number of decision trees in the forest. More trees can improve accuracy but increase training time. max_features: The number of features to consider when looking for the best split. Lower values reduce correlation between trees. max_depth: The maximum depth of each tree. Shallower trees are faster but may underfit. min_samples_split: The minimum number of samples required to split an internal node. Higher values can reduce overfitting. min_samples_leaf: The minimum number of samples required to be at a leaf node. Helps control tree size. bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used. param_grid = {     'n_estimators':,     'max_features':,     'max_depth':,     'min_samples_split':,     'min_samples_leaf':} # Initialize model rf = RandomForestRegressorgrid_search = GridSearchCVgrid_search.fitprintprintStep 3: Evaluate Final Model on Test Set Now that we’ve selected the best-performing model based on cross-validation, we can evaluate it on the held-out test set to estimate its generalization performance. # Evaluate final model on test set best_model = grid_search.best_estimator_ print: {best_model.score:.3f}") Calculating Random Forest Feature Importance One of the key advantages of Random Forests is their interpretability — something that large language modelsoften lack. While LLMs are powerful, they typically function as black boxes and can exhibit biases that are difficult to identify. In contrast, scikit-learn supports two main methods for measuring feature importance in Random Forests: Mean Decrease in Impurity and Permutation Importance. 1). Mean Decrease in Impurity: Also known as Gini importance, this method calculates the total reduction in impurity brought by each feature across all trees. This is fast and built into the model via reg.feature_importances_. However, impurity-based feature importances can be misleading, especially for features with high cardinality, as these features are more likely to be chosen simply because they provide more potential split points. importances = reg.feature_importances_ feature_names = X.columns sorted_idx = np.argsortfor i in sorted_idx:     print2). Permutation Importance: This method assesses the decrease in model performance when a single feature’s values are randomly shuffled. Unlike MDI, it accounts for feature interactions and correlation. It is more reliable but also more computationally expensive. # Perform permutation importance on the test set perm_importance = permutation_importancesorted_idx = perm_importance.importances_mean.argsortfor i in sorted_idx:     printIt is important to note that our geographic features lat and long are also useful for visualization as the plot below shows. It’s likely that companies like Zillow leverage location information extensively in their valuation models. Housing Price percentile for King County. Image by Michael Galarnyk. Visualizing Individual Decision Trees in a Random Forest A Random Forest consists of multiple decision trees—one for each estimator specified via the n_estimators parameter. After training the model, you can access these individual trees through the .estimators_ attribute. Visualizing a few of these trees can help illustrate how differently each one splits the data due to bootstrapped training samples and random feature selection at each split. While the earlier example used a RandomForestRegressor, here we demonstrate this visualization using a RandomForestClassifier trained on the Breast Cancer Wisconsin datasetto highlight Random Forests’ versatility for both regression and classification tasks. This short video demonstrates what 100 trained estimators from this dataset look like. Fit a Random Forest Model using Scikit-Learn # Load the Breast CancerDataset data = load_breast_cancerdf = pd.DataFramedf= data.target # Arrange Data into Features Matrix and Target Vector X = df.locy = df.loc.values # Split the data into training and testing sets X_train, X_test, Y_train, Y_test = train_test_split# Random Forests in `scikit-learn`rf = RandomForestClassifierrf.fitPlotting Individual Estimatorsfrom a Random Forest using Matplotlib You can now view all the individual trees from the fitted model. rf.estimators_ You can now visualize individual trees. The code below visualizes the first decision tree. fn=data.feature_names cn=data.target_names fig, axes = plt.subplots, dpi=800) tree.plot_tree; fig.savefigAlthough plotting many trees can be difficult to interpret, you may wish to explore the variety across estimators. The following example shows how to visualize the first five decision trees in the forest: # This may not the best way to view each estimator as it is small fig, axes = plt.subplots, dpi=3000) for index in range:     tree.plot_tree    axes.set_titlefig.savefigConclusion Random forests consist of multiple decision trees trained on bootstrapped data in order to achieve better predictive performance than could be obtained from any of the individual decision trees. If you have questions or thoughts on the tutorial, feel free to reach out through YouTube or X. The post Understanding Random Forest using Pythonappeared first on Towards Data Science. #understanding #random #forest #using #python

Understanding Random Forest using Python (scikit-learn)

towardsdatascience.com
Decision trees are a popular supervised learning algorithm with benefits that include being able to be used for both regression and classification as well as being easy to interpret. However, decision trees aren’t the most performant algorithm and are prone to overfitting due to small variations in the training data. This can result in a completely different tree. This is why people often turn to ensemble models like Bagged Trees and Random Forests. These consist of multiple decision trees trained on bootstrapped data and aggregated to achieve better predictive performance than any single tree could offer. This tutorial includes the following: What is Bagging What Makes Random Forests Different Training and Tuning a Random Forest using Scikit-Learn Calculating and Interpreting Feature Importance Visualizing Individual Decision Trees in a Random Forest As always, the code used in this tutorial is available on my GitHub. A video version of this tutorial is also available on my YouTube channel for those who prefer to follow along visually. With that, let’s get started! What is Bagging (Bootstrap Aggregating) Bootstrap + aggregating = Bagging. Image by Michael Galarnyk. Random forests can be categorized as bagging algorithms (bootstrap aggregating). Bagging consists of two steps: 1.) Bootstrap sampling: Create multiple training sets by randomly drawing samples with replacement from the original dataset. These new training sets, called bootstrapped datasets, typically contain the same number of rows as the original dataset, but individual rows may appear multiple times or not at all. On average, each bootstrapped dataset contains about 63.2% of the unique rows from the original data. The remaining ~36.8% of rows are left out and can be used for out-of-bag (OOB) evaluation. For more on this concept, see my sampling with and without replacement blog post. 2.) Aggregating predictions: Each bootstrapped dataset is used to train a different decision tree model. The final prediction is made by combining the outputs of all individual trees. For classification, this is typically done through majority voting. For regression, predictions are averaged. Training each tree on a different bootstrapped sample introduces variation across trees. While this doesn’t fully eliminate correlation—especially when certain features dominate—it helps reduce overfitting when combined with aggregation. Averaging the predictions of many such trees reduces the overall variance of the ensemble, improving generalization. What Makes Random Forests Different In contrast to some other bagged trees algorithms, for each decision tree in random forests, only a subset of features is randomly selected at each decision node and the best split feature from the subset is used. Image by Michael Galarnyk. Suppose there’s a single strong feature in your dataset. In bagged trees, each tree may repeatedly split on that feature, leading to correlated trees and less benefit from aggregation. Random Forests reduce this issue by introducing further randomness. Specifically, they change how splits are selected during training: 1). Create N bootstrapped datasets. Note that while bootstrapping is commonly used in Random Forests, it is not strictly necessary because step 2 (random feature selection) introduces sufficient diversity among the trees. 2). For each tree, at each node, a random subset of features is selected as candidates, and the best split is chosen from that subset. In scikit-learn, this is controlled by the max_features parameter, which defaults to 'sqrt' for classifiers and 1 for regressors (equivalent to bagged trees). 3). Aggregating predictions: vote for classification and average for regression. Note: Random Forests use sampling with replacement for bootstrapped datasets and sampling without replacement for selecting a subset of features. Sampling with replacement procedure. Image by Michael Galarnyk Out-of-Bag (OOB) Score Because ~36.8% of training data is excluded from any given tree, you can use this holdout portion to evaluate that tree’s predictions. Scikit-learn allows this via the oob_score=True parameter, providing an efficient way to estimate generalization error. You’ll see this parameter used in the training example later in the tutorial. Training and Tuning a Random Forest in Scikit-Learn Random Forests remain a strong baseline for tabular data thanks to their simplicity, interpretability, and ability to parallelize since each tree is trained independently. This section demonstrates how to load data, perform a train test split, train a baseline model, tune hyperparameters using grid search, and evaluate the final model on the test set. Step 1: Train a Baseline Model Before tuning, it’s good practice to train a baseline model using reasonable defaults. This gives you an initial sense of performance and lets you validate generalization using the out-of-bag (OOB) score, which is built into bagging-based models like Random Forests. This example uses the House Sales in King County dataset (CCO 1.0 Universal License), which contains property sales from the Seattle area between May 2014 and May 2015. This approach allows us to reserve the test set for final evaluation after tuning. Python"># Import libraries # Some imports are only used later in the tutorial import matplotlib.pyplot as plt import numpy as np import pandas as pd # Dataset: Breast Cancer Wisconsin (Diagnostic) # Source: UCI Machine Learning Repository # License: CC BY 4.0 from sklearn.datasets import load_breast_cancer from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor from sklearn.inspection import permutation_importance from sklearn.model_selection import GridSearchCV, train_test_split from sklearn import tree # Load dataset # Dataset: House Sales in King County (May 2014–May 2015) # License CC0 1.0 Universal url = 'https://raw.githubusercontent.com/mGalarnyk/Tutorial_Data/master/King_County/kingCountyHouseData.csv' df = pd.read_csv(url) columns = ['bedrooms',             'bathrooms',             'sqft_living',             'sqft_lot',              'floors',              'waterfront',              'view',              'condition',              'grade',              'sqft_above',              'sqft_basement',              'yr_built',              'yr_renovated',              'lat',              'long',              'sqft_living15',              'sqft_lot15',              'price'] df = df[columns] # Define features and target X = df.drop(columns='price') y = df['price'] # Train/test split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) # Train baseline Random Forest reg = RandomForestRegressor(     n_estimators=100, # number of trees     max_features=1/3, # fraction of features considered at each split     oob_score=True, # enables out-of-bag evaluation     random_state=0 ) reg.fit(X_train, y_train) # Evaluate baseline performance using OOB score print(f"Baseline OOB score: {reg.oob_score_:.3f}") Step 2: Tune Hyperparameters with Grid Search While the baseline model gives a strong starting point, performance can often be improved by tuning key hyperparameters. Grid search cross-validation, as implemented by GridSearchCV, systematically explores combinations of hyperparameters and uses cross-validation to evaluate each one, selecting the configuration with the highest validation performance.The most commonly tuned hyperparameters include: n_estimators: The number of decision trees in the forest. More trees can improve accuracy but increase training time. max_features: The number of features to consider when looking for the best split. Lower values reduce correlation between trees. max_depth: The maximum depth of each tree. Shallower trees are faster but may underfit. min_samples_split: The minimum number of samples required to split an internal node. Higher values can reduce overfitting. min_samples_leaf: The minimum number of samples required to be at a leaf node. Helps control tree size. bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used. param_grid = {     'n_estimators': [100],     'max_features': ['sqrt', 'log2', None],     'max_depth': [None, 5, 10, 20],     'min_samples_split': [2, 5],     'min_samples_leaf': [1, 2] } # Initialize model rf = RandomForestRegressor(random_state=0, oob_score=True) grid_search = GridSearchCV(     estimator=rf,     param_grid=param_grid,     cv=5, # 5-fold cross-validation     scoring='r2', # evaluation metric     n_jobs=-1 # use all available CPU cores ) grid_search.fit(X_train, y_train) print(f"Best parameters: {grid_search.best_params_}") print(f"Best R^2 score: {grid_search.best_score_:.3f}") Step 3: Evaluate Final Model on Test Set Now that we’ve selected the best-performing model based on cross-validation, we can evaluate it on the held-out test set to estimate its generalization performance. # Evaluate final model on test set best_model = grid_search.best_estimator_ print(f"Test R^2 score (final model): {best_model.score(X_test, y_test):.3f}") Calculating Random Forest Feature Importance One of the key advantages of Random Forests is their interpretability — something that large language models (LLMs) often lack. While LLMs are powerful, they typically function as black boxes and can exhibit biases that are difficult to identify. In contrast, scikit-learn supports two main methods for measuring feature importance in Random Forests: Mean Decrease in Impurity and Permutation Importance. 1). Mean Decrease in Impurity (MDI): Also known as Gini importance, this method calculates the total reduction in impurity brought by each feature across all trees. This is fast and built into the model via reg.feature_importances_. However, impurity-based feature importances can be misleading, especially for features with high cardinality (many unique values), as these features are more likely to be chosen simply because they provide more potential split points. importances = reg.feature_importances_ feature_names = X.columns sorted_idx = np.argsort(importances)[::-1] for i in sorted_idx:     print(f"{feature_names[i]}: {importances[i]:.3f}") 2). Permutation Importance: This method assesses the decrease in model performance when a single feature’s values are randomly shuffled. Unlike MDI, it accounts for feature interactions and correlation. It is more reliable but also more computationally expensive. # Perform permutation importance on the test set perm_importance = permutation_importance(reg, X_test, y_test, n_repeats=10, random_state=0) sorted_idx = perm_importance.importances_mean.argsort()[::-1] for i in sorted_idx:     print(f"{X.columns[i]}: {perm_importance.importances_mean[i]:.3f}") It is important to note that our geographic features lat and long are also useful for visualization as the plot below shows. It’s likely that companies like Zillow leverage location information extensively in their valuation models. Housing Price percentile for King County. Image by Michael Galarnyk. Visualizing Individual Decision Trees in a Random Forest A Random Forest consists of multiple decision trees—one for each estimator specified via the n_estimators parameter. After training the model, you can access these individual trees through the .estimators_ attribute. Visualizing a few of these trees can help illustrate how differently each one splits the data due to bootstrapped training samples and random feature selection at each split. While the earlier example used a RandomForestRegressor, here we demonstrate this visualization using a RandomForestClassifier trained on the Breast Cancer Wisconsin dataset (CC BY 4.0 license) to highlight Random Forests’ versatility for both regression and classification tasks. This short video demonstrates what 100 trained estimators from this dataset look like. Fit a Random Forest Model using Scikit-Learn # Load the Breast Cancer (Diagnostic) Dataset data = load_breast_cancer() df = pd.DataFrame(data.data, columns=data.feature_names) df['target'] = data.target # Arrange Data into Features Matrix and Target Vector X = df.loc[:, df.columns != 'target'] y = df.loc[:, 'target'].values # Split the data into training and testing sets X_train, X_test, Y_train, Y_test = train_test_split(X, y, random_state=0) # Random Forests in `scikit-learn` (with N = 100) rf = RandomForestClassifier(n_estimators=100,                             random_state=0) rf.fit(X_train, Y_train) Plotting Individual Estimators (decision trees) from a Random Forest using Matplotlib You can now view all the individual trees from the fitted model. rf.estimators_ You can now visualize individual trees. The code below visualizes the first decision tree. fn=data.feature_names cn=data.target_names fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=800) tree.plot_tree(rf.estimators_[0],                feature_names = fn,                class_names=cn,                filled = True); fig.savefig('rf_individualtree.png') Although plotting many trees can be difficult to interpret, you may wish to explore the variety across estimators. The following example shows how to visualize the first five decision trees in the forest: # This may not the best way to view each estimator as it is small fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(10, 2), dpi=3000) for index in range(5):     tree.plot_tree(rf.estimators_[index],                    feature_names=fn,                    class_names=cn,                    filled=True,                    ax=axes[index])     axes[index].set_title(f'Estimator: {index}', fontsize=11) fig.savefig('rf_5trees.png') Conclusion Random forests consist of multiple decision trees trained on bootstrapped data in order to achieve better predictive performance than could be obtained from any of the individual decision trees. If you have questions or thoughts on the tutorial, feel free to reach out through YouTube or X. The post Understanding Random Forest using Python (scikit-learn) appeared first on Towards Data Science.

0 Комментарии ·0 Поделились ·0 предпросмотр

Войдите, чтобы отмечать, делиться и комментировать!
Towards Data Science @TowardsDataScience поделился ссылкой
2025-05-15 20:53:46 ·

How To Build a Benchmark for Your Models

I’ve been working as a data science consultant for the past three years, and I’ve had the opportunity to work on multiple projects across various industries. Yet, I noticed one common denominator among most of the clients I worked with:

They rarely have a clear idea of the project objective.

This is one of the main obstacles data scientists face, especially now that Gen AI is taking over every domain.

But let’s suppose that after some back and forth, the objective becomes clear. We managed to pin down a specific question to answer. For example:

I want to classify my customers into two groups according to their probability to churn: “high likelihood to churn” and “low likelihood to churn”

Well, now what? Easy, let’s start building some models!

Wrong!

If having a clear objective is rare, having a reliable benchmark is even rarer.

In my opinion, one of the most important steps in delivering a data science project is defining and agreeing on a set of benchmarks with the client.

In this blog post, I’ll explain:

What a benchmark is,

Why it is important to have a benchmark,

How I would build one using an example scenario and

Some potential drawbacks to keep in mind

What is a benchmark?

A benchmark is a standardized way to evaluate the performance of a model. It provides a reference point against which new models can be compared.

A benchmark needs two key components to be considered complete:

A set of metrics to evaluate the performance

A set of simple models to use as baselines

The concept at its core is simple: every time I develop a new model I compare it against both previous versions and the baseline models. This ensures improvements are real and tracked.

It is essential to understand that this baseline shouldn’t be model or dataset-specific, but rather business-case-specific. It should be a general benchmark for a given business case.

If I encounter a new dataset, with the same business objective, this benchmark should be a reliable reference point.

Why building a benchmark is important

Now that we’ve defined what a benchmark is, let’s dive into why I believe it’s worth spending an extra project week on the development of a strong benchmark.

Without a Benchmark you’re aiming for perfection — If you are working without a clear reference point any result will lose meaning. “My model has a MAE of 30.000” Is that good? IDK! Maybe with a simple mean you would get a MAE of 25.000. By comparing your model to a baseline, you can measure both performance and improvement.

Improves Communicating with Clients — Clients and business teams might not immediately understand the standard output of a model. However, by engaging them with simple baselines from the start, it becomes easier to demonstrate improvements later. In many cases benchmarks could come directly from the business in different shapes or forms.

Helps in Model Selection — A benchmark gives a starting point to compare multiple models fairly. Without it, you might waste time testing models that aren’t worth considering.

Model Drift Detection and Monitoring — Models can degrade over time. By having a benchmark you might be able to intercept drifts early by comparing new model outputs against past benchmarks and baselines.

Consistency Between Different Datasets — Datasets evolve. By having a fixed set of metrics and models you ensure that performance comparisons remain valid over time.

With a clear benchmark, every step in the model development will provide immediate feedback, making the whole process more intentional and data-driven.

How I would build a benchmark

I hope I’ve convinced you of the importance of having a benchmark. Now, let’s actually build one.

Let’s start from the business question we presented at the very beginning of this blog post:

I want to classify my customers into two groups according to their probability to churn: “high likelihood to churn” and “low likelihood to churn”

For simplicity, I’ll assume no additional business constraints, but in real-world scenarios, constraints often exist.

For this example, I am using this dataset . The data contains some attributes from a company’s customer basealong with their churn status.

Now that we have something to work on let’s build the benchmark:

1. Defining the metrics

We are dealing with a churn use case, in particular, this is a binary classification problem. Thus the main metrics that we could use are:

Precision — Percentage of correctly predicted churners among all predicted churners

Recall — Percentage of actual churners correctly identified

F1 score — Balances precision and recall

True Positives, False Positives, True Negative and False Negatives

These are some of the “simple” metrics that could be used to evaluate the output of a model.

However, it is not an exhaustive list, standard metrics aren’t always enough. In many use cases, it might be useful to build custom metrics.

Let’s assume that in our business case the customers labeled as “high likelihood to churn” are offered a discount. This creates:

A cost when offering the discount to a non-churning customer

A profit when retaining a churning customer

Following on this definition we can build a custom metric that will be crucial in our scenario:

# Defining the business case-specific reference metric
def financial_gain:
loss_from_fp = np.sum) * 250
gain_from_tp = np.sum) * 1000
return gain_from_tp - loss_from_fp

When you are building business-driven metrics these are usually the most relevant. Such metrics could take any shape or form: Financial goals, minimum requirements, percentage of coverage and more.

2. Defining the benchmarks

Now that we’ve defined our metrics, we can define a set of baseline models to be used as a reference.

In this phase, you should define a list of simple-to-implement model in their simplest possible setup. There is no reason at this state to spend time and resources on the optimization of these models, my mindset is:

If I had 15 minutes, how would I implement this model?

In later phases of the model, you can add mode baseline models as the project proceeds.

In this case, I will use the following models:

Random Model — Assigns labels randomly

Majority Model — Always predicts the most frequent class

Simple XGB

Simple KNN

import numpy as np
import xgboost as xgb
from sklearn.neighbors import KNeighborsClassifier

class BinaryMean:
@staticmethod
def run_benchmark:
np.random.seedreturn np.random.choice, p=.mean, 1 - df_train.mean])

class SimpleXbg:
@staticmethod
def run_benchmark:
model = xgb.XGBClassifiermodel.fit.drop, df_train)
return model.predict.drop)

class MajorityClass:
@staticmethod
def run_benchmark:
majority_class = df_train.modereturn np.full, majority_class)

class SimpleKNN:
@staticmethod
def run_benchmark:
model = KNeighborsClassifiermodel.fit.drop, df_train)
return model.predict.drop)

Again, as in the case of the metrics, we can build custom benchmarks.

Let’s assume that in our business case the the marketing team contacts every client who’s:

Over 50 y/o and

That is not active anymore

Following this rule we can build this model:

# Defining the business case-specific benchmark
class BusinessBenchmark:
@staticmethod
def run_benchmark:
df = df_test.copydf.loc= 0
df.loc= 1
return dfRunning the benchmark

To run the benchmark I will use the following class. The entry point is the method compare_with_benchmark that, given a prediction, runs all the models and calculates all the metrics.

import numpy as np

class ChurnBinaryBenchmark:
def __init__:
self.metrics = metrics
self.benchmark_models = benchmark_models

def compare_pred_with_benchmark:

output_metrics = {
'Prediction': self._calculate_metrics}
dct_benchmarks = {}

for model in self.benchmark_models:
dct_benchmarks= model.run_benchmarkoutput_metrics= self._calculate_metricsreturn output_metrics

def _calculate_metrics:
return {getattr: funcfor func in self.metrics}

Now all we need is a prediction. For this example, I made a quick feature engineering and some hyperparameter tuning.

The last step is just to run the benchmark:

binary_benchmark = ChurnBinaryBenchmarkres = binary_benchmark.compare_pred_with_benchmarkpd.DataFrameBenchmark metrics comparison | Image by Author

This generates a comparison table of all models across all metrics. Using this table, it is possible to draw concrete conclusions on the model’s predictions and make informed decisions on the following steps of the process.

Some drawbacks

As we’ve seen there are plenty of reasons why it is useful to have a benchmark. However, even though benchmarks are incredibly useful, there are some pitfalls to watch out for:

Non-Informative Benchmark — When the metrics or models are poorly defined the marginal impact of having a benchmark decreases. Always define meaningful baselines.

Misinterpretation by Stakeholders — Communication with the client is essential, it is important to state clearly what the metrics are measuring. The best model might not be the best on all the defined metrics.

Overfitting to the Benchmark — You might end up trying to create features that are too specific, that might beat the benchmark, but do not generalize well in prediction. Don’t focus on beating the benchmark, but on creating the best solution possible to the problem.

Change of Objective — Objectives defined might change, due to miscommunication or changes in plans. Keep your benchmark flexible so it can adapt when needed.

Final thoughts

Benchmarks provide clarity, ensure improvements are measurable, and create a shared reference point between data scientists and clients. They help avoid the trap of assuming a model is performing well without proof and ensure that every iteration brings real value.

They also act as a communication tool, making it easier to explain progress to clients. Instead of just presenting numbers, you can show clear comparisons that highlight improvements.

Here you can find a notebook with a full implementation from this blog post.
The post How To Build a Benchmark for Your Models appeared first on Towards Data Science.
#how #build #benchmark #your #models

How To Build a Benchmark for Your Models
I’ve been working as a data science consultant for the past three years, and I’ve had the opportunity to work on multiple projects across various industries. Yet, I noticed one common denominator among most of the clients I worked with: They rarely have a clear idea of the project objective. This is one of the main obstacles data scientists face, especially now that Gen AI is taking over every domain. But let’s suppose that after some back and forth, the objective becomes clear. We managed to pin down a specific question to answer. For example: I want to classify my customers into two groups according to their probability to churn: “high likelihood to churn” and “low likelihood to churn” Well, now what? Easy, let’s start building some models! Wrong! If having a clear objective is rare, having a reliable benchmark is even rarer. In my opinion, one of the most important steps in delivering a data science project is defining and agreeing on a set of benchmarks with the client. In this blog post, I’ll explain: What a benchmark is, Why it is important to have a benchmark, How I would build one using an example scenario and Some potential drawbacks to keep in mind What is a benchmark? A benchmark is a standardized way to evaluate the performance of a model. It provides a reference point against which new models can be compared. A benchmark needs two key components to be considered complete: A set of metrics to evaluate the performance A set of simple models to use as baselines The concept at its core is simple: every time I develop a new model I compare it against both previous versions and the baseline models. This ensures improvements are real and tracked. It is essential to understand that this baseline shouldn’t be model or dataset-specific, but rather business-case-specific. It should be a general benchmark for a given business case. If I encounter a new dataset, with the same business objective, this benchmark should be a reliable reference point. Why building a benchmark is important Now that we’ve defined what a benchmark is, let’s dive into why I believe it’s worth spending an extra project week on the development of a strong benchmark. Without a Benchmark you’re aiming for perfection — If you are working without a clear reference point any result will lose meaning. “My model has a MAE of 30.000” Is that good? IDK! Maybe with a simple mean you would get a MAE of 25.000. By comparing your model to a baseline, you can measure both performance and improvement. Improves Communicating with Clients — Clients and business teams might not immediately understand the standard output of a model. However, by engaging them with simple baselines from the start, it becomes easier to demonstrate improvements later. In many cases benchmarks could come directly from the business in different shapes or forms. Helps in Model Selection — A benchmark gives a starting point to compare multiple models fairly. Without it, you might waste time testing models that aren’t worth considering. Model Drift Detection and Monitoring — Models can degrade over time. By having a benchmark you might be able to intercept drifts early by comparing new model outputs against past benchmarks and baselines. Consistency Between Different Datasets — Datasets evolve. By having a fixed set of metrics and models you ensure that performance comparisons remain valid over time. With a clear benchmark, every step in the model development will provide immediate feedback, making the whole process more intentional and data-driven. How I would build a benchmark I hope I’ve convinced you of the importance of having a benchmark. Now, let’s actually build one. Let’s start from the business question we presented at the very beginning of this blog post: I want to classify my customers into two groups according to their probability to churn: “high likelihood to churn” and “low likelihood to churn” For simplicity, I’ll assume no additional business constraints, but in real-world scenarios, constraints often exist. For this example, I am using this dataset . The data contains some attributes from a company’s customer basealong with their churn status. Now that we have something to work on let’s build the benchmark: 1. Defining the metrics We are dealing with a churn use case, in particular, this is a binary classification problem. Thus the main metrics that we could use are: Precision — Percentage of correctly predicted churners among all predicted churners Recall — Percentage of actual churners correctly identified F1 score — Balances precision and recall True Positives, False Positives, True Negative and False Negatives These are some of the “simple” metrics that could be used to evaluate the output of a model. However, it is not an exhaustive list, standard metrics aren’t always enough. In many use cases, it might be useful to build custom metrics. Let’s assume that in our business case the customers labeled as “high likelihood to churn” are offered a discount. This creates: A cost when offering the discount to a non-churning customer A profit when retaining a churning customer Following on this definition we can build a custom metric that will be crucial in our scenario: # Defining the business case-specific reference metric def financial_gain: loss_from_fp = np.sum) * 250 gain_from_tp = np.sum) * 1000 return gain_from_tp - loss_from_fp When you are building business-driven metrics these are usually the most relevant. Such metrics could take any shape or form: Financial goals, minimum requirements, percentage of coverage and more. 2. Defining the benchmarks Now that we’ve defined our metrics, we can define a set of baseline models to be used as a reference. In this phase, you should define a list of simple-to-implement model in their simplest possible setup. There is no reason at this state to spend time and resources on the optimization of these models, my mindset is: If I had 15 minutes, how would I implement this model? In later phases of the model, you can add mode baseline models as the project proceeds. In this case, I will use the following models: Random Model — Assigns labels randomly Majority Model — Always predicts the most frequent class Simple XGB Simple KNN import numpy as np import xgboost as xgb from sklearn.neighbors import KNeighborsClassifier class BinaryMean: @staticmethod def run_benchmark: np.random.seedreturn np.random.choice, p=.mean, 1 - df_train.mean]) class SimpleXbg: @staticmethod def run_benchmark: model = xgb.XGBClassifiermodel.fit.drop, df_train) return model.predict.drop) class MajorityClass: @staticmethod def run_benchmark: majority_class = df_train.modereturn np.full, majority_class) class SimpleKNN: @staticmethod def run_benchmark: model = KNeighborsClassifiermodel.fit.drop, df_train) return model.predict.drop) Again, as in the case of the metrics, we can build custom benchmarks. Let’s assume that in our business case the the marketing team contacts every client who’s: Over 50 y/o and That is not active anymore Following this rule we can build this model: # Defining the business case-specific benchmark class BusinessBenchmark: @staticmethod def run_benchmark: df = df_test.copydf.loc= 0 df.loc= 1 return dfRunning the benchmark To run the benchmark I will use the following class. The entry point is the method compare_with_benchmark that, given a prediction, runs all the models and calculates all the metrics. import numpy as np class ChurnBinaryBenchmark: def __init__: self.metrics = metrics self.benchmark_models = benchmark_models def compare_pred_with_benchmark: output_metrics = { 'Prediction': self._calculate_metrics} dct_benchmarks = {} for model in self.benchmark_models: dct_benchmarks= model.run_benchmarkoutput_metrics= self._calculate_metricsreturn output_metrics def _calculate_metrics: return {getattr: funcfor func in self.metrics} Now all we need is a prediction. For this example, I made a quick feature engineering and some hyperparameter tuning. The last step is just to run the benchmark: binary_benchmark = ChurnBinaryBenchmarkres = binary_benchmark.compare_pred_with_benchmarkpd.DataFrameBenchmark metrics comparison | Image by Author This generates a comparison table of all models across all metrics. Using this table, it is possible to draw concrete conclusions on the model’s predictions and make informed decisions on the following steps of the process. Some drawbacks As we’ve seen there are plenty of reasons why it is useful to have a benchmark. However, even though benchmarks are incredibly useful, there are some pitfalls to watch out for: Non-Informative Benchmark — When the metrics or models are poorly defined the marginal impact of having a benchmark decreases. Always define meaningful baselines. Misinterpretation by Stakeholders — Communication with the client is essential, it is important to state clearly what the metrics are measuring. The best model might not be the best on all the defined metrics. Overfitting to the Benchmark — You might end up trying to create features that are too specific, that might beat the benchmark, but do not generalize well in prediction. Don’t focus on beating the benchmark, but on creating the best solution possible to the problem. Change of Objective — Objectives defined might change, due to miscommunication or changes in plans. Keep your benchmark flexible so it can adapt when needed. Final thoughts Benchmarks provide clarity, ensure improvements are measurable, and create a shared reference point between data scientists and clients. They help avoid the trap of assuming a model is performing well without proof and ensure that every iteration brings real value. They also act as a communication tool, making it easier to explain progress to clients. Instead of just presenting numbers, you can show clear comparisons that highlight improvements. Here you can find a notebook with a full implementation from this blog post. The post How To Build a Benchmark for Your Models appeared first on Towards Data Science. #how #build #benchmark #your #models

How To Build a Benchmark for Your Models

towardsdatascience.com
I’ve been working as a data science consultant for the past three years, and I’ve had the opportunity to work on multiple projects across various industries. Yet, I noticed one common denominator among most of the clients I worked with: They rarely have a clear idea of the project objective. This is one of the main obstacles data scientists face, especially now that Gen AI is taking over every domain. But let’s suppose that after some back and forth, the objective becomes clear. We managed to pin down a specific question to answer. For example: I want to classify my customers into two groups according to their probability to churn: “high likelihood to churn” and “low likelihood to churn” Well, now what? Easy, let’s start building some models! Wrong! If having a clear objective is rare, having a reliable benchmark is even rarer. In my opinion, one of the most important steps in delivering a data science project is defining and agreeing on a set of benchmarks with the client. In this blog post, I’ll explain: What a benchmark is, Why it is important to have a benchmark, How I would build one using an example scenario and Some potential drawbacks to keep in mind What is a benchmark? A benchmark is a standardized way to evaluate the performance of a model. It provides a reference point against which new models can be compared. A benchmark needs two key components to be considered complete: A set of metrics to evaluate the performance A set of simple models to use as baselines The concept at its core is simple: every time I develop a new model I compare it against both previous versions and the baseline models. This ensures improvements are real and tracked. It is essential to understand that this baseline shouldn’t be model or dataset-specific, but rather business-case-specific. It should be a general benchmark for a given business case. If I encounter a new dataset, with the same business objective, this benchmark should be a reliable reference point. Why building a benchmark is important Now that we’ve defined what a benchmark is, let’s dive into why I believe it’s worth spending an extra project week on the development of a strong benchmark. Without a Benchmark you’re aiming for perfection — If you are working without a clear reference point any result will lose meaning. “My model has a MAE of 30.000” Is that good? IDK! Maybe with a simple mean you would get a MAE of 25.000. By comparing your model to a baseline, you can measure both performance and improvement. Improves Communicating with Clients — Clients and business teams might not immediately understand the standard output of a model. However, by engaging them with simple baselines from the start, it becomes easier to demonstrate improvements later. In many cases benchmarks could come directly from the business in different shapes or forms. Helps in Model Selection — A benchmark gives a starting point to compare multiple models fairly. Without it, you might waste time testing models that aren’t worth considering. Model Drift Detection and Monitoring — Models can degrade over time. By having a benchmark you might be able to intercept drifts early by comparing new model outputs against past benchmarks and baselines. Consistency Between Different Datasets — Datasets evolve. By having a fixed set of metrics and models you ensure that performance comparisons remain valid over time. With a clear benchmark, every step in the model development will provide immediate feedback, making the whole process more intentional and data-driven. How I would build a benchmark I hope I’ve convinced you of the importance of having a benchmark. Now, let’s actually build one. Let’s start from the business question we presented at the very beginning of this blog post: I want to classify my customers into two groups according to their probability to churn: “high likelihood to churn” and “low likelihood to churn” For simplicity, I’ll assume no additional business constraints, but in real-world scenarios, constraints often exist. For this example, I am using this dataset (CC0: Public Domain). The data contains some attributes from a company’s customer base (e.g., age, sex, number of products, …) along with their churn status. Now that we have something to work on let’s build the benchmark: 1. Defining the metrics We are dealing with a churn use case, in particular, this is a binary classification problem. Thus the main metrics that we could use are: Precision — Percentage of correctly predicted churners among all predicted churners Recall — Percentage of actual churners correctly identified F1 score — Balances precision and recall True Positives, False Positives, True Negative and False Negatives These are some of the “simple” metrics that could be used to evaluate the output of a model. However, it is not an exhaustive list, standard metrics aren’t always enough. In many use cases, it might be useful to build custom metrics. Let’s assume that in our business case the customers labeled as “high likelihood to churn” are offered a discount. This creates: A cost ($250) when offering the discount to a non-churning customer A profit ($1000) when retaining a churning customer Following on this definition we can build a custom metric that will be crucial in our scenario: # Defining the business case-specific reference metric def financial_gain(y_true, y_pred): loss_from_fp = np.sum(np.logical_and(y_pred == 1, y_true == 0)) * 250 gain_from_tp = np.sum(np.logical_and(y_pred == 1, y_true == 1)) * 1000 return gain_from_tp - loss_from_fp When you are building business-driven metrics these are usually the most relevant. Such metrics could take any shape or form: Financial goals, minimum requirements, percentage of coverage and more. 2. Defining the benchmarks Now that we’ve defined our metrics, we can define a set of baseline models to be used as a reference. In this phase, you should define a list of simple-to-implement model in their simplest possible setup. There is no reason at this state to spend time and resources on the optimization of these models, my mindset is: If I had 15 minutes, how would I implement this model? In later phases of the model, you can add mode baseline models as the project proceeds. In this case, I will use the following models: Random Model — Assigns labels randomly Majority Model — Always predicts the most frequent class Simple XGB Simple KNN import numpy as np import xgboost as xgb from sklearn.neighbors import KNeighborsClassifier class BinaryMean(): @staticmethod def run_benchmark(df_train, df_test): np.random.seed(21) return np.random.choice(a=[1, 0], size=len(df_test), p=[df_train['y'].mean(), 1 - df_train['y'].mean()]) class SimpleXbg(): @staticmethod def run_benchmark(df_train, df_test): model = xgb.XGBClassifier() model.fit(df_train.select_dtypes(include=np.number).drop(columns='y'), df_train['y']) return model.predict(df_test.select_dtypes(include=np.number).drop(columns='y')) class MajorityClass(): @staticmethod def run_benchmark(df_train, df_test): majority_class = df_train['y'].mode()[0] return np.full(len(df_test), majority_class) class SimpleKNN(): @staticmethod def run_benchmark(df_train, df_test): model = KNeighborsClassifier() model.fit(df_train.select_dtypes(include=np.number).drop(columns='y'), df_train['y']) return model.predict(df_test.select_dtypes(include=np.number).drop(columns='y')) Again, as in the case of the metrics, we can build custom benchmarks. Let’s assume that in our business case the the marketing team contacts every client who’s: Over 50 y/o and That is not active anymore Following this rule we can build this model: # Defining the business case-specific benchmark class BusinessBenchmark(): @staticmethod def run_benchmark(df_train, df_test): df = df_test.copy() df.loc[:,'y_hat'] = 0 df.loc[(df['IsActiveMember'] == 0) & (df['Age'] >= 50), 'y_hat'] = 1 return df['y_hat'] Running the benchmark To run the benchmark I will use the following class. The entry point is the method compare_with_benchmark() that, given a prediction, runs all the models and calculates all the metrics. import numpy as np class ChurnBinaryBenchmark(): def __init__( self, metrics = [], benchmark_models = [], ): self.metrics = metrics self.benchmark_models = benchmark_models def compare_pred_with_benchmark( self, df_train, df_test, my_predictions, ): output_metrics = { 'Prediction': self._calculate_metrics(df_test['y'], my_predictions) } dct_benchmarks = {} for model in self.benchmark_models: dct_benchmarks[model.__name__] = model.run_benchmark(df_train = df_train, df_test = df_test) output_metrics[f'Benchmark - {model.__name__}'] = self._calculate_metrics(df_test['y'], dct_benchmarks[model.__name__]) return output_metrics def _calculate_metrics(self, y_true, y_pred): return {getattr(func, '__name__', 'Unknown') : func(y_true = y_true, y_pred = y_pred) for func in self.metrics} Now all we need is a prediction. For this example, I made a quick feature engineering and some hyperparameter tuning. The last step is just to run the benchmark: binary_benchmark = ChurnBinaryBenchmark( metrics=[f1_score, precision_score, recall_score, tp, tn, fp, fn, financial_gain], benchmark_models=[BinaryMean, SimpleXbg, MajorityClass, SimpleKNN, BusinessBenchmark] ) res = binary_benchmark.compare_pred_with_benchmark( df_train=df_train, df_test=df_test, my_predictions=preds, ) pd.DataFrame(res) Benchmark metrics comparison | Image by Author This generates a comparison table of all models across all metrics. Using this table, it is possible to draw concrete conclusions on the model’s predictions and make informed decisions on the following steps of the process. Some drawbacks As we’ve seen there are plenty of reasons why it is useful to have a benchmark. However, even though benchmarks are incredibly useful, there are some pitfalls to watch out for: Non-Informative Benchmark — When the metrics or models are poorly defined the marginal impact of having a benchmark decreases. Always define meaningful baselines. Misinterpretation by Stakeholders — Communication with the client is essential, it is important to state clearly what the metrics are measuring. The best model might not be the best on all the defined metrics. Overfitting to the Benchmark — You might end up trying to create features that are too specific, that might beat the benchmark, but do not generalize well in prediction. Don’t focus on beating the benchmark, but on creating the best solution possible to the problem. Change of Objective — Objectives defined might change, due to miscommunication or changes in plans. Keep your benchmark flexible so it can adapt when needed. Final thoughts Benchmarks provide clarity, ensure improvements are measurable, and create a shared reference point between data scientists and clients. They help avoid the trap of assuming a model is performing well without proof and ensure that every iteration brings real value. They also act as a communication tool, making it easier to explain progress to clients. Instead of just presenting numbers, you can show clear comparisons that highlight improvements. Here you can find a notebook with a full implementation from this blog post. The post How To Build a Benchmark for Your Models appeared first on Towards Data Science.

0 Комментарии ·0 Поделились ·0 предпросмотр

Войдите, чтобы отмечать, делиться и комментировать!

Обновить до Про