Understanding Random Forest using Python (scikit-learn)
Decision trees are a popular supervised learning algorithm with benefits that include being able to be used for both regression and classification as well as being easy to interpret. However, decision trees aren’t the most performant algorithm and are prone to overfitting due to small variations in the training data. This can result in a completely different tree. This is why people often turn to ensemble models like Bagged Trees and Random Forests. These consist of multiple decision trees trained on bootstrapped data and aggregated to achieve better predictive performance than any single tree could offer. This tutorial includes the following:
What is Bagging
What Makes Random Forests Different
Training and Tuning a Random Forest using Scikit-Learn
Calculating and Interpreting Feature Importance
Visualizing Individual Decision Trees in a Random Forest
As always, the code used in this tutorial is available on my GitHub. A video version of this tutorial is also available on my YouTube channel for those who prefer to follow along visually. With that, let’s get started!
What is BaggingBootstrap + aggregating = Bagging. Image by Michael Galarnyk.
Random forests can be categorized as bagging algorithms. Bagging consists of two steps:
1.) Bootstrap sampling: Create multiple training sets by randomly drawing samples with replacement from the original dataset. These new training sets, called bootstrapped datasets, typically contain the same number of rows as the original dataset, but individual rows may appear multiple times or not at all. On average, each bootstrapped dataset contains about 63.2% of the unique rows from the original data. The remaining ~36.8% of rows are left out and can be used for out-of-bagevaluation. For more on this concept, see my sampling with and without replacement blog post.
2.) Aggregating predictions: Each bootstrapped dataset is used to train a different decision tree model. The final prediction is made by combining the outputs of all individual trees. For classification, this is typically done through majority voting. For regression, predictions are averaged.
Training each tree on a different bootstrapped sample introduces variation across trees. While this doesn’t fully eliminate correlation—especially when certain features dominate—it helps reduce overfitting when combined with aggregation. Averaging the predictions of many such trees reduces the overall variance of the ensemble, improving generalization.
What Makes Random Forests Different
In contrast to some other bagged trees algorithms, for each decision tree in random forests, only a subset of features is randomly selected at each decision node and the best split feature from the subset is used. Image by Michael Galarnyk.
Suppose there’s a single strong feature in your dataset. In bagged trees, each tree may repeatedly split on that feature, leading to correlated trees and less benefit from aggregation. Random Forests reduce this issue by introducing further randomness. Specifically, they change how splits are selected during training:
1). Create N bootstrapped datasets. Note that while bootstrapping is commonly used in Random Forests, it is not strictly necessary because step 2introduces sufficient diversity among the trees.
2). For each tree, at each node, a random subset of features is selected as candidates, and the best split is chosen from that subset. In scikit-learn, this is controlled by the max_features parameter, which defaults to 'sqrt' for classifiers and 1 for regressors.
3). Aggregating predictions: vote for classification and average for regression.
Note: Random Forests use sampling with replacement for bootstrapped datasets and sampling without replacement for selecting a subset of features.
Sampling with replacement procedure. Image by Michael Galarnyk
Out-of-BagScore
Because ~36.8% of training data is excluded from any given tree, you can use this holdout portion to evaluate that tree’s predictions. Scikit-learn allows this via the oob_score=True parameter, providing an efficient way to estimate generalization error. You’ll see this parameter used in the training example later in the tutorial.
Training and Tuning a Random Forest in Scikit-Learn
Random Forests remain a strong baseline for tabular data thanks to their simplicity, interpretability, and ability to parallelize since each tree is trained independently. This section demonstrates how to load data, perform a train test split, train a baseline model, tune hyperparameters using grid search, and evaluate the final model on the test set.
Step 1: Train a Baseline Model
Before tuning, it’s good practice to train a baseline model using reasonable defaults. This gives you an initial sense of performance and lets you validate generalization using the out-of-bagscore, which is built into bagging-based models like Random Forests. This example uses the House Sales in King County dataset, which contains property sales from the Seattle area between May 2014 and May 2015. This approach allows us to reserve the test set for final evaluation after tuning.
Python"># Import libraries
# Some imports are only used later in the tutorial
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Dataset: Breast Cancer Wisconsin# Source: UCI Machine Learning Repository
# License: CC BY 4.0
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn import tree
# Load dataset
# Dataset: House Sales in King County# License CC0 1.0 Universal
url = ';
df = pd.read_csvcolumns =df = df# Define features and target
X = df.dropy = df# Train/test split
X_train, X_test, y_train, y_test = train_test_split# Train baseline Random Forest
reg = RandomForestRegressorreg.fit# Evaluate baseline performance using OOB score
printStep 2: Tune Hyperparameters with Grid Search
While the baseline model gives a strong starting point, performance can often be improved by tuning key hyperparameters. Grid search cross-validation, as implemented by GridSearchCV, systematically explores combinations of hyperparameters and uses cross-validation to evaluate each one, selecting the configuration with the highest validation performance.The most commonly tuned hyperparameters include:
n_estimators: The number of decision trees in the forest. More trees can improve accuracy but increase training time.
max_features: The number of features to consider when looking for the best split. Lower values reduce correlation between trees.
max_depth: The maximum depth of each tree. Shallower trees are faster but may underfit.
min_samples_split: The minimum number of samples required to split an internal node. Higher values can reduce overfitting.
min_samples_leaf: The minimum number of samples required to be at a leaf node. Helps control tree size.
bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used.
param_grid = {
'n_estimators':,
'max_features':,
'max_depth':,
'min_samples_split':,
'min_samples_leaf':}
# Initialize model
rf = RandomForestRegressorgrid_search = GridSearchCVgrid_search.fitprintprintStep 3: Evaluate Final Model on Test Set
Now that we’ve selected the best-performing model based on cross-validation, we can evaluate it on the held-out test set to estimate its generalization performance.
# Evaluate final model on test set
best_model = grid_search.best_estimator_
print: {best_model.score:.3f}")
Calculating Random Forest Feature Importance
One of the key advantages of Random Forests is their interpretability — something that large language modelsoften lack. While LLMs are powerful, they typically function as black boxes and can exhibit biases that are difficult to identify. In contrast, scikit-learn supports two main methods for measuring feature importance in Random Forests: Mean Decrease in Impurity and Permutation Importance.
1). Mean Decrease in Impurity: Also known as Gini importance, this method calculates the total reduction in impurity brought by each feature across all trees. This is fast and built into the model via reg.feature_importances_. However, impurity-based feature importances can be misleading, especially for features with high cardinality, as these features are more likely to be chosen simply because they provide more potential split points.
importances = reg.feature_importances_
feature_names = X.columns
sorted_idx = np.argsortfor i in sorted_idx:
print2). Permutation Importance: This method assesses the decrease in model performance when a single feature’s values are randomly shuffled. Unlike MDI, it accounts for feature interactions and correlation. It is more reliable but also more computationally expensive.
# Perform permutation importance on the test set
perm_importance = permutation_importancesorted_idx = perm_importance.importances_mean.argsortfor i in sorted_idx:
printIt is important to note that our geographic features lat and long are also useful for visualization as the plot below shows. It’s likely that companies like Zillow leverage location information extensively in their valuation models.
Housing Price percentile for King County. Image by Michael Galarnyk.
Visualizing Individual Decision Trees in a Random Forest
A Random Forest consists of multiple decision trees—one for each estimator specified via the n_estimators parameter. After training the model, you can access these individual trees through the .estimators_ attribute. Visualizing a few of these trees can help illustrate how differently each one splits the data due to bootstrapped training samples and random feature selection at each split. While the earlier example used a RandomForestRegressor, here we demonstrate this visualization using a RandomForestClassifier trained on the Breast Cancer Wisconsin datasetto highlight Random Forests’ versatility for both regression and classification tasks. This short video demonstrates what 100 trained estimators from this dataset look like.
Fit a Random Forest Model using Scikit-Learn
# Load the Breast CancerDataset
data = load_breast_cancerdf = pd.DataFramedf= data.target
# Arrange Data into Features Matrix and Target Vector
X = df.locy = df.loc.values
# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split# Random Forests in `scikit-learn`rf = RandomForestClassifierrf.fitPlotting Individual Estimatorsfrom a Random Forest using Matplotlib
You can now view all the individual trees from the fitted model.
rf.estimators_
You can now visualize individual trees. The code below visualizes the first decision tree.
fn=data.feature_names
cn=data.target_names
fig, axes = plt.subplots, dpi=800)
tree.plot_tree;
fig.savefigAlthough plotting many trees can be difficult to interpret, you may wish to explore the variety across estimators. The following example shows how to visualize the first five decision trees in the forest:
# This may not the best way to view each estimator as it is small
fig, axes = plt.subplots, dpi=3000)
for index in range:
tree.plot_tree axes.set_titlefig.savefigConclusion
Random forests consist of multiple decision trees trained on bootstrapped data in order to achieve better predictive performance than could be obtained from any of the individual decision trees. If you have questions or thoughts on the tutorial, feel free to reach out through YouTube or X.
The post Understanding Random Forest using Pythonappeared first on Towards Data Science.
#understanding #random #forest #using #python
Understanding Random Forest using Python (scikit-learn)
Decision trees are a popular supervised learning algorithm with benefits that include being able to be used for both regression and classification as well as being easy to interpret. However, decision trees aren’t the most performant algorithm and are prone to overfitting due to small variations in the training data. This can result in a completely different tree. This is why people often turn to ensemble models like Bagged Trees and Random Forests. These consist of multiple decision trees trained on bootstrapped data and aggregated to achieve better predictive performance than any single tree could offer. This tutorial includes the following:
What is Bagging
What Makes Random Forests Different
Training and Tuning a Random Forest using Scikit-Learn
Calculating and Interpreting Feature Importance
Visualizing Individual Decision Trees in a Random Forest
As always, the code used in this tutorial is available on my GitHub. A video version of this tutorial is also available on my YouTube channel for those who prefer to follow along visually. With that, let’s get started!
What is BaggingBootstrap + aggregating = Bagging. Image by Michael Galarnyk.
Random forests can be categorized as bagging algorithms. Bagging consists of two steps:
1.) Bootstrap sampling: Create multiple training sets by randomly drawing samples with replacement from the original dataset. These new training sets, called bootstrapped datasets, typically contain the same number of rows as the original dataset, but individual rows may appear multiple times or not at all. On average, each bootstrapped dataset contains about 63.2% of the unique rows from the original data. The remaining ~36.8% of rows are left out and can be used for out-of-bagevaluation. For more on this concept, see my sampling with and without replacement blog post.
2.) Aggregating predictions: Each bootstrapped dataset is used to train a different decision tree model. The final prediction is made by combining the outputs of all individual trees. For classification, this is typically done through majority voting. For regression, predictions are averaged.
Training each tree on a different bootstrapped sample introduces variation across trees. While this doesn’t fully eliminate correlation—especially when certain features dominate—it helps reduce overfitting when combined with aggregation. Averaging the predictions of many such trees reduces the overall variance of the ensemble, improving generalization.
What Makes Random Forests Different
In contrast to some other bagged trees algorithms, for each decision tree in random forests, only a subset of features is randomly selected at each decision node and the best split feature from the subset is used. Image by Michael Galarnyk.
Suppose there’s a single strong feature in your dataset. In bagged trees, each tree may repeatedly split on that feature, leading to correlated trees and less benefit from aggregation. Random Forests reduce this issue by introducing further randomness. Specifically, they change how splits are selected during training:
1). Create N bootstrapped datasets. Note that while bootstrapping is commonly used in Random Forests, it is not strictly necessary because step 2introduces sufficient diversity among the trees.
2). For each tree, at each node, a random subset of features is selected as candidates, and the best split is chosen from that subset. In scikit-learn, this is controlled by the max_features parameter, which defaults to 'sqrt' for classifiers and 1 for regressors.
3). Aggregating predictions: vote for classification and average for regression.
Note: Random Forests use sampling with replacement for bootstrapped datasets and sampling without replacement for selecting a subset of features.
Sampling with replacement procedure. Image by Michael Galarnyk
Out-of-BagScore
Because ~36.8% of training data is excluded from any given tree, you can use this holdout portion to evaluate that tree’s predictions. Scikit-learn allows this via the oob_score=True parameter, providing an efficient way to estimate generalization error. You’ll see this parameter used in the training example later in the tutorial.
Training and Tuning a Random Forest in Scikit-Learn
Random Forests remain a strong baseline for tabular data thanks to their simplicity, interpretability, and ability to parallelize since each tree is trained independently. This section demonstrates how to load data, perform a train test split, train a baseline model, tune hyperparameters using grid search, and evaluate the final model on the test set.
Step 1: Train a Baseline Model
Before tuning, it’s good practice to train a baseline model using reasonable defaults. This gives you an initial sense of performance and lets you validate generalization using the out-of-bagscore, which is built into bagging-based models like Random Forests. This example uses the House Sales in King County dataset, which contains property sales from the Seattle area between May 2014 and May 2015. This approach allows us to reserve the test set for final evaluation after tuning.
Python"># Import libraries
# Some imports are only used later in the tutorial
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Dataset: Breast Cancer Wisconsin# Source: UCI Machine Learning Repository
# License: CC BY 4.0
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn import tree
# Load dataset
# Dataset: House Sales in King County# License CC0 1.0 Universal
url = ';
df = pd.read_csvcolumns =df = df# Define features and target
X = df.dropy = df# Train/test split
X_train, X_test, y_train, y_test = train_test_split# Train baseline Random Forest
reg = RandomForestRegressorreg.fit# Evaluate baseline performance using OOB score
printStep 2: Tune Hyperparameters with Grid Search
While the baseline model gives a strong starting point, performance can often be improved by tuning key hyperparameters. Grid search cross-validation, as implemented by GridSearchCV, systematically explores combinations of hyperparameters and uses cross-validation to evaluate each one, selecting the configuration with the highest validation performance.The most commonly tuned hyperparameters include:
n_estimators: The number of decision trees in the forest. More trees can improve accuracy but increase training time.
max_features: The number of features to consider when looking for the best split. Lower values reduce correlation between trees.
max_depth: The maximum depth of each tree. Shallower trees are faster but may underfit.
min_samples_split: The minimum number of samples required to split an internal node. Higher values can reduce overfitting.
min_samples_leaf: The minimum number of samples required to be at a leaf node. Helps control tree size.
bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used.
param_grid = {
'n_estimators':,
'max_features':,
'max_depth':,
'min_samples_split':,
'min_samples_leaf':}
# Initialize model
rf = RandomForestRegressorgrid_search = GridSearchCVgrid_search.fitprintprintStep 3: Evaluate Final Model on Test Set
Now that we’ve selected the best-performing model based on cross-validation, we can evaluate it on the held-out test set to estimate its generalization performance.
# Evaluate final model on test set
best_model = grid_search.best_estimator_
print: {best_model.score:.3f}")
Calculating Random Forest Feature Importance
One of the key advantages of Random Forests is their interpretability — something that large language modelsoften lack. While LLMs are powerful, they typically function as black boxes and can exhibit biases that are difficult to identify. In contrast, scikit-learn supports two main methods for measuring feature importance in Random Forests: Mean Decrease in Impurity and Permutation Importance.
1). Mean Decrease in Impurity: Also known as Gini importance, this method calculates the total reduction in impurity brought by each feature across all trees. This is fast and built into the model via reg.feature_importances_. However, impurity-based feature importances can be misleading, especially for features with high cardinality, as these features are more likely to be chosen simply because they provide more potential split points.
importances = reg.feature_importances_
feature_names = X.columns
sorted_idx = np.argsortfor i in sorted_idx:
print2). Permutation Importance: This method assesses the decrease in model performance when a single feature’s values are randomly shuffled. Unlike MDI, it accounts for feature interactions and correlation. It is more reliable but also more computationally expensive.
# Perform permutation importance on the test set
perm_importance = permutation_importancesorted_idx = perm_importance.importances_mean.argsortfor i in sorted_idx:
printIt is important to note that our geographic features lat and long are also useful for visualization as the plot below shows. It’s likely that companies like Zillow leverage location information extensively in their valuation models.
Housing Price percentile for King County. Image by Michael Galarnyk.
Visualizing Individual Decision Trees in a Random Forest
A Random Forest consists of multiple decision trees—one for each estimator specified via the n_estimators parameter. After training the model, you can access these individual trees through the .estimators_ attribute. Visualizing a few of these trees can help illustrate how differently each one splits the data due to bootstrapped training samples and random feature selection at each split. While the earlier example used a RandomForestRegressor, here we demonstrate this visualization using a RandomForestClassifier trained on the Breast Cancer Wisconsin datasetto highlight Random Forests’ versatility for both regression and classification tasks. This short video demonstrates what 100 trained estimators from this dataset look like.
Fit a Random Forest Model using Scikit-Learn
# Load the Breast CancerDataset
data = load_breast_cancerdf = pd.DataFramedf= data.target
# Arrange Data into Features Matrix and Target Vector
X = df.locy = df.loc.values
# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split# Random Forests in `scikit-learn`rf = RandomForestClassifierrf.fitPlotting Individual Estimatorsfrom a Random Forest using Matplotlib
You can now view all the individual trees from the fitted model.
rf.estimators_
You can now visualize individual trees. The code below visualizes the first decision tree.
fn=data.feature_names
cn=data.target_names
fig, axes = plt.subplots, dpi=800)
tree.plot_tree;
fig.savefigAlthough plotting many trees can be difficult to interpret, you may wish to explore the variety across estimators. The following example shows how to visualize the first five decision trees in the forest:
# This may not the best way to view each estimator as it is small
fig, axes = plt.subplots, dpi=3000)
for index in range:
tree.plot_tree axes.set_titlefig.savefigConclusion
Random forests consist of multiple decision trees trained on bootstrapped data in order to achieve better predictive performance than could be obtained from any of the individual decision trees. If you have questions or thoughts on the tutorial, feel free to reach out through YouTube or X.
The post Understanding Random Forest using Pythonappeared first on Towards Data Science.
#understanding #random #forest #using #python
·17 Visualizações