How to Set the Number of Trees in Random Forest
Scientific publication
T. M. Lange, M. Gültas, A. O. Schmitt & F. Heinrich. optRF: Optimising random forest stability by determining the optimal number of trees. BMC bioinformatics, 26, 95.Follow this LINK to the original publication.
Random Forest — A Powerful Tool for Anyone Working With Data
What is Random Forest?
Have you ever wished you could make better decisions using data — like predicting the risk of diseases, crop yields, or spotting patterns in customer behavior? That’s where machine learning comes in and one of the most accessible and powerful tools in this field is something called Random Forest.
So why is random forest so popular? For one, it’s incredibly flexible. It works well with many types of data whether numbers, categories, or both. It’s also widely used in many fields — from predicting patient outcomes in healthcare to detecting fraud in finance, from improving shopping experiences online to optimising agricultural practices.
Despite the name, random forest has nothing to do with trees in a forest — but it does use something called Decision Trees to make smart predictions. You can think of a decision tree as a flowchart that guides a series of yes/no questions based on the data you give it. A random forest creates a whole bunch of these trees, each slightly different, and then combines their results to make one final decision. It’s a bit like asking a group of experts for their opinion and then going with the majority vote.
But until recently, one question was unanswered: How many decision trees do I actually need? If each decision tree can lead to different results, averaging many trees would lead to better and more reliable results. But how many are enough? Luckily, the optRF package answers this question!
So let’s have a look at how to optimise Random Forest for predictions and variable selection!
Making Predictions with Random Forests
To optimise and to use random forest for making predictions, we can use the open-source statistics programme R. Once we open R, we have to install the two R packages “ranger” which allows to use random forests in R and “optRF” to optimise random forests. Both packages are open-source and available via the official R repository CRAN. In order to install and load these packages, the following lines of R code can be run:
> install.packages> install.packages> library> libraryNow that the packages are installed and loaded into the library, we can use the functions that these packages contain. Furthermore, we can also use the data set included in the optRF package which is free to use under the GPL license. This data set called SNPdata contains in the first column the yield of 250 wheat plants as well as 5000 genomic markersthat can contain either the value 0 or 2.
> SNPdataYield SNP_0001 SNP_0002 SNP_0003 SNP_0004
ID_001 670.7588 0 0 0 0
ID_002 542.5611 0 2 0 0
ID_003 591.6631 2 2 0 2
ID_004 476.3727 0 0 0 0
ID_005 635.9814 2 2 0 2
This data set is an example for genomic data and can be used for genomic prediction which is a very important tool for breeding high-yielding crops and, thus, to fight world hunger. The idea is to predict the yield of crops using genomic markers. And exactly for this purpose, random forest can be used! That means that a random forest model is used to describe the relationship between the yield and the genomic markers. Afterwards, we can predict the yield of wheat plants where we only have genomic markers.
Therefore, let’s imagine that we have 200 wheat plants where we know the yield and the genomic markers. This is the so-called training data set. Let’s further assume that we have 50 wheat plants where we know the genomic markers but not their yield. This is the so-called test data set. Thus, we separate the data frame SNPdata so that the first 200 rows are saved as training and the last 50 rows without their yield are saved as test data:
> Training = SNPdata> Test = SNPdataWith these data sets, we can now have a look at how to make predictions using random forests!
First, we got to calculate the optimal number of trees for random forest. Since we want to make predictions, we use the function opt_prediction from the optRF package. Into this function we have to insert the response from the training data set, the predictors from the training data set, and the predictors from the test data set. Before we run this function, we can use the set.seed function to ensure reproducibility even though this is not necessary:
> set.seed> optRF_result = opt_predictionRecommended number of trees: 19000
All the results from the opt_prediction function are now saved in the object optRF_result, however, the most important information was already printed in the console: For this data set, we should use 19,000 trees.
With this information, we can now use random forest to make predictions. Therefore, we use the ranger function to derive a random forest model that describes the relationship between the genomic markers and the yield in the training data set. Also here, we have to insert the response in the y argument and the predictors in the x argument. Furthermore, we can set the write.forest argument to be TRUE and we can insert the optimal number of trees in the num.trees argument:
> RF_model = rangerAnd that’s it! The object RF_model contains the random forest model that describes the relationship between the genomic markers and the yield. With this model, we can now predict the yield for the 50 plants in the test data set where we have the genomic markers but we don’t know the yield:
> predictions = predict$predictions
> predicted_Test = data.frame, predicted_yield = predictions)
The data frame predicted_Test now contains the IDs of the wheat plants together with their predicted yield:
> headID predicted_yield
ID_201 593.6063
ID_202 596.8615
ID_203 591.3695
ID_204 589.3909
ID_205 599.5155
ID_206 608.1031
Variable Selection with Random Forests
A different approach to analysing such a data set would be to find out which variables are most important to predict the response. In this case, the question would be which genomic markers are most important to predict the yield. Also this can be done with random forests!
If we tackle such a task, we don’t need a training and a test data set. We can simply use the entire data set SNPdata and see which of the variables are the most important ones. But before we do that, we should again determine the optimal number of trees using the optRF package. Since we are insterested in calculating the variable importance, we use the function opt_importance:
> set.seed> optRF_result = opt_importanceRecommended number of trees: 40000
One can see that the optimal number of trees is now higher than it was for predictions. This is actually often the case. However, with this number of trees, we can now use the ranger function to calculate the importance of the variables. Therefore, we use the ranger function as before but we change the number of trees in the num.trees argument to 40,000 and we set the importance argument to “permutation”.
> set.seed> RF_model = ranger> D_VI = data.frame,
+ importance = RF_model$variable.importance)
> D_VI = D_VIThe data frame D_VI now contains all the variables, thus, all the genomic markers, and next to it, their importance. Also, we have directly ordered this data frame so that the most important markers are on the top and the least important markers are at the bottom of this data frame. Which means that we can have a look at the most important variables using the head function:
> headvariable importance
SNP_0020 45.75302
SNP_0004 38.65594
SNP_0019 36.81254
SNP_0050 34.56292
SNP_0033 30.47347
SNP_0043 28.54312
And that’s it! We have used random forest to make predictions and to estimate the most important variables in a data set. Furthermore, we have optimised random forest using the optRF package!
Why Do We Need Optimisation?
Now that we’ve seen how easy it is to use random forest and how quickly it can be optimised, it’s time to take a closer look at what’s happening behind the scenes. Specifically, we’ll explore how random forest works and why the results might change from one run to another.
To do this, we’ll use random forest to calculate the importance of each genomic marker but instead of optimising the number of trees beforehand, we’ll stick with the default settings in the ranger function. By default, ranger uses 500 decision trees. Let’s try it out:
> set.seed> RF_model = ranger> D_VI = data.frame,
+ importance = RF_model$variable.importance)
> D_VI = D_VI> headvariable importance
SNP_0020 80.22909
SNP_0019 60.37387
SNP_0043 50.52367
SNP_0005 43.47999
SNP_0034 38.52494
SNP_0015 34.88654
As expected, everything runs smoothly — and quickly! In fact, this run was significantly faster than when we previously used 40,000 trees. But what happens if we run the exact same code again but this time with a different seed?
> set.seed> RF_model2 = ranger> D_VI2 = data.frame,
+ importance = RF_model2$variable.importance)
> D_VI2 = D_VI2> headvariable importance
SNP_0050 60.64051
SNP_0043 58.59175
SNP_0033 52.15701
SNP_0020 51.10561
SNP_0015 34.86162
SNP_0019 34.21317
Once again, everything appears to work fine but take a closer look at the results. In the first run, SNP_0020 had the highest importance score at 80.23, but in the second run, SNP_0050 takes the top spot and SNP_0020 drops to the fourth place with a much lower importance score of 51.11. That’s a significant shift! So what changed?
The answer lies in something called non-determinism. Random forest, as the name suggests, involves a lot of randomness: it randomly selects data samples and subsets of variables at various points during training. This randomness helps prevent overfitting but it also means that results can vary slightly each time you run the algorithm — even with the exact same data set. That’s where the set.seedfunction comes in. It acts like a bookmark in a shuffled deck of cards. By setting the same seed, you ensure that the random choices made by the algorithm follow the same sequence every time you run the code. But when you change the seed, you’re effectively changing the random path the algorithm follows. That’s why, in our example, the most important genomic markers came out differently in each run. This behavior — where the same process can yield different results due to internal randomness — is a classic example of non-determinism in machine learning.
Taming the Randomness in Random Forests
As we just saw, random forest models can produce slightly different results every time you run them even when using the same data due to the algorithm’s built-in randomness. So, how can we reduce this randomness and make our results more stable?
One of the simplest and most effective ways is to increase the number of trees. Each tree in a random forest is trained on a random subset of the data and variables, so the more trees we add, the better the model can “average out” the noise caused by individual trees. Think of it like asking 10 people for their opinion versus asking 1,000 — you’re more likely to get a reliable answer from the larger group.
With more trees, the model’s predictions and variable importance rankings tend to become more stable and reproducible even without setting a specific seed. In other words, adding more trees helps to tame the randomness. However, there’s a catch. More trees also mean more computation time. Training a random forest with 500 trees might take a few seconds but training one with 40,000 trees could take several minutes or more, depending on the size of your data set and your computer’s performance.
However, the relationship between the stability and the computation time of random forest is non-linear. While going from 500 to 1,000 trees can significantly improve stability, going from 5,000 to 10,000 trees might only provide a tiny improvement in stability while doubling the computation time. At some point, you hit a plateau where adding more trees gives diminishing returns — you pay more in computation time but gain very little in stability. That’s why it’s essential to find the right balance: Enough trees to ensure stable results but not so many that your analysis becomes unnecessarily slow.
And this is exactly what the optRF package does: it analyses the relationship between the stability and the number of trees in random forests and uses this relationship to determine the optimal number of trees that leads to stable results and beyond which adding more trees would unnecessarily increase the computation time.
Above, we have already used the opt_importance function and saved the results as optRF_result. This object contains the information about the optimal number of trees but it also contains information about the relationship between the stability and the number of trees. Using the plot_stability function, we can visualise this relationship. Therefore, we have to insert the name of the optRF object, which measure we are interested in, the interval we want to visualise on the X axis, and if the recommended number of trees should be added:
> plot_stabilityThe output of the plot_stability function visualises the stability of random forest depending on the number of decision trees
This plot clearly shows the non-linear relationship between stability and the number of trees. With 500 trees, random forest only leads to a stability of around 0.2 which explains why the results changed drastically when repeating random forest after setting a different seed. With the recommended 40,000 trees, however, the stability is near 1. Adding more than 40,000 trees would get the stability further to 1 but this increase would be only very small while the computation time would further increase. That is why 40,000 trees indicate the optimal number of trees for this data set.
The Takeaway: Optimise Random Forest to Get the Most of It
Random forest is a powerful ally for anyone working with data — whether you’re a researcher, analyst, student, or data scientist. It’s easy to use, remarkably flexible, and highly effective across a wide range of applications. But like any tool, using it well means understanding what’s happening under the hood. In this post, we’ve uncovered one of its hidden quirks: The randomness that makes it strong can also make it unstable if not carefully managed. Fortunately, with the optRF package, we can strike the perfect balance between stability and performance, ensuring we get reliable results without wasting computational resources. Whether you’re working in genomics, medicine, economics, agriculture, or any other data-rich field, mastering this balance will help you make smarter, more confident decisions based on your data.
The post How to Set the Number of Trees in Random Forest appeared first on Towards Data Science.
#how #set #number #trees #random
How to Set the Number of Trees in Random Forest
Scientific publication
T. M. Lange, M. Gültas, A. O. Schmitt & F. Heinrich. optRF: Optimising random forest stability by determining the optimal number of trees. BMC bioinformatics, 26, 95.Follow this LINK to the original publication.
Random Forest — A Powerful Tool for Anyone Working With Data
What is Random Forest?
Have you ever wished you could make better decisions using data — like predicting the risk of diseases, crop yields, or spotting patterns in customer behavior? That’s where machine learning comes in and one of the most accessible and powerful tools in this field is something called Random Forest.
So why is random forest so popular? For one, it’s incredibly flexible. It works well with many types of data whether numbers, categories, or both. It’s also widely used in many fields — from predicting patient outcomes in healthcare to detecting fraud in finance, from improving shopping experiences online to optimising agricultural practices.
Despite the name, random forest has nothing to do with trees in a forest — but it does use something called Decision Trees to make smart predictions. You can think of a decision tree as a flowchart that guides a series of yes/no questions based on the data you give it. A random forest creates a whole bunch of these trees, each slightly different, and then combines their results to make one final decision. It’s a bit like asking a group of experts for their opinion and then going with the majority vote.
But until recently, one question was unanswered: How many decision trees do I actually need? If each decision tree can lead to different results, averaging many trees would lead to better and more reliable results. But how many are enough? Luckily, the optRF package answers this question!
So let’s have a look at how to optimise Random Forest for predictions and variable selection!
Making Predictions with Random Forests
To optimise and to use random forest for making predictions, we can use the open-source statistics programme R. Once we open R, we have to install the two R packages “ranger” which allows to use random forests in R and “optRF” to optimise random forests. Both packages are open-source and available via the official R repository CRAN. In order to install and load these packages, the following lines of R code can be run:
> install.packages> install.packages> library> libraryNow that the packages are installed and loaded into the library, we can use the functions that these packages contain. Furthermore, we can also use the data set included in the optRF package which is free to use under the GPL license. This data set called SNPdata contains in the first column the yield of 250 wheat plants as well as 5000 genomic markersthat can contain either the value 0 or 2.
> SNPdataYield SNP_0001 SNP_0002 SNP_0003 SNP_0004
ID_001 670.7588 0 0 0 0
ID_002 542.5611 0 2 0 0
ID_003 591.6631 2 2 0 2
ID_004 476.3727 0 0 0 0
ID_005 635.9814 2 2 0 2
This data set is an example for genomic data and can be used for genomic prediction which is a very important tool for breeding high-yielding crops and, thus, to fight world hunger. The idea is to predict the yield of crops using genomic markers. And exactly for this purpose, random forest can be used! That means that a random forest model is used to describe the relationship between the yield and the genomic markers. Afterwards, we can predict the yield of wheat plants where we only have genomic markers.
Therefore, let’s imagine that we have 200 wheat plants where we know the yield and the genomic markers. This is the so-called training data set. Let’s further assume that we have 50 wheat plants where we know the genomic markers but not their yield. This is the so-called test data set. Thus, we separate the data frame SNPdata so that the first 200 rows are saved as training and the last 50 rows without their yield are saved as test data:
> Training = SNPdata> Test = SNPdataWith these data sets, we can now have a look at how to make predictions using random forests!
First, we got to calculate the optimal number of trees for random forest. Since we want to make predictions, we use the function opt_prediction from the optRF package. Into this function we have to insert the response from the training data set, the predictors from the training data set, and the predictors from the test data set. Before we run this function, we can use the set.seed function to ensure reproducibility even though this is not necessary:
> set.seed> optRF_result = opt_predictionRecommended number of trees: 19000
All the results from the opt_prediction function are now saved in the object optRF_result, however, the most important information was already printed in the console: For this data set, we should use 19,000 trees.
With this information, we can now use random forest to make predictions. Therefore, we use the ranger function to derive a random forest model that describes the relationship between the genomic markers and the yield in the training data set. Also here, we have to insert the response in the y argument and the predictors in the x argument. Furthermore, we can set the write.forest argument to be TRUE and we can insert the optimal number of trees in the num.trees argument:
> RF_model = rangerAnd that’s it! The object RF_model contains the random forest model that describes the relationship between the genomic markers and the yield. With this model, we can now predict the yield for the 50 plants in the test data set where we have the genomic markers but we don’t know the yield:
> predictions = predict$predictions
> predicted_Test = data.frame, predicted_yield = predictions)
The data frame predicted_Test now contains the IDs of the wheat plants together with their predicted yield:
> headID predicted_yield
ID_201 593.6063
ID_202 596.8615
ID_203 591.3695
ID_204 589.3909
ID_205 599.5155
ID_206 608.1031
Variable Selection with Random Forests
A different approach to analysing such a data set would be to find out which variables are most important to predict the response. In this case, the question would be which genomic markers are most important to predict the yield. Also this can be done with random forests!
If we tackle such a task, we don’t need a training and a test data set. We can simply use the entire data set SNPdata and see which of the variables are the most important ones. But before we do that, we should again determine the optimal number of trees using the optRF package. Since we are insterested in calculating the variable importance, we use the function opt_importance:
> set.seed> optRF_result = opt_importanceRecommended number of trees: 40000
One can see that the optimal number of trees is now higher than it was for predictions. This is actually often the case. However, with this number of trees, we can now use the ranger function to calculate the importance of the variables. Therefore, we use the ranger function as before but we change the number of trees in the num.trees argument to 40,000 and we set the importance argument to “permutation”.
> set.seed> RF_model = ranger> D_VI = data.frame,
+ importance = RF_model$variable.importance)
> D_VI = D_VIThe data frame D_VI now contains all the variables, thus, all the genomic markers, and next to it, their importance. Also, we have directly ordered this data frame so that the most important markers are on the top and the least important markers are at the bottom of this data frame. Which means that we can have a look at the most important variables using the head function:
> headvariable importance
SNP_0020 45.75302
SNP_0004 38.65594
SNP_0019 36.81254
SNP_0050 34.56292
SNP_0033 30.47347
SNP_0043 28.54312
And that’s it! We have used random forest to make predictions and to estimate the most important variables in a data set. Furthermore, we have optimised random forest using the optRF package!
Why Do We Need Optimisation?
Now that we’ve seen how easy it is to use random forest and how quickly it can be optimised, it’s time to take a closer look at what’s happening behind the scenes. Specifically, we’ll explore how random forest works and why the results might change from one run to another.
To do this, we’ll use random forest to calculate the importance of each genomic marker but instead of optimising the number of trees beforehand, we’ll stick with the default settings in the ranger function. By default, ranger uses 500 decision trees. Let’s try it out:
> set.seed> RF_model = ranger> D_VI = data.frame,
+ importance = RF_model$variable.importance)
> D_VI = D_VI> headvariable importance
SNP_0020 80.22909
SNP_0019 60.37387
SNP_0043 50.52367
SNP_0005 43.47999
SNP_0034 38.52494
SNP_0015 34.88654
As expected, everything runs smoothly — and quickly! In fact, this run was significantly faster than when we previously used 40,000 trees. But what happens if we run the exact same code again but this time with a different seed?
> set.seed> RF_model2 = ranger> D_VI2 = data.frame,
+ importance = RF_model2$variable.importance)
> D_VI2 = D_VI2> headvariable importance
SNP_0050 60.64051
SNP_0043 58.59175
SNP_0033 52.15701
SNP_0020 51.10561
SNP_0015 34.86162
SNP_0019 34.21317
Once again, everything appears to work fine but take a closer look at the results. In the first run, SNP_0020 had the highest importance score at 80.23, but in the second run, SNP_0050 takes the top spot and SNP_0020 drops to the fourth place with a much lower importance score of 51.11. That’s a significant shift! So what changed?
The answer lies in something called non-determinism. Random forest, as the name suggests, involves a lot of randomness: it randomly selects data samples and subsets of variables at various points during training. This randomness helps prevent overfitting but it also means that results can vary slightly each time you run the algorithm — even with the exact same data set. That’s where the set.seedfunction comes in. It acts like a bookmark in a shuffled deck of cards. By setting the same seed, you ensure that the random choices made by the algorithm follow the same sequence every time you run the code. But when you change the seed, you’re effectively changing the random path the algorithm follows. That’s why, in our example, the most important genomic markers came out differently in each run. This behavior — where the same process can yield different results due to internal randomness — is a classic example of non-determinism in machine learning.
Taming the Randomness in Random Forests
As we just saw, random forest models can produce slightly different results every time you run them even when using the same data due to the algorithm’s built-in randomness. So, how can we reduce this randomness and make our results more stable?
One of the simplest and most effective ways is to increase the number of trees. Each tree in a random forest is trained on a random subset of the data and variables, so the more trees we add, the better the model can “average out” the noise caused by individual trees. Think of it like asking 10 people for their opinion versus asking 1,000 — you’re more likely to get a reliable answer from the larger group.
With more trees, the model’s predictions and variable importance rankings tend to become more stable and reproducible even without setting a specific seed. In other words, adding more trees helps to tame the randomness. However, there’s a catch. More trees also mean more computation time. Training a random forest with 500 trees might take a few seconds but training one with 40,000 trees could take several minutes or more, depending on the size of your data set and your computer’s performance.
However, the relationship between the stability and the computation time of random forest is non-linear. While going from 500 to 1,000 trees can significantly improve stability, going from 5,000 to 10,000 trees might only provide a tiny improvement in stability while doubling the computation time. At some point, you hit a plateau where adding more trees gives diminishing returns — you pay more in computation time but gain very little in stability. That’s why it’s essential to find the right balance: Enough trees to ensure stable results but not so many that your analysis becomes unnecessarily slow.
And this is exactly what the optRF package does: it analyses the relationship between the stability and the number of trees in random forests and uses this relationship to determine the optimal number of trees that leads to stable results and beyond which adding more trees would unnecessarily increase the computation time.
Above, we have already used the opt_importance function and saved the results as optRF_result. This object contains the information about the optimal number of trees but it also contains information about the relationship between the stability and the number of trees. Using the plot_stability function, we can visualise this relationship. Therefore, we have to insert the name of the optRF object, which measure we are interested in, the interval we want to visualise on the X axis, and if the recommended number of trees should be added:
> plot_stabilityThe output of the plot_stability function visualises the stability of random forest depending on the number of decision trees
This plot clearly shows the non-linear relationship between stability and the number of trees. With 500 trees, random forest only leads to a stability of around 0.2 which explains why the results changed drastically when repeating random forest after setting a different seed. With the recommended 40,000 trees, however, the stability is near 1. Adding more than 40,000 trees would get the stability further to 1 but this increase would be only very small while the computation time would further increase. That is why 40,000 trees indicate the optimal number of trees for this data set.
The Takeaway: Optimise Random Forest to Get the Most of It
Random forest is a powerful ally for anyone working with data — whether you’re a researcher, analyst, student, or data scientist. It’s easy to use, remarkably flexible, and highly effective across a wide range of applications. But like any tool, using it well means understanding what’s happening under the hood. In this post, we’ve uncovered one of its hidden quirks: The randomness that makes it strong can also make it unstable if not carefully managed. Fortunately, with the optRF package, we can strike the perfect balance between stability and performance, ensuring we get reliable results without wasting computational resources. Whether you’re working in genomics, medicine, economics, agriculture, or any other data-rich field, mastering this balance will help you make smarter, more confident decisions based on your data.
The post How to Set the Number of Trees in Random Forest appeared first on Towards Data Science.
#how #set #number #trees #random
·87 Views