TOWARDSDATASCIENCE.COM
Predicting the NBA Champion with Machine Learning
Every NBA season, 30 teams compete for something only one will achieve: the legacy of a championship. From power rankings to trade deadline chaos and injuries, fans and analysts alike speculate endlessly about who will raise the Larry O’Brien Trophy. But what if we could go beyond the hot takes and predictions, and use data and Machine Learning to, at the end of the regular season, forecast the NBA Champion? In this article, I’ll walk through this process — from gathering and preparing the data, to training and evaluating the model, and finally using it to make predictions for the upcoming 2024–25 Playoffs. Along the way, I’ll highlight some of the most surprising insights that emerged from the analysis. All the code and data used are available on GitHub. Understanding the problem Before diving into model training, the most important step in any machine learning project is understanding the problem:What question are we trying to answer, and what data (and model) can help us get there? In this case, the question is simple: Who is going to be the NBA Champion? A natural first idea is to frame this as a classification problem: each team in each season is labeled as either Champion or Not Champion. But there’s a catch. There’s only one champion per year (obviously). So if we pull data from the last 40 seasons, we’d have 40 positive examples… and hundreds of negative ones. That lack of positive samples makes it extremely hard for a model to learn meaningful patterns, specially considering that winning an NBA title is such a rare event that we simply don’t have enough historical data — we’re not working with 20,000 seasons. That scarcity makes it extremely difficult for any classification model to truly understand what separates champions from the rest. We need a smarter way to frame the problem. To help the model understand what makes a champion, it’s useful to also teach it what makes an almost champion — and how that differs from a team that was knocked out in the first round. In other words, we want the model to learn degrees of success in the playoffs, rather than a simple yes/no outcome. This led me to the concept of Champion Share — the proportion of playoff wins a team achieved out of the total needed to win the title. From 2003 onward, it takes 16 wins to become a NBA Champion. However, between 1984 and 2002, the first round was a best-of-five series, so during that period the total required was 15 wins. A team that loses in the first round might have 0 or 1 win (Champion Share = 1/16), while a team that makes the Finals but loses might have 14 wins (Champion Share = 14/16). The Champion has a full share of 1.0. The @warriors take home the NBA title A final look at the bracket for the 2021-22 #NBAPlayoffs presented by Google Pixel. pic.twitter.com/IHU72Kr8AN— NBA (@NBA) June 17, 2022 Example of playoff bracket from the 2021 Playoffs This reframes the task as a regression problem, where the model predicts a continuous value between 0 and 1 — representing how close each team came to winning it all. In this setup, the team with the highest predicted value is our model’s pick for the NBA Champion. This is a similar approach to the MVP prediction from my previous article. Predicting the NBA MVP with Machine Learning Data Basketball — and the NBA in particular — is one of the most exciting sports to work with in data science, thanks to the volume of freely available statistics. For this project, I gathered data from Basketball Reference using my python package BRScraper, that allows easy access to the players’ and teams data. All data collection was done in accordance with the website’s guidelines and rate limits. The data used includes team-level statistics, final regular season standings (e.g., win percentage, seeding), as well as player-level statistics for each team (limited to players who appeared in at least 30 games) and historical playoff performance indicators. However, it’s important to be cautious when working with raw, absolute values. For example, the average points per game (PPG) in the 2023–24 season was 114.2, while in 2000–01 it was 94.8 — an increase of nearly 20%. This is due to a series of factors, but the fact is that the game has changed significantly over the years, and so have the metrics derived from it. Evolution of some per-game NBA statistics (Image by Author) To account for this shift, the approach here avoids using absolute statistics directly, opting instead for normalized, relative metrics. For example: Instead of a team’s PPG, you can use their ranking in that season. Instead of counting how many players average 20+ PPG, you can consider how many are in the top 10 in scoring, and so on. This enables the model to capture relative dominance within each era, making comparisons across decades more meaningful and thus permitting the inclusion of older seasons to enrich the dataset. Data from the 1984 to 2024 seasons were used to train and test the model, totaling 40 seasons, with a total of 70 variables. Before diving into the model itself, some interesting patterns emerge from an exploratory analysis when comparing championship teams to all playoff teams as a whole: Comparison of teams: Champions vs Rest of Playoff teams (Image by Author) Champions tend to come from the top seeds and with higher winning percentages, unsurprisingly. The team with the worst regular season record to win it all in this period was the 1994–95 Houston Rockets, led by Hakeem Olajuwon, finishing 47–35 (.573) and entering the playoffs as only the 10th best overall team (6th in the West). Another notable trend is that champions tend to have a slightly higher average age, suggesting that experience plays a crucial role once the playoffs begin. The youngest championship team in the database with an average of 26.6 years is the 1990–91 Chicago Bulls, and the oldest is the 1997–98 Chicago Bulls, with 31.2 years — the first and last titles from the Michael Jordan dinasty. Similarly, teams with coaches who have been with the franchise longer also tend to find more success in the postseason. Modeling The model used was LightGBM, a tree-based algorithm widely recognized as one of the most effective methods for tabular data, alongside others like XGBoost. A grid search was done to identify the best hyperparameters for this specific problem. The model performance was evaluated using the root mean squared error (RMSE) and the coefficient of determination (R²). You can find the formula and explanation of each metric in my previous MVP article. The seasons used for training and testing were randomly selected, with the constraint of reserving the last three seasons for the test set in order to better assess the model’s performance on more recent data. Importantly, all teams were included in the dataset — not just those that qualified for the playoffs — allowing the model to learn patterns without relying on prior knowledge of postseason qualification. Results Here we can see a comparison between the “distributions” of both the predictions and the real values. While it’s technically a histogram — since we’re dealing with a regression problem — it still works as a visual distribution because the target values range from 0 to 1. Additionally, we also display the distribution of the residual error for each prediction. (Image by Author) As we can see, the predictions and the real values follow a similar pattern, both concentrated near zero — as most teams do not achieve high playoff success. This is further supported by the distribution of the residual errors, which is centered around zero and resembles a normal distribution. This suggests that the model is able to capture and reproduce the underlying patterns present in the data. In terms of performance metrics, the best model achieved an RMSE of 0.184 and an R² score of 0.537 on the test dataset. An effective approach for visualizing the key variables influencing the model’s predictions is through SHAP Values, atechnique that provides a reasonable explanation of how each feature impacts the model’s predictions. Again, a deeper explanation about SHAP and how to interpret its chart can be found in Predicting the NBA MVP with Machine Learning. SHAP chart (Image by Author) From the SHAP chart, several important insights emerge: Seed and W/L% rank among the top three most impactful features, highlighting the importance of team performance in the regular season. Team-level stats such as Net Rating (NRtg), Opponent Points Per Game (PA/G), Margin of Victory (MOV) and Adjusted Offensive Rating (ORtg/A) also play a significant role in shaping playoff success. On the player side, advanced metrics stand out: the number of players in the top 30 for Box Plus/Minus (BPM) and top 3 for Win Shares per 48 Minutes (WS/48) are among the most influential. Interestingly, the model also captures broader trends — teams with a higher average age tend to perform better in the playoffs, and a strong showing in the previous postseason often correlates with future success. Both patterns point again to experience as a valuable asset in the pursuit of a championship. Let’s now take a closer look at how the model performed in predicting the last three NBA champions: Predictions for the last three years (Image by Author) The model correctly predicted two of the last three NBA champions. The only miss was in 2023, when it favored the Milwaukee Bucks. That season, Milwaukee had the best regular-season record at 58–24 (.707), but an injury to Giannis Antetokounmpo hurt their playoff run. The Bucks were eliminated 4–1 in the first round by the Miami Heat, who went on to reach the Finals — a surprising and disappointing postseason exit for Milwaukee, who had claimed the championship just two years earlier. 2025 Playoffs Predictions For this upcoming 2025 playoffs, the model is predicting the Boston Celtics to go back-to-back, with OKC and Cleveland close behind.  Given their strong regular season (61–21, 2nd seed in the East) and the fact that they’re the reigning champions, I tend to agree. They combine current performance with recent playoff success. Still, as we all know, anything can happen in sports — and we’ll only get the real answer by the end of June. (Photo by Richard Burlton on Unsplash) Conclusions This project demonstrates how machine learning can be applied to complex, dynamic environments like sports. Using a dataset spanning four decades of basketball history, the model was able to uncover meaningful patterns into what drives playoff success. Beyond prediction, tools like SHAP allowed us to interpret the model’s decisions and better understand the factors that contribute to postseason success. One of the biggest challenges in this problem is accounting for injuries. They can completely reshape the playoff landscape — particularly when they affect star players during the playoffs or late in the regular season. Ideally, we could incorporate injury histories and availability data to better account for this. Unfortunately, consistent and structured open data on this matter— especially at the granularity needed for modeling — is hard to come by. As a result, this remains one of the model’s blind spots: it treats all teams at full strength, which is often not the case. While no model can perfectly predict the chaos and unpredictability of sports, this analysis shows that data-driven approaches can get close. As the 2025 playoffs unfold, it will be exciting to see how the predictions hold up — and what surprises the game still has in store. (Photo by Tim Hart on Unsplash) I’m always available on my channels (LinkedIn and GitHub). Thanks for your attention! Gabriel Speranza Pastorello The post Predicting the NBA Champion with Machine Learning appeared first on Towards Data Science.
0 Comentários 0 Compartilhamentos 54 Visualizações