nba-prediction

NBA Game Predictor Project

Click HERE to see it in action


Let's connect 🤗
Twitter • LinkedIn

image info

Project Repository: https://github.com/cmunch1/nba-prediction

NOTE: As of October 2024, I am temporarily removing Hopsworks feature store and model registry from this project until it becomes more stable.

Table of contents

Introduction

This project is a demonstration of my ability to quickly learn, develop, and deploy end-to-end machine learning technologies. I am currently seeking to change careers into Machine Learning / Data Science. (This is still kind of a work in progress - I wanted to get the end-to-end process setup, and then go back and iterate improvements, which I am constantly doing now.)

I chose to predict the winner of NBA games because:

I am actually not really a big fan of the NBA but have watched a few games and have basic knowledge. I have never done any sports betting either, but I have always loved exploration and discovery; the possibility of maybe finding something that somebody else has “missed” is very appealing to me, especially in terms of competition and of making money

Problem: Increase the profitability of betting on NBA games

Initial Step: Predict the probability that the home team will win each game

Machine learning classification models will be used to predict the probability of the winner of each game based upon historical data. This is a first step in developing a betting strategy that will increase the profitability of betting on NBA games.

Disclaimer

In reality, a betting strategy is a rather complex problem with many elements beyond simply picking the winner of each game. Huge amounts of manpower and money have been invested in developing such strategies, and it is not likely that a learning project will be able to compete very well with such efforts. However, it may provide an extra element of insight that could be used to improve the profitability of an existing betting strategy.

Plan

Overview

Initial Modeling Development Cycle

Initial Data Setup

Production Pipeline

Tools Used:

NOTE: As of October 2024, I am temporarily removing Hopsworks feature store and model registry from this project until it becomes more stable.

Future Possibilities

Continual improvements might include:

Structure

Jupyter Notebooks were used for initial development and testing and are labeled 01 through 10 in the main directory. Notebooks 01 thru 06 are primarily just historical records and notes for the development process.

Key functions were moved to .py files in src directory once the functions were stable.

Notebooks 07, 09, and 10 are used in production. I chose to keep the notebooks instead of full conversion to scripts because:

Data

Data from the 2013 thru 2021 season has been archived on Kaggle. New data is scraped from NBA website.

Currently available data includes:

NOTES

New Data

New data is scraped from https://www.nba.com/stats/teams/boxscores

Data Leakage

The data for each game are stats for the completed game. We want to predict the winner before the game is played, not after. The model should only use data that would be available before the game is played. Our model features will primarily be rolling stats for the previous games (e.g. average assists for previous 5 games) while excluding the current game.

I mention this because I did see several similar projects online that failed to take this into account. If the goal is simply to predict which stats are important for winning games, then the model can be trained on the entire dataset. However, if the goal is to predict the winner of a game like we are trying to do, then the model must be trained on data that would only be available before the game is played.

EDA and Data Processing

Exploratory Data Analysis (EDA) and Data Processing are summarized and detailed in the notebooks. Some examples include:

Histograms of various features

Correlations between features

Train / Test/Validation Split

Baseline Models

Simple If-Then Models

ML Models

Feature Engineering

Model Training/Testing

Models

The native Python API (rather than the Scikit-learn wrapper) is used for initial testing of both models because of ease of built-in Shapley values, which are used for feature importance analysis and for adversarial validation (since Shapley values are local to each dataset, they can be used to determine if the train and test datasets have the same feature importances. If they do not, then it may indicate that the model does not generalize very well.)

The Scikit-learn wrapper is used later in production because it allows for easier probability calibration using sklearn’s CalibratedClassifierCV.

Evaluation

Experiment Tracking

Notebook 07 integrates Neptune.ai for experiment tracking and Optuna for hyperparameter tuning.

Experiment tracking logs can be viewed here: https://app.neptune.ai/cmunch1/nba-prediction/experiments?split=tbl&dash=charts&viewId=979e20ed-e172-4c33-8aae-0b1aa1af3602

Probability Calibration

SKlearn’s CalibratedClassifierCV is used to ensure that the model probabilities are calibrated against the true probability distribution. The Brier loss score is used to by the software to automatically select the best calibration method (sigmoid, isotonic, or none).

Production Features Pipeline

Notebook 09 is run from a Github Actions every morning.

A variable can be set to either use Selenium or ScrapingAnt for scraping the data. ScrapingAnt is used in production because of its built-in proxy server.

Model Training Pipeline

Notebook 10 retrieves the most current data, executes Notebook 07 to handle hyperparameter tuning, model training, and calibration, and then adds the model to the Model Registry. The time periods used for the train set and test set can be adjusted so that the model can be tested only on the most current games.

Streamlit App

The streamlit app is deployed at streamlit.io and can be accessed here: https://cmunch1-nba-prediction.streamlit.app/

It uses the model in the Model Registry to predict the win probability of the home team for the current day’s upcoming games.

Model Performance

The current model was tested over the completed 2022-2023 regular season (not playoffs) and had an accuracy of 0.615.

Baseline performance of “home team always wins” is 0.58 for this same time period.

One of the top public prediction models had an accuracy of 0.656 for this same time period.

Overall, the performance for the regular season is not bad, but there is room for improvement.

Feedback

Thanks for taking the time to read about my project. This is my primary “portfolio” project for my quest to change careers and find an entry level position in Machine Learning / Data Science. I appreciate any feedback.

Project Repository: https://github.com/cmunch1/nba-prediction

My Linked-In profile: https://www.linkedin.com/in/chris-munch/

Twitter: https://twitter.com/curiovana

Acknowledgements

Pau Labarto Bajo mentored me on this project, providing valuable feedback and insights. He provides online tutorials, training courses, and a blog at his website: https://datamachines.xyz/