Accelerating NBA Play-by-Play Processing with Python and Parallel Computing
- smileytr
- Mar 31
- 2 min read
Updated: Apr 4
As a data scientist and basketball enthusiast, I set out to build a realistic NBA game simulation engine from scratch using Python. What began as a class project for CMSE 401: Parallel Computing at Michigan State University quickly evolved into a broader exploration of high-performance data processing and simulation design. My initial goal was to implement a game outcome model using ELO ratings at the team level, but I soon saw the opportunity to increase the model’s granularity by incorporating play-by-play data across nearly three decades of NBA seasons.
To achieve this, I pulled and processed over 27 years of NBA play-by-play data from the 1997–2023 seasons, sourced from Kaggle. The combined dataset was roughly 2GB in size, containing detailed in-game events across millions of possessions. The processing workload included filling in missing data, inferring play types and results, and cleaning each game’s record down to usable structure. To handle this massive scale efficiently, I built a parallel data processing pipeline using Python’s ProcessPoolExecutor and ThreadPoolExecutor to distribute the work across both seasons and games. I also leveraged the Numba library’s @njit decorator to compile computationally expensive row-level logic, accelerating numeric corrections and cumulative stat tracking.
The final result was a merged and cleaned play-by-play dataset with over 13 million rows, processed end-to-end in just 22 seconds on Michigan State’s High Performance Computing Cluster. This project not only demonstrated the power of Python’s parallel and compiled capabilities, but also laid the foundation for a more realistic and statistically-informed NBA simulation model that could eventually factor in player-level impact, coaching decisions, and advanced tempo-based pacing. You can view the full preprocessing implementation in the code above or on my GitHub Gist.
Komentarai