Abstract
We address the problem of automatic extraction of patterns in the sequence of events in basketball games and construction of statistical models for generating a plausible simulation of a match between two distinct teams. We present a method for automatic construction of an attribute space which requires very little expert knowledge. The attributes are defined as the ratio between the number of entries and exits from higher-level concepts that are identified as groups of similar in-game events. The similarity between events is determined by the similarity between probability distributions describing the preceding and the following events in the observed sequences of game progression. The methodology is general and is applicable to any sports game that can be modelled as a random walk through the state space. Experiments on basketball show that automatically generated attributes are as informative as those derived using expert knowledge. Furthermore, the obtained simulations are in line with empirical data.
Similar content being viewed by others
Notes
The computation of silhouette coefficients is based on the maximal distance between items instead of average distance. The goal is a more conservative cut of the dendrogram that results in smaller and more cohesive clusters.
The data were obtained from https://stats.nba.com.
In sports, the two dummy events correspond to the start and the end of each period.
The impact of the event type ADREB will be represented in the selection, albeit indirectly, even though it will not be explicitly present in the numerator of any of the selected attributes.
At first glance, it seems that only Dur is required, but then we lose the ability to begin the simulation at an arbitrary starting point.
The rules of the sport determine which event types can follow each other. Picking the attribute PrevEvt as the root node leads to a natural division of event types into disjoint subsets, depending on what can follow.
As we have seen before, not all the sequence elements correspond to explicit feeds in the play-by-play data. In this particular example, the elements from category 0 represent immediate change in possession after made baskets.
Theoretically, there should be 24600 simulations, but games from the beginning of the season cannot be simulated because the teams’ skills cannot be estimated.
We excluded matches that went to overtime. We also excluded each team’s first home and away game because the teams’ skills have yet to be estimated.
The predicted score margin for a game is calculated as the average score margin in the generated simulations of that game.
References
Baghal T et al (2012) Are the “four factors” indicators of one factor? an application of structural equation modeling methodology to NBA data in prediction of winning percentage. J Quant Anal Sports 8(1):1–14
Berri DJ (2008) A simple measure of worker productivity in the national basketball association. Bus Sport 3:1–40
Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 75:1–3
Cervone D, D’Amour A, Bornn L, Goldsberry K (2016) A multiresolution stochastic process model for predicting basketball possession outcomes. J Am Stat Assoc 111(514):585–599
Cha S-H (2007) Comprehensive survey on distance/similarity measures between probability density functions. City 1:1
Chang Y-H, Maheswaran R, Su J, Kwok S, Levy T, Wexler A, Squire K (2014) Quantifying shot quality in the nba. In: Proceedings of the 8th annual MIT sloan sports analytics conference. MIT, Boston
Chawla S, Estephan J, Gudmundsson J, Horton M (2017) Classification of passes in football matches using spatiotemporal data. ACM Trans Spat Algorithms Syst 3:6
Cintia P, Giannotti F, Pappalardo L, Pedreschi D, Malvaldi M (2015) The harsh rule of the goals: Data-driven performance indicators for football teams. In: 2015 IEEE international conference on data science and advanced analytics (DSAA), IEEE, 36678 pp. 1–10
Clemente FM, Martins FML, Mendes RS et al (2016) Social network analysis applied to team sports analysis. Springer, Berlin
Elo A (1961) New USCF rating system. Chess life 16:160–161
Epstein ES (1969) A scoring system for probability forecast of ranked categories. J Appl Meteorol 8:985–987
Franks A, Miller A, Bornn L, Goldsberry K et al (2015) Characterizing the spatial structure of defensive skill in professional basketball. Annal Appl Stat 9:94–121
Gabel A, Redner S et al (2012) Random walk picture of basketball scoring. J Quant Anal Sports 8(1):1–18
Good IJ (1952) Rational decisions. J R Stat Soc Series B (Methodological), pp 107–114
Gudmundsson J, Horton M (2017) Spatio-temporal analysis of team sports. ACM Comput Surv 50:22
Hollinger J (2003) Pro Basketball Prospectus 2003–2004. Brassey’s, San Francisco
Hvattum LM, Arntzen H (2010) Using ELO ratings for match result prediction in association football. Int J Forecast 26:460–470
Kononenko I (1995) On biases in estimating multi-valued attributes. In: Ijcai. 95: 1034–1040
Kubatko J, Oliver D, Pelton K, Rosenbaum DT (2007) A starting point for analyzing basketball statistics. J Quant Anal Sports 3:1–22
Kullback S, Leibler RA (1951) On information and sufficiency. Annal Math Stat 22:79–86
Langville AN, Meyer CD (2012) Who’s# 1?: the science of rating and ranking. Princeton University Press, Princeton
Le HM, Carr P, Yue Y, Lucey P (2017) Data-driven ghosting using deep imitation learning. In: 2017 MIT sloan sports analytics conference
Lucey P, Bialkowski A, Monfort M, Carr P, Matthews I (2014) Quality vs quantity: improved shot prediction in soccer using strategic features from spatiotemporal data. In: Proceedings of the 8th annual MIT sloan sports analytics conference. pp 1–9
Mehrasa N, Zhong Y, Tung F, Bornn L, Mori G (2018) Deep learning of player trajectory representations for team activity analysis. In: 2018 MIT sloan sports analytics conference
Oliver D (2004) Basketball on paper: rules and tools for performance analysis. Potomac Books Inc, Potomac
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Štrumbelj E, Vračar P (2012) Simulating a basketball match with a homogeneous Markov model and forecasting the outcome. Int J Forecast 28:532–542
Teramoto M, Cross CL (2010) Relative importance of performance factors in winning NBA games in regular season versus playoffs. J Quant Anal Sports 6(3):1–17
Vračar P, Štrumbelj E, Kononenko I (2016) Modeling basketball play-by-play data. Expert Syst Appl 44:58–66
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Vračar, P., Štrumbelj, E. & Kononenko, I. Automatic attribute construction for basketball modelling. Knowl Inf Syst 62, 541–570 (2020). https://doi.org/10.1007/s10115-019-01361-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-019-01361-2