Enhancing Basketball Game Outcome Prediction through Fused Graph Convolutional Networks and Random Forest Algorithm

Basketball is a popular sport worldwide, and many researchers have utilized various machine learning models to predict the outcome of basketball games. However, prior research has primarily focused on traditional machine learning models. Furthermore, models that rely on vector inputs tend to ignore the intricate interactions between teams and the spatial structure of the league. Therefore, this study aimed to apply graph neural networks to basketball game outcome prediction, by transforming structured data into unstructured graphs, to represent the interactions between teams in the 2012–2018 NBA season dataset. Initially, the study used a homogeneous network and undirected graph to build a team representation graph. The constructed graph was fed into a graph convolutional network, which yielded an average success rate of 66.90% in predicting the outcome of games. To improve the prediction success rate, feature extraction based on the random forest algorithm was combined with the model. The fused model yielded the best results, and the prediction accuracy was improved to 71.54%. Additionally, the study compared the results of the developed model with previous studies and the baseline model. Our proposed method considers the spatial structure of teams and the interaction between teams, resulting in superior performance in basketball game outcome prediction. The results of this study provide valuable insights for basketball performance prediction research.


Introduction
Machine learning (ML) is an interdisciplinary field that combines computer science, statistics, and other disciplines to develop predictive models that imitate certain aspects of human thinking. Accurate prediction is essential for various industries, including policy-making, risk prevention, resource management, and economic and social development. In the sports industry, prediction models are increasingly used by coaches, players, and companies to improve competitiveness and profits. For example, accurate predictions can inform sales planning, investment decisions, training programs, tactical choices, and injury prevention strategies [1][2][3].
Sports industries have grown rapidly in recent decades, driven by economic, technological, and social developments. Sports markets generate significant value and revenue, such as through sports betting, venue management, and broadcast management, as exemplified by the 2022 FIFA World Cup in Qatar. With the rise in popularity of sports, there has been growing interest in predicting sports outcomes [4][5][6]. Among all sports, the National Basketball Association (NBA) in the United States is one of the most influential basketball leagues, generating billions of dollars in revenue. Winning games, gaining advantages, and maintaining team performance are crucial goals for competitive organizations such as the NBA. To achieve these goals, coaches and team administrators analyze and predict future team and player performance, and adjust team lineups and tactics accordingly.

Related Work
It is important to note that this paper proposes a graph convolutional network (GCN) prediction model for basketball games, which is specifically applied to the NBA dataset. Therefore, the relevant literature for this study mainly focused on two aspects: basketball game outcome prediction and the GNN methodology, as well as sports outcome prediction.

Basketball Game Outcome Prediction
In this section, we will examine the data sets used in previous studies for predicting basketball game outcomes, the number of features used in each study, the most successful algorithm employed, and the corresponding success rates achieved.
In recent years, several studies have been conducted to predict the outcomes of basketball games using machine learning algorithms. Loeffelholz et al. [22] used neural networks to model the NBA 2007-2008 season and found that the most effective method was the feed-forward neural network (FFNN), with a success rate of 74.33%. Similarly, Zdravevski and Kulakov [23] predicted two consecutive NBA seasons using algorithms in Weka, with logistic regression being the most effective method, with a success rate of 72.78%. On the other hand, Miljkovic et al. [24] predicted the NBA 2009-2010 season games using data mining and found that the most efficient method was naive Bayes, with a success rate of 67%.
Cao [25] used machine learning algorithms to build a model for predicting NBA game outcomes and found that simple logistic regression was the most effective method, with a success rate of 69.67%. Lin, J. et al. [26] attempted to determine the winners of NBA 1991-1998 season games using random forests and achieved a 65% success rate. Tran, T. [27] predicted NBA games using matrix factorization, with an accuracy of 72.1%.
Li, Y. et al. [13] attempted to predict the games for the NBA 2011-2016 seasons using a data envelopment analysis methodology and tested with the 2015-2016 season, with a 73.95% accuracy rate. Horvat, T. et al. [17] used seven different machine learning models to predict basketball game outcomes for the NBA 2009-2018 season and found that k-nearest neighbors was the best method, with an accuracy of 60.01%. Li [28] conducted a study on modeling the NBA 2012-2018 season using machine learning classifiers. The study employed three different classifiers and found that linear regression following use of a least absolute shrinkage and selection operator (LASSO) was the most effective method, achieving a success rate of 67.24%. In another study conducted by Ozkan I A. [18], the outcomes of games from the 2015-2016 season of the Turkish Basketball Super League were estimated using a concurrent neuro-fuzzy system, with a 79.2% success rate. ÇENE E. [29] explored the performance of seven different algorithms in predicting EuroLeague games for the 2016-2017 and 2020-2021 seasons. Their findings revealed that logistic regression, support vector machines (SVM), and artificial neural networks (ANN) were the most effective models, with an overall accuracy of approximately 84%.
Another study by Osken C and Onay C. [30] focused on identifying player types using k-means and c-means clustering, and using cluster memberships to train prediction models. Their approach achieved a prediction accuracy of 76% over a period of five NBA seasons. Table 1 presents a comparison of previous studies that have attempted to predict basketball game outcomes. The table displays the datasets used in each study, the number of data points and features, the most successful algorithm used, and the corresponding success rates achieved.

GNN Methodology and Sport Outcome Prediction
The second category of relevant research focused on GNN, which is a cutting-edge method used for sports outcome prediction. This technique can be traced back to Aleksandra P. (2021) [31], who used the GNN model to predict the outcomes of soccer games. In the study, the author employed individual teams as nodes, past games as edges, and assigned different weights to the edges, to reflect recent games' greater impact on the team performance. The model was trained on a soccer dataset of a league, with an accuracy of 52.31%. Similarly, Xenopoulos P. and Silva C. (2021) [32] used a graph-based representation of the state of the game as input to GNN to predict the outcome of NFL and CSGO games, resulting in a reduction of test set losses by 9% and 20%, respectively. Additionally, Mirzaei A. (2022) [33] utilized dynamic graph (specifically a spatiotemporal graph) representation learning to predict soccer games, using only the names of teams and players, achieving an estimation accuracy of 50.36%. Bisberg A J and Ferrara E. (2022) [34] modeled LoL (League of Legends) games with GCN, achieving an estimation accuracy of 61.9%. The training, validation, and test networks were LPL (League of Legends Pro League), LCK (LoL Champions Korea), and LCS (League of Legends Championship Series), respectively. Moreover, the utilization of GCN combined with RF has proven to be successful in various relevant cross-domain areas. For instance, Chen et al. [35] proposed a model that integrates random forest and Graph WaveNet to capture spatial dependencies and extract long-term dependencies from spatiotemporal data. The efficacy of this approach was demonstrated on real-world datasets for traffic flow and groundwater level prediction, resulting in improved performance. Similarly, in the field of activity recognition, Hu et al. [36] proposed a correlation coefficient-based method to generate a graph from motion signals, followed by random forest classification, resulting in a significantly high accuracy. Table 2 compares previous studies' use of GNN to predict game outcomes, including the datasets, amount of data and features, the most successful models, and success rates. Several researchers have developed GNN prediction models for various sports, but these models generally suffer from low accuracy. While some methods for predicting NBA game outcomes exist, few studies have applied GNN to basketball games. Thus, this paper proposes a new basketball game outcome prediction model and explores its application in the NBA.
In contrast to previous research, this study divides NBA games played between the 2012 and 2018 seasons into six datasets and extracts features using principal component analysis (PCA), least absolute shrinkage and selection operator (LASSO), and random forest (RF) models. The results are then predicted using GCN.

Methodology
The following Figure 1 depicts an outline of the process of this paper. In the subsequent subsections, we will provide a more detailed explanation of each step.

Graph Networks Construction
The objective of this study was to develop a team structure in the NBA that can be used to predict game outcomes. To achieve this objective, we needed to create a graph that defines the nodes and their relationships. In our research, we employed a homogeneous graph that has only one node type and one relationship type. This type of graph provides a simplified representation of the graph data and can be used in other basketball leagues.
An example of a homogeneous graph is presented in Figure 2. In this graph, each node represents a team and contains information about 44 features. Nodes are labeled as wins or losses for games, and the edges between the nodes represent recent games related to the teams. Specifically, at time t, team A is directly connected to opponent team B, at time t + 1, team A is directly connected to team C, and team B is directly connected to team D, and at time t − 1, team A is directly connected to team E, and team B is directly connected to team F. Moreover, each team is also connected to itself from its last game.
Since the structure of the graph is symmetric, the games are arranged in ascending chronological order for each team. Thus, one team in the graph is connected to its own last and next games. This structure enables us to consider the impact of recent games on the prediction of game outcomes. Additionally, the graph convolutional network (GCN) is able to learn the structure of the network built. Therefore, the prediction task can also be interpreted as a node classification task.

Principal Component Analysis
Principal component analysis (PCA) is a widely used statistical technique that aims to reduce the dimensionality of a dataset, while retaining the maximum amount of original variation. The underlying concept of PCA involves finding the directions, also known as "principal components", in which the data varies the most, and projecting the data onto these components.
Specifically, PCA identifies the first principal component as the direction with maximum variability in the data, and the subsequent principal components correspond to the direction of maximum variability, after removing the earlier components. The number of principal components equals the number of variables in the original dataset.
Our first step in performing a PCA on the data was to standardize it by subtracting the mean of each variable from each data point and dividing it by the standard deviation of each variable. Next, we computed the covariance matrix, which provides information about the relationship between each pair of variables. Using the covariance matrix, we derived the eigenvalues and eigenvectors, which indicate the amount of variance explained by each principal component and the direction of the principal components, respectively. These principal components are created through linear combinations of the original variables and are sorted in descending order of their corresponding eigenvalues. Based on a predetermined number of components or the desired level of variance to be explained, we then selected the top k principal components that accounted for the most variance in the data. Finally, we projected the original data onto the principal components, to transform it into the new coordinate system defined by the principal components.
By selecting only the most important principal components, PCA reduces the number of variables required to describe the data, making the analysis more manageable and efficient. This can also lead to an improved accuracy and interpretability of the results.

Feature Extraction Based on the LASSO Algorithm
The LASSO algorithm is a widely used method for feature selection in machine learning. Its primary purpose is to extract a subset of important features from a large pool of available features. The algorithm operates by adding a penalty term to the cost function of a linear regression model. This penalty term encourages the coefficients of less important features to be reduced to zero. As a result, the LASSO algorithm produces a sparse model, where only the most significant features are retained.
The LASSO (least absolute shrinkage and selection operator) algorithm is a form of regularization technique, which is commonly used to reduce the complexity of a model by selecting the most relevant features, while setting the coefficients of less important features to zero. The LASSO algorithm operates by adding a penalty term to the linear regression objective function, which encourages sparsity in the coefficients. The penalty term is the sum of the absolute values of the coefficients multiplied by a hyperparameter, known as the regularization parameter. By increasing the regularization parameter, the LASSO algorithm sets more coefficients to zero, effectively removing the corresponding features from the model.

Feature Extraction Based on the Random Forest Algorithm
Random forest is a widely used machine learning algorithm for classification and regression tasks, belonging to the ensemble learning category, where it aggregates predictions of multiple models to generate a final prediction. Random forest can also be employed for feature extraction, by identifying the most significant features within a dataset. The algorithm works by creating multiple decision trees, where each tree is trained on a random subset of data and features. The final prediction is then made by aggregating the results of all trees.
During the training of the model, random forest calculates the importance of each feature based on its contribution to the overall accuracy of the model. The features that have a higher contribution are assigned higher scores. By analyzing these scores, we can identify the most important features within the dataset. This process is beneficial when dealing with high-dimensional data or when we want to simplify the model without compromising its accuracy.
In conclusion, random forest is a valuable tool for feature extraction and can assist in identifying the most relevant features within a dataset, which can be used to improve machine learning models. Figure 3 depicts the process of feature extraction using random forest.

Graph Convolutional Network
Graph convolution networks (GCN) are neural networks that process data represented by graph structures, making them an effective tool for analyzing and modeling complex data that cannot be modeled using traditional Euclidean space models. Unlike traditional convolutional neural networks that operate on images, GCN uses a convolution operator on the graph structure.
The formula for graph convolution is expressed as Here, h i denotes the feature vector of node i at layer l, W (l) represents the weight matrix at layer l, c ij is a normalization constant, and σ is an activation function. N i represents the set of neighboring nodes to node i.
The aforementioned formula serves as the fundamental operation in GCN. The output feature vector of a node is a weighted sum of its neighboring nodes' feature vectors, passed through a nonlinear activation function. This process is repeated for multiple layers, enabling GCN to learn hierarchical representations of the graph structure. More experimental details can be found in the following link: https://github.com/KaiZhao-Aike/nbagcn.git (accessed on 20 March 2023).

Experiment and Results
In this study, our goal was to accurately predict the outcomes of NBA games. We elaborate on the dataset, experimental procedures, and results in the subsequent subsections.

Datasets
Acquiring sufficient relevant data is a fundamental requirement for building effective prediction models. With the era of big data and rapid technological advancements in the sports industry, obtaining statistical information on sports has become easier. In this study, we utilized a dataset containing NBA statistics from 2012 to 2018 obtained from Kaggle, a data modeling and analysis competition platform that enables companies and researchers to explore machine learning. The dataset, submitted by Paul Rosetti, is called "NBA Enhanced Box Score and Standings (2012-2018)" [37]. The file 2012-18_teamBoxScore.csv, included in this dataset, contains basic and advanced data for each of the 82 games played by 30 NBA teams in each season from the 2012-2013 to the 2017-2018 seasons. It covers only the regular games of the NBA seasons during this period. The dataset consists of two rows for each game, representing the home team and the away team, with a total of 14,758 rows (excluding one game in the 2012-2013 season). To predict game outcomes using GCN, we divided the statistics for each season into separate datasets, resulting in six datasets in total. Table 3 provides the definitions of the 44 features used in this study. All the features used in our prediction model were calculated per team and per game. They were calculated by aggregating the statistics of the players who played in the game for each team. We calculated features such as "Team2p%" and "Team3p%" by computing the percentage of 2-point and 3-point shots made by each team in that particular game. Other features such as "TeamFTA" and "TeamFTM" were calculated by counting the number of free throws attempted and made by each team in the game, respectively.

Feature Engineering
The primary objective of this paper was to utilize a GNN model to predict basketball game outcomes. To achieve this goal, feature extraction was critical for improving the accuracy of the model. The selection of appropriate features is crucial for accurate prediction, and it appears to be more important for the accuracy than the availability of a large number of games/instances (Bunker et al., 2022) [5]. The manual selection of features based on researchers' domain expertise can be challenging to interpret, so machine learning algorithms can be employed to output feature importance. The PCA, LASSO, and RF methods were used to extract key variables as input features, reduce the dimension of the input data, and thus improve the model's performance.
The feature extraction method applied to our GCN model consisted of two steps. The first step involved the correlation coefficient matrix, as illustrated in Figure 4, which displayed the correlation coefficients between the features. A heatmap graphically represented data, using color to depict values. Near the scale of 0, the color indicates that two features are not correlated. As the color gradually becomes lighter, the positive correlation between the two features strengthens. Conversely, as the color darkens, the negative correlation between the two features intensifies. The overall colors suggested that some of the 44 features were highly correlated, indicating redundancy in the feature information. Therefore, performing feature extraction was appropriate and meaningful.
The second step involved the application of three feature extraction methods proposed in this paper. We randomly split the dataset into training, validation, and test sets, using a 70-10-20 ratio. The training set was used to fit the feature extraction models, and then the validation and test sets were transformed using the same models. The first method was PCA, which is an unsupervised method that maximizes the variance, without using output information. PCA was utilized to reduce the dimension of the features and eliminate correlations among them. To determine the number of principal components n_components, we set this to a contribution rate of 0.95, resulting in the extraction of seven principal components. This step effectively reduced the model complexity, improved its running speed, and eliminated the influence of feature correlation.
In subsequent experiments, we utilized a graph-based model, by inputting the seven principal components obtained earlier, in order to examine the model improvement effect.
The LASSO method is a powerful feature extraction technique that is able to reduce variance, select the most relevant features from a potentially large and multicollinear set of predictors, identify redundant predictors, and improve prediction accuracy. After applying LASSO, the coefficients of some features were reduced to zero and these features were excluded from the model. The remaining features with nonzero coefficients were used to build the final model. In our experiment, we used the scikit-learn implementation of the LASSO regression model on the training set to select the most important features, by setting alpha 0.1. To illustrate the application of LASSO, we used data from the NBA 2013-2014 season to predict game outcomes. Our results showed that, out of 44 features, only 12 features were selected by LASSO for prediction. As shown in Table 4, these features included teamEDiff (0.361368), teamPF (−0.037388), teamFTA (0.028293), teamPPS (0.024126), team-Drtg (−0.017633), team3PA (−0.009556), teamBLK% (0.006646), team3P% (0.004803), team-STL/TO (0.004417), teamDREB% (−0.004153), teamFIC (0.004135), and teamTO (−0.002860). Overall, our findings suggest that LASSO can effectively identify the most important predictors in complex regression models with many potential predictors, and thus improve the prediction accuracy.
To identify the most important features and gain unique insights into their contributions to the prediction task, we employed the random forest (RF) method. In our study, we used the classification and regression tree (CART) algorithm for random forest and the mean decrease impurity method for calculating feature importance. By ranking the features according to their importance, low-importance features could be ignored without adversely affecting the accuracy of the model. This step helped to significantly reduce the noise and data redundancy in the analysis. The results of the RF method, presented in Table 5, provided clear indications of the feature importance, allowing us to reduce the number of features from 44 to 3. Among the top-ranked features, which consistently appeared as the most important across the six seasons, were teamEDiff, teamDrtg, and teamFIC. The extracted features were utilized to train the models, thereby enhancing their overall performance. To ensure the consistency of the feature importance results across the six datasets, we chose the 2013-2014 season as a representative sample. This season was selected because it demonstrated a consistent pattern of change in the feature importance results.

Experimental Results and Comparison
In this section, we present a quantitative analysis of our model, including an evaluation of its prediction accuracy and a comparison of its performance with other models. To construct the graph, we used the open-source fork of Thomas Kipfs's original GCN model [38] available at [39], and implemented it in Python using Pytorch [40]. The adjacency matrix constructed in Section 3.1 was used to construct the graph that was then fed into the GCN. We split the basketball dataset into 70% for training, 10% for validation, and 20% for testing. Our experiment involved training the GCN model for 500 epochs, with a learning rate of 0.01 and a hidden layer size of 64. To prevent overfitting and improve the model's generalization, we used a dropout rate of 0.5 during training. Tables 6-9 showcase the prediction accuracy of the various graph-based models in predicting basketball game outcomes using the 2012-2018 NBA season dataset. This study found that the graph architectures used in the models were highly effective in predicting the outcomes, with the graph-based models outperforming the previously studied prediction models.
This study also found that combining GCN models with feature extraction resulted in better outcomes than using GCN models alone. Feature extraction was identified as an important factor in achieving better results. We observed that the GCN + RF model achieved the highest average accuracy of 71.54% on all datasets, outperforming the widely used baseline model. The GCN + LASSO model also showed high accuracy levels and outperformed the original GCN model. However, the study found that GCN + PCA did not yield the desired outcomes. The PCA dimensionality reduction criterion selects principal components that maximize the variance of the original data on the new axis, potentially losing some important information. Moreover, features with small variances are not necessarily unimportant, and such a unique criterion may overlook crucial data. The results suggest that reducing the input space's dimensionality may have resulted in the loss of vital information necessary for predicting the game's outcomes.
The aim of this study was to determine the impact of graph-based models on the prediction of basketball game outcomes, and the findings demonstrated that the GCN + RF model achieved the best average performance over the six datasets. The study identified the most critical factors affecting the outcome of NBA games based on the evaluation metrics, providing valuable information for team administrators and coaches, who can utilize this information to improve team playing abilities based on the key factors that affect the winning or losing of a game.
Although adding more features should theoretically result in better prediction results, our study found that the prediction performance improved after extracting important features from the NBA dataset. This emphasizes the significance of appropriate feature extraction for effective prediction. Our results showed that, compared to the original GCN model, the GCN models derived from the RF and LASSO techniques demonstrated improved accuracy. While LASSO outperformed RF in certain seasons, on average, RF showed a slightly better accuracy. However, the GCN + PCA model exhibited a decrease in accuracy.
Furthermore, we compared the prediction results of our proposed model with three models commonly used in previous studies, the decision tree (DT) classifier, the support vector machine (SVM) classifier, and linear regression (LR). Table 6 displays the accuracy of our proposed model and the baseline models in the dataset without feature extraction. The results indicated that the GCN model had a similar predictive power to the DT, SVM, and LR models, and its predictive accuracy was even better when combined with a feature extraction method, as shown in Tables 7-9. Our proposed graph-based model was more effective in predicting basketball game outcomes compared to the other models when using LASSO and RF extraction algorithms. It is worth noting that, although Li [28] utilized the same dataset as our study, our proposed model achieved a higher prediction accuracy.   Table 10 presents a performance comparison of the two extraction algorithms, GCN + LASSO and our proposed method (GCN+RF), using 10-fold cross-validation. The data in the table show the accuracy of each method for different seasons, as well as the average accuracy across all seasons. The presented results indicate that our proposed method (GCN+RF) achieved a slightly better average accuracy than GCN + LASSO. Nevertheless, the performance of each method varied across different seasons, indicating that neither method consistently outperformed the other in all situations.

Conclusions and Discussion
Predicting the outcome of a basketball game is an important but complex task, which relies on several factors, such as a team's status, the opponent's situation, and the internal and external environment. In this paper, we propose a graph-based model that takes into account the complex interactions among NBA teams to predict game outcomes. Our model constructs a graph, in which teams are nodes and are connected to their opponents, as well as their own past and future games. These nodes and edges form a message passing network that is trained using a semisupervised GCN model. To improve the prediction accuracy, we combined our model with feature extraction methods. We demonstrated the superiority of our proposed method by comparing it with other prediction studies.
This study used NBA regular season data from the 2012 to 2018 seasons, and the proposed model achieved the highest accuracy compared to other baseline models. Combining the GCN model with feature extraction methods, such as PCA, LASSO, and RF. The RF and LASSO methods improved the model's accuracy, while the PCA method decreases it. This experiment shows that the one-hop neighbor aggregation in the GCN model gave the best results. Compared to other prediction studies, the proposed model considers the connections between NBA teams and the influence of opponents on the game, providing a new perspective for basketball sports management and performance prediction analysis.
The study has limitations that could be addressed in future research. The homogeneous graph constructed in the current model does not distinguish team nodes from opponent team nodes, and the relationship between nodes is singular. The use of spatiotemporal graphs to represent games on the same date could extend the spatial structure representation of NBA teams and better take into account time information. The category of nodes is a single team representation, and future research could consider adding information such as coaches and players, whose influence on game outcomes is crucial. Further research could apply the proposed model to other basketball leagues, use richer game data, find more suitable feature information and extraction methods, and adjust the number of layers and aggregation methods of the GNN, to improve the accuracy of the model in predicting game outcomes.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.

Data Availability Statement:
Publicly available datasets were analyzed in this study. The data can be found here: https://www.kaggle.com/datasets/pablote/nba-enhanced-stats, accessed on 9 September 2022.