CUDA-bigPSF: An optimized version of bigPSF accelerated with graphics processing Unit

Accurate and fast short-term load forecasting is crucial in efficiently managing energy production and distribution. As such, many different algorithms have been proposed to address this topic, including hybrid models that combine clustering with other forecasting techniques. One of these algorithms is bigPSF, an algorithm that combines K-means clustering and a similarity search optimized for its use in distributed environments. The work presented in this paper aims to improve the time required to execute the algorithm with two main contributions. First, some of the issues of the original proposal that limited the number of cores simultaneously used are studied and highlighted. Second, a version of the algorithm optimized for Graphics Processing Unit (GPU) is proposed, solving the previously mentioned issues while taking into account the GPU architecture and memory structure. Experimentation was done with seven years of real-world electric demand data from Uruguay. Results show that the proposed algorithm executed consistently faster than the original version, achieving speedups up to 500 times faster during the training phase.


Introduction
Since electricity was discovered, humanity has created a steadily growing number of devices that make use of electricity.Most of the time, people use electricity simultaneously for multiple applications such as lighting, refrigeration, cooling, or heating, among others.The energy required for this is usually provided via an interconnected electricity network known as "power grid".
However, many complex factors have to be taken into account in the management of the power grid, such as the use of renewable energy sources, which rely on weather conditions or electricity transmission losses.Thus, it is common to use Artificial Intelligence (AI) systems to assist in the management of the power grid, particularly in the prediction of energy demand and renewable energy production (Bose, 2017).
Over the last two decades, technical advancements have led to the higher use of smart meters (Zheng et al., 2013), devices that measure the electricity imported and exported from the grid by the consumer in real time.These devices also provide the energy provider with energy consumption data periodically, which can be used to optimize energy production and distribution in entire regions.With the adoption of these devices and the increasing energy consumption transparency of public entities and governments, researchers have a wide range of data available to study energy consumption.However, in many cases, usage of this type of data poses considerable challenges, as the sheer amount of data may significantly increase the computational power required to train these AI systems.
The relevance of energy in our current society has led to its study under many different scenarios.The algorithms used for this task (Kong et al., 2019) cover a wide range from easy-to-understand and interpret models, such as ARIMA, to highly accurate black-box models: neural networks, deep learning, and ensembles of different models, among others.Pattern Sequence-based Forecasting (PSF) (Martinez Alvarez et al., 2011) is an interesting middle-ground approach that has previously provided remarkable results in the energy field.This algorithm creates hybrid models that combine clustering and additional methods to extract patterns and make computations based on those patterns.PSF and many of its improved versions present some interesting properties in big data scenarios, e.g., the clustering-based pattern extraction reduces the computational cost for the second step of the algorithm, the pattern sequence-based forecast.However, they still require intense computational power as each prediction requires an independent clustering and pattern sequence-based forecast, severely hindering the time needed to train and predict with these models.
Parallel and distributed approaches are frequently used to reduce the time needed to train algorithms with high computational demands.An improved specialized version for Apache Spark clusters called "bigPSF" was presented in 2020 (Pérez-Chacón et al., 2020).However, there is no work to this date that studies PSF algorithms under parallel architectures.In this paper, a new version of the bigPSF algorithm accelerated with Graphic Processing Units (GPUs) is proposed, hereafter referred to as "CUDA-bigPSF".Two main contributions are provided to this research field in this work: • The first GPU implementation of a pattern sequence-based algorithm is developed, reducing significantly the time required to train and use this model.• Some of the issues of the original BigPSF proposal are highlighted and how they could be solved to obtain better performance when using a distributed environment.
This manuscript is structured as follows: Section 2 reviews relevant related papers on pattern sequence-based forecasting and GPU algorithms with a focus on big data energy problems.Section 3 describes the CUDA/GPU architecture and explains the CUDA-bigPSF algorithm.Section 4 studies the GPU implementation's accuracy, speedup, and scalability.Lastly, section 5 draws the most relevant accomplishments of this work and proposes future lines of research.

Related works.
This section is structured in two independent parts and reviews the most relevant related works in the field.In the first part, works related to the PSF algorithm are reported and discussed.In the second part, we review the use of the GPU in AI and, more specifically, in the energy field.
The PSF algorithm was published in 2011 (Martinez Alvarez et al., 2011).This algorithm starts by applying K-means clustering to transform the time series before the prediction date into a sequence of cluster identifiers (labels).Afterwards, the algorithm splits the labeled sequence using a sliding window of size W.In order to make the prediction, the algorithm looks for similar patterns to the last created with the sliding window, i.e., the pattern of the W days before the prediction date.The final prediction is the average of all the occurrences found using the original time series.
PSF has shown excellent results when working with energy data, and, as such, it is its primary use.Nevertheless, it has also been used to forecast energy prices (Jin et al., 2015), wind speed (Bokde et al., 2017), solar power (Fujimoto & Hayashi, 2012), or even to impute missing data (Bokde et al., 2018).Several authors have proposed variants and improvements to overcome some of the original algorithm's limitations.In (Jin et al., 2015) the authors used the Self-Organizing Map (SOM) and neural networks to create a specialized version that preserves the input space's topological properties.Similarly, in (Martínez-Álvarez et al., 2019) the authors proposed a specialized version for functional data (funPSF) through the use of a functional clustering algorithm, funHDDC (Bouveyron & Jacques, 2011).They also created a version with specialized models for each day of the week (7-funPSF) that provided significantly better results than the previous one.In (Shen et al., 2013) the authors evaluated using PSF with five different clustering methods (K-means, SOM, K-medoids, Hierarchical clustering, and Fuzzy C-means) individually and in an ensemble.(Jin et al., 2014) introduced a weighted mean that gives more relevance to the most frequent patterns each day of the week.Lastly, the algorithm our work is based in (Pérez-Chacón et al., 2020) proposes adapting the original PSF algorithm for clusters with Apache Spark.Beyond the distributed approach, this algorithm also included a weighted mean that gives more relevance to the matches closer to the prediction date and a grid search of hyperparameters to find the best solution at the expense of more computational power.
The rise of big data and many other data science methodologies that are computationally expensive, such as AutoML, have led to a higher interest in parallel and distributed algorithms capable of providing similar results in less time.Researchers and companies have published open-source access to GPU-accelerated implementations of traditional machine learning algorithms.Facebook's FAISS library (Johnson et al., 2021) optimizes similarity search and clustering of dense vectors, providing fast K-means clustering and nearest neighbour search algorithms.ThunderSVM (Wen et al., 2018) provides a GPU adaptation of Support Vector Machines with the standard kernels used for classification and regression.Most gradient boosting machines provide GPUaccelerated implementations, such as XGBoost (Chen & Guestrin, 2016) or LightGBM (Ke et al., 2017).NVIDIA recently launched cuML (Raschka et al., 2020), a CUDA-specific open-source library to accelerate all the algorithms included in the popular Python package scikit-learn.

Table 1
Summary of related works using the GPU on the energy field.Neural network frameworks, such as Tensorflow (Abadi et al., 2016) or PyTorch (Paszke et al., 2019), provide GPU-accelerated implementations optimized for deep neural networks and represent the broadest use of GPU in AI research nowadays.
The energy field is no different, and most relevant recently published works use GPU-accelerated neural network frameworks or use parallelized metaheuristics to train neural networks.Table 1 presents a summary of the most relevant works on the energy field using the GPU.
Although PSF algorithms have previously shown excellent results in energy forecasting, to the best of our knowledge, the use of GPU for PSF algorithms has yet to be studied.As such, the study and proposal of our GPU-accelerated algorithm, CUDA-bigPSF, is justified.

The CUDA architecture.
GPUs were initially conceived to accelerate graphical computation.However, the massively parallel architecture of the GPU was also of interest in many other fields that could use it to accelerate their applications and simulations, leading to an evolution of the GPU programming model towards the paradigm known nowadays as General-Purpose GPU (GPGPU).As part of this evolution, new user-friendly languages were created to avoid the complexity of writing general-purpose code through graphical APIs or assembly.An example of this is Compute Unified Device Architecture (CUDA), a proprietary extension of C++, made to facilitate GPGPU programming with NVIDIA graphics cards.
In CUDA, the set of instructions to be executed by each GPU thread are written in special functions called kernels.The programmer specifies the kernel's total number of threads by dividing the total number of threads in a grid of blocks.Each block always contains a fixed number of threads, and all threads within the same block can be synchronized and access a special programmer-managed cache for fast collaboration.The grid indicates the total number of blocks required to execute the kernel.The number of threads in a given block can be provided in one, two or three dimensions to overcome some limitations and to provide easier abstractions in some algorithms involving complex structures such as matrices.The same applies to the dimension of a grid.
A CUDA-capable GPU has one or more streaming multiprocessors, each containing a set of cores, registers, cache memory and a scheduler.

The bigPSF algorithm.
BigPSF provides an improved PSF algorithm for distributed environments.The training process of the algorithm finds the optimal hyperparameters (number of clusters and window size) through a grid search evaluated in a validation partition.The training and test processes are done sequentially over all the days on their corresponding partition, using the additional computational power to accelerate each prediction.
The BigPSF accelerated prediction algorithm (Fig. 2) starts by creating a distributed structure denominated RDD (Resilient Distributed Dataset) from the original dataset samples before the prediction date.This RDD is shuffled into random partitions distributed on the nodes available in the cluster.The algorithm continues by applying K-means over the RDD.Centroids are initialized using the k-means++ algorithm (Arthur & Vassilvitskii, 2007).
Afterwards, each node finds the closest cluster for the partitions of the RDD available in the node and computes a partial centroids update.After each iteration, partial centroids are communicated to the primary node to obtain the final centroids of the iteration.K-means clustering ends after reaching a maximum number of iterations or convergence.The clustering process finishes with the creation of a new RDD, in which each sample is transformed to its closest cluster identifier.Each compute node does this last step independently, as synchronization is unnecessary.Then, the algorithm creates its more complex structure, the "pattern matrix", in a new RDD.Each row of this RDD contains a row identifier id, a sequence of W labels from the days between id and id + W − 1, and a data copy (hValue) of the day id + W of the original dataset.This structure is generated by grouping all the possible sequences of labels of length W from the previous RDD.Finally, the algorithm filters all rows in the pattern matrix that share the same pattern and day of the week of the prediction date.The prediction is the weighted average of the data copies sharing the same sequence of labels and day of the week.This weight for each match is calculated as: where id i is the row identifier of the match and matches contains all pattern occurrences in the pattern matrix.
A small example of how the bigPSF calculates a prediction is provided in Fig. 3, where the algorithm is computing the prediction for the day with ID 100 with a window size of W = 3 and a number of clusters K = 5.The algorithm starts by applying K-means with all the data prior to the day to be predicted and labeling them with their corresponding best cluster (upper row of the figure).Then, making use of the labeled dataset and the window size, the pattern matrix is constructed.The last row of the pattern matrix will indicate the pattern of the day to be predicted.All previous rows in the pattern matrix containing the same pattern are filtered and the final prediction is made with the weighted average of the hValues of the rows selected (using the weights provided in eq. 1).

CUDA-bigPSF.
The bigPSF algorithm shows some level of parallelism in two primary ways.In the first one (data parallelism), the computation for each sample in the dataset in parallel is done in paralle, as it proposed in the original bigPSF algorithm.In the second one, each prediction is made sequentially in each thread.There are several reasons why the second approach better when using the GPU.First, to obtain a significant speedup, it is imperative to keep all GPU threads busy.However, if a data parallelism strategy is used, there would be several instances in which some threads would have to wait until all the others finish for synchronization purposes.For example, after each K-means iteration, the algorithm would need synchronization to ensure centroids are updated before the next iteration begins.Second, if the number of days in the dataset is smaller than the number of cores available in the GPU, using the first approach would keep more CUDA cores busy as long as three or more different numbers of clusters are being evaluated simultaneously.Last, the limited memory available in the GPU makes using the second approach better for scalability as memory accesses are local to the thread except for reading the dataset and writing the final result.Therefore, after each thread finishes its work, the local memory resources can be released to be used by another thread, significantly improving the scalability of the proposed approach.
As such, CUDA-bigPSF (Fig. 4) distributes the work in independent threads, each computing the prediction for a given date.They have access to the entire dataset and the final output structure in global memory.The thread identifier will be used to determine up to which date of the input dataset they should be able to access and where they must write their predictions in the output structure.The kernel (algorithm 1) will launch using a bi-dimensional grid of quantity of number of clusters to be evaluated by the minimum number of blocks to cover the validation (or test) partition.The kernel (algorithm each thread executes) starts with a standard implementation of Lloyd's K-means algorithm, initializing the centroids with the K-means++ algorithm.The clustering process finishes after reaching a maximum number of iterations or convergence.The objective function of the K-means algorithm is to minimize the Within Set Sum of Squared Errors (WSSSE) of each cluster, which is defined as follows (eq.2): where d ( x i , c j ) 2 is the Euclidean distance between each sample x i of the cluster Cj and the centroid of that cluster c j .The algorithm iterates over the entire dataset once in each iteration, calculating the closest cluster to each sample, adding the sample to a new array to compute the centroids for the next iteration, and incrementing by one another structure used to count the number of samples in each cluster.The centroids for the next iteration are obtained by dividing these last two data structures (computing the mean).
Next, the pattern sequence-based algorithm starts.First, the query is calculated, i.e., the labels (cluster identifiers) for the w samples before the prediction date.Then, the algorithm strides weekly over the days in the dataset that share the same day of the week of the prediction date.To evaluate a dataset sample i, the label of the sample w days before it is computed.A match is found for a window size of one if it shares the same label as the position w of the query in reverse order.The same conditions apply for any window size w except the previous window size w − 1 also needs to have a match.The computation for each w is done in ascending order to avoid any unnecessary calculations.
Every match found indicates that we must use the sample in the weighted average for the current prediction date and window size.To use only a stride over the entire dataset, two data structures are required to compute the weighted average, similar to the procedure previously used for the k-means centroids.Since the weights of BigPSF are a division that has the sum of all numerators in the denominator, whenever a match is found the sample is partially weighted by multiplying by the numerator and stored in a data structure and an additional data structure is used to eventually compute the sum of all numerators (denominator).
Lastly, the thread computes the division of the previous two data structures to obtain the prediction for a given data for all possible values of w that we are using.As the BigPSF algorithm specifies, the prediction obtained by a window of size w − 1 is used if there are no matches for a window size of w.Occasionally the algorithm may fail for a window size of one.In those scenarios, all samples before the prediction date are used, regardless of the day of the week.The kernel finishes by putting the local structure containing the predictions for all possible values of w in their corresponding place in the global memory so the CPU can access the results.
As a last note, different clustering algorithms could be used instead of K-means.Although a similar approach to the one proposed for K-means could be used for any clustering algorithm, the optimal GPU implementation of the algorithm will change significantly depending on the data structures and computations required by each algorithm.Nevertheless, using K-means provides several advantages that will lead to substantially faster execution times than most clustering methods.This is due to the fact that only one hyper-parameter has to be tuned for Kmeans (the number of clusters) and only to store a really small data structure per execution of K-means is required in memory (the cluster centroids) that will usually always fit in the cache memory even when there are many predictions and, as such, clustering processes, being computed in parallel.

Experimental Setup
We have used the same dataset used in the bigPSF paper to compare our results.This dataset contains electricity consumption data from Uruguay between 2007 and 2014 recorded hourly.The average demand observed is 1092.21MW, with a minimum of 609.87 MW and a maximum of 1907.55MW.Fig. 5 displays the energy consumption distribution by day of the week.We can observe from this figure that energy demand on weekends is lower than on weekdays, as it is expected (Raza & Khosravi, 2015).We did not need additional preprocessing since the dataset did not present any missing observations or extreme outliers.The dataset was split in 70 % training and 30 % test, with the last 30 % of the training partition used as validation for the hyperparameter optimization, as it is specified in the bigPSF paper.
All experiments were done with a personal computer with an AMD 5 Ryzen 2600X CPU running at 3.6 GHz, an NVIDIA GeForce RTX 3060 Ti 8 GB graphics card, and 32 GB of DDR4 RAM.The code was written using Python 3.11 and CUDA 11.8.CUDA experiments were repeated 30 times with seeds from 1996 to 2025.For the CUDA-BigPSF kernel, we used 32 threads per block, as it provided the fastest results.

Implementation accuracy.
In this section, we will compare the accuracy of our implementation with the results provided in the original paper.Even though we have implemented the same algorithm with different approaches, we cannot obtain the same results as the original authors due to the randomness in the initialization of k-means and the fact that the original authors did not seed their experiments.As such, we can only evaluate if we have obtained reasonably similar results during training and test.
During the training phase, the Mean Absolute Percentage Error (MAPE) was used, as it is done in bigPSF.This metric (eq.2) has the advantage of being scale-independent and easy to interpret as it represents the average distance between forecasted and expected value in percentage.For all equations, n represents the total number of samples, y i the forecasted sample at index i and yˆi the expected values.
Table 2 displays the difference in MAPE during training between the average of 30 repetitions of CUDA-bigPSF and BigPSF (enclosed in parentheses).As we can observe, both algorithms provide relatively similar results considering the randomness of k-means initialization.The most significant difference in MAPE between the approaches is 0.59 % with k = 6 and w = 6.The best averaged MAPE found by CUDA-bigPSF was 4.51 % with k = 14 and w = 1, while the best MAPE for BigPSF was 4.52 % with k = 13 and w = 2.In 1 of the experiment's repetitions with k = 15, CUDA-bigPSF could not provide at least one prediction, even removing the day of the week constraint.Thus, we have excluded that seed (1998) from the average displayed in the table for k = 15.In 27 out of the 30 experiment repetitions, a window size of one provided the best results, questioning whether it is advantageous to study the use of a broader window size or whether we should limit the window size from the start to reduce the algorithm's computational complexity.In 18 out of the 30 experiment repetitions, a window size of k = 15 provided the best results, followed by 5 repetitions with k = 13 and 4 repetitions with k = 14.
We applied a similar methodology to compare the results in test using the 30 seeds with their optimal hyperparameters.For test, two additional metrics are used: the Mean Absolute Error (MAE) and the Root Mean Squared Error.The MAE (eq. 3) provides the average difference between the forecasted value and the expected value in the original scale of the data while the RMSE (eq.4) gives a higher penalization to large errors between forecasted values and expected values.
Table 3 summarizes the results of our 30 repetitions for CUDA-bigPSF and the results reported for bigPSF.Our implementations obtain similar results on average for MAPE and MAE, and the best experiment done with CUDA even improves the results reported in bigPSF substantially.However, there is an unexpected difference in the Table 2 MAPE (%) for the grid search during the training phase for CUDA-bigPSF and bigPSF (enclosed in parentheses).Best values for each method in bold.RMSE metric that we cannot explain.A comparison of the results provided by the bigPSF / CUDA-bigPSF algorithm with other forecasting algorithms such as neural networks, ARIMA and gradient boosting trees can be found in (Pérez-Chacón et al., 2020).

Implementation speedup and scalability.
At last, we compare the executing times of the Spark version, the CUDA version, and a sequential CPU version we will use as a baseline.Table 4 reports the performance of each architecture with the original dataset and synthetic datasets made by repeating the original dataset, as is done in the BigPSF paper.
First, it is important to note that even though the number of cores used for bigPSF seems small, authors reported that using a higher number of cores does not improve the results but rather makes them go even slower.This situation happens because many algorithm steps of bigPSF using their data distribution approach require synchronization and node cooperation, unlike our GPU approach.As such, even though it takes almost 19 min to train the algorithm with Spark, our GPU version can train it (find the optimal number of clusters and window size) in under two seconds using the full potential of all its cores.Interestingly, our sequential implementation was slightly faster than the Spark version, training 2 min faster, although it is easily explained as our CPU has a much higher clock speed and the Spark version only uses two cores.The evolution of training time for all approaches and the speedup obtained by bigPSF and CUDAbigPSF are displayed in Fig. 6, where the speedup is calculated by dividing the sequential version time by the accelerated version time.However, the Spark approach struggles to obtain a significant speedup until using 28 years of data.Meanwhile, our GPU approach can produce results over 500 times faster than both methods for seven years of data and still manages to make results at least 300 times faster when using the highest amount of data evaluated in this paper (112 years).
From the previously discussed results, it is clear that using a CUDA device will produce faster results than the Spark approach in most situations.In fact, the Spark approach only uses a significant number of cores once training with an unreasonably large dataset.It is also important to note that due to the weighting system used in bigPSF, older samples influence the prediction at a much lower rate.As such, at some point, adding more data, at best, will be no more than a rounding error in the final forecast.The only situation in which the CUDA version proposed in this paper should perform significantly worse than reported is with GPU devices that cannot store all the data structures in the device memory.During the implementation and explanation of our algorithm, we have considered this and used local memory whenever possible so that once a thread finishes its work, another thread can use that memory.As a last resource, the user can reduce the number of clusters evaluated simultaneously to reduce the amount of local memory used per thread.Nevertheless, this algorithm should provide good results in most cases, even using low-end NVIDIA graphics cards.

Conclusion
The main objective of the work presented in this paper was to create a high-performance GPU implementation of an algorithm for load forecasting made for distributed algorithms, bigPSF.The proposed algorithm was evaluated with the same dataset of energy consumption from Uruguay used in bigPSF, allowing a direct comparison between both methods.The design of the GPU version took into account some of the limitations of the bigPSF algorithm through two main contributions.First, CUDA-bigPSF uses a completely different approach to distribute the work between the cores, removing almost all the need for synchronization and communication between nodes.Second, CUDA-bigPSF takes into account several factors to avoid any unnecessary computations and removes one of the costly data structures used in bigPSF, the pattern matrix.
Results show that CUDA-bigPSF provides a correct implementation of bigPSF capable of achieving speedups during the training phase up to 500 times faster than the original bigPSF.As such, the work presented in this paper makes bigPSF more accessible to researchers and practitioners, as the availability of GPU devices is more widespread and cheaper than access to a distributed cluster.Furthermore, many of the solutions proposed in this paper for the GPU can also be used to improve the distributed version of the algorithm.
There are several directions for future work on the algorithm presented in this paper.One possibility is to evaluate and optimize the use of different clustering methods or ensembles of them, evaluating the training time and accuracy of them in different datasets.Additionally, it may be useful to develop versions of the algorithm for multivariate time series.Another possible direction for future work is to combine the use of this algorithm in an ensemble with other forecasting algorithms to potentially improve forecast accuracy.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1.A schematic of the memory layout and multiprocessors of the GPU device used in this research.

Fig. 2 .
Fig. 2. A general scheme of the steps done by the bigPSF algorithm for each prediction.

Fig. 3 .
Fig. 3.An example of how the BigPSF algorithm calculates a prediction in a simulated dataset for K = 5 and W = 3.

Fig. 4 .
Fig. 4. A flowchart of the work executed by each thread in CUDA-bigPSF.

Fig. 5 .
Fig. 5. Box plot of the energy consumption each day of the week.

Fig. 6 .
Fig. 6.On the left, line plot of the time spent in training by each method.On the right, speedup obtained by the Spark and CUDA versions over a sequential implementation.

Table 3
Summary of results obtained by the algorithms in quality metrics for the test partition.

Table 4
Execution time per version of the algorithm in hh:mm:ss.