GPU accelerated modelling and forecasting for large time series

. Modelling of large scale data series is of significant importance in fields such as astrophysics and finance. The continuous increase in available data requires new computational approaches such as the use of multicore processors and accelerators. Recently, a novel time series modelling and forecasting method was proposed, based on a recursively updated pseudoinverse matrix which enhances parsimony by enabling assessment of basis functions, before inclusion into the final model. Here-with, a novel GPU (Graphics Processing Unit) accelerated matrix based auto-regressive variant is presented, which utilizes lagged versions of a time series and interactions between them to form a model. The original approach is reviewed and a matrix multiplication based variant is proposed. The GPU accelerated and hybrid parallel versions are introduced, utilizing single and mixed precision arithmetic to increase GPU performance. Discussions around performance improvement and high order interactions are given. A block processing approach is also introduced to reduce memory requirements for the accelerator. Furthermore, the inclusion of constraints in the computation of weights, corresponding to the basis functions, with respect to the parallel implementation are discussed. The approach is assessed in a series of model problems and discussions are provided.


Introduction
Modelling and forecasting time series has several applications is a number of scientific fields including signal processing, computational finance and astrophysics.Modelling of time series can be performed with traditional approaches [21], such as Auto-Regressive Integrated Moving Average (ARIMA) models [3] and Exponential Smoothing (ES) [4,16], or machine learning methods such as Long Short -Term Memory Networks (LSTM) [15] and Support Vector Regression (SVR)

ICCS Camera Ready Version 2022
To cite this paper please use the final published version: DOI: 10.1007/978-3-031-08757-8_33 [8].Another important family of techniques are based on orthogonalization of a set of basis functions, such as Fast Orthogonal Search [17] or Matching Pursuit [22].These techniques construct a model from a set of basis functions (linear or non-linear) using orthogonalization procedures such as Gram-Schmidt orthogonalization.Recently, a recursive Schur complement pseudoinverse approach for modelling time series was introduced [12].This approach avoids orthogonalization, while enabling the use of preconditioned iterative methods for reducing memory requirements in the case of a large number of basis functions [10].These methods enable the use of arbitrary basis functions including linear, trigonometric, auto-regressive or machine learning based [12,13,11].
The emergence of big data led to an increase in interest in the area of parallel computing in order to reduce processing times.Extensive research has been carried out in the parallelization of machine learning methods, especially neural networks in multicore systems, distributed systems and accelerators such as GPUs [2,18,1,23].Parallelization was used to reduce training times especially for deep neural network architectures and for very large input datasets.GPU acceleration has been also utilized to reduce training and optimization times for Support Vector Machines [25,24,19].In the majority of these approaches, the training operations are reformed to take advantage of matrix by matrix (BLAS3) kernels that can be efficiently parallelized in GPUs and modern multicore hardware.Another important modification is the use of mixed precision arithmetic, combining half-precision, single precision and double precision arithmetic substantially improving performance and storage requirements [20,1].In the literature, efforts have been directed also in the parallelization of Matching Pursuit type methods [9,6] for GPUs.Despite the extensive literature and software available for parallelizing machine learning methods, literature around parallelization of techniques such as Fast Orthogonal Search is limited.
Herewith, we propose a novel parallel implementation of the recently proposed recursive Schur complement pseudoinverse matrix modelling based on auto-regressive basis functions.Initially, the method is recast to take advantage of BLAS3 operations, during the basis search operation, while avoiding redundant computations which will increase computational work and memory requirements.Then, the parallel algorithm, that utilizes mixed precision arithmetic, is presented and discussed along with a pure GPU implementation and block mixed precision variants.Multiplicative interactions between auto-regressive basis functions are also discussed.Implications related to precision and memory transfers between CPU and GPU are analyzed.The proposed scheme is assessed by modelling and forecasting two time series with different characteristics and sets of candidate basis functions.Scalability results are also presented and discussed.
In Section 2, the recursive Schur complement based pseudoinverse matrix of basis functions is reviewed and insights on the basis functions selection, higher order basis interactions and termination criteria is given.In Section 3, the matrix based variant is proposed and the CPU/GPU and pure GPU implementations are presented.Moreover, a block variant is given along with discussions on memory requirements, data transfer overhead and the effects of mixed precision

ICCS Camera Ready Version 2022
To cite this paper please use the final published version: DOI: 10.1007/978-3-031-08757-8_33 arithmetic.In Section 4, numerical results are presented depicting the applicability and accuracy of the proposed scheme along with discussions on scalability and implications of mixed precision arithmetic.

Recursive Schur complement based time -series modelling
The coefficients of a model, corresponding to the time series y, with linearly independent basis functions stored in the columns of a matrix X can be computed as follows: where a is a vector of length n retaining the coefficients corresponding to the n basis functions retained in X.The matrix X T X and its inverse are Symmetric Positive Definite [10].In most cases all basis functions are not known "a priori" or their contribution to error reduction is not significant.In order to avoid such issues a recursive pseudoinverse matrix approach has been proposed [12] based on a symmetric variant of the matrix preconditioning technique introduced in [10].Following this approach and given an additional basis F , with 1 ≤ i ≤ n, we have: or equivalently: where ( i F denotes the Schur Complement corresponding to the addition of basis function F .The initial conditions for the recursive formulation are The updated set of coefficients a i+1 can be computed alternatively using the following equations: with: where b denotes the coefficient corresponding to basis function F and a * i is the updated coefficients after addition of basis function F .The vector i F corresponds to the (i + 1) column and s i+1 = F T (F + X i g i+1 ) is the Schur complement.The modelling error ρ i+1 = ∥r i+1 ∥ 2 2 corresponding to the addition of a basis function F can be computed as: ICCS Camera Ready Version 2022 To cite this paper please use the final published version: DOI: 10.1007/978-3-031-08757-8_33 The quantity e i+1 = b T s i+1 b denotes the error reduction corresponding to the addition of a basis function F [12].In order to ensure positive definiteness of the dot product matrix X T i+1 X i+1 the quantity e i+1 should be bounded by 0 ≤ e i+1 ≤ ∥r i ∥ 2  2 .It should be noted that ρ i+1 = ∥r i+1 ∥ 2 2 = T •M SE is the Squared Error, T is the number of samples in time series y, and MSE denotes the Mean Squared Error.Detailed description and additional discussions regarding the method are given in [12,13,11].

Assessment and selection of basis functions
The explicit expression of error reduction can be used to select a subset of basis functions to form a model from a candidate set U retaining N basis functions.Trigonometric, exponential and linear functions have been considered for modelling in [12], [13], while a Non-Negative Adaptive Auto -Regression approach was followed in [11].The procedure of selecting an appropriate basis requires computation of the potential error reduction for each member of the set U .This procedure is algorithmically described in Alg. 1.The procedure, described by Alg. 1, proceeds through all candidate basis in U sequentially, storing the respective error reductions to vector u.Then, the algorithm proceeds by selecting the index of the basis function that lead to maximum error reduction under the constraints that ensure positive definiteness.
The set U can host any type of basis functions even Machine Learning models such as Support Vector Machines [8].In the current manuscript we focus on lagged basis function of the form: with y 0 = y.In practice, the number of samples in lagged time series y −i is n−N , since the latest sample is removed (retained in the responses y), along with the first N − 1 samples from each candidate lagged basis to ensure that there are no missing data.In case multiplicative interactions between basis functions are allowed [14], e.g.y i y j . . . the number of basis functions into the candidate set are: ICCS Camera Ready Version 2022 To cite this paper please use the final published version: DOI: 10.1007/978-3-031-08757-8_33 where k is the order of allowed interactions.In the case of k = 1, then the number of candidate basis functions is equal to N .Additional constraints can be imposed during the selection of a basis function that leads to the maximum error reduction.These constraints can be imposed during step 10 of Alg. 1. Examples of constraints include non-negativity of the coefficients [11] or imposing a threshold on their magnitude.
After selection of an appropriate basis function or lag, addition of this basis function has to be performed and the corresponding matrices to be updated.Several basis functions can be fitted by executing the process described by Alg. 1 followed by the process of Alg. 2, iteratively.The fitting process is terminated based on criteria regarding the fitting error, e.g.[12]: Another approach is to terminate the fitting process based on the magnitude of the coefficients: where b is the coefficient corresponding to the i + 1 added basis function, while a 1 is the coefficient corresponding to the basis function added first.In both termination criteria ϵ is the prescribed tolerance.It should be noted that the second criterion is more appropriate in the case of lagged basis.
)y 4: Check termination criterion and terminate if met

Performance optimization and parallelization
The procedure, described by Alg. 1, proceeds through all candidate basis in U sequentially.It can be performed in parallel by assigning a portion of basis functions to each thread of a multicore processor.However, the most computationally intensive operations are matrix by vector and vector by vector, which are BLAS2 (Basic Linear Algebra Subroutines -Type 2) and BLAS1 operations, respectively, [5].These operations lead to decreased performance compared to operations between matrices which are BLAS3 operations, since they require increased memory transfers and cache tiling and data re-use is limited [7].Thus, to increase performance the most computationally intensive part, which is the basis search described by Alg. 1, has to be redesigned in order to compute the corresponding error for all candidate basis functions.

ICCS Camera Ready Version 2022
To cite this paper please use the final published version: DOI: 10.1007/978-3-031-08757-8_33 Let us consider a set of N candidate basis functions, represented as vectors of length N − n, stored in the columns of matrix U ((n − N ) × N ).The columns g i+1 , 1 ≤ j ≤ N corresponding to each basis can be computed by the following matrix multiplication operations: The matrix g i+1 (i × N ) is formed by four dense matrix multiplications, however matrix D −1 i is diagonal matrix, thus it can be retained as a vector.Multiplying matrix D −1 i by another matrix is equivalent to multiplying the elements of each row j with the corresponding element d −1 j,j of the vector retaining the elements of the diagonal matrix.For the remainder of the manuscript we will denote this operation as (⋆).This operation can be used in the process described in Alg. 2. This reduces operations required, as well as storage requirements, since a matrix multiplication is avoided and the computation can be performed in place in memory.
Following computation of the columns stored in matrix g i+1 , the Schur Complements s (j) i+1 , 1 ≤ j ≤ N corresponding to the candidate basis functions are computed as follows: The formula U T (U + X i g i+1 ) leads to the computation of Schur complement of the block incorporation of basis and not the individual Schur complements corresponding to the candidate basis function, which are stored in the diagonal of the result.The Schur complement si+1 is a dense matrix of dimensions N × N and requires substantial computational effort.In order to avoid unnecessary operations each diagonal element can be computed as follows: (s i+1 ) j = (U T ) j,: ((U ) :,j + X i (g i+1 ) :,j ), (13) where ( . ) i,j denotes an element at position (i,j) of a matrix and (:) denotes all elements of a row or column of a matrix.In order to compute all elements of the diagonal concurrently, eq. ( 13) can be reformed as follows: where ⊙ denotes the Hadamard product of two matrices and v is a vector of the form [1 1 . . .1] T .The vector s i+1 (N × 1) retains the Schur complements corresponding to the candidate basis functions in U .Dedicated (Optimized) Hadamard product is not included in the standard BLAS collection, however it is included in vendor versions or CUDA (Compute Unified Device Architecture).Following the same notation the coefficients corresponding to the basis functions in set U can be computed as: and the corresponding error reductions as: ICCS Camera Ready Version 2022 To cite this paper please use the final published version: DOI: 10.1007/978-3-031-08757-8_33 The most appropriate basis function is selected by finding the maximum error reduction in vector e.The matrix based basis selection procedure is algorithmically described in Alg. 3.
1: Let N denote the number of candidate basis functions in U .
A block variant of Alg. 3 can be utilized to process batches of candidate functions.This can be performed by splitting matrix U into groups, processing them and accumulating the corresponding error reductions in vector e before computing the index of the most effective basis function.Despite the advantages in terms of performance, the matrix and block based matrix approaches have increased memory requirements.The memory requirements are analogous to the number of candidate basis functions, since they have to be evaluated before assessment, while in the original approach each candidate basis is evaluated only before its assessment.Thus, the matrix approach requires O(N (n−N )), the block approach O(max(ν(n − N )), b s (n − N ))) and the original approach O(ν(n − N )) 64-bit words, where ν denotes the number of basis functions included in the model and b s the number of basis in each block.

Graphics Processing Unit Acceleration
The operations involved in Matrix Based Basis Search, given in Alg. 3, can be efficiently accelerated in a Graphics Processing Unit (GPU).However, most GPU units suffer from substantial reduction of the double precision performance, e.g.32× in the case of double precision arithmetic (Geforce RTX 20 series).In order to mitigate this issue, 32-bit floating point operations and 16-bit half precision floating point operations are utilized.This gives rise to mixed precision computations, where GPU related operations are performed in reduced precision, while CPU related ones are performed in double precision arithmetic.This approach enables acceleration while minimizing round off errors from reduced precision computation.
The proposed scheme utilized a similar approach in order to accelerate the most computationally intensive part of the process, which is the basis search.Computations in the GPU require data movement from the main memory to the GPU memory, which is a time consuming operation, thus should be limited.For the case of of the Matrix Based Basis Search algorithm, data should be transferred in the GPU before processing.This includes the time series y, the matrix G i and vector D i , the matrix of included basis X i and the previous

ICCS Camera Ready Version 2022
To cite this paper please use the final published version: DOI: 10.1007/978-3-031-08757-8_33 modelling error ρ i in order to mark basis that could hinder positive definiteness.Before copying these arrays to the GPU memory, they should be cast to single precision arithmetic to ensure increased performance during computations.The matrix U , retaining the candidate basis, and time series y can be transferred in the GPU before starting the fitting process, since they are "a priori" known, while all the other matrices should be updated after addition of new basis function to the model.However, the update process includes only a small number of values to be transferred at each iteration.

Algorithm 4 GPU accelerated modelling
1: Let y denote the time series to be modelled, N the maximum lag, n the number of samples in y.
The process is described in Alg. 4. The matrices and vectors stored in the GPU memory are given in bold.The process terminates if the termination criterion of eq. ( 10) is met during the basis addition process or line 12 of Alg. 4. It should be noted that the first basis included removes the mean value of the time series y.This is performed in line 6 where the addbasis function is invoked.In practice, addition of the first basis function is performed using the equations: , where F is substituted with a vector ((n − N ) × 1) with all its components set to unity.
The functions cpu2gpu and gpu2cpu are used to transfer data from CPU memory to GPU memory and from GPU memory to CPU memory, respectively.Conversion of matrices, vectors and variables from double precision to single precision arithmetic is performed with function single and the function update is utilized to update matrices and vectors involved in the Matrix Based Basis Search performed in the GPU.At each iteration 2 + i + (n − N ), 1 ≤ i ≤ ν single precision floating point numbers need to be transferred to GPU memory, with ν denoting the number of basis functions included in the model.
A full GPU version of Alg. 4 can be also used, by forming and updating all matrices directly to the GPU memory.In this approach, the matrix of candidate matrices U and the time-series y need to be transferred to the GPU memory, before computation commences.The value of the first coefficient should also

ICCS Camera Ready Version 2022
To cite this paper please use the final published version: DOI: 10.1007/978-3-031-08757-8_33 be transferred to the CPU since it is required by the termination criterion.Moreover, in every iteration the new coefficient has to be transferred to CPU in order to assess model formation through the termination criterion.This approach uses solely single precision arithmetic and is expected to yield slightly different results due to rounding errors.

Numerical results
In this section the applicability, performance and accuracy of the proposed scheme is examined by applying the proposed technique to two time series.
The first time series is composed of large number of samples and lagged basis functions without multiplicative interactions are used as the candidate set.The scalability of different approaches is assessed with respect to single precision, mixed precision and double precision arithmetic executed either on CPU or GPU.The second time series has a reduced number of samples, however an extended set of lagged candidate basis functions, which include second order interactions, are included.The characteristics of the time series are given in Tab.
1.The two time series were extracted from R studio.The error measures used to assess the forecasting error was Mean Absolute Percentage Error (MAPE), Mean Absolute Deviation (MAE) and Root Mean Squared Error (RMSE): where y i are the actual values, ŷi the forecasted values and T the length of the test set.All forecasts are performed out-of-sample using a multi-step approach without retraining.All experiments were executed on a system with an Intel Core i7 9700K 3.6-4.9GHz CPU (8 cores) with 16 GBytes of RAM memory and an NVIDIA Geforce 2070 RTX (2304 CUDA Cores) with 8 GBytes of memory.All CPU computations were carried out in parallel using Intel MKL, while the GPU computations were carried out using NVIDIA CUDA libraries.Different notation is used for the variants of the proposed scheme: CPU-DP: Matrix based CPU implementation using double precision arithmetic.This is the baseline implementation.CPU-SP: Matrix based CPU implementation using single precision arithmetic.CPU-DP-GPU: Matrix based CPU/GPU implementation using mixed precision arithmetic.The basis search is performed in the GPU using single precision

ICCS Camera Ready Version 2022
To cite this paper please use the final published version: DOI: 10.1007/978-3-031-08757-8_33 arithmetic, while incorporation of the basis function is performed in double precision arithmetic.CPU-DP-GPU-block(n b ): Block matrix based CPU/GPU implementation using mixed precision arithmetic.The basis search is performed in the GPU using single precision arithmetic, while incorporation of the basis function is performed in double precision arithmetic.GPU: Pure matrix based GPU implementation using single precision arithmetic.The parameter n b denotes the number of blocks.The block approach requires less GPU memory.

Time series 1 -Scalability and accuracy
The average value of the dataset is 192.079, the minimum value is 11 and the maximum value is 465.For this model the lagged candidate basis has been utilized, while higher degree interactions are not allowed, resulting in an additive model.The performance in seconds for all variants is given in Fig. 1, while speedups are presented in Fig. 2. The pure GPU and CPU-DP-GPU implementations have the best performance overall leading to the best speedups.The pure GPU implementation has a speedup greater than 20× for more than 50 basis functions with a maximum of approximately 27×, with respect to the baseline implementation.With respect to the CPU single precision arithmetic implementation the pure GPU approach has a speedup of approximately 10× for more than 50 basis functions.The CPU-DP-GPU has a maximum speedup of approximately 22× attained for 134 basis functions.After that point the speedup decreases because the double precision operations in the CPU do not scale with same rate, reducing the overall speedup.It should be noted that even for low number of basis functions, e.g. 6, the speedup of the pure GPU implementation is approximately 6× with respect to CPU-DP and approximately 4× with respect to CPU-SP implementation.The number of basis functions included in the model along with forecasting errors are given in Tab. 2. The number of basis functions as well as the errors are not substantially affected by the use of mixed or single precision arithmetic.More specifically, up to approximately 50 basis functions all variants produce almost identical results.However, above 50 basis functions there is a minor difference in the number of basis functions included in the model which in turn slightly affects the error measures.The difference in the number of included basis functions is caused by rounding errors in the computation of error reduction ρ.This is caused by the rounding errors in the formation of the column vectors g i+1 , involved in the computation of respective Schur complements and potential basis coefficients.
An important observation is that the error measures regarding forecasts do not significantly reduce after the incorporation of approximately 130 basis functions.Thus, additional basis functions increase the complexity of the model.In

ICCS Camera Ready Version 2022
To cite this paper please use the final published version: DOI: 10.1007/978-3-031-08757-8_33 order to ensure sparsity of the underlying model, a different termination criterion can be used, since the termination criterion of eq. ( 10) depends on the magnitude of the entries of the basis functions and is more susceptible to numerical rounding errors.The new termination criterion based on error reduction percentage is as follows: where e i+1 denotes the potential error reduction that will be caused by the incorporation of the i + 1-th basis.ϵ ∈ [0, 1] ⊂ R denotes the acceptable percentage of error reduction to include a basis function.This criterion will be used to model the second time series along with higher level interactions.

Time series 2 -Flexibility and higher order interactions
The average value of the dataset is 41.9808, the minimum value is 23 and the maximum value is 73.The candidate set is composed of lagged basis functions and second order interactions of the form y j y k .The termination criterion of eq. ( 18) was used with ϵ = 0.002, with maximum lags equal to 50.In Fig. 3 the actuals along with the forecasted values computed with and without interactions are given.The inclusion of second order interactions results in capturing the nonlinear behavior of the time series in the forecasted values.The error measures without interactions were: RM SE = 5.96, M AE = 5.11 and M AP E = 12.33, while with interactions the error measures were: RM SE = 6.18,M AE = 4.93 and M AP E = 12.20.The inclusion of higher order interactions led to reduction of the error measures and showcases the flexibility of the approach allowing for the inclusion of arbitrary order interactions in the candidate basis functions.
The execution time for CPU-SP, CPU-DP and GPU were 1.1621, 2.2076 and 0.3932, respectively.Thus, the speedup of the pure GPU variant was approximately 3× over the CPU-SP variant and 5.6× over the CPU-DP version.The pure GPU version is efficient even for time series with small number of samples, under a sufficiently sized space of candidate basis functions.

Conclusion
A matrix based parallel adaptive auto-regressive modelling technique has been proposed.The technique has been parallelized in multicore CPUs and GPUs and a block variant has been also proposed, based on a matrix (BLAS3) recast of the required operations.The pure GPU variant presented speedup up to 27× over the double precision arithmetic parallel CPU version and 10× over the parallel single precision CPU version for time series with large number of samples.The use of single and mixed precision did not affect substantially the forecasting error, rendering the technique suitable for modelling and forecasting large time series.Implementation details and discussions on higher order interactions between basis functions have been also given.The applicability and effectiveness of the method were also discussed and new termination criterion based on potential error reduction of basis functions, which is invariant to scaling, has been given.Future work is directed towards the design of an improved basis search that will reduce the search space based on a tree approach.Moreover, backfitting procedures will be considered.

Fig. 1 :Fig. 2 :
Fig. 1: Performance for all variants for different number of basis functions.The performance of the block variants degrades when increasing the number of blocks retaining the candidate basis functions, since they require more data

Fig. 3 :
Fig. 3: Actual along with forecasted values with and without interactions.

Table 1 :
Model time series with description and selected splitting.