Performance Analysis of Long Short-Term Memory-Based Markovian Spectrum Prediction

Dynamic Spectrum Access (DSA) solutions equipped with spectrum prediction can enable proactive spectrum management and tackle the increasing demand for radio frequency (RF) bandwidth. Among various prediction techniques, Long Short-Term Memory (LSTM) is a deep learning model that has demonstrated high performance in forecasting spectrum characteristics. Although well-performing, the theoretical characterization of LSTM prediction performance has not been well developed in the literature. Therefore, in this article, we examine an LSTM based temporal spectrum prediction model and characterize its prediction performance through theoretical analysis. To this end, we analyze the LSTM prediction outputs over simulated Markov-model-based spectrum data and spectrum measurements data. Our results suggest that the predicted scores of the LSTM based system model can be described using mixtures of truncated Gaussian distributions. We also estimate the performance metrics using the mixture model and compare the results with the observed prediction performance over simulated and measured datasets.


I. INTRODUCTION A. BACKGROUND AND MOTIVATION
CURRENT regulatory measures for spatial and temporal allocation of radio frequency (RF) spectrum have inadequacies that result in the under-utilization of frequencies and the inability to manage the growing bandwidth demand due to an abundance of devices and applications [1], [2]. Dynamic Spectrum Access (DSA) solutions are widely adopted to ensure efficient spectrum utilization, as they allow the opportunistic usage of spectrum by unlicensed users when licensed users, known as Primary Users (PU), are inactive. Cognitive Radio (CR) is a technology that enables intelligent wireless communications and realizes DSA functionalities. Unlicensed users, known as Secondary Users (SU), adopt CRs to perform cognitive functions that are broadly related to radio environment perception, reconfigurability, and learning. Specifically, these functions include spectrum sensing, spectrum decision, spectrum sharing, and spectrum mobility The associate editor coordinating the review of this manuscript and approving it for publication was Ding Xu . and can help utilize resources more efficiently [3]. However, cognitive functions incur significant time delays, which can cause transmission collisions or even degrade the spectrum utilization efficiency [2], [4].
Predicting the spectrum usage characteristics is recognized as an important step towards managing spectrum more efficiently in DSA scenarios. Spectrum prediction techniques infer unmeasured characteristics such as channel occupancy status, duty cycle, and signal power level based on historical spectrum data [5], [6]. These techniques leverage the correlations in the data to make future predictions in the time, frequency, and spatial domains. The predicted results can be applied to identify spectrum holes and infer spectrum evolution. These techniques can also help minimize delays, reduce energy consumption, and improve the system throughput [4], [6].
Time-domain or temporal spectrum prediction typically involves techniques such as Auto-Regressive (AR) time series models, Markov chain based models, Bayesian inference, Machine Learning (ML) models, and Deep Learning (DL) models [6]. However, many conventional prediction methods require a priori information to make predictions and cannot effectively capture the long-term dependencies in the data [7]. In contrast, Long Short-Term Memory (LSTM) is a deep learning model that learns the long-term and shortterm dependencies in the input data. LSTM hidden units consist of memory cells that maintain the state variables and non-linear gating units that control the update mechanism of the hidden unit. It is a Recurrent Neural Network (RNN) variant that performs well with sequence data and has been adopted to perform modulation classification, interference classification, and spectrum prediction in various studies [8].
LSTMs have shown good performance in spectrum prediction applications related to LTE mobile traffic prediction, cognitive aerospace communications, vehicular communications, machine-type communications etc [9], [10]. LSTM has also shown improved prediction performance compared to other techniques such as Auto-Regressive Integrated Moving Average (ARIMA), Feed-Forward Neural Networks etc., in various spectrum prediction works [9], [11], [12]. Although various works establish the superior capabilities of LSTM models to leverage correlations in the observations and predict more accurately, the theoretical analysis and insights regarding predictions have not been well developed in the literature. These aspects motivate us in this article to extensively analyze the prediction performance of LSTM for the temporal prediction of the future spectrum occupancy status.

B. LITERATURE REVIEW
Recent years have seen significant progress in spectrum prediction with a variety of predictive solutions that adopt algorithms such as clustering analysis, Hidden Markov Models (HMM), RNNs, and so on [13]. Conventional spectrum prediction methods fail to capture long-term temporal correlations in the data and require a priori information about the PU activity distribution to make predictions [7]. However, deep learning models such as LSTMs are capable of learning longterm and short-term correlations in the data and do not require a priori information to make predictions [6]. LSTMs have also consistently shown very high prediction performance for various spectrum prediction applications such as frequency hopping sequence prediction, cognitive aerospace communications, vehicular communications, etc., with simulated and measured data [14]- [16]. For instance, in [11], an LSTM based framework predicts the Power Spectral Density (PSD) values over measured datasets from terrestrial and satellite networks. The classification accuracy of the LSTM model is found to be high and stable compared to that of a Multi-Layer Perceptron (MLP) network, over a wide range of frequencies. In [12], a 2D-LSTM based spectral-temporal model predicts the PSD values with better precision, recall, and F1-score compared to the baseline AR, Bayesian Inference, and 1D-LSTM models over different frequency ranges of a measured spectrum dataset. In [17], a spectrum access strategy integrated with an LSTM based time-frequency prediction model achieves higher system throughput and reduces energy consumption. The LSTM model proposed in this work shows higher prediction accuracy than the baseline HMM and MLP models. However, performance metrics such as Root Mean Square Error (RMSE) are not examined in this work.
The impact of parameters such as nature of the channel, input structure, number of future prediction steps, and model hyperparameters on LSTM prediction performance is also studied in many spectrum prediction works. The prediction accuracy of HMM, MLP, LSTM, and a proposed Reinforcement Learning (RL) based model with trained MLP sub-models are compared for bursty ISM band spectrum data in [18]. The proposed RL based MLP method adaptively takes action as per the bursty input. The results suggest that, on average, the proposed RLMLP model shows higher prediction accuracy than the LSTM model for bursty spectrum data. In [5], a Convolutional LSTM spectrum prediction model is developed to perform long-term joint spatial-temporal-spectral prediction of signal power level over measured UHF bands. Based on the results, the proposed model shows stable RMSE performance over different channels but shows degradation as the number of future time steps increases. A more sophisticated Convolutional LSTM based model, STS-PredNet, is proposed in [19], which shows a stable RMSE performance for multi-time step spatialtemporal-spectral prediction, with no degradation. In [20], cooperative spectrum prediction using LSTM local predictors is adopted for multi-time step prediction in a heterogeneous CR environment. The prediction performance of the local predictor is investigated in terms of the mean prediction error and the total probability of prediction error for a measured spectrum dataset over UHF bands. The relation between the performance metrics and parameters such as the window size and cooperative range are also studied. However, a detailed analysis of the total probability of error is not attempted.
Although high performing, LSTMs could pose challenges in terms of the computational burden and handling missing data. For spectrum situation prediction in a dense wireless ecosystem in [21], deep learning models such as Deep Neural Network (DNN), Convolutional Neural Network (CNN), and LSTM are trained with RF measurement traces of complicated coexistence scenarios. These models predict the type, form, and number of users in a frequency band for the test data. The results in [21] suggest that LSTMs show better prediction accuracy for large input window sizes but also consume more than 5x the training time of DNN and CNN models. Similarly, an LSTM based spectrum prediction model is implemented on a Field Programmable Gate Array (FPGA) for resilient military communication applications [22]. The models are trained and tested over measured IEEE 802.11p frames. The results suggest that the LSTM model outperforms the baseline ARIMA, Moving Average, and Naïve models even at low precision of number representation. Moreover, higher accuracy is demonstrated by floating-point implementations of the LSTM, but at the cost of higher latency and resource utilization. A similar trend is observed in [23], where an LSTM based spectrum sensing framework shows high detection performance and accuracy under a low Signal-to-Noise Ratio (SNR) regime, at the cost of increased training time.
To reduce the computational overhead associated with LSTM based models, different methods have been proposed in the literature. For instance, to reduce the computational cost and improve the scalability of an LSTM based model, the dimensionality of spectrum data is reduced using tensor decomposition in [7]. Their proposed method also achieves lower normalized prediction error than the baseline models such as AR, Support Vector Machine (SVM), CNN, and LSTM models. To reduce the computational overhead due to training, an improved initialization method that enables fast LSTM training convergence compared to traditional initialization methods is proposed for spectrum prediction over Markovian data in [24]. Finally, in regards to handling missing data, a tensor completion based method is proposed in [25] to recover missing spectrum data values from a corrupted measured spectrum dataset. The proposed method in tandem with an LSTM predictor shows the highest prediction accuracy over a recovered data sequence, compared to other models such as K-Nearest Neighbours (KNN), Seasonal ARIMA (SARIMA), and MLP.
As discussed in this section, several research papers have adopted LSTM models for spectrum prediction and established various aspects such as their higher computational overhead, superior prediction capabilities compared to other baseline models, etc. However, the theoretical characterization related to LSTM prediction have not yet been well developed. Such a theoretical analysis could provide insights into LSTM prediction performance under different scenarios such as cooperative prediction, and also help to build improved prediction models. Therefore, in this article, we present novel insights that extend the theoretical understanding of LSTM based spectrum prediction. We analyze the LSTM prediction outputs for a Markovian spectrum occupancy data and propose a mixture model based analytical form of prediction performance metrics such as the probability of error. We also verify the results over simulated and RF measurements data from a realistic environment.

C. CONTRIBUTIONS
In our work, we assume that a wireless sensor or CR senses a channel at a fixed location over time. The data captured by the sensor is generated using a two-state discrete-time Markov model with the states representing Idle (0) and Busy (1) spectrum status. We adopt an LSTM based system model with a sequence input layer, an LSTM layer, a dense layer, a softmax layer, and a classification layer to perform the prediction. To analyze the predicted results, we train the LSTM based model with multiple independent Markov input sequences and extract the histograms of the prediction scores at the softmax layer output. Based on the results, we propose a mixture of truncated Gaussian distributions to model the prediction score outputs and estimate the performance, such as the probability of error, using the proposed mixture distribution. Finally, we compare the theoretical and observed values of the performance metrics over simulated data and RF measurement data from a realistic environment. To sum up, our key contributions in this article are: • Modelling the prediction score of output classes: We conduct a statistical analysis of the predicted outputs of the LSTM based model for a Markovian input. We train an LSTM based system model with multiple independent inputs and obtain the histograms of the prediction scores of the model. We propose that the prediction scores of the LSTM based system model for the output classes can be modelled well using a mixture of truncated Gaussian distributions.
• Analysis of Prediction Performance Metrics: We propose an analytical expression for the probability of error of the LSTM model, based on the mixture model proposed. Further, we verify the proposed expression on simulated and measured spectrum datasets, and also compare with the observed mean prediction errors. We also estimate other prediction performance metrics using the proposed model and compare them with the observed values. The remainder of this article is organized as follows. Section II discusses the system model adopted. The modelling and error analysis are discussed in sections III and IV respectively. Section V presents the model verification and numerical results. Section VI concludes this article.

II. SYSTEM MODEL
In this work, we assume a SU equipped with a wireless sensor or a CR that senses a channel temporally at a fixed location. The sensed observations are in the form of a binary sequence representing the PU activity, i.e., the occupancy status of the channel by the PU. To perform spectrum prediction, we adopt an LSTM based model for the SU. This model learns the parameters for making forecasts from historical observations and predicts the spectrum occupancy status for a single future time step based on the current spectrum occupancy observation. Figure 1 shows the LSTM based model adopted by the SU for spectrum prediction. At time instant t, the model in fig. 1 processes an input, x t and makes a prediction of the output class,ŷ t . The input, x t , corresponds to the current spectrum occupancy observations at the SU, and the predicted output,ŷ t , represents the future occupancy status. The sequence input layer passes x t to the LSTM layer, which learns the temporal correlations and patterns in x t . The LSTM layer output is then mapped to the prediction results by a dense layer. A softmax layer computes the prediction scores for each output class based on the dense layer activations, and the classification layer computes the cross-entropy loss during training.
In sections II-A -II-D, we explain the PU activity model and the LSTM based system model in more detail.

A. INPUT DATA MODEL
Spectrum usage models are often adopted to represent PU activity and directly impact the SU system performance in DSA solutions [6]. These models are used to evaluate spectrum prediction techniques and are classified as time, frequency, and spatial domain models. Temporal spectrum usage models include Continuous-Time Markov Chains (CTMC) and Discrete-Time Markov Chains (DTMC) [6]. DTMC models are widely adopted spectrum usage models used to evaluate system performance metrics such as throughput and packet delay. These models can reproduce statistical properties associated with PU activity in realistic environments [6], [26]. Depending on the stationarity of the system, a stationary or non-stationary DTMC can be adopted [6].
A DTMC is fully characterized by the probability of the initial state and its transition probability matrix, P [27]. A two-state DTMC with states, x t [0], x t [1], and a transition probability matrix, P, as in (1), is diagrammatically shown in fig. 2.
Our study assumes that the PU spectrum occupancy follows a first-order, two-state DTMC process with a binary state space, S = {0, 1}. State 0 stands for x t [0] in fig. 2, and indicates an idle channel with no traffic. Similarly, state 1 stands for x t [1] in fig. 2, and indicates that the PU occupies the channel. Further, we assume the initial state to be 0, and we limit the scope of our work to cases where α and β are equal in (1). A similar study can be extended to α = β cases as well.

B. LSTM LAYER
Long Short-Term Memory (LSTM) is a RNN architecture that has become state-of-the-art for learning problems involving sequence data such as speech recognition, handwriting recognition, and language modelling [28]. The LSTM cell structure allows long-term retention of information and eliminates the vanishing gradients problem common to vanilla RNNs [8]. An LSTM cell or hidden unit consists of a memory cell and a few non-linear gating units, as shown in fig. 3. The memory cell stores the information about the context or history of the inputs as an internal cell state, C t , and the gates control the update mechanism of C t . The forget gate controls the removal or retention of information within the memory cell, as in (2).
The input gate activations and the cell candidate gate activations are used internally to compute the update values for C t , as in (3), (4).
The updated cell state is computed by combining the activations of the forget gate, input gate and cell candidate gate as in (5), where, * represents the element-wise product of matrices.
Finally, the output gate controls the flow of C t for the computation of the cell output, h t , as in (6), (7).
In (2) - (4) and (6), the weight parameter matrices of the forget gate, input gate, cell candidate gate and the output gate are denoted as w f , w i , w C , and w o , respectively. The bias parameter vectors of the gates are represented as b f , b i , b C , and b o , respectively. The gate activation vectors are represented by f t , i t ,C t , and o t . These gate activations are computed as the matrix-vector product (.) between the corresponding weight parameter matrix and a concatenated vector of the past cell output, h t−1 , with the current cell input, x t , denoted as [h t−1 , x t ]. The respective bias parameter vector is added to the computed product and the result is passed onto a gate activation function. The forget gate, input gate, and output gate have a sigmoid activation function represented by σ , whereas, the cell candidate gate has the tanh activation function. In our implementation, we adopt the LSTM layer with one LSTM cell.

C. DENSE LAYER
The LSTM layer activation vector, h t , is passed onto a dense or fully-connected layer, as in fig. 1. This layer consists of dense layer hidden units that compute the layer output, a t , as in (8), Here, the matrix-vector product of the weight parameter matrix, w d , with the input, h t , is computed and added with the bias parameter vector, b d . In our implementation, the dense layer has two hidden units. The rationale behind this choice is that the dense layer hidden units will map the LSTM activations to the two possible output classes, viz. idle and busy.

D. SOFTMAX LAYER
In the system model of fig. 1, the dense layer output, a t , is passed onto a softmax layer that applies the softmax activation function on a t . The softmax and classification layers are adopted for classification tasks, where the classification layer computes the cross-entropy loss during model training. In this study, we represent the softmax function output as y t , which are the scores predicted by the LSTM based model for the output classes, given the input, x t . The prediction score, y t , is given by, where a t (i) and y t (i) are the i th element of these vectors, respectively. In our implementation, y t contains two elements corresponding to each output class. To reflect the meaning of the output classes, we adopt the indexing convention as, i = 0, 1 and j = 0, 1. Additionally, due to the nature of the softmax function, y t has the properties given by,

III. MODELLING THE DISTRIBUTIONS OF LSTM OUTPUTS FOR SPECTRUM PREDICTION
As explained in section II, the system model of fig. 1 processes a stochastic input, x t by passing it through neural network layers. Each layer processes its inputs and produces a corresponding layer activation or output. To analyze the performance of the spectrum prediction model, we consider the testing phase of a trained LSTM based model. During this phase, the weight and bias parameters are assumed to be constants.
As the input x t is a stochastic input in our system model implementation, we can consider the LSTM layer output, h t , as a random variable. We assume that h t follows a probability density function (PDF), f H (h t ), denoted as, Similarly, the cumulative distribution function (CDF) of h t is denoted as F H (h t ). Following the LSTM layer, the dense layer computes its output using h t as given by (8). Using (8) and (12), the PDF and CDF of the dense layer output, a t , can be obtained as, Here, i can take values viz. 0, 1. Similarly, using (9), the CDF of the softmax output predicted scores can be expressed in terms of F H (h t ) as, where, w d is defined as, Similarly, b d is defined as, The definitions in (16) and (17) are adopted to concisely express the distribution in (15).
Considering (15) - (17), arriving at a closed-form expression for the distributions of h t , a t , and y t is challenging because the LSTM cell consists of non-linear operations with feedback, resulting in a higher-order system. As an analytical approach to determine the distribution of the LSTM outputs is non-trivial, we adopt a modelling approach as explained in later sections.

A. MODELLING METHODOLOGY OF LSTM OUTPUTS
In this study, spectrum prediction is performed as a classification problem in supervised learning. The system model in fig. 1 is trained with the input data generated as per the two-state DTMC model in section II-A. To train the LSTM model, a binary input sequence, S, is generated for T = 10000 time steps and reshaped into the training input matrix, D, of dimensions (T − 2) × 2. This reshaping is according to a sliding window of size, 2, and overlap, 1. The row entries of D corresponding to time step, t, are the past and present binary occupancies, viz. S(t − 1), and S(t). The training label corresponding to the input, D(t) is the categorical occupancy value, i.e., Busy or Idle, corresponding to S(t + 1).
In our system model of fig. 1, the LSTM layer consists of an LSTM cell, and the dense layer consists of two hidden units. The weight and bias parameters of the system model layers are initialized as per conventional initialization schemes [29], [30]. The optimizer algorithm is selected as ADAM optimizer during training. The model hyperparameters, such as the learning rate, number of epochs, etc., are systematically tuned to achieve a suitable reducing trend of the training cost curve. For instance, to select the learning rate, training was conducted for different values of this parameter. Initially, a learning rate value of 1 was adopted and systematically scaled down to a value of 0.005, while observing the training cost curve. Among the different values, a learning rate of 0.05 resulted in a suitable reducing trend of the training losses. The remaining hyper-parameters of the system were selected in a similar manner to achieve a suitable reducing trend of the training costs. Overall, these steps are hereafter referred to as a training run. The trained LSTM based system model is tested with an independent Markovian test input with the same α value. This testing procedure is referred to as a test run hereafter.
In order to characterize the prediction performance of the LSTM model, fig. 4 shows the framework adopted for modelling the LSTM outputs. As in fig. 4, N independent test runs are conducted for the LSTM system model, and the predicted output sequences, y t , are analyzed. Further, the average histograms of the predicted scores, y t , given the target spectrum occupancy, S t , are obtained from N = 1000 independent test runs for a given α. The averaged histogram can be treated as an estimate of the conditional density of y t , given S t . For instance, fig. 5 shows these averaged histograms of different output classes for α = 0.7. These indicate that the distributions of the predicted scores, conditioned on the training label, exhibit Gaussian-like features, with multiple peaks, bell-like symmetric curves, and having mean and variance parameters. Further details of the modelling methodology in fig. 4 are described in the following sections III-B and III-C.

B. PROPOSED MIXTURE MODEL
A mixture with appropriate component distributions is adopted to model the predicted scores, considering the features of the averaged histograms in fig. 5. The proposed mixture model of the predicted scores, given the target spectrum occupancy is represented as f Y (y t |S t ) and consists of Truncated Gaussian distributions (TGD), λ T (.), as in (18), (19), (19) where, for a given test run: t is the time step y t (i) is the predicted score for the output classes i takes values 0, 1 corresponding to output classes K , M are the number of mixture components in each model π, θ stand for the proportions of the mixture components λ T (.) stands for the PDF of a TGD component [a,b] is the truncation interval µ, ν stand for the means of the parent Gaussian component distribution (before truncation) σ , δ are the standard deviation of the parent Gaussian component distribution (before truncation) As mentioned in (18) and (19), λ T (.) is the PDF of a TGD component, that is defined in general as, where, In (20), φ(.) is the PDF of a standard normal variable, z, given by, and (.) is the CDF of z, given by, The CDF of a TGD is given by In (20) and (24), µ and σ represent the mean and standard deviation parameters of the parent normal distribution of the variable, x; [a, b] is the interval of truncation such that, VOLUME 9, 2021  a ≤ x ≤ b. The mean and variance parameters of a TGD is given by, As we know from (10) that y t takes values in the range [0, 1], the proposed model for f Y (y t |S t ) must also take nonzero values within the same range. Therefore, we have chosen the truncation interval, [a, b], for the TGD components in (18) and (19) as [0, 1]. Moreover, the sum of the component proportions in each model of the form of (18) and (19) is equal to 1. This ensures that the corresponding PDF normalizes to 1.

C. MODEL PARAMETER ESTIMATION
As shown in fig. 4 and explained in section III-A, the averaged PDF of predicted scores are estimated based on multiple independent test runs. Further, we have adopted the following steps to model the predicted scores data using the proposed mixture of TGD components, 1) Estimate the mixture model parameters.
2) Truncate the PDF of mixture components to the required range of y t , i.e., [0, 1]. 3) Recombine the TGD components to get the resulting mixture model. These steps also account for the property of the predicted scores as in (10). To estimate the mixture model parameters in step 1, we have adopted the Expectation Maximization (EM) algorithm. The EM algorithm computes the maximum likelihood estimates of statistical model parameters iteratively [31]. This algorithm is usually computationally less intensive, works well with missing data, and is an intuitive choice to fit mixture distributions. For each observation, the EM algorithm computes the posterior probability that the observation belongs to a particular Gaussian Mixture component. The algorithm then uses the computed posterior probabilities to estimate the model parameters by the Maximum Likelihood principle. The proposed mixture models obtained by steps (1) -(3) are superimposed over the corresponding averaged histograms in fig. 5. The corresponding CDFs of the predicted scores based on the proposed mixture model are shown in fig. 6. The model parameters of (18) and (19) estimated using EM algorithm are tabulated for different values of input model parameter, α as shown in tables 1 and 2. We have selected the number of mixture components, K = M = 4 in (18) and (19), as this performed well for our study.

D. MODEL VERIFICATION
The proposed mixture model as described in sections III-A -III-C is verified over a simulated data. The results for verification are shown in fig. 7 for a simulated dataset with α = 0.8, wherein the theoretical mixture model is compared with the averaged histograms of the predicted scores for simulated test data. From fig. 7, the proposed mixture model fits the averaged histograms of the predicted scores well.

IV. PERFORMANCE OF LSTM BASED SPECTRUM PREDICTION A. PROBABILITY OF ERROR OF LSTM BASED SPECTRUM PREDICTION
In this section, we present the discussion on the analytical expression for the probability of error of the LSTM based prediction model. To this end, we utilize the mixture model of the predicted scores, as described in section III-B.
We can define the probability of error, P e , for the LSTM based system model in fig. 1 as, (27) where, y t (0) and y t (1) are the predicted scores of the LSTM model for output classes, Idle and Busy, respectively. As per the Markovian model, we have, In our study, we consider the case of α = β for simplicity. As per eqns. (28) and (29), we have, In addition, as per (11), we have, Applying (30) and (31) on (27), we get a concise form for P e as As we model the conditional distributions of the predicted scores, y t (0) and y t (1), as mixtures of truncated Gaussian distributions, we utilize the CDF of such a distribution in (32). Applying (18), (19), and (24) on (32), we get the probability of error as, ; ν 0m , δ 0m , 0, 1) Based on our study, we observe that the error performance is also impacted by the amount of correlation in the data. To substantiate this, we present the autocovariance performance based on theoretical analysis and simulations in section IV-B.

B. AUTOCOVARIANCE FUNCTION
In this section, we present the theoretical form of the autocovariance function of a two-state DTMC, as shown in fig. 2. The eigen values of the transition probability matrix, P, can be derived as, The corresponding eigen vectors can be found as, Assuming that eigenvector decomposition can be performed, we can write P as follows, where, = diag(λ 1 , λ 2 ) and C is the matrix with v 1 and v 2 as its columns. The k-step transition probability matrix can be calculated as (39), shown at the bottom of the next page. The autocorrelation function can be expressed by definition as, x n x n+d P(x n+d |x n )P(x n ) (40) For the initial probabilities of states given by, the distribution of states after k steps is given by, Applying (42) in (40) and making relevant substitutions, we get, For simplicity, we consider the case where α = β. Then, (43) simplifies as, From (44), the autocovariance function (ACV) for the Markovian input can be derived as,   Figure 8 compares the ACV coefficients for lag values, d = 1, 2 using (44) and using simulations. The ACV coefficients for the predicted output is also shown in fig. 8.

D. MEAN ABSOLUTE ERROR (MAE)
In this section, we present the theoretical form of the Mean Absolute Error (MAE) of the LSTM model shown in fig. 1. We define the observed or simulated prediction error, P e , as in (46). We also define the MAE for each test run as, Here,ŷ n and S n are as defined in section IV-C. Further, the definition in (55) can be written as, Further expanding (56) using the definition of E(.), we get, Splitting (57) into two sum terms corresponding toŷ n = S n andŷ n = S n , we get, MAE = ŷ n =S n |ŷ n − S n |P(ŷ n , S n ) + ŷ n =S n |ŷ n − S n |P(ŷ n , S n ) (58) Asŷ n = S n ⇒ |ŷ n − S n | = 0 andŷ n = S n ⇒ |ŷ n − S n | = 1, we can simplify (58) as, MAE = ŷ n =S n P(ŷ n , S n ) = P(ŷ n = 1, S n = 0) + P(ŷ n = 0, S n = 1) (59) As the RHS of (59) is equivalent to the calculation of P e , we have,

V. MODEL VERIFICATION AND NUMERICAL RESULTS
In this section, we verify the proposed analytical form of P e discussed in section IV-A, for simulated and measured datasets. We also present the prediction performance results in terms of the mean observed prediction error, RMSE, and MAE, for different values of the parameter, α, over the simulated datasets.

A. VERIFICATION ON SIMULATED DATA
For the input model parameter α taking values as 0.5, 0.55,. . . ,0.95, the mean observed prediction error is estimated over 1000 independent simulated test runs, as described in section III-A. The probability of error, P e is also calculated as per (33), for the corresponding data. The theoretical and observed P e values are as shown in fig. 9. As can be seen from fig. 9, the probability of error reduces with increasing values of α and the graph is symmetric with respect to α = 0.5. Moreover, there is a linear relationship between P e and α, in addition to the fact that the modelling based approach gives results that match with the simulated P e averaged over multiple independent test runs.

B. VERIFICATION ON SPECTRUM MEASUREMENTS DATA
The probability of error of the LSTM based spectrum prediction model is estimated on the spectrum measurements data published by the RWTH Aachen University [32], [33]. The data used for verification is captured indoors in an office environment. The dataset comprises PSD measurements for 6000 sweeps and 8192 trace points, for the subband I, a centre frequency of 770 MHz and over a frequency range of 20 MHz to 1.52 GHz. The PSD data is converted to a binary spectrum occupancy data, , as per (61), where, PSD rx,t,i is the received PSD for time t, channel, i, and γ is the energy detection threshold, taken as −107 dBm/200kHz. The dataset, , is assumed to consist of channel data following a two-state DTMC process. For this study, the channel data with approximately equal values of α and β are selected for testing the trained LSTM models. For different values of α, the observed prediction error is calculated as per (46) and compared with the probability of error calculated using (33). The comparison results are as shown in fig. 9. These results indicate that theoretical expression for P e as in (33) applies well for LSTM based spectrum prediction over experimental data, for different cases of the input model parameter, α.

C. PREDICTION PERFORMANCE RESULTS
To analyze the prediction performance of the LSTM based model for a Markovian spectrum occupancy input in a statistical sense, 1000 independent test runs are carried out for a given value of α. The prediction error, P e is calculated for each test run using (46) and averaged over all test runs. Figure 10 shows a comparison between the mean prediction errors of the LSTM based prediction model and that of an HMM-based spectrum prediction model, which is a Bayesian-based technique adopted in [27]. The prediction errors are plotted for different values of input autocovariance, σ xx (1), where, σ xx (1) = 0 indicates the equally likely case of α = 0.5. From fig. 10, we observe that the mean prediction error is symmetric and linearly decreasing on both sides of the peak error at σ xx (1) = 0. Figure 11 shows the comparison of the RMSE calculated theoretically and in simulations for different values of α. For a given α, the RMSE is calculated for each simulation test  run using (47) and averaged over all test runs. Additionally, the theoretical value of RMSE is calculated using (54) and the model derived for P e in (33). From fig. 11, we observe that the theoretically calculated RMSE closely matches the RMSE obtained from simulations. Figure 11 also shows a symmetric reducing trend on both sides of the maximum case corresponding to σ xx (1) = 0. Therefore, the results indicate that the model exhibits the worst prediction performance for the truly random or equally likely case of the transition probability, α. The prediction performance improves steadily as the input occupancy data deviates from the equally likely case. Figure 12 shows the comparison of the MAE calculated theoretically and in simulations for different values of α. For a given α, the MAE is calculated for each simulation test run using (55) and averaged over all test runs. Additionally, the theoretical value of MAE is calculated using (60) and the model derived for P e in (33). From fig. 12, we observe that the theoretically calculated MAE closely matches the MAE calculated from simulations. Figure 12 also shows a symmetric reducing trend on both sides of the maximum case of α = 0.5, corresponding to σ xx (1) = 0. Therefore, the results indicate that the model exhibits the worst prediction performance for the truly random case of α and improves steadily in case of higher correlation in the input occupancy data.

VI. CONCLUSION
In this article, we analyzed and modelled the prediction performance of an LSTM based system model for spectrum prediction. We assumed the spectrum occupancy data to follow a two-state discrete-time Markov model. We then tested a trained LSTM system model with simulated and measured spectrum datasets and analyzed the model's prediction scores for each output class. Based on our results, we verified that the predicted scores of the model could be represented using mixtures of truncated Gaussian distributions. Further, we arrived at the probability of error of the prediction model based on the proposed mixture model and compared the results with the observed prediction error over simulated and measured data. We also presented the comparison of simulated and theoretically calculated RMSE and MAE for the LSTM system model. In the future, we aim to characterize the prediction performance of multi-step ahead spectrum prediction using LSTM models.