Data‐Driven Forecasting of Low‐Latitude Ionospheric Total Electron Content Using the Random Forest and LSTM Machine Learning Methods

In this research, we present data‐driven forecasting of ionospheric total electron content (TEC) using the Long‐Short Term Memory (LSTM) deep recurrent neural network method. The random forest machine learning method was used to perform a regression analysis and estimate the variable importance of the input parameters. The input data are obtained from satellite and ground based measurements characterizing the solar‐terrestrial environment. We estimate the relative importance of 34 different parameters, including the solar flux, solar wind density, and speed the three components of interplanetary magnetic field, Lyman‐alpha, the Kp, Dst, and Polar Cap (PC) indices. The TEC measurements are taken with 15‐s cadence from an equatorial GPS station located at Bogota, Columbia (4.7110° N, 74.0721° W). The 2008–2017 data set, including the top five parameters estimated using the random forest, is used for training the machine learning models, and the 2018 data set is used for independent testing of the LSTM forecasting. The LSTM method as applied to forecast the TEC up to 5 h ahead, with 30‐min cadence. The results indicate that very good forecasts with low root mean square (RMS) error (high correlation) can be made in the near future and the RMS errors increase as we forecast further into the future. The data sources are satellite and ground based measurements characterizing the solar‐terrestrial environment.

Consequently, estimating the present and future TEC in the ionosphere would contribute to understanding and mitigating these adverse space weather effects. Previous and current physics-based estimation of the TEC and other space weather parameters are usually complicated by the fact that the relative roles of the physical factors are different across different geographic regions and altitude ranges. Physical models are also not purely physical since some of the processes are parameterized based on simplifying assumptions and may not work globally and the empirical data may not be available for all geographic regions. Moreover, physics-based models for forecasting need observational data to be used as initial and/or boundary conditions at the model grids. Furthermore, the ionosphere is shaped by inputs from the Sun, the solar wind, the magnetosphere, the lower atmosphere and transport processes in the ionosphere itself. Often, we do not have a precise functional relationship between these parameters and TEC measurements or other space weather parameters. Most studies depend heavily on data assimilation techniques in modeling the ionospheric TEC (Bilitza, 2018;Hajj et al., 2004;Mandrake et al., 2005;Scherliess et al., 2009).
Machine learning methods are promising solutions for this kind of problems in which we do not either know the functional relationship between the input variables and the output parameter we want to estimate or the computational cost of the physical model is high. These methods are famous in learning from the data to extract the necessary information. They are particularly suitable for problems involving a suite of variables especially when linear techniques such as least squares regression is inadequate to describe a nonlinear system. Machine learning and deep learning methods are now quite popular in many industries and have achieved some impressive results (Camporeale, 2019).
Some machine learning methods have been applied successfully to ionospheric, magnetospheric and other space weather studies. Forecasting of geomagnetic indices such as the Kp and Dst (Tan et al., 2018;Wu & Lundstedt, 1996), coronal mass ejection propagation time (Bobra & Ilonidis, 2016), solar wind speed (Yang et al., 2018), relativistic electrons at geosynchronous orbits (Ling et al., 2010) have been achieved by applying machine learning methods. Other recent works by Bortnik et al. (2018), Z. , McGranaghan et al. (2018), and Gross and Cohen (2020) applied different machine learning methods for different space weather studies over the ionosphere, magnetosphere and the radiation belt. A non-linear regression analysis was used by Villalobos and Valladares (2020) to model TEC values over South and Central America. These authors found a non-linear Kp, solar flux, and day of year dependency for each pixel (0.5° × 0.5°) and each 30-min TEC map obtained between 2008 and 2010. However, their numerical model presented differences as large as 30% of the TEC measurements. A comprehensive review of the application of machine learning for space weather nowcasting and forecasting is given by (Camporeale, 2019).
The problem of forecasting TEC is largely a time domain problem, in that the future evolution is dependent not only on the present state, but on the past history of various solar-terrestrial parameters. To address this need, we have chosen to investigate the Long-Short Term Memory (LSTM) deep recurrent neural network. LSTM method is a powerful and well-known branch of artificial neural networks famously known for solving time sequence data (Goodfellow et al., 2016). LSTMs have been recently applied for forecasting different space weather parameters. For example, Tan et al. (2018) applied the LSTM method to forecast the Geomagnetic Kp index using historical solar wind, interplanetary magnetic field and the Kp index itself as input. Wei et al. (2018) were able to achieve one day lead time forecasting of high-energy electron integral flux at geostationary orbit using the LSTM deep neural network machine learning method and the Kp, Ap, Dst, solar wind speed, magnetopause subsolar distance and the 2-MeV electron integral flux itself as input. LSTMs are particularly popular in speech recognition, solar power forecasting, traffic prediction and others (Gensler et al., 2016;Graves et al., 2013;Zhao et al., 2017). A brief introduction about the LSTM deep recurrent neural network is presented §2.2.
Other advanced machine learning methods have been applied to forecast ionospheric TEC. For example, Huang and Yuan (2014) employed radial basis function neural network improved by Gaussian mixture model to forecast 30 min TEC. Huang and Yuan (2014) used day of year, local time, previous TEC, its temporal and differential variation as input variables and found results with root-mean-square error less than 5 TECU (1 TECU = 10 16 el/m 2 ). Habarulema et al. (2009) used solar geomagnetic indices, Day of year and hour number features in a simple neural network to model TEC. R.  employed deep neural networks to forecast global TEC maps. However, the application of machine learning and deep learning methods to study the ionosphere can be considered at its infancy state, especially at high and mid-latitudes, given the ubiquitously available space and ground based data sets and the recent popularity of machine learning methods (McGranaghan et al., 2018).
In this research we use advanced machine learning methods and various space and ground based parameters and machine learning to forecast the ionospheric total electron content at a temporal resolution of half an hour and a lead time of up to 5 h.
The paper is organized in the following manner. Section 2 gives a succinct description of the machine learning methodology and the solar-terrestrial parameters that are associated with the variability of the TEC over the Bogota, Columbia equatorial station. Also in Section 2 the data processing and procedures of the experiment are described. Section 3 describes the results of the machine learning algorithm and the implication of these results. Sections 4 provides the discussion and conclusion part.

Data Analysis
We use TEC values collected by one of the Low-latitude Ionospheric Sensor Network (LISN) stations that has been operated since 2001 in Bogota, Columbia (location: 4.7110° N, 74.0721° W). This station is of prime importance due to its location under the northern crest of the equatorial anomaly (see Figure 1). Here, the plasma density is a strong function of the equatorial zonal electric field, the meridional component of the thermospheric wind, and several processes acting in other layers (e.g., stratosphere, mesosphere, and thermosphere) that influence or modify these quantities. The TEC at this GPS station is calculated with 15-s cadence, collected for this study over one full cycle period from 2008 to 2018. We resample the TEC to a cadence of 30 min for efficient computational purposes.
The input features comprising the solar flux (F10.7), solar wind density and speed, the three components of interplanetary magnetic field, Lyman-alpha, the Kp, Dst and Polar Cap (PC) indices, the interplanetary magnetic field, a total of 34 parameters are extracted from NASA's OMNI (Operating Missions as a Node on the Internet) Space Physics Data Facility (SPDF). SPDF-OMNI contains multi-spacecraft measurements, account for the propagation estimated time from spacecraft to the magnetopause (McGranaghan et al., 2018). Table 1 presents the training features used in our machine learning methods. First, the random forest machine learning method is used to perform a preliminary regression analysis and variable importance estimation. Then the top 5 variables are selected and used into the LSTM deep recurrent neural network machine learning method to forecast TEC every 30 min for 5 h lead time.
Simple interpolation technique is applied to some of the parameters to match all the features to a 1 min time resolution before resampling the total data set to a 30 min cadence. All the data sets including the vertical TEC and the OMNI input features are cleaned for missing values. The presence of outliers in some features have been identified and removed as they can affect the machine learning output. For example, the solar flux has outlier measurements way greater than 140 in solar flux unit, other features such as the PC -index, Sigma Lat, Sigma Lon and other have outlier measurements which are removed before training our machine learning methods. Each feature and the vTEC has been scaled from 0 to 1 using the sklearn MIn- MaxScaler (Pedregosa et al., 2011) before training the machine learning methods to minimize the impact of large dynamical range.

Machine Learning
Machine learning is a mathematical approach in which computer systems "learn by example" and extract useful information from a large set of historical data, often very large amounts of data spanning as large number of parameters as possible. Recently, machine learning has been applied to various fields in geosciences and remote sensing, agriculture, banking, etc (Camps-Valls, 2009;Lary et al., 2016;Zhang et al., 2016), and prediction of atmospheric gases such as CO 2 (Gardner & Dorling, 1998) and ozone (Prybutok et al., 2000;Yi & Prybutok, 1996). Beyond geosciences it is used very widely for applications such as for spam filtering (Guzella & Caminhas, 2009), credit scores, fraud detection, image processing, etc. The availability of large space and ground based data sets makes machine learning suitable for ionospheric study. The need for adequate computational facility can be satisfied with the advent of the now commonplace resources such as high performance computing (McGranaghan et al., 2018).
Machine learning methods can learn the behavior of the system and retrieve the necessary information if they are provided with data spanning as many parameters as possible in the training. It can "learn" the behavior of the system even in the case the relation between the information and the parameters is non-linear and multivariate (Lary et al., 2016).
Some commonly used machine learning approaches include Neural Networks, Support Vector Machines, decision trees, and Random Forests (an ensemble of decision trees). The other powerful method for time series forecasting is recurrent neural networks. Although there are different types of machine learning algorithms currently used, there is no single method that always will perform better than the rest for all problems. The best machine learning method to apply depends on the problem we solve and the available training data (Kotsiantis, 2007

Random Forest
One of the popular machine learning methods known for its robust performance on regression and classification is the random forest method introduced by (Breiman, 2001). The random forest machine learning algorithm works based on random sampling of data to form ensemble of decision trees for both regression and classification problems. Each tree will provide its "vote" to make a decision in classification and a model functional estimate for regression. After a number of regression trees are grown using a randomly selected subset of training samples and variables, prediction will be made by aggregating (majority vote for classification or averaging for regression) the predictions of the ensemble. The basic idea here is ensemble learners from randomly resampled data set produce better model performance than developing a single regression tree from the total data set. In that way random forests decreases the variance of the model without increasing bias (Breiman, 2001;Friedman et al., 2001;Verikas et al., 2011).
The main advantage of random forests for the purpose of this study is that it provides a useful facility to rank the relative importance of the input variables (Genuer et al., 2010), allowing us to isolate the variables that are helpful for forecasting. The random forest estimates the variable importance of a feature by estimating the Gini impurity for classification and variance for regression when that feature is used as a splitting node for the regression tree (Han et al., 2016;Strobl et al., 2007). In this case, we are computing how much each feature contributes to decreasing the weighted impurity or variance. While a machine learning model such as LSTM can in principle learn which variables are important, the more refined is the data set, the less training data will be required to achieve good results. As such, narrowing down the variables to the important ones can shorten the training process for the LSTM.

Recurrent Neural Networks
Recurrent neural networks are a special type of neural networks known for time series forecasting and sequential analysis (Yu et al., 2019). Recurrent neural networks work on the principle of the cyclical connectivity and information flow of neurons in the human brain (Tan et al., 2018). They allow information to persist because they have loops across the hidden layers that connect the previous information to the present task as shown in Figure 2a. These cyclic connections present in recurrent neural networks make them more powerful than ordinary feed forward neural networks. Recurrent neural networks perform the same task for each component of the sequence with the output dependent on the previous computation of the sequence in the sense that the recurrent neural network has "memory" that encodes information for the previous computation.
The recently popular and robust deep recurrent neural network applied for solving scientific and commercial problems is the Long-Short Term Memory (LSTM) network. LSTMs, developed by Hochreiter and Schmidhuber (1997) and then refined by Graves (2012), are an improvement to recurrent neural networks in that LSTMs are known for removing the vanishing gradient problem (Hochreiter, 1998) that inhibits recurrent neural networks from learning. The vanishing gradient arises when the gradient of the loss function with respect to the weights is highly reduced during backpropagation time in a long sequence recurrent neural network (Bengio et al., 1994;Mikolov et al., 2014). LSTMs avoid the vanishing gradient problem of recurrent neural networks by remembering information for a long period of time through their special ZEWDIE ET AL.

Results
We first employed the random forest machine learning method to estimate the functional relationship between the various space weather parameters shown in Table 1 to the ionospheric vTEC measured at Bogota Columbia using the 2008 to 2013 data. The random forest machine learning method is also used to estimate and rank the contribution of each parameter ( Figure 6)  To depict the distribution of the vTEC for each bin, the 25th and 75th quartile plots are shown by, respectively, the red and black lines. The data is split into 20% test set and 80% training set. The random forest model is developed using the training data and forecasts are made for the training set ( Figure 3) and test set ( Figure 4). Figure 3 shows scatter plots of forecasts made using the training data and the target vTEC used to supervise the model. Similarly, Figure 4 presents scatter plots of the actual vTEC withheld for testing and forecasts made using the test set. It is no surprise that in the Figure 3 the scatter plots are less spread away from the diagonal (RMSE = 0.373) than the test data set as the they are result of forecasts using the same training data set that is used to develop the model. The scatter plots in Figure 4 are more spread away from the diagonal (RMSE = 0.968). The bin color scale shows that most of the TEC fall below about 25 TEC units.
Error distribution of estimates of the random forest forecast for the independent test data is depicted in Figure 5. In this case the error is the difference between the forecasted vTEC based on the independent test data and the actual vTEC withheld before training the random forest method. We clearly observe that the error is distributed with mean centered near zero and standard deviation close to 1 TEC unit showing robust performance of the random forest method.
The random forest can be applied to estimate the contribution of each feature for the model. This method contributes significantly by compli-ZEWDIE ET AL.
10.1029/2020SW002639 6 of 11   menting model interpretation and explanation issues that are common in other methods such as the blackbox nature of the neural network (Tzeng & Ma, 2005) method. The random forest calculates the feature importance based on the calculation of the Gini impurity (Han et al., 2016) described in §2.2.1. Other methods such as estimation of the mean squared error in the out-of-bag sample for all number of trees before and after the values of that parameter are permuted (Genuer et al., 2010) can also be applied to estimate the variable importance in the random forest regression and classification problems. Figure 6 presents the variable importance estimated and ordered in their respective rank for our TEC forecasting. The names and units of the features are listed in Table 1 The result of the application of the LSTM method to forecast vTEC 30 min ahead into the future is shown in Figure 7. In Figure 7, the black and red curves respectively show the actual and forecasted vTEC 30 min into the future using the LSTM method. Similarly the forecasted and actual GPS TEC 5 h into the future is shown in Figure 8. Visual inspection of the Figures 7 and 8, clearly illustrates the 30 min ahead forecast a more accurate result than the forecast 5 h ahead of time. The 5 h ahead forecast in Figure 8 fails to get the various peaks and fluctuations in the vTEC. Figures 7 and 8 show that we can effectively forecast the vTEC in the immediate future. And when we try to forecast further into the future, the model produces a less accurate forecast of the TEC. Further more, Figure 8 illustrates that the LSTM has relatively a hard time in forecasting the day time VTEC than the night time TEC. The LSTM method, especially, fails to capture fluctuations in the TEC during the day time compared to the nigh time. The night time vTEC doesn't fluctuate as much as the day time vTEC and therefore it is easy to forecast using the LSTM. In some situations the LSTM has a tendency to over estimate the day time peak vTEC while effectively capturing the night time trough.
The forecasting of GPS vTEC at the Bogota, Columbia station has been done in 30-min segments for up to 5 h into the future. Root mean square (RMS) error and correlation coefficient between the LSTM forecasted and actual vTEC for every half hour forecast has been calculated and shown, respectively, in Figures 9 and 10. For baseline performance comparison the persistence forecast is also included. For both LSTM and persistence forecast, the correlation coefficient decreases as we forecast further into the future (Figure 9). Similarly, the RMS error increases as we progress from a half hour forecast to the 5 h lead time forecast. However, even for a five-hour lead time forecast, the correlation coefficient remains fairly high, 0.88 and the persistence forecast falls significantly as we forecast further into the future.

Summary and Discussion
Forecasting the ionospheric space weather is important to mitigate its effect on global navigation and communication systems and power grids. However, developing a comprehensive and accurate ionospheric forecasting model is a challenge as the physical drivers are complex across the different regions and seasons (Mallika et al., 2018). Ionospheric TEC is a key parameter in describing the state of the ionosphere and accurate forecasting of it is crucial for ionospheric space weather characterization and mitigation.
The complex nature of the global ionosphere is shaped by solar and interplanetary activities as wells inputs from stratosphere, troposphere and mesosphere. Various long and short term physics and data assimilation based forecasts have been developed in the past (Amerian et al., 2013;Jakowski et al., 2011). However, physics based models hardly capture the complex structure of the ionosphere as the mathematical relationship between the solar, geomagnetic and lower atmospheric parameters across the various ionospheric geographic regions and different altitudes is not comprehensively and precisely known. Empirical models such as the IRI (Bilitza, 2001)     been commonly applied for ionospheric TEC forecasting are biased and produced inaccurate predictions. For example comparison of direct TEC derived from GPS and IRI model have shown large temporal discrepancies especially over the low latitude ionosphere (Karia et al., 2015).
Therefore, data driven forecasting based on advanced machine learning methods is an ideal candidate for effectively forecasting not only just the TEC but also other space weather events. Although the application of machine learning for forecasting is not necessarily new, it is at its golden time due to the availability of large data sets and the increase in computational power based on specialized processors (Camporeale, 2019). The near Earth space environment is particularly suitable to apply advanced machine learning methods due to the ubiquitously availability of data from different satellites, balloons, ground based magnetometers and GPS receivers, radars and imagers for about half a century.
In this paper, we used the random forest and LSTM machine learning methods to forecast vTEC using an equatorial GPS station at Bogota, Columbia. The random forest method was used to estimate the variable importance and rank them in order of their influence to the model. The LSTM deep recurrent neural network is then used to forecast the vTEC 5 h ahead every half hour. Only the top 5 parameters in the random ranking as well as the past history of the TEC are used as input for the LSTM forecasting. Using a selected number of the top predictors highly improves the computational necessity and training time. The method is able to forecast the TEC even in the drastic behavior of the equatorial ionosphere. And undoubtedly, it will perform much better at the mid-latitude ionosphere where drastic changes in the ionosphere are uncommon. Subsequent papers will present the application of the method at the polar region ionosphere. The method can also be applied to forecast other space weather parameters at all regions and altitudes.
A few important points should be indicated regarding the variables used in the forecasting and the machine learning methods. It has been widely known that the equatorial ionosphere is driven by solar radiation coming from the sun and the parameters which are known to influence the TEC are solar and magnetic indices, geographic position of the receiver and line of sight of the satellite (Habarulema et al., 2009;Hofmann-Wellenhof et al., 2012). As we clearly see in the variable importance ranking using the random forest machine learning (see Figure 6), the solar parameters: F10.7 solar flux, layman alpha, R sun spot and the magnetic indices: ASY_D, Dst and PC indices stood in the top 6. This is a testimony that the machine learning results agree well with the physics of ionospheric formation.
A few other recent researchers used the LSTM methods to forecast different space weather parameters and its application is on the rise. For example, Yang et al. (2018) used the LSTM method to forecast the occurrence of geomagnetic storms (kp v 5) using the solar wind and planetary magnetic field as well the kp index itself to the LSTM method. Yang et al. (2018) found that the LSTM method improved the the kp forecasting, compared to other methods, despite a lower mean absolute and mean squared errors. Our application of the LSTM method to forecast vTEC produced results with higher correlation coefficient (≈0.98) at short lead time than at long lead time. A recent work by Z.  used a hybrid LSTM and CNN (Convolutional Neural Network) for forecast extreme weather events bringing superior results than traditional numerical and statistical methods. Application of the hybrid LSTM-CNN method will help for ionospheric 2D and 3D forecasting and identifying electron density perturbations.

Data Availability Statement
All the TEC values corresponding to Bogota in Colombia can be found at the LISN server at the following address: lisn.igp.gob.pe. Other solar wind, magnetosphere, and ionosphere parameters are from the NASA Space Physics Data Facility (SPDF) web page (omniweb.gsfc.nasa.gov).
10.1029/2020SW002639 9 of 11 Figure 10. Root mean square error between real and forecasted total electron content, as a function of forecast horizon.