A deep learning approach to solar radio flux forecasting

The effect of atmospheric drag on spacecraft dynamics is considered one of the predominant sources of uncertainty in Low Earth Orbit. These effects are characterised in part by the atmospheric density, a quantity highly correlated to space weather. Current atmosphere models typically account for this through proxy indices such as the F10.7, but with variations in solar radio flux forecasts leading to significant orbit differences over just a few days, prediction of these quantities is a limiting factor in the accurate estimation of future drag conditions, and consequently orbital prediction. In this work, a novel deep residual architecture for univariate time series forecasting, N-BEATS, is employed for the prediction of the F10.7 solar proxy on the days-ahead timescales relevant to space operations. This untailored, pure deep learning approach has recently achieved state-of-the-art performance in time series forecasting competitions, outperforming well-established statistical, as well as statistical hybrid models, across a range of domains. The approach was found to be effective in single point forecasting up to 27-days ahead, and was additionally extended to produce forecast uncertainty estimates using deep ensembles. These forecasts were then compared to a persistence baseline and two operationally available forecasts: one statistical (provided by BGS, ESA), and one multi-flux neural network (by CLS, CNES). It was found that the N-BEATS model systematically outperformed the baseline and statistical approaches, and achieved an improved or similar performance to the multi-flux neural network approach despite only learning from a single variable.


A B S T R A C T
The effect of atmospheric drag on spacecraft dynamics is considered one of the predominant sources of uncertainty in Low Earth Orbit. These effects are characterised in part by the atmospheric density, a quantity highly correlated to space weather. Current atmosphere models typically account for this through proxy indices such as the F10.7, but with variations in solar radio flux forecasts leading to significant orbit differences over just a few days, prediction of these quantities is a limiting factor in the accurate estimation of future drag conditions, and consequently orbital prediction. In this work, a novel deep residual architecture for univariate time series forecasting, N-BEATS, is employed for the prediction of the F10.7 solar proxy on the days-ahead timescales relevant to space operations. This untailored, pure deep learning approach has recently achieved state-of-the-art performance in time series forecasting competitions, outperforming well-established statistical, as well as statistical hybrid models, across a range of domains. The approach was found to be effective in single point forecasting up to 27-days ahead, and was additionally extended to produce forecast uncertainty estimates using deep ensembles. These forecasts were then compared to a persistence baseline and two operationally available forecasts: one statistical (provided by BGS, ESA), and one multi-flux neural network (by CLS, CNES). It was found that the N-BEATS model systematically outperformed the baseline and statistical approaches, and achieved an improved or similar performance to the multi-flux neural network approach despite only learning from a single variable.

Introduction
The dynamics of space objects orbiting in Low Earth Orbit (LEO) strongly depend on the characterisation of the uncertainties on the initial state, physical properties of the objects themselves (such as mass and shape) and properties of the atmosphere, chiefly the density. These atmospheric properties are strongly influenced by both solar and geomagnetic activities, whose forecasting is therefore of paramount importance for space operations, and whose forecast uncertainties are fundamental to properly characterise the uncertainties on the orbital states of spacecraft and space debris. As such, the prediction of these quantities has fundamental implications both in the day-to-day management of operational spacecraft such as collision avoidance [1], and also in the longer term, in re-entry prediction [2].
Typical atmospheric density models, which are used to model the dynamics of space objects, capture the space weather conditions using two types of proxies, one for the solar activity and one for the geomagnetic activity. The atmospheric density is predominantly influenced by the solar activity, or the so-called Extreme Ultra Violet (EUV) irradiance [3]. The solar EUV is highly energetic and is absorbed by forecasts still relying on statistical techniques. Moreover, many of the considered approaches on the machine learning side focus on the application of classical approaches such as Support Vector Regression (SVR) [11], or single layer feedforward neural networks [12].
However, within the last year there have been significant advancements in the field of time series forecasting by way of deep learning. More specifically, in [14], Oreshkin et al. presented N-BEATS, a deep residual architecture for univariate time series forecasting, which, for the first time, succeeded in outperforming winning approaches of recent forecasting competitions across a range of domains, which were all previously based on either statistical or hybrid (statistical + ML) methods. Its success is due to a unique architecture that combines a deep stack of fully-connected layers, backward and forward residual links, aggregation of the partial forecasts in a hierarchical fashion, and ensembling. Being a pure deep learning approach implies that, unlike statistical approaches, there is no expert knowledge, or ad-hoc feature engineering, required on the data itself in order to train the model.
Given the promising performance of this state-of-the-art architecture across a range of domains, in this work we apply N-BEATS to the daily prediction of the F10.7 solar proxy and examine its feasibility over forecast horizons relevant to space operations, from 3 days for activities such as collision avoidance, up to 27 days for activities such as re-entry campaigns. To the best of our knowledge, this is the first time deep residual networks have been applied to the forecasting of solar proxies. Furthermore, we extend this approach with non-intrusive uncertainty quantification using deep ensembles [15,16]. Finally, we perform a systematic comparison of the forecasts generated using this pure deep learning approach to those generated using other data-driven approaches, both statistical and ML, and show that it can produce improved single point forecasts, as well as useful forecast uncertainty estimates.
The main contributions of this work can be briefly summarised as follows: • The use of a state-of-the-art deep neural network (N-BEATS) to forecast future values of the F10.7 using only its past history, with no additional variables and no requirement for domain-specific knowledge of the data. • The use of deep ensembles to improve the accuracy of the forecasts and to provide a measure of model uncertainty alongside the single point predictions. • A detailed systematic comparison with operationally available forecasts, which emphasises the strengths and weaknesses of this approach and paves the way for future work. To this end, the forecasts provided by our approach, along with the code to reproduce the experiments of this paper, are publicly available in a Github repository, 1 to enable further research and comparisons.
The paper is structured as follows. In Section 2 we provide backgrounds on the state of the field of time series forecasting, with a particular emphasis on its use in relation to space weather activities. In Section 3, the proposed approach is described, which includes not only the explanation of the deep learning architecture employed, N-BEATS, but also the way the data is extracted and passed to the model, the training and evaluation procedures, and the estimation of the prediction intervals through an ensemble of trained models. Section 4 makes a detailed comparison of the proposed approach with respect to current operationally available forecasts, comparing both the values of the predicted data points in the future, and the uncertainty intervals. Finally, in Section 5 we discuss the conclusions of the paper, and outline avenues for future research. 1 https://github.com/stardust-r/deep-learning-space-weather-forecasting.

Backgrounds on time series forecasting
The goal of time series forecasting is to predict the values of a set of future data points given a set of past observations. There are multiple types of forecasting, depending on different criteria: • The number of series to predict. The term univariate time series forecasting refers to making predictions on one single series, regardless the number of input variables used. On the other hand, multivariate time series forecasting refers to the prediction of several related series at once. • The number of time steps to predict, also known as the horizon ( ). In contrast to one-step-ahead predictions, multi-horizon forecasting predicts the variables of interest at multiple future time steps, thus providing decision makers with an estimate that can be used to optimise their course of action across an entire path of predictions. The number of time steps used to create the prediction is then known as the lookback. • The uncertainty estimation provided by the forecasting model.
Single point models focus on estimating, as precisely as possible, the future point values. However, in many scenarios [17], the provision of uncertainty intervals can be useful, if not critical, for risk management, by giving decision makers an indication of likely best and worst-base values that the target can take.
Although there have been recent attempts to create a meaningful distinction between forecasting methods [18], these can be roughly classified as being either of a statistical or machine learning nature. Statistical methods make use of statistics based on historical data to predict what will happen in the future. They are normally computationally efficient as they rely on linear processes to minimise the prediction error, and require expert knowledge about the trend and the seasonality of the data to model. Traditional and popular examples of these methods include ARIMA [19] and Exponential smoothing (ETS) [20] models. On the other hand, ML methods tackle the problem of forecasting as a supervised learning (auto)regression task, where the model is trained on pairs of past/future values from different slices of the time series. They are computationally more demanding and rely, in many cases, on non-linear training algorithms. In a ''pure'' ML method, the main advantage is that no time series specific engineering is needed to train the model.
Among the several ML methods that can be used for time series forecasting, neural networks, and more specifically, deep neural networks, are one of the most popular alternatives in the recent literature, due to the latest breakthroughs in Deep Learning (DL) [21]. A neural network is an artificial model that emulates how the human brain works, using an abstract (or simplified) mathematical model of a neuron. It consists of a series of such neurons connected to each other with a series of weights. These weights are learned from the training data, using a learning algorithm that updates them in order to minimise the loss (or error) of the network predictions summed over all training cases. The most common type of neural network is the feedforward neural network, where the information enters into the input units, and flows in one direction through the hidden layers until it reaches the output units. Although the universal approximation theorem [22] shows that any function can be well-approximated using a feedforward neural network with just one hidden layer of non-linear neurons, in practice, deeper architectures (with more than one hidden layer) have smaller matrices, making it possible to split the derivative of the loss function into pieces, meaning that the model can be trained more quickly and will take up less memory [23]. In addition, the layered structure of a deep neural network enables the automatic extraction of features from the data at different levels of abstraction, the later layers being the most specialised for the task at hand. In the field of time series forecasting, the most common deep learning architectures are those based on Recurrent Neural Networks (RNNs), whose units contain an internal memory state which acts as a compact summary of past information [24]. In recent years, the development of attention mechanisms and the Transformer architecture [25] has also lead to improvements in temporal dependency learning, from which time series forecasting has benefited [26].
Despite all of this, the use of ML and DL methods are far from being the standard for the task of time series forecasting. For instance, in the 2018 forecasting competition M4, 2 which challenges researchers to forecast time series data over multiple domains, 12 out of the 17 most accurate solutions were ensembles of classical statistical methods [27], and only six of the submissions were pure ML. Only in the last few months, when the 2020 M5 competitions 3 ,4 were concluded, could it finally be seen that ML methods were part of the top solutions of the leaderboard, representing a significant step forward in the implantation of ML for time series forecasting.

Forecasting the F10.7 proxy
Given the importance of the F10.7 proxy to atmospheric density modelling, many research works have been carried out to derive and test forecasting methods and approaches. Support Vector Regression (SVR) was used for short-term forecasting of F10.7 [11], with the authors showing that ''the proposed approach can perform well by using fewer training data points than the traditional neural network''. A simple linear forecasting model for the F10.7 proxy has been proposed by Warren et al. [8]. In this paper the authors also compared the linear forecasting approach of the F10.7 to forecasting using artificial neural networks, and preliminarily concluded that ''forecasting via sophisticate artificial neural networks is not any better than a simple linear forecasting approach''. Various empirical time series prediction techniques were compared in [12]. The authors selected a multi-wavelength, nonrecursive, analogue neural network, and found that ''the prediction of the 30 cm flux, and to a lesser extent that of the 10.7 cm flux, performs better than NOAA's present prediction of the 10.7 cm flux, especially during periods of high solar activity''. A linear multi-step forecasting model based on the correlation between different forecasting steps and the characteristic of heteroscedasticity is proposed in [9]. In the same paper, a variational Bayesian procedure to optimise the model is also introduced, and it is claimed that the proposed model improves the performance of multi-step F10.7 forecasting by considering correlation and heteroscedasticity. More recently, a thorough analysis of the power of statistical ARIMA models for the forecasting of this proxy was carried out in [10], proving that, as long as the order of the ARIMA model is optimally chosen, the model is not inferior to other techniques. The importance of benchmarking these models, especially with a view to use in atmospheric density models and operational applications such as collision avoidance, is now beginning to be recognised in this field [28].
To the best of our knowledge, the reception of deep neural networks to forecast the solar flux is limited, especially in terms of the F10.7 proxy. Most of the work in the intersection of solar activity and deep learning is focused on the early classification of solar flares and geomagnetic storms. As an example, Long Short Term Memory (LSTM) architectures (a subclass of RNNs) have been employed for the detection of geomagnetic storms based on the index in [29], and in [30], Convolutional Neural Networks (CNNs) are trained to classify flaring and nonflaring active regions using line-of-sight magnetograms. Only in the last few months, the task of forecasting future values of the GOES X-ray flux has been studied with different deep learning architectures [31], including N-BEATS, the architecture used as a basis for this work.

Applying deep learning to F10.7 forecasting
In this section, we present our approach to provide daily, univariate, multi-horizon forecasts with prediction intervals of the F10.7, using only past information of the proxy, with no additional input variables. It is an end-to-end deep learning approach based on the novel architecture N-BEATS. For the sake of reproducibility, the implementation of this approach is publicly available on Github. 5

Model architecture: N-BEATS
N-BEATS (Neural Basis Expansion Analysis for interpretable Time Series forecasting) [14] is a novel deep learning architecture for single point, univariate, multi-horizon forecasting that has been gaining traction in the field since it was proved to be the first pure deep learning method that outperforms the winning approaches of recent forecasting competitions. It does not need any specific expert knowledge on the data, and is thus applicable to a wide array of target domains without any feature engineering. An open source implementation of N-BEATS, 6 written with the deep learning library PyTorch, has been employed in this work.
N-BEATS belongs to the family of deep residual networks, which were introduced for computer vision tasks as a way to train very deep networks effectively [32]. More specifically, its topology is described as doubly residual stacking (see Fig. 1), where each stack consists of multiple residual blocks that produce two outputs: the block's estimation of the input (or lookback) data, called backcast, and the estimation of the future values across the desired horizon, known as forecast. The backcast is subtracted from the current's block input, forming a residual which then serves as input to the next block in the stack. The output of the network is the result of a hierarchical aggregation of the forecasts across stacks, i.e., first the partial forecasts of each block are aggregated at the stack level and then at the overall network level, providing the final global output. The iterative and residual nature of this architecture aims to encourage gradual signal reconstruction and forecasting.
Internally, each basic residual block within a stack consists of a multi-layer fully connected network with non-linear (ReLu) activation functions between each layer, although there are some extensions of N-BEATS that replace this basic building block with temporal aware structures such as RNNs [33]. The fully connected network outputs two vectors of basis expansion coefficients, normally referred to as , which are then accepted by two learnable basis functions to generate the final backcast and forecast of the block, respectively. The parameters of the basis functions can be constrained so that only a family of functions can be learnt (e.g., low-degree polynomials or Fourier series), forcing the model to decompose the forecast into distinct human interpretable outputs such as the trend and the seasonal components of the data. On the other hand, the parameters of these functions can be left unconstrained (or generic, as it is known in the N-BEATS paper), with the aim of using no domain knowledge in the modelling.

Data and model inputs
The European Space Agency (ESA) Space Weather Service Network 7 maintains a database containing both past, and forecast, values of solar and geomagnetic indices which are relevant to drag calculations. This data is compiled from a number of independent providers in one place, for the convenience of end users and space operators, as a part of its Space Surveillance and Tracking Service.
We use this service to extract the time series of the F10.7, measured in solar flux units (sfu), which has been measured continuously since E. Stevenson et al. 1947 by the Ottawa, and then Penticton Radio Observatories [4]. The F10.7 is available in either observed values, which vary throughout the year with the Sun-Earth separation, or adjusted values, where the observations are adjusted to 1 AU (Astronomical Unit). In this work, we train our forecasting model using the observed data, which is used by typical thermosphere models [5,12]. Daily values of the F10.7 were taken to be those measured at 20:00 where appropriate, and missing values, more prevalent early in the time series, linearly interpolated.
The data is then split into training and validation subsets. However, due to the correlated nature of time series data, the typical ML strategy of random splitting, that ensures that the underlying distribution in these subsets is the same, cannot be used. We must therefore ensure that the validation set contains a full solar cycle so that it is representative of the training data. To achieve this, we use an approximate 80% to 20% splitting strategy, with the training set covering the period from 01/01/1950 to 20/10/2006, and the validation set covering 20/10/2006 to 01/10/2020, covering Solar Cycle 24, as shown in Fig. 2.
As we are using deep learning, for which successive feature extraction through the layers of the architecture is implicit, this data does not require extensive pre-processing in order for the model to perform well. As such, the input data needs only to be normalised, to prevent exploding gradients and improve the numerical stability of the model, and subdivided into lookback-horizon windows. No explicit knowledge or analysis of time series features such as trend and seasonality is required in advance.

Model training and evaluation
The underlying architecture of a deep learning model is defined by a series of hyperparameters which can be pre-set by the user, and are not learnt by the model during training. These parameters constrain the complexity of the model and should be optimised, or tuned, to find the optimal configuration for a specific problem such that the model does not under or overfit the training data.
In the case of the N-BEATS architecture, as can be seen in Fig. 1, there are a large number of potential hyperparameters that can be tuned. However, one of the distinguishing aspects of this architecture is that it was specifically designed to be generally applicable across a wide variety of horizons and datasets, and should perform well without the need for extensive tuning.  As such, we investigated architectural hyperparameters (for example the number of stacks, number of blocks per stack, number of layers etc.) similar to those recommended by the authors [14], and also focused on basis choice and optimisation hyperparameters such as learning rate. Tuning was performed through a grid search, using the experiment-tracking tool Weights & Biases [34], and the chosen parameters are given in Table 1.
Notably, we found that constraining the bases to trend (polynomial) and seasonality (Fourier) components (in an interpretable approach, as described in Section 3.1) significantly hindered model performance compared to the generic approach, where the model has free reign to learn the preferred basis. By not constraining the Basis Type, the final model therefore does not rely on domain specific knowledge.
This analysis was performed using the Mean Squared Error (MSE), the nominal metric used to evaluate the performance of regression problems in machine learning, which is defined as follows, in which ⃗ are the set of predicted future values of a time series of length over a forecast horizon of length , and ⃗ are the set of true observed values over the horizon, For a more robust and systematic approach to model evaluation, several additional metrics will also be considered in this work, which capture different aspects of the model performance.
Firstly, we include the Mean Absolute Percentage Error (MAPE), and Mean Absolute Scaled Error (MASE), which are standard scale-free metrics used more specifically in the field of time series forecasting, and used by [14] to enable performance comparison over a range of different datasets, These are linear metrics, which means that unlike the squared MSE, they do not give as much weighting or importance to larger errors, which are typically associated with higher levels of solar activity.
The MASE is equivalent to the Mean Absolute Error (MAE), scaled by the average error of a naive baseline model whose forecast is simply a previously observed value periods in the past. If there is no prior knowledge of the seasonality of the time series, can be set to 1 and the naive model is that of the persistence. As we do not want our analysis to depend on any pre-existing domain knowledge, and as the solar flux encompasses multiple seasonalities, we consider the persistence as our baseline model here in the definition of the MASE, and later in Section 4.1.
Next, we consider the recommendations of [35], who proposed a standardised set of comparison metrics for benchmarking geomagnetic index prediction models. In this way, we hope to enable and encourage more transparent and systematic comparisons between pre-existing models by future authors.
We therefore also include the Pearson linear correlation coefficient and the Mean Error (ME), or bias, which gives an indication as to whether the model, on average, overpredicts (positive bias) or underpredicts (negative bias) the observed data. We also include the MAE and Root Mean Squared Error (RMSE), the square root of Eq. (1), to be consistent with the recommended guidelines, and to be comparable to other authors who may choose these metrics. Finally, we introduce the concept of the Relative metric, which is an extension of a metric suggested by Yaya et al. in [12]. In the case that the model is trained separately for each horizon, using the above metrics will result in a set of performance metrics for different horizons. However, in order to obtain a single metric over all horizons, the performances must be scaled to prevent higher horizons, with higher errors, dominating the final value. We therefore define the relative metric as the average, over all horizons, of the ratio of the model performance to that of the persistence, where can be any of the above metrics, and min , max are the minimum and maximum horizons of interest, which in our case are 3 and 27 days respectively.

Ensemble forecasting and uncertainty quantification
Ensemble forecasting is a technique that has long been used in terrestrial weather forecasting [36], and is also employed in all leading submissions in time series forecasting competitions [14], owing to its ability to not only improve accuracy, but also to improve the reliability of such forecasts by providing an inherent measure of model uncertainty [37]. This is achieved by averaging the predictions over a diverse set of models to create a single more-accurate model, with an associated uncertainty, that has several additional advantages. For example, by providing a range of possible outcomes, this approach can yield a better understanding of extreme events, such as solar storms. It can also be used to account for both uncertainty in the model inputs (aleatoric uncertainty), and the propagation of uncertainty inherent in the model itself (epistemic uncertainty, as considered in this work) in the resultant forecast uncertainty [36].
However, one of its greatest benefits when employing deep learning techniques, is its ability to improve the out-of-distribution robustness of the model [16]. Due to their large network complexity, deep learning models are particularly susceptible to overfitting, from which the model does not generalise well to new data. Regularisation techniques can be used to overcome this, and in the case of the N-BEATS architecture, ensembling was found to be more powerful than typical techniques such as drop out, or weight penalties [14].
We adopt an approach similar to that recommended in [14], building the ensemble from a set of models that have the same underlying architecture (which is chosen through hyperparameter tuning, see Section 3.3), but different higher level training parameters. For this we use three main sources of diversity: length of input window (lookback), choice of loss function (the error used internally during training, see Section 3.3), and choice of weight initialisation, as described in Table 2.
In this way, the resulting ensemble can account for trends in the data over different time scales, and account for model bias and variance arising from the training procedure. This procedure is iterative and stochastic, and therefore the initial values of the weights strongly determine which local optima is found. Varying the initialisations, and the search space itself by varying the loss function, therefore improves the performance of the ensemble as a whole by averaging out weaker solutions. Injecting randomness in the initialisation is also particularly important when using machine learning techniques for time series forecasting, as the sequential nature of the data prevents the usual practice of shuffling the training dataset prior to each epoch in order to provoke changes in the gradient estimate of the optimiser.   As shown in Table 2, for every forecast horizon, we generate 90 individual models, which are then combined using the mean as the ensemble aggregation function, to generate the final forecast. In this approach, we use the same random initialisations over all horizons, and do not use bootstrapping, 8 in order to ensure that each model is trained with as large a dataset as possible. Such an approach has been shown to perform well in practice compared to traditional bagging procedures [15]. It can be seen from Fig. 3, that using this ensemble approach improves the performance of the overall model, and that the number of individual models we consider (90) is sufficient.
An example of a 5-day N-BEATS ensemble forecast generated using the approach described in this section, which will simply be denoted as N-BEATS for the remainder of this paper, can then be seen in Fig. 4. Here, we show an example of the lookback period used for the prediction (10 days, 2 , as one of the lookbacks used in the ensemble), and the N-BEATS ensemble prediction over the forecast horizon. The 1 uncertainty band naturally arises from the distribution of forecasts over the ensemble (each of which is shown here in grey to illustrate the concept), and can be seen to encompass the true values of the F10.7 over the horizon.

Comparison with operationally available forecasts
In this section, the performance of the N-BEATS ensemble approach described in Section 3, is compared to operationally available forecasts that comprise both statistical and machine learning approaches. The models themselves are described in Section 4.1, with the results of the comparisons in terms of single point forecasting and uncertainty estimation discussed in Sections 4.2 and 4.3 respectively. 8 Bootstrapping is a resampling technique that draws samples times uniformly with replacement from a dataset with items.

Forecast model descriptions
Here, we describe the F10.7 forecasts used for the comparison with N-BEATS. They are publicly available, and updated on a daily basis.

Persistence (Baseline)
The persistence model forecasts always the last observed value, i.e.,̂+ = + −1 ∀ ∈ {1, … , }. It is a simple model that performs reasonably well, especially for low horizons, and thus is a common baseline to use for comparison in multiple related works [12]. Setting = 1 in the MASE metric (See Eq. (3)) can be thought of as comparing a certain set of predictions against the performance of the persistence model.

BGS (ESA)
In 1993, the Geomagnetism Group of the British Geological Survey (BGS) carried out work under contract to ESA to investigate forecasting techniques for predicting solar and geomagnetic activity [6]. As part of this work, they constructed a software for the forecasting of the F10.7 proxy up to 27 days ahead. This software uses an ARIMA model [19] with 60 coefficients, which are recalculated daily to reflect changing solar and geomagnetic conditions, using the preceding two years of data.

CLS (CNES)
The Collecte Localisation Satellites (CLS), a subsidiary of the French Space Agency (CNES), provides forecasts of the F10.7 using a model developed from their research, published in [12]. Although their method is similar to the approach proposed in this work in the sense that both are based on neural networks, there are two main differences: 1. Unlike the approach presented here, CLS use additional input variables aside from the F10.7 to compute the forecasts. More specifically, multiple wavelengths (8.2 cm, 10.7 cm, 15 cm and 30 cm) of the solar radio flux are included. 2. The architecture employed cannot be considered as a deep neural network, since it only has one hidden layer. This layer then relies on a logistic activation function, while N-BEATS relies on ReLu. Additionally, the architecture is based on a feedforward approach, which differs to the deep residual approach used by N-BEATS.

Comparison of single point forecasting
In this section, we present a comparison of the model performances in terms of single point forecasting. The analysis is comprised of two subsections.
First, we consider forecasts provided by the ESA Space Weather Service Network. 9 These forecasts, which include BGS, are only available since late 2016, and therefore this analysis is performed on a reduced validation set covering the period 26/11/2016 to 01/10/2020. The second section then contains an extended comparison, covering a full solar cycle, between N-BEATS and CLS, whose complete forecast archive is publicly available. 10 This covers the full validation set shown in Fig. 2, from 20/10/2006 to 01/10/2020.
In these sections, the N-BEATS, persistence and BGS single point forecasts are evaluated against the observation dataset described in Section 3.2. The CLS forecasts are evaluated against their own archive of observations, in order to be consistent with the data that they used during model training.

2016-2020 validation period
This analysis covers a period of relatively low solar activity, as seen in Fig. 2, but over which we can compare N-BEATS to all the models described in Section 4.1. During this period, a small number of BGS forecasts are missing, and thus these windows are also removed from the other models for a fair comparison.
In Fig. 5a, we show the evolution of different performance metrics, defined in Section 3.3, for different N-BEATS models with these available operational forecasts. RMSE and MAPE are error metrics, and so better performing models have lower errors. As expected, the errors increase with horizon, as we forecast further forward in time. The opposite is true of the Pearson correlation coefficient, , where higher values are desirable, with a value of 1 indicating a perfect linear correlation between the observed and predicted values of the models.
It can be seen that for the reduced validation set in Fig. 5a, the N-BEATS ensemble model gives consistently good results up to a forecast horizon of 27 days, outperforming the baseline persistence model, statistical BGS approach, and neural network CLS model in RMSE and , and showing a similar or improved performance over all models in all displayed metrics.
To infer the overall best performing model over this period, we use relative metrics, as defined in Eq. (6), to obtain a set of single performance metrics which are averaged over all forecast horizons. As described in Section 3.3, these are scaled against the persistence at each horizon before they are averaged to ensure that the final metrics are not weighted too heavily towards larger, more error prone horizons, and therefore a relative metric value of 1 means that the model exhibits the same performance as the persistence baseline. As such, models exhibiting better performance than the persistence in error metrics will have a relative metric less than 1, but those exhibiting better performance in will have a score greater than 1. The performance of the models using these metrics are given in Table 3.
Again, it can be concluded that N-BEATS consistently outperforms the persistence, as its relative error metrics are below, and correlation metric is above 1 respectively. However, more significantly, it can be seen that it systematically outperforms both the BGS statistical approach and neural network CLS model 11 in all metrics over this reduced validation period.
One perhaps surprising result, is the relative poor performance of the CLS model in MAPE when it outperforms the persistence and BGS in RMSE. One possible explanation for this, is that the CLS model used a variation of the RMSE as their loss function during training [12]. In this way, the model learns to minimise this specific metric and, as a result, forecasts generated with this model may have a bias towards it. This Table 4 Relative metric comparison of N-BEATS with operationally available forecasts for 2006-2020. Metrics are scaled against the persistence baseline, and averaged over forecast horizons. The best performing values in each metric are highlighted in bold.

Model
Relative illustrates the importance of considering multiple metrics for model performance, and multiple loss functions in the N-BEATS ensemble. It should be noted, however, that it is difficult to definitively conclude the relative performance of these models in any metric using only this reduced validation set. For this, the comparison should be performed over a full solar cycle, as models that are deficient for low solar activity, may perform better during periods of high solar activity.

2006-2020 validation period
The same analysis was therefore performed over the full validation set for available models, with the metric evolution and relative metrics shown in Fig. 5b and Table 4 respectively. It can be seen that, over a full solar cycle, the performance of N-BEATS is significantly closer to that of CLS in all metrics. As expected, both models outperform the persistence model, but in RMSE related metrics, N-BEATS fractionally underperforms compared to CLS, whereas it outperforms CLS in MAPE related metrics. This supports the hypothesis discussed in Section 4.2.1, that the CLS model may have a bias towards RMSE as it was used as its loss function during training. On the other hand, we use both the MSE and MAPE as loss functions during the ensemble approach (Section 3.4), which works to minimise bias in our final model. This leaves the correlation coefficient, , as the only independent comparison metric that was not used during training for either approach and, as can be seen in Table 4, the final metric is identical for both models.
This similarity in performance is an encouraging result, given that the CLS approach uses 4 different flux wavelengths during training, whereas N-BEATS learns only from a single one. This suggests that N-BEATS is a more powerful architecture, able to infer hidden patterns in the data that aid in its forecasting capabilities.
Note that, on comparing Table 4 to that obtained over the reduced validation set, Table 3, it can be seen that in general the relative metrics improve when considering a longer timescale with different levels of solar activity. This may be suggestive that the models improve their performance during periods of high solar activity, or indeed rather that the persistence baseline which is used as a scaling factor, has relatively worse performance during these periods. However, the same is not true of the relative value for N-BEATS, whose performance decreased, which again serves to justify the use of different metrics that capture different aspects of the model performance, but is also an indicator that N-BEATS may have a stronger performance during low solar activity.
To further investigate the strengths and deficiencies of the models over the course of the solar cycle, during different levels of solar activity, we consider the breakdown of RMSE over the validation set. This is shown in Fig. 6, alongside the observed F10.7 data during this period, to illustrate the high level of correlation between the model error and the level of solar activity itself.
From this breakdown of single point model error we can draw two main conclusions. First, that the performance of the two models is fairly comparable throughout the solar cycle, with CLS showing a slight tendency to perform better during increasing activity and for lower horizons, and N-BEATS performing better for lower activity and longer horizons. Second, that all models perform poorly in predicting the peak event in late 2017.
To understand this first point, we consider the Mean Error (ME), or bias, of the models over four year-long periods which are characteristic of increasing, high, decreasing and low levels of solar activity over solar E. Stevenson et al.  cycle 24. These are listed in Table 5, and were chosen, where possible, to overlap with those used for a similar analysis in [12]. It can be seen from Fig. 7, that CLS is more biased than N-BEATS for low solar activity, which explains its poor performance during these periods, as in Section 4.2.1. This bias is reduced during the periods of higher activity which can again be explained by its sole use of RMSElike squared metrics, which are weighted more heavily towards higher errors and therefore higher activities.
N-BEATS, on the other hand, has a tendency to have better performance for lower solar activities, and be more biased for high activity. For both increasing and decreasing activity, N-BEATS has a near-zero bias up to a horizon of 6-days before it begins to underpredict, but this deviation is most pronounced during the high period of solar activity.
This systematic underprediction of high fluxes with increasing forecast horizon by N-BEATS can be better seen in Fig. 8. For low horizons, the value between the observed and predicted fluxes is close to 1. However, as the forecast horizon increases, the model tends to underestimate high values of the F10.7, resulting in a negative bias.
This damping of high activity by the model then also leads us to the second point, the peak event in 2017. This corresponds to an intense storm period that occurred during September 2017 which produced the largest flares during Solar Cycle 24 [38]. This included 4 X-class flares, the highest flare class, which have the ability to disturb satellite trajectories, and are therefore important to capture for orbital prediction [39].
It can be seen from Fig. 6, that N-BEATS is unable to capture this event. This is a result of the overconfidence of deep neural networks with out-of-distribution data, by which rare events can be erroneously predicted as in-distribution values with high confidence [15]. The CLS neural network model appears to suffer similarly, although for CLS, the error evaluation here is performed against it's own provided archive of observations, which undergoes an anomaly screening [12]. The period of observations covering this event has been tagged as ''flare corrected'', and therefore may not provide a direct comparison to the N-BEATS forecast. The fact that, contrary to CLS, we are not training from or evaluating against systematically cleaned data, may also explain the apparent higher bias of N-BEATS. However, we chose to not to use cleaned data as we wanted to allow for flare capture by our model. This will be further investigated in future work.

Comparison of uncertainty estimation
In this section, we present an analysis of model performance in terms of uncertainty estimation. Of the available models for comparison with N-BEATS, only CLS provides a measure of prediction uncertainty, and we are therefore able to use the full validation set (20/10/2006 to 01/10/2020) for this analysis. Both models employ fundamentally different approaches to uncertainty quantification. For N-BEATS, we obtain the forecast uncertainty directly from the deep ensemble as the standard deviation over individual model runs, as described in Section 3.4. This differs to the approach used by CLS, for which the uncertainty estimation is not inherent to the model, but assigned after prediction using a linear fit between past model RMSE and solar flux [12]. Fig. 6 shows the 162-day-averaged level of forecast uncertainty for each model, 1 for N-BEATS and RMSE for CLS, for a 10-day forecast horizon. From this, it can be seen that the N-BEATS ensemble approach is able to correctly characterise the shape of the forecast uncertainty, exhibiting the same high level of correlation with solar  [12]. Bottom: Observed F10.7 over the validation period. activity and single point forecast RMSE as the CLS model. The main distinction between the models is that the N-BEATS estimation is much narrower than that of CLS, and it should be noted that the same relative behaviour is true of all horizons. In and of itself, this observation cannot be used to form conclusions about the quality of uncertainty estimation in either approach: a narrower uncertainty window is preferable, provided it is not underestimating the uncertainty.
To quantify the relative performance, we consider the proportion of true observed values, , that fall within the estimated uncertainty range,̂± , of each model. In order to perform a comparison, we then measure this performance against the expected fraction for a given confidence level, . For this, we make two main assumptions. First, we assume that the ensemble average is normally distributed according to the central limit theorem [40]. Second, given the low bias of the CLS model, as demonstrated in Fig. 7, we assume that it is an unbiased predictor, and therefore the provided RMSE can be approximated as a standard deviation. Under these assumptions, we can obtain the uncertainty interval for each model for a given confidence level using the probit function [40], which acts as a multiplier to the model standard deviation. The performance of each model for different confidence levels (a reliability diagram), is shown in Fig. 9.
From Fig. 9, it can be seen that the N-BEATS uncertainty estimation is overconfident, falling below the ideal black dashed line, and implying that the spread of ensemble predictions is too narrow. However, this result is not unexpected, as MSE-based deep ensembles have been found to routinely yield overoptimistic uncertainty estimates in the field of deep learning [15].
To account for this, we follow an approach developed for ensemble forecasting for climate predictions [40], which uses a correction, or calibration factor, to increase the width of the uncertainty. Following the principle that RMSE should be equal to the standard deviation for an unbiased predictor, this correction factor, , is derived as the ratio between the RMSE and ensemble standard deviation during the model training period: where ∈ {1, … , } are the time points for which predictions are made over the training set. If this ratio does not vary significantly with time, may be taken as the temporal average over the entire training set. However, we found this not to be the case, with the ratio following the same seasonal trend as the solar flux. As such, we expanded this method in order to obtain a correction factor as a function of predicted solar flux, (̂). For this, factors were calculated over 365-day sliding windows over the training set, and a linear relation found with the mean window flux. The resulting calibrated model, denoted as N-BEATS-C, comprises the same single point predictions (ensemble mean) as the original N-BEATS model, but with a corrected standard deviation, = (̂) . This correction factor was found to work well, with N-BEATS-C providing significantly improved uncertainty estimation over the original E. Stevenson et al. N-BEATS (Fig. 9). From this figure, it can be seen that not only does N-BEATS-C no longer significantly underpredict the uncertainty, but also that it lies closer to the ideal than CLS, which overpredicts the uncertainty. This can be explained from Fig. 6, in which the forecast uncertainties are compared to the single point model RMSE. For an uncertainty estimate to be realistic, it should contain the model RMSE (for legibility, here only that of N-BEATS is shown as it is sufficiently similar to CLS to enable this comparison). It can be seen that the forecast uncertainty of CLS is wider than that of N-BEATS-C, and that although it better encompasses the RMSE for some periods of high solar activity, it overestimates the uncertainty during all periods of low activity. On the other hand, N-BEATS-C has a better estimation of the uncertainty during low activity, and, as can be seen from Fig. 9, therefore better characterises the forecast uncertainty on average, with narrower windows that do not significantly underestimate it.
One caveat to this analysis concerns the validity of the central limit theorem. Each ensemble is formed of 90 members (Section 3.4), whose predictions are not necessarily independent (indeed time series in general often suffer from autocorrelation), nor identically distributed. However, although the reliability diagram shows some non-linearity, suggesting that the data is not normally distributed, there is no major deviation from the linear ideal, serving to provide some further justification for the approach. An alternative approach to correcting for overoptimistic estimates of deep ensembles, by updating the loss functions used during training to those that are able to also capture the quality of the predictive uncertainty of the model (the variance as well as the mean [15]), which do not require these assumptions, will be investigated in future work.

Conclusions and future work
This paper presents the use of the novel N-BEATS deep residual neural network for the daily prediction of the F10.7 solar proxy. This pure deep learning approach, which has provided a significant advancement in the field of time series forecasting within the last year, was found to be effective in this task up to a forecast horizon of 27days, without the need for any specific expert knowledge of the data or feature engineering. Forecasts generated using this deep, univariate approach were compared to a persistence baseline and two operationally available forecasts: BGS (a statistical approach) and CLS (a shallow neural network approach based on 4 flux wavelengths). For this comparison, several metrics were considered in order to obtain a comprehensive and unbiased evaluation of model performance. It was found that the N-BEATS model systematically outperformed the baseline and statistical approaches, and achieved an improved or similar performance to CLS in all metrics, despite only learning from a single flux wavelength. Therefore, not only was N-BEATS found to be a more powerful architecture for predicting the F10.7, in requiring less data to achieve the same level of performance, and consequently reducing the potential sources of uncertainty, but also that the use of different loss functions in the N-BEATS approach lead to reduced training biases.
To capture the uncertainty in the forecasting, the N-BEATS model was additionally extended in this work to provide integrated uncertainty quantification using deep ensembles. Although initially overoptimistic, as is typical for deep ensembles of this type, it was found that applying a correction factor resulted in a superior characterisation of the forecast uncertainty when compared to that of CLS, especially during periods of low solar activity.
In future work, we will work to augment our model with improved uncertainty estimators to further mitigate the propensity of deep ensembles to produce overoptimistic uncertainty estimates, and expand the approach to produce probabilistic forecasts. We will also consider further improvements in the accuracy of the model by including auxiliary variables, such as additional flux wavelengths, which may also aid in correcting the tendency of the model to underpredict the flux at high solar activity. In a further step, this approach could then be used to extend the forecasting to these other variables, for example the F30, in a multivariate approach.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.