Ultra-short-term prediction of LOD using LSTM neural networks

Earth orientation parameters (EOPs) are essential in geodesy, linking the terrestrial and celestial reference frames. Due to the time needed for data processing and combining different space geodetic techniques, EOPs of the highest quality suffer latencies from several days to several weeks. However, real-time EOPs are needed for multiple geodetic and geophysical applications. Predictions of EOPs in the ultra-short term can overcome the latency of EOP products to a certain extent. Traditionally, predictions are performed using statistical methods. With the rapid expansion of computing capacity and data volume, the application of deep learning in geodesy has become increasingly promising in recent years. In particular, the Long Short-Term Memory (LSTM) neural networks, one of the most popular Recurrent Neural Network varieties, are promising for geodetic time series prediction. In this study, we investigate the potential of using LSTM to predict daily length of day (LOD) variations up to ten days in advance, accounting for the contribution of effective angular momentum (EAM). The data are first preprocessed to obtain residuals by combining physical and statistical models. Then, we employ LSTM networks to predict the LOD residuals using both LOD and EAM residuals as input features. Our methods outperform all other state-of-the-art methods in the first eight days with an improvement of up to 43% under the first EOP Prediction Comparison Campaign conditions. In addition, we assess the performance of LOD predictions using more extended time series to consider the improvements of EOP products over the last decade. The results show that extending data volume significantly increases the performance of the methods.


Introduction
Earth orientation parameters (EOPs) describe the motion of the Earth-centered-Earth-fixed reference frame (International Terrestrial Reference Frame, ITRF) w.r.t. the barycentric and barycentric-fixed celestial reference frames (International Celestial Reference Frame, ICRF) (Petit and Luzum 2010). This relation is a function of time and consists of several parameters, including Universal Time (UT1), coordinates of the pole (x, y) of Earth's rotation axis, and celestial pole offsets (dX, dY). Of particular interest is UT1, which is not a uniform time but includes variations due to irregularities in the Earth's rotation period. Two important parameters that The fundamental requisite to understanding the changes in the EOPs is to apply the principle of conservation of angular momentum to the Earth system. Under this principle, two primary reasons for changes in Earth's rotation can be summarized: internal mass redistribution and the exchange of angular momentum between the solid Earth and fluid parts (Gross 2007). In addition, the applied external torques also have substantial impacts, known as tides (Dehant and Mathews 2015). Therefore, understanding the dynamics and redistribution of the fluid layers of the atmosphere, oceans, and the terrestrial hydrosphere is critical for understanding the non-tidal changes of EOPs, caused by the fluid layers of the Earth . The effective angular momentum functions (EAM), including the AAM (atmospheric angular momentum), OAM (oceanic angular momentum), HAM (hydrological angular momentum), and SLAM (sea-level angular momentum), describe the non-tidal geophysical excitations of EOPs Dill 2018, 2019). Among them, the zonal component (z-component) of the AAM dominates the non-tidal changes on a short timescale of LOD with some additional contributions from the ocean and hydrological system (Holme and De Viron 2013). In addition, the angular momentum is typically categorized into mass and motion terms, except SLAM, which only contains the mass term.
There have been many studies on the prediction of LOD time series, which can be divided into two categories, namely only based on LOD data or in a combination of LOD and its geophysical excitation. In the last few decades, most efforts focused on the first category. Many statistical forecasting methods have been applied, including the autoregressive moving average model (Hamdan and Sung 1996) and auto-covariance based procedure (Kosek et al. 1998). The combinations of individual methods were also realized, like the combination of fuzzy-inference system and wavelet filtering (Akyilmaz et al. 2011), which is superior to the purely fuzzy logic method for the ultra-short term. Xu et al. (2012) achieved accurate prediction with mean absolute error (MAE) up to 0.13 ms in the ultra-short term by combining least-squares (LS), autoregressive (AR) model and Kalman filter. Besides, many studies realized the potential application of machine learning algorithms for prediction of EOPs. Schuh et al. (2002) applied an artificial neural network (ANN) to predict polar motion and LOD, which showed comparable results to traditional statistical methods. Other studies using ANNs and their combination with stochastic methods can be found in (Kosek et al. 2005;Liao et al. 2012;Lei et al. 2017). In addition, Lei et al. (2015) applied another type of machine learning algorithm called extreme learning machine, which tried to reduce the computational complexity of the training phase. The first EOP Prediction Comparison Campaign (EOP PCC), organized by the Polish Academy of Sciences, took place from 2005 to 2009, including more than 20 methods (Kalarus et al. 2010). The primary conclusion of the first EOP PCC was that there was no best algorithm for all parameters in all prediction horizons. Besides, it also highlighted that more attention should be put on the contribution of EAM. Freedman et al. (1994) considered the contributions of meteorological AAM forecast data in their Kalman filter model. They concluded that the AAM information could be an important adjunct to geodetic measurements of Earth orientation. In recent years, more and more studies include forecasts of geophysical excitations (AAM or EAM) into EOP predictions (Niedzielski and Kosek 2008;Kosek 2012;Dill et al. 2019;Modiri et al. 2020;Kiani Shahvandi et al. 2022b). Among them, Modiri et al. (2020) combined the singular spectrum analysis (SSA) and Copula theory to predict LOD in the ultra-short term, which reached the state-of-theart results of day-5 to 10 with MAE of 0.06 ms to 0.08 ms, whereas the Kalman filter algorithm (Freedman et al. 1994;Gross et al. 1998) dominated the prediction of the first four days with MAE of 0.04 ms to 0.06 ms. Only a few studies considered the EAM contributions in machine learning algorithms. Wang et al. (2008) incorporated the LOD and AAM prediction series into an ANN model to predict LOD for five days and proved a significant improvement up to 27.6%.
With the availability of more data and computational resources, deep learning approaches have gained colossal attention in recent years because of their high modeling and predicting power on different datasets. The recent achievements, shortcomings and outlooks of machine learning and deep learning approaches in geoscience are discussed by Reichstein et al. (2019) and Bergen et al. (2019). The authors concluded that the application of machine learning in geosciences is very promising and highlight the combination of physical modeling and machine learning approaches as a main future research direction. Motivated by the success of time series forecasting with deep learning, we present the application of this approach in the context of LOD prediction. In particular, we focus on the so-called sequential deep learning models because of the time series nature of LOD data. More specifically, among the sequential models, we use the Long Short-Term Memory (LSTM, Hochreiter and Schmidhuber 1997;Gers et al. 2000) neural network architecture, which is one of the most popular varieties of a Recurrent Neural Network (RNN, Rumelhart et al. 1985), as it can solve the vanishing gradient problem and thus, capture long-term dependencies. LSTM has proven to be highly accurate in the prediction of time series data in geosciences, including land deformation analysis (Pu et al. 2018), fog forecasting (Miao et al. 2020), and ocean temperature prediction . The geodetic community also realizes the potential application of LSTM for predicting geodetic time series (Kitpracha et al. 2019;Kiani Shahvandi and Soja 2021;. Our study focuses on ultra-short-term prediction of LOD using a hybrid modeling approach, which combines physical, statistical and deep learning methods while considering geophysical excitations. We test four varieties of LSTM neural networks, considering both LOD and EAM (AAM+OAM+HAM) data as inputs for LOD prediction. In the preprocessing phase, we remove the secular trend, tidal and seasonal variations of LOD by combining the Savitzky-Golay (SG) filtering (Savitzky and Golay 1964), tidal corrections (Petit and Luzum 2010), and least-squares sinusoidal regression (Brockwell and Davis 2002) to generate the LOD residuals. We also use the sinusoidal regression to remove seasonal variations in the EAM time series (including 6-day and 10-day forecasts) to create the EAM residuals. Then, our networks consider the LOD and EAM residuals as input features to learn from them and predict the LOD residuals in the next ten days. Additionally, signal portions removed in the preprocessing phase (we call them deterministic in the following) are predicted to recover the full LOD signal. We first compare the performance of four LSTM variants with other studies of EOP PCC in a hindcast experiment (1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008) under the same conditions. Then, extended data from 1985 to the end of 2020 are considered to study the impact of data volume. In addition, all the experiments are repeated once with AAM and corresponding forecasts instead of EAM to analyze the contributions of oceanic and hydrological systems.
The rest of this paper is organized as follows: Sect. 2 introduces the principles of applied methods. The data, the preprocessing, as well as the prediction strategy, are discussed in Sect. 3. The results and discussions are reported in Sect. 4. Finally, conclusions are stated in Sect. 5.

Savitzky-Golay filter
For the extraction and the modeling of interannual and decadal trend signals, we use a SG filter (Savitzky and Golay 1964), which belongs to the class of finite impulse response filters (Oppenheim and Schafer 2013;Schafer 2011). The SG filter is based on the concept of least-squares curve fitting in successive windows. It resembles the so-called weighted moving average method in which the weights of smoothing differ according to the location in the window. However, to derive the weights, the coefficients of the mentioned curve are determined using the least-squares approach. In mathematical representations, with m being an integer, let the (2m + 1) values in the window around the j th point of the time series y be y j−m , . . . , y j+m . The SG filter tries to minimize the following quadratic function The coefficients a k , k = 0, . . . , q are determined based on y j−m , . . . , y j+m and multiplied with the powers of i = −m, . . . , m to give a weighted moving average. Compared to other filtering methods, the SG filter obtains a good trade-off between waveform smoothing and preservation of important signals and thus, outperforms other filtering methods (Liu et al. 2016). The kernel of a SG filter is only needed to be computed once and the smoothing process is based on discrete convolution (Savitzky and Golay 1964; Schafer 2011), which results in a remarkable computational efficiency. Therefore, SG filters are broadly used by the geoscience community to separate additive noise from signals (Liu et al. 2016;Roy 2020).

LSTM
In this part, we give an overview of the LSTM network architecture (Hochreiter and Schmidhuber 1997). Motivated by resolving the issue in modeling the long-term dependencies in RNNs (Rumelhart et al. 1985), LSTM uses the forget mechanism to prevent overflow of information (Gers et al. 2000). As a result, the so-called problem of vanishing gradients, which is a drawback of RNNs, is avoided. This makes LSTM a powerful method for sequential data prediction, where the order of and dependencies between values are essential. In the following, we review the LSTM architecture.
Definition 1 Let the inputs and targets to the LSTM network be denoted by X = [x 1 , . . . , x n ] and Y = [y 1 , . . . , y n ], respectively, where x i ∈ R s× f and y i ∈ R o×1 . The number of samples, sequence length, number of features and output length are denoted by the integers n, s, f and o, respectively. The basic LSTM network uses 8 groups of weights and 4 groups of biases to map the input x i to the target y i . For this purpose, the temporal components of each input sample x i , which are denoted by x i,t with t = 1, . . . , s, are added recurrently. The mathematical representation of a single LSTM layer is as follows: in which I, F and O are the input, forget, and output gates, respectively, whereas C refers to cell state. h t denotes the so-called hidden states. In addition, the weights are denoted by U and W, biases by b, sigmoid function with σ , hyperbolic tangent function with tanh, and the Hadamard (element-wise) product with .
Using the initial value of h 0 = 0, the input, forget, output gates, and cell input are computed. Then, based on the initial value C 0 = 0, the cell state is updated for the current step. The cell state and output gates are then used for the update of the hidden state. The computed values for hidden and cell states are used for the next steps. The essential steps would have larger hidden state values so that the input, forget and output states can be mapped onto a larger value (closer to 1). The reason is that a value closer to 1 in the sigmoid function can keep the hidden states in the next cell since smaller values (closer to 0) would have value 0 in the hyperbolic tangent function. Therefore, LSTM can effectively capture time series dynamics.
Definition 2 Let the outputs of the neural networks be denoted byŶ. The optimization of the network by the loss function L (Ŷ, Y) refers to the following minimization problem arg min where θ denoting all the trainable parameters including the weights U, W and biases b.

Remark 1
We use the L 1 norm for the optimization, thus Eq.
(3) reads arg min Other choices for the loss function would be the L k , k = 2, 3, ... norms. However, for our problem of LOD prediction the L 1 norm has been proven to perform the best compared to our experiments using other loss functions. Furthermore, evaluation metrics will also be based on the L 1 norm, so it is logical to train the networks trying to minimize this norm. For the optimization process, the Adam optimizer is used to derive the optimal set of weights and biases (Kingma and Ba 2014).

Definition 3
The Encoder-Decoder LSTM (EDLSTM) architecture (Nayak and Ng 2020) is a variant of LSTM for sequence-to-sequence prediction problems where the number of input items differs from the number of output items. The encoder of the network is usually based on LSTM neurons, and the last hidden state of the encoder is considered as the context C. Then, the context C will be repeated o time to fit the dimensions of the outputs, allowing application of sequential modeling in the decoder. As a result, the sequential relationship in the outputs can also be considered.

Definition 4
The CNN-LSTM (Donahue et al. 2015), which was originally designed for visual recognition and description, combines the concepts of Convolutional Neural Network (CNN) and EDLSTM. It introduces a CNN layer before the LSTM layers to further extract patterns from features. In our case, we introduce one-dimensional convolutional layers before the LSTM layers to extract patterns from the input features. The outputs of the CNN layers will be fed into the LSTM layers, which have different features as the original inputs X.
Definition 5 Convolutional LSTM (ConvLSTM), proposed by Shi et al. (2015), is an architecture combining the convolution operator with the LSTM network. The difference between the LSTM and ConvLSTM lies in how the weights are multiplied with the inputs and hidden states. In Con-vLSTM weights are convolved with the inputs, while in LSTM weights are multiplied with the inputs by usual matrix operations. In contrast to CNN-LSTM, which first produces high-level feature maps and then, feeds them into the sequential modeling procedure, the ConvLSTM considers the spatial and temporal correlations jointly.

Evaluation metrics
In this study, we evaluate the performance of our predictions using two metrics, namely the mean absolute error (MAE) and absolute error (AE). The difference between the j-days ahead prediction of the i th sample of LOD is defined as: where Δ obs denotes the observed LOD, Δ pred denotes the predicted LOD, i is the index of date, and j is from 1 to 10 for the ultra-short term. Then, AE is defined as: which shows the absolute difference between the predictions and the observations. To assess the averaged performance, we calculate the MAE: where n p denotes the number of samples in the test set.

LOD time series
In this study, we use the LOD time series provided by the International Earth Rotation and Reference Systems Service (IERS). The EOPs are derived from the combination of observations from different space geodetic techniques with diurnal sampling rate. In the first hindcast experiment, we use the IERS EOP 05C04 time series (Bizouard and Gambis 2009) to compare the performance of our methods with the results of the first EOP PCC under the same conditions. The IERS EOP C04 time series was updated during the first EOP PCC from 97C04 to 05C04, which caused difficulties in the evaluation process. However, the impact on LOD quality was minor (Kalarus et al. 2010), so we do not consider it here. For the subsequent experiments with longer time series, we use IERS EOP 14C04 (Bizouard et al. 2019), which results from the combination of operational EOP series that are consistent with ITRF2014 (Altamimi et al. 2016).

EAM time series
One of the pivotal EAM analysis data sets are provided by the GFZ German Research Center for Geosciences (GFZ). GFZ provides effective angular momentum functions, including AAM, OAM, HAM, and SLAM from 1976 onward (Dobslaw and Dill 2018). In addition, they provide the corresponding 6-day forecast data of the individual components and 90-day prediction data of EAM (AAM+OAM+HAM). As we described in Sect. 1, the z-components have a major impact on LOD. Therefore, we consider the z-components (mass+motion) of EAM, denoted as EAMz in the following, and the 6-day forecast data of EAMz (denoted as EAMz forecasts) as additional features. Additionally, we implemented the same experiments with AAMz-only (including the forecasts) instead of EAMz to assess the contributions of oceanic and hydrological systems. In the following text, we will only describe the approaches focusing on EAMz, since all the processing steps are the same for EAMz and AAMz. A difficulty is that the 6-day forecast data are only archived back to 2016. Therefore, we have to simulate the forecast data for the time interval before 2016. To achieve this goal, we downgrade the observed EAMz data by adding white noise considering the root-mean-square deviation (RMSD) of the 6-day forecast data reported in (Dobslaw and Dill 2019, Figure 8.4) to generate pseudo-forecast data. To verify the quality of the synthetic data, we generated the pseudo-forecast data also for 2016 onward and compared them with the real forecast data. The noise levels agree with each other. Nevertheless, the actual errors are not purely white noise, which may harm the performance of our networks (Dill et al. 2021).
Since 2021, ETH Zurich has also investigated EAM forecasts based on GFZ EAM products to improve the accuracy and prediction horizon (Kiani Shahvandi et al. 2022a). The operational 14-day forecasts of the complete set of EAM components, based on first-order neural ordinary differential equations, are available from the Geodetic Prediction Center 1 at ETH Zurich (GPC, Soja et al. 2022). The details about these EAM forecasting algorithms can be found in Appendix A. Therefore, we also implemented the experiments with the ETH EAM forecasts as features instead of GFZ EAM forecasts, which also helps to analyze the impact of longer EAM forecasts. The 10-day forecasts of ETH EAMz and AAMz are used since the errors of the forecasts on the latter days increase steadily to a point where the longer forecasts have negative impacts on the LOD predictions. Here, we have a similar problem as for the GFZ forecasts: the real EAM forecasts are only archived back to 2021. Therefore, we generate the pseudo-forecasts following the same approach. All the used RMSD for generating pseudo-forecasts are reported in the supplementary information.

Splitting of data sets
For training the networks, we divide the whole dataset into training, validation, and test sets. Figure 1 shows the LOD datasets used in this study. For the first experiment, the training and validation data sets start on 1996-09-01 and ends on 2005-09-30, whereas the test set starts on 2005-10-01 and ends on 2008-10-31, as depicted in Fig. 1a. However, not all the daily data in this time range will be used for the test set because the submission of predictions for the first EOP PCC was only on Thursdays. Therefore, there are 161 test samples in total. The splitting of the training and validation sets is not based on the date but on the number of samples. Here, we have 3000 training samples and 248 validation samples. For the other experiments that rely on the extended data, the training and validation data sets start on the beginning of 1985 and the test set starts on 2016-01-01 and ends on 2020-12-31, as shown in Fig. 1b. In this experiment, the models predict the LOD every day. Therefore, we have 1827 test samples in total. The training and validation sets are also split based on number of samples. We have 9025 (80%) training and 2026 (20%) validation samples, respectively. Figure 2 shows the data processing workflow used in this study. Since we can treat some parts of the LOD-contributing signals as deterministic, we subtracted them prior to the application of LSTM and restored them afterward. Since the inputs to LSTM neural networks should be detrended, standardized, and normalized (Goodfellow et al. 2016), the preprocessing approach is necessary and beneficial. We first remove the secular trend and known signals from the LOD time series by combining the SG filter, tidal corrections and least-squares adjustment to generate the LOD residuals (Δ R ). Then, we use LSTM networks to predict Δ R .

Data processing
As described by Gross (2007), the LOD time series contains decadal, interannual, seasonal, and interseasonal signals. Besides, a secular trend of around 1.8 ms/century is observed. Since the tidal deformations (zonal tides) can explain a large part of the LOD signals, we first removed the tidal variations (Δ T ) following the IERS conventions 2010 (Petit and Luzum 2010) and obtained LOD without tidal variations Δ 1 . For computing Δ 2 , we removed the decadal trend of LOD (Δ S ) from Δ 1 by applying the SG filter with the polynomial order of 2 and the window length of 3653. The motivation for this choice is to optimally capture the long-term trend using a low-order nonlinear polynomial and obtain a smooth Δ S . However, in the first experiment using data from 1996 to 2005, the length of the LOD time series is shorter and the secular trend within around ten years is easier to model. Thus, we use a quadratic function to model the secular trend for EOP PCC instead of the SG filter. We further applied the least-squares sinusoidal regression to estimate the seasonal signals with annual, semiannual and terannual periods (Δ P ). By further removing Δ P from Δ 2 , we received the LOD residuals (Δ R ).
Before applying a similar strategy to the EAM observations and forecasts to obtain residuals, we should project them  to similar value ranges as the LOD data to reduce training difficulties. Since the used EAM functions from both sources are in the form of non-tidal normalized EAM Dill 2018, 2019), they are unitless and highly correlated with the LOD after removing tidal effects and secular trend. Therefore, we cannot apply geophysical equations to convert EAMz into LOD but should use a simple empirical relationship to convert the unit and obtain a similar value range: which agrees with the equation given by Gross (2007). Then, the least-square sinusoidal regression was applied to the EAMz time series to remove the seasonal signals with the same frequencies as for LOD. In the end, we obtained both LOD and EAMz residuals, which are in the interval of ±1 ms and highly correlated with each other. Both Δ R and EAMz R will later enter the LSTM neural networks as input features. To preprocess the test set, we always concatenated the 'new' data with the training and validation set and reran the preprocessing process, which caused edge effects as a consequence of the filtering step. To overcome this issue, we applied an additional linear offset adjustment as described in Appendix B.
In summary, Fig. 3 shows the whole preprocessing progress for LOD and corresponding modelled signals using the training and validation set of the experiments with extended data. The right column of Fig. 3 shows the modelled signals, and the left column shows the corresponding remaining signals before and after removing the modelled parts. The outputs of the preprocessing approach (left-bottom plot) will be used as the inputs of the LSTM neural networks. After the LSTM-prediction step, the predictions of the deterministic signals (right-hand-side of Fig. 3) are added toΔ R (Fig. 2) in order to produce the predictions of LOD.

Prediction of LOD residuals using LSTM neural networks
After preprocessing, we reshaped the whole time series into samples and divided them into training, validation, and test sets. In order to tune the hyperparameters and choose the best structure of networks, we applied the grid search strategy and tenfold cross-validation (Goodfellow et al. 2016). In this step, we tested different input sequence lengths for the prediction of ten days in future and concluded that the best results were obtained by considering 30 previous days as inputs. Another motivation for using 30 previous days as input is to model the monthly and biweekly LOD signals, which are not explicitly removed beforehand, using the networks. We reshaped the whole time series into three-dimensional tensors with the dimensions of n × s × f to fulfill the requirement of LSTM networks (Sect. 2.2), where n is the number of samples as reported in Sect. 3.1, s is the length of input sequence, and f is the number of features. The features of each input epoch are the LOD residuals, the observed EAMz residuals, and the forecasted EAMz residuals, respectively. The LSTM, EDL-STM, and CNN-LSTM networks take this three-dimensional matrix as input. However, this matrix has to be reshaped again to fulfill the requirement of the ConvLSTM network. In order to prevent overfitting, we applied an early-stopping strategy: we monitored the loss of the validation set and stopped the training phase when the loss stops decreasing or starts increasing. We noticed that the limited data volume under the conditions of the first EOP PCC is not enough to train our networks so that they can distinguish different prediction days. Therefore, we implemented ten separate networks to predict the ten different days in future to reduce this issue to a certain degree. In the second and third experiments, one network was sufficient to predict all ten days of LOD because a significantly larger amount of data was available. We used TensorFlow V2.4.0 (Abadi et al. 2015) to implement all the networks as well as the training, validation, and test processes. Figure 4 shows the structure of the EDLSTM network for the experiments with longer time series. The first LSTM layer is the encoder section, which is responsible for interpreting the input sequence. This layer takes the input tensor ∈ R 30× f and outputs a non-sequence context C ∈ R o with o denoting the output dimension. Then, the RepeatVector layer repeats the context so that the required length of the output is fulfilled, followed by the decoder section that consists of two LSTM layers. In the end, two Dense layers map the outputs of the decoder section with the outputs. Since the output of the decoder section is a sequence, the Dense layers have to be wrapped using the TimeDistributed wrapper. The structures of all the other networks are reported in the supplementary information.

Prediction of deterministic components
In addition to predicting the LOD residuals with the LSTM networks, we have to predict the deterministic components of LOD. During the preprocessing steps described in Sect. 3.2, we obtained functional forms to describe the tidal variations (IERS convention 2010, Petit and Luzum 2010) and the parameters of the seasonal signals (least-squares sinusoidal regression). Therefore, we could extrapolate these two components into the future to obtainΔ T andΔ P . Since only 30 previous days were considered as inputs in our networks, the trend function in this short time interval was usually linear, sometimes with a slight curvature. Therefore, we used the PCHIP (Piecewise Cubic Hermite Interpolating Polynomial) extrapolation (Fritsch and Carlson 1980) to precisely predict the secular trend.

Setups of experiments
In this study, we performed three experiments to evaluate the performance of our proposed methods comprehensively. The basic setups of the experiments are summarized in Table 1. First, we implemented a hindcast experiment to compare the performance of our methods to the other studies of the first EOP PCC. In an operational setting, we cannot use the IERS final products as our inputs due to the latency of one month and would have to resort to the rapid products which have a latency of two days for LOD. However, the historic rapid products are not fully archived once the final products are available. Therefore, we can only use the final products to train our networks and compute the predictions. The latency Jan. 2017-Dec. 2020 (n, 10) One network for 10 days of two days was still considered to provide the most realistic results. For the following experiments, we expanded the data volume to investigate the benefits of longer time series on the training process and the prediction accuracy that would be achievable nowadays. In the second experiment, we considered both the longer time series as well as the latency of input data. Therefore, the reported quality should be operationally achievable. In the final experiment, we studied the most optimistic accuracy by considering the optimal data availability without any latency to assess the theoretical limits of the proposed methods.

Results and discussion
We have generated predictions using the four LSTM variants in all three experiments. The results for the individual methods are reported in the supplementary information. We observed that the differences between the four variants are minor while the EDLSTM networks slightly outperform the others in most of the cases, with the lowest number of trainable parameters. Therefore, we conclude that the key to LOD prediction problem is the LSTM neuron rather than the different structures in encoders. We will thus only report the results using EDLSTM networks in this section and focus on the impacts of AAM and EAM and their forecasting horizons.

First experiment
In the first hindcast experiment, our study was carried out under the same conditions as for the first EOP PCC (Kalarus et al. 2010), resulting in a relatively fair comparison with others. The splitting of training, validation and test sets is reported in Sect. 3.1.3. We consider the latency of the IERS LOD rapid products, typically two days. Therefore, we generate LOD predictions for the next twelve days and take the latter ten predictions as our 10-day predictions. In this context, the 1-day delay of EAM forecasts has no impact. Figure 5 shows the MAE of our results and compares them to other studies of EOP PCC and the study by Modiri et al. (2020), which is the state-of-the-art study for days 5-10 (Modiri et al. 2020, Table 1 with Modiri et al. (2020), we note that the EAM forecasts are generated differently and based on different data sources. Moreover, the ways to handle the latencies of the underlying products are also different (Modiri, personal communication). This comparison proves the promising short-term performance of our method. Using the GFZ 6-day forecasts, we achieve the best results over the first four days. Afterward, the accuracy of our methods decreases more rapidly due to a lack of EAM forecasts, indicating the importance of the EAM forecasts for our purpose. Therefore, the ETH 10-day EAM forecasts significantly increase the quality of LOD predictions with an average improvement of 29% compared to using only 6-day EAM forecasts. As a result, our method outperforms other studies in the first eight days with a maximum improvement of 43% on day 3, whereas the method of Modiri et al. (2020) is still slightly better for days 9 and 10. Moreover, we observe steadily better performance if the networks using EAM instead of AAM with an average improvement of 10% for ETH 10-day forecasts, which proves the important contributions of the oceanic and hydrological systems to the LOD variations. Figure 6 shows the AEs of the individual predicted days of all the test samples between 2005 and 2008. Due to the small amount of data, the networks are not extensively trained, which leads to more noisy patterns starting from day 5. When

Second experiment
For this experiment, we use data from 1985 to 2016 to train our networks. In order to obtain results under operational conditions, we also consider a LOD latency of two days, which means we again generate 12-day-ahead predictions and consider the last 10 days as the predictions in the future. The networks benefit from the more extensive data volume and therefore achieve a better generalization. As a consequence, we are able to determine all twelve outputs from one network rather than twelve separate networks as in the first experiment. Figure 7 shows the MAEs of our results with the numerical values shown in the supplementary information. We note that the comparison with the results from the first EOP PCC is not entirely fair since the expanded data volume and improvements in the EOP time series (from EOP 05C04 to EOP 14C04) positively impact the results. The goal of plotting the results of other studies is to give an intuition about the performance of our methods. The more extended amount of data enables us to train one network with twelve outputs simultaneously, which results in more stable predictions. The contributions of more extended EAM forecasts are significant since the average MAE of the results from the network using 10-day EAM forecasts is about 32% lower than from the network using 6-day fore- casts. As a result, the network with 10-day EAM forecasts can provide high-quality 10-day LOD predictions with MAE of about 0.08 ms. However, the benefits of using EAM are not ensured for the latter days since we can also observe that the network using 10-day EAM forecasts can only give comparable or even worse results compared to the network using 10-day AAM predictions. A possible reason for this degradation is the higher forecasting errors of AAM, OAM and HAM. Since the EAM time series is the sum of the three terms, the forecasting errors of all three sources also propagate into the EAM forecasts. Therefore, the signal-to-noise ratios for the latter days are worse than the earlier days, and the network cannot effectively benefit from the meaningful signals.  Figure 8 shows the AEs of individual prediction days of all the samples in the test set of the second experiment. We observe outliers at the beginning of 2016 and 2020, which are likely caused by the biases in the GFZ 6-day forecasts (Dill et al. 2021). Due to the existence of these biases, the distribution of the generated pseudo-forecasts of EAM before 2016 differs from the distribution of the actual forecasts, which can cause a systematic error in our LOD predictions. In contrast, no noticeable irregularity is detected from the results using ETH 10-day predictions. However, this report could be slightly over-optimistic since no real EAM forecasts are available in this time interval and, therefore, the pseudo-predictions are used, which optimistically assumes the identical distributions of the training and test sets. In future work, we will continuously monitor the performance of our EAM forecasts to analyze the distributions of errors.

Third experiment
In the third experiment, we want to explore the achievable accuracy when using our methods under optimal conditions without the latency of the LOD and EAM observations. There are three motivations behind this experiment: First, it is crucial to fill in the missing values of LOD caused by the delay to receive the continuous time series, even though these missing values are not in the future. Secondly, we want to analyze the potential benefits if the processing time of LOD observations can be reduced. Furthermore, it is also more trivial to understand the correlation between EAM forecasts and LOD predictions since the time shift of two days does not exist anymore. Figure 9 shows the MAEs of the results with the numbers shown in the supplementary information. As the evidence in Fig. 9, the networks provide accurate LOD predictions with MAEs under 0.1 ms using GFZ 6-day EAM forecasts and under 0.06 ms using ETH 10-day EAM forecasts. We notice that the results shown in Fig. 9 are not fully equivalent to the ones shown in Fig. 7 shifted by two days. Because the networks only have ten outputs instead of twelve, the losses changed as well, see Eq. (4). The higher degradation of predictions based on 6-day EAM forecasts demonstrates the importance of longer EAM forecasts for our task. In contrast, MAEs of the predictions based on 10-day EAM forecasts increase more slowly due to the availability of EAM forecasts for the whole prediction horizon. The slightly higher gradient on latter days can be explained by the increasing uncertainties of the EAM forecasts on latter days. The impact of the uncertainties in EAM forecasts can also be observed by the convergence of MAEs using EAM and AAM. As we discussed in the second experiment, the superiority of EAM forecasts decreases with predicting horizon as the forecasting errors of EAM forecasts are higher than that of AAM forecasts. Figure 10 shows the optimistically achievable AE of individual prediction days of all the test samples between 2016 and the end of 2020 for the third experiment. We observe that most of the significant errors during the latter days are reduced in all four cases. However, the problems caused by the biases in the GFZ forecast are still observable at the beginning of 2016 and 2020. The results for ETH 10-day forecasts of AAM and EAM are similarly homogeneous.

Feature analysis
In this section, we discuss the impact of various features and different parts of the input sequences. The deep learning models have a black-box nature to a certain extent because they can, in principle, approximate any function, but it is difficult to understand the internal structure of the trained networks. Unlike other machine learning methods (e.g., random forests), there is no simple method to describe the importance of features in LSTM neural networks. To address this issue, we implemented empirical experiments, namely modifying the inputs by randomly adding or removing values equal to 30% of their magnitudes, followed by an investigation of the impact on the resulting MAE changes. The motivation for changing the magnitude of input values by 30% is that the changes are significant enough for us to learn the consequences. At the same time, the changes are also not too large to significantly impact the distribution of the features, which would hurt the performance of neural networks. We repeated this process 1000 times with different random seeds and computed the averaged MAE of these 1000 MAEs to obtain a general conclusion. Then, the relative changes of MAE are computed using the equation: We performed this analysis based on the networks in the third experiment. In this case, there is no temporal gap between the EAM forecasts and LOD observations, which simplifies the interpretation of the temporal relationship between the features and outputs.
First, we changed 30% of individual features on all the 30 input days to study the impact of different features. Figure 11 shows the resulting changes of MAE in percent on a logarithmic scale. As expected, the LOD observations are the most prominent features because they provide direct information related to our targets. Beside LOD, the analysis data of EAM are also important, especially for the first days, although they do not offer any horizon in the future. The critical contribution of the analysis data is to provide information about misalignments between the geophysical excitations and the LODs. Therefore, the EAM forecast data can contribute more to the LOD predictions by including the analysis data. Note the diagonal structure in Fig. 11, which outlines the impact of the EAM forecasts on the prediction quality, meaning the EAM forecasts of the day i have the most significant impact on the LOD predictions around day i, as expected. Nevertheless, the EAM forecast data on the following days also impact the LOD prediction of previous days. This effect can also be observed by comparing the results using 10-day and 6- Including EAM forecasts of latter days do not only improve the accuracy of the predictions after day six but also affect the predictions in the first six days positively. This finding is counterintuitive since there cannot exist a causal relationship between the EAM forecasts on latter days and LOD on earlier days. The reason is that the EAM forecasts are fed into the networks as separate features, while the LSTM networks do not consider the sequence between different features. Therefore, the networks benefit from the high correlation between input features. In view of causality, the observations of earlier days do not depend on the EAM of the next days; however, they are highly correlated, and the network can take advantage of this effect. Therefore, information from the future can also be helpful for the networks to predict the previous days. For a second run, we changed 30% of all features on individual input days and computed the relative change of MAE to study the impact of different input days, shown in Fig. 12 on a logarithmic scale. The features on the last input day dominate the prediction process since this input day contains the most recent information. The context is the last output of the encoder (Fig. 4), where the inputs on the last day only go through the nonlinear activation functions once. Therefore, the outputs are most sensitive to the last input epoch. Due to the existence of the forget gate in LSTM neurons, the high-frequency variations of inputs on previous epochs are more likely discarded, and only the major signals are restored. Therefore, we observe that the predictions are not sensitive to the 30% change in inputs from more than one week ago. However, there is still a visible impact of inputs from a week ago on the predictions of days 3-6, which may roughly indicate the known 14-day periods of the LOD time series.

Conclusions
In order to accommodate the need for real-time EOPs of the highest quality in geodetic applications, precise prediction of EOPs, especially in the ultra-short term, is necessary. In this study, we employed one of the most popular variants of RNN called LSTM to predict LOD for the next ten days, in accordance with the definition of ultra-short-term in the first EOP PCC. To obtain highly accurate predictions, we implemented a hybrid modeling approach, which combines physical, statistical, and deep learning modeling. First, we corrected the tidal effects and estimated the deterministic components (long-term trend, seasonal variations), and removed them from the LOD time series to generate LOD residuals. We also removed the seasonal signals from the geophysical excitations (including forecasts) to generate EAM and AAM residuals. Both LOD and EAM/AAM residual time series were considered as the inputs of the LSTM neural networks. Compared with the state-of-the-art studies, we achieved significant improvements up to 43% under the comparable conditions of the first EOP PCC and reduced the MAE further by including more data. Overall, accurate ultra-short-term LOD predictions with MAE below 0.08 ms are achieved in the realistic scenario. We investigated the predicting accuracy without considering the data latency. In this scenario, our method achieves MAE under 0.06 ms by day ten. Moreover, we discussed the contributions of atmospheric, oceanic and hydrological systems to LOD as well as the impact of their forecasting horizon. We conclude that the atmospheric contributions are the most dominant, while including the oceanic and hydrological systems also has clear benefits, with an additional MAE reduction of around 10%. The drawback of the EAM time series is its higher forecast-ing error level inherited from the three components (AAM, OAM, HAM), which impedes its contribution in the latter days. The feature analysis shows that the most crucial feature is LOD itself with clear contributions of EAM forecasts. An interesting fact is that the forecasts on latter days also impact predictions of previous days. With this finding, we can explain the reason for the better predictions on the first days while including longer EAM forecast data. Besides, this result also indicates the potential improvement by including more extended EAM forecast data, even after the 10th day. However, the rapidly decreasing quality of EAM forecast data hinders this approach.
In this study, we proved the advantages of LSTM neural networks for LOD prediction and the benefits of large data volumes for this approach. With the help of LSTM neural networks, we can provide reliable LOD predictions in the ultra-short term operationally, from which many realtime geodetic and geophysical applications can benefit. We conclude that more attention should be paid to including the geophysical excitations in the machine-learning-based LOD prediction framework. The higher errors caused by the summation of the atmospheric, oceanic and hydrological systems should be investigated further. This study also confirms the benefits of accurate EAM forecasts to LOD predictions, which should also apply to other EOPs, such as polar motion. We identified two needs in further improving EOP predictions: First, more attention should be paid to employing and testing other machine learning approaches for predicting EOPs. In this regard, suitable techniques may vary among different EOPs due to the different characteristics of the data. Second, improving the quality of rapid EOP products is also crucial, since the operational prediction must rely on them. In the end, more accurate EAM forecasts with a more extended forecasting horizon also have the potential to improve the quality of EOP predictions significantly.
Author Contributions Junyang Gou implemented the experiments, did the analysis and prepared the original draft. Mostafa Kiani Shahvandi provided the 10-day EAM forecasts, helped with the methodology and prepared the original draft. Roland Hohensinn and Benedikt Soja proposed the topic, supervised the study and revised the draft. All authors worked on the theoretical considerations, discussed the results, and contributed to the final manuscript.
Funding Open access funding provided by Swiss Federal Institute of Technology Zurich.

Data Availability
The used EOPs data are available at the IERS FTP server (ftp://ftp.iers.org/products/eop/). The used EAM data are available at the GFZ FTP server (ftp://esmdata.gfz-potsdam.de/EAM/). The 14-day EAM forecasts as well as the LOD predictions using the EDL-STM network provided by ETH Zurich can be found at the Geodetic Prediction Center (https://gpc.ethz.ch/). All the other datasets and mod-els generated and/or analyzed during the current study are available from the corresponding author on reasonable request. Declarations f t contains six components f t,1:6 corresponding to the 6-day forecasts. The RNN cell with parameters W 2 describes the differential equation. In the end, the resulting forecasting vector r, including the 14-day forecasts at time epochs t + 1 to t + 14, can be obtained by a linear transformation from the hidden states h t+1 with parameters b 3 , W 3 .
The operational quality control at GPC shows that ETH EAM forecasts are slightly more accurate than GFZ EAM forecasts in the first six days. On days 7 to 14, the predictions show good agreement with the corresponding observations, but the forecasting errors accumulate more rapidly. In our experiments, we observed that the EAM forecasts after day 10 negatively impacted the LOD predictions due to their higher noise level. As a result, we only use the first ten days of EAM forecasts in this study. The RMSDs used to generate the pseudo-forecasts are shown in Table 1 of the supplementary information.

Appendix B: Preprocessing using the Savitzky-Golay filter
In this study, we choose the SG filter to model and remove the secular trend in the LOD time series. For the time series from 1985 to 2015, we use the frame length of 3653 with order 2. However, in the first experiment using data from 1996 to 2005, the length of the LOD time series is shorter and the secular trend within around ten years is easier to model. Thus, we use a quadratic function to model the secular trend for EOP PCC instead of the SG filter.
To preprocess the test set, we always concatenate the 'new' data with the training and validation set, and rerun the preprocessing process. However, there is a boundary issue of the SG filter in the last windows since the number of data is smaller than the length of window, see Fig. 13. Therefore, the resulting LOD residuals sometimes differ from the EAMz residuals. To solve this issue, we estimate the difference of Δ 2 and EAMz within the input interval (30 days) linearly by applying one more least-squares adjustment. Then, we remove this difference from Δ 2 and add this in Δ S to generate Δ 2 and Δ S as the following: where b is the estimated linear offset between Δ 2 and EAMz. Then, we can estimate the seasonal signals based on Δ 2 for the test dataset and generate the corresponding Δ R . In the end, the LOD residuals and EAMz residuals are highly correlated. Figure 14 shows an example of the test set in the second experiment. The dashed blue line shows Δ R , which is the original LOD residuals without any linear offset correction.
Since we can detect a significant linear offset between Δ R and EAMz residuals, the LSTM neural networks cannot provide correct predictions relying on these two features. However, after removing the linear offset, we get Δ R (blue line), which is highly correlated with the EAMz residuals. The remaining differences are relatively small and mainly caused by data noise and the remaining low-frequency Earth's interior signals, which are potentially insufficiently modeled by the SG filter. Therefore, the LSTM neural networks can provide accurate predictions based on Δ R and EAMz residuals.