Nonlinear prediction of functional time series

We propose a nonlinear prediction (NOP) method for functional time series. Conventional methods for functional time series are mainly based on functional principal component analysis or functional regression models. These approaches rely on the stationary or linear assumption of the functional time series. However, real data sets are often nonstationary, and the temporal dependence between trajectories cannot be captured by linear models. Conventional methods are also hard to analyze multivariate functional time series. To tackle these challenges, the NOP method employs a nonlinear mapping for functional data that can be directly applied to multivariate functions without any preprocessing step. The NOP method constructs feature space with forecast information, hence it provides a better ground for predicting future trajectories. The NOP method avoids calculating covariance functions and enables online estimation and prediction. We examine the finite sample performance of the NOP method with simulation studies that consider linear, nonlinear and nonstationary functional time series. The NOP method shows superior prediction performances in comparison with the conventional methods. Three real applications demonstrate the advantages of the NOP method model in predicting air quality, electricity price and mortality rate.


INTRODUCTION
Functional data analysis (FDA) is a growing statistical field for analyzing longitudinal trajectories, curves, images or any manifold objects. One unique feature is that FDA treats each random function observed over the whole domain as a sample element. Functional data are commonly found in many applications such as fitness data from the wearable device, air pollution, longitudinal studies, time-course gene expressions and brain images. Traditionally, functional data are assumed to be independent. When functional data are observed at a sequence of time points, such functional data have temporal correlation and are called functional time series. In this article, we propose a nonlinear forecast model for functional time series.
One conventional framework for analyzing functional time series is based on functional principal component analysis (FPCA). FPCA is a popular method to perform dimension reduction for functional data. It provides low-dimensional feature representations for individual trajectories, and these features can be used to facilitate future modeling tasks. FPCA finds major sources of variation in functional data, which are quantified as functional principal components (FPCs). These FPCs can span a feature space to project individual functional data. That is, individual trajectories are expressed by a linear combination of FPCs. The coefficients in the linear combinations are FPC scores, which are the projections of individual functional data onto the space spanned by FPCs. By selecting a fixed number of FPCs, we can obtain a low-dimensional space to project functional data and use the obtained FPC scores to represent original functional data. Building subsequent models based on FPC scores is more efficient than using original functional data. Textbooks by Ramsay and Silverman (2005) and Vieu and Ferraty (2006) gave a good summary of both FDA and FPCA. The theoretical properties of FPCA have been well investigated by Dauxois et al. (1982), Hall and Hosseini-Nasab (2005), and Hall et al. (2006). Standard FPCA methods focus on finding the major modes of variations among functional data and does not have easy interpretations. Chen and Lei (2015), Lin et al. (2015), and Nie and Cao (2020) imposed sparsity constraints to obtain more interpretable FPCs. Sang et al. (2017) introduced a parametric approach to estimate top FPCs to enhance their interpretability. Nie et al. (2018) proposed supervised FPCA methods by considering the response information. Shi et al. (2021) introduced the informatively missing FPCA method for applications where the longitudinal trajectories are subject to informative missingness. Shi and Cao (2022) developed a robust method based on a new regression framework when the functional data have outliers. In addition, Dai and Müller (2018) considered functional data defined on some Riemannian manifolds where standard FPCA is not applicable. It is common in applications that individual curves are not fully observed. Nie et al. (2022) proposed a sparse orthonormal approximation method to recover the underlying trajectories from sparse and irregular functional data.
FPCA has been well studied for univariate functional data, but cannot be directly applied to multivariate trajectories. The standard approach for dealing with multivariate functional data is introduced by Ramsay and Silverman (2005). It suggests joining multivariate functions into a univariate one, that is, connecting starting points and endpoints of trajectories from all functional variables. Standard FPCA can then be applied to the resulting univariate functions. Following this approach, Chiou et al. (2014) incorporated a normalization approach to balance the degrees of variations across different functional variables. In the meantime, Happ and Greven (2018) investigated the case that trajectories from different functional variables are observed over different domains. Berrendero et al. (2011) pursued another approach by applying principal component analysis on observed multivariate trajectories at available observation points. Then, it connected the obtained principal components over observation points as the FPCs.
Functional regression models are another class of methods used for modeling functional time series. Ramsay and Silverman (2005) presents a good summary of regression models dealing with scalar and functional responses, where the latter case suits our interest in functional time series. Morris (2015) reviewed the applications and developments of functional regression models. Regression models for functions can be simplified by projecting functional data onto smooth eigenfunctions. Yao et al. (2005) utilized such an approach for estimating functional linear models, and Müller and Stadtmüller (2005) extended the method for generalized functional linear models. Kong et al. (2018) introduced the robust estimation method for functional linear models. James et al. (2000) proposed principal component models for sparse functional data. Guo (2002) considered mixed effects models where both the fixed and random effects are functions. Scheipl et al. (2013) extended mixed effect models for multiple functional predictors. Furthermore, Fan et al. (2015) and Müller (2008) studied the functional additive models for both scalar and function responses.
Both FPCA and functional regression models have been applied to analyze functional time series and establish prediction models for future trajectories. Aue et al. (2015) and Hyndman and Ullah (2007) proposed to forecast future trajectories based on the predicted functional principal component scores. Hyndman and Shang (2009) and Kargin and Onatski (2008) used a regression model by treating the future trajectory as the response and the last observed one as the predictor. It is sometimes equivalent to the autoregressive operator for curves. Jiao et al. (2019) further refined the predictions by using partially observed future trajectories. These methods depend on the stationarity and linearity of the functional time series. For the theories of functional time series, Bosq (2000) discussed theoretical properties of time series in function spaces. Later, Horvath et al. (2013) and Horvath et al. (2014) investigated the stationarity of function time series and whether it has the same mean function over time. By framing the functional time series as an autoregressive model, Hörmann and Kokoszka (2010) illustrated how to define and check the length of the temporal dependence of the series.
There are some recent advances in the analysis and prediction of functional time series. Hörmann et al. (2015) proposes to use the spectral density operator and obtains dynamic FPC scores instead of static ones. Elayouty et al. (2022) analyzes nonstationary functional time series by considering a dynamic functional principal component analysis. In addition, Gao et al. (2019) addresses the scenario where the functional time series is multivariate or high-dimensional. Elias et al. (2022) uses a nonparametric way to predict future curves by using the nearest ones. Furthermore, Alaverez Liebana (2017) has outlined a review of other approaches to analyzing functional time series.
The existing methods for predicting functional time series are mainly based on FPCA and functional regression models. There are several limitations to these types of approaches. First, these approaches often operate on the stationary assumption of functional time series. For FPCA-based approaches, this assumption specifies a constant mean and covariance function for all the trajectories from the observed series and future ones. A prediction model is then established in the space defined by FPCs, and we can forecast future trajectories by their predicted FPC scores. One limitation of this approach is related to the validity of the stationary assumption. Real data often do not follow a stationary process, hence the functional principal subspace and FPC scores cannot provide satisfactory predictions when the assumption is violated. Therefore, the prediction performance greatly depends on how well the stationary assumption holds for the observed series and future time points. In fact, both the mean and covariance function may vary over time and be influenced by covariates that are also depending on time. For example, the observed trajectories of electricity prices on a given day may have seasonal or weekend/weekday effects. The second limitation of FPCA-based approaches is whether the FPC scores are the most optimal representations of functional data with respect to the prediction task. FPC scores only contain information of variations among functional covariates, and that information may not be well translated for other models. The third limitation of FPCA-based approaches is that FPC scores are linear projections that cannot capture nonlinear structures in functional data. Consequently, we need to pursue other representation models defining nonlinear mapping rather than linear projections. Another limitation for regression-based approaches is also related to the nonlinearity in functional data. In this case, we need to consider the nonlinear temporal dependence between trajectories. Regression-based approaches use a linear operator to specify the dependence and cannot address more complex relationships between trajectories.
In addition, both FPCA and regression-based approaches are limited to univariate functional time series. The aforementioned limitations will become more influential in a multivariate setting because we have to consider linearity and stationarity across dimensions. One standard technique for dealing with multivariate functions is to connect functions into a univariate one and apply FDA methods to the resulting univariate function. As a result, the predictions on the boundaries of each function are often unsatisfactory. This can become a significant drawback when we are predicting multivariate functional time series. Traditional prediction models are often obtained for a fixed data set. We may need to consider the case where data are streamlined or continuously updated. When the data size is large or new trajectories are continuously observed, re-estimating the model can be computationally burdensome if it includes the calculation of covariance functions. Therefore, predicting functional time series in an online setting is more ideal, such that both the model and predictions can be updated and generated whenever a new observation becomes available.
In this article, we have selected two real environmental data sets that are challenging to the existing methodologies of functional time series. The first one includes measurements of chemical pollution in daily air quality. It is a multivariate functional time series where temporal dependencies between days present nonstationary behaviors and nonlinear relationships. Furthermore, such data often includes seasonal effects which further violates the stationarity assumption required by standard methods from FDA. In the application section, we will demonstrate that the proposed method outperforms existing methodologies in terms of predicting future data because our motivation is to propose a model that does not rely on stationarity or linearity of functional time series. The second data set is the daily electricity price of northern Italy. This data set is more unstable than the air quality data. It can easily have fluctuations or shrinkages due to human or environmental factors compared to the air quality data. For example, the price can have a sudden drop on weekends or holidays which do not follow the general trend of previous days. Our method is able to consider external factors in addition to the intrinsic temporal dependencies of the functional time series, hence it can produce better predictions for such kind of functional time series. These two examples demonstrate the importance of the proposed method to environmental data sets which usually exhibit complex dependence and nonstationarity structures due to internal and external factors to the system under study.
In this article, we propose a nonlinear prediction (NOP) model for functional time series. It is able to predict future trajectories that are either univariate or multivariate. The proposed model addresses the aforementioned limitations and has the following advantages over traditional FDA methods. First, the major advantage of our work is to avoid the calculation requirement of covariance functions. The motivation is to construct a latent feature space not depending on the decomposition of covariance functions. Instead, we propose to use a direct nonlinear function for mapping from the data space (containing curves in the functional time series) to a latent feature space. Assuming that the feature space is a low-dimensional space, any observed trajectories are compressed in vectors that preserve as much information as possible. The nonlinear mapping is more flexible to capture nonlinear structures in functional data. Moreover, we expect the latent feature space to capture the evolution of trajectories and to be less dependent on the stationarity of functional time series. Explicitly specifying can be exhausting, we employ neural networks as a powerful functional approximation tool to obtain the nonlinear mapping. Furthermore, we allow the latent feature space to be constructed based on prediction information whereas the functional principal subspace only focuses on variations among functional covariates. Hence, the obtained latent feature space can provide a better ground for predicting future trajectories.
One advantage of using the nonlinear mapping is that it can be directly applied to a multivariate function. It makes the proposed NOP model more suitable for multivariate functional time series. In addition to applying a nonlinear mapping on trajectories, we introduce a nonlinear function to model the temporal relationship between trajectories through their vector representations. The nonlinear is similarly approximated by neural networks, and it is flexible when we need to adjust the length of temporal dependence or consider additional covariates into the prediction model. When we forecast the time series for a longer period in advance, the predictions will be more stable under some mild stationary restrictions. The restrictions can be enforced with regularization in the estimation process. Regularization can also help the prediction model from overfitting when the training sample size is small. The last advantage of our NOP model is the ability to perform online estimation and prediction. Because the NOP model does not need to estimate covariance functions, it is straightforward to update the nonlinear functions of our model given any new observation. For our proposed method and neural network-based approaches, updating the current model is roughly equivalent to a single iteration of stochastic gradient update given the new observations. On the other hand, approaches based on functional principal component analysis require the calculation of covariance function which uses all observations. Our model does not need to use all observations at once, and it only requires a pair of observations for updating and training the model.
The remainder of the article is organized as follows. Section 2 presents the model with its neural network implementation and estimation process. It illustrates the design of the training objective which is able to use forecast information to construct a latent representation space. Section 3 summarizes simulation results for three types of functional time series: linear, nonlinear, and with a trend function. Section 4 presents the applications of the proposed model on electricity price, French mortality, and British Columbia air quality data sets. The first resembles a stationary functional time series, whereas the latter two appear to have a trend function and cyclic behaviors respectively. For each application, we will compare the proposed model with approaches based on FPCA and functional regression. Section 5 concludes the article.

Model specification
Let f t i (x), i = 1, … , n, be the functional time series observed at time points t 1 , … , t n . At each time t, the function f t (x) is a square-integrable function over the interval [a, b]. For simplicity, we assume that all n trajectories are evaluated at the same set of points x 1 , … , x J . In this article, we will consider a one-step ahead predition model as follows: where , , and are some nonlinear functions to approximate. The prediction model 1 contains three components as nonlinear functions. First, the function is responsible for mapping a random function f t (x) in the data space  to a latent feature space  ⊂ R D for a chosen D. The vector representation z t is responsible for conveying and compressing information in the corresponding trajectory. Building a prediction model and tracing out the evolution of functional time series will be easier by using {z t }'s than original functional data. If the number of observations J in each function is large, then low-dimensional representations greatly reduce subsequent modeling efforts. Explicitly defining a nonlinear function is difficult and not flexible with different data sets. We will utilize neural networks to approximate the nonlinear function and allow it to be data-dependent. In order to preserve the continuity of trajectories, we use a recurrent neural network to approximate , that is, where W, U, and V are parameter matrices of the neural network. Both u h and u z are activation functions of neural networks. The hidden unit h j is a D-dimensional vector in correspondence to the discrete observations f t (x j ) of any trajectory. The hidden unit h j processes and propagates information in parallel to a given trajectory. We assume that all information f t (x) can be conveyed by h J . A nonlinear mapping on f t (x) is achieved by applying a nonlinear mapping on h J as shown as the second equation in Equation (2). After obtaining the latent representation z t of a trajectory f t (x), we propose to predict the future f t+1 (x) through the evolution from z t to z t+1 . In this article, we consider a simple one-step ahead prediction model (z t ), and the nonlinear function is estimated by a feed-forward neural network as follows: where u is the chosen activation function and is the matrix of parameters in the neural network. There is great flexibility in modifying the prediction model based on latent representations z t 's. First, we can include other scalar covariates in the prediction model. We will use g t+1 to represent all the possible covariates as long as they are available before observing the trajectory f t+1 (x). The covariates in g t+1 can be easily incoorperated into the prediction model as the following: where we concatenate two vectors z t and g t+1 into a single one. For example, we can define g t+1 = {0, 1} given whether the day is a weekday or weekend. Another modification can be made on the length of temporal dependence or prediction.
In the specification (1), we consider the simplest prediction form z t+1 = (z t ). The temporal dependence can be extended as (z t−h∶t ). In addition, we can allow z t+1∶t+h = (⋅) to produce predictions for a future period [t+1,t+h] rather than on only one point t + 1. The flexibility of highlights the importance of using latent feature representations for forecasting functional time series.
If is a linear transformation or an explicit nonlinear function, it is feasible to specify an inverse mapping from  to  . However, the nonlinear mapping is approximated by neural networks, and we need to directly estimate an inverse mapping ∶  →  . The mapping can be applied to latent feature representations z 1∶t for observed trajectories or any future ones z t+1∶t+h . In order to produce reconstructions that preserve the continuity of curves, we will again use a recurrent neural network to approximate as follows: where uh and u f are the activation functions, and M, G, and L are the parameter matrices of the neural network. The prediction process is also illustrated in Figure 1. The top row in Figure 1 illustrates the prediction routine. The prediction of the future trajectory at t + 1 is obtained by using the last observed curve. The temporal dependence between adjacent trajectories is modeled through a nonlinear function with latent representations. Once the predictedẑ t+1 becomes available, we can generate the predicted trajectoryf t+1 (x) with (ẑ t+1 ).

Model estimation
For estimating all the nonlinear functions , , and , we can consider the following objective function: where l is a chosen loss function and p is a penalty function. For example, we will set l to be a squared distance function evaluated based on any pair of trajectories. The penalty function p can be a squared distance function or the Kullback-Leibler divergence term for any two vectors of representations. The penalty parameter controls the contribution of the penalty term  in the training objective. Let's take a closer look at each component in the training objective (6). First, the component  1 focues on the estimation of and by using trajectories that have been observed. We will let l be the squared loss function and f t (x) = ( (f t (x))) to be the reconstruction of a given trajectory.
) 2 shows the fitting of these two nonlinear functions. It can also demonstrate how well the latent vector z t can conserve information and be used to recover underly trajectory given noisy and discrete observations. The second component  2 is evaluated based on future trajectories when they become available for updating the model. This component is responsible for the estimation of the predictor and reinforces the reconstruction function .
The third component  is a penalty term to avoid overfitting the training data and balance the contributions between and . Excluding the first trajectory f 1 (x), each observed trajectory is used twice in calculating the training objective. That is, we will be obtaining two latent representations for each f t (x) where t = 2, … , n − 1. One z t is obtained by applying and the other is produced by the prediction from (z t−1 ). For example of a squared distance function p(⋅), a single penalty p( (f t (x)), ( (z t−1 ))) is minimized if two representations are the same, and it increases as the prediction of z t becomes different to its supposed representation obtained after observing the trajectory. This controls from overfitting the future trajectories in the tuples (f t (x), f t+1 (x)), and it only applies on f t+1 (x) when it becomes available. The penalty term also forces to use the feature space rather than solely predicting future trajectories, that is, it is subject to both  2 and  instead of a single prediction objective. As a result, the penalty  balances the contributions of both and and improves the future predictions by avoiding overfitting the training data.
This penalty term does not only prevents the model from overfitting the training data but also adjusts the long-term predictions. The long-term predictions in the one-step ahead model 1 is obtained iteratively from the first predicted trajectoryf n+1 (x), such thatf n+h (x) = ( ( (f n+h−1 (x)))). With the penalty term , we are able to enforce the stationary assumption proportional to the choice of . Larger penalty parameter forces the predicted latent representations to approach the supposed mappings of future trajectories. It suggests that the latent space should not change too much given the new observations and equivalently puts some stationarity constraints on the functional time series including unseen observations. If the functional time series is stationary, we can show that the model with a larger penalty has better long-term predictions than a smaller penalty in the later application examples. On the other hand, if our interest is in the short-term or only one-step ahead predictions, then we can use a small to put focus on the other two terms L 1 and L 2 in the training objective. The predictor has more flexibility in predicting future representations, and the model can produce predictions focusing on the upcoming time and alleviate the long-term stationarity of the series.
The estimates of , , and can be obtained by minimizing  with respect to the neural network parameters, that is, for a learning rate of . Figure 2 presents the training setup (inputs and outputs of the networks) for estimating network parameters.
For the observed functional data {f t (x)} n t=1 , we split them into a set of tuples {(f t (x), f t+1 (x))} n−1 t=1 . Each tuple is then fed into the training setup for calculating the objective function and evaluating the gradient for backpropagating training errors. The training objective is designed to facilitate the optimization process with the stochastic gradient descent method. The objective (6) can be easily evaluated based on a pair of adjacent trajectories (f t (x), f t+1 (x)). Furthermore, it is convenient to extend the proposed model into the online setting. After the model is fitted to a satisfactory degree, we can discard all observations except the last observed curve f n (x). We concatenate any new trajectory to the last observed one and update the network parameters with the calculated gradient. Then, the model can generate the predicted trajectory for the next time points.

Generalization bound on predictions
To address the prediction ability of our proposed method, we have to investigate whether it can be generalized for any unseen data. For that matter, we could use the generalization error which corresponds to the difference between the expected loss (model errors) and the empirical loss. By providing a bound on the generalization error, we can better understand the theoretical properties of our method with respect to predictions. Detailed proof is included in the Supplementary Material.

Definition 1 (Expected and empirical loss).
Let  be the distribution of any data pair (f t (⋅), f t+1 (⋅)), and q corresponds to the entire prediction model, f t+1 = q(f t ) = ( ( (f t ))). Then, the expected loss of our prediction model with respect to  is and the empirical loss iŝl where  contains the trajectories, f 1 , … , f n , from a observed functional time series and l is the loss or error function.
Theorem 1 (Generalization bound). For any , > 0, with probability ≥ 1 − over a set of n data pairs, the following bound holds .

SIMULATION STUDIES
First, we conduct three simulation studies to examine the finite sample performance of the proposed NOP method. We will also compare the prediction performances of the proposed NOP method with methods proposed in Hyndman and Ullah (2007), Shang (2009), andHörmann et al. (2015), and denoted here by FTSA, FPLSR, and DFPCA respectively. Three simulation settings are listed as follows: • Setting 1: FAR(1) • Setting 2: FAR(1) with a trend function • Setting 3: Nonlinear Autoregressive Model where Γ(x, u) = 0.7 exp(−0.5(x 2 + u 2 )) and w t (x) is an independent Brownian motion over [a, b] with zero mean and variance 1∕n. The simulation settings consider three scenarios: a functional autoregressive model with order of one (FAR(1)) and a trend function, a nonlinear autoregressive model, and a standard FAR(1) model. We will also consider different lengths of time series for each simulation setting to be 100, 500, 1000, and 2000. For each simulated functional time series, we will keep the last 20% trajectories in the time series as the testing sample.
In the simulation studies, we will only focus on the one-step ahead predictionf n+1 by using the previous trajectories f 1 (x), … , f n (x). The one-step ahead predictions are obtained as follows for each trajectory in the testing sample. Let n be the time index of the last observed curve, we will obtain the one-step ahead predictionf n+1 and increase the training sample by 1. Then, we will repeat the prediction process until the last trajectory in the testing sample. We use the mean squared prediction errors (MSPEs) for trajectories in the test sample to compare the prediction performances of the three methods. The MSPE at a future time point t is defined as follows: Each simulation setting is replicated for 100 times. For each replication, we calculate the median of MSPEs for all the trajectories in the testing sample, which is denoted as MSPE_m in the following comparisons. The prediction performances and comparisons are summarized in the following Table 1  Table 1 summarizes the simulation results from three settings. In the simulation setting I -FAR(1) model, we could observe that all four methods demonstrate similar performances in predicting future functional data in terms of MSPE_m. Our proposed NOP method slightly outperforms others while FTSA falls behind when the time series length increases. Even with training data of 80 samples, the nonlinear mappings of our method can still be well approximated and hence produce satisfactory predictions of future trajectories.
In simulation setting II where the functional time series has a trend function, the performance of linear and static methods gets worse over larger training samples or longer time series where our proposed NOP method gives the best prediction results. Both the FPLSR and FTSA methods are based on the stationary assumption of the series and do not consider a trend function. The existing trend function violates the stationary assumption of these two methods. As the training size becomes larger, the violation gets worse hence leading to poor prediction performances. On the other hand, our method does not require the calculation of covariance functions, thus it depends less on the stationarity of the series. TA B L E 1 The mean and standard deviation of MSPE_m for one-step-ahead predictions in the testing sample over 100 replications under simulation setting I, II, and III (ordered from top to bottom).

Setting
Method n = 100 n = 500 n = 1000 n = 2000 Hence, we can see that the NOP method provides the most consistent and best predictions regardless of the size of the training sample or the structure of the functional time series. The DFPCA method is slightly worse than our method but much better than the method using static functional principal components.
In simulation setting III where the dependence between functions is nonlinear, the proposed method slightly outperforms the DFPCA method while the other two linear approaches FPLSR and FTSA become significantly worse compared to the case of simulation setting I. In summary, we can see that our NOP method is the most consistent method and always produces the best predictions regardless of the structure of the functional time series.

APPLICATIONS
In this article, we consider three applications of predicting air quality, electricity price, and mortality rate. The air quality data are composed of multivariate functions over time. The other two applications are univariate functional time series. The electricity price data resembles a stationary functional time series. The mortality rate appears to have a decreasing trend function over the years. For all three applications, we employ the proposed NOP method and compare its prediction performances with the FTSA and FPLR methods proposed in Hyndman and Ullah (2007) and Hyndman and Shang (2009), respectively. The detailed results are summarized in the following subsections.

Air quality
This application focus on a multivariate functional time series of the air quality. Each multivariate trajectory contains hourly recordings of particle density of three chemicals NO 2 , O 3 , and SO 3 . In this application, we collect data in 2015-2020 from the BC air quality data archive (https://www2.gov.bc.ca/gov/content/environment/air-land-water/air/ air-quality/current-air-quality-data/bc-air-data-archive). Figure 3 displays trajectories of the functional time series in the first 100 days. Figure 3 shows that the hourly recordings of particle density are quite different for each chemical. The density of NO 2 appears to have multiple peaks throughout the day, and the trend of trajectories is cyclic. The overall density seems to increase from Day 1 and starts to fall off on Day 5. The density of O 3 peaks in the afternoon and may rise around midnight. SO 3 density curves are more consistent, where the majority of them have a peak around noon and diminish to zero afterward. Both the FPLSR and FTSA methods cannot be directly applied to multivariate functions. The standard technique is concatenating the multivariate functions into a univariate one, thus the resulting functional time series will be univariate. The NOP method can directly take multivariate inputs without any modification. We will keep the remaining 20% of the series as the testing sample, Figure 4 illustrates the one-step-ahead predictions for the first 3 days in the testing sample. One disadvantage of concatenating multivariate functions is the difficulty in predicting trajectories around connecting points. Figure 4 shows that FPCA-based approaches produce worse predictions at the junction of curves, for example, 0 and 24 h from O 3 curves. The NOP method does not have such disadvantages, and it can consistently predict future trajectories for all chemicals. In addition, when we concatenate multivariate functions into univariate ones, the temporal dependence becomes more complicated to capture and may not follow a linear relationship. The predictions from the NOP method are significantly better than linear models.
We obtain the one-step-ahead prediction and calculate the MSPE for each trajectory in the testing sample. Given the case of multivariate functions, we will compare the MSPE with respect to each dimension after obtaining predictions. The boxplots of MSPEs in the testing sample are summarized in Figure 5. Figure 5 shows that the MSPEs from the NOP method are the smallest compared to the other two methods. This conclusion is consistent across all three chemicals. It is evident that our NOP method excels in dealing with multivariate functional data. The proposed NOP method is also easier to be applied to real data sets because the NOP method does not need to calculate covariance functions.

Electricity price
We have the hourly electricity price of northen Italy from January 6th, 2015 to September 20th, 2020. The entire functional time series contains 2096 trajectories. Figure 6 displays some sample trajectories from the data set. The prediction performance is evaluated based on the predictions for the last 50 days. For the proposed NOP method, we will consider two prediction models. The first one ignores the difference between weekdays and weekends, that is, we define the prediction z t+1 = (z t ) to depend only on the previous latent representation. This model is denoted as NOP in the following comparisons. The second model considers the weekend/weekday effect by adding an indicator g t+1 , where g t+1 = 1 if t + 1th day is a weekday and zero otherwise. This model will be referred to as NOP_weekday in the following sections. The predicted representation from NOP_weekday is obtained by z t+1 = (z t , g t+1 ).

F I G U R E 6
Hourly electricity prices recorded for northen Italy. We highlight 2 trajectories for showcasing the data on weekends and weekdays. The blue and red curve corresponds to the price on one weekday and one weekend, respectively. Figure 7 displays the predicted hourly price for the following week by three methods. The plus sign "+" indicates the number of days to predict after the last observation. Because Day+3 and Day+4 are Saturday and Sunday, the NOP method including the weekend effect provides significantly better predictions for electricity prices in comparison with the FPLSR and FTSA methods and the NOP model without weekend effects. The future and long-term predictions are obtained simultaneously. The proposed NOP model with weekend effects can adjust predictions when encountering weekends, hence it produces stable and better forecasts after weekends as shown for Day+5 and Day+6 prediction curves in Figure 7.
We compare the MSPE for future predictions in the following 20 days. The prediction results are summarized in Figure 8. Figure 8 shows that the proposed NOP model with the weekend effect outperforms other methods when predicting prices for a longer period in advance. The reason is that the predictions are made iteratively rather than simultaneously. Weekend effects (or other possible extrinsic effects) offer more information in comparison with only using the previously predicted trajectory. Including the weekend effect forces the prediction to be more reliable and close to the true observation. This will also help in predicting subsequent trajectories by having some stable points in the iterative prediction process. We make this conclusion because the MSPE using the NOP model with the weekend effect is the most stable in comparison with other methods. In general, we observe that our proposed NOP method consistently has smaller MSPEs than those methods based on FPCA. Figure 9 compares MSPEs of predicted daily electricity prices for the following 30 days given different values of penalty parameter . Smaller 's give better predictions in the short term, that is, the first few days after the last observed trajectories. However, the predictions can be unstable (the orange curve) given that we are not enforcing the importance of the established latent space. That is, we allow the predicted representations to be too flexible yet do not consider how it is supposed to be in the current latent space. As increases, the prediction errors are more stable in the long run. Larger implies that the future trajectories are more similar to the observed ones, and it suggests that the established latent space is well established to include unseen curves. On the other hand, smaller prioritizes predicting future curves and does not enforce the importance of the latent feature space. In other words, we can use larger 's to adhere to the stationary assumption of the functional time series. We can use smaller 's to focus on short-term predictions.

Mortality rate
The left panel in Figure 10 displays the log mortality rate curves in France in 1916-2016. The mortality rate data appear to have a decreasing trend over the years. That is, the most recent year has the lowest rate compared to previous years. This functional data is similar to Setting 1 in the simulation studies where the functional time series is nonstationary. We blue and purple curves are the prediction from the nonlinear prediction method (NOP) with and without the weekend effect. The green, orange, and red curves are predictions from the functional partial least square method (FPLSR), the functional time series analysis (FTSA), and the dynamic functional principal component (DFPCA) method respectively.
keep the trajectories in 1996-2016 as the testing sample, and the prediction performance is investigated for this period. We predict the mortality rate in the last 20 years from 1996 to 2016 simultaneously with the NOP method. The right panel in Figure 10 shows that the predicted trajectories are decreasing over the years, indicating that our NOP method can detect the decreasing trend in the functional time series. Then, we compare the prediction performance of the NOP method with that of both FTSA and FPLSR methods. As illustrated in the simulation studies when the functional time series has a trend function, both FTSA and FPLSR method are not satisfactory in comparison to the NOP method. We expect similar conclusions in this application. Figure 11 summarizes the MSPEs for mortality rate curves in 1996-2016. Because both FTSA and FPLSR methods require the functional time series to be stationary, the long-term predictions from these two methods are not satisfactory. Moreover, the prediction becomes worse quickly when predicting curves that are farther away from the last observed curve. On the other hand, the MSPE of our method is steady regardless of the gap between a future trajectory and the last observed one. This demonstrates that the NOP method can capture the trend even without explicitly specifying such temporal relationships. F I G U R E 8 Mean squared prediction errors (MSPEs) of five methods for the following 20 days after the last observation. The blue and purple curves are the prediction from the nonlinear prediction method (NOP) with and without the weekend effect. The green, orange, and red curves are predictions from the functional partial least square method (FPLSR), the functional time series analysis (FTSA), and the dynamic functional principal component (DFPCA) method respectively.

CONCLUSION
In this article, we propose a nonlinear prediction (NOP) method for functional time series. The model components are nonlinear functions that are flexible for analyzing real data sets. Instead of using the linear latent space spanned by FPCs, we establish a more useful latent space for compressing observed functional data. This space provides a better ground for predicting future trajectories because it is constructed with forecast information. Moreover, the NOP method can predict multivariate functional time series much easier than the conventional methods. At the same time, the NOP method includes extrinsic effects to produce better long-term predictions. The NOP method can also be implemented with online estimation and prediction since it does not require using all observations but the last observed trajectory. We demonstrate the advantages of the NOP method over FPCA-based ones with simulation studies and three applications.