Conformal Prediction Bands for Two-Dimensional Functional Time Series

Time evolving surfaces can be modeled as two-dimensional Functional time series, exploiting the tools of Functional data analysis. Leveraging this approach, a forecasting framework for such complex data is developed. The main focus revolves around Conformal Prediction, a versatile nonparametric paradigm used to quantify uncertainty in prediction problems. Building upon recent variations of Conformal Prediction for Functional time series, a probabilistic forecasting scheme for two-dimensional functional time series is presented, while providing an extension of Functional Autoregressive Processes of order one to this setting. Estimation techniques for the latter process are introduced and their performance are compared in terms of the resulting prediction regions. Finally, the proposed forecasting procedure and the uncertainty quantification technique are applied to a real dataset, collecting daily observations of Sea Level Anomalies of the Black Sea


Introduction
Data observed on a two-dimensional domain arise naturally across several disciplines, motivating an increasing demand for dedicated analysis techniques. Functional data analysis (FDA) (Ramsay and Silverman 2005) is naturally apt to represent and model this kind of data, as it allows preserving their continuous nature, and provides a rigorous mathematical framework. Among the others, Zhou and Pan 2014 analyzed temperature surfaces, presenting two approaches for Functional Principal Component Analysis (FPCA) of functions defined on a non-rectangular domain, Porro-Muñoz et al. 2014 focuses on image processing using FDA, while a novel regularization technique for Gaussian random fields on a rectangular domain has been proposed by Rakêt 2010 and applied to 2D electrophoresis images. Another bivariate smoothing approach in a penalized regression framework has been introduced by Ivanescu and Andrada 2013, allowing for the estimation of functional parameters of two-dimensional functional data. As shown by Gervini 2010, even mortality rates can be interpreted as two-dimensional functional data.
Whereas in all the reviewed works functions are assumed to be realization of iid or at least exchangeable random objects, to the best of our knowledge there is no literature focusing on forecasting time-dependent two-dimensional functional data. In this work, we focus on time series of surfaces, representing them as two-dimensional Functional Time Series (FTS).
A two-dimensional Functional Time Series is an ordered sequence Y 1 , . . . , Y T of random variables with values in a functional Hilbert space H, characterized by some sort of temporal dependency. More formally, we consider a probability space (Ω, F , P), and define a random function at time t as Y t : Ω → H, measurable with respect to the Borel σ-algebra B(H). In the rest of the article we consider functions belonging to H = L 2 ( [c, d] × [e, f ]), with c, d, e, f ∈ R, c < d, e < f We stress the fact that, from a theoretical point of view, our methodology can be applied to functions defined on a generic subset of R 2 , however, for simplicity and without loss of generality, we will only consider rectangular domains.
Given a realization of the stochastic process {Y t } t= ...,N , we aim to forecast the next surface and quantify the uncertainty around the predicted function. Whereas uncertainty quantification in the context of univariate FTS forecasting has received great attention in the statistical community in recent decades, no attempts have been made to extend them to functions defined on a bidimensional domain.
Most of the research tackling univariate FTS forecasting has focused on adaption of the Bootstrap to the functional setting (see e.g. Hyndman and Shang 2009, Rossini and Canale 2018and Hernández et al. 2021. However, Bootstrap is a very computationally intensive procedure, especially in the infinite-dimensional context of functional data. In this work, we instead focus on Conformal Prediction (CP), a versatile nonparametric approach to prediction. The first appearance of such technique dates back to Gammerman et al. 1998, and it has been later presented in greater detail in Vovk et al. 2022 andBalasubramanian et al. 2006. An extensive and unified review of the theory of Conformal inference can be found in Fontana et al. 2023. The attractiveness of Conformal Prediction relies on its great versatility, which allows to couple it with any predictive algorithm, in order to obtain distribution free prediction sets. Throughout this work, we resort to Inductive Conformal Prediction, also known as Split Conformal Prediction (Papadopoulos et al. 2002). Such modification of the original Transductive Conformal method is not only computationally efficient, but also necessary in high-dimensional frameworks like the functional data one. It should be noted that the main drawback of the Full Conformal approach is the need of retraining the prediction algorithm for every possible candidate realization y. In practice, in multivariate problems, where y lies in R p , one runs the above routine for several candidates y over a p-dimensional regular grid. While such approach is prohibitive for high-dimensional spaces, since computational times grow exponentially with p, it becomes unfeasible in a functional setting, in which y lies in an infinite-dimensional space. Employing Split Conformal inference along with nonconformity scores tailored to functional data (Diquigiovanni et al. 2022) allows to obtain prediction sets in closed form.
Whereas CP theory has been originally developed under the assumption of exchangeability, Chernozhukov et al. 2018 reframed the CP framework in the context of randomization inference, proving approximate validity of the resulting prediction sets under weak assumptions on the conformity scores and on the ergodicity of the time series. Later, Diquigiovanni et al. 2021 adapted such methodology to allow for Functional Time Series in a Split Conformal setting. We extend such method to two-dimensional functional data.
Since we want to quantify prediction uncertainty, we also need to provide a forecasting technique for 2D Functional Time Series. We start by taking into account the literature on unidimensional FTS forecasting (Bosq 2000, Antoniadis et al. 2006, Horváth and Kokoszka 2012, Aue et al. 2012, Jiao et al. 2021, extending the theory of Functional Autoregressive Processes (FAR) to the bivariate setting, proposing estimation techniques for the FAR(1), and comparing them in an order to assess how forecasting performances influence the amplitude of prediction bands.
The rest of this paper is as follows: we illustrate Conformal Prediction for two-dimensional functional data in Section 2, providing theoretical guarantees of the resulting prediction bands. In Section 3, we introduce forecasting algorithms for two-dimensional Functional Time Series, proposing an extension of the FAR(1) for two-dimensional functional data. Such forecasting algorithms are then compared by means of the resulting prediction bands in a simulation study in Section 4. Finally, in Section 5 we employ the developed techniques to obtain forecasts and prediction bands of real data, predicting day by day the Black Sea level. Section 6 concludes.

Uncertainty Quantification: Conformal Prediction for 2D Functional Time Series
Consider a time series Z 1 , . . . , Z T of regression pairs Z t = (X t , Y t ), with t = 1, . . . , T . Let Y t be a random variable with values in H, while X t is a set of random covariates at time t belonging to a measurable space. Notice that X t is a generic set of regressors, which may contain both exogenous and endogenous variables. Later in the manuscript, we will consider X t to contain only the lagged version of the function Y t , namely Y t−1 . Given a significance level α ∈ [0, 1], we aim to design a procedure that outputs a prediction set C T,1−α (X T +1 ) for Y T +1 based on Z 1 , . . . , Z T and X T +1 , with unconditional coverage probability close to 1 − α. More formally, we define C T,1−α (X T +1 ) to be a valid prediction set if: We would like to construct a specific type of prediction sets, commonly known as prediction bands, formally defined as: where B n (u, v) ⊆ R is an interval for each (u, v) ∈ [c, d] × [e, f ]. The convenience of such type of prediction sets in applications is extensively motivated in the literature (see e.g. López-Pintado and Romo 2009, Lei et al. 2015and Diquigiovanni et al. 2022, since a prediction set of this kind can be visualized easily, a property that is instead not guaranteed if the prediction region is a generic subset of H. Let z 1 , . . . , z T be realizations of Z 1 , . . . , Z T . As the name suggests, Split Conformal inference is based on a random split of data into two disjoint sets: let I 1 , I 2 be a random partition of {1, . . . , T }, such that |I 1 | = m, |I 2 | = l, m, l ∈ N m, l > 0, m + l = T . Historical observations z 1 , . . . , z T are divided into a training set {z h , h ∈ I 1 }, used for model estimation, and a calibration set {z h , h ∈ I 2 }, used in an out-of-sample context to measure the nonconformity of a new candidate function. The choice of the split ratio and the type of split is non-trivial and has motivated discussion in the statistical community. We fix the training-calibration ratio equal to 1 and perform a random split, and refer to Appendix A for a more extensive discussion on the topic. We then introduce a nonconformity measure A, which is a measurable function with values in R ∪ {+∞}. The role of A({z h , h ∈ I 1 }, z) is to quantify the nonconformity of a new datum z with respect to the training set {z h , h ∈ I 1 }. The choice of the nonconformity measure is crucial to find prediction bands (2) in closed form. We employ the following nonconformity score, introduced by Diquigiovanni et al. 2022, extended here to two-dimensional functional data: where z = (x T +1 , y), g I 1 is a point predictor built from the training set I 1 , and s I 1 is a modulation function, which is a positive function depending on I 1 that allows for prediction bands with non-constant width along the domain. Section 4 and Appendix B discuss the estimation of g I 1 . The functional standard deviation is employed as modulation function s I 1 , allowing for wider bands in the parts of the domain where data show high variability and narrower and more informative prediction bands in those parts characterized by low variability. For an extensive discussion on the optimal choice of modulation function, we refer to Diquigiovanni et al. 2022. Consider now a candidate function y ∈ H and define the augmented dataset as The key idea of the methodology proposed by Chernozhukov et al. 2018 and extended by Diquigiovanni et al. 2021 is to generate several randomized versions of Z (y) through a specifically tailored permutation scheme, and compute nonconformity scores on each of them. We then decide whether to include y in the prediction region, by comparing the nonconformity score of Z (y) with that of its permuted replicas. In order to obtain such replicas, we aim to define a family Π of index permutations π : {1, . . . , T + 1} → {1, . . . , T + 1}, that keeps unchanged the training set indices I 1 , and modifies only {I 2 , T +1}, namely the indices of the calibration set and the next time step. We first introduce a function λ : {I 2 , T + 1} → {1, . . . , l + 1} such that λ(t) returns the t-th element of the ordered set {I 2 , T + 1}. Fix now a positive integer b ∈ {1, . . . l + 1} such that l+1 b ∈ N and define a familyΠ of index permutations that acts on the set {1, . . . , l + 1}. Eachπ i ∈Π is required to be a bijectioñ π i : {1, . . . , l + 1} → {1, . . . , l + 1}, for i = 1, . . . , l+1 b . We consider the non-overlapping blocking permutation scheme proposed by Chernozhukov et al. 2018, dividing data in blocks of size b, in such a way that each permutation is unique inΠ:π By definition, we have that |Π| = l+1 b , andΠ forms an algebraic group, containing the identity transformationπ 1 . It is then straightforward to introduce the family Π of index permutations acting on {1, . . . , T + 1}. Each π i ∈ Π, with i = 1, . . . , l+1 b is defined as:  Figure 1 reports an example of the families of permutation Π andΠ. We refer to Z π (y) = {Z π(t) } T +1 t=1 as the randomized version of Z (y) = {Z t } T +1 t=1 , and define the randomization p-value as: where nonconformity scores S (Z (y) ) and S (Z π (y) ) are defined as: The idea is to apply permutations, modifying the order of observations in the calibration set, while at the same time preserving the dependence between them, thanks to the block structure of Π. For each π, we compute the nonconformity score of Z π (y) . The p-value 7 of a test candidate value y is then determined as the proportion of randomized versions Z π (y) with a higher or equal nonconformity score than the one of the original augmented dataset Z (y) . Notice that p(y) is a measure of the conformity of the candidate function y with respect to the permutation family Π. It is then natural to include in the prediction set only functions y with an "high" conformity level. Given a significance level α ∈ [b/(l + 1), 1], the prediction region is hence obtained by test inversion: As a remark, it should be noted that if α ∈ (0, b/(l + 1)) the resulting prediction set coincides with the entire space H (Diquigiovanni et al. 2022). The advantage of using the Split Conformal method along with the conformity measure (3) relies on the possibility to find the prediction set in closed form. By defining k s as the (|Π| + 1)(1 − α) th smallest value of {S (Z π (y) ), π ∈ Π \ π 1 }, we derive: The prediction band is therefore: If regression pairs are exchangeable, the proposed method retains exact, model-free validity (Chernozhukov et al. 2018, Theorem 1). When such assumption is not met, one can instead guarantee approximate validity of the proposed approach under weak assumptions on the nonconformity score and the ergodicity of the time series. This result is illustrated in great detail by Theorem 2 of Chernozhukov et al. 2018 and Theorem 1 of Diquigiovanni et al. 2021. We report here the latter, with a slightly modified notation.
Theorem 1. If the following conditions hold: • With probability 1 − γ 2m the pdf of S * (Z π ) is bounded above by a constant D then the Conformal confidence set has approximate coverage α: The first condition concerns the approximate ergodicity ofF, a condition which holds for strongly mixing time series using blocking permutation Π defined in (6) (Chernozhukov et al. 2018). The other conditions are requirements for the quality of approximation of S * (Z π ) with S (Z π ). Intuitively, δ 2 2m bounds the discrepancy between the nonconformity scores and their oracle counterparts. Such condition is related to the quality of the point prediction and to the choice of the employed nonconformity measure.

Point Prediction: Functional Autoregressive Process of order one
In order to obtain CP band with empirical coverage close to the nominal one, the choice of an accurate point predictor is important. As mentioned before, whereas in the typical i.i.d. case finite-sample unconditional coverage still holds when the model is heavily misspecified (Diquigiovanni et al. 2022), in the time series context a strong model misspecification may compromise the coverage guarantees and not only the efficiency of the resulting prediction bands (Chernozhukov et al. 2018, Diquigiovanni et al. 2021. For this reason, it is important to consider models that are consistent with the functional nature of the observations and that can adequately deal with their infinite dimensionality. We build on top of the literature on functional autoregressive processes in Hilbert spaces, extending them for the first time to temporarily evolving surfaces. We narrow the forecasting methodology to the FAR(p), with p = 1 because of its wide success in the literature (Hernández et al. 2021, Papadopoulos et al. 2002and Aue et al. 2012. Whereas in the scalar context it is often beneficial to consider lags p greater than one, given the intrinsic high dimensionality of functional data, we would rather fit a biased but simpler model than an unbiased but more complicated model. This issue is enhanced in the two-dimensional context, because of the extra dimension in the domain of the function, and for this reason, we consider only the case p = 1. We introduce the Functional Autoregressive model of order 1 in Section 3.1 and propose estimation techniques in Section 3.2.

FAR(1) Model
The most popular statistical model used to capture temporal dependence between functional observations is the functional autoregressive process. The theory of functional autoregressive processes in Hilbert spaces is developed in the pioneering monograph of Bosq 2000 and a comprehensive collection of statistical advancements for the FAR model can be found in Horváth and Kokoszka 2012.
A sequence of mean zero random functions {Y t } T t=1 ⊂ H follows a non-concurrent Functional Autoregressive Process of order 1 if: where {ε t } t∈N is a sequence of iid mean-zero innovation errors with values in H satisfying E[||ε t || 2 ] < +∞ and Ψ is a linear bounded operator from H to H. We consider Ψ to be a Hilbert-Schmidt operator with kernel ψ, in such a way that: In order to ensure existence of a stationary solution of (13), one has to require that ∃ j 0 ∈ N such that ||Ψ|| j 0 < 1 (Bosq 2000, Lemma 3.1).

FAR(1) Estimation
Proceeding similarly to Horváth and Kokoszka 2012, and adopting an approach akin to the Yule-Walker estimation in the scalar setting, we propose the following estimator of Ψ: where ξ 1 , . . . , ξ M are the first M normalized functional principal components (FPC's), λ 1 , . . . , λ M are the corresponding eigenvalues, and x, ξ 1 , . . . , x, ξ M are the scores of x along the FPC's. Appendix C illustrates two different estimation techniques for ξ i and λ i , one based on a discretization of functions on a fine grid and the other designed starting from an expansion of data on a finite basis system. We further refer to Appendix B for more details on the derivation of estimator (15) and for an extensive discussion on how to adapt it to the Conformal Prediction setting, where Ψ is estimated from the training set I 1 only. Another forecasting procedure based on FPC's has been proposed by Aue et al. 2012 for one-dimensional functional data and is here extended to the two-dimensional setting. Calling once again ξ 1 , . . . , ξ M the first M functional principal components, we decompose the Functional Time Series as follows: T collects the evaluated principal components, and e t (u, v) is the approximation error due to the expansion's truncation on the first M principal components. Neglecting the approximation error e t , one can prove that the vector Y t follows a multivariate autoregressive process of order 1 (VAR(1)). Plugging in the estimated FPCsξ 1 , . . . ,ξ M , we can estimate the parameters of the resulting VAR(1) model using standard techniques of multivariate statistics and forecastŶ T +1 based on historical data Y 1 , . . . , Y T . The predicted functionŶ T +1 is then reconstructed as: We finally introduce a model that may appear simplistic, since it does not exploit the possible time dependence between functions' values in different points of the domain, but that in practical applications provides satisfying results. The prediction method assumes an autoregressive structure in each location (u, v) of the domain, ignoring the dependencies between different points. We call this model a concurrent FAR(1): where ψ u,v ∈ R and t = 2, . . . , T . Supposing to have observed functional data y 1 , . . . , y T on a common two-dimensional

Study Design
The goal of this section is twofold: we aim to assess the quality of the proposed CP bands and evaluate different point predictors in terms of the resulting prediction regions. Since this work is focused on uncertainty quantification, we compare forecasting performances by means of the resulting Conformal Prediction bands. Firstly and foremost, we estimate the unconditional coverage by computing the empirical unconditional coverage in order to compare it with the nominal confidence level 1 − α. In the second place, we consider the size of the prediction bands, since a small prediction region is preferable as it includes subregions of the sample space where the probability mass is highly concentrated (Lei et al. 2015) and it is typically more informative in practical applications.
We employ as a data generating process a FAR(1) model in order to evaluate the estimation routines presented in Section 3.1. In order to benchmark forecasting performances, we examine the forecasting methods against a naive one:Ŷ T +1 = Y T . By including a forecasting algorithm that is not coherent with the data generating process, we can illustrate how the presented CP procedure performs when a good point predictor g I 1 is not available. Although as reported in Section 3 a sufficiently accurate forecasting algorithm is necessary to guarantee asymptotic validity, we notice that in the simulations CP bands remain valid even when such assumption is not met.
To obtain further insights, we include the performances obtained by assuming perfect knowledge of the operator Ψ. For ease of reference, we list here the forecasting algorithms, introducing some convenient notation.
• FAR(1)-EK (Estimated Kernel) denotes the first estimation procedure presented in Section 3.1, where we explicitly computeΨ M as prescribed by (15) and then setŶ T +1 =Ψ M Y T .
• FAR(1)-VAR denotes the forecasting procedure (17), where we exploit the expansion on estimated functional principal components and forecast Y T +1 using the underlying VAR(1) model.
• Naive: we just setŶ T +1 = Y T . This method does not attempt to model temporal evolution, it is only included to see how much can be gained by exploiting the autoregressive structure of data.
• Oracle: we set Y T +1 = ΨY T , using the same exact operator Ψ from which data are simulated. This point predictor is clearly not available in practical application, but it is interesting to include it in order to see if poor predictions might be due to poor estimation of Ψ.
When it is required (namely in FAR(1)-EK, FAR(1)-EK+, FAR(1)-VAR), FPCA is performed using the discretization approach, as motivated in Appendix C, truncating the representation to the first 4 harmonics.
In Section 4.2, we fix the size b of the blocking scheme (6) equal to 1 and let the sample size T take values 19, 49, 99, 499. Secondly, in Section 4.3, we instead fix the sample size equal to 119 and repeat the simulations with b = 1, 3, 6. As usually done in the time series setting, the first observation is taken into account as a covariate only and does not enter neither the training set nor the calibration set. The proportion of data in the training and in the calibration set are hence equal to one half of the remaining observations: m = l = (T − 1)/2. For each value of T , we repeat the procedure by considering N = 1000 simulations. Simulations are implemented in the R Programming Language (R Core Team 2020).
In order to simulate a sequence of functions {Y t } t=1,...,T from a FAR(1), we assume that observations lie in a finite dimensional subspace of the function space H. Without loss of generality, throughout this section we consider functions in H = L 2 ([0, 1]×[0, 1]). H is spanned by orthonormal basis functions φ 1 , . . . , φ M , with M ∈ N representing the dimension of such subspace. Therefore, we have: where  (1) represented on a grid of 10 4 points.
The basis system φ 1 , . . . , φ M is constructed as the tensor product basis of two cubic B-spline systems For a discussion on the tensor product basis system, we refer to Appendix C.2 The matrix Ψ is defined as Ψ := 0.7Ψ ||Ψ|| F , withΨ having diagonal values equal to 0.8 and out-diagonal elements equal to 0.3 Innovation errors ε t are independently sampled from a multivariate normal distribution, with mean zero and covariance matrix Σ having diagonal elements equal to 0.5 and out-diagonal entries equal to 0.3. Figure 2 depicts an example of a simulated Functional Autoregressive Process of order one. A GIF of the time-evolving FAR(1) process can be found on GitHub.
Notice that simulations have been designed in such a way to generate a stationary process. This condition is important to guarantee the existence of a solution to the FAR(1) equation (13), and is here guaranteed by setting ||Ψ|| < 1, which satisfies the sufficient condition for stationary presented in Lemma 3.1 of Bosq 2000. One can indeed prove that, if relation (21) holds, then ||Ψ|| = ||Ψ|| F , where ||.|| is the usual operatorial norm and ||.|| F denotes the Frobenius norm. In this way, FAR(1) estimation techniques are well-defined. For what concerns the theoretical assumptions of the CP scheme, proving the hypothesis of Theorem 1 is difficult, in particular in the context of functional data. To the best of our knowledge, no test for strongly mixing two-dimensional time series has been proposed, and testing the bounds on the oracle nonconformity scores is even more challenging. However, we aim to show here how the CP procedure can still be applied to obtain valid and efficient prediction bands.

Varying the the Sample Size
We first fix the size b of the blocking scheme equal to 1 and let the sample size T take values 19, 49, 99, 499. We replicate the experiments with different significance levels: α = 0.1, α = 0.2 and α = 0.3, in order to assess how the confidence level influences the coverage and the width of the prediction bands. Figure 3 shows the empirical coverage, together with the related 99% confidence interval. Empirical coverage is computed as the fraction of the N = 1000 replications in which y T +1 belongs to C T,1−α (x T +1 ), and the confidence interval is reported in order to provide insights into the variability of the results, rather than to draw inferential conclusions about the unconditional coverage. Notice that different point predictors might intrinsically have dissimilar coverages, consequently this analysis aims to compare forecasting algorithm in terms of their predictive performances. We can appreciate that the 99% confidence interval for the empirical coverage almost always includes the nominal confidence level, regardless of the sample size at disposal. The only exception is obtained with α = 0.3 and T = 19. In this case, the method produces very narrow prediction bands (Figure 4c), that result in an empirical coverage smaller than the nominal one. This behavior however disappears as soon as the sample size T increases. It is also interesting to notice that, even when an accurate forecasting algorithm g I 1 is not available (namely with the Naive predictor), the proposed CP procedure still outputs prediction regions with a high unconditional coverage.
Similarly to Diquigiovanni et al. 2022, we define the size of a two-dimensional prediction band as the volume between the upper and the lower surfaces that define the prediction band:  Measuring the size of the correspondent prediction bands, we can compare the efficiency of different forecasting routines. We stress the fact that distinct point predictors may guarantee potentially different coverage levels. For this reason, it is crucial to first evaluate the empirical coverage of the resulting prediction bands and only afterward investigate their size. Figure 4 reports boxplots with prediction bands' size for the N = 1000 simulations and for different values of α and T . Bands' size tends to decrease as long as the number of observations T increases, hence improving the efficiency of prediction sets. Moreover, the size tends to decrease when the confidence level 1 − α increases. As expected, Naive predictor provides larger prediction bands, particularly in the large sample settings. On the other hand, FAR(1)-EK and FAR(1)-EK+, both based on the estimation of the autoregressive operator Ψ, provide the tightest prediction bands, not only when numerous observations are available, but also in small sample sizes scenario. Notice also that eigenvalue correction slightly improves the performances of FAR(1)-EK+ wrt FAR(1)-EK, especially when few samples are available. We acknowledge that, when T = 19, VAR-efpc performs remarkably worse than the other methods. We argue that this performance gap might be caused by the simultaneous OLS estimation of the underling VAR(1) equations, which might provide biased estimates if the sample size is small. However, when the sample size increases, such forecasting algorithm performs comparably with the already mentioned FAR(1)-EK and FAR(1)-EK+. Finally, although the Conformal Prediction bands produced by the oracle predictor are obviously the most performing one, both FAR(1)-EK and FAR(1)-EK+ provide CP bands with coverage and size comparable to the theoretically perfect oracle forecasting method.

Varying the Blocking Scheme Size
This time, we fix the sample size T and let the blocking scheme b vary, in order to determine how the validity and efficiency of the resulting prediction bands are influenced by such parameter. Once again, we repeat the experiments for α = 0.1, α = 0.2, α = 0.3, whereas the value of T is fixed and equal to 119, which provides a good balance between scenarios with small and large sample sizes. Analogous results have been found by letting the value of T vary. Results are reported in Figure 5 and Figure 6. Once again, in all circumstances the empirical coverage is close to the nominal one, confirming validity of CP bands even for higher values of b. Moreover, one can notice that, as already pointed out by Diquigiovanni et al. 2021, the band size tends to decreases when b decreases, thus providing more efficient prediction regions. We argue that this behaviour is related to the inverse proportionality between the blocking scheme size b and the dimension of permutation family |Π|. Finally, notice that larger prediction bands attain as expected a larger empirical coverage, and that larger values of α (smaller confidence level 1 − α) result in smaller prediction bands. A comparison of forecasting algorithms performances validates the considerations in Section 4.2.

Dataset
In this section, we aim to illustrate the application potential of the proposed methodology on a proper case study. We analyze a data set from Copernicus Climate Change Service (C3S), a project operated by the European Center for Medium-Range Weather Forecasts (ECMWF), collecting daily sea level anomalies of the Black Sea in the last twenty years (Mertz and Legeais 2018). Sea level anomalies are measured as the height of water over the mean sea surface in a given time and region. Specifically, altimetry instruments give access to the sea surface height (SSH) above the reference ellipsoid, which is calculated as the difference between the orbital altitude of the satellite and the measured altimetric distance of the satellite from the sea (see Figure 7a). Starting from this information, Sea Level Anomaly (SLA) is defined as the anomaly of the signal around the Mean Sea Surface component (MSS), which is computed with respect to a 20-year reference period (1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012). Observations are collected on a spatial raster, with a resolution of 0.125 • both on the longitude and on the latitude axis. Since observations are collected on a geoid, the domain lies on a two-dimensional manifold, however, because both longitude and latitude ranges are very small (14 • and 7 • respectively), we assume data to be observed on a bidimensional grid. The resulting lattice can hence be considered as the Cartesian product of a grid on the longitude axis made by N 1 = 120 points and a latitude grid of N 2 = 56 points. We refer to (u i , v j ), with i = 1, . . . , N 1 and j = 1, . . . , N 2 as the (i, j)-th point of such two-dimensional mesh. Since the Black Sea does not have a rectangular shape, we model data as realization of random surfaces defined on the rectangle circumscribed to the perimeter of the sea, but identically equal to zero outside of it. As a consequence, being B the set of points internal to the perimeter of the Black Sea, we slightly redefine the non conformity measure 3 to become: where R(u, v) is defined as: Hereafter, we will consider the time series {S LA t }, without making explicit the dependence on the bivariate domain.

Data Preprocessing
If possible, one would preferably forecast directly the time series of Sea Level Anomalies (S LA t ). However, in order to estimate the FAR(1) process, we would also like to guarantee stationarity of the time series at our disposal, and given the particular nature of the dataset, we expect observations to exhibit a strong seasonality as well as a linear trend. Indeed, both tide gauge and altimetry observations show that sea level trends in the Black Sea vary over time (Avsar et al. 2016, Cazenave et al. 2001). Tsimplis and Baker 2000 estimated a rise in the mean sea level of 2.2 mm/year from 1960 to the early 1990s, while long-track altimetry data indicate that sea level rose at a rate of 13.4 ± 0.11 mm/year over 1993-2008 (Ginzburg et al. 2011).
In order to further investigate this issue, we should proceed by testing the functional time series {S LA t } t for stationarity. However, while for one-dimensional Functional Time Series one could resort to the tests proposed by Horváth et al. 2014 or Aue and van Delft 2020, to the best of our knowledge no stationarity test for two-dimensional functional time series has been developed. For such reason, and aware of the limits of this approach, we resort to analyzing stationarity of univariate time series S LA t (u i , v j ), where each (u i , v j ) represents a grid point in the lattice. We stress the fact that stationarity of individual time series does not guarantee stationarity of the underlying functional process, and this constitutes only a necessary and not sufficient condition. Therefore, the goal is not to derive inferential results on the stationarity of the process, but rather to describe the evolution of the process by means of its individual component, and to potentially obtain a better time series to work with. We test each univariate time series for stationarity, using the Augmented Dickey Fuller (ADF) test. We report in Figure 8 a grid map of the p-values, for {S LA t } t , {∆S LA t } t and {∆ 2 S LA t } t , where ∆ and ∆ 2 denote respectively one and two differentiations. Despite the fact that we can't make inferential conclusions on the stationarity of the process based on the individual tests, we can see that the original time series exhibit a very non-stationary behavior, and after one differentiation there are many non-stationary locations.
After two differentiations, all the individual time series can be confidently considered stationary. We argue that the Functional Time Series might exhibit a behavior similar to one described in terms of stationarity, and hence proceed by differentiating twice the process, defining We forecast the differentiated time series Y t , and obtain prediction bands for it, using the methodology presented in Section 2. However, in order to provide a better insight into the prediction problem, we need to retrieve results pertaining to the original time series S LA t . Specifically, we apply the conformal machinery to the differentiated time series, callingŶ T +1 the forecasted function, and obtaining the prediction band for Y T +1 : Exploiting the fact that: we define the prediction band for S LA T +1 as: Finally, notice once again that testing the hypothesis underlying the CP procedure is not possible in this setting. We are nevertheless interested in applying the CP scheme together with the forecasting techniques in order to verify their empirical performances.

Study Design
The case study employs a rolling estimation framework which recalculates the model parameters on a daily basis and consequently shifts and recomputes the entire training, calibration and test windows by 24 hours, as shown in Figure 9. As before, we use a random split of data in the training and calibration sets, with split proportion equal to 50%. The significance level α is once again fixed equal to 0.1. For each of the 1000 days we aim to predict, we build the corresponding prediction band based on the information provided by the last 99 days, thereby fixing T = 99. Choosing this sample size provides accurate forecasts and thus small prediction bands, while maintaining reasonable computational times. The size of the blocking scheme is fixed to 1, since, as motivated in Section 4.3, this choice produces the narrowest prediction bands, preserving at the same time satisfactory performance in terms of empirical coverage. The rolling window is shifted 1000 times, thus iterating for almost three years the forecasting scheme. More specifically, and to allow for reproducibility of subsequent results, we consider a rolling window ranging from 01/01/2017 to 04/01/2020.
The point predictors used throughout this application are those described in Section 3. The number of Functional Principal Components is set equal to 8. For each shift of the rolling window and for each forecasting algorithm, we check if S LA T +1 belongs toC T,1−α (x T +1 ), and save the size of the corresponding prediction band. After collection of results, we calculate the average coverage, and use it to compare performances of the different point predictors in this scenario. Figure 10 outlines observed and forecasted surfaces obtained for one of the day in the rolling window, as long as the lower and upper bounds defining the prediction band. For the sake of simplicity, we display only results obtained using the FAR(1)-EK estimator (15), since, as discussed below, it provides on average the narrowest prediction bands. In order to allow for a more insightful analysis, and to further investigate the evolution of the surfaces, we implemented a dedicated Shiny App available online where results can be explored. We report in Figure 11a the average coverage of CP bands obtained across 1000 predictions. As in Section 4, we pair such quantity with a 99% confidence interval. Notice that in this case the confidence interval may be biased, due to the inevitable correlation between data used to construct it, however, we include it in order to assess the dispersion of the average coverage around the mean.

Results
Coherently with the results of the simulation study, we can appreciate that in all cases the prediction bands capture the observed surface y T +1 approximately (1 − α)% of the times, regardless of the forecasting algorithm used.
For what concerns the size of the prediction bands, the Naive predictor produces by far the widest ones (see Figure 11b), and, this fact does not reflect in a greater coverage compared to the other methods. On the other hand, prediction bands obtained with autoregressive forecasting algorithms provide narrower prediction regions. Among these, we can see that the non-concurrent FAR(1) is the most performing one, regardless of the way in which it is estimated (namely with FAR(1)-EK, FAR(1)-EK+ or FAR(1)-VAR). Nevertheless, also the concurrent FAR(1) model provides very tight prediction bands, almost comparable with the ones produced by the non-concurrent prediction algorithm.
We are also interested in analyzing the pointwise properties of CP bands in this scenario. Therefore, we display in Figure 12 a map of the pointwise coverage of the prediction bands, obtained using FAR(1)-EK. We can appreciate, as expected, a high empirical coverage across the entire domain, emphasizing once again the peculiarity of our approach, which guarantees global coverage of the prediction surfaces, reflected by an obvious pointwise coverage higher than the nominal one. We report in Figure 12 the average width of CP bands, which denotes a peculiar pattern, likely caused by data collection routines. Indeed, we can see from Figure 13b, that a similar behaviour observed in the map of pointwise width can be found by plotting the standard deviation of original data. This is coherent with the employed CP framework, since the size of prediction bands depends on the amplitude of the functional standard deviation.
In conclusion, this case study confirms the validity of our procedure and proves how a FAR(1) model significantly improves the predictive efficiency even in this more complex scenario.

Conclusions and Further Developments
In this work, we introduce a mathematical framework for probabilistic forecasting of two-dimensional Functional Time Series. Leveraging the CP scheme developed by Chernozhukov et al. 2018 andDiquigiovanni et al. 2021 and adapting it to the 2D-FTS setting, we propose technique for quantifying uncertainty when predicting time evolving surfaces. In order to provide point prediction of surfaces, we model functions through a Functional autoregressive process of order one, extending the mathematical theory of autoregressive processes to allow for bivariate functions. Estimations techniques for the FAR(1) are presented and compared We test the benefits and limits of the proposed approach, first on synthetic data and then on a real novel time series dataset, collecting daily observations of sea level anomaly over the Black Sea. Empirical results proved the validity of the methodology on non-synthetic data. We acknowledge that in applying the proposed procedure to the case study, we had to introduce some simplifications due to the novelty of the subject and the limited amount of work on two-dimensional Functional Time Series. We hope that this work will encourage the development of novel analysis techniques for 2D functional data, such as stationarity tests and other forecasting tools. Finally, throughout the work, we limited the analysis to the FAR(p), with p = 1, as motivated in section 3. One may consider a FAR(p) model, with p > 1. Whereas it is not straightforward to extend the estimation of Ψ M (15) to a FAR(p) model with p > 1, both the concurrent estimator (18) and the estimator based (b) Size of CP bands. Figure 11: Results of the forecasting procedure in a rolling window setting. All the methods produce prediction bands with average coverages close to the nominal one. For what concerns predictive efficiency, the naive predictor produces the widest bands, whereas the other methods produce instead more efficient bands, with sizes similar to each other.  on an expansion of FPC's, may be easily adapted to the case p > 1. The problem can in fact be rendered as a standard functional linear regression and solved by using the many off-the-shelf methods apt to the task and present in the literature (see e.g. Chiou et al., 2016). Finally, notice that thanks to the flexibility of our approach, a modification of such kind can be easily achieved by changing the point predictor g I 1 only, while preserving the desirable properties of prediction bands.

Appendix A. Split ratio and type of split
The choice of the split ratio is non-trivial and has motivated discussion in the statistical community. Including more data in the training set improves the estimation of the point predictorĝ I 1 . At the same time, having few data points in the calibration set produces a very rough p-value function (7), resulting in potential greater actual coverage with respect to the nominal one. This trade-off problem is enhanced in the time series context, in which one would like to have both training and calibration sets as large as possible, since asymptotic validity is guaranteed when both l and m go to infinity. Throughout this work, the training-calibration ratio is fixed equal to 50%-50%, as commonly suggested in literature. Moreover, we stress the fact that the split is random. This clearly introduce variability in the procedure since results depend on the particular division of data. We acknowledge a recent advancement in this direction, called Multi Split Conformal Prediction (Solari and Djordjilović 2021), which aggregates single split CP intervals across multiple splits.
Another interesting question regarding the type of split comes up in the time series context. Due to the lack of exchangeability, two different subdivisions are possible in this framework. A first choice could consist in a sequential division of data, where the split point is no longer random, but is a result of the training-calibration proportion (see A.14a). Wisniewski et al. 2020 applied this scheme in a rolling window fashion to forecast Market Makers' Net Positions. While this choice may seem more consistent in the presence of temporal dependence, since it does not split subsequent observations in two different sets, it may lead to very biased results if the training size m is very small or if data present a different trend or seasonal component in the training and calibration sets. The interested reader may refer to Kath and Ziel 2021 for a more comprehensive discussion on this topic. All in all, in order to make the model more robust, we preferred to split data randomly (A.14b).

(B.2)
Extending the work of Horváth and Kokoszka 2012 to a different functional space, we can derive the following estimators for such operators:Γ a.e..
In order to derive estimator (15) of the FAR(1), we first consider the following operatorial equation: A natural idea may consist in computing estimatorsΓ 0 ,Γ 1 from historical data and defining thenΨ =Γ 0Γ −1 0 . Unfortunately, the inverse operator Γ −1 0 is unbounded on H (Horváth and Kokoszka 2012), however, thanks to Γ 0 being a symmetric, compact, positive-definite operator, one can exploit its spectral decomposition to introduce a pseudoinverse operator Γ −1 0,M , defined as: where ξ 1 , . . . , ξ M are the first M normalized functional principal components (FPC's), λ 1 , . . . , λ M are the corresponding eigenvalues and x, ξ 1 , . . . , x, ξ M are the scores of x along the FPC's. We formally define ξ i and λ i as eigenfunctions and eigenvalues that solve the functional equation: We can now combine (B.5) and (B.6), plugging in estimated eigenfunctions and eigenvalues and callingΓ −1 0,M the resulting estimator of Γ −1 0,M , to finally derive:Ψ Notice that the operatorΓ −1 0,M is bounded on H ifλ j are strictly greater than zero for j = 1, . . . , M. Nevertheless, even if such condition is met, in practice one should cautiously select the number of principal components M, because very small eigenvalues will result in very high reciprocalsλ −1 j , providing in practice unbounded estimates of Γ −1 0,M . Such observation motivated Didericksen et al. 2010 to add a positive baseline to the estimated eigenvaluesλ j . This small modification improves the estimation of the operator Ψ, and most importantly, contributes to weaken the dependency ofΨ M on M.
We now aim to adapt the previous estimator (B.9) to the conformal inference setting. The goal is to accommodate the forecasting algorithm in order to estimate the point predictor g I 1 from the training set only. As a preliminary step, given that the FAR(1) model has been presented for mean-centered data, one has to estimate the mean functionμ I 1 from the training set only and consequently center all the observations in the training and calibration set aroundμ I 1 . If the sample size at disposal is sufficiently large and if the stationarity assumption is fulfilled, it should make no great difference to estimate the population function µ with the sample meanμ = 1 T T t=1 Y t or with its restriction on the training setμ I 1 = 1 m t∈I 1 Y t . Another fundamental step is the estimation of functional principal components. Since in the CP framework we are allowed to use only the information from the training set in order to computeξ 1 , . . . ,ξ M , it is then natural to employ only training data {z h : h ∈ I 1 } in such estimation routine.
In order to obtain estimatorΨ M (15), one has to computeΓ 1 andΓ −1 0,M from the training set. While the computation of the sample pseudo-inverse of the autocovariance estimator is straightforward: the CP counterpart ofΓ 1 is more delicate and requires further discussion. Recall indeed that the classical estimator for the lag-1 autocovariance operator from Y 1 , . . . , Y T is: In the CP setting, however, we could define three different estimators for Γ 1 : (B.13) where x ∈ H and I 1 [i : j] contains the indices from the i-th to the j-th element of I 1 . Notice that the three estimators differ because if t ∈ I 1 , we have no assurance that {t − 1} ∈ I 1 or even {t + 1} ∈ I 1 . We stress also the fact that the third operator is well-defined only if we reserve a burn-in set of length 1 at the front of the time series, in such a way that, if {2} ∈ I 1 , we can still compute the estimator. Among the three options, we prefer the third one (B.14), since it averages over a larger set. One may also argue that such estimators are not coherent with the CP setting, since they are inevitably based on data from the calibration set. However, as mentioned before, we are considering the time series of regression pairs Z t = (X t , Y t ), t = 1, . . . , T (Chernozhukov et al. 2018). The key consideration is that, according to the FAR(1) model, we use as regressors the lagged version of the time series, namely X t = Y t−1 and the regression couples becomes Z t = (Y t−1 , Y t ), for each t = 1, . . . , T . From this perspective, one could rephrase the definition of the sample covariance operator by making explicit its dependence from the regressor X t instead of Y t−1 : It is then straightforward to derive the estimatorΨ M,I 1 : The point predictor finally becomes: In order to compute (B.16), it is first mandatory to project the calibration set onto the EPFC's in order to compute the scores X h ,ξ j for h ∈ I 2 . Notice that, in the non-Conformal setting, such step comes for free when performing FPCA. In the CP context, however, EPFC's are computed from the training set only, therefore, in order to access the scores of the calibration set, one needs to explicitly add this projection step.

Appendix C. FPCA for Two-Dimensional Functional Data
A fundamental aspect in the design of many forecasting algorithms is the estimation of functional principal components (FPC's) {ξ i }i ∈ N. We define FPC's as functions ξ i ∈ L 2 ( [c, d] × [e, f ]) solving the functional equation: On a theoretical point of view, we would like to guarantee that population eigenfunctions can be consistently estimated by empirical eigenfunctions even in the non-iid framework of Functional Time Series. We refer to Theorem 16.2 in Horváth and Kokoszka 2012, which provides asymptotic arguments for such question.
The following subsections are dedicated to the estimation of eigenfunctions and eigenvalue in the two-dimensional functional case. Extending the work of Ramsay and Silverman 2005, we present two different estimation procedures, based respectively on a discretization of the functions to a fine grid and on a linear expansion of data on a finite set of basis functions. Among the two alternatives, we would resort to the function discretization. Indeed, such choice does not require the selection of a specific type of basis and not even the number of basis to employ, which are not trivial problem-dependent questions. Moreover, notice that also the discretization procedure can be seen as a particular case of the basis expansion, using as basis system indicator functions on the grid points. Furthermore, in the subsequent, Appendix C.3 we demonstrate with a simulation study that there is no significant evidence to prefer one method against the other in terms of estimation quality.
We want to stress the fact that our methodology for FPCA is general, it works for two-dimensional functional data regardless of the presence of temporary dependence between observations. Not modeling the serial dependence structure does not invalidate the PCA procedure, but we still have to require that the dynamic is stationary in order for the covariance estimation to make sense and thus to provide meaningful estimates.

Appendix C.3. Comparison of Estimation Methods
In this section, we aim to compare the two proposed approach for performing functional principal component analysis, namely the basis expansion and the discretization approach. Without loss of generality, we settle the study in L 2 ([0, 1] × [0, 1]).
In general, given a finite basis system {φ k } k=1,...,K , one can represent a Functional Time Series {Y t } T t=1 by means of representation (C.9), that we report here for ease of reference: Starting from this decomposition, we have derived in Appendix C.2 an estimator of the functional principal components ξ j , which, however, neglects the contribution of the error δ t . For such reason, the choice of the basis system {φ k } K k=1 is crucial, and if a meaningful option is not available, the approximation residual δ t would be large and this procedure would inevitably provide biased estimates of ξ j . On the other hand, the FPCA approach based on data discretization provides good result as long as the sampling grid is sufficiently dense.
In order to prove such thesis, we simulate a time series of functions {Y t } T t=1 ⊂ span{φ 1 , . . . , φ K }, from a nonconcurrent FAR(1) process with Gaussian errors. Data are simulated based on a basis expansion on {φ 1 , . . . , φ K }, which is constructed as the tensor product basis of two Fourier basis systems {g i } i=1,...,K 1 , {h i } i=1,...,K 2 both defined on [0, 1]. The sample size is chosen equal to T = 50, the number of basis in each of the one-dimensional systems is selected equal to 5, in order to have a total number of basis K = 25.
Since we know the space where the functions are embedded, we can apply the estimation procedure in Appendix C.2 using as basis system the same one used in the simulation. Notice that in this case the approximation error δ t in (C.16) is exactly zero and one can derive optimal estimates of the functional principal components. Estimators of the first three functional principal components are represented in Figure C.15a. We repeat the same estimation procedure, this time modelling functions with a basis built from the tensor product of two one-dimensional cubic B-Spline basis systems with 5 basis each. In this case, the basis system do not coincide with the one from which functions are simulated. Nevertheless, as reported in Figure C.15b all the scaled eigenfunctions are very close to the optimal ones estimated before. Finally, we compare the aforementioned estimators with the ones coming from the discretization approach. Each of the one-dimensional grids is discretized using a step size equal to 0.02, thus resulting in a total of 2500 points. Also in this case (see Figure C.15c), estimated harmonics are very close to the optimal ones in Figure C.15a.
To enable for better comparison, we report in Table C.1 the Mean Squared Error (MSE) (C.17) between the FPC's y estimated using full knowledge of the basis system from which functions are simulated and the ones estimated using other techniques (ŷ). Such quantity is computed starting from values of y andŷ on a two-dimensional grid {(u i , v j )} i=1,...,N 1 ; j=1,...,N 2 .

MS E(y,ŷ)
We can appreciate very low values of MSE, regardless of the technique used for FPCA, thus suggesting that both the basis expansion and the discretization approach are valid options on a practical point of view.