Spatiotemporal Multi-Resolution Approximations for Analyzing Global Environmental Data

Technological developments and open data policies have made large, global environmental datasets accessible to everyone. For analysing such datasets, including spatiotemporal correlations using traditional models based on Gaussian processes does not scale with data volume and requires strong assumptions about stationarity, separability, and distance measures of covariance functions that are often unrealistic for global data. Only very few modeling approaches suitably model spatiotemporal correlations while addressing both computational scalability as well as flexible covariance models. In this paper, we provide an extension to the multi-resolution approximation (MRA) approach for spatiotemporal modeling of global datasets. MRA has been shown to be computationally scalable in distributed computing environments and allows for integrating arbitrary user-defined covariance functions. Our extension adds a spatiotemporal partitioning, and fitting of complex covariance models including nonstationarity with kernel convolutions and spherical distances. We evaluate the effect of the MRA parameters on estimation and spatiotemporal prediction using simulated data, where computation times reduced around two orders of magnitude with an increase of the root-mean-square prediction error of around five percent. This allows for trading off computation times against prediction errors, and we derive a practical strategy for selecting the MRA parameters. We demonstrate how the approach can be practically used for analyzing daily sea surface temperature and precipitation data on global scale and compare models with different complexities in the covariance function.


Introduction
With the large amount of existing Earth observation programmes, measurements of many environmental phenomena are available at global scale. These include high-resolution optical imagery as well as atmospheric and meteorological variables. At the same time, technological developments such as Google Earth Engine (Gorelick et al., 2017), Earth observation data cubes (Mahecha et al., 2020;Appel and Pebesma, 2019;Lewis et al., 2017), and array databases (Stonebraker et al., 2013;Appel et al., 2018;Baumann et al., 1998) make these data accessible to scientists, such that studying changes, anomalies and interactions between different phenomena becomes easier. In many cases, the data are strongly autocorrelated in space and/or time and hence geostatistical models are of particular interest to draw conclusions from the data.
Continuous phenomena are typically modeled as a Gaussian process (GP) defined by a mean µ(s) and a covariance function Cov(s, s ) with s, s ∈ D and typically D ⊂ R d . Given a realization of a GP with observations at n locations s 1 , ..., s n , the joint distribution is multivariate normal with mean vector µ i = µ(s i ) and covariance matrix C ij = Cov(s i , s j ), i, j = 1, ..., n. The mean function typically integrates large scale trends or external covariates whereas the covariance function determines the degree and shape of spatial and/or temporal dependencies in the remainder process. Since covariance matrices must be positive definite, a few commonly used classes of covariance functions following a typical parameterization including a spatial range, (partial) sill, and a nugget effect are mostly applied. Inference typically concerns estimation of the mean and covariance function parameters and prediction of measurements at unobserved locations. Estimation and prediction are both affected by the big n problem (Lasinio et al., 2013), because the decomposition of the covariance matrix computationally scales with O(n 3 ) and has memory requirements O(n 2 ). For parameter estimation, naive moments-based experimental variogram estimation may lower the computational effort for simple stationary models.
As a result, most practical applications of approaches based on decomposing the full covariance matrix are limited to the order of 10 4 < n < 10 5 , whereas available global datasets such as satellite-based precipitation observations from the Global Precipitation Measurement (GPM) (Hou et al., 2014) mission have more than 10 6 observations per image, with new images being available every few hours or even more frequent. Furthermore, stationarity and separability assumptions in Geostatistical covariance models are hardly valid on global scale, and using Euclidean distances on projected coordinates or chordal distances on a sphere can be disadvantageous compared to using spherical (great circle) distances (Gneiting et al., 2013;Banerjee, 2005;Porcu et al., 2016).
The big n problem has been approached by a number of approximating methods in the last years. Common methods aim at reducing the rank of large covariance matrices (see Cressie and Johannesson, 2008, for example), making covariance or their inverse precision matrices sparse (see Furrer et al., 2006;Lindgren et al., 2011, for example), or partitioning the data into smaller subsets (see Katzfuss, 2017, for example). Heaton et al. (2019) give a comprehensive overview of available approaches and provide a comparative benchmark for spatial prediction, also including other algorithmic approaches such as gapfill (Gerber et al., 2018). The approaches considered have fundamentally different underlying concepts and parameterizations. Resulting differences in prediction performance as well as computation times still seem remarkable. Each of the available approaches offer advantages and disadvantages. Most important differences refer to the complexity of supported covariance models (e.g. nonstationarity), computational behavior (e.g. on distributed computing environments), whether the domain can be extended to space and time, and whether approximations are valid on the sphere.
Concentrating on modeling global environmental phenomena as recorded from remote sensing satellites, only some of the approaches have been successfully applied on large global spatiotemporal datasets. As such, Fixed Rank Kriging (FRK) (Cressie and Johannesson, 2008;Zammit-Mangion and Cressie, 2017) approximates a GP by a linear combination of r spatial basis functions with different resolutions. r is typically much smaller than n and inference only needs to factorize r × r matrices. The basis function representation implies nonstationary covariances and is integrated in a spatial random effects model to include external covariates in the mean function and smaller scale variations and/or measurement errors.
Similarly, LatticeKrig (Nychka et al., 2015) defines a multi-resolution representation where basis functions are compactly supported and their coefficients are modeled as a Markov random field (Rue and Held, 2005). As a result, computations make advantage of sparse matrix routines and a larger number of basis functions that also capture small scale variations can be used.
The stochastic partial differential equation (SPDE) approach (Lindgren et al., 2011) links Gaussian Markov Random Fields and GPs with Matérn covariance function to achieve sparse precision matrices by numerically solving stochastic partial differential equation on a triangulated area of interest. The approach has been applied in space and time (Cameletti et al., 2013), is valid on the sphere, and supports nonstationarity e.g. by using covariates in the dependence structure (Ingebrigtsen et al., 2014). Using integrated nested Laplace approximations (INLA) (Rue et al., 2009), the SPDE approach allows for fast Bayesian inference.
Recently, Zammit-Mangion and Rougier (2019) built on top of the SPDE approach a multiscale representation that allows for working with large datasets and nonstationary covariances. Results demonstrate that approaches need to consider flexible covariance models as well as computational scalability.
In terms of scalable computing by making use of distributed computing environments, especially the multi-resolution approximation (MRA) approach (Katzfuss, 2017) seems promising. MRA recursively partitions the domain of interest into smaller regions and assumes conditional independence between observations in disjoint regions at the same partitioning level. Huang et al. (2019) provides a distributed implementation with an application to remote sensing satellite data with n > 10 7 .
In this paper, we use the multi-resolution approximation (MRA) approach developed in Katzfuss (2017), extend it to spatiotemporal domains and covariance models, and evaluate the extent to which this can solve the aforementioned difficulties with global environmental datasets. The overall contribution of the paper is to make the MRA approach applicable and available to global spatiotemporal datasets, and to evaluate and discuss its performance and scalability, i.e., to what extent we can realistically apply it on today's datasets.
The remainder of the paper is organized as follows. Section 2 describes how we apply the MRA approach (Katzfuss, 2017) on spatiotemporal data. Sections 3 and 4 present a simulation study and real world examples, before Section 5 discusses the results, limitations, and gives an outlook on future research directions. Section 6 concludes the paper.

Multi-resolution Approximations in Space and Time
To approximate a mean zero Gaussian process, the MRA approach (Katzfuss, 2017) recursively partitions the area of interest and assumes conditional independence between different regions at the same partitioning level m ∈ {0, ..., M }. Within each region, MRA defines a predictive process (Banerjee et al., 2008) at a set of r knots, where larger regions (smaller m) capture large-scale spatial dependencies and smaller regions consider the remainder process of the previous levels and hence capture smaller scale spatial variations. As a result, approximations of covariances between any two locations are better if the locations share more regions. The basic idea is illustrated in Figure 1 with M = 3 partitioning levels and r = 16 (regularly placed) basis functions per region. Notice that for illustration purposes, the example uses identical basis functions at each level, although their shape is actually derived automatically with regard to a given covariance function.
Regions in the lowest level may or may not correspond to actual observations. In general, the total number of basis functions is larger than the number of observations, making it possible to account for both large and small scale variations. Advantages in computational complexity are due to replacing the decomposition of one big n × n matrix by the decomposition of many smaller matrices. The partitioning into conditionally independent regions allows for efficient inference on distributed computing environments. Implementations of the approach in Huang et al. (2019) and Jurek and Hammerling (2018) have demonstrated computational scalability up to n ≈ 10 7 . Below, we discuss how we use the MRA approach to approximate spatiotemporal Gaussian processes. For further details of the original approach and inference, the reader is referred to Katzfuss (2017).

Spatiotemporal Partitioning
Similar to the two-dimensional partitioning presented in Huang et al. (2019), we apply a regular spatiotemporal partitioning, where all three dimensions are divided into two equally sized parts per level. A spatiotemporal region is split into J = 8 smaller regions at each level, resulting in M i=0 8 i regions in total. Dimensions of spatiotemporal data can be rather uneven in size as there are often more values along a spatial dimension than points in time. At the lower partitioning levels a regular partitioning would lead to regions with all observations recorded at the same time. Due to the MRA assumption of conditional independence between different regions at the same partitioning level, this would prohibit estimations of temporal correlations. For this reason, we maintain lower bound parameters (LB x , LB y , LB t ) specifying the minimum size of regions allowed per dimension. For example, setting LB t = 10 days makes sure that the time dimension is not further partitioned if the temporal interval size of regions would fall below 10 days. As a result, the number of splits per region J may vary by m with J ∈ {2, 4, 8} and the partitioning may stop at a higher level than a user may have actually specified.
Given a number of basis functions per regions r in the levels before the last level (m < M ), we first placer = 3 √ r 3 knots at a regular spacetime grid within each region from which we afterwards randomly sample exactly r knots. Knots refer to the locations where the basis functions of the predictive process reach their maximum. To avoid duplicate knots in a region and its child regions, which would lead to positive definiteness issues, we assign a constant random spatiotemporal subpixel shift to all knots in a region. A final check at the end of the algorithm may furthermore remove duplicate knots in the unlikely event that there are still duplicated knots. At the lowest partitioning level (m = M ) we simply place knots at all available observation locations within a region.
The overall partitioning algorithm is illustrated in pseudo code below. if available, add prediction locationsŝ j lying within extent to the current region else 1. Create a regular grid atr = 3 √ r 3 , knots ( 3 √r per dimension) within the extent 2. Apply a small random sub-pixel shift per dimension to avoid identical knot locations at different m 3. Randomly sample exactly r of ther knots end if 2. Add region to the list of output regions 3. Partition the current region:

Input
1. Split all dimensions into two equally sized intervals if size of the current region is larger than 2 times the corresponding LB 2. if no dimension has been splitted: exit 3. Combine all dimension intervals as J = 2, 4, or 8 spatiotemporal regions with extents The recursive function starts with m = 0 and produces a list (containing a tree structure) of regions, their spatiotemporal extents, knot locations, and at the lowest level the observations. For spatiotemporal prediction, the corresponding locations are furthermore added as knots to regions in the lowest level.

Spatiotemporal Covariance Functions
While the selection of covariance models in space mostly concerns a few predefined functions, the selection of appropriate models in space and time is generally more difficult. A straightforward approach referred to as the metric model is to interpret spacetime as three-dimensional Euclidean space, scale the temporal coordinates by an anisotropy factor and use the same covariance functions as for purely spatial models. Other simple models may be simple products and sums of purely spatial, purely temporal, and metric models (see Grler et al., 2016, for a comprehensive discussion of frequently used models and parameterizations). For example the separable model is a simple product of a purely spatial and a purely temporal covariance functions with computational advantages in the decomposition of their covariance matrices using the Kronecker product.
One advantage of the MRA approach is that covariance functions can be completely defined by users. As such, users simply need to implement a function that creates a valid covariance matrix from two given sets of locations. As a result, simple models can be directly integrated into the MRA approach without any additional work. On a global scale, however, these models make unrealistic assumptions (see Section 1).
To make the approach applicable on spherical distances and add nonseparability, we have integrated some of the covariance functions suggested in Porcu et al. (2016). To incorporate nonstationarity, we have integrated a kernel convolution approach (Stein, 2005;Risser and Calder, 2015;Paciorek and Schervish, 2006), where the variance, smoothness, and anisotropy of the covariance model vary spatially. In our implementation (Section 2.4), we represent these processes as a linear combination of Gaussian radial basis functions centered at a few spatiotemporal locations. As an alternative, it would be possible to apply space and/or time deformations (Sampson and Guttorp, 1992) to yield nonstationary covariances in the MRA approach.

Parameter Estimation and Prediction
We use the original MRA formulation for likelihood-based inference from Katzfuss (2017) and omit details here. To estimate non-negative parameters of covariance functions, one can use bound constrained optimization algorithms such as L-BFGS-B (Byrd et al., 1995) or BOBYQA (Powell, 2009), or use general purpose optimization algorithms to estimate parameters on logarithmic scale.
Since prediction with MRA is known to produce artefacts at region boundaries, we furthermore implement a simple averaged prediction using multiple partitions. Splits in each dimension can be shifted by a fraction of the size of regions at the lowest partitioning level towards lower and upper values, yielding 9 different partitions. Computation times increase accordingly.

Simulation Study
We apply the described spacetime MRA approach on a simulated dataset and evaluate how the MRA parameters affect quality of predictions and parameter estimations as well as computation times. Using the R package RandomFields (Schlather et al., 2015), we simulated a threedimensional regular spacetime grid with size n = 50 × 50 × 50 = 125000, following a metric spatiotemporal model with an exponential covariance function (range = 0.2, partial sill = 0.05, spatiotemporal anisotropy = 0.02, nugget = 0.05). Figure A.1 shows the first 20 time slices of the dataset.
To assess the performance of the spatiotemporal MRA, we perform parameter estimation for different values of M and r and compare obtained values to the true values. For assessing prediction performance, we assume the true parameters to be known and predict 90% randomly selected pixels again for different values of M , and r. We calculate several diagnostic scores, including root-meansquare prediction error (RMSE), mean absolute error (MAE), the fraction of true values lying within a µ ± 2σ prediction interval (COV2SD), and measure computation times. Predictions for the same parameter values have been repeated three times each. Due to the randomness in the selection and shifting of knots (see Section 2), results are not exactly identical. To find the parameters with maximum MRA likelihood, we apply the L-BFGS-B optimization algorithm with starting values and bounds as presented in Table 1.

Parameter Estimation Results
Figure 2 shows parameter estimates against the true values. The worst estimates have been obtained for M = 4 and r = 8. In general, the quality of estimations seems relatively independent of both M and r, although for M = 4, estimates of the spatial range and partial sill tend to be slightly worse for low r.
We have also performed traditional experimental variogram estimation using the gstat R package (Grler et al., 2016;Pebesma, 2004). After using a large number of different initial parameter values (including the true values) for fitting the theoretical variogram, the blue horizontal dashed lines in Figure 2 show the estimates closest to the true values. Most MRA estimations are closer or at similar distance to the true values than the best solution of experimental variogram fitting, which also produced a large number of much worse estimates. For example, the median variogram estimate is 0.32 for the nugget effect and 0.46 for the spatial range, i.e., worse than the worst MRA estimates. The likelihood-based MRA fitting seems more robust and traditional variogram fitting furthermore may be impracticable for more complex nonstationary covariance models.
In terms of computation times, the MRA approach is naturally much slower than traditional variogram estimation. Single evaluations of the MRA likelihood took between 11 seconds (M = 4, r = 8) and 40 minutes (M = 2, r = 32), where the number of likelihood evaluations ranged from 95 to 864 until convergence. For more complex models with many more parameters, even more iterations might be needed. Interestingly, it was possible to achieve similar prediction errrors with M = 3 and M = 2 when using more basis functions for M = 3 with very similar computation times too. Figure  3 furthermore shows achieved R 2 values normalized by computation times as a simple indicator for the speed or efficiency of predictions, where M = 3 is generally fastest. This suggests a practical strategy how to select the MRA parameters by finding the most efficient M first (e.g. by starting at the M where regions in the lowest level contain ≈ 100 observations), and increasing r as computation times allow afterwards.

Prediction Results
Traditional Kriging interpolation using the R package gstat (Pebesma, 2004;Grler et al., 2016)  yield RMSE values that were 1.17% better than the best MRA result (M = 1, r = 128) and 6.64% better than the fastest (M = 3, r = 8) result, with computation times around 12 hours. In all cases, MRA predictions provide relative good estimations of the variances, such that around 95.45% of the true values lie within a µ ± 2σ prediction interval. As r increases these values tend to be more similar even for different values of M .

Applications to Real World Data
Below, we demonstrate how the spacetime MRA approach can be applied on two global spatiotemporal datasets and how more complex nonstationary covariance models with nonstationarity and spherical (Great circle) distances can be integrated.

Daily Sea Surface Temperatures
We apply the MRA approach on global daily sea surface temperature (SST) measurements from the Moderate-resolution Imaging Spectroradiometer (MODIS) 3 , recorded between 2018-04-10 and 2018-05-09. To keep reasonable computation times in this example, we averaged original pixels (with approximate size of 9km by 9km) to a 1 • ×1 • grid using the geospatial data abstraction library (GDAL) (Warmerdam, 2008). The resulting dataset contains gaps and sums to n = 637770 valid observations. Following Zammit-Mangion and Rougier (2019), we first fit a quadratic polynomial over latitude and apply MRA models on the residuals. Figure 4 illustrates mean residuals of the models over pixel time series and Figure A.2 in the Appendix shows the full dataset. After some initial experimentation, we choose M = 4, vary the number of basis functions, and apply the following three different covariance models, where θ is a vector of parameters, ∆ s , ∆ t represent spatial and temporal distances between any two locations (either Euclidean or spherical), and 1(·) is the indicator function:

M3:
A model similar to M1 that introduces nonstationarity in the spatial covariance function using the kernel convolution approach (Paciorek and Schervish, 2006;Risser and Calder, 2015), with standard deviation and geometric anisotropy (defined by length of the east-west and south-north axes) processes varying by latitude as a linear combination of Gaussian radial basis functions centered at (pseudo) latitudes ±110 • , ±82.5 • , ±55 • , ±27.5 • , and 0 • . Let s 1 = (λ 1 , φ 1 , t 1 ) T and s 2 = (λ 2 , φ 2 , t 2 ) represent latitude, longitude, and time coordinates of two points. The model can be defined as: −82.5, −55, −27.5, 0, 27.5, 55, 82.5, 110) For all models, parameters are first estimated using the same 100000 randomly selected observations before we use two different strategies to validate prediction performance. First, we use the  same 100000 observations to predict the remaining data and, at the same time, produce gap-free image time series. We refer to this validation strategy as random validation. Second, we leave out and predict three spatiotemporal regions of size 20 • × 20 • × 9 days in the lower, middle and higher latitudes to assess large scale prediction performance (referred to as block validation, see Figure  4). We calculate the root-mean-square prediction error (RMSE), mean absolute error (MAE), and count how many of the true observations lie within a µ ± 2σ prediction interval (COV2SD). Table 2 shows the obtained prediction scores and computation times. In general, the nonstationary model (M3) performs best in terms of RMSE and MAE. Differences are stronger for block validation where larger missing regions are predicted. The nonseparable spherical model and the simple separable model present very similar scores, with marginal advantages of the simple model for block validation. Interestingly, models with less basis functions in the upper partitioning levels here tend to perform slightly better. Figure 5 shows mean and variance predictions of the nonstationary model with r = 32 at three different days. Compared to the stationary models, uncertainty estimates of the nonstationary model tend to be higher, with strongest differences at the northern latitudes. Figure 6 furthermore illustrates the spatially varying standard deviation and anisotropy processes. For different values of r, the model shows rather similar curves.
Computation times of predictions vary considerably with model complexity. As such, random validation with the nonstationary models containing 30 parameters in the covariance function took approximately 5 to 7 times longer than with the simple model. However, these differences become smaller as r increases and also for block validation, where computations with the nonstationary models were even faster. For parameter estimation, the number of needed MRA likelihood evaluations generally varies for different initial values, M , r, and was also larger for the nonstationary model with more parameters. As a result, computation times of parameter estimation are more difficult to estimate although one may set limits on convergence criteria.

Daily Precipitation Measurements
As a more difficult scenario, we apply the same models to precipitation data from integrated multi-satellite retrievals (IMERG) (Huffman et al., 2015) of the Global Precipitation Measurement Mission (GPM) (Hou et al., 2014). The used dataset contains daily accumulated liquid precipitation recorded between 2018-08-01 and 2018-08-30 at latitudes in the range [60 • S, 60 • N ]. As in the sea surface temperature example, we average pixels to a size of 1 • × 1 • to keep reasonable computation times. The dataset contains 360 × 120 × 30 = 1.296e6 measurements.
Daily precipitation is highly non-Gaussian, i.e., nonnegative and zero-inflated, which is challenging for prediction based on Gaussian processes (see approaches in Beek et al., 1992;Kleiber et al., 2012). This example here does not try to overcome these issues but is rather an optimistic, somewhat naive, practical test of whether more complex covariance models may improve prediction of such data. As a first step, we apply a Box-Cox power transform with parameters λ 1 = 0.03 and λ 2 = 0.001 and afterwards subtract transformed values from the mean. We then again select 100000 observations randomly, which we use for model fitting and predicting other observations (random validation). However, we apply a different block validation here where we simply alternately leave out the left and right halves of images. We run each model again with different numbers of basis functions and evaluate prediction scores and computation times as in the SST example.   Table 3 shows the overall results. As expected from the non-Gaussian nature of the data, prediction scores tend to be rather poor. In general, the stationary nonseparable model on spherical distances with r = 64 tends to perform best, although for block validation, M1 with r = 32 provides slightly better uncertainty estimates and M3 with r = 16 has a slightly better MAE score. Noticeably, random validation performance of M3 changes strongly with the number of basis functions. At r = 64, results tend be similar to the best performing model, whereas at r = 16 scores are weakest. This can be explained by stronger differences in the estimation of the spatially varying standard deviation and anisotropy processes (see Figure 7). The computational behavior is largely in line with the sea surface temperature data and we omit a further discussion here.

Discussion
The presented experiments show that the MRA approach can be successfully applied to spatiotemporal datasets of a size prohibitive for traditional geostatistical approaches. As in Katzfuss (2017), the simulation study suggests that the MRA parameters M and r strongly affect computation times and prediction performance and can be used to trade off computational effort against accuracy. However, no strong effect of MRA parameters on parameter estimation results has been found except that results start to become worse for higher values of M . The practical examples have furthermore shown that it is possible to apply the approach to global datasets, with more complex nonstationary covariance models. Below, we discuss further results, limitations, and future research directions.

Scalability
The experiments showed that the order of n ≈ 10 6 is approachable on moderately powerful machines of individual researchers but larger problems still require distributed computing environments. Huang et al. (2019) provide a distributed implementation of the MRA approach and demonstrate scalability in an example with n > 10 7 . Considering available global spatiotemporal datasets, one might still need to reduce the spatial resolution or strongly subsample original data as in the examples. For specific practical applications, future research may concentrate on questions such as which spatial and temporal resolution is acceptable, or whether simple models on original data or complex models on subsampled data should be preferred.
From a practical point, the most time consuming part is parameter estimation, requiring repeated evaluations of the MRA likelihood in numerical optimization. More parameters as in the nonstationary models generally require more iterations. However, even for the same covariance models, the required number of iterations varied strongly e.g. for different numbers of basis functions used, or different initial values, which makes computation times for parameter estimation difficult to predict in advance.
Since the MRA approach replaces the decomposition of one huge covariance matrix by the decomposition of many smaller matrices, future work might also try to improve computation times on accelerated hardware such as graphics processing units (Henneböhl et al., 2011).

Choosing MRA Parameter Values
The simulated experiments demonstrate that the effect of the number of basis functions is stronger on prediction performance than on parameter estimation (see Figures 2 and 3). This suggests a practical strategy to select M and r, where M is selected first (e.g. starting with M such that approximately 100 observations are available in regions on the lowest partitioning levels), using a relatively small r for parameter estimation and increasing r for prediction depending on available computational resources, time, and needed accuracy. The presented approach uses a rather simple regular partitioning of space and time. Since the integration of more complex spatiotemporal partitioning strategies using hierarchical discrete global grids is possible, experiments on the effect of partitioning on model efficiency would be interesting future work.

Covariance Model Selection
Though the effect of different covariance models on prediction scores was relatively small in the real world data examples, there is wide consensus that isotropy, stationarity, and separability assumptions are not very realistic for global scale datasets (see for example Schmidt and Guttorp, 2020;Zammit-Mangion and Rougier, 2019).
The MRA approach is very flexible in the selection of covariance models, which can be completely defined by users. Though this opens up a lot of possibilities such as spatially and/or temporally varying anisotropy, nonseparability, nonstationarity and the use of spherical distances, the selection of appropriate spacetime covariance models is not straightforward in practical applications. An existing challenge, where alternative approaches such as the SPDE approach (Lindgren et al., 2011) or FRK (Cressie and Johannesson, 2008) have advantages is to combine nonstationarity with spherical covariance models. Extensions to the work in Porcu et al. (2016) and Porcu et al. (2018) adding nonstationarity are still needed. In the present form, one may directly make use of chordal distances, the kernel convolution approach, and a spatiotemporal metric model to yield a flexible class of models.

Comparison to Other Approaches
All in all, there are still only very few approaches that can be directly used to model spatiotemporal Gaussian processes on large global scale datasets. Compared to naive local approaches based on small neighborhoods of observations, the MRA approach may provide better predictions of large missing blocks of data and their uncertainties, and is able to infer parameters of global models, which may have physical interpretations. It combines advantages of maximum likelihood estimation and complex nonstationary spatiotemporal models, for which moments-based estimation may become impracticable. Similar to Nychka et al. (2015) and Zammit-Mangion and Rougier (2019), the MRA approach uses a basis function representation with generally more basis functions than available data to incorporate small and large scale spatiotemporal variations. At the same time, inference can be efficiently scaled in distributed computing environments (Katzfuss, 2017;Huang et al., 2019). The approach furthermore allows to experiment with arbitrary covariance functions, although it is still open how to add nonstationarity for spherical distances, which other approaches (Lindgren et al., 2011;Cressie and Johannesson, 2008) include implicitly. Furthermore, an extension of the approach to handle non-Gaussian data would be needed e.g. for modeling daily precipitation. As such, integration into Bayesian hierarchical models would be possible. A systematic comparison of related spatiotemporal approaches in the style of Heaton et al. (2019) would be very helpful to identify more advantages or disadvantages, to help users with the selection of an appropriate method, and to document, how their parameters allow to trade off computation times against accuracies in parameter estimation and spatiotemporal prediction. Furthermore, first attempts to integrate deep neural networks in spatial statistics e.g. to model space deformations for nonstationarity , or to learn the temporal dynamics of spatiotemporal processes with a convolutional neural network (CNN) seem very promising and may improve the modeling of complex spatiotemporal dependencies.

Conclusions
The paper demonstrated how the multi-resolution approximation approach (Katzfuss, 2017) can be successfully used for spatiotemporal modeling of large global environmental datasets. Especially for satellite-based measurements, making use of temporal correlations can help to predict frequent larger missing spatial regions (e.g. due to cloud-cover), or to include spatiotemporal correlations in more complex models, e.g., for analyzing interactions between different environmental variables. The selection of the MRA parameters allows to effectively trade off computation times against prediction performance while at the same time allowing for flexible choice of models that include nontstationary, spatiotemporally varying anisotropy, standard deviation and smoothness.

Code and Data Availability
R scripts and data to reproduce the experiments are available at https://github.com/ appelmar/stmra_supplement. The stmra R package is available at https://github.com/ appelmar/stmra.