Towards a complete picture of stationary covariance functions on spheres cross time

: With the advent of wide-spread global and continental-scale spatiotemporal datasets, increased attention has been given to covariance functions on spheres over time. This paper provides results for stationary covariance functions of random ﬁelds deﬁned over d -dimensional spheres cross time. Speciﬁcally, we provide a bridge between the characterization in Berg and Porcu (2017) for covariance functions on spheres cross time and Gneiting’s lemma (Gneiting, 2002) that deals with planar surfaces. We then prove that there is a valid class of covariance functions similar in form to the Gneiting class of space-time covariance functions (Gneiting, 2002) that replaces the squared Euclidean distance with the great circle distance. Notably, the provided class is shown to be positive deﬁnite on every d -dimensional sphere cross time, while the Gneiting class is positive deﬁnite over R d × R for ﬁxed d only. In this context, we illustrate the value of our adapted Gneiting class by comparing examples from this class to currently established nonseparable covariance classes using out-of-sample predictive criteria. These compar- isons are carried out on two climate reanalysis datasets from the National Centers for Environmental Prediction and National Center for Atmospheric Research. For these datasets, we show that examples from our covariance class have better predictive performance than competing models.


Introduction
In recent years, there has been a sharp increase in the prevalence of global or continental-scale spatiotemporal data due to satellite imaging, climate reanalyses, and wide-spread monitoring networks. Although Earth is not exactly spherical (it flattens at the pole), it is commonly believed that the Earth can be well approximated by a sphere (Gneiting, 2013;Castruccio and Stein, 2013). With the goal of modeling data over large spatial scales, while accounting for the geometry of the Earth, there has recently been fervent research on modeling and inference for random fields on spheres as well as on spheres cross time. Recent examples provide a comprehensive overview of these topics, including Gneiting (2013), Jeong and Jun (2015), Porcu, Bevilacqua and Genton (2016), Berg and Porcu (2017), and Porcu, Alegría and Furrer (2018).
Under Gaussianity, the covariance function is core to spatiotemporal modeling, inference, and prediction. Covariance functions are positive definite and showing that a candidate function is positive definite over spheres cross time often requires mathematical tools from harmonic analysis. Following the works of Schoenberg (1942) and Gneiting (2013) on spheres, the mathematical characterization of covariance functions on spheres cross time has been given by Berg and Porcu (2017). In addition, Porcu, Bevilacqua and Genton (2016) provide examples of covariance functions for practitioners. As a special case of these covariance classes, some have adapted these classes for temporal models on circles (one-dimensional spheres) to account for seasonal patterns in temporal autocorrelation (see Shirota and Gelfand, 2017;White and Porcu, 2019). Generalizations in the area of mathematical analysis include Peron (2016a,b, 2017) and Barbosa and Menegatto (2017).
For a random field on R d ×R with stationary covariance function C : R d ×R → R, Gneiting (2002) showed the following characterization: if C is continuous, bounded, symmetric, and integrable (over R d ), then C is a covariance function if and only if the function C ω : R → R, defined by is a covariance function for almost every ω ∈ R d . Here, i is the unit imaginary number and is the transpose operator. This characterization has been the crux of many important results in spatiotemporal covariance modeling. Examples include the Gneiting class (Gneiting, 2002), Schlather's generalized class (Schlather, 2010), component-wise anisotropic covariances (Porcu, Gregori and Mateu, 2006), multivariate geostatistical modeling (Apanasovich and Genton, 2010), quasi-arithmetic construction (Porcu, Mateu and Christakos, 2010) and nonstationary models (Porcu, Mateu and Bevilacqua, 2007). For a given positive integer d, (1) proves the validity of the Gneiting class of covariance functions: where · is the Euclidean norm. The function ϕ : [0, ∞) → R + is completely monotonic; that is, ϕ is infinitely differentiable on (0, ∞), satisfying (−1) n ϕ (n) (t) ≥ 0, n ∈ N. Here, ϕ (n) denotes nth derivative and we use ϕ (0) for ϕ, where ϕ(0) is required to be finite. The function ψ : [0, ∞) → R + is strictly positive with a completely monotonic derivative. Here and throughout, σ 2 is used to represent the spatiotemporal variance; that is, a scaling factor of a spatiotemporal correlation function. We also note that the function C in (2) is positive definite in R d × R for a given positive integer d, but it is not positive definite on every d-dimensional Euclidean space cross time.
Our paper focuses on two aspects of covariance modeling on d-dimensional spheres cross time. We first focus on Criterion (1) and its analogue on spheres over time. Our result provides an additional equivalence condition to those provided in Berg and Porcu (2017). We then provide an adaptation of the Gneiting class (2) to spherical domains, and show that it is positive definite over all ddimensional spheres (including the Hilbert sphere) cross time. Further, our proof is based on direct construction, allows us to avoid Fourier inversion, and does not require a convergence argument that was originally used by Gneiting (2002). Porcu, Bevilacqua and Genton (2016) considered a variant of this problem, modifying the Gneiting class based on temporal rescaling of the spatial component. This idea was also suggested by Gneiting (2002). In addition to a new Gneiting class for spheres over time, we adapt Heine's class of covariance functions (Heine, 1955), originally proposed over two-dimensional Euclidean spaces, to d-dimensional spheres cross time.
For estimation and prediction with the new covariance class, (5), we take a Bayesian approach using nearest neighbor Gaussian processes (NNGP) (Datta et al., 2016a,b). Bayesian models allow for simple and rigorous uncertainty quantification through a single probabilistic framework that does not rely on asymptotic assumptions. Because Gaussian process (GP) models for large datasets, as we have with globally sampled spatiotemporal data, are often computationally intractable, we use the NNGP as a surrogate. Modeling with the NNGP enables scalable model fitting, inference, and prediction for real-data examples. Our discussion here adds to application areas for NNGPs as they have not been used for global data in the literature.
For our data examples, we use daily near-surface temperature and cloud coverage from the first week of 2017 (Kalnay et al., 1996). We only use the first week to keep computation times short. To be clear, we do not claim that covariance functions from our new covariance classes are preferable for all datasets. Indeed, for some datasets that we tested, but that we do not present here, the new covariance functions in this manuscript showed little or no predictive advantage. However, we highlight these datasets because they show that our new Gneiting class yields practical predictive benefits in some cases.
We start by giving background (Section 2) for the theoretical results given in Section 3. In Section 3, we provide the analogue of (1) for covariance functions over spheres cross time. Additionally, we adapt the Gneiting class (2) to spheres cross time and show that, using a subclass of completely monotonic functions, the adapted Gneiting class can be used on all d-dimensional spheres cross time. Then, we provide an adaptation of the Heine covariance function, originally proposed in R × R, to spheres cross time. Proofs of the theoretical results are technical and are deferred to the Appendix A. We also provide a supplementary result in Appendix A related to our main result in Section 3. We then turn our attention to modeling data using covariance functions from our adapted Gneiting class in Section 4. In Section 5, we draw upon our modeling discussion for simulation studies and real data analyses. In our simulation studies, we explore parameter identifiability for examples from our adapted Gneiting covariance class and highlight some limitations. In our data examples, covariance functions from our adapted Gneiting covariance class have better out-of-sample predictive performance than covariance models currently in the literature, using mean absolute error, mean squared error, and continuous ranked probability scores as model comparison criteria. Finally, e provide concluding remarks in Section 6.

Background
Let d be a positive integer. Here, we consider stationary Gaussian random fields on d-dimensional unit spheres S d cross time (in R), where S d is defined to be {s ∈ R d+1 : s = 1}. We use the unit sphere without loss of generality. These random fields are denoted {Y (s, t), s ∈ S d , t ∈ R}. We assume Gaussianity in modeling (Section 4) which implies that finite dimensional distributions are completely specified by the mean and covariance function of the random field.
As a metric on S d , we use the great circle distance θ : We then consider covariance functions based on θ(s 1 , s 2 ) and time difference where we take θ as an abbreviation for θ(s 1 , s 2 ). Porcu, Alegría and Furrer (2018) refer to such covariance functions as spatially geodesically isotropic and temporally symmetric, and Berg and Porcu (2017) provide a mathematical characterization for these functions. Covariance functions are positive definite, meaning that for any collection It is worth noting that classes of positive definite functions on S d × R are nested, meaning that positive definiteness on S d ×R implies positive definiteness on S d × R for d < d, but the converse is not necessarily true. Porcu, Bevilacqua and Genton (2016) proposed the inverted Gneiting class and define it as with ϕ and ψ as defined in (2), and where ψ [0,π] denotes the restriction of ψ to the interval [0, π]. In contrast to (2) which scales Euclidean distance by a function of the temporal lag, (4) rescales the temporal lag by a function of the great circle distance. This was also mentioned in Gneiting (2002). It might be more intuitive to rescale space with time, as was done in (2), proposing a structure like where, in this case, we do not need to restrict any of the functions ϕ and ψ to the interval [0, π]. Also, one might note that the function ψ is not raised to the power d/2 as in (2). Showing this construction is valid is nontrivial and receives an exposition in Section 3.2. One choice of ϕ, used to construct covariance functions in (2) and (4) is the Matérn class ϕ(t) = M α,ν (t), t ≥ 0, α, ν > 0, defined as where K ν is the MacDonald function (Gradshteyn and Ryzhik, 2007). One appeal of this class is the parameter ν that governs the smoothness at the origin (Stein, 1999). Unfortunately, M α,ν (θ) is not positive definite on d-dimensional spheres, unless ν ∈ (0, 1/2] (Gneiting, 2013), which makes this function less appealing to model spatial processes that are sufficiently smooth.

The generalized Gneiting lemma on spheres cross time
We begin our discussion with a criterion for covariance functions defined over d-dimensional spheres cross time. Let G λ k be the kth normalized Gegenbauer polynomial of order λ > 0 (Dai and Xu, 2013). Gegenbauer polynomials form an orthonormal basis for the space of square-integrable functions, denoted as L 2 ([0, π], sin θ d−1 dθ).
Then, the following assertions are equivalent: is the covariance function of a random field on S d for almost every τ ∈ R; 3. for all k = 0, 1, 2, . . ., the functions b k,d : R → R, defined through (7), are continuous, positive definite on R, and k b k,d (0) < ∞.
Some comments are in order. Equivalence of 1 and 3 was shown by Berg and Porcu (2017). The result completes the picture that had been started by Gneiting (2013), Berg and Porcu (2017) and Porcu, Bevilacqua and Genton (2016). Equivalence of 1 and 2 provides the analogue of Gneiting's criterion in (1) for spheres cross time. Thus, Theorem 1 gives insight into relationships between covariance functions on spheres and covariance functions on Euclidean spaces. In fact, application of Theorem 1 provides a useful criterion (see Appendix A) relating spatiotemporal covariances on R×R with covariance functions on S 3 ×R.
The proof of Theorem 1 is technical, and we defer it to Appendix A to avoid mathematical obfuscation.

New classes of covariance functions on spheres cross time
We now detail our findings with two new classes of covariance functions on spheres over time. To do this, we need to introduce a new class of special functions. A function ϕ : [0, ∞) → R is called a Stieltjes function if where μ is a positive and bounded measure. We require throughout ϕ(0) = 1, which implies that ξ −1 μ(dξ) = 1. Let us call S the set of Stieltjes functions. It has been proved that S is a convex cone (Berg, 2008), with the inclusion relation S ⊂ C, where C is the set of completely monotone functions. The relation (9) shows that the function t → (1+t) −1 , t ≥ 0, is a Stieltjes function. Using the fact that ϕ ∈ S if and only if 1/ϕ is a completely Bernstein function (for a definition, see Porcu and Schilling, 2011) we can get a wealth of examples, as the book by Schilling, Song and Vondracek (2012) provides an entire catalogue of completely Bernstein functions. We finally note that completely Bernstein functions are infinitely differentiable over (0, ∞) and have a completely monotonic derivative.
We are now able to state the following result.
Theorem 2. Let C : [0, π] × R be the mapping defined through (5), where ϕ is a Stieltjes function on the positive real line, with ϕ(0) = 1, and ψ is strictly positive with a completely monotone derivative. Then, C is a covariance function on S d × R for all positive integers d.
Again, the proof is deferred to Appendix A. This result completes the adaptation of the Gneiting class (Gneiting, 2002) to spheres cross time. Theorem 2 allows C in (5) to be positive definite on every d-dimensional sphere under the condition that the function ϕ is a Stieltjes function. As already mentioned, the class is rich, and there is a whole catalogue available from the book by Schilling, Song and Vondracek (2012). In addition, our proof in Appendix A does not require any Fourier inversion techniques, nor convergence arguments as those used in Gneiting (2002).
We also note that the Matérn function M α,ν cannot be used for the purposes of Theorem 2. This is not surprising, as arguments in Gneiting (2013) show that the Matérn covariance function in (6) can only be used in Theorem 2 for 0 < ν ≤ 1/2. If one is interested in smoother realizations over spheres, then a common method involves using the Euclidean distance on spheres (Gneiting, 2013;Porcu, Alegría and Furrer, 2018), also called chordal distance, in (2). In this case, any choice for ν > 0 preserves positive definiteness. At the same time, using chordal distance has a collection of drawbacks that have inspired constructive criticism in Banerjee (2005), Gneiting (2013), Porcu, Alegría and Furrer (2018) and , to cite a few. We explore both possibilities and compare them in terms of predictive performance in Section 5.
To introduce another class of covariance functions, we define the complementary error function erfc as and erfc(t) = 2 − erfc(−t) when t is negative. We can show the following result.
Theorem 3. Let ψ [0,π] be the restriction to [0, π] of a positive function with a completely monotonic derivative. Then, The class presented in (10) is related to a covariance class on R×R considered by Heine (1955). Again, the proof is provided in Appendix A.

Hierarchical process modeling for spatiotemporal data
We illustrate the utility of one of our covariance classes (Theorem 2) using hierarchical NNGP models in a Bayesian setting. In spatial and spatiotemporal analyses, Bayesian models are often preferred for hierarchical modeling because they allow for simple and rigorous uncertainty quantification through a single probabilistic framework that does not rely on asymptotic assumptions (see, e.g., Gelman et al., 2014;Banerjee, Carlin and Gelfand, 2014;Cressie and Wikle, 2015).
Spatiotemporal random effects for point-referenced data are often specified through a functional prior using a Gaussian process (GP). Gaussian processes are natural choices for modeling data that vary in space and time. However, likelihood computations for hierarchical GP models require inverting a square matrix with dimension equal to the size of the data, making GP models intractable in "big-data" settings. Many have addressed this computational bottleneck using either low-rank or sparse matrix methods (see Heaton et al., 2018, for a review and comparison of some of these methods).
In contrast to low-rank methods, inducing sparsity in either the covariance matrix or the precision matrix can reduce the computational burden. Covariance tapering creates sparsity in the covariance matrix by using compactly supported covariance functions (see, e.g., Furrer, Genton and Nychka, 2006;Kaufman, Schervish and Nychka, 2008). These methods are generally effective for parameter estimation and interpolation; however, the allowable class of covariance functions is limited. On the other hand, inducing sparsity in the precision matrix has been leveraged to approximate GPs using Markov random fields (Lindgren, Rue and Lindstroem, 2011) or using conditional likelihoods (Vecchia, 1988;Stein, Chi and Welty, 2004). These approaches were extended to process modeling by Gramacy and Apley (2015) and Datta et al. (2016a). For discussion and further extension of these approaches, see Katzfuss and Guinness (2017). Unlike local approximate Gaussian processes (Gramacy and Apley, 2015), the NNGP is itself a GP (Datta et al., 2016a) and has good predictive performance relative to other "fast" GP methods (See Heaton et al., 2018).
To specify an NNGP, we begin with a parent GP over R d ×R or S d ×R. Nearest neighbor Gaussian process models induce sparsity in the precision matrix of the parent Gaussian model by assuming conditional independence given neighborhood sets constructed from directed acyclic graphs, yielding huge computational benefits (Datta et al., 2016a,b). Modeling, model fitting, and prediction details for NNGP models are given in Appendix B.

Examples of covariance functions
Here, we turn our attention to six nonseparable covariance functions used in simulation studies in Section 5.1 in our data analyses (Sections 5.2 and 5.3). For all examples, we parameterize the models with variance σ 2 and use c s and c t to represent the strictly positive spatial and temporal scale parameters, respectively.

Model comparison
To compare models that differ in terms of covariance specification, we propose the following criteria for comparing predictions to hold-out data y i : 90% predictive interval coverage, predictive mean square or absolute error (defined as where Y obs denotes observations. Besides these common criteria, we also use a strictly proper scoring rule (Gneiting and Raftery, 2007), the continuous ranked probability score (CRPS), defined as where Y i and Y i follow the predictive distribution F i (see Brown, 1974;Matheson and Winkler, 1976, for early discussion on CRPS).. An empirical estimate of the continuous ranked probability score, using M posterior predictive samples We average continuous ranked probability scores over all hold-out data to obtain a single value for comparison.

Simulation and data examples
Using only covariance functions (14) and (15), we provide a brief simulation study in Section 5.1 to explore the identifiability of covariance model parameters using an NNGP model. In Sections 5.2 and 5.3, we illustrate practical predictive advantages of the new Gneiting class using spherical distance (Theorem 2) using two climate reanalysis datasets from the National Centers for Environmental Prediction and National Center for Atmospheric Research (Kalnay et al., 1996).
For both analyses, we use the first week of the 2017 dataset.

Simulation study
Here, we present simulation studies to explore the empirical identifiability of covariance model parameters. To do this, we simulate many datasets that differ in their covariance specification, using either (14) and (15). For each covariance function, we fix parameters and generate 1,000 datasets from the following generative model: where observations lie on an evenly spaced grid of latitudes ranging from −90 to −60 (5 • spacing) and longitudes between −180 and 0 (5 • spacing). This grid is repeated from times 1 to 10, giving N = 4330. While we use a full GP specification for simulation and the dataset is not particularly large when fitting a single model, we fit these data using a hierarchical NNGP model with m = 25 neighbors because this mirrors modeling approach in Section 4 and because we fit 1,000 simulated datasets per simulation (6,000 in total). Neighbors are selected using simple rectangular neighborhood sets using great-circle (spherical) distance to define nearness (see Datta et al., 2016b). When simulating from (18) using (14) as the covariance function, we use with σ 2 κ = 4, c s = 0.2, c t = 2, α = 1/2, δ = 1/2, and τ 2 = 1 for the noise term (s, t). For τ 2 and σ 2 , we use inverse-gamma priors with 0.1 as both the shape and scale (corresponding to the rate of a gamma distribution) parameters. We assume c s ∼ Unif(0, π), c t ∼ Unif(0, 10), α ∼ Unif(0, 1], and δ ∼ Unif(0, 1], a priori. When using (15) in the generative model (18), we set σ 2 = 4, c s = 0.2, c t = 2, α = 1, β = 1/2, δ = 3/4, λ = 1, γ = 1/2, and τ 2 = 1 for the noise term (s, t). For simplicity, we constrain δ + β/2 = 1, γ = 1/2, and λ = 1. As before, we use inverse-gamma priors with 0.1 as both the shape and scale parameters for τ 2 and σ 2 . As before, we assume c s ∼ Unif(0, π), c t ∼ Unif(0, 10), α ∼ Unif(0, 2], β ∼ Unif(0, 1], and δ ∼ Unif(0, 1], a priori. We explore the effect of fixing some model parameters to examine how model identifiability is affected. Specifically, we consider fixing combinations of c s , c t , and α in (14) and (15) to the true value. In this setting, we re-fit the models described above, keeping all other specifications the same as described. We present the 90% empirical coverage rates for all parameters in Table 1. In this table, dashes indicate that parameters are fixed. For (14), we see improved coverage rates (i.e., closer to 90%) for σ 2 and α when range parameters c s and c t are fixed; however, δ shows significant under coverage when range parameters are fixed. The results for (15) are similar. When range parameters are fixed, coverage rates for σ 2 and β are closer to 90%. For this covariance model, α has slightly worse coverage when range parameters are fixed. These simulation studies highlight some of the limitations in estimating parameters of covariance functions from Theorem 2. For both (14) and (15), we note that there is limited parameter identifiability, particularly for spatial range parameter c s and spatiotemporal variance σ 2 . Although we do not provide identifiability proofs, this result seems similar to the limited identifiability of the Matérn class that is identifiable up to σ 2 c s −2ν (see Zhang, 2004). Apparently, this issue becomes even more complex under the space-time setting, and the lack of theoretical results for space-time asymptotics and equivalence of Gaussian measures make this problem very difficult. In this simulation, we find improved parameter identifiability when we fix spatial range parameters. We emphasize that the primary goal of our analyses is comparing predictive performance. However, if the unbiased estimation of covariance parameters is the primary goal, then a multi-stage fitting process can improve estimation (Mardia and Marshall, 1984).

Surface air temperature reanalysis data
For this section, we utilize the 2017 National Centers for Environmental Prediction/National Center for Atmospheric Research daily average 0.995 sigma level (near-surface) temperature reanalysis data (Kalnay et al., 1996). Air temperature at 0.995 sigma level is defined to be the temperature taken at an air pressure 0.995 × surface air pressure.
The foundations of global temperature change are well established (see, e.g., Folland et al., 2001;Hansen et al., 2006Hansen et al., , 2010. Furthermore, air temperature changes have, along with other changes in climate, a wide and deep impact on global biological systems (see Parmesan and Yohe, 2003;Thomas et al., 2004;Held and Soden, 2006). For these reasons, many climate models are dedicated to understanding past and predicting future temperature changes (see Simmons et al., 2004, for some discussion and comparisons about the various analyses of surface air temperature).
The daily near-surface temperature reanalysis data represent daily temperature averages over a global grid with 2.5 • spacing for latitude and longitude. We thin the data to 5 • spacing for latitude and longitude to carry out a model comparison on the hold-out data. In this dataset, we have observations at 2522 unique spatial locations, giving 17654 total observations. The averages of nearsurface temperature over the first week of January are plotted in Figure 1. Additionally, we give the density estimate of near-surface temperature for each day in Figure 1. Figure 1 shows that the overall temperature distribution is similar across days and demonstrates a clear spatial structure. Because our covariance model allows space to be scaled by time while using the spherical distance, we expect that our model may be able to capture the strong spatial structure in this data more effectively than the models with which we compare them.
With 17654 data, carrying out fully Bayesian inference using a full GP model is computationally burdensome; thus, we utilize a NNGP model. For these models, we use simple neighborhood selection presented in Datta et al. (2016b) using m = 25 neighbors, using the five nearest neighbors at the five most recent times, including the current time. We utilize two covariance models from each of the Gneiting class, the inverted-Gneiting class using spherical distance, and our new Gneiting class (Theorem 2). These models are presented in (11) to (15). For all models, we use inverse-gamma prior distributions for τ 2 and σ 2 with 0.1 for both the shape and scale parameters (corresponding to the rate parameter of a gamma distribution). We use prior distributions of c s ∼ Unif(0, π), c t ∼ Unif(0, 10), α ∼ Unif(0, 2], and β ∼ Unif(0, 1] for correlation function parameters. Because many covariance models have limited parameter identifiability, we constrain δ + βd/2 = 1 for (11), and δ + β/2 = 1 for (12), (13), and (15).
We compare these six models in terms of predictive performance on a randomly selected subset of the hold-out data. In total, we use 1000 locations over the week, giving 7000 hold-out observations. These hold-out locations are plotted in Figure 5 in Appendix C. We compare these models in terms of predictive root mean squared error, mean absolute error, continuous ranked probability score, and 90% prediction interval coverage, as discussed in Section 4.3. For predictions, neighbors are chosen to make prospective predictions using 25 neighbors (See Appendix B for details on modeling and prediction).
The results presented are based on 25,000 posterior draws after a burn-in of 5,000 iterations using a Gibbs sampler presented in Appendix B. These posterior samples are used for prediction and posterior inference. The results of the model comparison are given in Table 2. For this data, the covariance models from our class (14) and (15) had the best out-of-sample predictive performance, and the model (14) was the very best. For comparison, the Gneiting class using chordal distance had continuous ranked probability scores 7% and 16% higher than the best model for ν = 1/2 and ν = 3/2, respectively. Both models from the inverted Gneiting class with spherical distance were 13% worse in terms of continuous ranked probability scores. For the best model, (14), we provide posterior summaries in Table 3. Additionally, we display the correlation contour as function of spherical distance θ and time t for the posterior mean in Figure 2. Correlation is very persistent as a function of time; thus, decreases in autocorrelation are almost completely attributable to changes in spatial location. We can also compute posterior summaries for σ 2 /(σ 2 +τ 2 ), interpreted as the proportion of total variance attributable to the spatiotemporal random effect. Here, the posterior mean of σ 2 /(σ 2 + τ 2 ) is 0.834. In other words, our selected covariance model accounts for 83.4% of the total variance. In Table 3 and Figure  2, c s and c t suggest that the surface temperature process exhibits persistent correlation over both space and time. Low values of α in (14) add temporal smoothness.

Total cloud coverage reanalysis data
For this section, we utilize the 2017 National Centers for Environmental Prediction/National Center for Atmospheric Research daily average total cloud coverage reanalysis data (Kalnay et al., 1996). Total cloud coverage is defined as the fraction of the sky covered by any visible clouds, regardless of type. Total cloud coverage takes values between 0 and 100, representing a percentage of cloud coverage. Values of total cloud coverage close to 0 indicate clear skies, values from 40 to 70 percent represent broken cloud cover, and overcast skies correspond with 70 to 100 percent.
The degree of cloudiness impacts how much solar energy radiates to the Earth (see, e.g., Svensmark and Friis-Christensen, 1997). Total cloud coverage, like changes in global surface temperature, has been impacted by global climate changes (see, e.g., Melillo et al., 1993;Wylie et al., 2005), and changes in total cloud coverage are linked with many biological changes (see Pounds, Fogden and Campbell, 1999). Thus, tracking, predicting, and anticipating changes in cloudiness have important implications for understanding global climate changes and their effects on ecosystems.
The daily total cloud coverage reanalysis data represent daily averages and are given on a global grid with 1.9 • spacing for latitude and 1.875 • spacing for longitude. The spatial averages of cloud coverage over the first week of January are plotted in Figure 3. This map shows clear spatial variability that suggests that a spatial model is appropriate. We provide density estimates of total cloud coverage for each day of the week in Figure 3. These density estimates show that cloud coverage is similar across days.
Again, we thin the data to 3.8 • spacing for latitude and 3.75 • for longitude to carry out model comparison on hold-out data. In total, we have 4512 unique spatial locations, giving 31584 total observations. With 31584 data, carrying out fully Bayesian inference using a full Gaussian process model is intractable; thus, we utilize a nearest neighbor Gaussian process model. For these models, we use the same neighborhood formulations and fit the same covariance models with the same prior distribution to these data as we did in Section 5.2.
To obtain a test set to compare the six competing models, we randomly select 1000 locations and predict at these locations over the time-span of our data, giving a test set of size 7000. These hold-out locations are plotted in Figure  6 in Appendix C. As before, we compare these models in terms of predictive mean squared error, mean absolute error, continuous ranked probability scores, and 90% prediction interval coverage, as discussed in Section 4.3. The results of the model comparison are given in Table 4.
Again, an example from our new class was best and models in terms of prediction for the total cloud coverage dataset. All competing models were at   least 26% worse in terms of continuous ranked probability score compared to the best model (15).
For the best predictive model in (15), we provide posterior summaries in Table 5. Additionally, we display the correlation contour as function of spherical distance θ and time t for the posterior mean in Figure 4. The scale of the plots in Figure 4 are not the same as Figure 2. Correlation falls off sharply as a function of great circle distance. In this way, the total cloud coverage dataset differs greatly from the near-surface temperature dataset which demonstrated very persistent autocorrelation over space and time. For the total cloud coverage data, spatiotemporal variance σ 2 accounts for 96.40% of the total variance σ 2 + τ 2 (see Table 5). In Table 5, the parameter c t suggests that the surface temperature process exhibits persistent correlation over time; however, as discussed, the parameter c s indicates rapid decay as a function of space. The separability parameter β ∈ [0, 1] is close to one, meaning that the covariance process is nonseparable.

Discussion and conclusion
In this paper, we generalize the Gneiting criteria for nonseparable covariance functions (Gneiting, 2002) in Theorem 1 and present new classes of nonseparable covariance models for spatiotemporally data in Theorems 2 and 3. In a simulation study, we explored the identifiability of covariance parameters for two covariance functions from Theorem 2 and noted that these covariance functions have limited parameter identifiability. However, some of these challenges are remedied by fixing the spatial range parameter. We then illustrate the utility of our new Gneiting-like class using spherical distance through two climate reanalysis datasets from the National Centers for Environmental Prediction and National Center for Atmospheric Research (Kalnay et al., 1996). In these two data examples, covariance models from Theorem 2 outperform similar stationary nonseparable covariance models from Gneiting (2002) and Porcu, Bevilacqua and Genton (2016) using continuous ranked probability scores, root mean squared error, and mean absolute error. As discussed, we do not suggest that covariance functions from our new covariance classes are preferable for all datasets. However, these results highlight the benefit of allowing spatial distance to be scaled by the difference in time and the importance of using the spherical distance relative to Euclidean distance or chordal distance for these datasets.
The result in Theorem 1 presents a key for extending results obtained in Euclidean spaces to spheres cross time. Due to the lack of literature for multivariate cross-covariance models on spheres over time, with the notable exception of , we recommend this as a valuable area of expansion. In addition, the development of nonstationary covariance models for spheres cross time is an important direction for future research.
showing that B d is bounded and continuous. Further, B d is positive definite on R because positive definite functions are a convex cone being closed under pointwise convergence. To complete the result, we need to prove that B d ∈ L 1 (R). This comes from the fact that The proof is completed.
To prove (2) −→ (1), we let C τ as defined through (8) and suppose that C τ ∈ Ψ d a.e. τ ∈ R. By Schoenberg (1942) is nonnegative, where κ > 0 (Berg and Porcu, 2017). Using again Schoenberg (1942) theorem, we can write C τ as Since C τ ∈ Ψ d a.e. τ , this in turn implies ∞ > C τ (0) = k b k,d (τ ) for all τ ∈ R. We now define Apparently B d is nonnegative. Let us now show that B d ∈ L 1 (R). To do so, we note that so that B d ∈ L 1 (R) as asserted. This in turn implies that b k,d ∈ L 1 (R) for all k = 0, 1, . . .. Thus, we can define a function C : and where b k,d (·) = 1/(2π) e i·τ b k,d (τ )dτ is positive definite on R for all n ∈ N. Thus, the proof is completed by invoking Theorem 3.3 in Berg and Porcu (2017) and by verifying that We now prove the implication (2) −→ (3). Since C τ ∈ Ψ d for almost every τ ∈ R, we have that (20) holds. This implies that is positive definite. Summability of the sequence {b k,d (u)} ∞ k=0 at u = 0 follows easily from previous arguments.

Proof of Theorem 2
We start by noting the beautiful formula We now consider the function where (θ, u) ∈ [0, π] × R. This shows that H ξ is positive definite on every ddimensional sphere cross time (R) because the mapping θ → exp(−rθ) is positive definite on every d-dimensional sphere S d (Gneiting, 2013, Theorem 7) and because u → exp(−ξrψ(u 2 )) is positive definite on the real line. Since ϕ ∈ S, we have, using (9), that and this proves the assertion.

Proof of Theorem 3
We make use of the arguments in the proof of Theorem 2, in concert with formula (15) on page 15 of Bateman (1954): for x, t ≥ 0. We now replace x with ψ [0,π] (θ) and t with |u|. Since the composition of the negative exponential with a positive functions having completely monotonic derivative provides a completely monotonic function, we can invoke Theorem 7 in Gneiting (2013) to infer that the mixture above provides, in view of analogous arguments to the proof of Theorem 2, a positive definite function on S d × R for all d. The proof is completed.

Supplementary Theorem for Section 3.1
We now show how Theorem 1 can be useful to understand connections and analogues between the class of positive definite functions R × R and positive definite functions on the circle S 1 cross R.
Proof. Since ϕ is positive definite in R × R, by Lemma 1 in Gneiting (2002) we get that ϕ τ is positive definite in R a.e. τ ∈ R. Additionally, ϕ τ (x) = 0 whenever |x| ≥ π. Call ψ τ the restriction of ϕ τ to [0, π]. By Corollary 3 in Gneiting (2013) we have that the coefficients b k,1 (τ ), τ ∈ R in the Schoenberg expansion of ψ τ , as defined in (20) are nonnegative and strictly decreasing in k for any fixed τ ∈ R. This implies that ϕ τ (θ) is positive definite in S 3 for almost every τ ∈ R. Application of Theorem 1, Assertion 2, shows that C(θ, u) is positive definite in S 3 × R. The proof is completed.

Appendix B: Modeling details for the nearest neighbor Gaussian process
Suppose we begin with a parent Gaussian process over R d × R or S d−1 × R. Nearest neighbor Gaussian processes induce sparsity in the precision matrix of the parent Gaussian process by assuming conditional independence given neighborhood sets (Datta et al., 2016a,b). Let S = {(s 1 , t 1 ), (s 2 , t 2 ), ..., (s k , t k )} of k distinct location-time pairs denote the reference set, where we allow time to act as a natural ordering and impose an ordering on the locations observed at identical times. Then, we define neighborhood sets N S = {N (s i , t i ); i = 1, ..., k} over the reference set with N (s i , t i ) consisting of the m nearest neighbors of (s i , t i ), selected from {(s 1 , t 1 ), (s 2 , t 2 ), ..., For the Gibbs sampler, we need to define an inverse of the neighborhood set, which we call U (s i , t i ). The set U (s i , t i ) consists of all sites that include (s i , t i ) in their neighborhood sets. Along with S, N S defines a Gaussian directed acyclic graph w S with a joint distribution where N is a normal distribution, where C (si,ti),N (si,ti) is a vector of covariances between (s i , t i ) and its neighbors, we define C N (si,ti) to be the covariance matrix for the neighbors of (s i , t i ), and w N (si,ti) is the subset of w S corresponding to neighbors N (s i , t i ) (Datta et al., 2016a). Datta et al. (2016a) extend this Gaussian directed acyclic graph to a GP. This Gaussian process formulation only requires storage of k m×m distance matrices and requires many fewer floating point operations than full Gaussian process models (see Datta et al., 2016a). Like any other GP model, the NNGP can be utilized hierarchically for spatiotemporal random effects. In this article, we use NNGP as an alternative to the full Gaussian process specification. We envision our model taking the following form: Y (s, t) = x(s, t) β + w(s, t) + (s, t), (23) w(s, t) ∼ NNGP (0, C((s, t), (s , t ))), where Y (s, t) is a spatiotemporal process measured (with error), x(s, t) are p spatiotemporal covariates, and δ b a is the Kronecker delta function. We define C((s, t), (s , t )) using a covariance model discussed in Section 2 or Section 3. We recommend using inverse gamma (IG) prior distributions for τ 2 from the pure error term and σ 2 from the covariance function because this selection gives closed form full conditional distributions. If the outcomes and covariates are centered, then an intercept is unnecessary. If covariates are not available, then x(s, t) β is replaced with μ. Here, for more compact notation, we index space-time location pairs with i as (s i , t i ) and refer to corresponding outcomes, covariates, and spatiotemporal random effects as y i , x i , and w i , respectively.
The prior mean and variance for regression coefficients β are m β and V −1 β , respectively. Additionally, let a V and a τ be shape parameters for the inverse gamma prior distributions for σ 2 and τ 2 . Similarly, let b V and b τ be scale parameters (corresponding to rate parameter of the gamma distribution) for the inverse gamma prior distributions for σ 2 and τ 2 .
The full conditional distributions for the Gibbs sampler, which we denote · | · · · , are To express V * wi and m * wi , we must define some additional terms. First, we let B (s ,t ),(si,ti) be the scalar in B (s ,t ) corresponding to (s i , t i ). Second, we define a (s ,t ),(si,ti) = w(s , t ) − (sj ,tj )∈N (s ,t ),(sj ,tj ) =(sj ,tj ) B (s ,t ),(si,ti) w(s i , t i ).
The parameters of the full conditional distributions are as follows:  (si,ti) w N (si,ti) ) (w i − B (si,ti) w N ((si,ti)) )/F (si,ti) Prediction at an arbitrary location and time requires selection of m nearest neighbors from the reference set S for that location-time pair. We discuss two ways of selecting m neighbors. In theory, any location-time pair from the reference set can be selected as a neighbor for any prediction. If we allow predictions to depend on data occurring after the prediction time, then we call this a retrospective prediction since such a prediction could only be made retrospectively. On the other hand, a prospective prediction limits neighbor selection to elements of S that occur at the same time or prior to the time of prediction. Datta et al. (2016b) selects neighbors for prospective predictions, and we do the same in our analyses.
Predicted spatiotemporal random effects at location-time pairs follow a conditional normal distribution, where conditioning is limited to its neighbors. For any space-time pair (s, t), the conditional distribution of the random effect is Then, the posterior prediction Y (s, t) | Y is x(s, t) β + w(s, t) + i (s, t), where, in practice, posterior samples of β, w(s, t), and τ 2 are used to sample from the posterior predictive distribution. Predictions at hold-out location-time pairs can be used to compare competing models.

Appendix C: Locations of hold-out data from Section 5
Here, we provide the locations of hold-out data used for model validation. The locations for hold-out air temperature are given in Figure 5. The locations for hold-out cloud coverage are in Figure 6.