Spatially Varying Anisotropy for Gaussian Random Fields in Three-Dimensional Space

Isotropic covariance structures can be unreasonable for phenomena in three-dimensional spaces such as the ocean. In the ocean, the variability of the response may vary with depth, and ocean currents may lead to spatially varying anisotropy. We construct a class of non-stationary anisotropic Gaussian random fields (GRFs) in three dimensions through stochastic partial differential equations (SPDEs) where computations are done using Gaussian Markov random field approximations. The approach is proven in a simulation study where the amount of data required to estimate these models is explored. Then, the method is applied to construct a GRF prior on an ocean mass outside Trondheim, Norway, based on simulations from the complex numerical ocean model SINMOD. This GRF prior is compared to a stationary anisotropic GRF using in-situ measurements collected with an autonomous underwater vehicle where our approach outperforms the stationary anisotropic GRF for real-time prediction of unobserved locations.


Introduction
Gaussian random fields (GRFs) are a powerful tool for spatial and spatio-temporal geostatistical modeling (Diggle et al., 1998;Cressie and Wikle, 2015).When the key goal is predictions at unobserved locations, i.e., kriging, isotropic covariance functions often perform well, and more flexible covariance structures should be used with care (Fuglstad et al., 2015b).However, the screening effect in kriging (Stein, 2002) is not relevant in other settings where the primary goal is the estimated covariance structure.E.g., to describe internal variability in a climate model ensemble (Castruccio et al., 2019), or to produce a spatial prior based on numerical simulations that will later be used to guide autonomous sampling (Fossum et al., 2021;Foss et al., 2021).For the former, Fuglstad and Castruccio (2020); Hu et al. (2021) demonstrated that flexible covariance structures can perform better than stationary covariance structures.
There are many approaches to constructing flexible covariance structures (Sampson, 2010;Salvaña and Genton, 2021;Schmidt et al., 2011).Some early approaches are the deformation method (Sampson and Guttorp, 1992) and kernel convolutions (Paciorek and Schervish, 2006), but they both involve the covariances between any pair of locations.This means standard implementations are infeasible for large datasets.There are many ways to overcome such computational issues in spatial statistics and some are applicable for flexible covariance structures (Heaton et al., 2019).The stochastic partial differential equation (SPDE) approach (Lindgren et al., 2011) is interesting because it directly gives rise to computationally efficient models and easily extends to non-stationary covariance models.
However, increasing the degree of flexibility in the covariance structure requires increasing the number of parameters.The common isotropic Matérn covariance functions (Stein, 2012) are parametrized through 3 parameters: marginal variance, range, and smoothness.Flexible models can have 100s or more parameters (Fuglstad et al., 2015b).An appealing way to reduce dimensionality is to describe the covariance structure through covariates (Schmidt et al., 2011;Neto et al., 2014;Ingebrigtsen et al., 2014Ingebrigtsen et al., , 2015;;Risser and Calder, 2015).
The aforementioned works are all considering flexible covariance structures in two-dimensional space, and while the methods can be extended to three-dimensional space, the literature is sparse.For example, the SPDE approach has been used for simple anisotropic covariance structures in the context of fMRI data from the brain (Sidén et al., 2021), and more complex covariance structures in the context of astronomy (Lee and Gammie, 2021), though this was two-dimensional space and time treated as three-dimensional space.However, spatially varying anisotropy in the SPDE approach (Fuglstad et al., 2015a) has not been extended to threedimensional space.
The aim of this paper is to develop a new method for spatially varying anisotropy in three-dimensional space through the SPDE approach.A key advantage is that the formulation as an SPDE guarantees a valid covariance structure, and the main challenge is how to describe and parametrize non-stationary covariance structures.Fuglstad et al. (2015a) used one vector field to describe spatially varying anisotropy, but in three dimensions, two spatially varying orthogonal vector fields are necessary for full generality.
In a simulation study, we investigate how much data is necessary to recover parameters for three different model complexities: stationary isotropic, stationary anisotropic, and non-stationary anisotropic.We then estimate GRF priors to encode knowledge about the ocean from a numerical forecast generated by the numerical model SINMOD by SINTEF.A stationary GRF prior and a non-stationary GRF prior are updated based on in-situ measurements by an autonomous underwater vehicle (AUV), and we evaluate the predictive ability during a mission in Trondheimsfjorden, Norway, on May 27, 2021.Improved predictions are key, for example, in autonomous sampling of the oceans (Fossum et al., 2019(Fossum et al., , 2021)), but current approaches in autonomous ocean sampling are limited to stationary GRFs.
In Section 2, we describe how to model anisotropy and non-stationarity in three dimensions using SPDEs.Then in Section 3, we describe how to perform inference for the new model in a computationally efficient way.In Section 4, we describe the simulation study and discuss the results, and continue with the application to sampling in the ocean in Section 5. We end with a discussion in Section 6.
2 Constructing SPDEs with spatially varying anisotropy

Existing models
The Matérn covariance function on R 3 is given by where ||•|| is the Euclidean distance in R 3 , σ > 0 is the marginal standard deviation, K ν is the modified Bessel function of the second kind and order ν > 0, and κ > 0 is an inverse spatial scale parameter.As discussed in Lindgren et al. (2011), GRFs with this covariance function is the stationary solutions of the SPDE where α = ν + 3/2, τ = √ 8πκ/σ, ∇ • ∇ is the Laplacian, and W is a standard Gaussian white noise process.Lindgren et al. (2011) proposed to introduce non-stationarity by allowing κ and τ to vary in space (Ingebrigtsen et al., 2014(Ingebrigtsen et al., , 2015) ) or by deformations of space (Hildeman et al., 2021).Fuglstad et al. (2015a,b) consider a version of the SPDE, where the Laplacian is replaced by an anisotropic Laplacian where the direction and degree of anisotropy vary spatially.This was further extended to spherical geometry in Fuglstad and Castruccio (2020); Hu et al. (2021).However, all of these works were in two-dimensional base spaces, and only simpler models have been applied for three-dimensional base spaces (Sidén et al., 2021).
The key idea in Fuglstad et al. (2015a) was to replace ∇ • ∇ by ∇ • H(s)∇, where H(s) is everywhere a symmetric positive definite 2 × 2 matrix that controls the strength and direction of anisotropy.The matrix-valued function was specified as is a vector field.This allows γ(•) to control the baseline strength of dependence in all directions, and v(•) to control the strength and direction of additional spatial dependence.However, the same parametrization in R 3 is not sufficiently general to control anisotropy fully.

Stationary anisotropy in R 3
We follow the idea in Fuglstad et al. (2015a) for R 2 , and change the SPDE in Equation ( 2) to where ∇ • H∇ is an anisotropic Laplacian and the symmetric positive definite 3 × 3 matrix H controls the anisotropy.The parameter τ has been dropped since κ and H together control both marginal variance and correlation.
As shown in Appendix A.1, the resulting marginal variance is and the covariance function is explicitly known as for s 1 , s 2 ∈ R 3 .The latter is derived in Appendix A.2.This corresponds to geometric anisotropy in the Matérn covariance function with smoothness ν = 1/2.To understand the behavior of the covariance function, it is useful to think about H in terms of its eigenvalue decomposition.Let ṽ1 , ṽ2 , and ṽ3 be orthonormal eigenvectors corresponding to eigenvalues λ 1 , λ 2 and λ 3 , respectively.Then Figure 1 shows an example of the 0.37 level iso-correlation surface that will arise from the covariance function in Equation ( 5).The semi-axes of the ellipsoid in the figure )ṽ 2 , and v 3 = ( √ λ 3 /κ)ṽ 3 , which by evaluating the covariance function with either of these semi-axes will yield the relationship and the iso-correlation level r(v)/σ 2 m = e −1 ≈ 0.37.We generalize the parametrization described in Section 2.2 and H is decomposed as The eigenvalue decomposition of H has eigenvalues λ 1 = γ, λ 2 = γ + ||v|| 2 and λ 3 = γ + ||w|| 2 with the corresponding eigenvectors v 1 = v × ω, v 2 = v and v 3 = ω, respectively.We construct ω by a linear combination of two orthogonal vectors in the plane with v as normal vector.First, let where ρ 1 , ρ 2 ∈ R which works whenever v x = v y = 0.An alternative solution is to use Euler-Rodrigues parametrization (Euler, 1771;Rodrigues, 1840) to obtain both v and ω; however, in this case, the parameters are less interpretable and the issue is simply nullified by numerical optimization with appropriate initial parameter values.
The above parametrization for H uses six parameters, γ, v x , v y , v z , ρ 1 , and ρ 2 , to describe all forms of geometric anisotropy.The parameterization is interpretable: 1) γ controls the isotropic effect, 2) v x , v y , and v z controls one anisotropy in one direction, and 3) ρ 1 and ρ 2 controls anisotropy in a second direction orthogonal to the first.Lastly, κ simultaneously controls scaling of spatial dependence equally in all directions, and the variance of the GRF together with the six other parameters as seen in Equation (4).

Spatially varying anisotropy on bounded domain D ⊂ R 3
Non-stationarity and spatially varying anisotropy is achieved by making the coefficients in Equation (3) spatially varying, where κ(•) is a positive function, and H is a spatially varying symmetric positive definite 3 × 3 matrix.Heuristically, one can imagine that the SPDE is gluing together different local behavior described by ellipsoids, as discussed in Section 2.2, to a valid non-stationary covariance structure.
In practice, we need to limit Equation ( 8) to a bounded domain to parametrize the non-stationarity.The SPDE we propose is where D is bounded, and we enforce the boundary condition 3 Estimating SPDEs with spatially varying anisotropy

Parameterizing the non-stationarity
Before using the SPDE in Equation ( 8) in inference, we parametrize the nonstationarity through a finite number of parameters.This involves expanding , and ρ 2 (•) in basis functions.The log-transform is used for κ(•) and γ(•) since they must be positive functions.
Let g : R 3 → R denote a generic function that we want to expand in a basis, and let p > 0 the number of basis functions.We use basis splines similar to Fuglstad et al. (2015b), and set where α g ∈ R p , and f (s) = (f 1 (s), . . ., f p (s)) T is a p-dimensional vector with the basis functions evaluated at location s.
In this paper, we will use rectangular domains and a basis constructed as a tensor product of three one-dimensional B-splines.
This means that p = m 3 , where m > 0 is the number of basis functions used in each dimension.We use clamped splines where the derivative is 0 at each boundary, and the construction of the clamped one-dimensional B-splines is discussed in Appendix A.3. Figure 2 shows an example of the resulting basis functions in 1-dimension.Let B x,i denote the i-th basis function of the second-order basis in the xdimension, and similarly B y,j and B z,k for the y-and z-dimension.The resulting tree-dimensional basis is then for all combinations i, j, k ∈ {1, . . ., m}.This means that α g ∈ R m 3 , and m 3 parameters must be estimated for each of the seven functions described at the start of the section.In Sections 4 and 5, we use p = m 3 = 3 3 = 27.For a total of 189 parameters in the seven functions.When data is sparse, such a model can easily result in overfitting (Fuglstad et al., 2015b), and it is necessary to introduce penalties on the seven functions.In Fuglstad et al. (2015b), this was achieved by a hierarchical model where together with Neumann boundary conditions of zero derivatives on the boundary of the domain.However, this requires selecting a reasonable value for τ g > 0 for each of the seven functions and is computationally expensive if it is done using cross-validation.However, in the context of this paper, we are constructing a stochastic model that mimics the behavior of a densely "observed" numerical simulation model and does not include penalties beyond the restriction of using 27 basis functions.We demonstrate the ability of this model to be estimated in our context in the simulation study in Section 4, and also investigate the amount of data needed to estimate the model.

Hierarchical model and discretization
Consider a bounded domain D ⊂ R 3 , and observations y = (y 1 , y 2 , . . ., y n ) made at locations s 1 , s 2 , . . ., s n ∈ D. We assume a Gaussian observation model where σ 2 N > 0 is the nugget variance and describes true spatial variation as a combination of covariates and a GRF.Here x(•) is a spatially varying vector of k covariates, β ∈ R k are the coefficients of the covariates, and u(•) is a GRF with spatially varying anisotropy as presented in Section 2.
As described in Appendix B, the GRF u(•) is discretized using a regular grid with l cells, and we get a Gaussian Markov random field w = (w 1 , . . ., w l ) T .Let θ be the vector of all parameters controlling u(•), then where dependence on θ is suppressed for Q, and Q is a l × l precision matrix with a three-dimensional spatial sparsity structure.The vector w is linked to u(•) through a linear transformation u(s) = a(s) T w, where a has only one nonzero entry corresponding to which grid cell location s belongs.This gives u = (u(s 1 ), . . ., u(s n )) T = Aw, where the n × l matrix A only has one non-zero entry on each row.
The coefficients of the fixed effect, β, is assigned the weak penalty β ∼ N K (0, V I K ) for a fixed V > 0. Thus we can write y as where X is the design matrix of covariates, and ∼ N n (0, I n σ 2 N ) is an n-dimensional vector of random noise.This gives rise to the hierarchical formulation Let s * ∈ D be an unobserved location.After parameters θ and σ2 N are estimated, one can predict the underlying value η(s The estimation of parameters is detailed in the next section.

Simplify notation by letting
Let S = A X , then the observation model can be rewritten as Using this notation the log-likelihood can be expressed as Here dependence on θ is suppressed for µ C , Q z and Q C , and π(θ, σ 2 N ) can be used to assign a penalty on θ, e.g., like the random-walk penalty used in Fuglstad et al.

(2015b). The conditional precision matrix
and µ C is the conditional mean, Parameter inference is done by maximizing Equation ( 14) with respect to θ and σ 2 N .The parameter vector θ includes all coefficients for the basis functions, and when using 27 basis functions for each function, has 189 parameters.The parameter space is challenging to search and we use an analytical expression for the gradient in the optimization algorithm.The derivation of the analytical gradient involves many nested chain rules and a technique to calculate a partial inverse of sparse matrices (Rue and Held, 2010), see Appendix A.5 for a complete description.

Simulation study
In this section, we perform a simulation study to investigate the amount of data required to acquire reasonable parameter estimates of models with varying complexity that are specified through the SPDE.A comparison of these estimates is made from simulated data generated from three different parametrizations of the covariance structures.
The observation model for the different parametrizations is where w mod is the GMRF controlled by the parameters θ mod in the respective models, and is the independent noise term with mean zero and standard deviation σ N = 0.1 which is identical for all the parametrizations.Furthermore, the models are discretized on the same domain with a grid of size (M, N, P ) = (30,30,30) resulting in a total of 27000 grid nodes where the center of which is our spatial The first and simplest model is a Stationary Isotropic (SI) model which has a covariance structure controlled by the three parameters θ SI = (log κ 2 , log γ, log σ 2 N ), that is assigned to the values κ 2 = 0.2, γ = 2.5 and σ N = 0.1.The resulting spatial range is 10.59 with a marginal variance of 0.023.
The second is a Stationary Anisotropic (SA) model composed of the 8 parame- The parameters of these first two models are simply assigned some reasonable value; however, the third and most complex model with a non-stationary anisotropic covariance and a total of 190 parameters, they are much more troublesome to select.Therefore, functions are chosen to assign the parameter values in θ NA throughout the domain D such that the dependency directions imitate a vortex.Using these functions and evaluating them at the spatial locations in the discretization the parameters of the B-splines, described in Section 3.1, are found by optimization.These aforementioned parameters are θ NA = α log(κ 2 ) , α log γ , α vx , α vy , α vz , α ρ 1 , α ρ 2 , log σ N with σ N = 0.1, and the resulting covariance structure can be viewed in Figure 4.
We will now examine the extent of data required to fit back the parameters of the three models described above.First, we simulate multiple datasets from   17), with a different number of observed spatial locations and realizations (replicated observations of these spatial locations).The number of spatial locations varies between 100, 10000, and 27000 (all), and the number of realization range between 1, 10, and 100, so nine different combinations of dataset sizes.Furthermore, we want to perform 100 different trials for each of these combinations, and thereby have 900 total datasets per model.Also, note that the observed spatial locations are randomly chosen in each trial.From this, some statistics can be recovered about the model estimates that can give insight into the applicability of the different parameterizations.
Table 1 shows the root mean square error (RMSE) between the set parameter values in each model and their values inferred by the different datasets.This was obtained using the inference method described in Section 3.3 with the observation model in Equation ( 17) for each respective parametrization and trial.The columns describe the different number of observation locations (No. loc.) and the number  1 we observe that the (simple) stationary models, SI and SA, require very little data.In fact, observing under 1% of the grid for 10 realizations or more is good enough for the SI and the SA only requires some more realizations to attain similar parameter accuracy.
On the other hand, the most flexible parameterization, the NA model, requires much more data and only reaches reasonable parameter accuracy when the whole grid is observed with 10 or more realizations.Now there is a large discrepancy between 10000 observed points ( 37%) and 27000 (100%), so it could be interesting to investigate where in this range reasonable estimates are obtained.However, we have not chosen to explore this here.We also want to note that these estimates will change with the complexity of the covariance structure and with the initial values in the optimization.
5 GRF prior for statistical sampling of the ocean

Aim
Forecasts produced by numerical ocean models describe realistic behavior for the ocean, but local behavior such as plumes created by freshwater discharge from a river into the ocean are hard to accurately forecast.However, we can construct a prior based on the numerical ocean model that informs prior beliefs about the ocean, which can aid AUVs to more effectively sample the ocean.In this paper, the goal is to determine the three-dimensional extent of a freshwater plume in the ocean, and we assume operation time is short enough to justify a purely spatial prior that does not assume dynamical changes in time.
There are two steps in our approach.
Step 1 is to estimate a stationary GRF prior and a non-stationary GRF prior based on a simulation from the numerical ocean model as described in Section 5.2.
Step 2 is to combine each of the estimated priors with an observation model, and evaluate the predictive ability on in-situ observations from AUV as described in Section 5.3.The GRFs that we estimate based on the numerical ocean model can be viewed as statistical emulators of the ocean.

The numerical ocean model and the GRF prior
The that the varying vertical layers in the numerical model are either with 0.5m or 1m increments, so the SINMOD simulations don't require any additional modification to fit within our statistical model.
We first estimate the model where Φ is a diagonal matrix of AR(1) coefficients.The diagonal entries of Φ are estimated with maximum likelihood separately for each spatial location such that Φii = 143 t=1 z t,i z t−1,i / 143 t=1 z 2 t−1,i for i = 1, . . ., N M P , where z t,i is the value in cell i at time t.We then compute empirical innovations ˆ t = z t − Φz t−1 , t = 1, . . ., 143.
These empirical innovations describe the spatial covariance structure for short-term changes in salinity.
We fit the flexible non-stationary anisotropic model with 190 parameters, θNA = (α log κ , α log γ , α vx , α vy , α vz , α ρ 1 , α ρ 2 , log σ 2 N ), and the stationary anisotropic model with 8 parameters, θSA = (log κ 2 , log γ, v x , v y , v z , ρ 1 , ρ 2 , log σ 2 N ), to the assumed independent realization from a GRF ˆ 1 , . .In the next step, we construct the expected value of the GRF using the time average of the whole day, µ = 143 t=0 z t /144.The mean is shown in Figure 7a and shows the overall tendency for freshwater near the river outlet and saltwater further out in the ocean.We choose the prior where we combine the fixed mean vector, µ, with a new realization, e, of the estimated stationary anisotropic model or the non-stationary anisotropic model.
This is a spatial prior on a 32 m × 32 m × 1 m resolution.

In-situ data collection and emulator evaluation
In-situ measurements were made with the AUV on May 27, 2021, between 10:30 and 14:30.The AUV followed 9 pre-planned paths within the area of operation: two intersects at 0.5m depth one northbound and one north-westbound starting from the river, two zig-zags in each depth layer (0.5m,2m,5m), and one up and down pattern in depth ranging from 0.5m to 10.5m moving north-westbound starting from the river.Figure 8 displays the locations of the measurements in the top 5 layers of the field.
The AUV is moving at 1.5 m/s and continuously samples the salinity.This means that multiple measurements are made within each 32 m × 32 m × 1 m grid cell.Measurements are represented as y i , i = 1, . . ., n obs , whereby y i is the average value measured in grid cell i.We combine these measurements with the prior in Equation ( 18) using where a i selects the correct grid cell, Q −1 Prior is the estimated precision matrix for the GMRF, and the Gaussian likelihood with nugget variance σ 2 meas describes measurement noise and sub-grid variation.In general, we would estimate σ 2 meas using a trial run, but in this case, we estimated σ 2 meas using the average empirical variance over all observed grid cells in the total dataset.Note that we have not accounted for the uncertainty in the AUVs positions in these models.As the AUV dive, it loses its GPS signal and only relies on estimated location.When the GPS signal is returned a linear interpolation is made to account for drift but no uncertainty is included.
We evaluated the two priors, or emulators, by randomly ordering the 9 segments and then sequentially including more and more observations for predicting the remaining hold-out data.The random permutation of the segments was done repeatedly to determine the variation in scores over different paths.This scheme evaluates the AUVs' ability to predict future observations while maintaining the sequential structure of measurements.Figure 9 shows that the non-stationary model provides a better prior for the salinity in the ocean than the stationary model.The differences are largest when little data is available, which is consistent with the idea that the prior is most important in this case.The non-stationary model can leverage knowledge about which areas are most uncertain using the spatially varying marginal variance and update the prior based on expected similarities from the spatially varying anisotropy.The improvements are seen both in point predictions through RMSE and in predictive distributions as measured by CPRS (Gneiting and Raftery, 2007).

Discussion
We extend the class of SPDE-based GRFs introduced in Fuglstad et al. (2015a) to three-dimensional space by overcoming two key issues: parametrization and computation.For the former, we developed a specification of spatially varying anisotropy through a spatially varying baseline isotropic dependence, and two orthogonal spatially varying vector fields that describe extra dependence.This allows for an interpretable description of the 3×3 positive definite matrix describing anisotropy.For the latter, we use a finite volume method to construct a GMRF that approximates the solution of the SPDE.
The specification of spatially varying marginal variance and spatially varying anisotropy requires specifying 7 spatially varying real functions.In this paper, we expand each function with a clamped B-spline basis.If each function uses P 3 basis functions, this gives in total 7P 3 coefficients.As demonstrated in the simulation study, an unpenalized estimation of these parameters requires a densely observed area and multiple realizations.Application of the new models in data- sparse situations will require penalties that restrict the regularity of the 7 spatially varying functions.However, more research is needed to come up with a practical way to determine the appropriate strength of penalization for each of the functions.
While we did not experience any practical issues with the chosen way to describe the two orthogonal vector fields, the construction has a "gimbal lock" type issue.If one vector field points exactly along the z-axis, there is no unique choice for the second vector field.A potential way to avoid this issue is by describing the orientation of the two orthogonal vector fields through quaternions or Euler-

Rodrigues parameters.
Moving from two-dimensional space to three-dimensional space introduces an asymptotically higher computation cost as a function of grid size.For a regular three-dimensional grid with N nodes, the computational cost is O(N 2 ) compared to O(N 3/2 ) in two-dimensional space.This increased computational cost arises from increased fill-in in the Cholesky factor.However, the application demonstrates that the use of a grid size of N = 22275 is unproblematic even for real-time updates on an AUV.
For the predictions of salinity in the Trondheim's fjord, we see the highest improvement of the complex GRF prior compared to an isotropic GRF, for sparse in-situ measurements.As more data is collected, the difference between the models decreases.This suggests that the key advantage of training the more complex GRF is to encode prior physical knowledge so that we can more effectively update knowledge about unobserved locations.Salinity was used as an example, but in general, the same approach could be used to map other biologically interesting quantities such as phytoplankton (Fossum et al., 2019).The GRFs developed in this paper are a step forward in quantifying beliefs about unobserved regions in the ocean, which is essential for optimal decisions and more effective autonomous sampling (Fossum et al., 2021).
In future work, it would be interesting to add a dynamic component to the model to capture physical processes such as diffusion and advection.However, this substantially increases computational cost, and it is not clear to which degree an advection field from a numerical model should be trusted and which boundary conditions are best in an advection-dominated problem.The new class of GRFs shows great promise for encoding prior knowledge about a phenomenon in a computationally efficient way.However, overfitting is an important issue, and we must consider ways to penalize the complexity.In particular, we need to consider ways to allow flexibility in an area where it is needed such as a river outlet, and restrict flexibility in areas where we expect stationarity.
Note that the integral in Equation ( S3) is solved by converting to polar coordinates as

A.2 Covariance function
Evaluating Equation (S2) at ν = 1/2 and including the expression for the marginal variance the covariance function can be formalized as Then, consider the modified Bessel function of the second kind and evaluate this at order 1/2 gives The covariance function can then be formalized as (S4)

A.3 One-dimensional clamped B-splines
We illustrate the construction of 1-dimensional splines B-splines using the interval be the knot points.Then the zero-order B-splines are constructed recursively as for i = 0, . . ., p − 1.Let r denote the order of the B-splines.The first-and second-order basis splines are constructed as for i = 0, . . ., p − r − 1.

A.4 Integrated likelihood
The distribution of z = (u, β) is given by and the observation model is From this the distribution of z given some observations y is Then, integrating out z from the joint distribution gives , where the left-hand side does not depend on z such that it may be evaluated for any given value.Let us evaluate it for z = µ C such that The last term π(z|θ, σ 2 N , y) is removed since it is equal to 1. Thereby, conditioning on y and taking the log we have the log-likelihood

A.5 Gradient of the log-likelihood
This section is similar to the derivation of the gradient presented in the supplementary material of Fuglstad et al. (2015b).
Note that the last two terms are rewritten for simplicity in the gradient calculation and that the variance of the Gaussian noise term, σ 2 N is re-parametrized with its inverse τ N = 1/σ 2 N (precision).Derivatives of the log-likelihood are taken with respect to θ i , the elements of θ, and the precision on log scale as log(τ N ).
The first term is a constant and therefore its derivative is zero with respect to any of the parameters.The next term, the penalty or the prior of the parameters, is not used in this paper and otherwise depends on the choice of penalty so gradient calculation is not specified for this term.
To continue note the derivatives of the precision matrix which is used in the following derivations.First, the derivatives with respect to θ i are considered.The derivative of the log determinant terms are and the derivative of the quadratic terms are Then, combining these the derivative of the log-likelihood with respect to θ i is Next, the derivative with respect to the log precision, log τ N , is considered.
The derivative of the log determinant terms are Further, the derivative of 1/2y T y • τ N with respect to log(τ N ) is just the same expression so the remaining quadratic term becomes and then, by adding the last quadratic term, the expression simplifies to Finally, combining all these terms we have the derivative of the log-likelihood with respect to log(τ N ): Note that the derivative of Q C can be calculated quickly and it is derived from a series of chain rules; first on Q C , then on A and A H , and finally within H.The most computationally heavy calculation in the gradient of the log-likelihood is to calculate the inverses in the difference Q −1 − Q −1 C .However, since this term is multiplied with the derivative of Q with respect to θ i , which carries the non-zero structure of Q, only elements of Q −1 and Q −1 C which correspond to the non-zero structure of Q need to be calculated.This is done by calculating a partial inverse of two matrices as described in Rue and Held (2010).

B.1 Discretization
To find the local solution of the SPDE the domain D is divided into equally sized rectangular cubes or cells.We use M cells to divide and Figure S1 shows this cell and its closest neighbors.Furthermore, as a regular grid is employed the volume of a cell is V = h x h y h z .
To further define the local solution of the SPDE we denote the faces of a grid cell as σ F i,j,k (front), σ B i,j,k (back), σ L i,j,k (left), σ R i,j,k (right), σ U i,j,k (up) and σ D i,j,k (down) with their respective face centers s i,j−1/2,k , s i,j+1/2,k , s i−1/2,j,k , s i+1/2,j,k , Figure S1: One cell E i,j,k in the discretization with its closest neighbours; E i+1,j,k , , and E i,j,k−1 .
s i,j,k+1/2 and s i,j,k−1/2 .Figure S2 describes the different faces of a cell.

B.2 Local solution of the SPDE
Note that this description is an extension to three dimensions of the derivation described in Fuglstad et al. (2015a), and the reader is referred to there for further details.To locally solve the SPDE a finite volume scheme is derived.First, Figure S2: One cell E i,j,k of the discretization with all its faces; σ F i,j,k (front), σ B i,j,k (back), σ L i,j,k (left), σ R i,j,k (right), σ U i,j,k (up), and σ D i,j,k (down) each with its respective face centres.
Equation (S1) is integrated over a cell E i,j,k as where ds is a volume element.The integral of the Gaussian white noise on the right-hand side is a Gaussian variable with mean zero and variance equal to the volume of a cell which is independent of neighboring cells.Let z ijk be an standard Gaussian variable; then, Equation (S6) becomes Then, applying the divergence theorem to the second integral with the divergence operator gives The first integral is approximated by letting k 2 ijk be the average value of the continuous function κ 2 (s) within a cell, i.e. κ 2 ijk = 1/V E ijk κ 2 (s)ds, resulting in To describe the solution of the second integral it is divided into integrals over each surface as , where dir denotes the surface; R (positive x-direction), L (negative x-direction), B (positive y-direction), F (negative y-direction), U (positive z-direction), and D (negative z-direction).Now, an approximation of this surface integral over each face is required.It is assumed that the gradient of u(s) is constant over each face and equal to the value at the center of each face.The resulting scheme for the gradient on each face is described in Table S1.Furthermore, let H be approximated by its value at the center of the face, and then, we have the approximation where c dir ijk is the center of face dir in the cell E ijk , and A(σ dir ijk ) is the area of the face.Combining Equation (S9) with the scheme of ∇u(c dir ijk ) from Table S1, and Face Scheme Table S1: Numerical scheme of the partial derivative with respect to x, y and z of u ijk on the different faces of cell E ijk .
denoting the components of H as the approximations for each face become Next, a vectorization of the discretization is made; first moving along the zdirection, then along x-direction, and lastly along the y-direction.Let us denote this with the common index l = j • M • P + i • P + k so s ijk = s j•M •P +i•P +k = s l which gives u(s ijk ) = u l and κ 2 (s ijk ) = κ 2 l , and let the last index be L = (N − 1)M P + (M − 1)P + P − 1.Further, the vectorization results in the linear system of equations where D V = V • I M N P , D κ 2 = [κ 2 0 , . . ., κ 2 l , . . ., κ 2 L ] I M N P , and z ∼ N (0, I M N P ).For simplicity the indices of the neighbors are denoted k p = k + 1, k n = k − 1, j p = j + 1, j n = j − 1, i p = i + 1, and i n = i − 1.Note that the corner points are not included in this scheme.Denoting A = D V D κ 2 − A H , Equation (S10) can be written as and thus, the joint distribution of u is Here, Q = A T D −1 V A which is a sparse matrix of 93 non-zero elements per row.This corresponds to the point, the 18 closest neighbors, and their 18 closest neighbors.
Then removing duplicates results in 93 points.

C. Additional figures
In the application, Section 5, we estimate the parameters of a non-stationary anisotropic and stationary anisotropic model on a simulated dataset from the numerical ocean model SINMOD.The resulting properties of the non-stationary model are presented in Figure 7 in Section 5.2 since this is the main focus of the applications.The properties of the stationary anisotropic model fit on the same dataset are presented in Figure S3.The marginal variance in Figure S3b, which should be constant for this stationary model, shows some variability caused by the boundary conditions.Notice that this boundary effect is also bigger in the direction of the strongest dependency directions seen in the south and north corners.
Notice also the large discrepancies between the correlations in these two models,

Figure 2 :
Figure 2: Clamped B-spline basis with three basis functions in 1D.
6 and σ N = 0.1.This results in spatial ranges of 10.08 along the x-dimension, 6.75 along y, and 3.88 along z with a marginal variance of 0.023.

Figure 4 :
Figure 4: Spatial correlation at location [26,26,20] (a) and variance of the spatial effect (b) in the non-stationary anisotropic model.
Figure 5: The area of operation in Trondheimsfjorden at Ladehammaren just outside of Trondheim, Norway.The compass shows the cardinal directions relative to the map.
. .ˆ 143 .Note that there are N M P = 22275 spatial locations and the 144 empirical innovations cover the whole day of May 27, 2021.Figures 7b show the resulting variance of the spatial effect and Figure 7c the spatial correlation with location (x, y, z) = (22, 10, 0) of the non-stationary anisotropic model.The same figures of the stationary anisotropic model can be found in Appendix C, Figure S3.

Figure 7 :
Figure 7: Prior field (a) found from SINMOD simulations, the variance of the spatial effect (b) and spatial correlation of point [22,10,0] (marked) (c) in the non-stationary anisotropic model.The N-arrow is the cardinal north.

Figure 8 :
Figure 8: Measurement locations of the AUV in the top 6 depth layers of the spatial field on May 27th, 2021, in Trondheimsfjorden at Ladehammaren just outside of Trondheim, Norway.The N-arrow is the cardinal north.

Figure 9 :
Figure 9: The root mean square error (RMSE, top) and the continuous ranked probability score (CRPS, bottom) of predictions from the stationary anisotropic (orange) and non-stationary anisotropic models (blue) given different proportions of observed data (5%, 95%).The error bars are the standard deviations of the different measures under random permutations of the 9 segments.
direction and P cells on [A 3 , B 3 ] in z-direction.The cells have sides parallel to each axis of size h x = (B 1 − A 1 )/M , h y = (B 2 − A 2 )/N , and h z = (B 3 − A 3 )/P .The cells are assigned an index with regards to their cell number along each axes starting from number 0; i ∈ [0, M ] along x, j ∈ [0, N ] along y, and k ∈ [0, P ] along z.For a specific cell, its domain can be denoted as

Figure
Figure S3c and Figure 7c, as the stationary anisotropic model kind of captures an average correlation within the field.

Figure S3 :
Figure S3: Prior field (a) found from SINMOD simulations, the variance of the spatial effect (b) and spatial correlation of point [22,10,0] (c) in the stationary anisotropic model.The N-arrow shows the cardinal north.