Spatial Analysis Made Easy with Linear Regression and Kernels

Kernel methods are an incredibly popular technique for extending linear models to non-linear problems via a mapping to an implicit, high-dimensional feature space. While kernel methods are computationally cheaper than an explicit feature mapping, they are still subject to cubic cost on the number of points. Given only a few thousand locations, this computational cost rapidly outstrips the currently available computational power. This paper aims to provide an overview of kernel methods from first-principals (with a focus on ridge regression), before progressing to a review of random Fourier features (RFF), a set of methods that enable the scaling of kernel methods to big datasets. At each stage, the associated R code is provided. We begin by illustrating how the dual representation of ridge regression relies solely on inner products and permits the use of kernels to map the data into high-dimensional spaces. We progress to RFFs, showing how only a few lines of code provides a significant computational speed-up for a negligible cost to accuracy. We provide an example of the implementation of RFFs on a simulated spatial data set to illustrate these properties. Lastly, we summarise the main issues with RFFs and highlight some of the advanced techniques aimed at alleviating them.


Introduction
Spatial analysis can be seen as a supervised learning problem where we seek to learn an underlying function that maps a set of inputs to an output of interest based on known input-output pairs.In supervised learning, it is crucial that the mapping is done such that the underlying function can accurately predict (or generalise) to new data.Generally, the output, or response variable, is a variable measured at multiple geolocated points in space (e.g.case counts for a disease under investigation [1], anthropocentric indicators like height and weight [2,3], or socioeconomic indicators such as access to water or education [4,5]).Here, we shall consider response variables as y = (y 1 , y 1 , ..., y N ) ∈ R N , indicating that the metric of interest is a vector of length corresponding to the N locations.e inputs are referred to as explanatory variables or covariates and consist of multiple independent variables taken at the same N locations as the response variables.Epidemiological examples of explanatory variables are population, age, precipitation, urbanicity and spatial or space-time coordinates.Explanatory variables are given as X = [x 1 , x 2 , ..., x N ] ∈ R N ×d , i.e. a matrix (signi ed by the upper case X) of length (rows) corresponding to the N locations and width (columns) of d representing the number of explanatory variables at each location.is matrix is o en referred to as the design matrix or basis, with each row of the matrix represents a location and each column an explanatory variable.e nal step is to de ne a measurement equation, y = f(X) + , that links our explanatory variables to our responses.e goal of this paper is to cast complex spatial methods into a tool familiar to most researches -the linear model.We will introduce a large body of theory known as model-based geostatistics [6] from a machine learning perspective with the goal of slowly build the theory from scratch and arrive at a predictive algorithm capable of making inferences.To arrive at this goal, we will rst introduce linear regression and a more powerful variant of linear regression called ridge regression.We will then introduce kernels as a way to create complex functions and show how kernels can be learnt in the linear regression framework.Finally, we introduce a powerful new approach, random Fourier features, that is computationally favourable.We present code wherever relevant, include a brief toy example in R where we t a spatial problem using nothing more than linear regression and some transforms.
Within this paper, we focus on real-valued response variables as this scenario is easier to handle computationally and also be er at illustrating the mathematical theory of Gaussian processes.ere is considerable overlap between our introduction of kernel learning and the more traditional formulations based on Gaussian process regression.For an introduction to the Gaussian process and model-based geostatistics, we refer the reader here [7,8].For a detailed description of the mathematical correspondence between kernels and Gaussian processes, we refer the reader here [9].

Linear Regression
One of the most popular choice of model in supervised learning is the linear model.A linear model assumes that the outputs are a weighted linear combination of the inputs with some additional uncorrelated (independent) noise, i , [10] such that for any location x i,j w j + i , for j = 1, 2, ..., d Where w = (w 1 , w 2 , ..., w d ) ∈ R d represents the column vector of weights (or coe cients) of length d used to transform the input to the outputs.We can write this much more succinctly using matrix notation: By changing the values of the weights, w, we can generate a wide range of di erent functions.e learning task is to nd an optimal set of weights (or more broadly parameters) that can produce a function which reproduces our data as close as possible.To do this, we rst need to de ne what is meant by "as close as possible".Mathematically, we need to de ne a function, known as the object function, that measures the quality of our model given a set of weights.In the case of linear regression, a common objective function is given by: is objective function is o en called the squared loss function and computes the sum of the squared di erences between the measured response variables and the model's prediction of the responses for a given set of weights.Note, the multiplication by half only added to make taking the derivative simpler and does not a ect the solutions.e learning task is therefore to nd the set of weights that minimise (or maximise the negative of) the objective function and correspondingly produce the best possible solution for our model.Formally we seek to solve: min w∈R d y − Xw 2 Elementary calculus tells us that a minima of a given function can be found when its derivative with respect to the parameters are zero.i.e.
Rearranging the derivative allows us to solve for the optimal set of weights, w * , that minimise the objective function Equation 5 is termed the ordinary least squares (OLS) solution.With these optimal weights, the model can predict the response for any given set of inputs.Linear regression is ubiquitous across all elds of science due to the ease of derivation, the simplicity of the solutions, and the guarantee that when the assumptions of the Gauss-Markov theorem are met the OLS estimator is a best linear unbiased estimator (BLUE) (see [11] for the standard proof).However, this does not mean that linear regression will always yield good results.Some problems are non-linear such that even the best linear model is inappropriate (we will address this using kernels in the following sections).
In other scenarios, the data o en violate the assumptions that guarantee OLS as the BLUE.
One of the most common violated assumptions is that the explanatory variables are uncorrelated (or orthogonal).In the most extreme case, one explanatory variable can be an exact linear combination of the one or multiple other variables, termed perfect multicollinearity [12].A common cause for perfect multicollinearity is in overdetermined systems when the number of explanatory variables exceeds the number of data points (the width of X is greater than its length, d > N ), even though there may be no relationship between the variables.Both multicollinear and overdetemined systems are common in epidemiological and genetic studies [13,14].When the data contains perfect multicollinearity, the matrix inversion (X T X) −1 is no longer possible, preventing the solving for the weights in Equation 5. Multicollinearity (as well as other violations to the Gauss-Markov assumptions) is characterised by a poor performance in model testing and aberrant weights o en despite good performance in training.
e Bias-Variance Trade-o Surprisingly, one of the biggest problems with OLS is that it is unbiased.When there is a large number of predictors (i.e.d is large), having an unbiased estimator can greatly over t the data, resulting in very poor predictive capability.A celebrated result in machine learning, the bias-variance decomposition [15,16] explains why this is the case for the squared loss function.For the squared loss, and a linear model Xw, some data y and the true function f (X) the loss can be decomposed as a sum of expectations as: Here, the bias term can be thought of as the error caused by the simplifying assumptions of the model.e variance term tells us how much the model moves around the mean.e more complex a given model, the closer it will be able to t the data and the lower its bias.However, a more complex model can "move" more to capture the data points resulting in a larger variance.Conversely, if the model is very simple it will have low variance but might not be able to t a complex function resulting in high bias.is creates a trade-o , reducing bias means increasing variance and vice versa.In Figure 1, data points were generated from a simulated epidemic, with the y-axis representing the number of cases and the x-axis time measured in days.e true latent function of the simulated epidemic is a degree-2 polynomial with added Gaussian noise given by y = 1 + X + 0.2X 2 + , where ∼ N (µ = 0, σ 2 = 150).
e function is le truncated such that the number of cases is always greater than zero.
ree Polynomial models of di erent degrees were ed (linear/degree-1 in red, degree-2 in green and degree-5 in blue).
e most complex, degree-5 polynomial has the closest ts to the training points in Figure 1A.However, when the error is calculated for the testing data (not used to train the model), the degree-2 model has the lowest error, shown in Figure 1B.e crucial point here is despite the degree-5 model ing the training data be er than the degree-2 model, it poorly predicts data not included in the training data-set.e extra parameters in the degree-5 model allow the model to move to meet all of the training points resulting in a very low bias.However, this movement to meet the training points results in very high variance, with the model ing to the random noise in the data.erefore, the degree-5 polynomial can be said to over t the data, with the model's high variance out-weighting its low bias.e opposite is true of the linear model, where the linear functions cannot capture the non-linearity in the data resulting in a high bias that outweighs its low variance.e linear model under ts the data.e degree-2 polynomial strikes a balance between bias and variance, with the functions exible enough to capture the non-linearity of latent function without ing to the noise, capturing an optimal balance between bias and variance.
An approach that is frequently adopted to maintain optimal bias-variance trade-o is to restrict the choice of functions in some way.Such a restriction is referred to as regularisation and can constrain weights values, stabilise matrix inversion, and prevent over-ing.

Ridge Regression
e most common choice of regularisation is called Tikhonov regularisation or ridge regression/weight decay in the statistical literature [17][18][19].e fundamental idea behind ridge regression is to prevent overly complex functions by favouring weights and thus functions with small norms (small absolute values). is is achieved by adding a penalisation term with respect to the norm of the weights to our objective function giving us the standard objective function for ridge regression: e objective function for ridge regression is identical to the OLS objective function but with the addition of a regularisation term.Again, multiplying both components by half is to improve the simplicity where taking derivatives.e regularisation term is consists of a Euclidean norm term, w 2 , and and a regularisation parameter λ. e Euclidean norm terms computes the positive square root of the sum of squares of the weights, w 2 := w 2 1 + w 2 2 + ... + w 2 d and measures the complexity of a function.Functions with many large weights capable of producing highly non-linear functions will have large norms.
e λ parameter, a positive number that scales the norm and controls the amount of regularisation.When λ is big, complex functions with large norms are heavily penalised, with λ w 2 term signi cantly increasing the value of the objective function.erefore, when λ is big, minimising the objective function favours less complex functions and small weights.As λ approaches zero, large norms are less heavily penalised, with the product of λ w 2 ge ing smaller and smaller, allowing more complex functions with larger weights as optimal solutions to Equation 7. When λ = 0 the objective function is the same as the OLS model.e addition of regularisation forces the objective function to t the data as closely as possible without creating functions that are too complex.erefore, λ can be used to control the bias-variance trade-o .As λ get bigger the bias increases and variance decreases and vice versa.However, regularisation results in model weights that are biased towards zero, and thus (by design) the ridge estimate is a biased estimator.But as shown in the bias-variance trade-o , the increase in bias is out-weighted by lower variance and improved testing performance of the model.
As was the case with linear regression, the aim is to nd the vector of weights that minimise the ridge regression objective function: Again this is calculated by taking the derivative of the objective function, se ing it to zero, and solving for the weights: For prediction we now have y * = x * w * = x i (X T X + λI n ) −1 X T y where I n is the identity matrix (a square matrix in which all the elements of the principal diagonal are ones and all other elements are zeros) of dimensions N × N .
e above solution is termed the primal solution of the ridge regression.e addition of λI n ensures that when λ > 0 the matrix (X T X + λI n ) is always invertible, and hence allows for stable computation.Secondly, the addition of lambda allows us to control the complexity of the optimal solutions.Optimising the primal involves solving the d × d system of equations given by X T X, with computational complexity O(d 2 N ) in the best case (the big O notation, O, used to denote how the relative running time or space requirements grow as the input size grows).is complexity is extremely useful because even when the number of locations, N , is large, the dimensions of the explanatory variables dominate the primal solution.

e Dual Solution and the Gram Matrix
Interestingly, the primal form can be rewri en using the following matrix identity Taking this rearrangement and plugging it back into the primal solution in Equation 5gives a new equation for the weights and for model prediction: is new equation can be further abstracted by de ning α = (XX T + λI n ) −1 y such that the equations can be expressed as is derivation shows that the vector of weights, w, can be wri en as a linear combination of the training points where each x i represents a row of the design matrix corresponding to the explanatory variables at a location.For any new observation the output is given by y is solution is termed the Dual solution of the ridge regression and the variables α ∈ R N are called dual variables.Rather than nding the optimal weights for each of the d explanatory variable (the column of the design matrix), we nd the appropriate dual variable for each of the N locations (the rows of the design matrix).In contrast to the primal, the dual solution requires inverting an N × N dimensional matrix, XX T , with complexity O(N 3 ) in the worst case.
Given, that the dual has been derived directly from the primal it is easy to see that the solutions are equivalent, with the two solutions said to exhibit strong duality [20].However, the dual form can be derived directly from ridge regression objective function independent of the primal by expressing the problem as a constrained minimisation problem and solving using the Lagrangian (Supplementary Equations 1).
Given the much higher complexity of the dual solution, it is not immediately obvious why this solution would ever be useful.However, a property of dual solution in Equation 12is that it requires computing XX T , with the resulting matrix a symmetric positive semide nite matrix called the Gram matrix, G. G contains all of the pairwise inner product of inputs across all N locations.An inner product is a way to multiply vectors together, with the result of this multiplication being a scalar measure of the vectors similarity.Explicitly, the inner product of two vectors x and z ∈ R d is given by: where •, • is used to signify the inner product.erefore, the Gram matrix can be thought of as containing the similarity between all pairs of inputs.Indeed, if these vectors are centred random variables, G is approximately proportional to the covariance matrix.is notion of similarity is central in spatial analysis where we want to leverage the fact the points close to each other in space are similar.
As an example, the entry at row i column j of matrix G represents the inner product between the vectors of explanatory variables at location i and j respectively (corresponding to rows i and j in the design matrix).
is is wri en as g ij = x i , x j .e full gram matrix is given by G = X, X .erefore, in the dual solution, we can substitute XX T for the G, and x * X T with g(x * , X), the inner product between a new data point and the training points, ( x * , X ) giving: However, the question remains, how can we use this useful construction of the Gram matrix to model non-linear functions?

Non-Linear Regression
Up to this point, the equations have only represented linear models, where the outputs are assumed to be some linear combination of the inputs.However, many problems cannot be adequately described in purely linear terms.One approach to introduce non-linearity to the model is to transform the explanatory variables using non-linear transformations, such that the output is described as a linear combination of non-linear terms.For example, rather than a weighted sum of linear terms (i.e.x 1 + x 2 + x 3 ), we may instead use terms with exponents, logarithms or trigonometric functions (i.e.exp(x 1 )+log(x 2 )+sin(x 3 )).
Transforming the inputs rather changing the model allows us to maintain all of the convenient maths we have derived for linear models to create non-linear ones.
Mathematically, the input data is said to be projected from a linear input space to some new, and potentially non-linear, space.e projecting of data is termed a (non-linear) feature mapping and the new space to which the data is mapped is called the feature space.Figure 2 is an example of a mapping to a feature space such that the output can be expressed in linear terms (Supplementary Code 2). Figure 2: An example of non-linear feature mapping, where the data is mapped for (A) an input space in which the problem is non-linear into a new feature space (B) in which the outputs can be described as a linear combination of the inputs.
Generally, a mapping is denoted by φ (or Φ when applied to a matrix).
e general form of a feature mapping is given by: φ : e vector, x i , is mapped from a vector of length d to a vector of length D denoted as φ(x i ).Applying this mapping to the entire design matrix gives a new design matrix of explanatory variables in the feature space, Φ(X) ∈ R N ×D .A mapping can project to into higher-dimensional (d < D) or lower-dimensional (d > D) space, although in the context of regression, li ing to a higher dimension is more common.As an explicit example, consider the following mapping: is mapping li s the data from 2-dimensions into 3-dimensions (R 2 → R 3 ).e same set of equations can be used to solve a linear model in feature space a er the mapping.For example, the primal for ridge regression is now given by: min All that was required to derive these equations is to substitute the design matrix in the original input space, X, with the new design matrix in the feature space, Φ(X) (with the equivalent mapping for new input points, φ(x * )).
is substitution also applies to standard linear regression and the ridge regressor dual but are omi ed for brevity.e primal solution now requires solving for the weights in R D .erefore, it is easy to see how a very high dimensional mapping (D d and D > N ) solving the primal will computationally very di cult.ankfully, due to our dual ridge solution, this overdetermined situation does not present a computational problem -no ma er how many terms we add, the complexity will still be O(N 3 ).
However, it is rarely apparent a priori what is the most appropriate mapping to apply to the data.erefore, while the dual allows for very high dimensional feature mapping, the question remains, which mapping should we use?How many terms should be added?How do we capture interactions between terms?A brute force approach can quickly become combinatorially large.Given that we can limit model complexity using a ridge regularisation, the ideal situation would be to be able to de ne a large, if not in nite mapping capable of nearly any function.We can do this using kernels.

Kernel Methods
As highlighted earlier, an ideal feature mapping would be to a large or in nite feature space to which we can then apply regularisation.e dual solution ensures that we only need to solve for the same number of dual variables regardless of the dimensionality of the feature mapping (remember dual variables are in R N with complexity is O(N 3 )).However, this raises an interesting question: can we solve ridge regression in an in nite-dimensional feature space?
We do not want to (and can't) compute all the in nite terms required for explicit in nite-dimensional feature mapping.To work with these in nite dimensional spaces, we need to move away from explicit to implicit feature maps.To arrive at an implicit map, consider solving the dual using the aforementioned feature mapping in Equation 18. e dual solution requires computing the inner product the inner product between all pairs of inputs in feature space, such that the inner product between any two input vectors, X and Z, rst requires their mapping to feature space: en computing the inner product between the vectors in feature space: is calculation of the inner product gives a scalar value of the similarity between the two vectors in feature space but required explicitly computing each of the feature mappings of the two vectors before taking their inner product.What is interesting is that the equation for the inner-product in feature space (Equation 24) takes the form of a quadratic equation between x and z.Factorising the quadratic gives: We have just arrived at a new way of computing the inner product in feature space.e inner product between the vectors in feature space is equal to the square of the inner product in input space.e beauty of this result is our new function for the inner product in feature space does not require computing the explicit mapping of our data (does not require computing Φ(X)).We have already established that solving the dual only requires the inner product.erefore, a solution to the dual can be obtained without ever computing and storing the explicit feature mapping as long as we have a function to directly compute the inner product in feature space.
e ability for linear models to learn a nonlinear function while avoiding an explicit mapping is a famed result in machine learning known as the kernel trick [21] with the functions that compute the inner product in feature space are called kernels functions.Kernels, therefore, perform the mapping k : X × X → R between inputs x, z ∈ X .
e resulting matrix of inner products generated by the kernel function is termed the kernel matrix (and can be thought of as the Gram matrix in feature space).We can represent the kernel matrix by K (or k for vectors) such that: e dual equations can now be wri en in terms of kernels given by: is equation encapsulates all the theory we have developed thus far.Using the squared loss, we can compute in closed form the optimal solution for the weights.Using the kernel trick, we can project our input data implicitly into an in nite-dimensional feature space.Using regularisation through the ridge penalty, we can prevent our solution from over ing.us, we have derived a powerful non-linear model using no more same equations than we would have used for a linear model.For the speci c example of the mapping in Equation 23, the kernel function is given by K(x, z) = x, z 2 and represent a second-degree polynomial kernel.A more general formula for a degree-p polynomial kernel is given by K θ,p (x, z) = x, z + θ p (for values of p ∈ Z + and θ ∈ R ≥0 ), where θ is a free parameter that allows control over the in uence of higher-order and lower-order terms in the polynomial in the kernel function.However, there are many kernel functions; an example of an in nite dimensional kernel is the popular and widely used squared exponential kernel: Where called the length scale, controls the distance over which the range over which the kernel similarly operates and σ determines the average distance from the mean.e proof that a squared exponential kernel gives rise to an in nite feature mapping uses a Taylor series expansion of the exponential, shown in Supplementary Equations 2.

e Big N Problem
It is evident that the combination of the dual estimator and the kernel function is a powerful tool capable of extending linear models to handle non-linear problems.One of the primary motivations for considering the dual was that it required computing and inverting an N × N dimensional matrix rather than the d × d dimensional matrix of features in the primal.Even as d grows in nitely large, the kernel matrix remains of constant size and involves solving for the same number of dual variables.Historically this led to the widespread adoption of kernel methods (kernel ridge regression, support vector machines, Gaussian processes etc.) to solve di cult problems on small datasets.However, the dual still requires inverting and storing the kernel matrix, which for N observations will be O(N 3 ) complexity and need O(N 2 ) storage.Given only a few thousand points, these polynomial costs rapidly outstrip computational power.
A plethora of methods have been developed to aid the scaling of kernels methods to large datasets.Broadly, these methods aim to nd smaller/simpler matrices that are good approximations of the full kernel matrix.
e three major techniques are low-rank approximations, sparse approximations and spectral methods.Low-rank approximations of a matrix aim to nd smaller representations of the kernel matrix that contains all (or nearly all) of the information in the full kernel [22].For example, the popular Nyström approximates the full kernel matrix through a subset of its columns and rows [23].In comparisons, sparse methods aim to nd representations of the matrix that are mostly zeros because there exist e cient algorithms for the storage of and computation with such matrices [24][25][26].One of the best examples is the sparse matrix generated when modelling spatial data as a Gaussian Markov random eld (GMRF) that are solutions to Stochastic Partial Di erential Equation (SPDE) [27][28][29].However, the remainder of this paper will focus on a new, exciting subset of spectral methods called random Fourier features (RFF).

Random Fourier Features
RFF and other spectral method utilise the characterisation of the kernel function through its Fourier transform.A Fourier transform allows for the decomposition of any function into the periodic functions of di erent frequencies that make it up.e central idea behind these methods is that a good approximation of the kernel in the frequency domain (where the function described in terms of the periodic functions that make it up) will naturally yield a good approximation to the kernel function in its original domain.
All spectral methods are based on the same mathematical foundation; speci cally, the celebrated Bochner's theorem [30].Loosely, Bochner's theorem states that a shi -invariant kernel functions (where the output of the kennel is only dependent on the di erence between the inputs and not the explicit values of the input themselves) k(x 1 , x 2 ) = k(δ) for δ = |x 1 − x 2 |, can be expressed through a Fourier transform [31]: If we apply Euler's identity to the exponential and ignore the imaginary component of Equation 29we can express this integral as: is real part of Bochner's theorem computed by projecting the data onto the line drawn by ω (given by ω T x 1 and ω T x 2 ), pass these projections through cos and sin and stack them together.To understand why this process works consider the two key components, ω and its distribution P(ω).e variable ω is the frequency of the periodic functions.e distribution of these frequencies, P(ω), is called the spectral density of the kernel and gives the 'importance' of a given frequency, ω, in constructing the kernel function.
is is visualised in Figure 3, showing di erent spectral densities (Figures 3A,C,E,G,I) and the resulting functions produced by sampling from the kernel generated by each spectral density (Figures 3B,D,F,H,J).
e code for sampling the spectral densities and generating functions is given in Supplementary Code 3. e spectral densities in A,C,E,G correspond to sampling from delta functions (arrowheads) such that the sampled frequencies, ω, can only take the point values corresponding to each delta function.
In gure 3A the spectral density is composed of two delta functions, such that samples frequencies (ω) can only take values of 1 or 2. e functions generated by sampling from this spectral density show strong periodicity and closely resemble the standard trigonometric functions with corresponding frequencies (Figure 3D).When the frequencies of the two delta functions are increased to ∈ {10, 20}, the functions are again highly cyclical but, due to their higher frequencies, have rougher sample paths and a much smaller period (small changes in x can cause greater changes in y) (Figure 3C,D).By expanding the spectral density to 5 peaks, the sample paths show considerably more variation due to the inclusion of a larger variety of frequencies (Figure 3E,F and 3G,H).Finally gures 3H and 3J show samples functions generated by sampling frequencies form a Gaussian and Cauchy distribution respectively.e Gaussian spectral density corresponds to a spectral density of the squared exponential kernel (Gaussian kernel) and gives rise to smooth sample functions with a tremendous amount of variety when compared to the simpler spectral densities (Figure 3J).e Cauchy distribution corresponds to a spectral density generated by the Fourier transform of the Laplacian kernel and generates functions with a high degree of roughness (Figure 3L) due to the inclusion of very high frequencies in the long tails of the distribution (Figure 3K).Table 1 below shows the resulting power spectral densities generated by some common shi -invariant kernels.It should also be noted that empirical spectral densities can be used [32][33][34], and non-stationary extensions of Bochner's theorem exist [32,35] (discussed in Non-stationary and arbitrary Kernel Functions section).However, a major problem is that evaluating the integral in Equation 30 requires integrating over the in nite set of all possible frequencies.To get around this, we can approximate this in nite integral by a nite one using Monte Carlo integration.In Monte Carlo integration the full integral of a function is approximated by computing the average of that function evaluated at a random set of points.erefore, for RFF, rather than an in nite set of frequencies, the spectral density is approximated by averaging the sum of the function evaluated at random samples of ω drawn from the spectral density.e more samples that are evaluated, the closer the approximation gets to the full integral.Indeed, one of the best properties of random Fourier features is that of uniform convergence, with the Monte Carlo approximation of the entire kernel function converging to the full kernel uniformly (rather than pointwise).

Kernel
Kernel Function, K( Matén Table 1: Common shi -invariant kernels and their associated spectral densities erefore, the in nite integral in Equation 30 can be converted to a nite approximation by taking multiple independent samples from the power spectral density, P(ω), and computing the Monte Carlo approximation to the kernel k(x 1 − x 2 ) via: ∼ P(ω) (31) Given that the spectral densities are probability distributions, it is o en reasonably trivial to sample frequencies from them.For example, generating the frequencies for approximating a squared exponential is as simple as independently sampling ω's from a Gaussian distribution (Table 1).Equation 31is truly an astoundingly condensed result and can be wri en in its entirety in just 4 lines of R-code: e RFF approach results in an approximation of the whole kernel matrix.erefore, it is di erent from other low-rank methods that try to approximate parts of the full kernel matrix.Using the Woodbury matrix inversion formula we can reduce the complexity of inverting the whole kernel matrix from O(N 3 ) to O(m 3 ) (where m is the number of i.i.d.samples from P(ω)) to solve the dual [36].However, one of the key observations about RFF's is that the random basis matrix Φ RFF de nes a function space of its own!Technically a function space that is dense in a reproducing Hilbert space -the same space of functions from our kernel matrix.We can de ne this feature space as: is can be wri en using matrix notation as Φ(X) = [cos(XΩ T ) sin(XΩ T )] ∈ R N ×2m , where matrix X ∈ R N ×d is the design matrix and Ω ∈ R m×d is the frequency matrix with rows corresponding a sampled ω ([ω 1 , ..., ω m ]).See Supplementary Equations 3 for a more comprehensive walk-through of the steps from Bochner's theorem to the RFF equation.
We can use the Φ RFF (X) T Φ RFF (X) ∈ R m×m matrix to solve the primal.e resulting linear model takes the form y ∼ Φ RFF (X)w.erefore, for a spatial data set, the method only requires mapping the explanatory variables using RFF and ing a linear model.erefore, the ubiquitous linear model is converted to a much more expressive form, while retaining all the desirable mathematical properties that make linear models so popular.e theoretical properties of RFF estimators are still far from fully understood.
e seminal paper by Rahimi and Recht [37] showed that every entry in our kernel matrix is approximated to an error of ± with m = log(N ) 2 Fourier features.A more recent result shows that only

√
N log(N ) features can achieve the same learning bounds as full kernel ridge regression with squared loss [38]. is is particularly relevant for large datasets.For example, given 100, 000 data points, we would only need 3600 features to achieve the generalisation error as if we had used all points.
Beyond the General Linear Model roughout this paper, we have focused on Gaussian likelihoods/Squared loss, but Fourier bases can be used with any loss function.For example, when performing classi cation with binary data, i.e. y ∈ {0, 1} the cross-entropy loss (also known as log loss) can be used, given by: where ϕ is the Sigmoid or Logit function.Another example is the use of a Poisson likelihood to model count data [39].More broadly, generalised linear models (GLM) encompass the extension of linear regression to response variables that have a restricted range of potently values (such as binary or count data) or non-Gaussian error [40]. is generalisation is achieved by le ing the linear be related to the response variable via a link function.e link function ensures that the model's estimated responses are on the correct range and have the appropriate error structure but is still a function of the weighted linear sum of explanatory variables.us, we can still apply feature mapping including RFF and can continue to t these models using maximum likelihood.
ese models can easily be extended to include uncertainty through Bayesian inference.For example, Bayesian linear regression is achieved by assuming that both the response variables and the model parameters are drawn from distributions, with the aim to nd the posterior distribution of the model parameters given the input and output variables.By specifying a Gaussian likelihood, with a Gaussian prior on our coe cients, in expectation, our posterior is the same as that of ridge regression.We can evaluate full posterior uncertainty by sampling from the posterior distribution.

Toy Example of Random Fourier Features for Spatial Analysis
As an example, we simulate a non-linear spatial regression problem.e code for this example is provided in Supplementary Code 4. A set of random points in space is generated, such that each location has unique coordinates (longitude and latitude).Each location has a response variable generated from the same latent function as in Figure 2 with added Gaussian noise (y = x 2 1 + x 2 2 + where ∼ N (µ = 0, σ 2 = 1)).In gure 4a we show random 500 points drawn from the spatial process (the code can easily be changed to any function of the user s choice).Note that in the provided code generates points at random and therefore a user's results may di er from the exact results shown here.e simulated data were used to train three models, a linear regression model, a non-linear, kernel regression model and a kernel ridge regression model (KRR).Both the kernel regression and kernel ridge regression models use RFFs to approximate a Gaussian kernel approximation (user can specify the number of features).We assume, as is common for nearly all real-world spatial processes, that we only observe a subset of all possible locations.erefore, models are trained on only 20% of all the generated points, shown in Figure 4b.e remaining data not used to training is used for testing.Each model was trained using the same training data and their predictive performance measured by MSE for the testing data.For both kernel methods (kernel regression and kernel ridge regression), 100 Fourier features are used.For KRR, k-fold cross-validation was used to nd the optimal regularisation parameter (λ ridge ).e training and testing performance of the three models are shown in Table 2.As expected, the non-linear nature of the latent function results in very poor performance of the linear model with large training and testing error.e di erence between the kernel regression (without regularisation) and KRR (with regularisation) is indicative of bias-variance trade-o discussed earlier.
e Kernel regression model has excellent training performance, with the in nite feature space of the Gaussian kernel permi ing the ing of highly complex functions.However, in the absence of regularisation, the kernel regression model greatly over ts the training data resulting in poor testing performance.If we consider the bias-variance trade-o , the kernel regression model has a very low bias, but high variance.In comparison, the regularisation in kernel ridge regression help to prevent this over ing by penalising the overly complex models.A er training, the KRR model (with λ ridge = 3.98) has marginally higher training error than the kernel regression model but less than half the testing error.e regularisation has increased the KRR model's bias, but this is outweighed by the reduction in variance, corresponding to a small decrease the training accuracy but signi cant improvement in testing performance.Note, in this example the latent function is fairly simple (a 2D quadratic); therefore the number of features required for good performance is low.

Model
Training  5. Increasing the number of sampled Fourier bases increases the model's ability to t the training data and results in a steady reduction in training error kernel regression (Figure 5a, blue line).In comparison, KRR has a higher than the training error than kernel regression that remains constant even with additional features (Figure 5a, red line).As the number of features increases the kernel regression model increasingly over ts the training data resulting in poor testing performance (Figure 5b, blue line).Importantly, the testing performance of KRR is signi cantly lower than kernel regression and remains low even with additional features (Figure 5b, red line).e regularisation in KRR constrains model complexity, such that as more features are added the regularisation prevents over ing by increasing the magnitude of the regularisation parameter, λ ridge (Figure 5b, inset).As a result, KRR maintains a good bias-variance trade-o across all numbers of features.

Advanced Methods for Random Fourier Features Limitations of RFF's
Given the immense popularity and good empirical performance of the RFF method, li le has been published on their limitations.However, RFF does have its limitations, including in the context of spatial analysis.From here onward, we term the RFF method described previously as the standard RFF method.
Firstly, RFF's can be poor at capturing very ne scale variation as noted in Ton et al. [32]. is is likely due to ne-scale features being captured by the tails of the spectral density that will be infrequently sampled in the Monte Carlo integration.Secondly, from a computational perspective, RFF's are very e cient but can still be poor compared to some state-of-the-art spatial statistics approaches.For example, the sparse matrix approaches based on SPDE as solution GMRF provide impressive savings with complexity O(m 1.5 ) (compared to O(m 3 ) from the RFF primal solution) [27].Other methods such as the multiresolution Kernel approximation (MRA) provide incredible performance [41].However, it should be noted that many of these methods, such as MRA, are only valid for two dimensions [41], unlike RFF's which naturally extend to high-dimensions.irdly, while the convergence properties of RFF suggest excellent predictive capability [38], alternative data-dependent methods such as the Nyström approximation can perform much be er in many se ings [42,43].e following sections will discuss the current methods that address some of these limitations.

asi-Monte-Carlo Features (QMC RFF)
One of the most signi cant limitations of standard RFF is the Monte Carlo integration.e in nite integral that describes the kernel function is converted to a nite approximation by sampling ω's from the spectral density.e convergence of Monte Carlo integration to the true integral occurs at the rate O(m 0.5 ), which means for some problems a large number of features are required to approximate the integral accurately.
A popular alternative is to use asi-Monte Carlo (QMC) integration [44].In QMC integration, the set of points chosen to approximate the integral is chosen using a deterministic, low-discrepancy sequence [45].In this context, low-discrepancy means the points generated appear random even though they are generated from a deterministic, non-random process.For example, the Halton sequence, that generates points on a uniform hypercube before transforming the points through a quantile function (inverse cumulative distribution function) [46].Low-discrepancy sequences prevent clustering and enforce more uniformity in the sampled frequencies, allowing QMC to converge at close to O(m −1 ) [47].QMC can provide substantial improvements in the accuracy of the approximation of the kernel matrix for the same computational complexity.Crucially, QMC is trivial to implement within the RFF framework for some distributions.For example, for the squared exponential kernel, instead of generating features as by taking random samples from a Gaussian, we generate them as:

Leverage Score Sampling
In the standard RFF method, frequencies are sampled with a probability proportional to their spectral density.However, the formula for the power spectral density, and concurrently sampling probability of a given frequency, is data-independent such that the formula for spectral density does not require data (see Table 1).Data-independent sampling is sub-optimal and can yield very poor results [48][49][50], and has been identi ed as one of the reasons RFFs perform poorly in certain situations [51].An alternative is a data-dependent approach, that considers the importance of various features given some data.Several datadependent approaches for RFF have been proposed [38,51,52], but one of the most promising and easiest to implement is sampling from the leverage distribution of the RFF (abbreviated to LRFF) [51].
Leverage scores are popular across statistics and are a key tool for regression diagnostics and outlier detection [53,54].e leverage score measures the importance a given observation will have on the solution of a regression problem.However, this perspective of leverage scores as a measure of importance can be extended to any matrix.e leverage scores of a matrix A is given by the diagonal matrix T = A(AA T ) −1 A T .e leverage score for i-th row of matrix A is equal to the i-th diagonal element in matrix T , denoted as τ i,i and calculated by: τ i,i can also be seen as a measure of the importance of the row a i .
Most leverage score sampling methods apply ridge regularisation to the leverage scores,c ontrolled by regularisation parameter λ LRFF given by: e resulting scores are termed ridge leverage scores [55].e regularisation serves a nearly identical purpose as when applied in the context of linear regression; ensuring the inversion required to compute the scores is always unique and less sensitive to perturbations in the underlying matrix, such as when only partial information about the matrix is known.e ridge parameter is key in stabilising leverage scores and permits fast leverage score sapling methods that approximate leverage scores using subsets of the full data [56][57][58][59].
Leverage score sampling aims to improve the RFF method by sampling features with probability proportional to importance (rather than their spectral density).
is is achieved by sampling columns of the feature matrix with a probability proportional to their leverage scores.e resulting samples should contain more of the important features, and thus a more accurate approximation to the full matrix given the same number of samples.
e computation and inversion of the matrix M increases the computational burden of this approach, but only has to be calculated once for a given regularization parameter.A key distinction to note is that the formula for the leverage score up to this point (and in the majority of the literature) have sought the leverage score of the rows.For RFF we want the leverage scores of the columns as they correspond to the Fourier bases.erefore, the ridge leverage score of the Fourier features matrix is given by: With suitable scaling, we can now sample Fourier features with a probability proportional to the leverage distribution, allowing us to sample features proportional to their importance.e code is given by: 1 # G e n e r a t e t h e m f r e q u e n c y s a m p l e s ( Omega ) from t h e PSD 2 P r o j = x % * % t ( Omega ) # P r o j e c t i o n 3 P h i = c b i n d ( c o s ( P r o j ) , s i n ( P r o j ) ) / s q r t (m) # B a s i s 4 M = t ( P h i ) % * % P h i # P r i m a l m a t r i x 5 T l r f f <− M % * % s o l v e (M + n * d i a g ( x= lambda ,m) ) 6 p i s <− d i a g ( T l r f f ) # D i a g o n a l e l e m e n t s 7 l <− sum ( d i a g ( T l r f f ) ) #Sum o f d i a g o n a l e l e m e n t s o f T 8 i s wgt <− s q r t ( p i s / l ) # L e v e r a g e s c o r e s Code 3: LRFF Example

Orthogonal Random Features
One of the bene ts of RFF's is that they can de ne kernels in high dimensions.For example, one can use a kernel in 4-dimensions to represent Cartesian spatial coordinates x, y, z and time, t, the foundation for any spatiotemporal modelling.However, increasing dimensionality comes at a cost.While RFF's are unbiased estimators with respect to the expectation of the kernel matrix [37], increase the number of dimensions of the data signi cantly increases the variance of the RFF's estimate and requiring signi cantly more features to achieve an accurate approximation [60].One proposed approach to solve this issue with high dimensional kernels is to draw each new feature dimension orthogonally.
In the standard RFF method, the sampled frequencies can be concatenated into a frequency matrix, Ω ∈ R m×d .If we consider a squared exponential/ Gaussian kernel, Ω is actually just a random Gaussian matrix, as the sampled frequencies are drawn from a standard normal distribution and scaled by the kernel parameter, σ. erefore, the matrix Ω = 1 σ G, where G is a random Gaussian matrix of dimension R d×d (and should not be confused with the gram matrix).
In orthogonal random features (ORF), the aim is to impose orthogonality on Ω, such that it contains signi cantly less redundancy than a random Gaussian matrix, capable of faster convergence to the full kernel matrix with lower variance estimates.e simplest method to impose orthogonality would be to replace G with a random orthogonal matrix, O.However, if we consider the squared exponential/Gaussian kernel, the row norms of the G matrix follow will Chi distribution.In comparison, the orthogonal matrix, O, will have (by de nition) rows with unit norm.us, simply replacing G with O means that the RFF will no longer be an unbiased estimator.
To return to an unbiased estimator, the orthogonal matrix, O, must be scaled by the diagonal matrix, S, with diagonal entries random variables from Chi distributed with D degrees of freedom.
is ensures that the row norms of G and SO will be identically distributed and can be used to construct a matrix of orthogonal random features given by Ω ORF = 1 σ SO.It should be noted that the elements of S are only Chi distributed random variables when G is a random Gaussian matrix (the de nition of a Chi distributed random variable identical to taking the L2 norm of a set of standard normally distributed variables).For other kernels, the diagonal elements of matrix S are computed as the norm for the corresponding row in G. erefore, the i-th diagonal element of S is calculated by: erefore, generating orthogonal random features for a given kernel requires two steps.First, derive the orthogonal matrix O by performing QR decomposition on the feature matrix G (where O corresponding to the Q matrix of QR decomposition).See [61] for an excellent summary of the QR decomposition.Second, compute the diagonal entries of the matrix, S, by talking the norms of corresponding rows of G.We then compute the orthogonal feature matrix as Ω ORF = 1 σ SO. e random Fourier feature matrix is replaced with orthogonal random feature matrix to generate the orthogonal random basis matrix Φ ORF (X), computed as Φ ORF (X) = [cos(XΩ T ORF ) sin(XΩ T ORF )] ∈ R N ×2m .e R code for ORFs is as follows: 1 omega <− c ( ) 2 w h i l e ( Nrow < N f e a t u r e s ) { e SORF method replaces the random orthogonal matrix, O, by a class of specially structured matrices (consisting of products of binary diagonal matrices and Walsh-Hadamard matrices) that has orthogonality with near-Gaussian entries [60].
e SORF method maintains a lower approximation error the standard RFF, but is signi cantly more computationally e cient than ORF, with computing Φ SORF (X) taking only O(N log(N )) time.

Non-stationary and Arbitrary Kernel Functions
One of the most signi cant limitations to the standard RFF method is the restriction to shi -invariant kernels, where K(x, z) = K(x − z). is restriction means that the kernel value is only dependent on the lag or distance x−z rather than the actual locations.is property imposes stationarity on the spatiotemporal process.While this assumption is not unreasonable, and non-stationarity is o en unidenti able, in some cases the relaxation of stationary can signi cantly improve model performance [62].
To extend the RFF method to non-stationary kernels requires a more general representation of Bochner's theorem capable of capturing the spectral characteristics of both stationary and non-stationary kernels.
is extension [35] states than any kernel (stationary or non-stationary) can be expressed as its Fourier transform in the form of: 2 ) P(ω 1 )P(ω 2 ) dω 1 dω 2 (38) is equation is nearly identical to the original derivation of Bochner's theorem given in Equation 29, but now we have two spectral densities on R D to integrate over.It is easy to how if the two spectral densities are the same, the function equates the de nition for stationary kernels.Applying the same treatment and Monte Carlo integration can be performed to give the feature space of the non-stationary kernel [32], now given by: Φ RFF (x) = cos(ω 1 T x) + cos(ω 2 T x) sin(ω 1 T x) + sin(ω 2 T x) , ω {1,2} m j=1 i.i.d.
∼ P {1,2} (ω) (39) Note that this derivation requires drawing independent samples for both of the spectral densities, P l (ω), such that we generate two frequency matrices, Ω l ∈ R m×d .
In both the stationary and non-stationary case the choice of the kernel is o en arbitrary or made with knowledge of the process being modelled.For example, if the spatial data is expected to be very smooth, then a squared exponential kernel can be used.It is, however, possible to treat ω as unknown kernel parameters variables and infer their values [32]. is is equivalent to deriving an empirical spectral distribution.is strategy is data dependent and can achieve impressive results; however, great care must be taken to avoid over ing [32].

Conclusion
Regression is a key technique for nearly all scienti c disciplines and can be extended from its simplest forms to highly complex and exible models capable of describing nearly any type of data.Within this paper, we have provided an overview of one such tool in random Fourier features.RFF allows for the extension of the linear regression model into a highly expressive non-linear regression model only using some trigonometric functions and can be implemented with only a few lines of code and provides signicant computational bene ts to working with a full kernel.To that end, random Fourier features and their extensions represent an exciting new tool for multi-dimensional spatial analysis on large datasets.
Figure 1 allows us to visualise the consequences of this trade-o (Supplementary Code 1).

Figure 1 :
Figure 1: An example of bias-variance trade-o for a linear/degree-1 (red), degree-2 (green) and degree-5 (blue) polynomial model given the same training data.(A) All three models were ed to the same subset of 20 points derived a simulated epidemic with a latent function f (X) = 1 + X + 0.2X 2 with added Gaussian noise ( ∼ N (µ = 0, σ 2 = 150)).e function is le truncated such that the number of cases is always greater than zero.(B) e mean squared error (MSE) for each model was calculated for the remaining data not used for the ing, so-called testing data.

Figure 3 :
Figure 3: Power spectral densities and the functions produced by sampling from the resulting kernel.espectral densities in A,C,E,G correspond to sampling from delta functions (arrowheads) such that the sampled frequencies, ω, can only take the point values corresponding to each delta function.

1 # d a t a m a t r iCode 1 :
x X o f d i m e n s i o n s ( N x d ) , number o f f e a t u r e s m 2 Omega = m a t r i x ( rnorm (m * n c o l ( X ) ) , m) # s q u a r e d e x p o n e n t i a l k e r n e l 3 P r o j = x % * % t ( Omega ) # p r o j e c t i o n 4 P h i = c b i n d ( c o s ( P r o j ) , s i n ( P r o j ) ) / s q r t (m) # b a s i s 5 K = P h i % * % t ( P h i ) # k e r n e l m a t r i x Example of creating random Fourier features to approximate a Gaussian kernel matrix.

Figure 5 :
Figure 5: An example of the bias-variance trade-o for kernel regression (blue) and kernel ridge regression model (red) with a random Fourier feature approximation of a Gaussian kernel.e mean squared error (MSE) is calculated for both the (A) training data and the testing data (B) for an increasing number of sampled Fourier features, with optimal λ Ridge (by k-fold cross-validation) for the ridge regression model shown (B inset) 1 l i b r a r y ( r a n d t o o l b o x ) # P a c k a g e t o g e n e r a t e H a l t o n s e q u e n c e 2 Omega = m a t r i x ( qnorm ( ( h a l t o n (m, n c o l ( X ) ) ) ) ,m) Code 2: Example of asi-Monte Carlo sampling of a Gaussian power spectral density using a Halton sequence e remaining code for generating the features is exactly the same as Code 1.

3 G 4 G1 5 O 6 S 8 Nrow <− nrow ( omega ) 9 }
<− m a t r i x ( rnorm ( D1 * D1 ) , nrow=D1 , n c o l =D1 ) # Random G a u s s i a n m a t r i x <− m a t r i x ( rnorm ( D1 * D1 ) , nrow=D1 , n c o l =D1 ) <− q r .Q( q r ( G ) , c o m p l e t e =TRUE ) # O r t h o g o n a l M a t r i x by QR d e c o m p o s i t i o n <− d i a g ( s q r t ( rowSums ( G1 ˆ2 ) ) ) # D i a g o n a l M a t r i x 7 omega1 <− r b i n d ( omega , S Q) Code 4: ORF example Yu et al.[60] also included an extension to the ORF method termed structured ORF (SORF) to avoid the computationally expensive steps of deriving the orthogonal matrix (O(N 3 ) time) and computing random basis matrix (O(N 2 ) time).

Table 2 :
Training and testing performance of di erent models e importance of the bias-variance trade-o further illustrated by comparing how the training and testing performance of kernel regression and KRR vary with the number of Fourier features increases with, shown in Figure