Poisson intensity estimation with reproducing kernels

Despite the fundamental nature of the inhomogeneous Poisson process in the theory and application of stochastic processes, and its attractive generalizations (e.g. Cox process), few tractable nonparametric modeling approaches of intensity functions exist, especially when observed points lie in a high-dimensional space. In this paper we develop a new, computationally tractable Reproducing Kernel Hilbert Space (RKHS) formulation for the inhomogeneous Poisson process. We model the square root of the intensity as an RKHS function. Whereas RKHS models used in supervised learning rely on the so-called representer theorem, the form of the inhomogeneous Poisson process likelihood means that the representer theorem does not apply. However, we prove that the representer theorem does hold in an appropriately transformed RKHS, guaranteeing that the optimization of the penalized likelihood can be cast as a tractable finite-dimensional problem. The resulting approach is simple to implement, and readily scales to high dimensions and large-scale datasets.


Introduction
Poisson processes are ubiquitous in statistical science, with a long history spanning both theory (e.g.[19]) and applications (e.g.[12]), especially in the spatial statistics and time series literature.Despite their ubiquity, fundamental questions in their application to real datasets remain open.Namely, scalable nonparametric models for intensity functions of inhomogeneous Poisson processes are not well understood, especially in multiple dimensions since the standard approaches, based on kernel smoothing, are akin to density estimation and hence scale poorly with dimension.In this contribution, we propose a step towards such scalable nonparametric modeling and introduce a new Reproducing Kernel Hilbert Space (RKHS) formulation for inhomogeneous Poisson process modeling, which is based on the Empirical Risk Minimization (ERM) framework.We model the square root of the intensity as an RKHS function and consider a risk functional given by a penalized version of the inhomogeneous Poisson process likelihood.However, standard representer theorem arguments do not apply directly due to the form of the likelihood.Namely, the fundamental difference arises since the observation that no points occur in some region is just as important as the locations of the points that do occur.Thus, the likelihood depends not only on the evaluations of the intensity at the observed points, but also on its integral across the domain of interest.As we will see, this difficulty can be overcome by appropriately adjusting the RKHS under consideration.We prove a version of the representer theorem in this adjusted RKHS, which coincides with the original RKHS as a space of functions but has a different inner product structure.This allows us to cast the estimation problem as an optimization over a finite-dimensional subspace of the adjusted RKHS.The derived method is demonstrated to give better performance than a naïve unadjusted RKHS method which resorts to an optimization over a subspace without representer theorem guarantees.We describe cases where adjusted RKHS can be described with explicit Mercer expansions and propose numerical approximations where Mercer expansions are not available.We observe strong performance of the proposed method on a variety of synthetic, environmental, crime and bioinformatics data.

Poisson process
We briefly state relevant definitions for point processes over domains S ⊂ R D , following [8].For Lebesgue measurable subsets T ⊂ S, N (T ) denotes the number of events in T ⊂ S. N (•) is a stochastic process characterizing the point process.Our focus is on providing a nonparametric estimator for the first-order intensity of a point process, which is defined as: (2.1) The inhomogeneous Poisson process is driven solely by the intensity function λ(•): In the homogeneous Poisson process, λ(x) = λ is constant, so the number of points in any region T simply depends on the volume of T , which we denote |T |: For a given intensity function λ(•), the likelihood of a set of N = N (S) points x 1 , . . ., x N observed over a domain S is given by: λ(x i )e − S λ(x)dx (2.4)

Reproducing Kernel Hilbert Spaces
Given a non-empty domain S and a positive definite kernel function k : S × S → R, there exists a unique reproducing kernel Hilbert space (RKHS) H k .An RKHS is a space of functions f : S → R, in which evaluation is a continuous functional, meaning it can be represented by an inner product this is known as the reproducing property), cf.Berlinet and Thomas-Agnan [5].While H k is in most interesting cases an infinite-dimensional space of functions, due to the classical representer theorem [18], [28, Section 4.2], optimization over H k is typically a tractable finite-dimensional problem.
In particular, if we have a set of N observations x 1 , . . ., x N , x i ∈ S and consider the problem: min where R (f (x 1 ), . . ., f (x N )) depends on f through its evaluations on the set of observations only, and Ω is a non-decreasing function of the RKHS norm of f , there exists a solution to Eq. (2.5) of the form f * (•) = N i=1 α i k(x i , •), and the optimization can thus be cast in terms of the so-called dual coefficients α ∈ R N .This formulation is widely used in the framework of regularized Empirical Risk Minimization (ERM) for supervised learning, where is the empirical risk corresponding to a loss function L, e.g.squared loss for regression, logistic or hinge loss for classification.
If domain S is compact and kernel k is continuous, one can assign to k its integral kernel operator T k : L 2 (S) → L 2 (S), given by T k g = S k(x, •)g(x)dx, which is positive, self-adjoint and compact.There thus exists an orthonormal set of eigenfunctions {e j } ∞ j=1 of T k and the corresponding eigenvalues {η j } ∞ j=1 , with η j → 0 as j → ∞.This spectral decomposition of T k leads to Mercer's representation of kernel function k [28, Section 2.2]: with uniform convergence on S × S. Any function f ∈ H k can then be written as f = j b j e j where f 2 H k = j b 2 j /η j < ∞.Note that above we have focused on Mercer expansion with respect to the Lebesgue measure, but other base measures are also often considered in literature, e.g.[27, section 4.3.1].

Related work
The classic approach to nonparametric intensity estimation is based on smoothing kernels [26,11] and has a form closely related to the kernel density estimator: where κ is a smoothing kernel (related to but distinct from the RKHS kernels described in the previous section), that is, any bounded function integrating to 1. Early work in this area focused on edge-corrections and methods for choosing the bandwidth [11,6,7].Connections with RKHS have been considered by, for example, Bartoszynski et al. [4] who use a maximum penalized likelihood approach based on Hilbert spaces to estimate the intensity of a Poisson process.
There is long literature on maximum penalized likelihood approaches to density estimation, which also contain interesting connections with RKHS, e.g.[29].Much recent work on estimating intensities for point processes has focused on Bayesian approaches to modeling Cox processes.The log Gaussian Cox Process [23] and related parameterizations of Cox (doubly stochastic) Poisson processes in terms of Gaussian processes have been proposed, along with Monte Carlo [1,12,30], Laplace approximation [16,10,14] and variational [21,20] inference schemes.
Another related body of literature concerns Cox processes with intensities parameterized as the sum of squares of k Gaussian processes, called the permanent process [22].Interestingly, calculating the density of the permanent process relies on a kernel transformation similar to the one we propose below.Unlike these approaches, however, we are not working in a doubly stochastic (Cox process) framework; rather we are taking a penalized maximum likelihood estimation perspective to estimate the intensity of an inhomogeneous Poisson process.As future work, it would be worthwhile to explore deeper connections between our formulation and the permanent process, e.g. by considering an RKHS formulation of Cox processes or by considering an inhomogeneous Poisson process whose intensity is the sum of squares of functions in an RKHS.

Proposed method and kernel transformation
Let S be a compact domain of observations.Let k : S × S → R be a continuous positive definite kernel, and H k its corresponding RKHS of functions f : S → R. We model the intensity function λ(•) of an inhomogeneous Poisson process as: which is parameterized by f ∈ H k and an additional scale parameter a > 0.
The flexibility of choosing k means that we can encode structural assumptions of our domain, e.g.periodicity in time or periodic boundary conditions (see Section 4.1.1).Note that we have squared f to ensure that the intensity is nonnegative on S, a pragmatic choice that has previously appeared in the literature (e.g.[21]).While we lose identifiability (since f and −f are equivalent), as shown below we end up with a finite dimensional, and thus tractable, optimization problem.
The rationale for including a is that it allows us to decouple the overall scale and units of the intensity (e.g.number of points per hour versus number of points per year) from the penalty on the complexity of f which arises from the classical regularized Empirical Risk Minimization framework (and which should depend only on how complex, i.e. "wiggly" f is).
We use the inhomogeneous Poisson process likelihood from Eq. (2.4) to write the log-likelihood of a Poisson process corresponding to the observations {x 1 , . . ., x N }, for x i ∈ S, and intensity λ(•): We will consider the problem of minimization of the penalized negative log likelihood, where the regularization term corresponds to the squared Hilbert space norm of f in parametrization Eq. (3.1): This objective is akin to a classical regularized empirical risk minimization framework over RKHS: there is a term that depends on evaluations of f at the observed points x 1 , . . ., x N as well as a term corresponding to the RKHS norm.However, the representer theorem does not apply directly to Eq. (3.3): since there is also a term given by the L 2 -norm of f , there is no guarantee that there is a solution of Eq. (3.3) that lies in span{k(x i , •)} N i=1 .We will show that Eq. (3.3) fortunately still reduces to a finite-dimensional optimization problem corresponding to a different kernel function k which we define below.
Using the Mercer expansion of k in Eq. (2.6), we can write the objective Eq. (3.3) as follows: The last two terms can now be merged together, giving Now, if we define kernel k to be the kernel corresponding to the integral operator , k is given by: we see that: Thus, we have merged the two squared norm terms into a squared norm in a new RKHS.We note that a similar idea has previously been used to modify Gaussian process priors in [9], albeit in a different context, and that a similar transformation appears in the expression for the distribution of a permanent process [22].We are now ready to state the representer theorem in terms of kernel k.
Theorem 1.There exists a solution of Eq. (3.3) for observations x 1 , . . ., x N , which takes the form we have that the two spaces correspond to exactly the same set of functions.Optimization over H k is therefore equivalent to optimization over H k.
The proof now follows by applying the classical representer theorem in k to the representation of the objective function in Eq. (3.6).We decompose f ∈ H k as the sum of two functions: where v is orthogonal in H k to the span of { k(x j , •)} j .We prove that the first term in the objective , is independent of v.It depends on f only through the evaluations f (x i ) for all i.Using the reproducing property we have: where the last step is by orthogonality.Next we substitute into the regularization term: Thus, the choice of v has no effect on the first term in J[f ] and a non-zero v can only increase the second term f 2 Hk , so we conclude that v = 0 and that The notions of the inner product in H k and H k are different and thus in general span{k(x i , •)} = span{ k(x i , •)}.
Remark 2. Notice that unlike in a standard ERM setting, γ = 0 does not recover the unpenalized risk, because γ appears in k.Notice further that the overall scale parameter a also appears in k.This is important in practice, because it allows us to decouple the scale of the intensity (which is controlled by a) from its complexity (which is controlled by γ).
Illustration.The eigenspectrum of k where k is a squared exponential kernel is shown in Figure 1 for various settings of a and γ.Reminiscent of spectral filtering studied by Muandet, Sriperumbudur and Schölkopf [24], in the top plot we see that depending on the settings of a and γ, eigenvalues of k are shrunk or inflated as compared to k(x, x ) which is shown in black.In the bottom plot, the values of k(0, x) are shown for the same set of kernels.

Computation of k
In this section, we consider first the case in which an explicit Mercer expansion is known, and then we consider the more commonly encountered situation in which we only have access to the parametric form of the kernel k(x, x ), so we must approximate k.We show experimentally that our approximation is very accurate by considering the Sobolev kernel, which can be expressed in both ways.

Explicit Mercer Expansion
We start by assuming that we have a kernel k with an explicit Mercer expansion with respect to a base measure of interest (usually the Lebesgue measure on S), so we have eigenvectors {e j (x)} j∈J and eigenvalues {η j } j∈J : with an at most countable index set J. Given a and γ we can calculate: up to a desired precision as informed by the spectral decay in {η j } j∈J .Below we consider kernels for which explicit Mercer expansions are known: a kernel on the Sobolev space [0, 1] with a periodic boundary condition, the squared exponential kernel, and the Brownian bridge kernel.We also show how our formulation can be extended to multiple dimensions using a tensor product formulation.Although not practical for large datasets, the Mercer expansions given below, summing terms up to j = 50 (for which the error is less than 10 −5 ), can be used to evaluate approximations for the cases in which Mercer expansions are not available.

Sobolev space on [0, 1] with a periodic boundary condition
We consider a kernel on the Sobolev space on [0, 1] with a periodic boundary condition, proposed by Wahba [31, chapter 2] and recently used in Bach [2].The kernel is given by: where s = 1, 2, . . .denotes the order of the Sobolev space and B 2s ({x − y}) is the Bernoulli polynomial of degree 2s applied to the fractional part of x−y.The corresponding RKHS is the space of functions on [0, 1] with absolutely continuous f, f , . . ., f (s−1) and square integrable f (s) satisfying a periodic boundary condition f (l) (0) = f (l) (1), l = 0, . . ., s − 1.For more details, see [31,Chapter 2].
Bernoulli polynomials admit a simple form for low degrees.In particular, Moreover, note that: Thus, the desired Mercer expansion (with respect to the Lebesgue measure) is given by k(x, y) = m∈Z η m e m (x)e m (y) with eigenfunctions e 0 (x) = 1 and for

Squared exponential kernel
A Mercer expansion for the squared exponential kernel was proposed in [35] and refined in [13].However, this expansion is with respect to a Gaussian measure on R, i.e., it consists of eigenfunctions which form an orthonormal set in L 2 (R, ν) where ν = N (0, 2 I).The formalism can therefore be used to estimate Poisson intensity functions with respect to such Gaussian measure.In the classical framework, where the intensity is with respect to a Lebesgue measure, numerical approximations of Mercer expansion, as described in Section 4.2 are needed.Following the exposition in [27, section 4.3.1]and the relevant errata1 we parameterize the kernel as: 3) The Mercer expansion with respect to ν = N (0, 2 I) then has the eigenvalues and eigenfunctions where , and B = b/A.Thus we have the following eigenvalues for k: while the eigenfunctions remain the same.The functions in the corresponding RKHS are pinned to zero at both ends of the segment.

Extending the Mercer expansion to multiple dimensions
The extension of any kernel to higher dimensions can be constructed by considering tensor product spaces: H k1⊗k2 (where k 1 and k 2 could potentially be different kernels with different hyperparameters).If k 1 has eigenvalues η i and eigenfunctions e i and k 2 has eigenvalues δ j and eigenfunctions f j , then the eigenvalues of the product space are then given by the Cartesian product η i δ j , ∀i, j, and similarly the eigenfunctions are given by e i (x)f j (y).Our regularized kernel has the following Mercer expansion: Notice that this approach does not lead to a method that scales well in high dimensions, which is further motivation for the approximations developed below.

Numerical approximation when Mercer expansions are not available
We propose an approximation to k given access only to a kernel k for which we do not have an explicit Mercer expansion with respect to Lebesgue measure.We only assume that we can form Gram matrices corresponding to k and calculate their eigenvectors and eigenvalues.As a side benefit, this representation will also enable scalable computations through Toeplitz / Kronecker algebra [10,15,14] or primal reduced rank approximations [33].
Let us first consider the one-dimensional case and construct a uniform grid u = (u 1 , . . ., u m ) on [0, 1].Then the integral kernel operator T k can be approximated with the (scaled) kernel matrix 1  m K uu : R m → R m , where [K uu ] ij = k(u i , u j ), and thus Kuu is approximately K uu a m K uu + γI −1 .Note that for the general case of multidimensional domains S, the kernel matrix would have to be multiplied by vol(S).Without loss of generality we assume vol(S) = 1 below.
We are not primarily interested in evaluations of k on this grid, but on the observations x 1 , . . ., x N .Simply adding the observations into the kernel matrix is not an option however, as it changes the base measure with respect to which the integral kernel operator is to be computed (Lebesgue measure on [0, T ]).Thus, we consider the relationship between the eigendecomposition of K uu and the eigenvalues and eigenfunctions of the integral kernel operator T k .
Let λ u i , e u i be the eigenvalue/eigenvector pairs of the matrix K uu , i.e., its eigendecomposition is given by . Then the estimates of the eigenvalues/eigenfunctions of the integral operator T k are given by the Nyström method (see Rasmussen and Williams [27, Section 4.3] and references therein, especially Baker [3]): with K xu = [k(x, u 1 ), . . ., k(x, u m )], leading to: For an estimate of the whole matrix Kxx we thus have The above is reminiscent of the Nyström method [33] proposed for speeding up Gaussian process regression.It has computational cost O(m 3 + N 2 m).A reduced rank representation for Eq.(4.11) is straightforward by considering only the top p eigenvalues/eigenvectors of K uu .Furthermore, a primal representation with the features corresponding to kernel k is readily available and is given by which allows linear computational cost in the number N of observations.For D > 1 dimensions, one can exploit Kronecker and Toeplitz algebra approaches.Assuming that the K uu matrix corresponds to a Cartesian product structure of the one-dimensional grids of size m, one can write Thus, the eigenspectrum can be efficiently calculated by eigendecomposing each of the smaller m × m matrices K 1 , . . ., K D and then applying standard Kronecker algebra, thereby avoiding ever having to form the prohibitively large m D × m D matrix K uu .For regular grids and stationary kernels, each small matrix will be Toeplitz structured, yielding further efficiency gains [34].The resulting approach thus scales linearly in dimension D.
An even simpler alternative to the above is to sample the points u 1 , . . ., u m uniformly from the domain S using Monte Carlo or Quasi-Monte Carlo (see [25] for a discussion in the context of RKHS).We found this approach to work well in practice in high-dimensions (D = 15), even when m was fixed, meaning that the scaling was effectively independent of the dimension D.
Using the Sobolev kernel in Sec.4.1.1,we compared the exact calculation of Kuu with s = 1, a = 10, and γ = .5 to our approximate calculation.For illustration we compared a coarse grid of size 10 on the unit interval (left) to a finer grid of size 100.The RMSE was 2E-3 for the coarse grid and 1.6E-5 for the fine grid, as shown in Fig. 2. In the same figure we compared the exact calculation of Kxx with s = 1, a = 10, and γ = .5 to our Nyström-based approximation, where x 1 , . . ., x 400 ∼ Beta(.5, .5)distribution.The RMSE was 0.98E-3.A lowrank approximation using only the top 5 eigenvalues gives the RMSE of 1.6E-2.
As Figure 2, demonstrates, good approximation is possible with a fairly coarse grid u = (u 1 , . . ., u m ) as well as with a low-rank approximation.The RMSE was 2E-3 for the coarse grid and 1.6E-5 for the fine grid.We compare the exact calculation of Kxx with s = 1, a = 10, and γ = .5 to our Nyström-based approximation, where x 1 , . . ., x 400 ∼ Beta(.5, .5)distribution (bottom left).The RMSE was 0.98E-3.A low-rank approximation using only the top 5 eigenvalues gives the RMSE of 1.6E-2 (bottom right).

Inference
The penalized risk can be readily minimized with gradient descent. 2 Let α = [α 1 , . . ., α N ] and K be the Gram matrix corresponding to k such that Kij = k(x i , x j ).Then [f (x 1 ), . . ., f (x N )] = Kα and the gradient of the objective function J from (3.6) is given by where ./denotes element-wise division.Computing K requires O(N 2 ) time and memory, and each gradient and likelihood computation requires matrix-vector multiplications which are also O(N 2 ).Overall, the running time is O(qN 2 ) for q iterations of the gradient descent method, where q is usually very small in practice.

Hyperparameter selection
Analogously to the classical problem of bandwidth selection in kernel intensity estimation (e.g.[11,6,7]), some criteria must be adopted in order to select hyperparameters of the kernel k and also γ and a.We suggest crossvalidating on the negative log-likelihood of the inhomogeneous Poisson process (i.e.before we introduced the penalty term) from Eq. (3.2).The difficulty with this approach is that we must deal with the integral S f 2 (u)du of the intensity over the domain, which, for our model f (•) = N j=1 α j k(x j , •) is generally intractable.As an approximation, we suggest either grid or Monte Carlo integration.Recall that in Section 4.2 we approximated k using a set of locations u = (u 1 , . . ., u m ).We can reuse these points to approximate the integral: As f (u i ) = Kuix α, this approximation is given by 1 m α Kxu Kux α.

Naïve RKHS model
In this section, we compare the proposed approach, which uses the representer theorem in the transformed kernel k, to the naïve one, where a solution to Eq. (3.3) of the form f (•) = N j=1 α j k(x j , •) is sought even though the representer theorem in k need not hold.Despite being theoretically suboptimal, this is a natural model to consider, and it might perform well in practice.
The corresponding optimization problem is: While the first and the last term are straightforward to calculate for any f (•) = j α j k(x j , •), S f 2 (x)dx needs to be estimated.As in the previous section, we consider a uniform grid or set of sampled points u = (u 1 , . . ., u m ) covering the domain and use approximation The optimization problem thus reads: As above, the gradient of this objective with respect to α can be readily calculated, and optimized with gradient descent.

Experiments
We use cross-validation to choose the hyperparameters in our methods: a, the fixed intensity, γ, the roughness penalty, and the length-scale of the kernel k, minimizing the negative log-likelihood as described in Section 5.1.
To calculate RMSE, we either make predictions at a grid of locations and calculate RMSE compared to the true intensity at that grid or for the highdimensional synthetic example we pick a new uniform sample of locations over the domain and calculate the RMSE at these locations.We used limited memory BFGS in all experiments involving optimization, and found that it converged very quickly and was not sensitive to initial values.Code for our experiments is available at https://github.com/BigBayes/kernelpoisson.

1-d synthetic Example
We generated a synthetic intensity using the Mercer expansion of a SE kernel with lengthscale 0.5, producing a random linear combination of 64 basis functions, weighted with iid draws α ∼ N (0, 1).In Fig. 3 we compare ground truth to estimates made with: our RKHS method with SE kernel, the naïve RKHS approach with SE kernel, and classical kernel intensity estimation with bandwidth selected by crossvalidation.The results are typical of what we observed on 1D and 2D examples: given similar kernel choices, each method performed similarly, and numerically there was not a significant difference in terms of the RMSE compared to the true underlying intensity.the naïve model, and kernel smoothing to a synthetic intensity "true".The rug plot at bottom gives the location of points in the realized point pattern.The RMSE for each method was similar.

Environmental datasets
Next we demonstrate our method on a collection of two-dimensional environmental datasets giving the locations of trees.Intensity estimation is a standard first step in both exploratory analysis and modelling of these types of datasets, which were obtained from the R package spatstat.We calculated the intensity using various approaches: our proposed RKHS method with k with a squared exponential kernel, the naïve RKHS method with squared exponential kernel, and classical kernel intensity estimation (KIE) with edge correction.Each method used a squared exponential kernel.We report average held-out cross-validated likelihoods in Table 1.With the exception of our method performing better on the Red oak dataset, each method had comparable performance.It is interesting to note, however, that our method does not require any explicit edge correction3 , because we are optimizing a likelihood which explicitly takes into account the window.A plot of the resulting intensity surfaces for each method and the effect of edge correction are shown in Fig. 4 for the Black oak dataset.

High dimensional synthetic examples
We generated random intensity surfaces in the unit hypercube for dimensions D = 2, . . ., 15.The intensity was given by a constant multiplied by the square of the sum of 20 multivariate Gaussian pdfs with random means and covariances.
The constant was automatically adjusted so that the number of points in the realizations would be held close to constant, in the range 190-210.We expected  this to be a relatively simple synthetic example for kernel intensity estimation with a Gaussian kernel in low dimensions, but not in high dimensions.From each random intensity, we generated two random realizations, and trained our model using 2-fold crossvalidation with these two datasets.We predicted the intensity at a randomly chosen set of points and calculated the mean squared error as compared to the true intensity.For each dimension we repeated this process 100 times comparing kernel intensity estimation, the naïve approach, and our approach with k.Using the same procedure, but a sum of 20 multivariate Student-t distributions with 5 degrees of freedom, random means and covariances, and number of points in the realizations ranging from 10 to 1000, we generated 500 random surfaces, with dimension D = 2, . . ., 25.We expected this to be a difficult synthetic example for all of the methods due to a potential for model misspecification, as we continue to use squared exponential kernels, but the intensity surface is non-Gaussian.
As shown in Fig. 7(a) once we reach dimension 9 and above, our RKHS method with k begins to outperform kernel intensity estimation, where performance is measured as the fraction of times that the MSE is smaller across 100 random datasets for each D. Our method also significantly outperforms the naïve RKHS method as shown in Fig. 7(b).For D = 15 the difference between the two RKHS methods is not significant.This could be due to the fact that the number of points in the point pattern remains fixed, so the problem becomes very hard in high dimensions. 4As shown in the Fig. 7(c), kernel intensity estimation is almost always better than the naïve RKHS approach, although the difference is not significant in high dimensions.
For the Student-t experiment, as shown in Fig. 7(d)-(f), our RKHS method always outperforms kernel intensity estimation and is better than the naïve method in dimensions below D = 20.To assess the amount of improvement, rather than just its statistical significance, we compared the percent improvement in terms of MSE gained by our method versus the competitors, just focusing on D = 10 in Fig. 8. On this metric (intuitively, "how much do you expect to improve on average") our method shows reasonably stable results as compared to KIE, while the performance of the naïve method is revealed to be very variable.Indeed, the standard deviation across the random surfaces for D = 10 of the MSE was 56 for both our method and KDE but 166 for the naïve method, perhaps due to overfitting.

Computational complexity
Using the synthetic data experimental setup, we evaluated the time complexity of our method with respect to dimensionality d, number of points in the point pattern dataset n, and number of points s used to estimate k (Fig. 6), confirming our theoretical analysis.The effect of the dimensionality d was negligible in practice, because the main calculations rely only on an n × n Gram matrix whose calculation is relatively fast even for high dimensions.Our method's time complexity scales as O(n 2 ) as shown in Fig. 5 (but as discussed in Section 4.2, a primal representation is available which would give linear scaling.)where we used s = 200 sample points to estimate k.While a small s worked well in practice, we investigated much larger values of s.As shown in Fig. 6 the time complexity scaled as O(s 2 ) where the number of points was fixed to be 150; note that we fixed the rank of the eigendecomposition to be 20.

Spatiotemporal point pattern of crimes
To demonstrate the ability to use domain specific kernels and learn interpretable hyperparameters, we used 12 weeks (84 days) of geocoded, date-stamped reports of theft obtained from Chicago's data portal (data.cityofchicago.org)starting January 1, 2004, a relatively large spatiotemporal point pattern consisting of 18,441 events.We used the following kernel: exp(−.5s 2 /λ 2 s )(exp(−2 sin 2 (tπp)) + 1)(exp(−.5t 2 /λ 2 t )) which is the product of a separable squared exponential space and decaying periodic time kernel (with frequency p in a time domain normalized to range from 0 to 1) plus a separable squared exponential space and time kernel.After finding reasonable values for the lengthscales and other hyperparameters of k through exploratory data analysis, we used 2-fold cross-validation and calculated average test log-likelihoods for the number of total cycles p in the 84 weeks = 1, 2, . . ., 14 or equivalently a period of length 12 weeks (meaning no cycle), 6 weeks, ..., 6 days.These log-likelihoods are shown in Fig. 9; we found that the most likely frequency is 12, or equivalently a period lasting 1 week.This makes sense given known day-of-week effects on crime.

Conclusion
We presented a novel approach to inhomogeneous Poisson process intensity estimation using a representer theorem formulation in an appropriately transformed RKHS, providing a scalable approach giving strong performance on synthetic and real-world datasets.Our approach outperformed the classical baseline of kernel intensity estimation and a naïve approach for which the representer theorem guarantees did not hold.In future work, we will consider marked Poisson (e): our RKHS method versus the naïve RKHS method.In (c) and (f): comparison of KIE and the naïve RKHS approach.We used squared exponential kernels for all methods.In the Gaussian case (a)-(c), our method significantly outperforms kernel intensity estimation as the dimension increases, and outperforms the naïve method throughout.Kernel intensity estimation almost always outperforms the naïve approach.In the Student-t case (d)-(f), our method always outperforms kernel intensity estimation, and outperforms the naïve approach until very high dimensions.Neither kernel intensity estimation nor the naïve approach are consistently better than each other.
processes and other more complex point process models, as well as Bayesian extensions akin to Cox process modeling., where we generated random surfaces by squaring the sums of multivariate Student-t distributions, we considered dimension D = 10, in which our RKHS method was better than both the naïve method and kernel intensity estimation (KIE) but there was not a significant difference between KIE and the naïve method, and calculated the percent improvement in terms of MSE comparing our RKHS method to the naïve method (left) and our RKHS method to and kernel intensity estimation (KIE) (right).The improvement of our method over KIE is apparent, albeit perhaps only modest in this example.Meanwhile, our method is sometimes quite a bit better than the naïve method, which is often very inaccurate.Log-likelihood for various frequencies of a periodic spatiotemporal kernel in a dataset of 18,441 geocoded, date-stamped theft events from Chicago, using our RKHS model.The dataset is for 12 weeks starting January 1, 2004, and the maximum loglikelihood is attained when the frequency is 12, meaning that there is a weekly cycle in the data.Results using the naïve model were less sensible, with a maximum at period 1 (indicating no periodicity), with periods 5 and 2 (corresponding to a 16.8 day cycle and a 42 day cycle) also having high likelihoods.

5 Fig 1 :
Fig 1: Eigenspectrum of k (top) and values of k (bottom) for various settings of a and γ.

Fig 2 :
Fig 2:Using the Sobolev kernel in Sec.4.1.1,we compared the exact calculation of Kuu with s = 1, a = 10, and γ = .5 to our approximate calculation.For illustration we tried a coarse grid of size 10 on the unit interval (top left) to a finer grid of size 100 (top right).The RMSE was 2E-3 for the coarse grid and 1.6E-5 for the fine grid.We compare the exact calculation of Kxx with s = 1, a = 10, and γ = .5 to our Nyström-based approximation, where x 1 , . . ., x 400 ∼ Beta(.5, .5)distribution (bottom left).The RMSE was 0.98E-3.A low-rank approximation using only the top 5 eigenvalues gives the RMSE of 1.6E-2 (bottom right).

Fig 3 :
Fig 3: A synthetic dataset, comparing our RKHS method, (a) KIE with edge correction (b) KIE without edge correction (c) Our RKHS method with k (d) Naïve RKHS method

Fig 4 :
Fig 4: Location of white oak trees in Lansing, Michigan, smoothed with various approaches.Squared exponential kernels are used throughout.Edge correction makes a noticeable difference for classical kernel intensity estimation.Comparing (a) and (c) it is clear that our method is automatically performing edge correction.

Fig 5 :
Fig 5: Run-time of our method versus number of points in the point pattern dataset.

Fig 6 :
Fig 6: Run-time of our method versus number of sample points used to calculate k.

Fig 7 :
Fig 7:  Three methods were compared: our RKHS method, the naïve RKHS method, and kernel intensity estimation, based on 100 random surfaces for each dimension D in two experimental setups.In (a)-(c), the intensity surface was the squared sum of skewed multivariate Gaussians.In (d)-(f) the surface was a mixture of skewed multivariate Student-t distributions, with 5 degrees of freedom.In (a) and (d): comparison of our RKHS method versus KIE.In (b) and (e): our RKHS method versus the naïve RKHS method.In (c) and (f): comparison of KIE and the naïve RKHS approach.We used squared exponential kernels for all methods.In the Gaussian case (a)-(c), our method significantly outperforms kernel intensity estimation as the dimension increases, and outperforms the naïve method throughout.Kernel intensity estimation almost always outperforms the naïve approach.In the Student-t case (d)-(f), our method always outperforms kernel intensity estimation, and outperforms the naïve approach until very high dimensions.Neither kernel intensity estimation nor the naïve approach are consistently better than each other.

Fig 8 :
Fig 8:  To understand the practical (as opposed to statistical) significance of the results in Fig.7(d)-(f), where we generated random surfaces by squaring the sums of multivariate Student-t distributions, we considered dimension D = 10, in which our RKHS method was better than both the naïve method and kernel intensity estimation (KIE) but there was not a significant difference between KIE and the naïve method, and calculated the percent improvement in terms of MSE comparing our RKHS method to the naïve method (left) and our RKHS method to and kernel intensity estimation (KIE) (right).The improvement of our method over KIE is apparent, albeit perhaps only modest in this example.Meanwhile, our method is sometimes quite a bit better than the naïve method, which is often very inaccurate.

Fig 9 :
Fig 9: Log-likelihood for various frequencies of a periodic spatiotemporal kernel in a

Table 1
Tree Point Patterns from R Package spatstat