Kernel Density Estimation with Linked Boundary Conditions

Kernel density estimation on a finite interval poses an outstanding challenge because of the well-recognized boundary bias issues at the end-points. Motivated by an application of density estimation in biology, we consider a new type of boundary constraint, in which the values of the estimator at the two boundary points are linked. We provide a kernel density estimator that successfully incorporates this linked boundary condition. This is studied via a nonsymmetric heat kernel which generates a series expansion in nonseparable generalised eigenfunctions of the spatial derivative operator. Despite the increased technical challenges, the model is proven to outperform the more familiar Gaussian kernel estimator, yet it inherits many desirable analytical properties of the latter KDE model. We apply this to our motivating example in biology, as well as providing numerical experiments with synthetic data. The method is fast and easy to use and we also compare against existing methods, which do not incorporate this constraint and are therefore inaccurate near the boundary.


Introduction
Suppose we are given an independent and identically distributed statistical sample X 1 , . . . , X n from some unknown continuous density f X (·). 1 Estimating the density f X is one of the most common methods for discovering patterns in statistical data [44,5,43].
When the support of f X is the whole real line, a simple and popular non-parametric method for estimating f X is the kernel density estimator with a Gaussian kernel ϕ(x) = exp(−x 2 /2)/ √ 2π. Here √ t is the so-called bandwidth parameter that controls the smoothness of the estimator (see, for example, [43,44] and references therein).
It is well-known that f (x; t) is not an appropriate kernel estimator when f X has compact support [18], which (without loss of generality) we may assume to be the unit interval [0, 1]. The main reason for this is that f (x; t) exhibits large boundary bias at the end-points of the interval. For example, no matter how small the bandwidth parameter, f (x; t) will have non-zero probability mass outside the interval [0, 1]. Various solutions have been offered to cope with this boundary bias issue, which we may classify into three main types: (a) Using special (non-Gaussian) kernels with support on [0, 1] or on [0, ∞), as in [4,28,39]; (b) Adding bias-correction terms to f (x; t) as in [8,23]; (c) Employing domain transformations [18,30], which work by mapping the data to (−∞, ∞), constructing a kernel density estimator on the whole real line, and finally mapping the estimate back to [0, 1].
Sometimes, we not only know that f X has support on [0, 1], but also we have extra information about the values of f X at the boundaries. One example of this situation is what we will refer to as a linked boundary condition, where we know a-priori that f X (0) = rf X (1) for some known given parameter r > 0.
This situation arises in the field of Systems Biology [25], where inferences from many snapshots of cell data from unknown time points are made via pseudo-time algorithms and where data are known to be cyclical because of the cell cycle, for example. One then wishes to estimate the distribution of cells on the pseudo-time scale. The value at the left boundary of this distribution must as a consequence of cell division be double the value at the right boundary. In other words, we have linked boundary conditions with the constant r = 2, but otherwise we do not know the value of the density at the boundaries of the domain. See [25], for example, for further motivation and discussion.
Unfortunately, to the best of our knowledge, all of the currently existing kernel density estimation methods, bias-correcting or not, cannot satisfactorily handle the linked boundary condition. Figure 4 in the numerical section 5.2 shows a typical example of what can go wrong when currently available density estimators are applied to real biological data. The result is a smooth density with two unacceptable features: • The domain x ∈ [0, 1] is not respected, and instead the solution has positive density for negative values of x, and also for x > 1, which are physically unreasonable.
• The density does not respect the important biological constraint of the linked boundary condition (that the left value should be double the right, in this particular application), and instead the density decays to zero as |x| becomes large.
The purpose of this paper is to describe a new kernel density estimator that can handle the more general problem of linked boundary conditions with an arbitrary value of r; the situation of interest in the biological application where r = 2 is then solved as an important special case. Figure 5 (C) shows a successful application of our proposed method.
Our proposed estimator is of type (a), that is, we construct a special kernel with support on [0, 1], and such that the linked boundary condition is incorporated into the resulting estimator. Our kernel is inspired by the solution of a diffusion-type PDE [1,3,31,38,50]. In particular, we modify the diffusion model in [3] so that it satisfies the linked boundary conditions.
We also consider the discrete counterpart of the continuous model for two reasons. First, it is a numerical approximation to the continuous model and a useful way to compute the solution of the PDE. Second, the discrete model is relevant when we deal with data which is already pre-binned in a large number of bins.
The rest of the article is organized as follows. In the next section we describe the continuous model for the application at hand. Our results provide the necessary assurances for this to be a useful model in statistics for the purposes of density estimation. Next, we study a discrete model of the same situation, and establish convergence to the continuous model. We then briefly discuss the issue of choosing a stopping time, including pointwise bias, asymptotic properties and the AMISE. These results are applied to an example with biological data in §5.2 and some simulated examples. We then finish the main part of the paper with a short conclusion. After that, we present a derivation of the solution in §7, and explanations of its properties in §8. We formally solve the non self-adjoint initial boundary value problem as a contour integral using the unified transform, a novel transform for analysing boundary value problems for linear (and integrable nonlinear) partial differential equations. This formal solution is then rigorously verified and studied via a nonsymmetric heat kernel which generates a series expansion in nonseparable generalised eigenfunctions of the spatial derivative operator.

The Continuous Linked-Boundaries Model
First, recall the diffusion PDE approach to density estimation [3,24], where a diffusion partial differential equation in one space dimension is employed for kernel density estimation purposes [43,44]. We now present the continuous diffusion model that satisfies the linked boundary condition, and derive its analytical solution. Our proposed diffusion model for a linked-boundary KDE is the continuous solution of the PDE: Here the initial condition (IC), is simply the empirical measure of the given data X 1 , . . . , X n . In other words, f 0 is a sum of Dirac delta distributions. Our results hold for the common cases of practical importance, such as for initial data of the form in (2), and also for functions that are probability densities (or even general distributions) which we denote by f 0 . The first boundary condition (BC) involves a constant r. Notice the ratio of the left boundary value to the right boundary value is The constant ratio r links the boundaries 2 and is assumed to be non-negative: r ≥ 0. This PDE (1) is a natural candidate for density estimation with such a linked boundary condition. However, it is not immediately obvious what properties the PDE has. For example, one question is whether or not the solution is a probability density. More fundamental questions concern existence and uniqueness, which are well known in the special self-adjoint case of periodic boundary conditions when r = 1.

Well-posedness of the continuous model
We begin by establishing existence and uniqueness in Theorem 1. It turns out that results analogous to the well known r = 1 case hold. First we require the following definitions.
such that µ t is continuous in the vague topology. 3 Notably, this is weaker than C(0, T ; M ([0, 1])), where M ([0, 1]) has the usual Banach space structure with total variation norm. In this case the µ t would need to be continuous in this norm. For instance, The usual story in something like the finite element method, for example, is to first multiply the original differential equation (1) by a smooth test function, then integrate, and then apply integration by parts. This procedure leads to the weak formulation of the problem, and similar steps apply here too as follows, where we use g to denote our smooth test functions. In terms of notation, g x denotes the derivative with respect to x, whereas µ t does not denote a derivative, and instead the subscript t denotes the dependence on time. The notation µ t (g) denotes integration of the smooth test function against the measure.
The following adjoint boundary conditions are exactly those that arise from integration by parts (adjoint with respect to the standard inner product). That is, F (r) defined below in (3) is a subspace of D(A * ), where D(A) is defined in (7), and where A * is the adjoint operator, that is defined in (47). Definition 2. Let F(r) denote all g ∈ C ∞ ([0, 1]) satisfying the adjoint linked boundary conditions Theorem 1 (Well Posedness). Assume our initial condition f 0 lies in M ([0, 1]) and satisfies the linked boundary conditions in the very weak sense that f 0 ({0}) = rf 0 ({1}). Then there exists a unique weak solution to (1) for t ∈ [0, T ) for any T ∈ (0, ∞], which we denote by f (·, t).
For t > 0 this weak solution is smooth in t and real analytic in x. Furthermore, in the more general setting of f 0 not necessarily being of the form (2), the solution has the following properties which generalise the classical case of r = 1:

Solution of the continuous model
If we ignore the constant r in the boundary conditions of (1) (and replace it by the special case r = 1), then we would have the simple diffusion equation with periodic boundary conditions. One can successfully apply Fourier methods, separation-of-variables or Sturm-Liouville theory to solve the periodic version of this PDE [13,19]. However, when r = 1, making the ansatz that a solution is of the 'rank one', separable form f (x, t) = g(x)h(t) leads to no interesting solutions and separation of variables fails. It is possible to construct a somewhat non-obvious series solution. Surprisingly, this is not possible in general for well posed constant coefficient evolution equations on a finite interval [34]. The differential operator associated with the evolution equation in (1) is regular in the sense of Birkhoff [2] but not self-adjoint when r = 1 due to the boundary conditions. It is possible to generalise the notion of eigenfunctions of the differential operator [6] and these generalised eigenfunctions form a complete system in L 2 ([0, 1]) [33,26] (and in fact form a Riesz basis). However, the expansions become complicated due the fact that the characteristic values (which determine the spectrum) are not simple and the Green's function does not have simple poles. In particular, this complicates the time dependence of the following series solution.
Theorem 2 (Representation). The unique solution described in Theorem (1) has the following representations. Fourier integral representation: Here the contours ∂D ± are shown in Figure 6 and are deformations of the boundaries of D ± = {k ∈ C ± : Re(k 2 ) < 0}. The determinant function is given by where k n = 2nπ and In the more general case where r = 1, in addition to the usual separable solutions exp(ik n x − k 2 n t/2), the series expansion also includes the nonseparable solutions exp(ik n x − k 2 n t/2)(x + ik n t). We can understand these as being generalised eigenfunctions in the following sense (see the early papers [27,48]). Define the operator, where u x denotes a derivative, Then it is easily checked 4 that φ n ∈ N ((A − k 2 n I) 2 ). In particular, both φ n and (A − k 2 n I)φ n satisfy the required boundary conditions. Evaluating the series at t = 0 gives a realisation of Theorems 3.1-3.5 of Chapter 5 of [26] which state that these generalised eigenfunctions form a complete system in L 2 ([0, 1]). In particular, these functions block diagonalise the operator in an analogous form to the Jordan normal form for finite matrices. If we consider any generalised eigenspace N ((A − k 2 n I) 2 ) corresponding to k 2 n = 4π 2 n 2 with n > 0 and choose the basis {sin(k n x), φ n (x)/(2k n )}, the operator acts on this subspace as the matrix k 2 which cannot be diagonalised for r = 1. For our context of kernel density estimation, we define an integral kernel K so that we can write the solution as Owing to (39) and the residue calculus, this is given by the somewhat complicated expression K(r; x, y, t) = n∈Z exp(ik n x − k 2 n t/2) exp(−ik n y) + 1 − r 1 + r (x + ik n t)e −ikny which can be expressed in terms of the more common r = 1 kernel and its derivative (see (44)). For the initial data (2) this gives the density estimate f (x, t) = 1 n n k=1 K(r; x, X k , t).

Remark 1.
Theorem (1) also shows that the pointwise bias vanishes if f X is continuous with f X (0) = rf X (1). Namely, uniformly in x. This is in contrast to the standard Gaussian kernel density estimator [49] or its periodic version (when r = 1) [3], where this is not the case at the boundary. In fact, a more careful analysis of §8.1 shows that if f X ∈ C 2 [0, 1] and x t = x + O( √ t) then our estimator satisfies with C(f X ) independent of x ∈ [0, 1].

Properties of the continuous model
The Maximum Principle [13,19] for parabolic PDEs states that the diffusion equation attains a maximum on the 'boundary' of the two dimensional region in space x ∈ [0, 1] and time t ≥ 0. If our initial distribution is given by a continuous function then the maximum principle gives the following.
Proposition 1 (Bounds on Solution). Suppose that the initial distribution is continuous with f 0 (0) = rf 0 (1) and non- Then for any t > 0 and x ∈ [0, 1] we have min 2r 1 + r , In particular, f remains bounded away from 0 if a > 0 and r > 0.
Similar bounds can be found for negative f 0 and even negative or complex r, but we exclude those cases in this article.
The second boundary condition in words, says 'left slope equals right slope'. A consequence of this BC is formal conservation of mass: If we start with an initial condition that integrates to one, then the solution for all later times also integrates to one. In the context of density estimation, this essential property corresponds to conservation of probability. Seeing both that the solution is non-negative (by the Maximum Principle above), and that the solution always integrates to one if the initial density integrates to one, confirms that the solution is indeed a probability density. This is made precise in the following Theorem (a more careful proof could come via integral kernels, but we omit that here) which does not require continuous initial data.
Theorem 3 (Conservation of Probability). Suppose that the assumptions in Theorem (1) hold and f denotes the unique weak solution (which is well behaved for t > 0 by Theorem 2). Suppose also that f 0 is a probability measure. Then We also have the following, which immediately follows from (6), and describes the behaviour of the solution for large time.
This linear function is the unique stationary distribution that obeys the boundary conditions and has the same integral over [0, 1] as f 0 .
Two simple examples Choose r = 2 so that the linked boundary condition becomes, in words, the 'left boundary value is double the right boundary value.' A simple and explicit solution, which is a probability density, is The initial condition must be a probability density, and must be compatible with the above form in (12). In this example, the long time behaviour of (11) is f ∞ (x) = 4/3 − 2/3x. This solution in (12) is separable, and it has The initial condition does not satisfy the 'equal slopes' boundary condition. Nonetheless, our theory shows that for t > 0 all boundary conditions are satisfied, and moreover the methods that we propose still converge to the correct density, albeit at a slower rate.
constant boundary values (4/3 on the left, and 2/3 on the right) that do not change in time. However, the solution in (12) is too simplistic. To be useful in applications we need solutions of (1), for any choice of r ≥ 0, to have boundary values that automatically evolve in time, from the given initial data, and where we do not make assumptions about the form of that initial data. For instance, Figure 1 shows a second example, with f 0 (x) = 6 11 (−2x 2 + x + 2), that, unlike (12), illustrates boundary values that evolve with time. This is one reason why we need results such as Theorem 2 which allows us to handle non-separable solutions.

The Discrete Linked-Boundaries Model
The exact, continuous PDE (1) is replaced by a standard, finite-difference approximation that remains continuous in time but is discrete in space. This is sometimes known as the 'method of lines' [20], and it yields the first order ODE system: The structure of the matrix A is described below. The solution u of (13) is a vector that approximates the exact solution f . That is, . Here x i = ih is the ith grid point on the grid of n + 2 equally spaced points in the domain [0, 1], for i = 0, 1, . . . , n, n + 1. The spacing between two consecutive grid points is h = 1 n+1 . The two boundary conditions give two equations involving values at the two boundary nodes, i.e. at node 0 and at node n + 1. That is, This motivates us to make the following definitions for the boundary nodes: (We have eliminated u 0 and u n+1 from our equations in favour of u 1 and u n . We also always have that u 0 + u n+1 = u 1 + u n .) We are left with a set of n equations involving n unknown values u 1 , . . . , u n , at the n interior nodes 1, . . . , n (and only an n × n corresponding matrix).

The Four Corners Matrix
All this leads to an n × n matrix A, with the following structure: The first row and the last row come from the boundary conditions. All interior rows come from the standard second-order finite difference approximation of the (spatial) second derivative. Various linked boundary conditions can be modelled by choosing different values for the four numbers α, β, γ and δ. We will only consider choices of α, β, γ and δ that maintain these key properties of the matrix (17), namely that it has zero column sum, and off-diagonals are negative and the main diagonal entries are positive. Then an attraction of this discrete model (13) is that it can be interpreted as a continuous-time Markov process. This ensures that the solution vector u is a non-negative probability vector. For example, our linked boundary conditions in (1), with r = 2, corresponds to the choice α = γ = −2/3 and β = δ = −1/3, which does allow the interpretation as a Markov process. The solution of the discrete model problem (13), with a constant coefficient matrix comes as a matrix exponential One way to compute that exponential is via diagonalization where exp(− 1 2h 2 Dt) is a diagonal matrix, with ith diagonal entry given by the scalar exponential exp(− 1 2h 2 λ i t). The ith column of the matrix V is the eigenvector corresponding to eigenvalue λ i .
The 'Four Corners Matrix' (17), is a non-symmetric example of a 'tridiagonal Toeplitz matrix with four perturbed corners' [51,47]. Exact and explicit formulas for the eigenvalues and eigenvectors are available. Thus we have exact expressions for the solutions to this discrete model, via the above diagonalisation, if desired.
There is a unique zero eigenvalue, corresponding to the stationary density as t → ∞. The stationary density is an affine function in the continuous PDE setting. In the discrete setting, with r ≥ 0, the components of the stationary eigenvector v are equally-spaced, in the sense that ∀i, Then it is easy to verify that the corresponding rank one matrix vv is Toeplitz-plus-Hankel, for example by quickly checking the tests described in [47]. The same matrix (17), in the special symmetric case that α = β, appears in the Four Corners Theorem of [47]. Here we observe that the Four Corners Theorem of [47] can be generalized to the non-symmetric case. That is, the Four Corners Theorem still holds for the Four Corners Matrix (17) in our non-symmetric case when α = β: all functions of our nonsymmetric Four Corners Matrix (17), with r ≥ 0 in (19) below, are the sum of a Toeplitz matrix and a Hankel matrix. To see this is still true, we could check that the projections onto all eigenspaces are Toeplitz-plus-Hankel. To do that, we could use the explicit expressions for the eigenvectors ( for v or for w given below in (20)). Then the (i, j) entry of the rank one matrix vv , or of ww , is a sum of terms of the form sin(iθ) sin(jθ). The elementary identity 2 sin(iθ) sin(jθ) = cos((i − j)θ) − cos((i + j)θ) now allows us to write this as the sum of a Toeplitz part (coming from the part that is only a function of (i − j) and not of i or j separately, which in our case is cos((i − j)θ)) and a Hankel part (coming from functions of (i + j), i.e. cos((i + j)θ)). Since the projection onto all eigenspaces is Toeplitz-plus-Hankel, then the solution u(t) = exp (−At/2h 2 ) u(0) = Tu(0) + Hu(0) is likewise the sum of two parts: (i) a Toeplitz part, which can be thought of as the solution without boundary conditions; and (ii) a Hankel part, which is precisely the correction due to the boundary conditions.
All other eigenvalues of −A are negative. In the notation of [51] we have: and In the case that r = 1, we can group the spectral data into two classes with eigenvalues: The zero eigenvalue, when k = n−1 2 + 1, has already been discussed. Other eigenvalues correspond to eigenvectors with components (listed via subscripts) Some properties of the discrete model and its links to the continuous model are: • All eigenvalues of the Four Corners Matrix are purely real. Also, the eigenvalues of the operator in the continuous model are likewise purely real. 5 This is perhaps surprising since the matrix is not symmetric, and the operator is not self-adjoint.
• The Four Corners Matrix is diagonalizable, and in that sense of the right hand side of (18), the solutions to the discrete model are 'separable.' In contrast, the operator for the continuous PDE is not diagonalizable, and instead the analogy of the Jordan Normal Form from linear algebra applies to the operator. Despite this, it is straightforward to show that: 1. The eigenvalues of the discrete model converge 6 to that of the continuous model (including algebraic multiplicities).
2. The eigenvectors converge to the generalised eigenfunctions of the continuous operator.
• Finally, the discrete model is stable in the maximum norm. This together with consistency and the fact that the continuous model is well posed implies that the discrete model converges to the solutions of the continuous model from the Lax Equivalence Theorem [36]. In particular, the bounds proven in Proposition 1 for the continuous model carry over to the discrete model with 7 From the interpretation of the discrete model as a Markov process, it follows that sup and hence we also gain stability in any p-norm via interpolation. 5 Recall that in this article we insist on the PDE in (1) where r ≥ 0 and r real, and also where the equal slopes boundary condition applies. Generalisations to other values of r and other boundary constraints are certainly possible, but we do not consider those generalisations, nor their spectra, in this article. 6 This can be made precise in say the Attouch-Wets topology -the convergence is locally uniform. 7 We use a subscript l p to denote the matrix l p norm.
In summary, we offer two different practical computational methods to perform density estimation with linked boundary conditions: (i) the discrete model, via the Four Corners Matrix; and (ii) the continuous model, via the solutions that we find here as a series or as a contour integral. 8 The two methods have relative advantages and disadvantages. The Four Corners Matrix is a finite difference method of modest accuracy but it is simple and easy to use. This might be attractive, especially if the initial data is already discretely binned, as is often the case. It also essentially guarantees maintaining positivity because of the attractive interpretation as a Markov process. In contrast, the series solution we provide for the continuous model is typically highly accurate for t > 0.

Choosing the stopping time
An important issue in density estimation is how to choose the bandwidth parameter (for kernel density estimation) or equivalently, the final time T at which we compute the solution of the PDE. This issue has already received extensive attention in the literature. We now give a brief summary of that issue, and we also make two suggestions for known methods already available. After that, we address the issue specifically in the context of our linked boundaries model.
At one extreme, if we choose T = 0 then we recover the initial condition, which is precisely the raw data, with no smoothing. At the other extreme, if we let T → ∞, then we obtain a stationary density that is a linear, straight line, which contains no information whatsoever about the raw data. In between, 0 < T < ∞, we have some smoothing effect while also retaining some information from the original data, and often this is interpreted as an optimal balance between the data and the prior.
One would like such a choice of an intermediate value of T to satisfy certain statistical properties. For example, as more and more data are included, we would like to converge to the 'true' density. Various proposals and their properties are available. Perhaps the most common choice is 'Silverman's rule of thumb' which works very well when the data is close to being normally distributed. We expect that this choice is fine for the simpler datasets and examples that we consider in this article. However, the methods that we present in the article are more general, and for other applications it might well be necessary to adopt more sophisticated approaches.
In that regard, another possible approach is to use the output from the freely available software of one of the authors. 9 In particular, the first output from that software is bandwidth=sqrt(t), when the software is called via This is expected to be a better choice than Silverman's rule in situations where there are many widely separated peaks in the data.

Asymptotic properties
We now give a more precise treatment of the choice of stopping time for the linked boundaries model, as well as discussing the Mean Integrated Squared Error (MISE) defined by Often one is interested in the asymptotic approximation to the MISE, denoted AMISE, under the requirements that t = t n ↓ 0 and n √ t n → ∞. The asymptotically optimal bandwidth is then the minimizer of the AMISE. For our continuous model of kernel density estimation we have the following result which gives the same O(n −4/5 ) rate of convergence as the Gaussian kernel density estimator on the whole real line.
Theorem 4. Let t = t n be such that lim n→∞ t n = 0 and lim n→∞ n √ t n = ∞ and suppose that f X ∈ C 2 ([0, 1]) with f X (0) = rf X (1). Then the following hold: 1. The integrated variance has the asymptotic behaviour 2. If f X (0) = f X (1) then 3 Corollary 1. Combining the leading order bias and variance terms gives the asymptotic approximation to the MISE: Hence, the square of the asymptotically optimal bandwidth is with the minimum value

If
Hence, the square of the asymptotically optimal bandwidth is A few remarks are in order. First, it is interesting to note that in the case of f X (1) = f X (0), the optimum choice t * and the minimum AMISE do not depend on r, and are the same as the more familiar 'whole line' situation -in other words, we can confidently use existing methods in the literature (such as recommended above) to choose a stopping time. Second, it seems plausible that we could estimate f X (1) − f X (0) (or the value of r) adaptively and change the boundary conditions in the model accordingly. A full discussion of solving the heat equation with linked boundary conditions for the first spatial derivative is beyond the scope of this paper, but can be done using the same methods we present here. Future work will aim to incorporate an adaptive estimate of the true boundary conditions (both for the density function and its first derivative) and resulting adaptive boundary conditions. We mention a result in this direction which will appear when we compare our model to that of [3], whose code is based around the discrete cosine transform, the continuous version of which solves the heat equation subject to the boundary conditions We have used the subscript c to avoid confusion with our solution f to (1). The analogous result to Theorem 4 is the following theorem which can be proven using the same techniques and hence we have omitted the proof. Similarly, one can then derive the optimum choice of t and the minimum AMISE.
Theorem 5. Let t = t n be such that lim n→∞ t n = 0 and lim n→∞ n √ t n = ∞ and suppose that f X ∈ C 2 ([0, 1]). Then the following hold: 1. The integrated variance has the asymptotic behaviour 2 3. If f X (0) = 0 or f X (1) = 0 then 5 Numerical Examples

Numerical Examples with Synthetic Data
Here we shall test the method for examples where the true density f X is known. We begin with the trimodal distribution shown in Figure 2. We have also shown a typical approximation of the distribution function using our proposed method and the density estimation proposed in [3] for a sample size of n = 10000. The proposed method is more accurate near the boundaries of the domain (see zoomed in section of plot) and behaves similarly in the middle of the domain. For stopping times we have used the bandwidth bandwidth=sqrt(t) discussed in §4. We have also measured the error in two ways. We took the mean over 100 simulations for each n of both where g is the given estimator. Figure 2 shows these measures of error as we increase n, as well as the minimum AMISE as a function of n for each method. We can clearly see agreement with the analysis in §4.1. The difference between the two methods is more striking when measured using the uniform norm. We found this to be the case for a range of other tested distributions. Even in the case when f X (0) = f X (1) and both methods obtain minimum AMISE scaling as O(n −3/4 ), there was a significant difference in the uniform norm. Next we consider the case when f X is log-concave and not necessarily smooth. Denoting the PDF of the beta distribution with parameters (α, β) by b(α, β; x), we let The parameter a controls the smoothness of f X near x = 0. We have compared our method to a method that computes log-concave maximum likelihood estimators [11,10]. This seeks to compute the log-concave projection of the empirical distribution through an active set approach. 10 Details on this method and other similar methods can be found in [37], with a study of the more involved case of censored data in [12]. Tables 1 and 2 show the squared mean L 2 and L ∞ errors respectively over 10 simulations for n = 10 5 , as we vary a for the linked boundary diffusion estimator and the log-concave projection method (abbreviated to LC), as well as its smoothed version (LCS).
In each case we have shaded the most accurate estimator. The linked boundary diffusion estimator performs much better when measured in the uniform norm but is slightly worse in the L 2 sense when the distribution function becomes less smooth. This is demonstrated in Figure 3 for a typical estimation using n = 10000. The experiments in Tables 1 and 2

Numerical Example with Cell Data
This short section demonstrates the application of the methods that we propose to a problem in biology with real data, where the data are taken from [25]. As mentioned in the introduction, Figure 4 shows an example of what goes wrong when current methods are applied. Figure 5 C demonstrate our proposed method, which performs satisfactorily. Our example originates from the study of biological processes, in particular the cell cycle, with single cell data. The cell cycle can be split in different stages (Fig. 5 A): It starts with a growth phase (G1) followed by the S-phase in which DNA is replicated. The subsequent shorter growth phase (G2) then leads into M-phase (mitosis). A cell then divides into two daughter cells at the end of M-phase. The example dataset is obtained from an engineered cell line which has a fluorescent reporter protein named "geminin" which starts to accumulate once a cell transitions from G1-to S-phase and which is degraded during cell division (Fig. 5 A). This dataset compromises levels of DNA and geminin in single cells measured by flow cytometry (Fig. 5 B). The red curve in Fig. 5 B indicates the path that the average cell takes when it goes through the cell cycle. Pseudotime algorithms perform a dimensionality reduction by assigning a pseudotime value to each cell which can be interpreted as its position on the average curve. A recently developed theory based on ergodic principles makes it possible to derive a transformation to a real-time scale from cell ordering in pseudotime reflecting cell cycle progression ( [21,25]). The method relies on the distribution of cells on the pseudotime scale. As Figure 4: A typical example of output from a kernel density estimator (ksdensity from MATLAB) applied to our real biological data. This does not respect the domain, and it also does not respect the important linked boundary conditions. The novel methods that we propose in this article address those issues. mentioned in the introduction, the distribution at the beginning and the end of the cell cycle are linked due to cell division by The example dataset ( Figure 5) compromises levels of DNA and geminin in single cells in an exponential growing population of NCI-H460/geminin cells measured by flow cytometry ( [25]). A pseudotime algorithm provides a dataset X 1 , . . . , X N of pseudotime values which represents a quantitative value for the progression of every single cell through the cell cycle. The kernel density estimator with linked boundary condition (r = 2) thus produces an distribution that satisfies the conditions (33) on the density due to cell division (Figure 5 C).

Conclusion
Our study was motivated by a question from biology. This biological application required a method of density estimation that can handle the situation , binned data (blue) and kernel density estimate (red). The kernel density estimate was obtained by solving our continuous PDE (1) by our discrete numerical method in (13), with the 'four corners matrix' in (17), and this solution looks satisfying. The stopping time, t = 0.00074, came from the kernel density estimator in [3].
of linked boundaries. Boundary conditions are known to be a difficult problem in the context of kernel density estimation. To our knowledge, the linked boundary conditions that we handle here have not been previously addressed in the literature. We have provided two approaches: 1) a continuous PDE model with theoretical guarantees that its solution is well behaved and suitable for the proposed statistical applications; 2) a discrete model, which we prove converges to the continuous model. As a numerical example, we have applied the discrete model on data that arises in biology. The models and the analysis we provide naturally lead us to provide algorithms and software suitable for this application. We believe this can be useful to a wider community interested in this class of problems with linked boundary conditions. We found the method to compete well with other existing methods, both in terms of speed and accuracy. In particular, the method is more accurate close to the boundary.
There remain some open questions regarding the provided models. First, it seems likely that a good strategy may be to adaptively approximate suitable boundary conditions satisfied by the probability distribution (such as r or more generally linking the derivatives) and use this estimate to choose the appropriate model. Second, it is possible to adapt the methods in this paper to multivariate distributions. We leave both of these topics for a further study.

Derivation of the Solution
We begin with a description of how to obtain the solution in Theorem 2. It is straightforward to verify the solution works but this does not indicate how it is constructed. We will solve the problem via a powerful transform technique called the unified transform [14,15,17,16,7]. The unified transform applied to linear initial-boundary value problems can be considered as constructing a transform tailor-made to the given PDE and boundary conditions. It generalises many of the classical methods and computes solutions even in some cases when classical methods apparently fail [46,22]. A review of this method can be found in [9]. So far the only case of our problem considered in the literature on this method has been r = 1 (trivially periodic). For the heat equation with oblique Robin boundary conditions/non-local boundary conditions we refer the reader to [29] and for interface problems we refer the reader to [41,42,40].
We then construct the series solution by deforming the contours in the integral representation and applying Cauchy's residue theorem. We remark that the unified method is similar in spirit to the contour method of Birkhoff in that it makes considerable use of the tools of complex analysis (see [32] for more details and history on this method). Birkhoff's method tackles the problem of generalised eigenfunction expansions of the spatial derivative operator by using asymptotic estimates of the Green's function associated with the problem for large eigenvalue parameter. Further discussion of the link between the unified transform and the spectral theory of two-point linear differential operators is beyond the scope of this paper but we refer the reader to [35,45]. We found that the unified transform considerably simplifies the proof of Theorem 1 as opposed to the use of series solutions. To the authors' knowledge, our problem has not been considered (other than the obvious periodic case) in the literature concerning either classical methods or the unified transform.

Obtaining the solution
The first step is to write the PDE in divergence form: We will employ Green's theorem, over the domain (0, 1)×(0, t). Here one must assume a priori estimates on the smoothness of the solution f which will be verified later using the candidate solution. Define the transforms: where again we assume these are well defined. Green's theorem and the boundary conditions imply (after some small amount of algebra) the so called 'global relation', coupling the the transforms of the solution and initial data: The next step is to invert via the inverse Fourier transform, yielding However, this expression contains the unknown functionsg andh. To get rid of these we use some complex analysis and symmetries of the global relation (35). Define the domains These are shown in Figure 6. A quick application of Cauchy's theorem and Jordan's lemma means we can re-write our solution as We now use the symmetry under k → −k of the global relation (35) and the fact that the argument in each ofg andh is k 2 /2 to set up the linear system: .
Defining the determinant function Υ(k) = 2(1 + r)(cos(k) − 1), solving the linear system leads to the relations: Since Υ(k) is zero whenever cos(k) = 1, before we substitute these relations into our integral solution we deform the contours ∂D + and ∂D − as shown in Figure 6 to avoid the poles of Υ(k) −1 along the real line. Upon substitution, we are still left with unknown contributions proportional to We will argue that the integral I 1 (x, t) along ∂D + vanishes and the argument for I 2 (x, t) follows the same reasoning. First observe that as k → ∞ in C + , Υ(k) −1 ∼ exp(ik)/(1 + r). Also we must have that is also bounded in C + and hence the function is bounded in C + . It follows that we can close the contour in the upper half plane and use Jordan's lemma to see that I 1 (x, t) vanishes. We then obtain the integral form of the solution in Theorem 2.
To obtain the series form we can write 2 exp(ik)−(1+r) = − exp(ik)Υ(k)+ exp(ik)[(1 + r) exp(ik) − 2r], which implies Taking into account the orientation of ∂D − , upon deforming the first of these integrals back to the real line, we see that it cancels the first integral in (5).
Define the function The integrand in (39) has a double pole at k n = 2nπ so we deform the contour ∂D to ∂D shown in Figure 7. Cauchy's residue theorem then implies that It is then straightforward to check the equality of (40) and (6).

Further Explanation of Results
For completeness, we collect here proofs of the results in Section 2, as well as the bound (21).

Discussion of the limit t ↓ 0
To rigorously prove the theorems in Section 2, we shall need to study the solution as t ↓ 0. Recall the definition in (8). We shall also need the function Figure 7: Deformation of the contour to circle the poles. The contributions along the real line between these circles cancel.
defined for t > 0. Using the Poisson summation formula, we can write K 1 as a periodic summation of the heat kernel. The following lemma is well known and hence stated without proof.
The following Proposition then describes the limit properties of our constructed solution as t ↓ 0 in the case of continuous initial data.
Proof. We can write Here means the derivative with respect to the spatial variable.
To study the limit as t ↓ 0, we first show that we can ignore the terms with a factor of t. It is enough to prove the following Lemma.
Proof. By the Riemann-Lebesgue lemma, we have that lim n→∞f0 (k n ) = 0. So given > 0, let N be large such that if n ≥ N then f 0 (k n ) ≤ . Then Let h = 2π. The sum approximates the integral ∞ h(N +1) exp(−y 2 t/2)ydy and we have for some constantC. It follows that lim sup t↓0 t n∈N exp(−k 2 n t/2)k nf0 (k n ) ≤C .
Since > 0 was arbitrary, the lemma follows.
By a change of variable we have The bound (43) now follows from Lemma 1, as do the pointwise limits from a straightforward somewhat tedious calculation. Now suppose that f 0 (0) = rf 0 (1) and split the initial data as follows: Then p 0 ∈ C([0, 1]) with the crucial property that p 0 (0) = p 0 (1) = 0. Arguing as above and using Lemma 1, we see that the following limit holds uniformly lim t↓0 1 0 K(r; x, y, t)p 0 (y)dy = p 0 (x).
Proof. Note that the case r = 1 is well known. The fact that f 0 ∈ L 1 ([0, 1]) by Hölder's inequality together with Lemma 2 show that we can ignore the parts multiplied by t in the kernel representations (44). The fact that yf 0 (y) ∈ L p ([0, 1]) implies the convergence by simply summing the parts in (44) and using the r = 1 case with a change of variable for the K 1 (x + y, t) part.

Further explanation of well-posedness and representation of the solution
Proof of Theorems 1 and 2. For t > 0, it is clear that the function in (5) is smooth in x, t and real analytic in x, as well as solving the heat equation. This follows from being able to differentiate under the integral sign due to the exp(−k 2 t/2) factor and the fact that extending x to a complex argument yields an analytic function. It is also easy to check via the series (6)  converges for all x and is uniformly bounded as t ↓ 0. We will use the explicit calculation of the endpoints limits at x = 0, 1. By the dominated convergence theorem we have which proves the required weak continuity.
To prove uniqueness, suppose that there exists µ t , τ t ∈ M ([0, 1]) which are both weak solutions with µ 0 = τ 0 = f 0 . Set m t = µ t − τ t . We will consider expansions of functions in the generalised eigenfunctions of the adjoint problem (see the discussion after the statement of Theorem 2). It is straightforward to check that the adjoint problem (with the boundary conditions in (3)) is Birkhoff regular and hence the generalised eigenfunctions are complete in L 2 ([0, 1]). In fact we can show that any continuous function g ∈ C([0, 1]) of bounded variation with g(0) = g(1) can be approximated uniformly by linear combinations of these functions. This follows by either arguing as we did in the proof of Proposition 3 (the case of non-matching derivatives holds but is more involved) or follows from Theorem 7.4.4 of [32]. Now suppose that λ lies in the spectrum of the adjoint A * defined by (47) In our case, the generalised eigenfunctions associated with λ correspond to a basis of N ((A − λI) l ) where l = 1 or 2. If l = 2, and the nullity of (A − λI) 2 is greater than A − λI, we can choose a basis {g 1 , g 2 } such that (A−λI)g 2 = g 1 . For the general case and chains of generalised eigenfunctions we refer the reader to [32]. Now suppose that g ∈ N (A − λI) then g must be smooth on [0, 1]. It follows that for t > 0 2 d dt m t (g) = −λm t (g).
Note that m 0 (g) = 0 and hence we must have that m t (g) = 0 for all t ≥ 0. Similarly, suppose that {g 1 , g 2 } ⊂ N ((A − λI) 2 ) with (A − λI)g 2 = g 1 . Then by the above reasoning we have m t (g 1 ) = 0 for all t ≥ 0 and hence Again we see that m t (g 2 ) = 0 for all t ≥ 0. Though we don't have to consider it in our case, it is clear that the same argument would work for chains of longer lengths. The expansion Theorem discussed above together with the dominated convergence theorem shows that if g ∈ C([0, 1]) of bounded variation with g(0) = g(1) = 0, then m t (g) = 0 for all t ≥ 0. This implies that if U ⊂ (0, 1) is open then m t (U ) = 0 for all t ≥ 0. In particular, we must have m t = a(t)δ 0 + b(t)δ 1 with a, b continuous. In fact, for any f ∈ F(r) we have from which we easily see that a = b = 0 and hence uniqueness follows. This also shows uniqueness in the space C(0, A; L p ([0, 1])), where no argument at the endpoints is needed.
Proof of Proposition 1. We first show that in this case the solution is continuous on [0, 1] × [0, T ) for any T ∈ (0, ∞]. The case of continuity at points t > 0 has already been discussed so suppose that (x n , t n ) → (x, 0) then The first term converges to zero by continuity of f 0 whilst the second term converges to zero by the proven uniform convergence as t ↓ 0. Using the limit given by Proposition 2, we will take T = ∞ without loss of generality.
Since the solution is regular in the interior and continuous on the closure, this immediately means that we can apply the maximum principle to deduce that sup where Ω = (0, 1) × (0, T ). A similar result holds for the infimum. Evaluating (40) at x = 0 leads to where we have used the function K 1 defined by (41) and extended f 0 periodically (values at the endpoints contributed nothing). Hence Similar calculations yield The fact max{2r/(1 + r), 2/(1 + r)} ≥ 1 (recall r ≥ 0) finishes the proof.

Derivation of bound (21) for the discrete model
Let n ≥ 2. Let u ∈ R n ≥0 be any initial vector with u ∞ ≤ 1 and let 1 denote the vector with 1 in all entries. The eigenvector in the kernel of the Four Corners Matrix A is a linear multiple of w 0 defined by This has components Extend this vector to have x 0 = 0, then an application of the discrete Fourier transform implies that we can write for j = 0 where G n (k) = exp(4πik/(n + 1)) − 1 (exp(2πik/(n + 1)) − 1) 2 = −i 2 sin(2πk/(n + 1)) sin 2 (kπ/(n + 1)) .

Explanation for choosing the bandwidth with the linked boundaries model
Here are the arguments for Theorem 4.
Proof. We begin with the proof of 1.
Recall that where K is the kernel given by (44). The second of these terms is bounded by a constant multiple of 1/n so we consider the first. Recall the decomposition (44): K(r; x, y, t) = K 1 (x − y, t) 1 + (x − y) 1 − r 1 + r + K 1 (x + y, t)(x + y − 1) 1 − r 1 + r + t(1 − r) 1 + r K 1 (x + y, t) + K 1 (x − y, t) , where K 1 is the standard periodic heat kernel. For x, y ∈ [0, 1] we have that with the rest of the expansion exponentially small as t ↓ 0 and the asymptotics valid upon taking derivatives. Using this, it is straightforward to show that we can write K(r; x, y, t) = 1 √ 2πt G(r; x, y, t), where G is bounded. From the above asymptotic expansions we can write G(r; x, y, t) = exp − (x − y) 2 2t + h 1 (x, y, t) exp − (x − y − 1) 2 2t + h 2 (x, y, t) exp − (x − y + 1) 2 2t + h 3 (x, y, t) exp − (x + y) 2 2t + h 4 (x, y, t) exp − (x + y − 2) 2 2t + E(x, y, t), where the h i are bounded and the error term E(x, y, t) is exponentially small as t ↓ 0 uniformly in x, y. Furthermore, we have is bounded and converges point-wise for almost all x ∈ [0, 1] to f X (x). It follows that (51) The rate (24) now follows.
Next suppose that p 0 (0) = p 0 (1). In this case we have ∂F ∂t (x, t) = 1 2 1 0 K 1 (x − y, t) ∂ 2 w ∂y 2 (x, y)dy (60) for some bounded function U (x, t) which converges to 0 as t ↓ 0 for almost all x ∈ [0, 1]. It follows that F (x, t) = t 1 2 for some bounded functionF (x, t) which converges to 0 as t ↓ 0 for almost all x ∈ [0, 1]. Similarly we have for some bounded function V (x, t) which converges to 0 as t ↓ 0 for almost all x ∈ [0, 1]. It follows from (54) that for some bounded function W (x, t) which converges to 0 as t ↓ 0 for almost all x ∈ [0, 1]. But we have The dominated convergence theorem then implies that