Transmetric Density Estimation

Abstract A transmetric is a generalization of a metric that is tailored to properties needed in kernel density estimation. Using transmetrics in kernel density estimation is an intuitive way to make assumptions on the kernel of the distribution to improve convergence orders and to reduce the number of dimensions in the graphical display. This framework is required for discussing the estimators that are suggested by Hovda (2014). Asymptotic arguments for the bias and the mean integrated squared error is difficult in the general case, but some results are given when the transmetric is of the type defined in Hovda (2014). An important contribution of this paper is that the convergence order can be as high as 4/5, regardless of the number of dimensions.


Introduction
The multivariate kernel density estimator with a fixed bandwidth is given by fH where K H (x) : R m → R is a single modal, symmetric, nonnegative and zero-mean function that integrates to 1.The bandwidth matrix H describes the level of smoothing.When using a kernel K that is independent of H, then The first univariate version of (2) was proposed by Fix and Hodges (1951), while the general class was investigated by Rosenblatt (1956) and Parzen (1962).The multivariate extension was outlined by Cacoullos (1966) and Epanechnikov (1969), while the full bandwidth matrix was discussed by Deheuvels (1977).
This paper involves using generalizations of the metrics in kernel density estimation.To improve readability, we recollect the following definitions.
If the third criterion (identity) is relaxed, d is a pseudometric and equivalently the pseudometric space and the ball is defined in the same way as in definition 1.2.Similarly, the semimetric is defined by relaxing the fifth criterion of definition 1.1.The quasimetric is defined by relaxing the fourth criterion and the pseudoquasimetric is defined by relaxing both the third and the fourth criterion.The premetric is the most general of them all, in which it keeps only the two first criteria.
The above definition of a pseudometric is exactly the same as the definition of a semimetric in the field of nonparametric functional data analysis (Ferraty and Vieu, 2006).To avoid confusion, we will use the term pseudometric, even though most of the literature in nonparametric functional data analysis uses the term semimetric.
In nonparametric functional data analysis, the explanatory variables are functional.Most of the work has been done on regression and classification and an overview of the field is given in the book by Ferraty and Vieu (2006).In the case of density estimation, Delaigle and Hall (2010) proved that a probability function generally does not exist for functional data, but developed notions of density and mode, when functional data is considered in the space determined by the eigenfunctions of the principal components.In the paper by Ferraty, Kudraszow and Vieu (2012), an infinite dimensional analogue of a density estimator is discussed, when the small ball probability associated with the functional data can be approximated as a product of two independent functions; one depending on the center of the ball and one depending on the radius.
In Hovda (2014), some finite dimensional pseudometric spaces are considered.Here, the balls with finite radius have finite volumes.Such spaces exist as a specification of the theory of nonparametric functional data analysis (Ferraty et al., 2012).The theory in Hovda ( 2014) is distinguished from that of nonparametric functional data analysis, because the pseudometric spaces that are considered have unit ball volumes that are dependent on the locations of centers.This is mitigateted by defining equivalent distances of pseudometrics.
In this paper, we propose the definition of the transmetric which generalized both the equivalent distances of pseudometrics as of Hovda (2014), the finite dimensional pseudomtrics that are common in nonparametric functional data analysis and also metrics that are used in common kernel density estimation.Moreover, the formulation in this paper allows using multiple transmetrics.This is a generalization of Hovda (2014), where only one transmetric is allowed.It is proven that the convergence rates can be as high as 4/5 regardless of the number of dimensions.This was only indicated numerically in Hovda (2014).
Using transmetrics in density estimation involve training parameters of the transmetrics to fit the kernel of the distribution.To avoid confusion, it is worth emphasizing that the kernel of the distribution is a different concept than the kernel function, such as K(•) in equation ( 2).The kernel of the distribution, or more roughly the family of level sets of the distribution, is defined as follows: , where x and y are elements in X.
The choice of the family of transmetrics, describes the constraints on the kernel of the distributions that can be modeled.In case of fitting parameters, it is therefore important that the level sets of the actual distribution are not in violation with these constraints.This is somewhat analogous to parametric models, where serious misbehavior are observed when the distribution has a different shape than any of the members of the family of models.If the true distribution is not in violation of the constraints imposed by the choice of family of transmetrics, it is shown in this paper that convergence orders can be improved and the distribution can be graphically displayed in fewer dimensions.
In section 2, transmetric density estimation is defined as a product of kernels that take transmetrics as arguments.The transmetric as of Hovda (2014) is outlined in section 3.In section 4, asymptotic arguments are discussed, while the proofs are given in section 5.The paper is concluded in section 6.Table 1 contains a list of symbols that are frequently used in this paper.

Definition of Transmetric Density Estimators
In general there are two popular paths for constructing multi-dimensional kernels.The product kernels are products of single dimensional kernels, while the circular symmetric kernels are kernels that are dependent of the Euclidean distance from 0. We propose a different definition The P j elements of H −1/2 x. y j The P j elements of H −1/2 y.X i, j The P j elements of H −1/2 X i .c j Center point parameter in the jth transmetric (not the center of the balls).p j Vector of positive exponents in the j'th transmetric.p j Harmonic average of the elements in p j .

V p j
Volume of the generalized unit ball with parameters p j .

V t, j
Volume of the unit ball in the jth transmetric space.c Concatenated vector of all the c j s.Size is m.c j The jth component of c, not any of the c j s. p Concatenated vector of all the p j s.Size is m.p j The jth component of p, not any of the p j s. d type, j,par The jth premetric of type type with parameters par.d t y pe Tuple of q premetrics of type t y pe.
where m is the number of dimensions and d is some metric.It is assumed that the kernel is non-negative and monotonically decreasing with finite first and second moments.This kernel generalizes all symmetric kernels when d is chosen to be Euclidean.It also generalizes all product kernels with convex level sets.The convex restriction is a consequence of the triangle inequality property of the metric.Moreover, it makes sense to also assume that the metrics need to be of at least differentiability class C 1 and that they are translation invariant (i.e inducing a normed vector space).
One example is the normal kernel where the metric is Euclidean and Another example is the uniform kernel where the metric is Chebyshev distance and Notice that the kernels are dependent on the metric being used.In this paper we propose a generalization of the metric as an input to the kernel.
Definition 2.1.A transmetric on a set R m is a function (distance function) d t : R m × R m → R + , which for all x, y ∈ R m and ϵ ∈ R + , satisfies the following conditions: This is a premetric with the additional criteria of topology and translation invariance of ball volumes.This is the reason for the trans prefix in the name.For comparison, a metric that induces a norm is a premetric with the additional criteria of translation invariance and the last three criteria of 1.1.Clearly, all metrics that induces a norm are transmetrics, which is practically all metrics that are used in common kernel density estimation.This is why definition 2.3 below, is a generalization of common kernel density estimation.
The key idea of this definition is to generalize the metric that induces a norm as much as possible, but still make sense for kernel density estimation.The first two criteria ensures that close points (in the sense of the transmetric) are weighted higher than points that are further away.The third criterion implies that if the kernel is continuous, then the estimator is also.The last criterion in definition 2.1, is important to achieve the result in theorem 4.1.It is worth emphasizing that relaxing the identity criterion yield some opportunities in smoothing along level sets of the distribution.
As commented on with regards to equation ( 4) and ( 5), there is a relationship between the kernel being used and the metric.This motivates the following definition.
Definition 2.2.A kernel K is said to be associated to a transmetric space (R m , d t ), or associated to a transmetric d t , if K is nonnegative, monotonically decreasing with compact support and satisfy in the case when the unit ball V B dt (x,1) is finite and in the infinite case In the case when V B dt (x,1) is infinite, it is not meaningful to set the integral equal to one, since the kernel would clearly approach zero everywhere.To elaborate on this, we define the equivalence relation It is clear that all equivalence classes in R m with respect to ∼ d t are unbounded.To compensate for this unboundedness, we have decided to let the integral be proportional to V B dt (x,1) .Another way to handle this, would be to impose boundaries on the transmetric space (i.e let it be a subset of R m ), such that all B d t (x, 1)s would be bounded.This is not discussed further in this paper.
We are now ready to define transmetric density estimation.
Definition 2.3.Let {P 1 , P 2 , ...P q } be a partition of {1, 2, ...m}, where m j = |P j |.Let x j , y j and X i, j be vectors that, respectively, consist of the P j elements of H −1/2 x, H −1/2 y and H −1/2 X i .If we choose a tuple of transmetrics d t , such that for every P j , there is a K j associated with the transmetric space (R m j , d t, j (x j , y j )), then the transmetric density estimator is defined as where f (y) is equal to 1/n at the X i s and 0 elsewhere.
The idea is to weight the contributions of the X i, j s according to the distances in the transmetric spaces.It is very important to note that in the case when any of the transmetrics has an unbounded unit ball, fH,d t (x) is only proportional to f (x).As said before, this is because the whole R m is considered, rather than a subset where all transmetric spaces have bounded unit balls.
In the trivial case, the estimator fH,d t (x) = fH,d (x), where all P j are equal to { j} and for all j, d t, j (x j , y j ) = |x j − y j |.If we let ∏ m j=1 K j (u j ) = K(u), then we arrive at the estimator defined in equation ( 1).It is easy to see how any metric that defines a normed vector space can be used, but the interesting question is what other options do we have.What transmetrics make sense to use and which kernels can be associated to them?It is of particular interest to investigate the opportunities of relaxing the identity criterion.This motivates the definition of associated distributions, which was first defined in ?.This term is adapted to a tuple of transmetrics here.

Definition 2.4. A probability density function is said to be an associated distribution of fH,d t , denoted as f
Here, the equivalence relation ∼ d t, j is defined as for all x j , y j ∈ R m j , x j ∼ d t, j y j iff d t, j (x j , y j ) = 0.
While discussing the usefulness of a family of transmetrics, it makes sense to ask which density functions are invariant to the equivalence relations that are implied by the transmetrics.
Another point to note is that the graphical display of the estimated density can be reduced to a q-dimensional display.This is because the estimated density at a certain point All the transmetric spaces that will be discussed in this paper, have the following property regarding how the volumes of the balls vary as a function of the radius.
Definition 2.5.A transmetric d t or a transmetric space (R m , d t ), is said to be of order u, if the volume of the ball is on the form This property is useful as it makes it easy to find associated kernels.
Theorem 2.1.Given a transmetric space (R m , d t,V t ,u ) of order u, then for any function g : R + → X, where X ⊂ R, that is bounded and continuous almost everywhere, the following property is valid ∫ A direct consequence of this is that the normal and the uniform associated kernels of d t,V t ,u are: and when V t is finite and in the infinite case, equation ( 4) and ( 5) can be used.
An important source for finding transmetrics is using pseudometrics.In general the volumes of the balls in pseudometric spaces with constant radius are dependent on the locations of the centers.This means that they can not be used directly.However, the following theorem shows how transmetrics can be designed from pseudometrics that are inducing a topology on R m .
where V t is chosen to be the volume of the unit ball of the transmetric space (R m , d t,V t ,u ).
To fix ideas, the framework described in this section is required to discuss the transmetrics as of ?, which are discussed next.

Transmetrics as of Hovda (2014)
In the paper by Hovda (2014), a family of pseudometrics is defined as the absolute value of the difference between two semimetrics.In that paper the concept of equivalent distance of the pseudometric is used.However, this is essentially the same as designing a transmetric by using theorem 2.2, where V t is chosen to be the volume of either the m-sphere or the m-cube.Both choices are valid, but the mathematical formulation may be simpler depending on which kernel is chosen.c) and .
Here, c is a chosen point (for instance the mean or the mode of the distribution), p is a vector of positive p i s and p is the harmonic average of the p ′ i s (i.e. p = ).Moreover, O is the set of positive odd numbers that are smaller than or equal to m.
The unit ball of the semimetric space (R m , d s,p ) is called the generalized unit ball.The volume of this ball is V p , which is proven by Wang (2005), Gao (2013) and Hovda (2014).In general d s,p is a semimetric, because the triangle inequality does not hold when at least one of the p i s are less than one.Depending on the p i s, the generalized balls describe a wide range of multi-dimensional geometrical objects, where spheres, cubes, cylinders and stars are special cases.A few three dimensional objects are shown in figure 1(a), 1(b) and 1(c).
In the family of pseudometric spaces (R 3 , d p,c,p ), the balls can be viewed as generalizations of the spherical shell and two examples are shown in figure 1(d) and 1(e).
The formula for V B dp,c, p (a,ϵ) is proven in Hovda (2014).The continuity of the ball at ϵ = d s,p (a, c) is seen, when noting the fact that In the case when ϵ < d s,p (a, c), the volume is a polynomial with respect to either ϵ or d s,p (a, c).The same is true when ϵ ≥ d s,p (a, c), but the polynomial is different.This demonstrates how the volumes of the balls are dependent on the locations of the centers.Notice also that the derivative of the volume is not continuous at ϵ = d s,p (a, c), meaning that the derivative of the transmetric is also not continuous.
As discussed in Hovda (2014), a subset of the family of associated distributions is the generalized normal distributions (also known as the exponential power distribution).Some univariate cases are shown in figure 1(f).In Hovda (2014).In this paper several distributions are used to compare the new estimators with classical kernel density estimators.Because there is only one transmetric involved, the distributions can be shown using two dimensional plots.It is shown that the bias of multivariate distributions is significantly lower away from the mode, given that the underlying distribution is an associated distribution.In the present paper, more asymptotic results are given in section 4.1 and 4.2.

Discussion of Bias, Variance and Mean Integrated Squared Error
Density estimators are commonly measured by global error criteria.Among these, we have chosen the mean integrated squared error MISE, which is given below This expression is also known as the L 2 risk function.In common kernel density estimation, there are two main paths for discussing the bias, the variance and the mean integrated squared error.For certain distributions, such as the mixture of normals, these statistics can be described analytically (Simonoff, 1996), but for more general distributions, analytic expressions are derived from an asymptotic argumentations (Rosenblatt, 1956;Cacoullos, 1966;Epanechnikov, 1969;Deheuvels, 1977).
In the case of transmetric density estimation, we can present this asymptotic result for the variance.
Theorem 4.1.The asymptotic variance of fH,d t (x) is which is independent of x.
In the case when d t, j is a transmetric with an order, theorem 2.1 can be used to derive analytical expressions for R(K j ).In the trivial case when fH,d t = fH,d , the expression for the variance is identical to what is reported in Deheuvels (1977).
This result may be desirable for modelling purposes, as the asymptotic variance of the estimator is everywhere proportional to the distribution itself and the determinant of the bandwidth matrix.However, it is far more difficult to find asymptotic results for the bias and the mean integrated squared error.
In section 4.1, we present some asymptotic arguments for the bias in the case when the transmetric density estimator is applied on an associated distribution.In section 4.2, an asymptotic expression for MISE is derived when q is equal to one.These results ties well with the results of Hovda (2014).

Asymptotic Arguments for the Bias
In this section we describe the asymptotic bias in the case when all transmetrics in d t are of the type described in section 3 and in Hovda (2014).It is an important observation that d t, j (x j , y j ) is equal to |x j − y j |, when the dimension and the order m j is one and c j is either a very high or a very low.This means that the common kernel density estimator fH,d (x) is a special case.With regards to the associated distributions we have the following results.
Lemma 4.1.Let f ∼ H,d t (x) be an associated distribution of fH,d t and let each d t, j be of the type described in section 3 and designed by a pseudometric d p, j,c j ,p j .Also, let each d p, j,c j ,p j be constructed by a semimetric d s, j,p j (x j , c j ).Then f ∼ H,d t (x) is equal to a function g(ϵ), which satisfies where each ϵ j = d s, j,p j (x j , c j ) p j .Here p j is the harmonic average of p j .
Lemma 4.2.The m-dimensional generalized normal distribution is an associated distributions of d t , when each d t, j is of the type described in section 3 and designed by a pseudometric d p, j .
It is important to note that the family of generalized normal distributions is only a special case of the total family of associated distributions.The bias of the estimator is described next.
Theorem 4.2.Let each d t, j of d t be of the type described in section 3 and designed by a pseudometric d p, j and denote α x, j = d s, j,p j (x j , c j ), α x = {α x,1 , α x,2 , ...α x,q }, α y, j = d s, j,p j (y j , c j ), α y = {α y,1 , α y,2 , ...α y,q }, β x, j = (V p j /V t, j ) 1/m j α x, j and β y, j = (V p j /V t, j ) 1/m j α y, j .Here, V t, j and V p j are the volumes of unit balls in the transmetric space (R m j , d t, j ) and the semimetric space (R m j , d s, j ), respectively.Moreover, define f where D f H (α x ) is a vector of the first order partial derivatives of f H and H f H (α x ) is the Hessian matrix.The entries of ψ and ϕ are given by y, j dβ y, j . ( where where O s is the set of positive odd numbers that are smaller than or equal to s.
Note that the expression for the expected value is only affected by the bandwidth matrix in the way that α x is constructed.We will get back to this in section 4.2.
In the trivial case when the estimator fH,d t (x) = fH,d (x), each P j is equal to { j}, each c j goes to −∞, each dV B d s, j, p j (c j ,α y, j ) /dα y, j is equal to one, ψ = α x and each In the more general case, when the estimator is not just fH (x), some results are found when choosing the kernels to be uniform.
Lemma 4.3.Let all parameters be the same as in theorem 4.2, and also let the kernels be m j -dimensional uniform kernels given by equation ( 5) and each V t, j = 2 m j .Then, where the function r s (u) is defined by the formula Analytical solutions of r s (u) is only possible for s < 5, but since the functions have only s and u as variables, they can easily be put into a table for fast lookup on a computer.In order to describe this function more we have found these approximations The top expression in equation ( 13) is exact and can easily be derived.The bottom expression is an approximation that is based on the fact that the volume of the ball in the far field is The middle expression is a function that is fitted to ensure continuity between the other two expressions, including the derivative at √ 3/2.

Asymptotic Arguments for MISE Where q is Equal to One
In this section we consider the special case of the model in section 4.1, when q is equal to one.This is exactly the same case as in section 3 and in Hovda (2014).Here, α x = α x,1 and α y = α y,1 .This means that f (y and σ 2 ( f H,d t (x)) can be approximated by inserting equation ( 12) into theorem (4.2) and using theorem (4.1) so that where The variance of σ 2 ( f H,d t (x)) is exactly the same as given in Epanechnikov (1969), when a uniform kernel is assumed.
When it comes to the bias, the results are different in several ways.The bias when β x = 0, differs from the results of Epanechnikov (1969) in two ways.First the ball where the kernel is non zero has the shape of B d s,p (0, √ 3) and not the m-cube.The second difference is that the bias of the new estimator is dependent on the first derivative of f , which is not the case in Epanechnikov (1969).
However, the greatest difference is due to fact that the bias decreases as a function of β x , regardless of the first and second derivatives.When m is at least 2 and β x is large compared to r m (β x ), which is the case when This means that for It is now clear that the bias is vanishing rapidly as β x increases above √ 3.This result comply with the Monte Carlo simulations that is conducted on various distributions in Hovda (2014).
Assuming large n, i.e. small bandwidths, we can compute the asymptotic mean integrated squared bias AMISB based on the approximations in equation ( 16).
Theorem 4.3.Let all parameters be the same as in lemma 4.3 and also let q be equal to one.We define u 1 as where L is a chosen limit, then for |H| ) 1 m , the asymptotic mean integrated squared bias Moreover, the optimal |H| 1 2 and optimal AMISE are Notice that A(m, p, f 1 ) is independent of |H|, given that |H| is small enough, which is related to L, which is the accuracy of the approximation parameter.For instance, if we choose L to be 0.05, the bias in a region that includes the center and that has 0.05 probability is ignored in the estimation.
The important result here is that AMISE opt ( f H ) is proportional to n to the power of −4/5, i.e. the convergence order is 4/5.The convergence order is independent of the number of dimensions, which means that the new estimators defy the curse of dimensionality when q is equal to one and f is an associated distribution of f H,d t .It is also an important result that A(m, p, f 1 ) is straightforward to compute, which is relevant for finding pilot bandwidths as part of parameter estimation.theorem For comparison, the common kernel density estimators, the H AMISE and AMISE opt are published in (Epanechnikov, 1969) as (20) Here, the convergence order of AMISE opt ( f H ) is 4/(4 + m), which is dependent of the number of dimensions.

Graphical Escription of B 1 and B 2
In order to understand the behavior of B 1 (m, p, f 1 ) and B 2 (m, p, f 1 ), we will discuss the multi-dimensional generalized normal distribution defined in equation ( 10).We let all elements of p be equal to p and the center be 0. The integral part of equation ( 24) is derived as , which is found by substitution of variables to fit the definition of the incomplete upper gamma function Γ(•, •).The u 1 parameter is found by inserting f 1 into equation ( 19) resulting in , where γ −1 (•, •) is the inverse of the lower incomplete gamma function.
In figure 2, log(B 1 (m, p, f 1 )) and log(B 2 (m, p, f 1 )) are plotted for various m and p, where the approximation limit L is set to 0.05 and 0.2.Based on the plots, it can be seen that for p larger than two, it is a fair simplification that B 1 (m, p, f 1 ) and B 2 (m, p, f 1 ) are exponential functions of m alone.
The effect of increasing L from 0.05 to 0.2 is minor, but can be seen as an upward shift of the values of log(B 1 (m, p, f 1 )) and log(B 2 (m, p, f 1 )).Using the equations ( 17), ( 18) and ( 19), it can be determined what population sizes that these approximations are valid for.In the case of L = 0.05, the minimum required population sizes vary between 4.4 • 10 6 and 5.8 • 10 7 , when the number of dimensions range from two to eight.In the case of L = 0.2, the required population sizes are in the range of 1.5 • 10 5 to 3.2 • 10 5 , when the number of dimensions also range from two to eight.
This is recognized to be equal to the Riemann-Stieltjes integral.This integral coincide with the Riemann integral, because V B d t,Vt ,u is differentiable and the derivative is continuous.In other words, This concludes the first property.The second property is a result of the fact that g has compact support, which implies that

Proof of Theorem 2.2
The ball of B d t,Vt ,u (x, ϵ) is given by The volume is therefore invariant to x.The proof is completed by noting that the criteria of topology, nonnegativity and mild identity are obvious.

Proof of Theorem 4.1
The variance is given by and by changing variables, taylor expanding around x and neglecting all higher terms Proof of lemma 4.1 We recognize that the kernel of f ∼ H,d t (x) has the form where R q+ indicates the positive octant in R q .Each level set is therefore uniquely determined by ϵ, thus f ∼ H,d t (x) = g(ϵ).
The proof is completed by noting the fact that f ∼ H,d t (x) must integrate to one.

Proof of lemma 4.2
We consider the special case when g(ϵ) is written as a product of univariate functions g j (ϵ j ), and each g j (ϵ j ) has the form of the exponential function a j exp(−ϵ j ).In this case the product of the coefficients a j is found from equation ( 9) and theorem 2.1, such that .
The parameters ψ j and ϕ j are manipulated to those defined in theorem 4.2, by inserting for the volumes and changing variables to the β y, j s.

Proof of lemma 4.3
This proof is straightforward by applying uniform kernels on equation (11).

Proof of theorem 4.3
To keep the exposition simple we have decided to exclude the third term in equation ( 14) so that Figure 1.Generalized unit balls are shown in figure (a), (b) and (c).Balls in the pseudometric spaces (R 3 , d p,0,p ) are shown in figure (d) and (e).Here, one octant is removed.In figure (f), examples of the univariate generalized normal distribution are shown.This parametric family of symmetric unimodal distributions includes all normal and Laplace distributions, and as limiting cases it includes all continuous uniform distributions on bounded intervals of the real line.It includes a shape factor β and a scale factor α. Various choices of shape factors are shown, where the scale factor is one and the mean is zero.

Figure 2 .
Figure2.log 10 (B 1 ) and log 10 (B 2 ) as functions of p and m.In the top row, the approximation limit L is set to 0.05, while in the bottom row L is set to 0.2.Increasing L from 0.05 to 0.2 can be seen as an upward shift of the values of log(B 1 (m, p, f 1 )) and log(B 2 (m, p, f 1 )).For p larger than two, a rule of thumb is that B 1 (m, p, f 1 ) and B 2 (m, p, f 1 ) are exponential functions of m.

Table 1 .
List of important symbols