Properties of Transmetric Density Estimation

Transmetric density estimation is a generalization of kernel density estimation that is proposed in Hovda (2014) and Hovda (2016), This framework involves the possibility of making assumptions on the kernel of the distribution to improve convergence orders and to reduce the number of dimensions in the graphical display. In this paper we show that several state-of-the-art nonparametric, semiparametric and even parametric methods are special cases of this formulation, meaning that there is a unified approach.

A generalization is the transmetric density estimatior, which is described in Hovda (2016) This framework is motivated by the work in Hovda (2014), which involved using only one transmetric.In Hovda (2014), the bias and the variance are described using Monte Carlo simulations, while asymptotic arguments are given in Hovda (2016) In the last paper, it is shown that the convergence order of the mean integrated squared error can be as high as 4/5.The important point is that the convergence order is independent of the number of dimensions.
For comparison, the convergence orders of common kernel density estimators are 4/(4 + m), where m is the number of dimensions (Epanechnikov, 1969).This improvement in accuracy over common kernel density estimation is motivating the path for finding useful and practical applications of this theory.
This paper contains two contributions in this direction.The first contribution involves a dedicated chapter that shows how a number of state-of-the-art problems are special cases of transmetric density estimation.The examples includes distributions with elliptic level sets (Liescher, 2005), linear regression, partial linear models (Härdle et al., 2000), projection pursuit models (Friedman et al., 1984) and also an example of nonparametric functional analysis (Ferraty and Vieu, 2006).
Based on these examples, it should be straightforward to see that other methods can be formalized in this way as well.This clarifies the relationship between methods and opens up for new ways to combine.
The second contribution is a description of how unbiased cross-validation can be used for parameter selection.The parameters in the model are the bandwidth matrix and the parameters related to the transmetric.In normal kernel density estimation, there are generally two methods that have received some attention, namely plug-in methods (Duong and Hazelton, 2003).and cross-validation methods (Duong and Hazelton, 2005).The plug-in methods require an analytic expression of the asymptotic mean square error.The special cases that are outlined in Hovda(2016) are complex and it is probably impossible in the general case.
In the case of cross-validation, there are more opportunities.In Duong and Hazelton (2005), the unbiased, the biased and the smoothed cross-validation methods are described and compared.The biased cross-validation methods requires The i'th component of x.

X i
The i'th data point of a data set.X i, j The j'th component of X i m Number of dimensions.q Number of transmetrics.{P 1 , P 2 , ...P q } Partition of {1, 2, ...m}.m j Size of P j , i.e. ∑ q j=1 m j = m.H Symmetric bandwidth matrix of size m × m. x j The P j elements of H −1/2 x. y j The P j elements of H −1/2 y.X i, j The P j elements of H −1/2 X i .c j Center point parameter in the jth transmetric (not the center of the balls).p j Vector of positive exponents in the j'th transmetric.p j Harmonic average of the elements in p j .

V p j
Volume of the generalized unit ball with parameters p j .

V t, j
Volume of the unit ball in the jth transmetric space.c Concatenated vector of all the c j s.Size is m.c j The jth component of c, not any of the c j s. p Concatenated vector of all the p j s.Size is m.p j The jth component of p, not any of the p j s. d type, j,par The jth premetric of type type with parameters par.

d t y pe
Tuple of q premetrics of type t y pe.
an expression for asymptotic mean square error and is not suitable for transmetric density estimators.The unbiased cross-validation methods is called unbiased since it is designed to improve the mean integrated squared error and not the asymptotic version.The smoothed cross-validation method is similar, but more complex as it involves finding a pilot bandwidth matrix.The main conclusion of the paper is that the smoothed cross-validation method is the most reliable.However, it also points out that the unbiased cross-validation method has reliable performance on several distributions.
In this paper, we have chosen to develop an expression for the unbiased cross-validation method.This is because of its algorithmic simplicity and computational efficiency.
In section 2, the definition of transmetric density estimation and relevant theorems from Hovda (2016) are repeated.Relationship to other methods is described in section 3, while the unbiased cross-validation method is outlined and discussed using Monte Carlo simulations in section 4. The paper is concluded in section 6.Table 1 contains a list of symbols that are frequently used in this paper.

Transmetric Density Estimation
In Hovda (2016) the transmetric is defined as: , which for all x, y ∈ R m and ϵ ∈ R + , satisfies the following conditions: The two first criteria is the definition of a premetric.A transmetric is therefore a premetric with the additional criteria of topology and translation invariance of ball volumes.All metrics that induces a norm are transmetrics, which is practically all metrics that are used in common kernel density estimation.
The key idea of this definition is to generalize the metric that induces a norm as much as possible, but still make sense for kernel density estimation.The first two criteria ensures that close points (in the sense of the transmetric) are weighted higher than points that are further away.The third criterion implies that if the kernel is continuous, then the estimator is also.The last criterion in definition 2.1, ensures that the variance of the estimator is everywhere proportional to the density (Hovda, 2016).It is worth emphasizing that relaxing the identity criterion yield some opportunities in smoothing along level sets of the distribution.
Morevover, we also need this definition: Definition 2.2.A kernel K is said to be associated to a transmetric space (R m , d t ), or associated to a transmetric d t , if K is nonnegative, monotonically decreasing with compact support and satisfy in the case when the unit ball V B dt (x,1) is finite and in the infinite case In the case when V B dt (x,1) is infinite, it is not meaningful to set the integral equal to one, since the kernel would clearly approach zero everywhere.To elaborate on this, we define the equivalence relation ∼ d t as for all x, y ∈ R m , x ∼ d t y iff d t (x, y) = 0.It is clear that all equivalence classes in R m with respect to ∼ d t are unbounded.To compensate for this unboundedness, we have decided to let the integral be proportional to V B dt (x,1) .Another way to handle this, would be to impose boundaries on the transmetric space (i.e let it be a subset of R m ), such that all B d t (x, 1)s would be bounded.
Two useful examples of kernels are when the transmetric is either Euclidean or Chebyshev distance ) . (2) Transmetric density estimation is defined here.
Definition 2.3.Let {P 1 , P 2 , ...P q } be a partition of {1, 2, ...m}, where m j = |P j |.Let x j , y j and X i, j be vectors that, respectively, consist of the P j elements of H −1/2 x, H −1/2 y and H −1/2 X i .If we choose a tuple of transmetrics d t , such that for every P j , there is a K j associated with the transmetric space (R m j , d t, j (x j , y j )), then the transmetric density estimator is defined as where f (y) is equal to 1/n at the X i s and 0 elsewhere.
The idea is to weight the contributions of the X i, j s according to the distances in the transmetric spaces.It is very important to note that in the case when any of the transmetrics has an unbounded unit ball, fH,d t (x) is only proportional to f (x).As said before, this is because the whole R m is considered, rather than a subset where all transmetric spaces have bounded unit balls.
In the trivial case, all P j are equal to { j} and for all j, d t, j (x j , y j ) = d j = |x j − y j |.In this case fH,d t (x) = fH,d (x), where d is the tuple of d j s.If we let ∏ m j=1 K j (u j ) = K(u), then we arrive at the estimator defined in equation (1).It is easy to see how any metric that defines a normed vector space can be used, but the interesting question is what other options do we have.What transmetrics make sense to use and which kernels can be associated to them?It is of particular interest to investigate the opportunities of relaxing the identity criterion.This motivates the definition of associated distributions: Definition 2.4.A probability density function is said to be an associated distribution of fH,d t , denoted as f ∼ H,d t : R m → R + , when all x j ∼ d t, j y j , imply that f ∼ H,d t (x) = f ∼ H,d t (y) for all x, y ∈ R m .Here, the equivalence relation ∼ d t, j is defined as for all x j , y j ∈ R m j , x j ∼ d t, j y j iff d t, j (x j , y j ) = 0.
While discussing the usefulness of a family of transmetrics, it makes sense to ask which density functions are invariant to the equivalence relations that are implied by the transmetrics.
Another point to note is that the graphical display of the estimated density can be reduced to a q-dimensional display.This is because the estimated density at a certain point All the transmetric spaces that will be discussed in this paper, have the following property regarding how the volumes of the balls vary as a function of the radius.
This property is useful as it makes it easy to find associated kernels.
Theorem 2.1.Given a transmetric space (R m , d t,V t ,u ) of order u, then for any function g : R + → X, where X ⊂ R, that is bounded and continuous almost everywhere, the following property is valid ∫ A direct consequence of this is that the normal and the uniform associated kernels of d t,V t ,u are: and when V t is finite and in the infinite case, equation ( 2) and ( 3) can be used.
An important source for finding transmetrics is using pseudometrics.In general the volumes of the balls in pseudometric spaces with constant radius are dependent on the locations of the centers.This means that they can not be used directly.However, the following theorem shows how transmetrics can be designed from pseudometrics that are inducing a topology on R m .
where V t is chosen to be the volume of the unit ball of the transmetric space (R m , d t,V t ,u ).
Proofs of theorem 2.1 and 2.2 are found in Hovda (2016) To fix ideas, the framework described in this section is required to discuss the transmetrics as of Hovda (2014).Moreover, the next section demonstrates that many state-of art problems can be described using this framework.

Relationship to State-of-the-art Problems
The purpose of this section is to describe the relationship to other methods and to show the flexibility of modeling problems with transmetric density estimators.The discussion is far from extensive, but the selection is chosen to show some of the generality of definition 2.3.

Distributions with Elliptic Level Sets
In a paper by Liescher (2005), the level sets of the distributions are constrained to be of elliptic shape.The elliptic shape of the level sets is constrained by a transformation that is performed prior to applying the nonparametric density estimator.The estimators in Hovda (2014) and Hovda (2016) can be viewed as a generalization of the estimators of Liescher (2005).This is because a wider family of distributions is allowed.In Liescher (2005), it is shown that the convergence rates are independent of the number of dimensions, except in the neighborhood of the mode.This result complies with the results in Hovda (2014) and Hovda (2016)

Nonparametric Functional Data Analysis
As said in the introduction, pseudometrics are commonly used in nonparametric functional data analysis.A function can be viewed as a point in an infinite dimensional space.It is worth noting that the number of dimensions in the definition 2.3 can approach infinity, but the number of dimensions can only be countable infinite.This is less general than the uncountable infinite number of dimensions that is needed when the domain of the functions are for instance R. In this respect, definition 2.3 is a specification of what is common in nonparametric functional data analysis.There are two reasons for this specification.First, the specification has little practical implication as functions are usually discretized.Second, it is probably possible, but outside the scope of this paper, to generalize definition 2.3 to also include variables of uncountable infinite dimensions.
In chapter 3.4.1 in the book of Ferraty and Vieu (2006), some finite dimensional pseudometrics are given.As an example, we mention a pseudometric that is based on the first c principal components of the data where the discretized representations of the functions x and y have m entries each and the w j s are the weights that defines the approximate integration.Moreover, v i j is the jth coordinate of the ith eigenvector of the covariance matrix of a relevant dataset.
This is a transmetric of order c, where V t is clearly infinite and V t = lim L→∞ L m−c V c , where V c is the volume of the csphere.The c dimensional kernel as defined in equation ( 2) is an associated kernel.If we choose H 1 2 = hI, the estimator in our notation is simply which is analog to what is found in Ferraty and Vieu (2006).It is worth noting that the associated distributions of this transmetric is the family of all functions that can be described as a linear combination of the first c eigenvectors.
Other pseudometrics that are discussed in Ferraty and Vieu (2006) are either based on partial least squares or the L 2 -norm of the derivatives of some order.Without derivations, it should be clear that the finite versions of these pseudometrics are also transmetrics.

Linear Regression
We define a transmetric space (R m , d t ), where This is a transmetric of order one, where V t = lim L→∞ 2L m−1 .The one-dimensional uniform kernel taken from equation ( 3) is associated with d t .If we choose H The regression model is found by investigating the expected value of x 1 given x 2 , x 3 , ...x m .Therefore, which describes the linear regression model.The intercept is zero when the data points are divided by the individual sample means.Obviously, this result has no effect on the parameter selection method, which is for instance minimizing the squared residuals.Based on this, it should be clear that other parametric regression models, such as polynomial regression, can be described in the sense of transmetric density estimation.

Partially Linear Models
Partially linear models is given by ) is the linear part and g(X i,m 1 +1 , X i,m 1 +2 , ..., X i,m ) is the nonparametric part.The data points are assumed to be i.i.d. and E(u i |X i,2 , X i,3 , , ..., X i,m ) = 0.An estimator for g(x m 1 +1 , x m 1 +2 , ..., x m ) is found by investigating We make an estimate of g, denoted ĝ, by inserting an estimator for f (x).We choose a tuple of transmetrics d t , where d t,1 is a m 1 -dimensional version of equation ( 8), associated with a m 1 -dimensional kernel as of equation (3).The other transmetrics are one-dimensional, associated with one-dimensional kernels.In this case where where .

Projection Pursuit
In the context of reducing the curse of dimensionality of nonparametric density estimators, it is also worth mentioning the projection pursuit density estimators.This method was first introduced in Friedman, Stuezle and Schroeder (1984), and a parametric extension was given by Welling, Zemel and Hinton (2003).In projection pursuit one projects the explanatory variables into principal directions and fits one-dimensional smooth density functions to these projections.The resulting density is the product of these densities.
The analogous projection pursuit density estimator can be described as a product of transmetric density estimators as where x j is a vector in the projected space, which only contain a subset of the coordinates.In the special case, when all the d t, j s contains one transmetric each and all transmetrics are one-dimensional, the usual projection pursuit model appears.
A similarity between the transmetric density estimators and the projection pursuit density estimators is that the resulting density can be visualized in a lower dimensional space.In projection pursuit, it is enough to show the one-dimensional ridge functions along with the principal directions to understand the full distribution.

Parameter Estimation by Cross-validation
The global error criteria to be minimized in the unbiased cross-validation method is the mean integrated squared error that is defined as This expression is also known as the L 2 risk function.The unbiased cross-validation (UCV) method aims to minimize MISE and employs the objective function is a leave-one-out estimator of f .The function UCV( fH,d t ) is unbiased in the sense that the expected value of UCV( fH,d t ) is equal to MISE( fH,d t ) − R( f ).Here, R( f ) is the L 2 -norm of f .The second term of equation ( 11) can be expanded to which is straightforward to treat computationally.The first term of equation ( 11) can be expanded to In the case when i is equal to j, equation ( 13) becomes where and R(K k ) is defined as If all transmetrics in d t has an order, theorem 2.1 can be used to obtain analytical expressions.Notice that the second part of equation ( 14) is an expression of the variance part of MISE.The challenge with estimating UCV(( fH,d t ) is to estimate the c i jk s in equation ( 14).Two special cases are treated below.2014)

Spesial Case When Transmetrics are of the Type Described in Hovda (
x,k dβ x,k . (15) Proof.We start by defining α ) and since α x,k is a transmetric of order m k , we can apply theorem 2.1.The integral has therefore reduced the number of dimensions to one The proof is concluded by changing variables to . This means that the arguments in the kernels can be expressed by the U m k function.
The integral is now one-dimensional, but unfortunately the integral is not trivial to solve for arbitrary m k .For most choices of kernel, we must rely on numerical solutions.In those cases, it is only necessary to integrate in regions that are close to β x,k,i and β x,k, j .
If we define these two functions choose a uniform kernel and let V t,k = 2 m k , then equation ( 15) is simply which is computationally tractable.This is because each β x,k,i and each r m k (β x,k,i ) can be pre-computed before calculating the double sums in equation ( 14).Note that C 1,m k (β x,k,i , β x,k, j ) is bounded by one, which happens when β x,k,i = β x,k, j .
To summarize, the estimator of UCV for the uniform kernel is therefore This can be reduced to It is worth noting that C 1,s is generally smaller than one and that it takes values other than zero more often than C 2,s .It is also worth noting that the two points β x,k,i and β x,k, j , contribute maximally to decrease UCV when β x,k, j = β x,k,i ±r m k (β x,k,i ).By contrast, when β x,k, j is slightly more than β x,k,i + r m k (β x,k,i ) or slightly less than β x,k,i − r m k (β x,k,i ), these two points contribute maximally to increase UCV.This effect is also seen for UCV on common kernel density estimation, when the kernel is uniform.

Special Case when Transmetrics are Metrics
It is straightforward to see that UCV for the common kernel density estimator, with a uniform kernel as of Duong and Hazelton (2005) is given by which is analogous to the generalized method.

Comparison of Transmetric Density Estimation and Kernel Density Estimation Using Monte Carlo Simulations
This section focuses on comparing transmetric density estimation with common kernel density estimation.In particular, AMISE is compared to optimal MISE and optimal UCV is compared to optimal MISE subtracted by R( f ).The kernels are uniform and the data sets are drawn from multi-dimensional normal distributions, where the number of dimensions and population sizes are varied.Equation ( 20) in Hovda (2016), where The procedure for MISE calculations is straightforward.A region R f of volume V R f is identified, where f is assumed to be close to zero outside this region.For fixed f , H and d t , the following are repeated n MS E times.In iteration i, n data points are drawn randomly from f and placed in a dataset X i .A single sample T i is drawn from a uniform distribution on R f .The estimator f i is calculated based on the dataset X i and the estimator of MISE is calculated by In experiment one, fH is computed and in experiment two, fH,d t is computed.To summarize, MISE is estimated by taking the average of a number of mean square error MSE estimates with random points taken uniformly within R f .The relevant parameters are listed in the table 2 and table 3.

Results:
The result of experiment one is shown in figure 1.Here, the common kernel density estimator is evaluated.It is clear that H AMISE seems to overestimate H MISE for small n, but this effect is smaller for larger n.Since the graphs are on a log scale, the reducing gap between them means that the normalized experimental error |H AMISE − H MISE |/H MISE is also reducing.
Moreover, it seems that AMISE opt is an overestimation of MISE opt , but the normalized experimental error is reduced with increased n.The reason for this is related to that H AMISE overestimates H MISE on this distribution.
The fact that MISE(H AMISE ) seems to lie between MISE opt and AMISE opt is a verification of the approximations given in equation ( 20) in Hovda (2016) It is also worth noting that this experiment is a verification of the convergence orders for the common kernel density estimators.On a log scale the convergence orders are proportional to the slopes of the graphs and it is clear that the slopes of AMISE are similar to those for MISE.It is also clear that the slopes are levelling out when the number of dimensions is increasing.This is expected.
The result of experiment two is shown in figure 2.Here the transmetric density estimator is evaluated.Similar to the result

Experiments on cross-validation
In this section, the optimal UCV( fH,d t ) and UCV( fH ) are calculated for various choices of bandwidths.Again the multidimensional normal distributions are considered.
The optimal bandwidth H UCV is the bandwidth that minimizes UCV, that is UCV opt .The mean integrated squared error that correspond to H UCV is denoted MISE UCV .For a given population size (n), each search for H UCV is repeated n MIN times.A distribution of estimates of H UCV is therefore available.MISE UCV is found by calculating MISE based on the average of n MS E MSE calculations.The bandwidths are chosen randomly from the H UCV distribution.
In the experiments, the distributions are normal and known and therefore UCV + R( f ) is compared to the estimates of MISE that is found in experiment one and two.The parameters of experiment three and four are listed in table 4.

Results
When using transmetric density estimation on multi-dimensional normal distributions, the convergence order of the mean integrated squared error, away from the mode, is approaching 4/5 for large population sizes.This result is independent of the number of dimensions.This result complies with the asymptotic arguments in Hovda (2016) Moreover, it is shown that parameters can be trained using unbiased cross-validation.However, the convergence order is slower for the transmetric density estimatior when the number of dimensions is small.
The result of experiment three is shown in figure 3. It is clear that using UCV on the common density estimator is a method that converges for increasing n in all dimensions.The fact that the gaps between the quartiles of H UCV and H MIS E decrease on the log scale, indicate that also the normalized experimental error decrease for increasing n.Moreover, it is seen that the expected value of UCV( fH ) + R( f ) converges towards MISE( fH ) as n increases.Notice that the normalized experimental error MISE(H UCV ) is decreasing as a function of n.
The result of experiment four is shown in figure 4. Here, UCV of the transmetric density estimator is shown.Clearly, the method seems to pick too small bandwidths, and this effect is most evident in two dimensions.This effect is increasing with n.This may not be too surprising since this method coincide with the common kernel density estimator for small n.
It is not completely clear whether the expected value of UCV( fH ) + R( f ) converges towards MISE( fH ) or not.On one side, since all graphs are decreasing everywhere on these log plots, it is clear that the experimental error is decreasing as a function of n.Moreover, in two and four dimensions the normalized experimental error seems constant, which is an important property for indicating convergence.However this is not the case in eight dimensions, which could be a sign that this will level out for very large n.
However, it is encouraging that this effect is not seen in MISE(H UCV ).It seems that the normalized experimental error is constant everywhere.It is worth noting that the poorer performance of UCV on the transmetric density estimator is somewhat surprising and the question of error in the computer program is always valid.To mitigate this, the author has implemented the code in both Java and Matlab with identical results.
The right plot in figure 5, summarizes the cross-validation experiments.The ratio of the two methods is shown, where parameter estimation is part of the methods.In general, some of the gain that is achieved by using transmetrics is lost by poorer performance of the cross-validation method.In fact, in two dimensions, the common kernel density estimator is working better when parameter estimation is part of the equation.However, in more dimensions, it seems that the transmetric density estimator is superior.

Conclusion
A great variety of state-of-the-art problems can be defined using transmetric density estimation.This has clarified how methods relate to each other and opened up new ways on how they can be combined.Moreover, unbiased cross-validation is possible even when the distribution is not an associated distribution.
Using Monte Carlo simulations, it is shown that parameters such as the scaling of the bandwidth matrix can be estimated using unbiased cross-validation.Although, the method seems to underestimate the bandwidth in two dimensions, the method seems appropriate when the number of dimensions is higher.The experimental error of the unbiased crossvalidation method seems constant with increasing population size.Moreover, Monte Carlo simulations have verified the asymptotic properties that were outlined in Hovda (2016)

Figure 1 .
Figure1.Result of experiment one, where the common kernel density estimator is evaluated.The left column shows log plots of optimal AMISE opt , MISE(H AMISE ) and MISE opt as functions of population size n.The distributions vary in the number of dimensions, but are all standard normal.The right column shows the corresponding optimal bandwidths in the sense of MISE and AMISE.The plots clearly show that optimal AMISE and optimal MISE decrease with approximately n −4/(4+m) , while the optimal bandwidths are proportional with n −1/(4+m) .

Figure 2 .
Figure 2. Result of experiment two, where the transmetric density estimator is evaluated.The left column shows log plots of optimal AMISE opt , MISE(H AMISE ) and MISE opt as functions of population size n.The distributions vary in the number of dimensions, but are all standard normal.The right column shows the corresponding optimal bandwidths in the sense of MISE and AMISE.The plots clearly show that optimal AMISE and optimal MISE decrease with approximately n −4/5 , while the optimal bandwidths are proportional with n −1/5 .

Table 1 .
List of important symbols

Table 2 .
Parameters that are common for all experiments 2 H AMISE ( fH,d t ) and AMISE opt ( fH,d t ) Equation (19) in Hovda (2016), where approximation limit L = 0.05 This value is denoted MISE(H AMISE ) and it is compared with AMISE opt .Moreover, we have also estimated minimal MISE, that is MISE opt , Hovda (2016)rlo Simulations of MISE and AMISE in the CaseWhen q = 1 From equation (19) inHovda (2016), we have analytical expressions for the optimal AMISE of f H,d t , AMISE opt ( f H,d t ) with the corresponding optimal bandwidth H AMISE ( f H,d t ).Equation (20) in the same article gives analytical expressions for AMISE opt ( f H ) and H AMISE ( f H ). All these expressions are given as functions of n.In order to verify these approximations, we have chosen to simulate MISE when |H| 1 2 = H AMISE .

Table 4 .
Hovda (2016)n experiment three and fourMoreover, the normalized experimental errors of both H AMISE and AMISE H improve for increasing n.This is expected, because the restriction in equation (18) inHovda (2016), suggest that the approximations of AMISE are only valid for n larger than 10 6 .Unfortunately, limited computational power has restricted us from evaluating larger population sizes.However, the convergence of MISE H towards AMISE H gives trust to the approximations of equation (19) inHovda (2016)Another verification is that MISE(H AMISE ) seems to follow AMISE opt .It is clearly worth commenting that for the smallest population sizes, MISE opt ( fH,d t ) is similar to MISE opt ( fH ) and H MISE ( fH,d t ) is similar to H MISE ( fH ).This is because the two methods coincide, when n is small.However, it is obvious that MISE opt ( fH,d t ) is substantially smaller than MISE opt ( fH ) for larger n.This is seen in the left subfigure of figure5, where the ratio of MISE opt ( fH,d t ) and MISE opt ( fH ) is plotted.In eight dimensions and when n = 100000, MISE of the transmetric density estimator is 20 times smaller.