Maximum likelihood estimation of a multivariate log-concave density

Density estimation is a fundamental statistical problem. Many methods are either sensitive to model misspecification (parametric models) or difficult to calibrate, especially for multivariate data (nonparametric smoothing methods). We propose an alternative approach using maximum likelihood under a qualitative assumption on the shape of the density, specifically log-concavity. The class of log-concave densities includes many common parametric families and has desirable properties. For univariate data, these estimators are relatively well understood, and are gaining in popularity in theory and practice. We discuss extensions for multivariate data, which require different techniques. After establishing existence and uniqueness of the log-concave maximum likelihood estimator for multivariate data, we see that a reformulation allows us to compute it using standard convex optimization techniques. Unlike kernel density estimation, or other nonparametric smoothing methods, this is a fully automatic procedure, and no additional tuning parameters are required. Since the assumption of log-concavity is non-trivial, we introduce a method for assessing the suitability of this shape constraint and apply it to several simulated datasets and one real dataset. Density estimation is often one stage in a more complicated statistical procedure. With this in mind, we show how the estimator may be used for plug-in estimation of statistical functionals. A second important extension is the use of log-concave components in mixture models. We illustrate how we may use an EM-style algorithm to fit mixture models where the number of components is known. Applications to visualization and classification are presented. In the latter case, improvement over a Gaussian mixture model is demonstrated. Performance for density estimation is evaluated in two ways. Firstly, we consider Hellinger convergence (the usual metric of theoretical convergence results for nonparametric maximum likelihood estimators). We prove consistency with respect to this metric and heuristically discuss rates of convergence and model misspecification, supported by empirical investigation. Secondly, we use the mean integrated squared error to demonstrate favourable performance compared with kernel density estimates using a variety of bandwidth selectors, including sophisticated adaptive methods. Throughout, we emphasise the development of stable numerical procedures able to handle the additional complexity of multivariate data.


About this document
This document is an introduction to the R package LogConcDEAD (log-concave density estimation in arbitrary dimensions) based on Cule et al. (2009). It aims to provide a detailed user guide based on simple, reproducible worked examples. This package is available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=LogConcDEAD.
LogConcDEAD depends on MASS (Venables and Ripley 2002) for some vector operations and geometry (Grasman and Gramacy 2008) for convex hull computation. The package rgl (Adler and Murdoch 2007) is recommended for producing graphics.
This document was created using Sweave (Leisch 2002) and L A T E X (Lamport 1994) using R (R Development Core Team 2008. This means that all of the code has been checked by R, and can be reproduced exactly by setting an appropriate seed (as given at the beginning of each example), or tested on different examples by using a different seed.

Log-concave density estimation
We address the fundamental statistical problem of estimating a probability density function f 0 from independent and identically distributed observations X 1 , . . . , X n taking values in R d .
If a suitable parametric model is available, a common method is to use maximum likelihood to estimate the parameters of the model. Otherwise, a standard nonparametric approach is based on kernel density estimation (Wand and Jones 1995), which has been implemented in the R function density. In common with many nonparametric methods, kernel density estimation requires the careful speciĄcation of a smoothing parameter. For multivariate data, the smoothing parameter is a bandwidth matrix with up to 1 2 d(d + 1) entries to choose, meaning that this method can be especially difficult to apply in practice.
An alternative to kernel density estimation or other estimation techniques based on smoothing (all of which require the selection of a smoothing parameter, which is nontrivial especially in the multivariate case) is to impose some qualitative shape restrictions on the density. If the shape restrictions are suitable, there is enough structure to guarantee the existence of a unique and fully automatic maximum likelihood estimate, even though the class of densities may be inĄnite-dimensional. This therefore avoids both the restrictions of a parametric model and the difficulty of bandwidth selection in kernel density estimation. The price is some restriction on the shape of the density. However, these restrictions are less severe than those imposed by a parametric model. Shape-constrained maximum likelihood dates back to Grenander (1956), who treated monotone densities in the context of mortality data. Recently there has been considerable interest in alternative shape constraints, including convexity, k-monotonicity and log-concavity (Groeneboom et al. 2001;Dümbgen and RuĄbach 2009;Balabdaoui and Wellner 2007). However, these works have all focused on the case of univariate data.

Log-concave densities
for all x, y ∈ R d and λ ∈ (0, 1). This corresponds to what Rockafellar (1997) calls a proper concave function. We say a probability density function f is log-concave if log f is a concave function. Several common parametric families of univariate densities are log-concave, such as Gaussian, logistic and Gumbel densities, as well as Weibull, Gamma and Beta densities for certain parameter values (An 1998). In fact, Cule et al. (2010) showed that even though the class of multivariate log-concave densities is large (inĄnite-dimensional), it still retains some of the simple and attractive properties of the class of Gaussian densities.
One-dimensional log-concave density estimation via maximum likelihood is discussed in Dümbgen and RuĄbach (2009); computational aspects are treated in RuĄbach (2007). It is in the multivariate case, however, where kernel density estimation is more difficult and parametric models less obvious, where a log-concave model may be most useful.
Theoretical and computational aspects of multivariate log-concave density estimation are treated in Cule et al. (2010). In particular, it is proved that if Y 1 , . . . , Y m are (distinct) independent and identically distributed observations from a distribution with log-concave density f 0 on R d , then (with probability 1) there is a unique log-concave density f m satisfying where F is the class of all log-concave densities on R d . Further, it is shown that this inĄnite dimensional maximization problem can be reduced to that of maximizing over functions of the formh y for some y = (y 1 , . . . , y m ) ∈ R m , wherē As discussed in Cule et al. (2010), we may think ofh y as the function obtained by placing a pole of height y i at X i and stretching a rubber sheet over the top of the poles.
Therefore, to completely specify the maximum likelihood estimator, we need only specify a suitable vector y ∈ R m , as this deĄnes the entire functionh y . A main feature of the Log-ConcDEAD package is that it provides an iterative algorithm for Ąnding such an appropriate vector y.
From our knowledge of the structure of functions of the form (2), we may deduce some additional properties of f m . It is zero outside the convex hull of the data, and strictly positive inside the convex hull. Moreover, we can Ąnd a triangulation of the convex hull into simplices (triangles when d = 2, tetrahedra when d = 3, and so on) such that log f m is affine on each simplex (Rockafellar 1997).
In practice our observations will be made only to a Ąnite precision, so the observations will not necessarily be distinct. However, the same method of proof shows that, more generally, if X 1 , . . . , X n are distinct points in R d and w 1 , . . . , w n are strictly positive weights satisfying n i=1 w i = 1, then there is a unique log-concave density f n , which is of the form f n = exp(h y ) for some y ∈ R n , and which satisĄes The default case w i = 1 n corresponds to the situation described above, and is appropriate for most situations. However, the generalization (3) obtained by allowing w i ̸ = 1 n allows us to extend to binned observations. In more detail, if Y 1 , . . . , Y m are independent and identically distributed according to a density f 0 , and distinct binned values X 1 , . . . , X n are observed, we may construct a maximum likelihood problem of the form given in (3), setting This generalization may also be used for a multivariate version of a log-concave EM algorithm (Chang and Walther 2007, also discussed in Cule et al. 2010).

Outline of the remainder of this document
In Section 2, we outline the algorithm used to compute the maximum likelihood estimator, including various parameters used in the computation. This is essentially an adaptation of ShorŠs r-algorithm (Shor 1985) (implemented as SolvOpt by Kappel and Kuntsevich (2000)), and depends on the Quickhull algorithm for computing convex hulls (Barber et al. 1996). This section may be skipped on Ąrst reading.
In Section 3, we demonstrate the main features of the package through four simple examples (one with d = 1, two with d = 2 and one with d = 3). This section includes a description of all of the parameters used, as well as the output structures. We also introduce the plotting functions available, as well as functions for sampling from the density estimate and for evaluating the density at a particular point.

Introduction
Recall that the maximum likelihood estimator f n of f 0 may be completely speciĄed by its values at the observations X 1 , . . . , X n . Writing C n for the convex hull of the data, Cule et al. (2010) showed that the problem of computing the estimator may be rephrased as one of Ąnding The function σ is convex, but not differentiable, so standard gradient-based convex optimization techniques such as NewtonŠs method are not suitable. Nevertheless, the notion of a subgradient is still valid: a subgradient at y of σ is any direction which deĄnes a supporting hyperplane to σ at y. Shor (1985) developed a theory of subgradient methods for handling convex, non-differentiable optimization problems. The r-algorithm, described in Shor (1985, Chapter 3) and implemented as SolvOpt in C by Kappel and Kuntsevich (2000), was found to work particularly well in practice. A main feature of the LogConcDEAD package is an implementation of an adaptation of this r-algorithm for the particular problem encountered in log-concave density estimation.

ShorŠs r-algorithm
Our adaptation of ShorŠs r-algorithm produces a sequence (y t ) with the property that as t → ∞. At each iteration, the algorithm requires the evaluation σ(y t ), and the subgradient at y t , denoted ∂σ(y t ), which determines the direction of the move to the next term y t+1 in the sequence.
Exact expressions for σ(y t ) and ∂σ(y t ) are provided in Cule et al. (2010). In practice, their computation requires the evaluation of convex hulls and triangulations of certain Ąnite sets of points. This can be done in a fast and robust way via the Quickhull algorithm (Barber et al. 1996), available in R through the geometry package (Grasman and Gramacy 2008). Due to the presence of some removable singularities in the expressions for σ(y t ) and ∂σ(y t ), it is computationally more stable to use a Taylor approximation to the true values for certain values of y t (Cule and Dümbgen 2008). The values for which a Taylor expansion (rather than direct evaluation) is used may be controlled by the argument Jtol to the LogConcDEAD function mlelcd. By default this is 10 −3 ; altering this parameter is not recommended.
Several parameters may be used to control the r-algorithm as detailed by Kappel and Kuntsevich (2000). In the function mlelcd, they may be controlled by the user via the arguments stepscale1, stepscale2, stepscale3, stepscale4 and desiredsize. For a detailed description of these parameters, as well as of this implementation of the r-algorithm, see Kappel and Kuntsevich (2000).

Stopping criteria
The implementation of the r-algorithm used in the main function mlelcd terminates after the (t + 1)th iteration if each of the following conditions holds: for some small tolerances δ, ϵ and η.
(4) and (5) are the criteria suggested by Kappel and Kuntsevich (2000); (6) is based on the observation that the maximum likelihood estimator is density (Cule et al. 2010). By default, these values are δ = 10 −4 , ϵ = 10 −8 and η = 10 −4 , but they may be modiĄed by the user as required, using the parameters ytol, sigmatol and integraltol respectively. The default parameters have been found to work well and it is not recommended to alter them.

Usage
In this section we illustrate the functions available in LogConcDEAD through several simple simulated data examples. These functions include mlelcd, which computes the maximum likelihood estimator, as well as graphics facilities and the function rlcd for sampling from the Ątted density.

Example 1: 1-d data
For this example, we will use 100 points from a Gamma(2, 1) distribution. The seed has been set (to 1), so this example can be reproduced exactly; you may also like to try with a different seed.

lgdtxt, lty=lgdlty)
For one-dimensional data, we note that the alternative active set algorithm from logcondens (RuĄbach and Dümbgen 2006;Dümbgen et al. 2007) may also be used to compute the log-concave maximum likelihood estimator, which appears to be faster. Unsurprisingly, the output results from the two procedures are essentially the same.

Example 2: 2-d normal data
For this section, we will generate 100 points from a bivariate normal distribution with independent components. Again, we have set the seed (to 22) for reproducibility.

Basic usage
The basic command in this package is mlelcd, which computes the log-concave maximum likelihood estimate f n . The verbose option controls the diagnostic output, which will be described in more detail below. The default print statement shows the value of the logarithm of the maximum likelihood estimator at the data points, the number of iterations of the subgradient algorithm required, and the total number of function evaluations required to reach convergence.

R> out <-mlelcd(x,verbose=50)
In the next two subsections, we will describe the input and output in more detail.

Input
The only input required is an n × d matrix of data points. One dimensional (vector) input will be converted to a matrix. Optionally a vector of weights w, corresponding to (w 1 , . . . , w n ) in (3), may be speciĄed. By default this is which is appropriate for independent and identically distributed observations.
A starting value y may be speciĄed for the vector (y 1 , . . . , y n ); by default a kernel density estimate (using a normal kernel and a diagonal bandwidth selected using a normal scale rule) is used. This is performed using the (internal) function initialy.
The parameter verbose controls the degree of diagnostic information provided by SolvOpt. The default value, −1, prints nothing. The value 0 prints warning messages only. If the value is m > 0, diagnostic information is printed every mth iteration. The printed information summarises the progress of the algorithm, displaying the iteration number, current value of the objective function, (Euclidean) length of the last step taken, current value of exp ¶h y (x)♢ dx and (Euclidean) length of the subgradient. The last column is motivated by the fact that 0 is a subgradient only at the minimum of σ (Rockafellar 1997, Chapter 27), and so for smooth functions a small value of the subgradient may be used as a stopping criterion. For nonsmooth functions, we may be close to the minimum even if this value is relatively large, so only the middle three columns form the basis of our stopping criteria, as described in Section 2.2.1.
The remaining optional arguments are generic parameters of the r-algorithm, and have already been discussed in Section 2.

Output
The output is an object of class "LogConcDEAD", which has the following elements:
R> out$logMLE [1:5] [1] -2.698592 -3.543324 -2.935386 -2.259437 -1.804418 As was mentioned in Sections 1 and 2, there is a triangulation of C n = conv(X 1 , . . . , X n ), the convex hull of the data, such that log f n is affine on each simplex in the triangulation. Each simplex in the triangulation is the convex hull of a subset of ¶X 1 , . . . , X n ♢ of size d + 1. Thus the simplices in the triangulation may be indexed by a Ąnite set J of (d + 1)-tuples, which are available via R> out$triang [1:5,] [ For each j ∈ J, there is a corresponding vector b j ∈ R d and β j ∈ R, which deĄne the affine function which coincides with log f n on the jth simplex in the triangulation. These values b j and β j are available in R> out$b [1:5,] [ (In all of the above cases, only the Ąrst 5 elements are shown.) As discussed in Cule et al. (2010), for each j ∈ J we may Ąnd a matrix A j and a vector α j such that the map w → A j w + α j maps the unit simplex in R d to the jth simplex in the triangulation. The inverse of this map, x → A −1 j x − A −1 j α j , is required for easy evaluation of the density at a point, and for plotting. The matrix A −1 j is available in out$verts and A −1 j α j is available in out$vertsoffset. The "LogConcDEAD" object also provides some diagnostic information on the execution of the SolvOpt routine: the number of iterations required, the number of function evaluations needed, the number of subgradient evaluations required (in a vector NumberOfEvaluations), and the minimum value of the objective function σ attained (MinSigma).

R> out$NumberOfEvaluations
[1] 282 858 283 R> out$MinSigma [1] 3.516167 The indices of simplices in the convex hull C n are available: R> out$chull [1:5,] [ In addition, an outward-pointing normal vector for each face of the convex hull C n and an offset point (lying on the face of the convex hull) may be obtained.
R> out$outnorm [1:5,] [  [1:5,] NULL This information may be used to test whether or not a point lies in C n , as x ∈ C n if and only if p T (x − q) ≤ 0 for every face of the convex hull, where p denotes an outward normal and q an offset point.
When d = 1, the convex hull consists simply of the minimum and maximum of the data points, and out$outnorm and out$outoffset are NULL, although out$chull still takes on the appropriate values.

Graphics
Various aspects of the log-concave maximum likelihood estimator can be plotted using the plot command, applied to an object of class "LogConcDEAD". The plots are based on interpolation over a grid, which can be somewhat time-consuming. As several will be done here, we can save the results of the interpolation separately, using the function interplcd, and use it to make several plots. The number of grid points may be speciĄed using the parameter gridlen. By default, gridlen=100, which is suitable for most plots.
Where relevant, the colors were obtained by a call to heat_hcl in the package colorspace (Ihaka et al. 2008), following the recommendation of Zeileis et al. (2009). Thanks to Achim Zeileis for this helpful suggestion.
The rgl package allows user interaction with the plot (e.g. the plot can be rotated using the mouse and viewed from different angles). Although we are unable to demonstrate this feature on paper, Figure 5 shows the type of output produced by rgl, using the following code: R> plot(out,g=g,type="r") Figure 6 shows the output produced by setting uselog = TRUE to plot on the log scale. Here we can clearly see the structure of the log-concave density estimate. This is produced using the command R> plot (out,g=g,type="r",uselog=TRUE) Figure 5: rgl output for Example 2

Other functions
In this section we will describe the use of the additional functions rlcd and dlcd.

Sampling from the MLE
Suppose we wish to estimate a functional of the form θ(f ) = g(x)f (x) dx, for example the mean or other moments, the differential entropy − f (x) log f (x) dx, etc. Once we have obtained a density estimate f n , such as the log-concave maximum likelihood estimator, we may use it as the basis for a plug-in estimate θ n = g(x) f n (x) dx, which may be approximated using a simple Monte Carlo procedure even if an analytic expression is not readily available.
In more detail, we generate a sample Z 1 , . . . , Z N drawn from f n , and approximate θ n by This requires the ability to sample from f n , which may be achieved given an object of class "LogConcDEAD" as follows: Details of the function rlcd are given in Cule et al. (2010) and Gopal and Casella (2010).
Once we have a random sample, plug-in estimation of various functionals is straightforward.

Example 4: Higher-dimensional data
In our Ąnal example we illustrate the use of the log-concave density estimate for higherdimensional data. For this example the seed has been set to 4444. The log-concave maximum likelihood estimator is deĄned and may be computed and evaluated in exactly the same way as the 2-dimensional examples in Sections 3.2 and 3.5. This estimate will be based on 100 points.