Exact Solutions in Log-Concave Maximum Likelihood Estimation

We study probability density functions that are log-concave. Despite the space of all such densities being infinite-dimensional, the maximum likelihood estimate is the exponential of a piecewise linear function determined by finitely many quantities, namely the function values, or heights, at the data points. We explore in what sense exact solutions to this problem are possible. First, we show that the heights given by the maximum likelihood estimate are generically transcendental. For a cell in one dimension, the maximum likelihood estimator is expressed in closed form using the generalized W-Lambert function. Even more, we show that finding the log-concave maximum likelihood estimate is equivalent to solving a collection of polynomial-exponential systems of a special form. Even in the case of two equations, very little is known about solutions to these systems. As an alternative, we use Smale's alpha-theory to refine approximate numerical solutions and to certify solutions to log-concave density estimation.


Introduction
Nonparametric methods in statistics emerged in the 1950-1960s [26,50,44,3] and fall into two main streams: smoothing methods and shape constraints.Examples of smoothing methods include delta sequence methods such as kernel, histogram and orthogonal series estimators [58], and penalized maximum likelihood estimators, e.g., spline methods [25].Their defining feature is the need to choose the smoothing or tuning parameters.It is a delicate process because smoothing parameters depend on the unknown probability density function.In contrast to smoothing methods, shape constrained nonparametric density estimation is fully automatic and does not depend on the underlying probability distribution, though this comes at the expense of worse L 1 convergence rates for smooth densities [24].Some previously studied classes of functions include non-increasing [27], convex [29], k-monotone [7] and s-concave [20].We refer the reader to [55,53,57,28] for general references on nonparametric statistics.The definitions of k-monotone and s-concave can be found in [6] and [18], respectively.
In this paper we focus on the class of log-concave densities, which is an important special case of s-concave densities.The choice of log-concavity is attractive for several reasons.First of all, most common univariate parametric families are log-concave, including the normal, Gamma with shape parameter greater than one, Beta densities with parameters greater than 1, Weibull with parameter greater than 1 and others.Furthermore, log-concavity is used in reliability theory, economics and political science [4].In addition to this, log-concave densities have several desirable statistical properties.For example, log-concavity implies unimodality but log-concave density estimation avoids the spiking phenomenon common in general unimodal estimation [21].Moreover, this class is closed under convolutions and taking pointwise limits [14].We refer the reader to [52] for an overview of the recent progress in the field.
Let X = (x 1 , x 2 , . . ., x n ) be a point configuration in R d with weights w = (w 1 , w 2 , . . ., w n ) such that w i ≥ 0 and w 1 +w 2 +• • •+w n = 1.The log-concave maximum likelihood estimation (MLE) problem aims to find a Lebesgue density that solves max n i=1 w i log(f (x i )) s.t.log(f ) is concave and (1.1) It has been shown that the solution exists with probability 1 and is unique, and its logarithm is a tent function, i.e., a piecewise linear function with regions of linearity inducing a subdivision of the convex hull of X [59,43,15,49], see Figure 1 for an example.While MLE is the most widely studied estimator in this setting, it is not the only one, for examples see [19,16].
The maximum likelihood estimator is attractive because of its consistency under general assumptions [43,21,14,23] and superior performance compared to kernel-based methods with respect to mean integrated squared error, as observed in simulations [15].At the same time, the convergence rate is still an open question and only lower [34,35] and upper [34,10] bounds are known.Further theoretical properties have been studied for some special cases of log-concave densities, e.g., k-affine densities [33] and totally positive densities [48].Several algorithms have been developed to compute the log-concave MLE in one dimension [51] and in higher dimensions [15,2,46].Software implementations include R packages such as logcondens [22] and cnmlcd [37] in one dimension, and LogConcDEAD [13] and fmlogcondens [45] in higher dimensions.
How many cells does the subdivision induced by the logarithm of the optimal log-concave density have?
Using the R package LogConcDEAD with default parameters, one obtains that the logarithm of the maximum likelihood estimate is a piecewise linear function with seven unique linear pieces.However, when one investigates the optimal density more closely, it appears that several linear pieces are similar.For example, a visual inspection of the optimal density depicted in Figure 1 makes it impossible to distinguish all 7 regions and suggests that there are only four unique linear pieces.Using LogConcDEAD one also obtains the two triangles, but according to the LogConcDEAD output the quadrangle consists of two linear pieces and the hexagon consists of three linear pieces.The subdivision corresponding to the LogConcDEAD result is depicted in Figure 8a.What is the true number of unique linear pieces of the optimal density?Is it four, seven or another value?
Theoretically, the algorithm used in LogConcDEAD finds the true optimal density, however, in practice, the answer is a numerical approximation.By changing the parameter sigmatol from default value 10 −8 to 10 −10 , LogConcDEAD outputs four unique linear pieces, exactly as we observed in Figure 1.Although it might seem obvious that four is the correct number of linear pieces, in reality the situation is more complicated, see Example 4.16.How do we find the correct number of linear pieces?
The goal of this paper is to study exact solutions to log-concave maximum likelihood estimation.An exact solution will have three different meanings in this paper.First, one might hope that it is an algebraic number.This would enable exact symbolic computations by way of storing a floating point approximation of a number along with a polynomial that vanishes on it.Such computations are not possible for transcendental numbers.Thus, the first main result of our paper is Theorem 3.7, which states that the heights at the sample points of the logarithm of the log-concave density estimate are transcendental for an open ball of weights.
Second, in light of Theorem 3.7, we would like to express the maximum likelihood estimator in closed form using well-known mathematical operations and functions, although not necessarily elementary functions.In the simplest case of one cell in one dimension, we derive the log-concave density estimator in closed form using the generalized W -Lambert function, see Proposition 3.9.It is known that the generalized W -Lambert function is not an elementary function.More generally, solving the MLE can be restated as a collection of polynomial-exponential systems of equations, which have been studied in the literature.However, even in the case of two equations, only bounds on solutions are known [38].This suggests that it might be difficult to express the log-concave maximum likelihood estimator in closed form.As an alternative, we turn to Smale's α-theory, which we describe briefly now.Third, given a sufficiently close floating point solution to the MLE problem, one hopes that it can be refined to any desired precision using Newton iteration or other techniques.A natural question arises: when is the approximate solution good enough for these methods to succeed?A way to make this mathematically rigorous is Smale's α-theory [9,56], which we discuss in Section 4. We obtain the α-certified solutions to log-concave density estimation.This allows us to test and compare numerical solvers, as well as rigorously decide the certified, correct subdivision for a given log-concave density estimation problem.Our methods are especially relevant when the precision of the log-concave density estimate is important.This opens new pathways to answering the motivating question: what is the correct number of cells?
The code for computations in this paper can be found at [30].

Geometry of log-concave maximum likelihood estimation
We start by reviewing the geometry of log-concave maximum likelihood estimation mostly following [49].
Definition 2.1.Let P be the convex hull of a point configuration For a fixed real vector y ∈ R n , we define a function h X,y on R d , called the tent function, as the smallest concave function such that h X,y (x i ) ≥ y i for i = 1, . . ., n.Here the term smallest means that for any other concave function h on R d such that h(x i ) ≥ y i for i = 1, . . ., n, one must have h(x) ≥ h X,y (x) for all x ∈ R d .The tent function h X,y is piecewise linear on P with linear pieces equal to upper facets of the convex hull of the points (x 1 , y 1 ), (x 2 , y 2 ), . . ., (x n , y n ) in R d+1 .We have h X,y (x) = −∞ at all points x ∈ R d outside P .If h X,y (x i ) = y i for i = 1, . . ., n, then y is called relevant.
It was shown by Cule, Samworth and Stewart for uniform weights [15] and by Robeva, Sturmfels and Uhler in general [49] that the constrained optimization problem (1.1) of finding the log-concave maximum likelihood estimate is equivalent to the unconstrained optimization problem (2.1) Moreover, the log-concave maximum likelihood estimate is a tent function with tent poles at some of the x i .Therefore finding the log-concave density which maximizes the likelihood of (X, w) is equivalent to finding an optimal height vector y * .
Definition 2.2.We follow the definitions in [17].Given a point configuration X in R d , a subdivision ∆ of X is a collection of d-polytopes, denoted σ i , such that the union of polytopes in ∆ equals conv(X), the vertex set of polytopes in ∆ is contained in X and the intersection of polytopes in ∆ can only happen along lower dimensional faces.A subdivision ∆ is called a triangulation, if all polytopes in ∆ are simplices.A triangulation ∆ of the point configuration X is called maximal, if every element of X is a vertex of a simplex in ∆.A subdivision is called regular if its full dimensional cells σ i are combinatorially equivalent to the regions of linearity of a tent function on X for some height vector y ∈ R n .
Corollary 2.3.[49, Corollary 2.6] To find the optimal height vector y * in (2.1) is to maximize the following rational-exponential objective function over y ∈ R n : where ∆ is any regular triangulation that refines the regular subdivision induced by the tent function h X,y .
If y induces a regular subdivision ∆ that is not a maximal regular triangulation, then we can consider any maximal regular triangulation that refines ∆.Thus if there are m maximal regular triangulations of X, then to find the optimal y * we must compare the optimal values y * ∆ 1 , y * ∆ 2 , . . ., y * ∆m which are obtained by solving the optimization problem (2.2) m times, once for each maximal regular triangulation ∆ 1 , ∆ 2 , . . ., ∆ m .Notation 2.4.We will denote by S ∆ the function given by the right hand side of (2.2) for a fixed triangulation ∆.To visualize the situation, we consider the Samworth body which was introduced in [49].The unconstrained optimization problem (2.1) is equivalent to the constrained optimization problem of maximizing the linear function w • y over the Samworth body.For different choices of weight vector w = (w 1 , w 2 , w 3 ), we obtain different optimal height vectors y = (y 1 , y 2 , y 3 ) on the surface of the Samworth body, and the height vector determines the triangulation.The Samworth body consists of two regions that can be seen in Figure 2. The green region comes from the one-simplex triangulation ∆ 1 = {{1, 3}}, while the red region comes from the two-simplex triangulation ∆ 2 = {{1, 2}, {2, 3}}.Moreover, one can see lines separating the green region into two pieces and the red region into three pieces (ignore the curve separating the green and the red regions for now).These lines correspond to the degenerate cases where y 1 = y 3 , y 1 = y 2 or y 2 = y 3 , and hence the right hand side of (2.2) is not defined.Therefore those lines are simply artifacts of the reformulation (2.2) since in the original unconstrained setting (2.1) these points present no difficulty.The intersection of the three lines is the point (− log 5, − log 5, − log 5).Consider the curve separating the green and red regions of the Samworth body.This curve is made of all the points y that form a relevant tent function, inducing the subdivision ∆ 1 .To understand the green region, see the piecewise linear functions drawn in Figure 3. Since the lowest (dotted) function is not concave, it is invalid as a tent function.Therefore, if the height y 2 is too low, the optimal tent function will be the (solid-line) linear function.In effect, the optimal tent-function ignores heights y i if they are too low.This basic phenomenon is responsible for the green part of the Samworth body being flat in the y 2 direction, meaning that it is a pencil of half-lines parallel to the y 2 -axis.
The transition from the red region to the green region is not smooth.For every y on the curve between the green and red regions, there is a two-dimensional cone of weight vectors that give y as an optimal solution.The generators of this cone are described in [49,Theorem 3.7].The optimal height vector y * for w = ( 1 3 , 1 3 , 1 3 ) lies on the curve between the red and green regions.It is not a critical point of the function (2.3), because w is not a normal vector to the red region at the point y * .
We now return to the general situation and consider the specific approach of critical equations for solving the optimization problem (2.2).Let X = (x 1 , . . ., x n ) be a configuration of n points x i ∈ R d .Fixing a maximal regular triangulation ∆ of our point configuration X, we can find the optimal y * ∆ for S ∆ in (2.2) over y ∈ R n by solving the system of critical equations ∂S ∆ /∂y i = 0.These partial derivatives take the form (see [49, Proof of Lemma 3.4]): .
(2.4) Definition 2.6.For a fixed maximal regular triangulation ∆ of X, let A be the matrix such that the system of n critical equations (2.4) can be written in the form where e y is a column vector of exponentials (e y 1 , e y 2 , . . ., e yn ) T , and w is a column vector of weights (w 1 , . . ., w n ) T .The matrix A is called the score equation matrix.
The entries of A are in the field of rational functions in the variables y 1 , . . ., y n .Diagonal entries of A are .
The matrix A can be written as a sum of matrices over maximal simplices σ ∈ ∆.This will be described explicitly in the proof of Theorem 3.1.
There are two caveats when solving the optimization problem (2.2) using the method of critical equations.First, it is not enough to consider the system of critical equations ∂S ∆ /∂y i = 0 only for each of the maximal regular triangulations ∆, since the optimization problem (2.2) is not smooth.One has to consider a system of critical equations for each subdivision of X.For a general subdivision ∆ of X, this system is constructed in the following way.We consider S ∆ ′ (y 1 , . . ., y n ) for any maximal triangulation ∆ ′ that refines ∆, substitute y i that can be expressed in terms of other y's in the subdivision ∆ and construct the system of critical equations ∂ S ∆ /∂y i = 0 for the resulting function S ∆ .For maximal triangulations, we have S ∆ = S ∆ and the system of critical equations is given by (2.4).We will demonstrate this phenomenon on the point configuration from Example 2.5.
).The output from LogConcDEAD suggests that the optimal tent function is supported on one cell, with heights given by y * 1 = −1.816665,y * 2 = −1.576024and y * 3 = −1.415597.However, the vector y * is neither a critical point of S ∆ 2 nor of the function This can be seen by taking partial derivatives of these functions with respect to y 1 , y 2 , y 3 and substituting y * 1 , y * 2 , y * 3 .In the case of ∂S ∆ 1 /∂y i = 0, it is particularly easy to see that there are no solutions, since ∂S ∆ 1 /∂y 2 = w 2 ̸ = 0.In the case of ∂S ∆ 2 /∂y i = 0, the system of critical equations fails to certify in the sense of Section 4.
We will verify in Example 4.13 that y * is a critical point of the function S ∆ 2 .
The second caveat is that to find the optimal tent function, it is not enough to merely compare the optimal critical points y * ∆ of ∂S ∆ /∂y i = 0 for each subdivision ∆.Denote by Y ∆ the set of y that induce a subdivision that is equal to or coarser than ∆.For each ∆, it also has to be checked that is depicted in Figure 5.We have P exp(f * ∆ (t))dt = 1 for all subdivisions ∆.This implies that if y * ∆ is not relevant, then exp(h X,y * ∆ ) is not a distribution.The optimal tent function is supported on the subdivision {{1, 3}, {3, 5}}.Also subdivisions {{1, 4}, {4, 5}} and {{1, 5}} give concave piecewise-linear functions f * ∆ , however, the value of S ∆ at y * ∆ is less for these subdivisions (respectively −2.32524 and −2.32556) than for the optimal subdvision (−2.31007).Moreover, only for the optimal subdivision we obtain (a) {12, 23, 34, 45} {{1,3},{3,5}} agrees with y * when rounded to the third decimal digit.This suggests a method for checking whether a subdivision supports the optimal tent function: The piecewise-linear function f * ∆ should be concave and the height vector y * ∆ should be close to y * obtained by LogConcDEAD.We see from this example, if a subdivision ∆ is incompatible with the optimal subdivision, then f * ∆ might or might not be concave.The subdivisions {{1, 4}, {4, 5}} and {{1, 2}, {2, 5}} are both incompatible with the subdivision {{1, 3}, {3, 5}}, and f * {{1,4},{4,5}} is concave whereas f * {{1,2},{2,5}} is not concave.In all examples that we have done, if a subdivision ∆ refines the optimal subdivision, then f * ∆ is not concave and if a subdivision ∆ is coarser than the optimal subdivision, then f * ∆ is concave.Whether this is true in general, is left as an open question.

Transcendentality and closed-form solutions
In this section we use notions from geometric combinatorics to study the structure of (2.6).In particular, we will prove that the matrix A is invertible.This will be our main tool in proving the transcendentality of log-concave MLE and deriving closed form solutions in the one-dimensional one cell case using Lambert functions.

Score equation matrix invertibility and transcendentality
Towards proving transcendentality, we first investigate the invertibility of the matrix A. Theorem 3.1.Consider a point configuration X = (x 1 , . . ., x n ) in R d , let ∆ = {σ 1 , . . ., σ m } be a maximal regular triangulation of X.The score equation matrix A from (2.5) is invertible.Definition 3.2.Given a triangulation ∆, we define the neighborhood N (j) of a vertex j in ∆ to be the set of vertices N (j) = {i : (i, j) ∈ σ k for some k} .
Before giving the proof of Theorem 3.1, we illustrate the construction in the proof with a small example.
Example 3.3.Let X = (x 1 , x 2 , x 3 , x 4 ) be a four point configuration in R 2 with ∆ = {σ 1 , σ 2 }, where σ 1 = {1, 2, 3} and σ 2 = {2, 3, 4}.Let A be the score equation matrix for the entire regular triangulation ∆.Let us denote the difference y i − y j by y ij .Then A = A(σ 1 ) + A(σ 2 ), where  We define matrix B to be the matrix A with its j-th column multiplied by i∈N (j) y 2 ji , for all j from 1 to 4. We obtain the following matrices The product of the diagonal entries of B = B(σ 1 ) + B(σ 2 ) is a polynomial of degree 12. Whereas a term in the expansion of the determinant of B with off-diagonal entries has at most degree 10.
Proof of Theorem 3.1.The score equation matrix A associated to a maximal regular triangulation ∆ can be written as where the entries of A(σ) for i ̸ = j are The matrix A(σ) is sparse: If i or j does not belong to σ then A i,j (σ) = 0.
Let B (resp.B(σ)) be the matrix that is obtained by multiplying the j-th column of A (resp.A(σ)) by α∈N (j) (y j − y α ) 2 for j = 1, . . ., n: Fix σ ∈ ∆.We describe separately the off-diagonal and diagonal entries of B(σ).For i, j ∈ σ and i ̸ = j we get And for the diagonal entries Given a polynomial f ∈ R[y 1 , . . ., y n ], we can rewrite f = d j i=0 f i y i j as a univariate polynomial in y j of degree d j , where f i ∈ R[y i : i ̸ = j] is a constant with respect to y j .We then define the initial form of f with respect to j to be We observe that for the off-diagonal entries B(σ) i,j , the initial form with respect to j is in j (B(σ) i,j ) = y , where γ j = |N (j)| is the number of vertices adjacent to j in ∆.Whereas for the diagonal entry B(σ) j,j , the initial form is in j (B(σ) j,j ) = y 2γ j −d j .
In both cases, the degree of the initial form is the degree of the polynomial.We sum the matrices B(σ) for σ ∈ ∆, to get B and note that the coefficient of the monomial y 2γ j −d j in B j,j is the number of simplices in ∆ containing vertex j.Hence, using the Leibniz formula to compute the determinant of B, we get that the product of diagonal entries is a polynomial of degree n j=1 2γ j − d .All off-diagonal entries in that column of B are of degree one smaller, thus any monomial in the expanded form of the determinant with off-diagonal entries must have degree at least two smaller than the product of diagonal entries.The following equality is a direct consequence of (3.1) Since det(B) is not identically 0, det(A) is not identically zero, hence A is invertible over the field of rational functions.
We will explore rational-exponential systems of the form (3.2) further in Sections 3.2-3.3.The following is a result from transcendental number theory, for a textbook reference see Theorem 1.4 of [5].
A special case of the Lindemann-Weierstrass theorem is the Lindemann theorem which states that exp(y) is transcendental for algebraic y ̸ = 0. Theorem 3.7.Let X ⊆ Q d .If vol(conv(X )) ̸ = 1, then there exists an open ball of weights U ⊆ R n such that for every w ∈ U, at least one coordinate of the optimal height vector y * is transcendental.If vol(conv(X )) = 1, then all coordinates of y * are algebraic if and only if w is in the cone over the secondary polytope Σ(X).
Proof.Let ∆ be a maximal regular triangulation.According to [49, Theorem 1.2], there exists an open ball U ⊆ R n of weights that induces the maximal regular triangulation ∆.Take any w ∈ U and consider the rational-exponential system (3.2) for this choice of ∆ and w.Then we have exp(y 1 ) = p 1 (y 1 , . . ., y n ) where p 1 is a rational function in Q(y 1 , . . ., y n ).Assume that y 1 , . . ., y n are algebraic.By Lindemann's theorem exp(y 1 ) is algebraic if and only if y 1 = 0.
However, p(y 1 , . . ., y n ) is always algebraic, since y 1 , . . ., y n are algebraic and the algebraic numbers form a field.Hence y 1 = 0. We can argue similarly that y i = 0 for all i.The vector y = (0, . . ., 0) belongs to the boundary of the Samworth body if and only if the volume of the convex hull of X is 1.In this case, y is the optimal solution if w is in the cone over the secondary polytope Σ(X) by [49, Corollary 3.9].

One cell in one dimension
In this section we apply the invertibility of the score equation matrix to give a closed form solution to log-concave maximum likelihood estimator in case the logarithm of the optimal density is a linear function on the real line.If X = (x 1 , x 2 ) ⊂ R, then Hence the polynomial-exponential system (3.2) has the form In the rest of the section we will discuss how to solve Equation (3.5) using Lambert functions.The solutions for y 1 and y 2 can then be obtained from Equations (3.3) and (3.4) by solving for y 12 .
Definition 3.8 (Section 2 in [42]).For x, t i , s j ∈ R, consider the function We denote its (generally multi-valued) inverse function at the point a ∈ R by We have W (; ; a) = log(a).
Remark 3.10.Proposition 3.9 generalizes to the case when we have n points on a line and the optimal tent function is supported on one cell.The generally multi-valued generalized W -Lambert function W (ρ + 1; −ρ −1 − 1; −ρ) is plotted in Figure 6.We explore its branches, i.e., single-valued functions of ρ, using r-Lambert functions.Definition 3.11 (Section 3.2 in [42]).If r ∈ R, consider the function We denote its inverse function in the point a ∈ R by W r (a) and call it the r-Lambert function.
The following theorem makes the connection between the generalized Lambert function and the r-Lambert function: Theorem 3.12 (Theorem 3 in [42]).If t, s, a ∈ R, the following equality holds: The second case happens when ρ > 0, in which case we have the double branch of constant zero function and an additional branch.This is the branch that is relevant to us in the context of Proposition 3.9.The first case happens when ρ < 0, in which case there exists a double branch of the constant zero function.This cannot appear for positive weights w i .The third case does not happen.
The r-Lambert function can be computed with the C++ implementation [41].Alternatively, one can use results about computing roots of polynomial-exponential equations.In [38], a symbolic-numeric algorithm is proposed for constructing explicitly an interval containing all the real roots of a single real polynomial-exponential equation, and counting how many roots are contained in a non-bounded interval.In [47], the decision problem of the existence of positive roots of such functions is discussed.This subject is strongly related to quantifier elimination [60], and to transcendentality problems [40,11,12].The latter problem of the transcendence theory appears in our Theorem 3.7.

Two cells in one dimension
Recall y 12 = y 1 − y 2 and y 23 = y 2 − y 3 .Then Consider the polynomial-exponential system exp(y) = A −1 w as in (3.2).Dividing the first equality with the second one and the second one with the third one gives: Hence we could reduce a polynomial-exponential system with three equations and three variables to a polynomial-exponential system with two equations and two variables.Systems of two rational bivariate polynomial-exponential equations such as (3.6) are studied in [38].An algorithm giving the number of solutions of such a system is provided, where all the solutions are contained in a generalized open rectangle of type I 1 × I 2 ⊂ R 2 , under the hypothesis that at least one of the intervals I 1 or I 2 is bounded.Remark 3.13.Let X ⊂ R. If we consider tent functions h X,y that are supported on two cells such that h X,y is a constant function on one of the two cells, then one can use methods similar to the one cell case (see Section 3.2) to give the optimal solution using the Lambert function.

Certifying solutions with Smale's α-theory
As explained in Section 2, our task is to maximize the objective function S(y 1 , . . ., y n ) defined in Corollary 2.3.For a subdivision ∆, we can find the optimal y * ∆ by considering S ∆ ′ (y 1 , . . ., y n ) for any maximal triangulation ∆ ′ that refines ∆, substituting y i that can be expressed in terms of other y's for the subdivision ∆ and solving the system of critical equations ∂ S ∆ /∂y i = 0 for the resulting function S ∆ .For maximal triangulations, we have S ∆ = S ∆ and the system of critical equations is given by (2.4).We will write S ∆ instead of S ∆ also when talking about general subdivisions and for brevity we denote the system of critical equations by ∇S ∆ (y) = 0. We say the system is square because we have n equations ∂S ∆ /∂y i = 0 in n variables y 1 , . . ., y n .Usually it will be impossible to write down exact solutions to these systems, but there is a way forward.In what follows we discuss the computation of certified solutions to this system of equations.To do so, we discuss Smale's α-theory, which makes mathematically rigorous the idea of approximate zeros in the sense of quadratic convergence of Newton iterations.The following influential definition was given in [9,56].Definition 4.1 (Chapter 8 of [9]).Let Df (x) be the n × n Jacobian matrix of the square system of complex-analytic equations f (x) = 0 ∈ C n , where f : C n → C n is written as a column vector of its component functions A point z ∈ C n is an approximate zero of f if there exists a zero z * ∈ C n of f such that the sequence of Newton iterates for all k ≥ 1 where z 0 = z.If this holds, then we call z * the associated zero of z.
Here ∥x∥ := ( n i=1 x i x i ) 1 2 is the standard norm in C n , and the zero z * is assumed to be nonsingular, meaning that detDf (z * ) ̸ = 0.
Therefore the problem becomes two-fold.Given a system of equations f , we need a way to (1) generate approximate solutions, and (2) certify their quadratic convergence under Newton iterations.The methods of Smale's α-theory solve exactly this second problem.This is accomplished using the constants α(f, x), β(f, x) and γ(f, x), which we will discuss in Section 4.1.Typically γ is difficult to compute, since it is defined as the supremum of infinitely many quantities depending on higher-order derivatives of our system of equations.However, explicit upper bounds on γ were calculated in [31] which we can specialize to the system required for log-concave density estimation.These upper bounds have the advantage that they are easily computed from our system ∇S = 0, and can therefore be used to αcertify approximate solutions coming from numerical software.In Section 4.1, we make this precise, discussing recent work on the subject [31,32,54,56] and how it applies in our context.Remark 4.2.One might wonder why we do not directly evaluate the equations in question to the approximate height values given by statistical packages.The reason is that we want to have a measure of how accurate this solution is, which is also very sensitive to the system.Consider for example the system consisting of the single polynomial f (x) = x.We would not accept 1/2 as a solution.But if we consider the system f (x) = x 10 and we evaluate at x = 1/2, we get a value that is less than 0.001.This could have been tempting, but note that in both cases the difference between actual solution and approximation is the same.
Another example that illustrates the potential difficulties involved in judging a numerical solution based on its evaluation into the original system of equations comes from [8].Consider the univariate polynomial A solution which is accurate within 9.4 × 10 −12 of the true solution is z * = 30.00000000000142− 0.00000000000047i, but evaluating the polynomial at this solution yields a complex number f (z * ) with norm |f (z * )| = 31.371,which certainly seems far from zero.However, refining the accuracy of this solution to z * * = 29.9999999999998983894731343124+ 0.0000000000000000000000062i, we find that |f (z * * )| = 0.00000000032, which is much better.

Smale's α-theory
The intuition behind α-theory is as follows.The size of the initial Newton iteration step combined with the size of the derivatives control how quickly Newton iteration converges to a true solution.We can calculate the size of the Newton iteration step, so if we have some control over the higher order derivatives of f , then we should be able to certify whether a solution satisfies the criterion of Definition 4.1.This motivates the definition of the following constants α, β, γ ∈ R, associated to a system of equations f at a point x.These constants measure quantities relevant to certifying approximate zeros.
Definition 4.3.Let f : C n → C n be a system of complex-analytic functions and let x ∈ C n .We define α(f, x) to be the product of β(f, x) and γ(f, x): The constant β(f, x) measures the size of the Newton iteration step applied at x, namely: while γ(f, x) bounds the sizes of the following quantities, involving the higher order derivatives: If we can compute these constants β, γ for a candidate solution, then we can utilize the following Theorem 4.4 (Chapter 8 of [9]).If f : C n → C n is a system of complex-analytic functions and x ∈ C n satisfies ≈ 0.157671, then x is an approximate zero of f = 0.
For polynomial systems, all higher-order derivatives eventually vanish.Exactly this fact was used in [54] to derive an upper bound for γ(f, x) which involves the degrees of the polynomials in the system f .This is highly convenient since, even for systems of polynomials, calculating γ(f, x) purely based on the definition is quite a difficult task.Yet, if we are to certify candidate solutions to our system of equations, we need to calculate γ and β at our candidate x, multiply them, and hope they are below ≈ 0.157671.

Polynomial-exponential systems
For polynomial-exponential systems f , calculating γ(f, x) is even harder.However, in [31], an upper bound was computed for γ involving quantities more readily apparent in a given system f than what appears in the bare definition of γ.In fact, an upper bound for γ is calculated which applies to a general class of systems, as well as upper bounds for several special cases.One of these special cases can be further specialized to the system of equations ∇S = 0 arising in log-concave density estimation (this is Lemma 4.9 below).In [31] an example is given where the bounds for the special cases allowed candidate solutions to be α-certified despite failure using the more general bounds.In this section we summarize the results of [31] as they relate to log-concave density estimation.First we need a few definitions.Definition 4.5.For a point x ∈ C n define For a polynomial g : For a polynomial system f : We now define a quantity µ(f, x) associated to a polynomial system which will play a role in bounding γ later.Definition 4.6.Let f : C n → C n be a polynomial system with deg Following [31], we extend Definition 4.6 to certain polynomial-exponential systems.Definition 4.7.Let a ∈ Z ≥0 , δ i ∈ C, and σ i ∈ {1, . . ., n}.Consider the polynomialexponential system where P : C N → C n is a polynomial system with N = n + a variables.Thus, the system G is a square system of size N .We write X := (x, u).Define µ(G, X) = max 1, DG(x, u) −1 C P (x, u)∥P ∥ I a .
The following specializes Corollary 2.6 of [31].Then, for any X = (x, u) ∈ C N such that the Jacobian of G is invertible, Proof.This is a straight-forward specialization of Corollary 2.6 of [31].We set to zero quantities that deal with functions not relevant to log-concave density estimation.
Therefore, reformulating our system of polynomial-exponential equations ∇S ∆ = 0 in the format (4.1) will allow us to calculate an upper bound on γ, which will allow us to certify solutions to our critical equations.Lemma 4.9.Fix a maximal regular triangulation ∆.The polynomial-exponential system ∇S ∆ = 0 can be reformulated as a system of equations of the form (4.1), demonstrating that Theorem 4.8 applies in the context of log-concave maximum likelihood estimation.
Proof.The partial derivatives ∂S ∆ /∂y k are rational functions of the y i and the exp(y i ).Since we set each partial derivative to zero, we can clear denominators, creating a system of equations, each of which is a polynomial in the y i and the exp(y i ).Setting each δ i = 1 in (4.1), we can replace each occurrence of exp(y i ) with u i , creating the polynomial system P (y 1 , . . ., y n , u 1 , . . ., u n ), hence a = n as well.Appending the equations u i − exp(y i ) to the system of polynomials P , we have a system of 2n equations in 2n unknowns.This system is of the required form in order to apply Theorem 4.8.
Thus, we have everything we need to compute the upper bound in (4.2) for a system of critical equations ∇S ∆ = 0 when ∆ is a maximal regular triangulation.By calculating this upper bound for a given system of equations, we can certify approximate numerical solutions obtained in any way.When ∆ is not a maximal regular triangulation, one must impose further linear constraints on some of the y i , as was the case in Example 2.7.After simplifications, one might still end up with terms involving exponentials of fractional convex combinations of the y i .This poses no threat for the purposes of α-certification, as one may in fact use products of exponentials of the form e βy i .In particular, a bound for γ(G, X) also for these more general polynomial-exponential systems is given in [31,Corollary 2.6].
In algebraic statistics, it is common to find algebraic invariants which characterize algebraic complexity.For example, the maximum likelihood degree of a statistical model gives information about the critical points of the likelihood function of a parametric model [1].Similarly, in nonparametric algebraic statistics, it could be the case that the combinatorial complexity of the optimal subdivision gives us information about the computational complexity of finding a numerical solution.
Question 4.10.Does increasing the combinatorial complexity of the optimal subdivision decrease the likelihood that the numerical output from LogConcDEAD is α-certified?
We study this question experimentally in the next section.In future work, one could hope to precisely describe this phenomenon, should it exist.Of course, higher degrees, more variables, more equations will always increase the bound on γ we calculate, but the combinatorics should still play some role.

A procedure for α-certifying
One of our motivating questions was to determine the correct subdivision for a given data set, as was the case in Example 1.1.In this section we describe a procedure based on Smale's α-theory that in principle allows us to find the certifiably correct subdivision.Recall that the objective function S(y 1 , . . ., y n ) depends on a subdivision of the convex hull of the data set X.If there are m subdivisions, then there are m different objective functions S 1 , . . ., S m , and m different possible systems of equations ∇S 1 = 0, . . ., ∇S m = 0. Given an estimate of a solution y * , perhaps computed numerically using existing software, we can attempt to α-certify that solution using any of these systems as input to Lemma 4.9 and Theorem 4.8.
As we collect α-certified critical points for the various objective functions, we can use this data to determine the correct subdivision, helping to answer our motivating question.
In practice, we have found that numerically computed solutions y * are often not αcertified, using any of the systems ∇S i = 0.However, using a brute-force search over all possible additional digits, we often can find one system ∇S j = 0 to which y * + ε is an αcertified solution.Here, ε = (ε 1 , . . ., ε n ) is a vector providing additional digits of precision to each component of y * .As we compute α-values for each y * + ε, we move in the direction which causes a decrease in the computed α-value, until we are able to find an α-certified y * + ε.We describe this in the following Algorithm 1: Testing certifiability by digit refinement Input: A system ∇S i = 0 coming from the ith candidate subdivision and a candidate approximate solution y * = (y 1 , . . ., y n ).Result: A refinement of the heights y * + ε along with alpha certification of the system, or inability to certify. 1 Let p be the number of trusted significant digits (in binary) of the approximate solution y * . 2 Expressing y * in binary, compute the α-value for all 3 n points y i + ϵ i 2 −p , ϵ i ∈ {−1, 0, 1}.Keep the point with the lowest alpha value, and set this as the new y i .3 If the alpha value is below 0.157671 stop and return the solution.If it has decreased between steps or remained the same, increase p by 1 and go to step 2. If there is no improvement for several loops in a row, stop and declare inability to certify the system.
Remark 4.11.Here we collect a few comments on Algorithm 1.
1. We note that this brute-force search over all possible digits could be replaced by any numerical procedure for finding solutions to a given set of equations, see for instance the refine command in the Numerical Algebraic Geometry package for Macaulay2 [36].For example, Newton iteration could be used on the system of equations to produce more accurate solutions, which could then be α-certified.However, to compare the outputs of LogConcDEAD for problems of increasing combinatorial complexity (see Table 1), we wanted to use a completely "blind" brute-force search as described above.
2. One does not need to stop at Step 3 once a solution is certified.Repeating the loop allows increasing the precision of the solution by moving to lower α values.This is in contrast to statistical software like LogConcDEAD which only allows up to 7 significant digits.
3. Although precision can be added, our (first) goal with Algorithm 1 is to find the correct subdivision induced by the heights.One can test several subdivisions here, therefore we say that we test the (approximate) solution against the corresponding system of equations.
4. It might happen that the α-value does not immediately decrease from one loop to the next even if we have the correct system of equations.One reason is that if the next significant digit is a zero for all heights, we are computing an α-value for the same point multiple times.
5. In step 1 of the above algorithm, we let p be the number of trusted significant digits of the approximate solution y * .We have found that several of the last digits of a solution computed with LogConcDEAD were incorrect, in the sense that if we start our search (in Algorithm 1) earlier in the significant digits of y * we are able to α-certify some y * + ε.
In this way, we can correct for some of the imprecision of a numerical solver.
Example 4.13.We now consider the same sample X = (2, 5, 7) with uniform weights.As discussed in Example 2.7, LogConcDEAD output suggests that the logarithm of the optimal density has a single region of linearity (Figure 7c).Can we certify this assessment?Recall that substituting y 2 = 2 5 y 1 + 3 5 y 3 to S(y 1 , y 2 , y 3 ) = The system of equations ∇ S = 0 does have solutions, and we were able to check that the numerical solution y * computed by LogConcDEAD is an α-certified solution to this amended system of equations.
Example 4.14.We used Algorithm 1 to certify the sample X = (0, 1, 2, . . ., n) ⊂ R for weights given by the binomial distribution with p = 6/11, i.e., w i = n i (6/11) i (5/11) n−i .Looking at the LogConcDEAD output, we suspect that the triangulation given by the points consists of all consecutive line segments {i − 1, i} for i ∈ 1, 2, . . ., n.We therefore compute α-values using the system of equations corresponding to the full triangulation.In all cases tested, we were able to certify the system for some refinement of the original LogConcDEAD output.In Table 1, we summarize the number of binary digits required for certification in each case.This table suggests that the complexity of α-certifying increases when the number of sample points increases.Table 1: Number of binary digits needed to certify n + 1 points with weights coming from an asymmetric binomial distribution.We now present an example in two dimensions that needs more significant digits than the previous cases.
After a few repetitions, this finds a point with a lower alpha value.We repeat this process, each time decreasing the exponent of 2 when creating the new test points.After 95 rounds we detect the refined point with alpha value 0.125519.Therefore, this new solution is α-certified.Note that this number has 34 decimal digits; we have rounded digits coming from the conversion from base 2 (109 digits) after this position.Our conclusion is that the triangulation obtained by the heights in the LogConcDEAD output is certifiably correct.
Example 4.16.We finish our paper by returning to our motivating example 1.1 from the introduction, and consider two possible subdivisions of P = conv(X) for the regions of linearity of the optimal tent function: The first subdivision ∆ 1 in Figure 8a arises from the LogConcDEAD output with default parameters after using the "unique" function.The second subdivision ∆ 2 in Figure 8b is given by the four regions of linearity in Figure 1 that we get by adjusting the precision in LogConcDEAD and then using the "unique" function.Unfortunately the objective functions involved have too many summands for α-certification to be feasible.
As an alternative, we use the NMaximize command in Mathematica directly on the objective functions S ∆ 1 and S ∆ 2 .The optimal y * ∆ 1 for the 7-cell subdivision gives a tent function whose regions of linearity are {{1, 2, 3}, {1, 8, 13}, {1, 3, 13}, {2, 3, 14}, {3, 13, 14}}, which are depicted in Figure 8c.This triangulation is not refined by the subdivision ∆ 1 : For example, the triangle {1, 3, 4} in the subdivision ∆ 1 intersects the interiors of triangles {1, 3, 13}, {2, 3, 14}, {3, 13, 14}.Thus the 7-cell subdivision ∆ 1 is not the subdivision that we are looking for.In fact, the vector y * ∆ 1 is not relevant, i.e. there exists x i such that h X,y * ∆ 1 (x i ) > y i , and as a result P exp(h X,y * In comparison, the optimal height vector that we obtain using LogConcDEAD is A computation in Polymake verifies that y * ∆ 2 gives a tent function whose regions of linearity are exactly the cells of ∆ 2 .This suggests that the 4-cell subdivision ∆ 2 is indeed the subdivision induced by the optimal y * in Example 1.1.We conclude with a haiku.

Approximate heights, subdivisions inexact.
A long road ahead.

Figure 1 :
Figure 1: The optimal tent function for the sample of 14 points in Example 1.1.

Figure 3 :
Figure 3: The red tent function corresponds to a vector y in the red region of the Samworth body.The solid green tent function corresponds to a vector y on the curve separating red and green regions of the Samworth body.The dotted green function is not convex.Its height vector y belongs to the green region of the Samworth body and both green sets of heights give the same tent function.

Figure 4 :
Figure 4: Maximizing S ∆ over y restricted to Y ∆ .

Example 2 . 8 .
then the maximum must be on the boundary of Y ∆ , see Figure4for an illustration.The boundary of Y ∆ is stratified into regions Y ∆ corresponding to the various subdivisions ∆ which are refined by ∆.Hence one should consider critical points for strictly coarser subdivisions ∆.We consider the point configuration X = {0, 1, 2, 3, 4} ⊆ R and the weight vector w = (3/15, 4/15, 5/15, 2/15, 1/15).This point configuration has exactly eight subdivisions.For each subdivision ∆, we use the Mathematica commmand NMaximize to find the maximum y * ∆ of the function S ∆ .For each subdivision ∆, the smallest piecewise-linear function f

Figure 5 :
Figure 5: Piecewise-linear functions induced by y * ∆ maximizing S ∆ for each subdivision ∆ in Example 2.8.The notation ij in the subcaptions refers to the set {i, j}.