Estimation of a discrete probability under constraint of k-monotony

We propose two least-squares estimators of a discrete probability under the constraint of k-monotony and study their statistical properties. We give a characterization of these estimators based on the decomposition on a spline basis of k-monotone sequences. We develop an algorithm derived from the Support Reduction Algorithm and we finally present a simulation study to illustrate their properties.


Introduction
The estimation of a density under shape constraint is a statistical problem that was first raised by Grenander [17] in 1956 in the case of a density under monotony constraint. Over the past 30 years, there has been several studies of estimators under shape constraint, most of them being maximum likehood estimators or least squares estimators. In these cases, the authors characterize the estimators, study the asymptotic law and the rate of convergence and discuss the implementation. For such studies, the constraints are, for example, the monotony, the convexity or the log-concavity (if log(f ) is concave, f is logconcave) and the k−monotony.
The k−monotony notion was introduced by Knopp [22] in 1929 for discrete functions: it generalizes to k th order the notion of convex series (or 2−monotone series) and corresponds to the positivity of a k-th derivative fonction. In 1941, Feller [15] extended that definition to k−monotone continuous functions and Williamson [26] enabled characterizing these k−monotone functions with their decomposition in spline basis : Property 1 (Williamson, 1955) Let g be a continuous function. Let k 2. The function g is k−monotone if and only if there exists a nonnegative mesure µ on R * such that: Consequently k−monotone functions can be described with an integral form. The estimation of a k−monotone distribution has been studied by Balabdaoui et al. ( [1], [5], [4]): they proposed the maximum likehood and the least-squares estimators under k−monotony constraint for the continuous space and studied their theoretical properties (consistency and rate of convergence) as well as their limit distribution. They also discussed the adaptation of an algorithm proposed by Groeneboom et al. [19].
Most of the work on estimation under shape constraint was focused on densities with a support on R or on an interval, but recently, discrete probabilities have gained interest because of their numerous applications in ecology or financial mathematics (see [13] or [24]). Jankowski and Wellner [20] recently studied the estimation under monotony constraint and Balabdaoui et al. [2] investigated the log-concave discrete densities. More recently, the estimation of a convex discrete distribution was treated by Durot et al. ([12], [13]) and Balabdaoui et al. [2].
In this article we propose two least-squares estimators of a k-monotone discrete probability with k 2. The first one is the projection of the empirical estimator on the set of k−monotone sequences, the second one is the projection of the empirical estimator on the set of k−monotone probabilities. We show the existence of these estimators and give a characterization for each one of them which is based on the decomposition on a spline basis of k−monotone sequences showed by Lefevre and Loisel [24]. Thanks to this characterization we generalize some results for the convex case (k = 2, see [12]) to k > 2, as for example the comparison with the empirical estimator (Theorem 3).
However differences between the convex case and the case k > 2 arised. First the projection of a discrete probability on the set of k−monotone sequences is not a probability in general when k > 2. This structural property of the set of k−monotone functions, k 3, justifies the definition of two different estimators while they are equivalent in the convex case. Secondly the proofs of some other properties require new tools. In fact the results about the support of our estimator require control of the decreasing of the tail of k−monotone probabilities while truncation is sufficient in the convex case.
Although the construction of our estimators is inspired by the work of Balabdaoui [1] our results are not deduced from the continuous case. In fact, for k 3, unlike for the convex case, we could neither construct a k−monotone density that goes through the points of a k−monotone sequence nor approach a k−monotone sequence with a k−monotone density. It is however interesting to note that connecting the points of a convex sequence can provide a convex continuous function because no differentiability assumption is required in the definition of the convexity. Moreover the practical implementation of the estimator is structurally different from the continuous case. For the discrete case we implement the estimators using exact iterative algorithms inspired by the Support Reduction Algorithm described in Groeneboom et al. [19] and we discuss a practical stopping criterion (see Section 4). Differences with the continuous case also emerged when we consider the rate of convergence in terms of l 2 -error since our estimators are consistent with typical parametric √ n-rate of convergence (see Theorem 3).
The paper is organized as follows: the definition of the k−monotony and some properties about k−monotone discrete functions are reminded in Section 2, and a characterization of the estimator is given in Section 3.1. Statistical properties about this estimator are then presented in Section 3.3. In Section 4, a method to implement the estimator in practice using the Support Reduction Algorithm of Groeneboom et al. [19] is presented. The stopping criterion for this algorithm, which differs from the convex case (k = 2) is also discussed. In Section 5 we discuss the possibility to choose an estimator on the set of k−monotone sequences instead of the set of k−monotone probabilities. Finally a simulation study is given in Section 6.

Characterizing k-monotone sequences
Let us begin with a list of notation and definitions that will be used throughout the paper.
The same notation is used to denote a discrete function f : N → R + and the corresponding sequence of real numbers (f (j), j ∈ N). For all r ∈ N \ {0}, the classical L r -norm of f is defined as follows: and we denote by L r (N) the set of functions f such that f r is finite. In particular L 2 is an Hilbert space and the associated scalar product is denoted <, >.
For any integer k 1, let ∆ k f be the k th differential operator of f defined for all i 0 by the following recurrence equation: It is easy to see that the operator ∆ k satisfies the following equation: Definitions • Let f be a k-monotone sequence. The integers i such that (−1) k ∆ k f (i) > 0 are called the k-knots of f . If for all integer i in the support of f , the quantities (−1) k ∆ k f (i) are strictly positive, f is said to be strictly kmonotone.
• The maximum s f of the support of f is defined as and may be infinite.
Let us remark that if the support of a k-monotone sequence f is finite, then s f is a k-knot.
A k−monotone function on L 1 (N) is for example a non-negative and nonincreasing polynomial function of degree k − 1, such as f (i) = max(0, m − i) k−1 for some positive constant m.
It should be noticed that if a sequence f is k-monotone for some k 2, then for all j < k, it is strictly j−monotone on its support. This property, shown in Section 7.3.1, is not true in general in the continuous case (see Balabdaoui [1] for example).
Finally we will denote by S k the set of k-monotone sequences that are in L 1 (N), and by P k the set of k-monotone probabilities on N. We denote by P the set of probabilities on N.

Decomposition on a spline basis
The characterization of k-monotone functions defined on R as a mixture of polynomial functions has been established by Lévy [25], and the inversion formula that specifies the mixture function follows from the results of Williamson [26] (see Lemma 1. in [5] for example). In the case of k-monotone sequences a similar decomposition has been simultaneously established for convex sequences by Durot et al. [12], and in the more general case of k-monotony by Lefevre and Loisel [24]. Many of our proofs will rely on this decomposition.
For any integer k, let us define a basis of spline sequences (Q k j ) j∈N\{0} as follows: Let m k j be the mass ofQ k j : m k j = j i=0Q k j (i) and Q k j =Q k j /m k j the normalized spline. We can now formulate the mixture representation of k-monotone sequences.
• The sequence f is k−monotone if and only if there exists a positive measure π on N, such that for all i ∈ N, f (i) satisfies: • If f is k−monotone, the measure π is unique and defined as follows: In particular Q k j is k−monotone, and the set of k-knots of f is the set of integers j such that π(j) is strictly positive.
These properties are shown in Lefevre and Loisel [24]. From this property, it appears that monotone discrete probabilities are mixture of uniform distributions, convex probabilities are mixture of triangular distribution, . . . , k-monotone probabilities are mixture of splines with degree k − 1.
3 Constrained least-squares estimation on the set of k-monotone discrete probabilities Suppose that we observe n i.i.d random variables, X 1 , . . . , X n with distribution p defined on N, such that for all i = 1, . . . , n, and j ∈ N, p(j) = P (X i = j).
We propose to build an estimator of p that satisfies the k-monotony constraint.
Since the projection on the set of k−monotone sequences S k is not a probability in general for k 3 (see Section 5) we consider the least-squares estimator p defined as follows: wherep is the empirical estimator of p: Since the set P k of k−monotone discrete probabilities is convex and closed in the Hilbert space L 2 (N), it follows, from the projection theorem for Hilbert spaces, that p exists and is unique.

Characterizing the constrained least-squares estimator
A connerstone for deriving some statistical properties of our estimator is the following characterization ofp. Let us begin with a few notation. For any positive sequence f in L 1 (N) we define the j-th primitive of f as follows: for all Moreover, we define the quantity β(f ) Theorem 1 Let f ∈ P. The projectionp defined as Equation (4) is the unique k−monotone probability f satisfying: 1. For all l ∈ N, 2. If l is a k−knot of f , then the previous inequality is an equality.
The proof of this theorem is given in Section 7.1.1. It uses the connections between successive primitives of the spline sequences (Q k j , j ∈ N). In the particular case of convexity the same result can be established with 0 in place of β(f ), see Lemma 2 in [12]. Let us recall that in that case, we have the nice property that the least-squares estimator over convex sequences is a convex probability distribution. This property is no longer satisfied when k 3. We will come back to this point in Section 5.

Support ofp
A key feature of the estimatorp is that its support is finite. Let us denote by s = s p , respectively s = s p , the maximum of the support of p, respectively p.
Theorem 2 Let p be the least-squares estimator defined by Equation (4). 1. The support ofp is finite.
In the particular case of convexity, when k = 2, it is shown that s s. The question whether such a property still holds for k 3 remains open.

Statistical Properties of p when p is k-monotone
Let us now evaluate the behaviour of p for estimating a probability, in particular, how does it compare with the empirical estimator p. It is proved in the following theorem that the constrained least-squares estimator is closer (with respect to the L 2 -norm) to any k-monotone probability than is p.
Theorem 3 For any k−monotone probability f , the following inequality is satisfied: Ifp is not k−monotone, then the inequality is strict. Moreover if p is k−monotone and if there exists i ∈ N such as ∆ k p(i) = 0, then for all k−monotone probability f , we have : In particular, if p is k-monotone and not strictly k-monotone, the estimator p is strictly closer to p than is p with probability at least 1/2. This theorem is a straightforward generalization of Theorem 4 in [12] for the convex case and its proof is omitted.
In the following theorem the moments of the distributionsp andp are compared.
If p satisfies β ( p) = 0, the result is the same as the one obtained in the convex case. In fact, it will be stated in Section 5, that β(p) 0, and that β(p) = 0 if the mimimizer of f − p 2 over f ∈ S k equals the mimimizer of f − p 2 over f ∈ P k . This is the case if p is k-monotone, or if k = 2.

Asymptotic properties of p
In this section we consider the asymptotic properties of p when the sample size n tends to infinity. We first establish the consistency of p both in the case of a well-specified model or a misspecified model.
Theorem 5 Let p S k be the orthogonal projection of p on the set P k . Then, for all r ∈ [2, +∞], the random variable √ n p S k −p r is bounded in probability.
In particular, this theorem states that if the distribution p is k-monotone, then the convergence of p to p is of the order √ n with respect to the L r -norm.
The case of a finite support In the particular case where the distribution p is k-monotone and has a finite support, we characterize the asymptotic behaviour of the k-knots of p, and give an upper bound for s, the maximum of the support of p.
Theorem 6 Let p be a k−monotone probability with finite support.
1. Let j ∈ N be a k−knot of p. Then with probability one there exists n 0 ∈ N such that for all n n 0 , j is a k-knot ofp.
2. Let s, respectivelyŝ, be the maximum of the support of p, respectivelyp. Then, with probability one, there exists n 0 ∈ N such that for all n n 0 we have •ŝ s if k is even.
The proof of the first part of the theorem (see Section 7.1.5) is based on the fact that for all j ∈ N, It follows that if j is a k-knot of p, then (−1) k ∆ k p(j) will be strictly positive for n large enough. Conversely, if j is not a k-knot of p, which means that ∆ k p(j) = 0, then (−1) k ∆ k p(j) may be strictly positive for all n. Therefore, the set of k-knots of p does not estimate consistently the set of k-knots of p.
Concerning the second part of the theorem, we can notice that the result we get concerning s is weaker than what was obtained in the convex case. Indeed, when k = 2, we know that s s, and consequently that, s = s for n large enough if p has a finite support.

Implementing the estimatorp
The practical implementation of p requires the use of a specific algorithm that is composed of two parts. The first part consists in solving the problem defined at Equation (4) for sequences f whose support is finite. More precisely, for a chosen positive integer L, we compute p L , the minimizer of f − p 2 over sequences f ∈ P k whose support is included in {0, . . . , L}. The second part consists in checking if p L = p. For that purpose, starting from Theorem 1, we propose a stopping criterion that can be calculated in practice.

Constrained least-squares estimation on a given finite support
We know from Property 2 that if f ∈ P k , there exists a unique probability π on N, such that f and π satisfy Equation (2). Therefore, solving (4) is equivalent to minimizing on the set of probabilities π on N, the criterion Ψ(π) defined as follows: The first part of our algorithm computes The solution is given by p L = j 0 π L (j)Q k j where π L is the minimizer of Ψ(π) over probabilities π whose support is included in {0, . . . , L}: π L = argmin {Ψ(π), π ∈ P, s π L} .
The algorithm we use to compute π L is based on the support reduction algorithm introduced by Groeneboom et al. [19]. In the particular case where k = 2 it was developped by Durot and al. [12]. Nevertheless when k 3 an adaptation is needed to guarantee that π L is a probability. Let us underline that this algorithm is proven to give the exact solution in a finite number of steps.
For all ν ∈ P, let D ν Ψ be the directionnal derivative function of Ψ in the direction ν defined as follows: The Support Reduction Algorithm is based on the property that π L is solution of (9) if and only if the directionnal derivative functions calculated in π L in the directions ν = δ j , where δ j denotes the Dirac probability in {j}, are non negative for all 0 j L. Moreover, these derivatives are exactly 0 for all j in the support of π L .
Starting from this property, the Support Reduction Algorithm is composed of two steps. In the first step, the support of the current probability µ is augmented by a point j where D δj Ψ(µ) is strictly negative (if any). In the second step the minimisation of Ψ(µ) over sequences µ such that j 0 |µ(j)| = 1 and whose support is the current support, is performed. The current support is reduced to obtain a positive sequence.
Let us introduce notation used in the second step of the algorithm. For a set S = {j 1 , . . . , j s } ⊂ {0, . . . , L} we note where I is the vector with L + 1 components all equal to 1.
The algorithm for computing π L for a fixed L is given at Table 1. It is shown in Section 7.2 that this algorithm returns π L in a finite number of steps.

Stopping criterion
The second step of the algorithm is to find L such that p L = p.
The characterization of p given by Theorem 1 cannot help for practical implementation because the necessary condition in that theorem requires an infinite number of calculations. Nevertheless, if f is a k-monotone probability, with maximum support s f , it is possible to find an integer M such that if f satisfies Inequality (6) for all l M , then f satisfies Inequality (6) for all l > M . Such a property results from the writting of as a polynomial function in the variable l. On the one hand, Property 3, shown in Section 7.3.2, states that F k f (l) − F k p (l) is a polynomial function in l of degree k − 1 as soon as l is greater than the maxima of the support of f and p.
Property 3 Let f be a discrete sequence with finite support and s f be the maximum of its support. Let τ = max(s f , s), then for all l τ + 1, we have the following equalities: -If for all j ∈ {0, . . . , L} D δj Ψ(π) 0, stop.
Go to step 2.
• Step 2 : -If for all l ∈ S , π S (l) 0, π ← π S S ← S Return to step 1. -Else Return to step 2. Table 1 Algorithm for computing π L for a fixed L.
On the other hand starting from Pascal's rule, it is easy to see that for all k 2 and l 0, Putting Equations (10) and (11) together, it appears that for all l τ = max(s f ; s), there exist coefficients (a 0 , a 1 , . . . , a k−1 ) such that Let d be the degree of this polynomial (the smallest j such that a j = 0 for all j d + 1) and let M be defined by By Cauchy's Theorem for localization of polynomial's roots, the largest root of This leads to the following characterization of p which is a corollary of Theorem 1. Its proof is omitted.
Theorem 7 Let f be a sequence in P with a finite support. Let M and a d be defined as above. The two following assertions are equivalent: the previous inequality is an equality.

The sequence f is exactlyp.
This Theorem answers our initial problem, namely checking if p L equals p: the coefficients a j depending on p L and p, can be calculated in practice, as well as M . Nevertheless M can be very large, leading to tedious computation. For small values of k more efficient criteria can be proposed. In particular, for k = 3 and k = 4, it is possible to obtain a characterization of p that only depends on s and s. This is the object of the following theorem shown in Section 7.1.6.
Theorem 8 Let f be a sequence in P with a finite support and s = max {s f , s}. Let k ∈ {3, 4}. The two following assertions are equivalent:

The sequence f is exactlyp.
When k > 4 we are not able to propose a similar stopping criterion. Indeed the proof is based on the properties of the spline function Q k j for k > 4 and in particular, requires the calculation of the number of k -knots of Q k j for k k, which is intractable.

5
Constrained least-squares estimation on the set of k-monotone sequences By definition, our estimator p is a probability. We could have proposed to estimate p by minimizing the least-squares criterion under the constraint of k-monotony only. Let p * be that estimator: The following property, shown in Section 7.3.3, establishes the link between both estimators.
Property 4 Let p and p * be defined at Equations (4) and (12), and let β be defined at Equation (5). The coefficient β(p) is null if and only ifp =p * .
In the particular case where k = 2,p * is exactly p (see Theorem 1 in [12]). As soon as k 3 this property is no longer satisfied in general, and it is proven in Section 7.3.4 that the mass of p * is greater than or equal to 1. This result was expected because a similar property was shown by Balabdaoui [1] in the continuous framework.
To illustrate this point in the discrete framework, let us consider the projection of δ 1 (the Dirac probability in 1) on the set of 3−monotone sequences. Some calculation (see Section 7.3.7 for a proof) leads to the following result: and its mass is close to 1.14.
Nevertheless we can show (see Section 7.3.5) the following asymptotic result.
Property 5 Let p be a k−monotone probability, and p * be defined at Equation (12). Then, with probability one the mass of p * converges to one.
The properties shown in Section 3 for the estimator p hold true for p * . More precisely, the estimatorp * satisfies Theorems 2, 5, 6. Theorem 3 is also true for p * apart from the first assertion, where Equation (7) is satisfied for any kmonotone sequence f . Finally Theorem 4 is true with 0 in place of β( p)( s+1−a) in Equation (8).
The implementation of p * is similar to that of p except that in the first part of the algorithm (see Section 4.1) the Support Reduction Algorithm can be used without any modification at Step 2 (where the estimator of π is constraint to have a sum of one). The stopping criterion used in the second part is obtained in the same way as forp. The proofs of these last results are omitted. They are based on Property 4.

Simulation
We designed a simulation study to assess the quality of the least-squares estimator p on the set of k-monotone probabilities, as compared to the empirical estimator p, and to the least-squares estimator p * on the set of k-monotone sequences for k ∈ {2, 3, 4}.

Simulation Design
We considered two shapes for the distribution p: the spline distribution Q j with j = 10 and ∈ {2, 3, 4, 10}, and the Poisson distribution P(λ) for λ ∈ {0.3, 0.35, 0.45, 2 − √ 2, 0.7, 1}. Those two families of distribution differ by the finiteness of their support, and by the number of knots in their decomposition on the spline basis. Precisely, the distribution Q j has one -knot in j while amonotone Poisson distribution has an infinite number of -knots. The following proposition, shown in Section 7.3.6, gives the property of k-monotony for Poisson distributions.
Property 6 Let P(λ) be the Poisson distribution with parameter λ. For each 1, let λ be defined as the smallest root of the following polynomial function: Then P(λ) is -monotone if and only if λ λ Some simple calculation gives the following values: For each distribution p, we considered several values for the sample size n: n ∈ {20, 50, 100, 250, 500, 1000}. In some cases we also considered very large values of n in order to illustrate the asymptotic framework. We denote by p n the empirical estimator and by p k n , respectively p * k n , the least-squares estimator of p on the set of k-monotone probabilities, respectively sequences. For each simulation configuration, 1000 random samples were generated. The simulations were carried out with R (www.r-project.org), the code being available at the following web address: http://www.maiage.jouy.inra.fr/jgiguelay

Global fit
To assess the quality of the estimators for estimating the distribution p we consider the l 2 -loss and the Hellinger loss. We have also considered the total variation loss, but the results are not shown because they are very similar to those obtained for the l 2 -loss,

Estimators comparison based on the l 2 -loss
The l 2 -loss between p and any estimator of p, say q, is defined as the expectation of the l 2 -error, l 2 (p, q) = E( p − q 2 2 ). We first compared the quality of the fit of the estimators p k n and p n by computing for each simulated sample p − p k n 2 2 and p − p n 2 2 . The l 2 -losses were estimated by the mean of 1000 independant replications of the l 2 -errors. In all simulation configurations, the l 2 -losses are decreasing towards 0 when n increases. In what follows we will consider the ratios l 2 (p, p k n )/l 2 (p, p n ) to compare the estimators.
The results for the spline distributions Q j are presented on Figure 1. When n is small, p k n has smaller l 2 -loss than p n whatever the value of k. When n tends to infinity, we have to consider two cases according to the discrepancy between k which defines the degree of monotonty of the estimator, and which is the degree of monotony of p. As it was expected considering Theorem 3, when k , then the ratio is smaller than 1. Moreover we note that the smaller the deviation −k is, the smaller the ratio. In particular when k = , the ratio tends to a constant strictly smaller than 1, while when k < , the ratio tends to 1. For example, when = 4, k = 3, the ratio of the l 2 -losses equals 0.45 for n = 10000 and 0.80 for n = 100000. When k > , the ratio tends to infinity. For example, when = 2, k = 3, the ratio of the l 2 -losses equals 9.93 for n = 10000 and 259 for n = 100000. This result was expected because the empirical estimator p n is consistent while our estimator is not. Indeed, following Theorem 5, the ratio of l 2 -losses is of order n p − p S k 2 2 which is zero if p is -monotone and k . The results for the Poisson distribution are similar to those obtained for the spline distributions except that the asymptotic is achieved for smaller values of the sample size n. Only the case λ = 0.35, where the corresponding Poisson distribution is 3-monotone, is presented in Figure 2. It appears that when k = 2 the ratio of l 2 -losses tends to one, when k = 3 it tends to a value close to 0.9, and when k = 4 it tends to infinity.  Finally we compare the l 2 -losses for the estimators p k n , p * k n and p n for k = 3 and k = 4 (recall that for k = 2, p * k n = p k n ). The ratios l 2 (p, p * k n )/l 2 (p, p n ) behave similarly to the ratios l 2 (p, p k n )/l 2 (p, p n ) (not shown). Next we compare the values of the l 2 losses for p * k n and p k n . When we consider the spline distributions Q j with l = 2 and l = 3, the difference between the l 2 losses are not significant (they are smaller than 2-times their empirical standard-error calculated on the basis of 1000 simulations). When l increases, the distribution p is more hollow and it appears that l 2 (p, p * k n ) is greater than l 2 (p, p k n ), see Table 2.

Estimators comparison based on the Hellinger loss
Let us now consider the Hellinger loss defined, for any estimator q, as H(p,q) = E √ p − q 2 2 . The results for the spline distributions Q j are similar to those obtained for the l 2 -loss, except that the ratios H(p, p k n )/H(p, p n ) are not necessary smaller than 1 when k , see Figure 3 for the Triangular distribution  Table 2: Spline distributions : difference (×1000) between the l 2 -loss of p * k n and the l 2 -loss of p k n , for different values of n, for k = 3 in green and k = 4 in blue. The symbol "-" is for non-significant result. In the case of the Poisson distributions the differences between the l 2 -loss and the Hellinger loss are more obvious. As it is illustrated by Figure 4, if the degree of monotony of p is strictly greater than k, then the ratio is smaller than 1 (see case (a) with k = 2, 3 and case (b) with k = 2). If k = , then H(p, p k n ) is smaller than H(p, p n ) if the distribution p is " -monotone enough", that is to say if the parameter λ of the Poisson distribution is such that λ − λ is large enough, where λ has been defined in Property 6, see for example cases (c) and (d) with k = 2, where λ 2 = 2 − √ 2.  The degree of monotony of these distributions is given by .

Some characteristics of interest
We consider the estimation of some characteristics that may be of interest as the entropy, the variance and the probability at 0. For each of these characteristics denoted L(p), we measure the performance in terms of the root mean squared error of prediction calculated as follows: where BIAS and SE are the estimated bias and standard-error of the estimator based on the simulations. Let L be an estimator of L(p), then BIAS = L · − L, where L · = s L s /1000 with L s being the estimate of L(p) at simulation s, and SE 2 = s ( L s − L · ) 2 /1000.

Entropy
The entropy is defined as We compare the estimators Ent( p k n ) and Ent( p n ) by the ratio of their RMSEP. The results differ according to the family of distributions. For the spline distributions Q j , see Figure 5, it appears that if k < , then Ent( p k n ) has smaller RMSEP than Ent( p n ). However, when k = l, the ratio of the RMSEP's increases and reaches an asymptote greater than 1. For example, in Figure 5, case (b) with k = 3, the ratio tends to 0.96, in case (c) with k = 4, the ratio tends to 1.93. In fact, if we consider the space of -monotone distributions with maximum support j, the distribution Q j may appear as a "limiting case" in this space, in that it admits only one -knot in j. It seems that for these Q j distributions, the projection on the space of − 1-monotone discrete probabilities give better results than on the space of -monotone discrete probabilities.  For the Poisson distributions, see Figure 7, when n is small, the estimator based on the emprirical distribution, Ent( p n ), has a smaller RMSEP than Ent( p k n ). When n is large the RMSEP ratio tend to one if k , and tend to infinity if k > .

Probability mass in 0.
We compare the performances of p k n (0) and p n (0) by comparing the corresponding renormalized SE and BIAS.
The results for the spline distributions are presented in Table 3. When k l, p k n (0) has smaller SE than p n (0). Its bias is greater in absolute value and always negative, but the RMSEP stays smaller. For each k, the variations of √ nSE/p(0) versus n are very small and tend to stabilize around a value that increases with l − k.
When k > l, p k n (0) keeps a smaller RMSEP than p n (0) for small n. But, when n increases the absolute bias as well as the standard error increase.

Variance
We compare the estimators of the variance of p, denoted var( p k n ) and var( p n ) comparing the ratio of their RMSEP. The results are similar for the spline distributions and the Poisson's distributions and we present only the RMSEP for the spline distributions Q l j in Figure 7.
When k = l, the ratio of the RMSEP tends to a constant smaller than 1 when n tends to infinity. Conversely if we are not in a good model (k > l) the ratio of the RMSEPs tends to infinity when n tends to infinity.
When k < l and n large the ratio of the RMSEPs increases with l − k and goes beyond 1. For example for k = 3 and l = 4 the ratio of the RMSEPs is equal to 0.68 when n = 10000, while if l = 10 the ratio is greater than 1 as soon as k 3 and n 1000.
When k > l the ratio of the RMSEPs tends to infinity when n tends to infinity.
When n is small var( p k n ) has smaller RMSEP than var( p n ) whatever the value of k and l.

About the mass of the non-constrained estimatorp * k
We were also interested in the estimation of the mass of the non-constrained estimatorp * k . Figures 8 and 9 illustrated the results for the spline distributions with n = 20 and n = 100. As expected the mass is always larger than 1 and whatever k, the distribution of the mass comes closer to one when n increases (compare figures 8 and 9). The larger l is, the smaller the median and the dispersion around the median are. On the other hand when k increases the distributions are more scattered and their medians move away from 1 (compare the lines of each figure).

Conclusion
Let us consider the case where p * is l−monotone andp k n is the least-squares estimator of p * on S k for k l. Concerning the l 2 -loss, the total variation loss and the estimation of p * (0),p k n performs better than the empirical estimatorp n . Moreover the superiority of the performance ofp k n is larger when n is small. Concerning the Hellinger loss, or the estimation of the variance and the entropy, we get the following results. For small n, as before, the least-squares estimator is always better than the empirical estimatorp n . When n is large,p k n andp n behave similarly. If p is a frontier distribution in S l , as for example the Poisson distribution with λ = λ l or a spline distribution Q l j , thenp l−1 n performs better thanp l n . If not, thenp l n performs better thanp k n for all k l. Finally, for all considered criteria, the estimatorp k n performs better thanp * k n when n is small and both estimators perform similarly when n is large.

Proofs
For all discrete function f , let Q(f ) = 1 2 ||f −p|| 2 . When no confusion is possible we writep instead ofp k . Let us first prove thatp satisfies 1. or equivalently, that for all integer l the following inequality is satisfied:

Properties of the estimator
By definition β(p) =<p,p −p >, then (13) is equivalent to: Let us rewrite this equation by considering limits of the directionnal derivatives. For all ∈]0, 1], l 0 we define a function q l as follows: The function q l is a k−monotone probability and becausep minimizes Q on the set of k-monotone probabilities, we have Q(q l ) Q(p) and: that is equivalent to: Therefore we have the following inequalities: Now, using Lemma 5 (see Section 7.4) we have that for all k 2 and for all positive discrete measure f : We choose f =p and we obtain exactly (14).
Let us now show thatp satisfies 2.. Let l be a k−knot ofp, we need to show that Inequality (14) is an equality. As before we consider q l defined at Equation (15) and show that q l is a k−monotone probability for nonpositif small enough. Thanks to the following equality: we get : Becausep is k−monotone, (−1) k ∆ k q l (i) 0 for nonpositive small enough and i = l. As l is a k−knot ofp, (−1) k ∆ kp (l) > 0 then (−1) k ∆ k q l (i) 0. Therefore we get: which leads us to: that is exactly an equality in (14).
Conversely assuming that f is a k−monotone probability that satisfies : with equality if l is a k−knot of f , we have to show that f =p. By definition of p we need to show that for all k−monotone probability g we have Q(g) Q(f ). Let g be a k-monotone probability. Using Lemma 6 (see Section 7.4): The function g is k−monotone then for all i, (−1) k ∆ k g(i) 0 and using (16) and lemma 6 (see Section 7.4): Moreover g being a k−monotone probability, according to Property 2, we have the decomposition on the spline basis: Finally for all k−monotone probability g we find : By unicity of the projection we have f =p.

Proof of Theorem 2 : Support ofp
The support ofp is finite. Let us first consider the case where β(p) = 0. According to Property 4 this is equivalent top =p * .
The result is proved by contradiction. Let us assume thatp has an infinite support then we can build a probabilityp satisfying the following properties: ii) for all i s,p(i) =p(i).
iii) there exists i such asp(i) <p(i). iv)p is k-monotone and non-negative, withs the maximum of the support ofp.
The probabilityp is constructed as follows. We define for all j ∈ {1, . . . , k − 1} and for all i ∈ N, the j th derivative function q j ofp : . . .
We have q j+1 (i) = ∆ 1 q j (i) so for all i ∈ N: Thenp is k-monotone (and non-negative) if and only if q k−1 is non-increasing (and non-negative).
Becausep has an infinite support, all the functions q j have infinite suppport too. Moreover for all i ∈ N we have the following inequalities: The next step is to modify q k−1 toq k−1 such that ifq j is defined as: and ifp is defined as: thenp satisfies i)ii)iii)iv).
The function q k−1 has an infinite support and is non-increasing, therefore it has an infinity of 1−knots (points where q k−1 is strictly non-increasing). Assume first that k is odd (k 3). Let i 0 be a 1−knot of q k−1 such that i 0 >s. We defineq k−1 as follows: where is some positive real number chosen such thatq k−1 is still non-increasing. For example take = (q k−1 (i 0 ) −q k−1 (i 0 + 1))/2. The functionq k−1 is shown at Figure 10.

Then the distributionp defined at Equation (18) satisfies iv).
To show the properties i) to iii), we will use the following equality whose proof is straightforward and omitted: where the indice h k−1 is in the set {0, . . . , i − k + 1} which is empty if i k − 1.
It remains to show thatp p. By construction ofq k−1 , the primitive of q k−1 is greater than the primitive ofq k−1 , and becausep(i) −p(i) is nonnegative and the following equality: we get point i).
If k is even the proof is based on another construction ofq k−1 . Let us first recall that i is a 1−knot of q k−1 if ∆ 1 q k−1 (i) = q k−1 (i + 1) − q k−1 (i) is strictly negative (because k is even). We have two cases: Case 1 : There exists (i 0 , i 1 ) such thats i 0 < i 1 , i 1 − i 0 2, ∆ 1 q k−1 (i 0 ) and ∆ 1 q k−1 (i 1 ) are strictly negative and ∆ 1 q k−1 (i) = 0. The probabilityq k−1 is defined as follows: Case 2 : For all i s + 1, ∆ 1 q k−1 (i) < 0. let i 0 =s + 1, then the probabilitȳ q k−1 is defined as follows: The functionsq k−1 are presented in Figure 11. The rest of the proof is similar to the one when k is odd.
Therefore, Theorem 2 is proved in the case β(p) = 0. Assume now that β(p) = 0. By Theorem 1 we know that if l is a k−knot ofp, β(p) is written as   follows: Let us proved that if the support ofp is infinite then β(p) = 0. Indeed if the support ofp is infinite,p has an infinite numbers of k−knots and Equation (21) is true for an infinite numbers of integers l. Moreover by Equation (11) the term m k l is a polynomial function in the variable l with degree k and by Lemma 8 (see Section 7.4) the term F k p (l) − F k p (l) is a polynomial function with degree less than k − 1. Therefore the left side in Equation (21) tends to zero when l tends to infinity, showing that β(p) = 0.
Asŝ is a k−knot ofp there exist two consecutive 1−knots between r and s. This allows to define the functionq k−1 andp as before in Equation (18) and Equation (19).
The proof is similar when k is even.

Remark 1
The second case requires q k−1 (r + 2) > 0 forq to be nonnegative. That is to say we need that r + 2 ŝ. This is the reason why the two sets {s − k + 2,. . . ,ŝ − 2} and {s − k + 2,. . . ,ŝ − 1} are different if k is odd or k is even.

Proof of Theorem 4 : Comparison between the moments ofp and the moments ofp.
Let q a sequence and let a real number such that (1 − )p + q is a k−monotone probability. Becausep minimizes Q over the set of k−monotone probabilities we get: that is equivalent to: which leads after simplifications (see the proof of Theorem 1 for more explanations) to: For q(i) = |i − a| u + /m(a, u) we get the result. Moreover for q = δ 0 we find p(0) −p(0) β(p).

Proof of Theorem 5 : Rate of convergence ofp.
The proof is based on Lemma 6.2 of Jankowski and Wellner (2009) [20]. First we assume that r = 2. Banach's Theorem for projection on a closed convex set says that the projection on the set of k−monotone probabilities is 1−lipschitzienne. Then if p S k is the projection of p on the set S k we have: We need to show that √ n||p −p|| 2 = O P (1), or equivalently that the series W n = √ n(p −p) is tight in L 2 (N). Using Lemma 6.2 of Jankowski and Wellner (2009), we have to show that: This is easily verified by noting that var(p(j)) = p(j)(1 − p(j))/n. Then for all It follows that: Because (−1) k ∆ k p(j) > 0, almost surely for n large enough (−1) k ∆ kp (j) > 0, which proves that j is a k−knot ofp.
Second part Ifŝ s the theorem is true. We assume now thatŝ > s. The proof is similar to the proof of Theorem 2. For a real number, and for any j ∈ {2, k − 1}, the function q is defined as follows: whereQ j s+1 is defined at Equation (1). The function q is a k−monotone probability for small enough. Indeed (−1) k ∆ kQ ĵ s+1 (i) is strictly nonpositive only for i =ŝ. Moreover (−1) k ∆ kp (ŝ) =p(ŝ) > 0 then for smaller enough, (−1) k ∆ k q (ŝ) = (1 − )p(ŝ) − Q ĵ s+1 (i)/m ĵ s+1 is nonnegative. On the other hand asp minimizes Q, we get: which is equivalent to: This leads to the following inequality: By Lemma 5 (see Section 7.4) we deduce that: which is exactly 2.(c).
Reciprocally we assume now that f satisfies 1. and we show that f =p, which by Theorem 1, is equivalent to show that for all l ∈ N with equality if l is a k−knot of f . This is true for l s + 1 because f satisfies 2.(a) and 2.(b). Because f has no k−knot after s it remains to show that the inequality is true for l s + 2.
We begin with the case k = 3. Because s s and f andp are probabilities, applying Theorem 3, we obtain that for all l s + 1, As f satisfies 1.(a) we have: .
After some calculations, we can show that (22) is satisfied if and only if P 3 (l) 0 where P 3 is the polynomial function P 3 (l) = (l − s)(l − (s + 1))(l + 2s + 7). This is true because l s + 1.
7.2 Estimating π on a finite support Theorem 9 The algorithm described at Table 1 returnsp L in a finite number of steps.

Proof of Theorem 9
During step 1 the set S is a subset of {0, . . . , L} and π is the minimizer of Ψ on the set S. The criterion allowing us to determine if π =π L (and to stop the algorithm) is given by Lemma 2 (see Section 7.2.2).
In order to show that the algorithm returnsπ L in a finite number of steps we need to show the both assertions : • Assertion 1 : Going from Step 2 to Step 1 is done in a finite number of runs.
• Assertion 2 : If π m denotes the value of π at iteration m of the algorithm, then (Ψ(π m )) converges to the minimum of Ψ on the set of probabilities with support on {0, . . . , L} that is to say toπ L .

At
Step 2 the set S may be reduced up to one element, but it can not be empty because the minimizer of Ψ on a singleton is non-negative. That proves Assertion 1. Let us show Assertion 2 by proving that for all m ∈ N * , Ψ(π m+1 ) < Ψ(π m ). Let S be the support of π m at iteration m, and let j ∈ {0, . . . , L} be an integer such as D δj Ψ(π m ) < 0. We have S = S + j and Ψ(π S ) < Ψ(π S ) by Lemma 4 (see Section 7.2.2). We consider two cases : 1 : If π S is a nonnegative measure we go to Step 1 with π = π S . In other terms π m+1 = π S and therefore Ψ(π m+1 ) < Ψ(π m ) = Ψ(π S ).
2 : If π S is not a nonnegative measure the algorithm iterates inside Step 2 and π S is updated at each loop. We need to verify that at the end of this iterative procedure: Let r be the number of times when we go in Step 2 during the m-th loop and let S h be the value of the set S the h-th time we go in Step 2. We have π m+1 = π S h . We show by induction the following property : Thanks to the property 2. in Lemma 4 the property HR 1 is true. Assume now that HR h is true for some h r − 1, Ψ(π S h ) < Ψ(π S ). Let l and be defined as follows: .
Then (1 − )π S + π S h is a 1-mass function with support S h+1 = S h − {l}. It follows, by convexity of Ψ that: Thanks to HR h , it follows that: and HR h+1 is true. Then HR r is true, that is to say Ψ(π m+1 ) < Ψ(π m ) for all integer m, and (Ψ(π m )) m∈N ) converges when m tends to infinity (because it is a nonincreasing and bounded sequence). The limit is the minimum of Ψ because the nondecreasing is strict.

Proof of the lemmas
The proof of Theorem 9 is based on the following lemmas whose proofs are given afterwards. All the notations used in this section were defined in Section 4.
Then we have the following equality: Lemma 2 There is equivalence between : Moreover if π =π L then for all j ∈ supp(π) we have D δj Ψ(π) = 0.
Lemma 3 Let M S be the set of positive measure π whose support is included in the set S. Let π S and π S be defined as follows: Then we have π S = π S .
The proof of the following lemma is in Durot and al. [12]: Lemma 4 Let π S = i∈S a i δ i be the minimizer of Ψ over the set of nonnegative sequences with support S ⊂ {0, . . . , L}.
Let j an integer such that j / ∈ S and D δj Ψ(π L ) < 0. Let π * S = i∈S b i δ i be the minimizer of Ψ over the set of sequences with support S = S + {j}. Then, the two following results hold: 1. Ψ(π S ) < Ψ(π S ).
2. Assume that b i for some i ∈ S is strictly nonpositive and let: If π S is the minimizer of Ψ over the set of sequences with support S = S − {l}, then Ψ(π S ) < Ψ(π S ).
Proof of Lemma 1 Let µ be a probability with support included in {0, . . . , L}.
We write µ = L j=0 µ(j)δ j then, for L s: In particular for µ = δ j we find: Then, by noting that j µ(j) = 1 we have the following equalities: and the lemma is proved.
Proof of Lemma 2 We first show thatπ L satisfies 2. For all 0 < < 1 and j ∈ {0, . . . , L} the function (1 − )π L + δ j is a probability and then by definition ofπ L we have the following inequality: which leads to D δj Ψ(π L ) 0, showing the point 2.
Proof of Lemma 3 Let π S be the solution of the first problem of minimization. Let Q S and H S be defined as in Section 4. The KKT's conditions give us that π S is the unique sequence that satisfies: where L is the Lagrange's function: The partial derivative function of L is: where I is the vector with L + 1 components equal to 1. We have leading to: 1 =< Q S π S , I >=< H Sp , I > −λ S < HI, I > .
Finally we obtain: Then for all π with support included on S we have L(π S , λ S ) L(π, λ S ) and π S is solution for the second problem: π S = argmin supp(π)⊂S L(π, λ S ) .
Because we are considering strictly convex minimization problems, we get π S = π S . We will prove the following property about k−monotone discrete functions:

Proof of properties
Property 7 For all k 2, if p ∈ L 1 (N) is a k−monotone discrete function then p is j−monotone and strictly j−monotone on its support for all j < k.
We show this result by iteration. First a convex (or 2-monotone) discrete function on L 1 (N) is nonincreasing (see [23]).

Proof of Property 6
We prove that Poisson distribution P(λ) is l−monotone if and only if λ λ l . The distribution q = P(λ) is l−monotone if and only if for all i ∈ N we have (−1) k ∆ k q(i) 0. We have for all l ∈ N the following equalities: where R l is the polynomial function defined as follows: Therefore a necessary condition for P(λ) to be l−monotone is R l (λ, 0) nonnegative which is equivalent to P l (λ) nonnegative where P l (λ) is defined as follows: Conversely, because i → R l (λ, i) is an increasing function for λ ∈ [0, 1], the condition is sufficient. When λ tends to infinity, P l (λ, 0) tends to +∞ then P l (λ) is nonnegative until the smallest root of P l which is nonnegative. In other terms the previous condition is true in particular for λ λ l .

Projection of δ 1 onto S 3
Our purpose is to show that the projection of δ 1 on the cone S 3 has a mass strictly greater than one. After some calculations, we know that this projection is written as f = αQ 3 5 + βQ 3 6 . To calculate the coefficients (α, β), we need to establish a necessary and sufficient condition which makes sure that f isp * 3 . This condition is given in Property 8 (see Section 7.3.3).
where the last equality comes fromQ k h (i + 1) =Q k h−1 (i) with the convention Q k 0 = 0. Finally we obtain: and the lemma is shown.
Proof of Lemma 6 The lemma is proved by induction. First it is true for k = 0 with the convention ∆ 0 f (i) = f (i) = F 0 f (i). Assume now that the result if true for some k − 1 0. We have the following inequalities: because F k g (0) = 0 h1=0 . . . By Property 2 (see Section 2) we have the equality: Then ∞ l=i (−1) k ∆ k f (l)Q k l (i) 1 and finally :

It follows that
(i) and the lemma is proved.
The last equality is due to a result of Bernoulli for Faulhaber's sum : the k−th sum of Faulhaber is denoted by S k and defined as follows : It is shown that: where the B j are Bernoulli's numbers (with the convention B 1 = 1 2 ). A proof of this result can be found in [9].