Multinomial and empirical likelihood under convex constraints: directions of recession, Fenchel duality, perturbations

The primal problem of multinomial likelihood maximization restricted to a convex closed subset of the probability simplex is studied. Contrary to widely held belief, a solution of this problem may assign a positive mass to an outcome with zero count. Related flaws in the simplified Lagrange and Fenchel dual problems, which arise because the recession directions are ignored, are identified and corrected. A solution of the primal problem can be obtained by the PP (perturbed primal) algorithm, that is, as the limit of a sequence of solutions of perturbed primal problems. The PP algorithm may be implemented by the simplified Fenchel dual. The results permit us to specify linear sets and data such that the empirical likelihood-maximizing distribution exists and is the same as the multinomial likelihood-maximizing distribution. The multinomial likelihood ratio reaches, in general, a different conclusion than the empirical likelihood ratio. Implications for minimum discrimination information, compositional data analysis, Lindsay geometry, bootstrap with auxiliary information, and Lagrange multiplier tests are discussed.


Introduction
Zero counts are a source of difficulties in the maximization of the multinomial likelihood for log-linear models. A considerable literature has been devoted to this issue, culminating in the recent papers by Fienberg and Rinaldo [12] and Geyer [14]. In these studies, convex analysis considerations play a key role.
Less well recognized is that the zero counts also cause difficulties in the maximization of the multinomial likelihood under linear constraints, or, in general, when the cell probabilities are restricted to a convex closed subset of the probability simplex; see Section 2 for a formal statement of the considered primal optimization problem P. Though in this case the nature of the difficulties is different than in the log-linear case, the convex analysis considerations are important here as well, because they permit developing a correct solution of P -one of the main objectives of the present work.
Contrary to widely-held belief, a solution of P may assign a positive weight to an outcome with zero count; cf. Theorem 2, Examples 3 and 4, as well as the examples in Section 5. This fact also affects the Lagrange and Fenchel dual problems to P.
The restricted maximum of the multinomial likelihood defined through the primal problem P is not amenable to asymptotic analysis, and the primal form is not ideal for numerical optimization. Thus, it is common to consider the Lagrange dual problem instead of the primal problem. This permits the asymptotic analysis (cf. Aitchison and Silvey [4]), and reduces the dimension of the optimization problem, because the number of linear constraints is usually much smaller than the cardinality of the sample space. Smith [41,Sects. 6,7] has developed a solution of the Lagrange dual problem, under the hidden assumption that every outcome from the sample space appears in the sample at least once; that is, ν > 0, where ν is the vector of the observed relative frequency of outcomes. The same solution was later considered by several authors; see, in particular, Haber [19, p. 3], Little and Wu [30, p. 88], Lang [27,Sect. 7.1], Bergsma et al. [8, p. 65]. It remained unnoticed that, if the assumption ν > 0 is not satisfied, then the solution of Smith's Lagrange dual problem does not necessarily lead to a solution of the primal problem P.
El Barmi and Dykstra [10] studied the maximization of the multinomial likelihood under more general, convex set constraints, where it is natural to replace the Lagrange duality with the Fenchel duality. When the feasible set is defined by the linear constraints, El Barmi and Dykstra's (BD) dual B reduces to Smith's Lagrange dual. The BD-dual B leads to a solution of the primal P if ν > 0. The authors overlooked that this is not necessarily the case if a zero count occurs.
Taken together, the decisions obtained from El Barmi and Dykstra's simplified Fenchel dual B can be severely compromised. It is thus important to know the correct Fenchel dual F to P. This is provided by Theorem 6, which also characterizes the solution set of F. It is equally important to know the conditions under which the BD-dual B leads to a solution of P. The answer is provided by Theorem 16. An analysis of directions of recession is crucial for establishing the theorem.
Since obtaining a solution of the Fenchel dual is numerically demanding, a simple algorithm for solving the primal is proposed. The PP algorithm forms a sequence of perturbed primal problems. Theorem 20 demonstrates that the PP algorithm epiconverges to a solution of P. Even stronger, pointwise convergence can be established for a linear constraint set; see Theorem 21. The convergence theorems imply that the common practice of replacing the zero counts by a small, arbitrary value can be supplanted by a sequence of perturbed primal problems, where the δ−perturbed relative frequency vectors ν(δ) > 0 are such that lim δ 0 ν(δ) = ν. Because each ν(δ) is strictly positive, the PP algorithm can be implemented through the BD-dual to the perturbed primal, by the Fisher scoring algorithm, Gokhale's algorithm [16], or similar methods.
The findings have implications for the empirical likelihood. Recall that 'in most settings, empirical likelihood is a multinomial likelihood on the sample'; cf. Owen [32, p. 15]. As the empirical likelihood inner problem E (cf. Section 9) is a convex optimization problem, it has its Fenchel dual formulation. If the feasible set is linear, then the Fenchel dual to E is equivalent to El Barmi and Dykstra's dual B to E. Thanks to this connection, Theorem 16 provides conditions under which the solution set S P of the multinomial likelihood primal problem P and the solution set S E of the empirical likelihood inner problem E are the same, and the maximumL of the multinomial likelihood is equal to the maximumL E of the empirical likelihood. Consequently: • If C is an H-set or a Z-set with respect to the type ν (for the definition, see Section 6), the maximum empirical likelihood does not exist, though the maximum multinomial likelihood exists. The notion of H-set corresponds to the convex hull problem (cf. Owen [31,Sect. 10.4]) and the notion of Z-set corresponds to the zero likelihood problem (cf. Bergsma et al. [7]). By Theorem 16, these are the only ways the empirical likelihood inner problem may fail to have a solution; cf. Section 9.
• If any of conditions (i)-(iv) in Theorem 16(b) are not satisfied, thenL E <L, and the empirical likelihood may lead to different inferential and evidential conclusions than those suggested by the multinomial likelihood.
Fisher's [13] original concept of the likelihood carries the discordances between the multinomial and empirical likelihoods also into the continuous iid setting; cf. Section 9.1.
The findings also affect other methods, such as the minimum discrimination information, compositional data analysis, Lindsay geometry of multinomial mixtures, bootstrap with auxiliary information, and Lagrange multiplier test, which explicitly or implicitly ignore information about the support and are restricted to the observed outcomes.

Organization of the paper
The multinomial likelihood primal problem P and its characterization (cf. Theorem 2) are presented in Section 2. The Fenchel dual problem F to P is introduced in Section 3. A Lagrange dual formulation of the convex conjugate (cf. Theorem 5) serves as a ground for Theorem 6, one of the main results, which provides a relation between the solutions of P and F. If the feasible set C is polyhedral, a solution of F can be obtained also from a different Lagrange dual to P; cf. Section 3.1. A special case of the single inequality constraint is discussed in detail in Section 3.2, where a flaw in Klotz's [26] Theorem 1 is noted. In Section 4, El Barmi and Dykstra's [10] dual B is recalled; Section 4.1 introduces its special case, the Smith dual problem. Theorem 2.1 of El Barmi and Dykstra [10], and its flaws are presented in Section 5, where they are also illustrated by simple examples. Section 6 studies the scope of validity of the BD-dual B; cf. Theorem 16. Sequential, active-passive dualization is proposed and analyzed in Section 7. Perturbed primal problem P δ is introduced in Section 8, where the epi-convergence of a sequence of the perturbed primals for a general, convex C, and the pointwise convergence for the linear C are formulated (cf. Theorems 20,21) and illustrated. Implications of the results for the empirical likelihood method are discussed in Section 9. A brief discussion of implications of the findings for the minimum discrimination information, compositional data analysis, Lindsay geometry of multinomial mixtures, bootstrap with auxiliary information and Lagrange multiplier test is contained in Section 10. Finally, Section 11 comprises detailed proofs of the results.
An R code and data to reproduce the numerical examples can be found in [18].
2 Multinomial likelihood primal problem P Let X denote a finite alphabet (sample space) consisting of m letters (outcomes) and ∆ X denote the probability simplex ∆ X q ∈ R m : q ≥ 0, q = 1 ; identify R m with R X . Suppose that (n i ) i∈X is a realization of the closed multinomial distribution Pr((n i ) i∈X ; n, q) = n! q ni i /n i ! with parameters n ∈ N and q = (q i ) i∈X ∈ ∆ X . Then the multinomial likelihood kernel L(q) = L ν (q) e −n (q) , where = ν : ∆ X →R, Kerridge's inaccuracy [25], is (q) − ν, log q , (2.1) and ν (n i /n) i∈X is the type (the vector of the relative frequency of outcomes). The conventions log 0 = −∞, 0 · (−∞) = 0 apply;R denotes the extended real line [−∞, ∞] and a, b is the scalar product of a, b ∈ R m . Functions and relations on vectors are taken component-wise; for example, log q = (log q i ) i∈X . For x ∈ R m , x is a shorthand for i∈X x i .
Consider the problem P of minimization of , restricted to a convex closed set The goal is to find the solution set S P as well as the infimumˆ P of the objective function over C. The problem P will be called the multinomial likelihood primal problem, or primal, for short. Special attention is paid to the class of polyhedral feasible sets C C = {q ∈ ∆ X : q, u h ≤ 0 for h = 1, 2, . . . , r}, (2.2) or to its subclass of sets C given by (a finite number of) linear equality constraints where u h are vectors from R m . These feasible sets are particularly interesting from the applied point of view and permit to establish stronger results. Without loss of generality it is assumed that X is the support supp(C) of C, that is, for every i ∈ X there is q ∈ C with q i > 0; in other words, the structural zeros (cf. Baker et al. [6, p. 34]) are excluded. Due to the convexity of C this is equivalent to the existence of q ∈ C with q > 0. Under this assumption, Theorem 2 gives a basic characterization of the solution set of P. Before stating it, some useful notions are introduced.
Definition 1. For a type ν (or, more generally, for any ν ∈ ∆ X ), the active and passive alphabets are The elements of X a , X p are called active, passive letters, respectively.
Put m a card X a > 0, m p card X p ≥ 0. Let π a : R m → R ma , π p : R m → R mp be the natural projections; identify R ma with R X a and R mp with R X p . Note that if X p = ∅ then m p = 0 and R mp = {0}. For x ∈ R m , x a and x p are the shorthands for π a (x) and π p (x), respectively. (If no ambiguity can occur, the elements of R ma , R mp will be denoted also by x a , x p .) Identify R m with R ma × R mp , so that it is possible to write x = (x a , x p ) for every x ∈ R m . Finally, for a subset M of R m and x ∈ M let M a π a (M ) and be the active projection and the x p -slice of M ; analogously define M p and M p (x a ).
Theorem 2 (Primal problem). Let ν ≥ 0 be from ∆ X . Let C be a convex closed subset of ∆ X with support X . Thenˆ P is finite, S P is compact and there isq a P ∈ C a , q a P > 0, such that , and S P is a singleton if and only if theq a P -slice C p (q a P ) of C is a singleton. Thus the primal has always a solutionq P . Its active coordinatesq a P are unique and the passive coordinatesq p P are arbitrary such thatq P ∈ C. It is worth stressing that C p (q a P ) need not be equal to {0 p }, that is, a solution of P may put positive mass to passive letter(s). The following couple of simple examples illustrates the points; see also the examples in Section 5. Hereafter X denotes a random variable supported on X .
The Fisher scoring algorithm which is commonly used to solve P when C is a linear set may fail to converge when the zero counts are present; cf. Stirling [42]. Other numerical methods, such as the augmented Lagrange multiplier methods, which are used to solve the convex optimization problem under polyhedral and/or linear C may have difficulties to cope with large m. Thus, it is desirable to approach P also from another direction.
3 Fenchel dual problem F to P Consider the Fenchel dual problem F to the primal P: where C * {y ∈ R m : y, q ≤ 0 for every q ∈ C} is the polar cone of C and * : R m →R, is the convex conjugate of (in fact, the convex conjugate of˜ : R m →R given bỹ (x) = (x) for x ∈ ∆ X and˜ (x) = ∞ otherwise). The Fenchel dual F is often more tractable than the primal P, in particular when C is given by linear equality and/or inequality constraints. This is the case of the models for contingency tables, mentioned in Introduction. Also an estimating equations model C Θ leads to a linear feasible set C θ , when θ ∈ Θ is fixed. The model and u h : Θ → R m , h = 1, 2, . . . , r, are the estimating functions. There θ ∈ Θ ⊆ R d and d need not be equal to r. Since r is usually much smaller than m, F may be easier to solve numerically than P.
Observe that the convex conjugate itself is defined through an optimization problem, the convex conjugate primal problem (cc-primal, for short), whose solution set is The structure of S cc (z) is described by Proposition 36. The conjugate * can be evaluated by means of the Lagrange duality, where the Lagrange function is It holds that (cf. Lemma 34) For every µ ∈ R and a, b ∈ R ma , a > 0, b > −µ, define Theorem 5 (Convex conjugate by Lagrange duality). Let ν ∈ ∆ X and z ∈ R m . Then The key point is that theμ(z a ), which solves (3.4), is not always theμ(z) which minimizes k z (µ). They are the same if and only ifμ(z a ) ≥ max(z p ). As it will be seen in Theorem 6, if z ∈ S F then this inequality decides whether the solution of primal P is supported only on the active letters, or some probability mass is placed also to the passive letter(s).
Theorem 5 serves as a foundation for Theorem 6, which states the relation between the Fenchel dual and the primal.
Theorem 6 (Relation between F and P). Let ν, C,q a P be as in Theorem 2. Then

5)
then (ŷ a F , −1 p ) ∈ S F and the following hold: (a) If q a P = 1 thenμ(−ŷ a F ) = 1 and Thus, in the active coordinates, the solution of F is unique and is related toq a P by (3.5). Together with Theorem 2 this yields The structure of the solution set of F in the passive letters is determined by the relation ofμ(−ŷ a F ) to 1 =μ(−ŷ F ). In the case (a),μ(−ŷ a F ) =μ(−ŷ F ), and the passive projectionsŷ p F of the solutions satisfyŷ p F ≥ −1. Then it suffices to solve the primal solely in the active letters and S P is a singleton. This happens for instance if ν > 0.
In the case (b),μ(−ŷ a F ) <μ(−ŷ F ), andŷ p F satisfy min(ŷ p F ) = −1. Then every solution of the primal assigns the positive probability to at least one passive letter.
To sum up, the Fenchel dual problem, once solved, permits to findq a P through (3.5) and this way it incorporates C * intoq a P . In the case of linear or polyhedral C the dual may reduce dimensionality of the optimization problem, yet the numerical solution of F is somewhat hampered by the need to obey (3.3). Moreover,q p P remains to be found; in this respect see Proposition 47.

Polyhedral and linear C
In the polyhedral case (2.2) C = {q ∈ ∆ X : q, u h ≤ 0 for h = 1, 2, . . . , r}, a solution of the Fenchel dual F can also be obtained from the saddle points of the following Lagrange function Indeed, it holds that Hence (q,α) is a saddle point of L(q, α) if and only ifq ∈ S P and hα h u h ∈ S F ; cf.
There may exist many base solutions. However, if u a 1 , . . . , u a r are linearly independent then the base solution is unique, since then the system of equations hα h u a h =ŷ a F has a unique solutionα. Analogous claim holds for a linear C (cf. (2.3)), just in this case α ∈ R r , by the Farkas lemma.

Single inequality constraint and Klotz's Theorem 1
To illustrate the base solution in connection with Theorem 6, consider given by a single inequality constraint. By the Farkas lemma, C * = {αu : α ≥ 0}+R m − . If u p ≥ 0 the case (a) of Theorem 6 applies (for if not, there is a base solutionα F u and, by Theorem 6(b), min(α F u p ) should be −1; this is not possible since min(α F u p ) ≥ 0).
Assume that min(u p ) < 0 and the case (b) of Theorem 6 applies. Take any base solutionα F u. Then, by Theorem 6(b), min(α F u p ) = −1; sô . (3.6) Further, by Theorem 2,q p P is arbitrary from C p (q a P ). If u p attains the minimum at a single letter, then the solution of the primal P is always unique.
To sum up, the case (b) of Theorem 6 happens if and only if min(u p ) < 0 and ν a 1 − 1 min(u p ) u a < 1.
Thus, under Klotz's condition (3.1b), the solution of P should assign zero weight to any passive letter. This is not the case, as the following example demonstrates.
4 El Barmi-Dykstra dual problem B to P El Barmi and Dykstra [10] consider a simplified Fenchel dual problem B (the BD-dual, will be called the BD conjugate of . Note that the BD-dual B is easier to solve than F, as in the formerμ is fixed to 1. If C is polyhedral then, in analogy with the concept of the base solution of F, the vectors from S B of the form α B,h u h are called the base solutions of B. As above, every solution of B can in this case be written as a sum of a base solution and a vector from R m − .

Smith dual problem
By the Farkas lemma, for the feasible set C = {q ∈ ∆ X : q, u h = 0, h = 1, 2, . . . , r} given by r linear equality constraints, the polar cone is  [10]). Let C be a convex closed subset of ∆ X .
(B → P): Ifˆ B is finite, then S B is nonempty andˆ B = −ˆ P . Moreover, for everyŷ B ∈ S B ,q B ν 1+ŷ B belongs to S P .
Thoughˆ P is always finite,ˆ B may be infinite, leaving the solution set S P inaccessible through (B → P) of [10, Thm. 2.1]. Moreover, the claims (B → P) and (P → B) of the theorem are not always true. In fact, there are three possibilities: To illustrate them, we give below six simple examples: one where the BD-duality works and the other five where it fails. In Examples 9, 10, 13, 14 the set C is linear, whereas Example 11 presents a nonlinear C, and in Example 12 the set C is defined by linear inequalities.
, which does not have finite infimum.
Observe that Theorem 2.1 of [10] implies thatq p B = 0 p , provided thatˆ B is finite. However, C p (q a ) may be different than {0 p }; in such a caseq p P has a strictly positive coordinate and the BD-duality gap occurs. Definition 15. If a nonempty convex closed set C ⊆ ∆ X and a type ν are such that C a (0 p ) = ∅, then we say that C is an H-set with respect to ν. The set C is called a Z-set with respect to ν if C a (0 p ) is nonempty but its support is strictly smaller than X a .
Note that C a (0 p ) comprises those q ∈ C which are supported on the active letters. Thus, C is neither an H-set nor a Z-set if and only if there is q ∈ C with q a > 0, q p = 0.
Clearly, in Example 10, C is an H-set with respect to the ν. The same set C becomes a Z-set with respect to the ν considered in Example 13. And it is neither an H-set nor a Z-set with respect to the ν studied in Example 9. Further, the feasible set C considered in Example 14 is neither an H-set nor a Z-set with respect to the particular ν. In Examples 11 and 12, C is an H-set with respect to the ν.
Theorem 16 (Relation between B and P). Let ν, C,q a P ,ŷ a F be as in Theorems 2 and 6. (a) If C is either an H-set or a Z-set with respect to ν then (b) If C is neither an H-set nor a Z-set thenˆ B is finite,ˆ B ≤ˆ F , and there iŝ y a B ∈ C * a such thatμ(−ŷ a B ) = 1 and Moreover, there is no BD-duality gap, that is, if and only if any of the following (equivalent) conditions hold: Informally put, Theorem 16(a) demonstrates that the BD-dual breaks down if C is either an H-set or a Z-set with respect to the observed type ν. Then the (B → P) part of Theorem 8 does not apply. At the same time the (P → B) part of Theorem 8 does not hold, as S P = ∅, yet S B = ∅. This is illustrated by Examples 10-13.
Part (b) of Theorem 16 captures the other discomforting fact about the BD-dual: if the solution of B exists, it may not solve the primal problem P. By (i) this happens whenever the solutionq P of P assigns a positive weight to at least one of the passive letters (provided that S B = ∅). See Example 14.
At least, for ν > 0 the BD-dual works well.
The corollary justifies the use of the BD-dual for solving P when ν > 0. Recall that in the case of linear C the BD-dual is just Smith's simplified Lagrange dual problem (4.1), which is an unconstrained optimization problem. It can be solved numerically by standard methods for unconstrained optimization or by El Barmi & Dykstra's [10] cyclic ascent algorithm.
Finally, it is worth noting that the solution sets S P and S F are always compact but S B , if nonempty, is compact if and only if ν > 0, that is, there is no passive letter. If ν > 0 and S B = ∅, then S B is unbounded from below.

Base solution of B and no BD-duality gap
The case (iv) of Theorem 16(b) provides a way to find out whether a solution of P assigns the zero weights to the passive letters or not. First, determine whether C is neither an H-set nor a Z-set with respect to ν. Then, solve the BD-dual B and find a solutionŷ B of it. Finally, verify that (ŷ a B , −1 p ) belongs to the polar cone C * . For example, ifŷ p B ≥ −1 then this is satisfied automatically. On the other hand, in order to haveq p P = 0 p (that is, to have no BD-duality gap), y p B ≥ −1 must be satisfied by someŷ p B ∈ S B . In the case when C is polyhedral, there must exist a base solutionŷ p B = hα h u h of B withŷ p B ≥ −1. The next example illustrates the point.
To sum up, El Barmi and Dykstra's dual may fail to lead to the solution of the multinomial likelihood primal P in different ways. For a particular type ν the feasible set C may be an H-set or a Z-set, and then the BD-dual fails to attain finite infimum. Even if this is not the case B may fail to provide a solution of P, due to the BD-duality gap. Theorem 16(b) states equivalent conditions under which the BD-dual is in the extremality relation with P and leads to a solution of P; see also Lemma 58.
In the next two sections other possibilities of solving P are explored. First, an active-passive dualization is considered. Then, a perturbed primal problem and the PP algorithm are studied. Interestingly, a solution of the perturbed primal problem may be obtained from the BD-dual problem.

Active-passive dualization
The active-passive dualization is based on a reformulation of the primal P as a sequence of partial minimizationŝ Assume that q p is such that the slice C a (q p ) has support X a (this is not a restriction, since otherwise the inner infimum is ∞). Since ν a > 0, Corollary 17 gives that a solution of the inner (active) primal problem A κ Theorem 19 (Relation between B κ and A κ ). Let q p ∈ C p be such that the support of C a (q p ) is X a . Then there is a unique solutionŷ a (q p ) of B κ , and is the unique member of S P (q p ). Moreover,q a (q p )⊥ŷ a (q p ) andˆ B (q p ) = −ˆ P (q p ).
Thanks to (7.1) and the extremality relation between A κ and B κ , the active-passive (AP) dual form of the active-passive primal is The active-passive dualization is illustrated by the following example.
Example 13 (cont'd). Here C p = [0, 1/2] and the AP dual can be written in the form where v a = u a + (κ(q p 1 ) q p 1 u p 1 )1 a . The inner optimization giveŝ .
, which has to be maximized over q p 1 ∈ C p = [0, 1/2]. The maximum is attained at ,q a −1 = 1/4 andq a 0 = 1/2. Hence,q = (1, 2, 1)/4. In the outer, passive optimization, it is possible to exploit the structure of S P (cf. Theorem 2), and this way reduce the dimension of the problem. This is the case, for instance, when C is polyhedral.

Perturbed primal P δ and PP algorithm
For δ > 0 let ν(δ) ∈ ∆ X be a perturbation of the type ν; we assume that The perturbation activates passive, unobserved letters. For every δ > 0 consider the perturbed primal problem P δ where δ ν(δ) . Since the activated type ν(δ) has no passive coordinate, the perturbed primal problem P δ can be solved, for instance, via the BD-dualization; recall Corollary 17. Thus, for every δ > 0, .
How isq P (δ) related to S P , andˆ P (δ) toˆ P ? Theorem 20 asserts that C δ epiconverges to C ν when δ 0. (There, for a map f : R m →R and a set C ⊆ R m , the map f C : R m →R is given by The epi-convergence (cf. Rockafellar and Wets [39,Chap. 7]) is used in convex analysis to study a limiting behavior of perturbed optimization problems. It is an important modification of the uniform convergence (cf. Kall [24], Wets [43]).
Theorem 20 (Epi-convergence of P δ to P). Assume that C is a convex closed subset of ∆ X with support X , ν ∈ ∆ X , and (ν(δ)) δ>0 is such that (8.1) is true. Then and the active coordinates of solutions of P δ converge to the unique pointq a P of S a P : lim Moreover, if S P is a singleton (particularly, if ν > 0) then also the passive coordinates converge and lim δ 0q Thus, if δ is small enough, the (unique) solutionq P (δ) of the perturbed primal P δ is close to a solution of the primal problem P. Theorem 20 also states that in the active coordinates the convergence is pointwise, to the uniqueq a P of S a P . The next theorem demonstrates that if C is given by linear constraints and ν(δ) is defined in a 'uniform' way in δ, also the passive coordinates ofq P (δ) converge pointwise.
Theorem 21 (Convergence of P δ to P, linear C). Let C be given by (2.3), ν ∈ ∆ X , and (ν(δ)) δ>0 be such that (8.1) is true. Further, assume that ν(·) is continuously differentiable and that there is a constant c > 0 such that, for every i ∈ X p , .
Then lim δ 0q P (δ) exists and belongs to S P .
This corresponds to the case when every passive coordinate is 'activated' by equal weight.
The following example demonstrates that without the assumption (8.3) the convergence in passive letters need not occur.
The next example provides a numeric illustration of the pointwise convergence of a sequence of perturbed primals to P. The perturbed primal solutions are obtained through their BD-duals. It is worth stressing that B to P δ is, for a linear C, an unconstrained optimization problem; cf. Section 4.1.
As an aside, note that for this type ν the BD-dual to the original, unperturbed primal P breaks down, since C θ is an H-set. In fact, it is an H-set with respect to this ν for any θ ∈ Θ; cf. the empty set problem in Grendár and Judge [17].
The convergence theorems suggest that the practice of replacing the zero counts by an 'ad hoc' value can be superseded by the PP algorithm; i.e., by a sequence of the perturbed primal problems, for ν(δ) > 0 such that lim δ 0 ν(δ) = ν. Since each ν(δ) > 0, the PP algorithm can be implemented through the BD-dual to P δ , by the Fisher scoring algorithm, or by the Gokhale algorithm [16], among other methods.

Implications for empirical likelihood
Let us point out some of the consequences of the presented results for the empirical likelihood (EL) method; cf. Owen [31]. In most settings, including the discrete one, empirical likelihood is 'a multinomial likelihood on the sample', Owen [32, p. 15]. It is usually applied to an empirical estimating equations model, which is in the discrete setting defined as C Θ,ν a θ∈Θ C θ,ν a , where and u a h : Θ → R ma are the empirical estimating functions. The empirical likelihood estimator is defined through For a fixed θ ∈ Θ the data-supported feasible set C θ,ν a is a convex set and the inner optimization in (9.1) is the empirical likelihood inner problem Since C θ,ν a is just the 0 p -slice C a θ (0 p ) of C θ (given by (3.1)), the EL inner problem E can equivalently be expressed asˆ Its dual is Note that (9.2) is just Smith's simplified Lagrangean (4.1), that is, the BD-dual B to the multinomial likelihood primal problem P, for the linear set C θ . This connection implies, through Theorem 16, that the maximum of empirical likelihood does not exist if C θ is either an H-set or a Z-set with respect to ν. The two possibilities are recognized in the literature on EL, where an H-set is referred to as the convex hull condition (cf. Owen [32,Sect. 10.4]), and a Z-set is known as the zero likelihood problem (cf. Bergsma et al. [7]). Theorem 16 also implies that these are the only ways the EL inner problem may fail to have a solution. Note that the inner empirical likelihood problem may fail to have a solution for any θ ∈ Θ; cf. the empty set problem, Grendár and Judge [17]. In addition, Theorem 16 implies that, besides failing to exist, the empirical likelihood inner problem may have different solution than the multinomial likelihood primal problem. If C θ is neither an H-set nor a Z-set then, by Theorem 16(b), it is possible that Sinceˆ B = −ˆ E , in the latter caseˆ P <ˆ E and S P = S E . This happens when any of the conditions (i)-(iv) from Theorem 16(b) is not satisfied. Then the distribution that maximizes empirical likelihood differs from the distribution that maximizes multinomial likelihood. Moreover, the multinomial likelihood ratio may lead to different inferential and evidential conclusions than the empirical likelihood ratio. The next example illustrates the points.
As the two solutions are very close, the multinomial likelihood ratio is which indicates inconclusive evidence. However, the empirical likelihood ratio leads to a very different conclusion. Note that the active letters are X a = {−2, −1, 2}, and C θ is neither an H-set nor a Z-set with respect to the observed type ν, for the considered θ j (j = 1, 2). Hence for both θ's the solution of E exists and it is •q E (θ 1 ) = (0.00286, 0.996, 0.00048), for θ 1 , •q E (θ 2 ) = (0.01429, 0.983, 0.00238), for θ 2 .
The weights given by EL to −2 are very different in the two models; the same holds for 2. The empirical likelihood ratio is which indicates decisive support for θ 2 ; cf. Zhang [46].
The BD-duality gap thus implies that in the discrete iid setting, when C is linear, EL-based inferences from finite samples may be grossly misleading.

Continuous case and Fisher likelihood
As far as the continuous random variables are concerned, due to the finite precision of any measurement 'all actual sample spaces are discrete, and all observable random variables have discrete distributions', Pitman [34, p. 1]. Already Fisher's original notion of the likelihood [13] (see also Lindsey [29,p. 75]) reflects the finiteness of the sample space. For an iid sample X n 1 (X 1 , X 2 , . . . , X n ) and a finite partition where n(A l ) is the number of observations in X n 1 that belong to A l . Thus, this view carries the discordances between the multinomial and empirical likelihoods also to the continuous iid setting.  Figure 25 is induced by a random sample of size n = 100 from the A-quantized standard normal distribution. The EL estimate of θ is −0.052472 and the associated EL-maximizing distributionq E is different than the multinomial likelihood maximizing distributionq P , which is associated with the estimated value 0.000015 and assigns a positive weight also to the passive letters −4 and 4.

Implications for other methods
Besides the empirical likelihood, the minimum discrimination information, compositional data analysis, Lindsay geometry, bootstrap in the presence of auxiliary information, and Lagrange multiplier test ignore information about the alphabet, and are restricted to the observed data. Thus, they are affected by the above findings. 1) In the analysis of contingency tables with given marginals, the minimum discrimination information (MDI) method (cf. Ireland and Kullback [23]) is more popular than the maximum multinomial likelihood method. This is because the former is more computationally tractable, thanks to the generalized iterative scaling algorithm (cf. Ireland et al. [22]). MDI minimizes I 0 (q ν) over q, so that a solution of the MDI problem must assign a zero mass to a passive, unobserved letter. Thus, MDI is effectively an empirical method. This implies that the MDI-minimizing distribution restricted to a convex closed set C may not exist; however, the multinomial likelihood-maximizing distribution always exists (cf. Theorem 2), and may assign a positive mass to an unobserved outcome(s).
Example 26 (Contingency table with given marginals). Consider a 3 × 3 contingency table with given marginals. Let X = {1, 2, 3} × {1, 2, 3}, and let the observed bivariate type ν have all the mass concentrated to (1, 1); the remaining eight possibilities have got zero counts. Let the column and raw marginals be f c = (1, 2, 7)/10, f r = (5, 4, 1)/10, respectively. One of the multinomial likelihood maximizing distributionŝ q P is displayed in Table 2. In the active letterq a P = 0.1 is unique, in the passive lettersq p P ∈ C p (q a P ). The table exhibits theq p P which can also be obtained by the PP algorithm with the uniform activation (8.4). Note that C a (0 p ) = ∅, so that the MDI-minimizing distribution does not exist.  It is worth stressing that the PP algorithm makes the multinomial likelihood primal problem P computationally feasible.
2) Multinomial likelihood maximization has the same solution regardless of whether the proportions ν or the counts (n i ) i∈X are used. Note that the vector ν of normalized frequencies is an instance of the compositional data. In the analysis of compositional data, it is assumed that the compositional data (x 1 , . . . , x m ) belong to {(x 1 , . . . , x m ) : x 1 > 0, . . . , x m > 0, x i = 1}; cf. Aitchison [3,Sect. 2.2]. This assumption transforms ν ∈ ∆ X into ν a ∈ ∆ X a . Consequently, the multinomial likelihood problem P is replaced by the empirical likelihood problem E. However, this replacement is not without consequences, as the solution of the empirical likelihood problem E (if it exists) may differ from the solution of P; cf. Section 9.
3) Lindsay [28, Sect. 7.2] discusses multinomial mixtures under linear constraints on the mixture components, and assumes that it is sufficient to consider the distributions supported in the data (i.e., in the active alphabet). Though the objective function (·) in P is a 'single-component' multinomial likelihood, the present results for the H-set, Z-set, and BD-gap suggest that it would be more appropriate to work with the complete alphabet; see also Anaya-Izquierdo et al. [5,Sect. 5.1]. 4) Bootstrap in the presence of auxiliary information (cf. Zhang [45], Hall and Presnell [20]) in the form of a convex closed set, resamples from the EL-maximizing distributionq E . Hence, this method intentionally discards information about the alphabet. Resampling fromq P seems to be a better option.
5) The Lagrange multiplier (score) test (cf. Silvey [40]) of the linear restrictions on q (cf. C given by (2.3)) fails if C is an H-set or a Z-set with respect to ν, because the Lagrangean first-order conditions do not lead to a finite solution of P. However, the multinomial likelihood ratio exists.

Notation and preliminaries
In this section we introduce notation and recall notions and results which will be used later; it is based mainly on Bertsekas [9] and Rockafellar [38,37]. We will not repeat the definitions introduced in the previous part of the paper.
We assume that the extended real lineR = [−∞, ∞] is equipped with the order topology; so it is a compact metrizable space homeomorphic to the unit interval. The arithmetic operations onR are defined in a usual way; further we put 0 · (±∞) 0. For α ≤ 0 we define log(α) −∞; then log : R →R is continuous. For In the matrix operations, the members of R m are considered to be column matrices. If no confusion can arise, a vector with constant values is denoted by a scalar.
Let C be a nonempty subset of R m . The convex hull of C is denoted by conv(C). The polar cone of C is the set C * {y ∈ R m : y, q ≤ 0 for every q ∈ C}. This is a nonempty closed convex cone [9, p. 166]. Assume that C is convex. The relative interior ri(C) of C is the interior of C relative to the affine hull aff(C) of C [9, p. 40]; it is nonempty and convex [9, Prop. 1.4.1].
The recession cone of a convex set C is the convex cone it is a linear subspace of R m . Note that if C is a cone then R C = C and L C = C ∩ (−C).
Let X be a subset of R m and f : X →R be a function. By f (x; y) we denote the directional derivative of f at x in the direction y [9, p. 17]. By ∇f (x) and ∇ 2 f (x) we denote the gradient and the Hessian of f at x. For a nonempty set C ⊆ X, argmin C f and argmax C f denote the sets of all minimizing and maximizing points of f over C, respectively; that is, The (effective) domain and the epigraph of f are the sets [9, p. 25] • concave if (−f ) is convex.
When dealing with closedness of f , we will often use the following simple lemma [9, Prop. 1.2.2 and p. 28].
Lemma 27. Let f : X →R be a map defined on a set X ⊆ R m . Definẽ Then the following are equivalent: (a) f is closed; (b)f is closed; (c)f is lower semicontinuous; The recession cone R f of a proper convex closed function f : R m →R is the recession cone of any of its nonempty level sets V γ [9, p. 93]. The lineality space L f of R f is, due to the convexity of f , the set of directions y in which f is constant (that is, f (x + αy) = f (x) for every x ∈ dom(f ) and α ∈ R); thus L f is also called the constancy space of f [9, p. 97]. If g : R m →R is concave, the corresponding notions for g are defined via the convex function (−g).
The fundamental results underlying the importance of recession cones, are the following theorems ( (a) the set argmin C f of minimizing points of f over C is nonempty and compact; (b) C and f have no common nonzero direction of recession, that is, Both conditions are satisfied, in particular, if C ∩ dom(f ) is bounded.
Theorem 29. Let C be a nonempty convex closed subset of R m and f : R m →R be a convex closed function such that C ∩ dom(f ) = ∅. If

2)
or if C is polyhedral and then the set argmin C f of minimizing points of f over C is nonempty. Under condition (11.2), argmin C f can be written asC + (L C ∩ L f ), whereC is compact.
Standing Assumption. If not stated otherwise, in the sequel it is assumed that a nonempty convex closed subset C of ∆ X having support X , and a type ν ∈ ∆ X are given.
Since no confusion can arise, by we denote also an extension of the original function (defined in (2.1)) to R m : the conventions 0 · (−∞) = 0 and log β = −∞ for every β ≤ 0 apply.

Proof of Theorem 2 (Primal problem)
In the next lemma we prove that is a proper convex closed function. Since C is compact, C has no nonzero direction of recession. From this and Theorem 28, Theorem 2 will follow. It remains to prove (e). Fix any γ ∈ R such that the level set V = {x : (x) ≤ γ} is nonempty. If z is such that z a ≥ 0 then there is an active letter i with z i < 0. In such a case, for any x ∈ V there is α > 0 with y i + αz i ≤ 0 and so y + αz ∈ dom( ) ⊇ V . Hence, by (11.1), z ∈ R V = R . Now take any z with z a ≥ 0. Then, for every x ∈ V and α > 0, (x + αz) ≤ (x) by the monotonicity of the logarithm. That is, x + αz ∈ V and so z ∈ R V = R . The property (e) is proved.
Proposition 31. If the support of C is X , then the primal P is finite and its solution set S P is nonempty and compact. Moreover, if ν > 0 then S P is a singleton.
Proof. Since C is compact, its recession cone is trivial. Thus the first assertion follows from Theorem 28. The fact that S P is a singleton provided ν > 0, follows from the strict convexness of f . Proposition 32. Let ν ∈ ∆ X and C be a convex closed set having support X . Then the π a -projection of S P onto active letters is always a singleton {q a P } and Consequently, S P is a singleton if and only if C p (q a P ) is a singleton. Proof. Note that C a = π a (C) is a nonempty convex closed subset of R ma . Define a : R ma →R by a (x a ) − ν a , log x a for x a ∈ C a and a (x a ) ∞ otherwise. Since (q) = a (q a ) for every q ∈ C, (11.4) it holds that inf The map a is proper, convex and closed (use Lemma 27 and the fact that the restriction of a to the closed set C a is continuous, hence closed). Since dom( a ) ⊆ C a is bounded, Theorem 28 gives that argmin a is a nonempty compact set. This set is a subset of dom( a ) and the restriction of a to dom( a ) is strictly convex (Lemma 30(c)), so argmin a is a singleton. Hence there is a unique pointq a P ∈ C a such that a (q a P ) =ˆ P . Now, (11.4) gives that q ∈ S P if and only if q a =q a P ; so S P = {q a P } × C p (q a P ). Theorem 2 immediately follows from Propositions 31 and 32.

Proof of Theorem 5 (Convex conjugate by Lagrange duality)
In this section we prove Theorem 5 on the convex conjugate * , defined by the convex conjugate primal problem (cc-primal, for short) * (z) = sup q∈∆ X ( q, z − (q)).
The proof is based on the following reformulation of the cc-primal * (z) = sup is the Lagrangian function; cf. Lemma 34. Then we will show that the map µ → sup x≥0 K z (x, µ) is minimized atμ(z) = max{μ(z a ), max(z p )}; cf. Section 11.3.2. Structure of the solution set S cc (z) of the cc-primal is described in Section 11.3.3. Additional properties of the convex conjugate, which will be utilized in the proof of Theorem 6, are stated in Section 11.3.4.
Proof. The first two equalities follow from the facts that ξ z a +c (·+c) = ξ z a (·) and that max(z p + c) = max(z p ) + c. The final one is a trivial consequence of the definition of * ; indeed, since q, c = c for every q ∈ ∆ X , * (z + c) = sup q ( q, (z + c) − (q)) = * (z) + c.

Lagrange duality for the convex conjugate
Assume that ν ∈ ∆ X and z ∈ R m are given. For and define extended-real-valued functions Lemma 34 (Lagrange duality for the convex conjugate). For every ν ∈ ∆ X and z ∈ R m , Proof. We follow [37,Sect. 4]. Denote by R m ⊕ the subset {x ∈ R m : x a > 0, x ≤ 1 + u 1 } is closed convex (in fact, polyhedral) and the mapF z : D →R, (x, u) → −h z (x) is convex and continuous, hence closed. Since the epigraphs ofF z and F z coincide, F z is convex and closed jointly in x and u. (11.7) The corresponding optimal value function ϕ z : R 2 →R is defined by (cf. [37, (4
The Lagrangian function L z : R m × R 2 →R associated with F z is defined by (cf. [37, (4
Proof of Theorem 5. Keep the notation from Section 11.3.1. By Lemma 34, * (z) = inf µ∈R k z (µ), and, using partial maximization, where c does not depend on x p . Thus The second case immediately gives k z (µ) = ∞ if µ < max(z p ). (11.11) If µ ≥ max(z p ) and z a i − µ ≥ 0 for some i ∈ X a , then k z (µ) = ∞ (use (11.10) and the fact that, for any a ≥ 0 and b > 0, the map f (x) ax + b log x is strictly increasing and lim x→∞ f (x) = ∞). That is . (11.12) Assume now that µ ≥ max(z p ) and µ > max(z a ). Note that z (x a , µ).

The structure of the solution set S cc (z)
The structure of the cc-primal solution set S cc (z), defined by (3.2), is described. First, recall the definition (11.5) of ξ.
Proposition 36. For every z ∈ R m , S cc (z) is a nonempty compact set and In particular, ifμ(z a ) ≥ max(z p ) then S cc (z) = {(q a cc (z), 0 p )} is a singleton.
Proof. Fix z ∈ R m and put µ μ(z),q a cc q a cc (z); then, by Theorem 5, * (z) = −1 + µ + I µ (ν a − z a ). The fact that S cc (z) is nonempty and compact follows from Theorem 28.
Henceq ∈ S cc (z) if and only if q p , z p = µ(1 − γ). Since q p = 1 − γ, the last condition is equivalent to the fact thatq p i = 0 for every i ∈ X p with z p i < µ. In view of (11.14) this proves the proposition.

Additional properties of the convex conjugate
Some additional properties of the convex conjugate * , concerning its monotonicity and differentiability, are summarized in the next lemmas. Proof. (a) If z a ≥z a then ξ z a (µ) ≥ ξza (µ) for every µ > max(z a ); moreover, ξ z a (µ) = ξza (µ) for some µ > max(z a ) if and only if z a =z a . Using (11.6), (a) follows.
Since the convex conjugate * is finite-valued and convex, by [38,Thm. 10.4] it is locally Lipschitz. The following lemma claims that * is even globally Lipschitz with the Lipschitz constant equal to 1. (Here we assume that R m is equipped with the sup-norm x ∞ = max|x i |.) Lemma 38. The convex conjugate * : R m → R is a (finite-valued) convex function which is Lipschitz with Lip( * ) = 1.
Lemma 39. The mapμ : R ma → R is differentiable (even C ∞ ) and ∇μ(z a ) = 1 Proof. Sinceμ(z) = max(μ(z a ), max(z p )), the continuity ofμ follows from the continuity ofμ; thus it suffices to prove the first part of the lemma.
Lemma 40. For every z, v ∈ R m , the subgradient and the directional derivative of * are given by In particular, ifμ(z a ) ≥ max(z p ) then * is differentiable at z and ∇ * (z) = (q a cc (z), 0 p ).
Note thatμ(·) is continuous by Lemma 39 and hence alsoq a cc (·) is continuous. Thusq a cc (z) > ε on a neighborhood U ofz, and so * = f on U . This proves the first assertion.
Ifμ(z a ) ≥ max(z p ) then S cc (z) is a singleton {q}, whereq (q a cc (z), 0 p ). In such a case Danskin's theorem gives that * is differentiable atz with ∇ * (z) = ∇ z ϕ(z,q) = q. So also the second assertion of the lemma is proved.
The convex conjugate * is not strictly convex, even if ν > 0. This is so because * (z + c) = * (z) + c for every constant c, cf. Lemma 33. However, the following holds.
Lemma 41. Let ν > 0. Then * is C ∞ and, for every z ∈ R m , Proof. We use the notation from the proof of Lemma 39. By Corollary 35 and Lemma 39, for k = l, and Thus the first assertion of the lemma is proved.
To prove the second part of the lemma, fix any z, x ∈ R m and put a α, A a∇ 2 * (z). Then Since a = 0, Jensen's inequality gives that x Ax ≥ 0, and x Ax = 0 if and only if x 1 = · · · = x m . The lemma is proved.

Proof of Theorem 6 (Relation between F and P)
The proof of Theorem 6 goes through several lemmas. First, in Lemma 43, the extremality relation between P and F is established using the Primal Fenchel duality theorem. Then the minimax equality for L (cf. (11.15)) is proved, see Corollary 45. Proposition 47 gives a relation between the solution set S P of the primal and the solution set S cc (−ŷ) of the convex conjugate * forŷ ∈ S F . Lemma 48 provides a key for establishing the second part of Theorem 6. The structure of the solution set S F is described in Lemma 50.

Extremality relation
In the following denote by ∆ + X the set {q ∈ ∆ X : q a > 0}; note that neither the primal nor the convex conjugate are affected by restricting to ∆ + X . Lemma 42. Let C ⊆ ∆ X . If y ∈ C * then y + R m − ⊆ C * .
Now the Primal Fenchel duality theorem [9, Prop. 7.2.1, pp. 439-440], applied to the convex set ∆ + X , the convex cone K C and the real-valued convex function | ∆ + X , gives that inf and that the supremum in the right-hand side is attained. That is,ˆ P = −ˆ F and S F = ∅.
Lemmas 43 and 44 immediately yield the minimax equality for L.

Fromq P toŷ F
The proof of the following lemma is inspired by that of [10, Thm. 2.1].

Structure of S F
Lemma 49. Letq a P be as in Theorem 2. Then Proof. Take anyŷ ∈ S F . By Propositions 47 and 36, From this the result follows immediately.

Proof of Theorem 6
Now we are ready to prove Theorem 6.
Proof of Theorem 6. The facts thatˆ F = −ˆ P and S F = ∅ were proved in Lemma 43. Further, the set S F is convex and closed due to the fact that the conjugate * is convex and closed (even continuous, see Lemma 38). To show compactness of S F it suffices to prove that it is bounded. We already know from Lemmas 48 and 50 thatŷ a > −1 andŷ p ≥ −μ(−ŷ) = −1 for everyŷ ∈ S F ; thus S F is bounded from below by −1.
The rest of Theorem 6 follows from Lemmas 48 and 50.

Proof of Theorem 16 (Relation between B and P)
First, basic properties of * B are proven. Then Lemma 52 provides a preparation for the recession cone considerations of B. This leads to Proposition 53 giving the conditions of finiteness ofˆ B . There B is seen as a primal problem and Theorem 29 is applied to it. The solution set S B is described in Lemma 54. Lemma 55 provides properties of q B ∈ C defined viaŷ B , noting thatq B need not belong to S P . Conditions equivalent toq B ∈ S P are stated in Lemma 58. Its proof utilizes also Lemmas 56 and 57.
Let ν ∈ ∆ X be a type. Recall from Section 4 that the map * B : R m →R is for y a > −1 defined by * B (y) = I 1 (ν a y a ) = ν a , log ν a 1 + y a .
Since * B (y) = (1 + y) + ν a , log ν a for every y ∈ R m (where is defined on R m by (11.3)), Lemma 30 yields the following result.
Lemma 51. If ν ∈ ∆ X , then (e) the recession cone R * B and the constancy space

Finiteness of the BD-dual B
Lemma 52. Let ν ∈ ∆ X and C be a convex closed subset of ∆ X having support X . Assume that C is either an H-set or a Z-set. Then there is an active letter i with the following property: For any γ > 0 and v > 0 there is y ∈ C * such that y a ≥ −γ and y a i = v. (11.18) Proof. Assume first that C is a Z-set, that is, C a (0 p ) = ∅ and there is an active letter i such that q a i = 0 whenever q ∈ C satisfies q p = 0. Fix any γ > 0, v > 0, and choose ε ∈ (0, 1) such that ε < γ/(v + 2γ). By compactness of C we can find 0 < δ < ε such that, for every q ∈ C, q p ≤ δ implies q a i < ε. (For if not, there is a sequence (q (n) ) n in C with (q (n) ) p ≤ 1/n and (q (n) ) a i ≥ ε for every n; by compactness, there is a limit point q ∈ C of this sequence and any such q satisfies q p = 0 and q a i ≥ ε, a contradiction.) Finally, take any w ≥ v/δ.
Define y ∈ R m by y a i = v, y a j = −γ for j ∈ X a \ {i}, and y p j = −w for j ∈ X p . (11.19) Take arbitrary q ∈ C; we are going to show that y, q ≤ 0. If q p ≤ δ then q a i ≤ ε and q a ≥ 1 − δ > 1 − ε, so by the choice of ε. On the other hand, if q p > δ then, using q a i ≤ min{1, q a }, by the choice of w. Thus y ∈ C * . Now assume that C is an H-set; that is, C a (0 p ) = ∅. By compactness, there exists δ > 0 such that q p > δ for every q ∈ C. Continuing as above we obtain that, for any γ > 0, v > 0, and w ≥ v/δ, the vector y given by (11.19) belongs to C * .
Proposition 53. Let ν ∈ ∆ X and C be a convex closed subset of ∆ X having support X . Then the following are equivalent: (a) the dual B is finite; (b) the set C is neither an H-set nor a Z-set (with respect to ν).
Proof. Assume that C is either an H-set or a Z-set. Fix any γ ∈ (0, 1). Then, by Lemma 52, for arbitrary v > 0 there is y ∈ C * satisfying (11.18). For such y, Since γ is fixed and v > 0 is arbitrary,ˆ B = −∞. This proves that (a) implies (b). Now we show that (b) implies (a). Assume that C is neither an H-set nor a Z-set; that is, there is q ∈ C with q a > 0, Lemma 54. Let ν ∈ ∆ X . Let C be a convex closed subset of ∆ X having support X , which is neither an H-set nor a Z-set (with respect to ν). Then the solution set S B is nonempty, and there isŷ a B ∈ C * a such thatŷ a B > −1 and . Moreover, S B is a singleton if and only if ν > 0.
Proof. The proof is analogous to that of Proposition 32. Define * a B : C * a →R by * a B (y a ) ν a , log(ν a /(1 + y a )) if y a > −1, * a B (y a ) ∞ otherwise. Then * B (y) = * a B (y a ) for any y ∈ C * , hencê B = inf y a ∈C * a * a B (y a ) and S B = y ∈ C * : y a ∈ argmin (11.20) By Lemma 51, * a B is strictly convex, so argmin C * a * a B contains at most one point. The set C * is neither an H-set nor a Z-set, so there is q ∈ C with q a > 0 and q p = 0. Thus, as in the proof of Proposition 53, R * a B ∩ R C * a ⊆ R * a B ∩ R {q a } * = {0 a }. Since C * a ∩ dom( * a B ) = ∅, Theorem 28 gives that argmin C * a * a B is nonempty, hence a singleton; denote its unique point byŷ a B . Since triviallyŷ a B > −1, the first assertion of the lemma follows from (11.20). The second assertion follows from the first one and Lemma 42.

No BD-duality gap; proof of Theorem 16
The proof of the following lemma is inspired by that of [10,Thm. 2.1].
Lemma 55 (Relation betweenŷ B andq B ). Let ν ∈ ∆ X . Let C be a convex closed subset of ∆ X having support X , which is neither an H-set nor a Z-set (with respect to ν). Letŷ a B be as in Lemma 54 and put Note thatq B need not belong to S P ; for conditions equivalent toq B ∈ S P , see Lemma 58.
for every y ∈ C * . (11.22) Applying (11.22) to y = 2ŷ and then to y = (1/2)ŷ (both belonging to C * since C * is a cone) gives Now (11.22) and (11.23) yield q B , y ≤ 0 for every y ∈ C * , that is,q B ∈ C * * = K C (for the last equality use that C is convex closed). Further (recall the definition (11.5) of ξ), by (11.23) and (11.21). Since max(−ŷ a B ) < 1,μ(−ŷ a B ) = 1. Moreover,q B ∈ ∆ X (use thatq B ≥ 0), and soq B ∈ (K C ∩ ∆ X ) = C. Finally, Lemma 56. Let ν ∈ ∆ X and C be a convex closed subset of ∆ X having support X . Thenˆ Proof. It follows from Theorems 6 and 5 that Hence the inequality is proved. If ν > 0 thenμ(−y) =μ(−y a ) for every y; so, by Lemma 55,ˆ Lemma 57. Let ν ∈ ∆ X , C be a convex closed subset of ∆ X having support X , and q a P be as in Theorem 2. Assume that C is neither an H-set nor a Z-set (with respect to ν), and that q a P = 1. Thenˆ B =ˆ F andŷ (ν a /q a P −1 a , −1 p ) belongs to S B ∩S F .
Proof. By Theorems 6(a) and 5,ŷ ∈ S F andˆ F = * (−ŷ) = I 1 (ν a ŷ) = * B (ŷ). Moreover, ξ −ŷ a (1) = q a P = 1, soμ(−ŷ a ) = 1. Thus * Now we prove thatˆ B =ˆ F . For ν > 0 this follows from Lemma 56. In the other case put C C a (0 p ) = C a ∩ ∆ X a . This is a nonempty convex closed set with supp(C ) = X a (use that C is neither an H-set nor a Z-set). Put ν ν a > 0 and where (q ) − ν , log q is defined on ∆ X a . Sinceq a P ∈ C we trivially havê P =ˆ P . Further, C * ⊇ C * a . (To see this, take any y a ∈ C * a ; then there is y p such that y a , q a + y p , q p ≤ 0 for every q ∈ C. Since (q , 0 p ) ∈ C for any q ∈ C , for every such q we have y a , q ≤ 0; that is, y a ∈ C * .) Henceˆ B ≤ inf y a ∈C * a I 1 (ν a y a ) =ˆ B (for the equality see the definition (B) ofˆ B ). By Lemma 56 and Theorem 6, Proof. We first prove that the conditions (a)-(k) are equivalent. First, Theorem 2 and Lemma 55 yield that the conditions (a)-(c) are equivalent and that any of them implies (d). By the definitions ofŷ a F andq a B from Theorem 6 and Lemma 55, (a) is equivalent to (e). Theorem 6 and Lemma 54 yield that the conditions (e)-(g) are equivalent. Since (d) implies (f) by Lemma 57, it follows that (a)-(g) are equivalent.
Sinceμ(−ŷ a B ) = 1 for anyŷ B ∈ S B by Lemmas 54 and 55, (h) is equivalent to (i). Since S F is nonempty, (g) implies (h). If (h) is true then, for someŷ B ∈ S B ,

11.7
Proof of Theorem 20 (Perturbed primal P δ -the general case, epi-convergence) We start with some notation, which will be used also in Section 11.8. Then we embark on proving the epi-convergence of perturbed primal problems.
11.7.1 Perturbed primal P δ -notation Fix any ν ∈ ∆ X . Recall that m a , m p denote the cardinalities of X a , X p . For every δ > 0 take ν(δ) ∈ ∆ X such that (8.1) is true; that is, Recall also the definitions of δ ,ˆ P (δ), S P (δ) from (P δ ) and the fact that (cf. (8.2)) The maps C , C δ : R m →R are defined by (see Section 8)

Epi-convergence
For the definition of epi-convergence, see [39,Chap. 7]. We will use [39, Ex. 7.3, p. 242] stating that, for every sequence (g n ) n of maps g n : R m →R and for every x ∈ R m , (e-liminf g n (x ), (11.28) where B(x, ε) is the closed ball with the center x and radius ε. If e-liminf n g n = e-limsup n g n g, we say that the sequence (g n ) n epi-converges to g and we write e-lim n g n = g. For a system (g δ ) δ>0 of maps indexed by real numbers δ > 0, we say that (g δ ) δ epi-converges to g for δ 0 and we write e-lim δ g δ = g, provided e-lim n g δn = g for every sequence (δ n ) n decreasing to zero.
Before proving the epi-convergence of C δ to C , two simple lemmas are presented. Lemma 59. Let C be a convex closed subset of ∆ X with supp(C) = X . Then there exists a positive real β such that where, for ν > 0,q P (ν) denotes the unique point of S P (ν).
Proof. Fix some q > 0 from C and put β = min i q i . For any ν > 0 andq =q P (ν) we haveq > 0; so, by Lemma 30(d), ν is differentiable atq and the directional derivative for every j. From this the lemma follows.
Lemma 60. Fix β > 0 and take the map Then F is continuous on D.
Proof. Take any sequence (ν n , q n ) n from D such that ν n → ν, q n → q. Then q ≥ βν. The sum i:νi>0 ν n i log q n i converges to i:νi>0 ν i log q i . Further, for every i such that ν i = 0, 0 ≥ ν n i log q n i ≥ ν n i log(βν n i ) (which is true also in the case when ν n i = 0) and the right-hand side converges to zero. Hence lim n F (ν n , q n ) = F (ν, q) and the lemma is proved.
Lemma 61. Let C be a convex closed subset of ∆ X with supp(C) = X . Then the maps C δ epi-converge to C , that is, e-lim Proof. Take any sequence (δ n ) n decreasing to 0. To prove that e-lim n C δn = C we use (11.28). Take x ∈ R m , ε > 0, and put δn (y).
Finally, assume that x ∈ C is such that x a > 0 and there is i ∈ X p with x p i = 0. By Theorem 2 (applied to ν(δ n ) > 0 and to the nonempty convex closed subset C ∩ B(x, ε) of ∆ X ), for every n there is uniquex n,ε > 0 from C ∩ B(x, ε) such that ψ n (x, ε) = δn (x n,ε ).
Proof. Take any (δ n ) n decreasing to 0. Since ( C δn ) n epi-converges to C andˆ P is finite, the result follows from [39, Thm. 7.1, p. 264] (in part (a) take B C; use that P (δ n ) = min C δn and thatˆ P = min C ).
The lemma states that every cluster point of a sequence (q P (δ n )) n of solutions of the perturbed primal problems P δn (δ n 0) is a solution of the (unperturbed) primal P. Of course, not every solution of P can be obtained as a cluster point of a sequence of perturbed solutions. For example, it is shown in the following section that if (ν(δ)) δ satisfies the regularity condition (8.3), then the perturbed solutions converge to a unique solution of P, regardless of whether S P is a singleton.
Proof of Theorem 20. The first part of Theorem 20 was shown in Lemmas 61 and 62. The second part on the convergence of active coordinates then follows from Theorem 2.
11.8 Proof of Theorem 21 (Perturbed primal P δ -the linear case, pointwise convergence) Let a model C be given by finitely many linear constraints (2.3), that is, where u h (h = 1, . . . , r) are fixed vectors from R m . Let a typeν ∈ ∆ X be given and let X a and X p be the sets of active and passive coordinates with respect toν. Assume that perturbed types ν(δ) (δ ∈ (0, 1)) are such that the conditions (8.1) and (8.3) are true (with ν replaced byν); that is, ν : (0, 1) → ∆ X is continuously differentiable, ν(δ) > 0, lim and there is a constant c > 0 such that, for every i ∈ X p , .
The aim of this section is to prove that, under these conditions, solutionsq P (δ) of the perturbed primal problems P δ converge to a solution of the unperturbed primal P; that is, lim δ 0q P (δ) exists and belongs to S P .

Outline of the proof
The proof is based on the following 'passive-active' reformulation of the perturbed primal problem min q∈C δ (q) = min q a ∈C a min q p ∈C p (q a ) δ (q a , q p ); hence, the passive projectionq p P (δ) of the optimal solutionq P (δ) iŝ q p P (δ) = argmin q p ∈C p (q a P (δ)) δ (q a P (δ), q p ) (recall that C a = π a (C) is the projection of C onto the active coordinates and, for q a ∈ C a , that C p (q a ) = {q p ≥ 0 : (q a , q p ) ∈ C} is the q a -slice of C). Employing the implicit function theorem, it is then shown that the passive projectionsq p P (δ) can be obtained from the active onesq a P (δ) via a uniformly continuous map ϕ. Sincê q a P (δ) converges by Theorem 20, the uniform continuity of ϕ ensures that alsoq p P (δ) converges.
The above argument is implemented in the following steps: 1. Assume that S P is not a singleton, the other case being trivial. Then the linear constraints (2. 3) defining C can be rewritten into a parametric form (see Lemma 63). There, A is an m p × s matrix of rank s, B is an m p × m a matrix, c ∈ R mp , none of A, B, c depends on q a , and the (polyhedral) closed subset Λ(q a ) of R s is defined by Λ(q a ) {λ ∈ R s : Aλ + Bq a + c ≥ 0}. (11.30) there x stands for q a and y stands for λ, for brevity. Let Z denote the projection of G onto the first (1 + m a ) coordinates. For z = (δ, x) ∈ Z put G(z) {y ∈ R s : Ay + Bx + c > 0}, G(z) {y ∈ R s : Ay + Bx + c ≥ 0}. (11.31) In Lemma 65, it is proven that, for every z = (δ, x) ∈ Z, the map

Define an open bounded polyhedral set
(where α i and β i denote the i-th rows of A and B, respectively) has a unique minimizer ϕ(z) ∈ G(z).
3. Since ϕ(z) is also a local minimum of ψ z , it satisfies F (z, ϕ(z)) = 0, where F : G → R s is given by F (z, y) −∇ψ z (y). (11.32) Using the implicit function theorem and an algebraic result (Proposition 68), it is shown that ϕ : Z → R s is uniformly continuous (see Lemma 70).
11.8.2 Parametric expression for q p ∈ C p (q a ) Let C be given by (2.3). Denote by U p the (r + 1) × m p matrix whose first r rows are equal to u p h (the passive projections of u h , h = 1, . . . , r), and the last row is a vector of 1's; analogously define U a using the active projections u a h of u h . Let b ∈ R r+1 be such that b i = 0 for i ≤ r and b r+1 = 1. Then C = {(q a , q p ) ≥ 0 : U p q p = −U a q a + b}. (11.33) Let V p (of dimension m p × (r + 1)) be the Moore-Penrose inverse of U p . By [21, p. 5-12], (11.33) can be written in the following form, where I denotes the m p × m p identity matrix: Put s rank(I − V p U p ). If s = 0 then (I − V p U p ) = 0 and q p uniquely depends on q a . That is, for every q a ∈ C a , the set C p (q a ) is a singleton. By Theorems 2 and 20, also S P is a singleton andq P (δ) converges to the unique member of S P ; so in this case Theorem 21 is proved.
Hereafter, we assume that s ≥ 1 and, without loss of generality, that the first s columns of (I − V p U p ) are linearly independent. Put (the submatrix of entries that lie in the first m p rows and the first s columns). Since {(I − V p U p )γ : γ ∈ R mp } equals {Aλ : λ ∈ R s }, the next lemma follows.
Lemma 63. Let C be given by (2.3) and U p , V p be as above. Assume that V p U p = I. Then there are s ≥ 1, an m p × s matrix A of rank s, an m p × m a matrix B, and a vector c ∈ R mp such that C = {(q a , L(q a , λ)) : q a ∈ C a , λ ∈ Λ(q a )}.
Note that, for λ = λ , L(q a , λ) = L(q a , λ ) since A has full column rank.
Since G(z) is open for every z, the necessary condition for optimality gives the following result.
In the next section an algebraic result (Proposition 68) implying that ϕ is Lipschitz on Z (Lemma 70) is proven.
11.8.4 Boundedness of (A DA) −1 A D Let · denote the spectral matrix norm [21, p. 37-4], that is, the matrix norm induced by the Euclidean vector norm, which is also denoted by · . If A is a matrix, A denotes its transpose. The following result must be known, but the authors are not able to give a reference for it. Before proving this proposition we give a formula for the inverse of A DA, which is a simple consequence of the Cauchy-Binet formula; cf. [21, p. 4-4]. To this end some notation is needed. If C is an m × k matrix and H ⊆ {1, . . . , m}, K ⊆ {1, . . . , k} are nonempty, by C[H, K] we denote the submatrix of C of entries that lie in the rows of C indexed by H and the columns indexed by K. C H , C (j) , C (ij) , and C To finish the proof it suffices to show that ςM −1 α is bounded by a constant σ not depending on D, δ. (Indeed, if this is true then (Ã DÃ ) −1Ã D ≤ (σ + σ α σ) + σ , which follows from the matrix norm triangle inequality applied to (E, F ) = (E, 0 s×1 )+ (0 s×m , F ), and from the matrix norm consistency property; cf. [21, p. 37-4].) If α = 0, then ςM −1 α = 0, so one can take σ = 0. Assume now that α = (α h ) h = 0. Using the notation from Lemma 69, This proves that the absolute value of every coordinate of v = ςM −1 α is bounded from above by a constant not depending on D, δ. It finishes the proof of Proposition 68.

Lipschitz property of ϕ
The result of the previous section and Lemma 67 yield that ϕ is Lipschitz, hence uniformly continuous on Z.
Lemma 70. The map ϕ : Z → R s is Lipschitz on Z. Consequently, there exists a continuous extensionφ :Z → R s of ϕ to the closureZ of Z.
Proof. By (8.3), the norm of D −1 E = diag(d i (x, y)ϑ i (δ)/ϑ i (δ)) mp i=1 is bounded from above by a constant not depending on z = (δ, x) and y (use that 0 < d i ≤ 1 by (11.35)). Thus, by Proposition 68 and Lemma 67, ϕ has bounded derivative on Z. Now the Lipschitz property of ϕ follows from the mean value theorem.