The existence of maximum likelihood estimate in high-dimensional binary response generalized linear models

Motivated by recent works on the high-dimensional logistic regression, we establish that the existence of the maximum likelihood estimate exhibits a phase transition for a wide range of generalized linear models with binary outcome and elliptical covariates. This extends a previous result of Candès and Sur who proved the phase transition for the logistic regression with Gaussian covariates. Our result reveals a rich structure in the phase transition phenomenon, which is simply overlooked by Gaussianity. The main tools for deriving the result are data separation, convex geometry and stochastic approximation. We also conduct simulation studies to corroborate our theoretical findings, and explore other features of the problem.


Introduction
In this paper, we are concerned with the maximum likelihood estimate of generalized linear models [32,34] with binary outcome. More precisely, we consider n independent and identically distributed observations (x x x i , y i ), i = 1, . . . , n, where the binary outcome y i ∈ {−1, 1} is connected to the covariates x x x i ∈ R p by the probability model Here σ : R → [0, 1] is the inverse link function, and (β 0 , β β β) ∈ R p+1 are unknown parameters of the model. Popular choices for σ are • logit link function [7]: σ(t) = e t /(1 + e t ).
The maximum likelihood estimate of (β 0 , β β β) is any maximizer of the log-likelihood In contrast to the maximum likelihood estimate of linear models, that of generalized linear models does not always exist. This phenomenon is closely related to the separability of observed data, see Section 2.1 for a review. Classical theory deals with this issue when the number of covariates p is fixed, and the number of observed data n tends to infinity. In the era of data deluge, we are often in a situation where the number of covariates p and the number of observations n are comparable in size. The problems of interest are in highdimensional asymptotics, in which case the number of parameters p and the number of observations n both tend to infinity, at the same rate.
In a series of papers [12,39,40], Sur, Chen and Candès developed a theory for the logistic regression with Gaussian covariates in high-dimensional regimes. They studied the asymptotic properties of the maximum likelihood estimate when p/n → κ, with applications in hypothesis testing. Candès and Sur [12] proved a phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression with Gaussian covariates. This extends an earlier result of Cover [14] in the context of information theory. Formally, there exists a threshold h MLE , depending on the parameters of the model, such that • if κ > h MLE , then P(maximum likelihood estimate exists) → 0 as n, p → ∞. • if κ < h MLE , then P(maximum likelihood estimate exists) → 1 as n, p → ∞.
This phenomenon is referred to as the phase transition for the existence of the maximum likelihood estimate. The latter is crucial to justify the use of large sample approximations to numerous measures of goodness-of-fit, and derive the limiting distribution of the likelihood ratio, as mentioned in [12]. But they only studied the existence of the maximum likelihood estimate for Gaussian covariates. This is rarely the case in reality. For instance, the covariates are often heavy-tail distributed in financial problems where p and n are large. The purpose of this paper is to further generalize the results of [12], proving the phase transition for a large class of generalized linear models with elliptical covariates. Here we consider a large number of covariates sampled from elliptical distributions, and predict whether one can expect the maximum likelihood estimate to be found or not. Elliptical symmetry is a natural generalization of multivariate normality. The contribution of this paper is twofold.
• Theoretical justifications. We give a universal threshold on p/n for the existence of the maximum likelihood estimate in the binary classification. Here the word 'universal' refers to a wide class of link functions and covariate distributions. Our work aims to explore to which extent the phase transition occurs in terms of link functions and covariates, including the logit link and Gaussian covariates as a special case. We notice that the projection limit assumption (Assumption 2.6) is essential to our result, which is disguised for the special choice of Gaussian covariates. This is analogous to regularity assumptions in high-dimensional signal processing [6,18]. Without this assumption, the phase transition formula may fail (for example, log-normal covariates).
• Novel techniques. While the high level idea is the same as [12], we bring a few techniques into this field. First, we give a checkable condition to the projection limit assumption. Second, we use a stochastic approximation approach (Theorem 4.3) to prove the phase transition formula. Compared to the bare-hands analysis in [12], our argument is more general and is easily adapted to other problems of interest. Finally, we provide fairly general conditions (for example, (9)) under which the phase transition occurs. These conditions reveal additional structures which are masked by Gaussianity.
In a recent paper of Montanari, Ruan, Sohn and Yan [33], they considered the max-margin classifier of the random feature problem in the high-dimensional regime. They provided a phase transition threshold for the existence of the max-margin classifier, and further studied the limiting max-margin classifier. But similar to [12], they assume that the covariates are Gaussian. De Loera and Hogan [31] considered the maximum likelihood estimate of a multi-class logistic regression in a different way. They explored a condition on the number of observed data and the number of classes such that the maximum likelihood estimate exists. We hope that our work will trigger further research towards a theory for multi-class classification models with non-Gaussian covariates.
The rest of the paper is organized as follows. In Section 2, we provide the background and state the main result, Theorem 2.7. In Section 3, we perform simulations to corroborate our theoretical findings. The proof of Theorem 2.7 is given in Section 4. In Section 5, we conclude with some insights and directions for future work.

Background and main result
In this section we provide the background on the existence of the maximum likelihood estimate in binary response generalized linear models, and the properties of elliptical distributions. Then we present the main result, Theorem 2.7.

Existence of the maximum likelihood estimate and data geometry
Often the maximum likelihood estimate in the logit model, implemented in many statistical packages, runs smoothly. But sometimes it fails, even when the number of covariates p is much smaller than the sample size n. One reason for this undesirable phenomenon is that the maximum likelihood estimate does not exist. It is a classical problem in statistics to characterize the existence and uniqueness of the maximum likelihood estimate in generalized linear models. Historically, Haberman [22] and Weddenburn [41] provided general criteria for the maximum likelihood estimate to exist. Silvapulle [38], and Albert and Anderson [1] gave conditions for the existence of the maximum likelihood estimate in logistic regression via data geometry. More precisely, they classified the data into the following three categories: ≥ 0 for all i, and equality holds for some i.
In [1], it was proved that the maximum likelihood estimate exists in logistic regression if and only if the data points overlap. See also [36] for a generalization. Later Lesaffre and Kaufmann [29] proposed a necessary and sufficient condition for the existence of the maximum likelihood estimate in probit regression, which coincides with that derived in [1] for logistic regression. In fact, their result holds for a general class of generalized linear models. It is easily seen that the logit, probit and cloglog links all satisfy the logconcavity. Despite this nice characterization, it is not clear how to check these criteria efficiently. See also [13,16,28] for algorithmic aspects for detecting separation/overlaps.

Elliptical distributions
Elliptical distributions are natural generalizations of multivariate normal, which preserve spherical symmetry. In the sequel, S p−1 denotes the unit sphere in R p . The following definition of elliptical distributions is due to Kelker [25], and Cambanis, Huang and Simons [11].

Definition 2.2.
A random vector X X X := (X 1 , . . . , X p ) ∈ R p is elliptically contoured, or simply elliptical if If μ μ μ = 0 0 0, r = p, and A A A is orthogonal, then X X X ∼ E p (0 0 0, I I I p , F R ) is said to be spherically symmetric. For X X X ∼ N(0 0 0, I I I p ), the random variable R is chidistributed with degree of freedom p. Below we list a few useful properties of elliptical distributions, see [10,11,19] for further development.
1. The random vector X X X ∼ E p (μ μ μ, Σ Σ Σ, F R ) has finite moments of order k > 0 if and only if ER k < ∞. If the first two moments exist, then EX X X = μ μ μ and V arX X X = ER 2 r Σ Σ Σ. 2. The distribution of X X X ∼ E p (μ μ μ, Σ Σ Σ, F R ) is absolutely continuous with respect to Lebesgue measure on R p if and only if r = p = rank(Σ Σ Σ), and the distribution of R is absolutely continuous with respect to Lebesgue measure on R + . 3. The marginal and conditional distributions of X X X ∼ E p (μ μ μ, Σ Σ Σ, F R ) are also elliptical. For the sake of simplicity, assume that r = p = rank(Σ Σ Σ). Let X X X := (X X X 1 , X 2 X 2 X 2 ), with X X X 1 ∈ R p1 and X X X 2 ∈ R p2 (p 1 + p 2 = p). Let μ μ μ := μ μ μ 1 μ μ μ 2 and Σ Σ Σ : the Mahalanobis distance between x x x 2 and μ μ μ 2 in the metric associated with Σ Σ Σ 22 .

Main result
Before stating the main result, we make a few assumptions on the link function, the covariate distribution, and model parameters. As seen in Section 2.1, the existence of the maximum likelihood estimate of binary response generalized linear models can be translated into data geometry. Thus, we need the assumptions on the link function in Theorem 2.1.

Assumption 2.3 (link function). For
The phase transition for the existence of the maximum likelihood estimate is expected to occur not only with Gaussian covariates, but with a broad range of covariate distributions. Here we consider elliptical distributions. To exclude singularity, we assume that the covariate distribution is absolutely continuous with respect to Lebesgue measure on R p .
To get a meaningful result in diverging dimension, we consider a sequence of problems with the intercept β 0 fixed, and V ar( This leads to the following assumption on model parameters.
In the remaining of the paper, we define where Here the superscript (p) emphasizes that the distribution of (Y (p) , X (p) ) or (V (p) , U (p) ) may depend on p. The dependence in the Gaussian case is easily ignored since for x x x i ∼ N (0 0 0, I I I p ), U (p) is distributed as standard normal independent of p. For the general elliptical covariates, we make the following technical assumption.
Now we give a sufficient condition for Assumption 2.6 to hold. Let m p,k be the then U (p) converges in distribution to U whose distribution is entirely characterized by the moments (m k ; k ≥ 1). The condition (5) is referred to as the Carleman's condition. See Lin [30, Theorem 1] for a list of equivalent conditions. Define Denote f X as the density of X. Also define The main result is stated as follows. where then we have The proof is deferred to Section 4. The assumption sup p E[(X (p) ) 8 ] < ∞, which is purely technical, is used to prove the law of large numbers for triangular arrays. As suggested by simulations in Section 3, the phase transition exists even without this moment condition. The assumption (9) is used to prove that the probability the data points can be separated via a univariate model is small. A sufficient condition for (9) to hold is that G p,∓ + G p,± is bounded away from 1. That is, there exists > 0 independent of p such that So the term E[p ± (X (p) )(G ∓ (X (p) ) + G ± (X (p) )) n−1 ] is exponentially small in n. To illustrate, we check the condition (10) for the logistic regression with Gaussian covariates. To simplify the discussion, we take β 0 = 0, and α 0 = γ 0 = 1. In this case, p Similarly, we can prove that G p,− (x) + G p,+ (x) < 1 for x < 0. It is also easily checked that all examples in Section 3 except the log-normal distribution satisfy the sufficient conditions (5)-(10).

Empirical results
In this section we perform experiments to verify the phase transition of the maximum likelihood estimate by (i) computing h MLE defined by (8); (ii) checking whether the data is separated by the linear programming as in [12]: subject to Note that the maximum likelihood estimate of the binary response generalized linear model exists if the linear programming (11) only has the trivial solution. We compare the theoretical phase transition curve with the empirical observations under several simulation designs described as below.
First, we argue that A A A = I I I p suffices to validate our theory.
For the elliptical covariates, we set μ μ μ = 0, A A A = I I I p and consider different distributions for the non-negative variable R, including the chi distribution with degree of freedom p, the Gamma distributions, the Pareto distributions, the half-normal distribution and the log-normal distribution. When generating the binary response by (1), we choose the logit function, the cloglog function and the probit function for the link functino. We simply take β 0 = 0. See [12] for results with β 0 = 0. To ensure Assumption 2.5, let ER 2 = pα 2 0 + 1 and β β β = ( W/|| W|| 2 + 1/p) · γ 0 /α 0 , where W ∼ N (0 0 0, I I I p ). We fix n = 1000, α 0 = 1, and vary γ 0 ∈ {0.01, 0.02, . . . , 10.00} and κ = p/n ∈ {0.005, 0.01, . . . , 0.6}. The parameter α 0 is simply set as 1 since we observed that it does not affect the phase transition much. A large α 0 might slightly shift the phase transition curve (the reds curve in Figure 1) to the right, and enlarge the uncertain band (the green bands). Once the data is generated, we solve the problem (11) by checking whether a non-trivial solution exists. We repeat the procedure for 100 times, and get a heat map which indicates the proportion of times that the maximum likelihood estimate exists for each pair (γ 0 , κ). See Figure 1 for the chi case with degree of freedom p. Results for other designs can be found in Appendix A.

Multivariate Gaussian covariates with different link functions
We consider the multivariate Gaussian covariates, which have been studied for the logit link in [12]. In our setup, R is sampled from a chi distribution with degree of freedom p and the link function is one of {logit, cloglog, probit}. Figure  1 displays the phase transition for the existence of the maximum likelihood estimate for different link functions. There are green bands in these figures, which indicates that the maximum likelihood estimate exists indefinitely when (γ 0 , κ) falls in this band with the given sample size. This region is referred to as the uncertainty band. Observe that for the multivariate Gaussian covariates, as expected, the h MLE curves lie in the uncertainty bands for different link functions. We also use the same setup to investigate the situation when A A A = I I I p . In this case, we generate A A A such that each entry A ij is i.i.d sampled from the standard Gaussian distribution. To make sure it is full rank, we let A A A ← 1 10000 A A A 2 ·I I I p +A A A. The result is deferred to Figure 3 (Middle). We find that the result is very similar to that in Figure 1, which corroborates our argument that it suffices to use A A A = I I I p to validate our theory.

Gamma-distributed R
In [12], U (p) defined by (4) simplifies to the Gaussian distribution, which does not depend on p. This is key to their proof of the phase transition for the existence of the maximum likelihood estimate. However, when we go beyond the chi distribution for R, U (p) depends on p. We observe that Assumption 2.6 is satisfied for Gamma distributions, and the resulting theoretical phase transition curves agree with the simulations. More precisely, assume R ∼ Gamma(k, θ) where k is the shape parameter and θ is the scale parameter. The second moment condition (12) gives θ = ER 2 /(k 2 + k). When k = 0.5, we get θ 0.5 = 4(p + 1)/3 which corresponds to χ distribution with degree of freedom 2 if θ 0.5 is an integer; when k = 1, it is the Exponential distribution with θ 1 = (p + 1)/2; when k = 2, it is a Gamma distribution with θ 2 = (p + 1)/6. Figure 2 implies that h MLE defined by (8) converges quickly as p increases. Table 1 indicates that all the theoretical phase transition curves align with the corresponding middle curves of the uncertainty bands. Table 1 Summary of the theoretical h MLE and the simulations of the phase transition for the maximum likelihood estimate with the logit link function. h 0.5 is the κ such that the proportion of times that the maximum likelihood estimate exists is 0.5, and the number inside the bracket is the width of the uncertainty interval (a slice of the uncertainty band) for a given γ 0 . MIW is the mean width of the uncertainty intervals across γ 0 ∈ (0, 10]; MD is the mean difference between h MLE and h 0.5 .

The moment condition and the tail behavior of R
First we explore a case where the eighth moment of X (p) does not exist as required by Theorem 2.7. To this end, we sample R from the Pareto distribution of type I. We specify the shape parameter α, and set the scale parameter x m = (α − 2)/α · ER 2 . Recall that for the Pareto distribution of type I, the fourth moment exists when α > 4, and the third moment exists when α > 3. From Table 1, we see that the simulation results match the theoretical h MLE well. This suggests that the moment condition in Theorem 2.7 may be further relaxed.
Subsequently, we study how the tail behavior of the R distribution influences the phase transition curve h MLE . In the previous empirical studies, we consider the chi distributions with degree of freedom p and Gamma distributions which have sub-exponential tails; the Pareto distributions have polynomial tails. We also investigate the half-normal distribution with a sub-Gaussian tail, and the log-normal distribution with another heavy tail. To ensure (12), we set the scale parameter σ 2 = ER 2 = p + 1 for the half-normal distribution, and σ 2 = 0.5 · log ER 2 = log √ p + 1 for the log-normal distribution. From Table 1, we observe that the theoretical h MLE successfully predicts the phase transition in the simulations of the half-normal distribution, but fails for the log-normal distribution. This is due to the fact that the log-normal distribution is not uniquely characterized by its moments, and the sufficient condition (5) does not hold. See Appendix A for more detailed results for the exploration.

Roadmap to the proof of Theorem 2.7
Elliptical covariates Assume that the covariates x x x i ∼ E p (μ μ μ, Σ Σ Σ, F R ) with r = p = rank(Σ Σ Σ). It is easily seen that x x x i = μ μ μ + Σ Σ Σ 1/2 z z z i with z z z i ∼ E p (0 0 0, I I I p , F R ). Recall from Section 2.1 that for GLMs satisfying Assumption 2.3, the MLE does not exist if and only if there is We are in a situation where the covariates x x x i := (x i1 , . . . , x ip ) is spherically symmetric and V ar(x x x T i β β β) = |β β β| 2 ER 2 /p → γ 2 0 . By rotational invariance, we assume that all the signal is in the first coordinate. That is, where (Y (p) , X (p) ) ∼ F α0,β0,γ0 , and . Now we want to express P(no MLE) via conic geometry. For a fixed space W ∈ R n , let C(W) = {w w w + u u u : w w w ∈ W, u u u ≥ 0}, be the convex cone generated by W. The following proposition is read from [12,Propositions 1 & 2], and we include the proof in Section 4.2 for completeness. (13). Let L := span(X X X 2 , . . . , X X X p ) and W := span(Y Y Y (p) , X X X (p) ).
Let {No MLE Single} be the event that the data points can be completely or quasi-completely separated by the intercept and the first coordinate only, i.e.
By Proposition 4.1, the existence of the MLE boils down to whether L intersects C(W) in a non-trivial way. It remains to prove the following: (1) The probability P(No MLE Single) is relatively small. (2) The probability P(L ∩ C(W) = {0 0 0}) exhibits a phase transition through the ratio κ := p/n, and h MLE (α 0 , β 0 , γ 0 ) defined by (8).
Separation of data in a univariate model We aim to prove that the probability P(No MLE Single) is small. In [12], a sketch of proof is given for the logistic regression with Gaussian covariates. In Section 4.3, we give a rigorous proof of this result in the setting of Theorem 2.7. The main difficulty comes from the fact that though the probability the data can be separated via any fixed t 0 ∈ R is exponentially small, there are uncountably many such t 0 and the union bound does not give a good estimate.

Convex geometry and phase transition
We want to prove the phase transition of P(L ∩ C(W)) through the interplay between κ and h MLE (α 0 , β 0 , γ 0 ). The key is to understand when a random subspace L with uniform orientation intersects C(W) in a non-trivial way.
For any fixed subspace W ∈ R n , the approximate kinematic formula [2, Theorem I] shows that for any ε ∈ (0, 1), there exists a ε > 0 such that Here δ(C) is the statistical dimension of the convex cone C defined by δ(C) := n − E|Z Z Z − Π C (Z Z Z)| 2 , where Z Z Z ∼ N (0 0 0, I I I n ) and Π C is the projection onto C. The following identity is given in [12,Lemma 3]: Theorem 2.7 can be derived from the formulas (15)-(16) and the following theorem.
In [12], the authors proved Theorem 4.3 in the setting of the logistic regression by a bare-hands argument. One can adapt their argument to prove Theorem 4.3, with possibly more complications. However, the statement of Theorem 4.3 suggests it be a form of stochastic approximation. Here we show how this result follows systematically from stochastic approximation, which is of independent interest.
The following result gives asymptotic inference of v n as n → ∞. The proof will be given in Section 4.4.
To prove Theorem 4.3, we need to show that the set of minimizers argmin G n is non-empty and bounded in probability. The argument is similar in spirit to [12], and we give the proof for ease of reference. In Section 4.4, we prove that under the assumptions in Theorem 2.7: • The problem (17) has a unique minimizer λ λ λ 0 .
• Given (Y (p) , X (p) ), the random vector (X 2 , . . . , X p ) is also elliptical whose distribution is given by (13). By the geometric characterization (15), we get for some ε n → 0. • The random variable Q p,n is uniformly integrable, since Q p,n ≤ |Z Z Z + | 2 /n which is sub-exponential. This implies that E(Q p,n |X X X, Y Y Y ) converges in probability to h MLE (α 0 , β 0 , γ 0 ).

Proof of Proposition 4.1
We aim to prove that then there is no MLE if and only if there is a non-zero vector . This leads to (19). Note that there is no MLE if and only if there is a non-zero vector The identity in law (9) implies that the equality occurs with probability 0, which proves (20).

Proof of Proposition 4.2
Let (X 1 , . . . , X n ) be i.i.d. samples with density f X (p) . It is well known that the distribution of the order statistics (X (1) , . . . , X (n) ) is given by n! n i=1 f X (p) (x i ) for x 1 < · · · < x n . Note that there exists t ∈ R separating X (1) < · · · < X (n) if and only if for some k ∈ {0, . . . , n}, the responses corresponding to X (1) , . . . , X (k) is of the same sign, and those corresponding to X (k+1) , . . . , X (n) is of the opposite sign. Consequently, Combining the above identities yields Finally, the condition (9) together with (21) lead to the desired result.

Proof of Theorem 4.3
We start with the proof of Lemma 4.4.
By Lemma 4.4, it suffices to prove that the set of minimizers argmin G n is non-empty and bounded in probability. We aim to show that under the assumptions in Theorem 2.7, the problem (13) has a unique minimizer λ λ λ 0 , and for any minimizer λ λ λ n ∈ argmin G n , | λ λ λ n − λ λ λ 0 | = O P (1).
We prove these statements in the next two lemmas. It is easily seen that G(λ λ λ) and G n (λ λ λ) are convex. The following lemma shows that the function G is strongly convex, which was stated in [12] without proof. Here we give a complete proof.
Finally, we prove that the set of minimizers argmin G n is bounded in probability. The argument can be used to show that | λ λ λ n − λ λ λ 0 | = O P (n −1/4 ), where λ λ λ n is any minimizer of G n . The proof is adapted from [12,Lemma 4], which we include for completeness.   (8)