Hypothesis testing near singularities and boundaries

The likelihood ratio statistic, with its asymptotic $\chi^2$ distribution at regular model points, is often used for hypothesis testing. At model singularities and boundaries, however, the asymptotic distribution may not be $\chi^2$, as highlighted by recent work of Drton. Indeed, poor behavior of a $\chi^2$ for testing near singularities and boundaries is apparent in simulations, and can lead to conservative or anti-conservative tests. Here we develop a new distribution designed for use in hypothesis testing near singularities and boundaries, which asymptotically agrees with that of the likelihood ratio statistic. For two example trinomial models, arising in the context of inference of evolutionary trees, we show the new distributions outperform a $\chi^2$.


Introduction
The likelihood ratio statistic is commonly used to compare a null model to an alternative model. In many circumstances this statistic is asymptotically χ 2distributed, which greatly facilitates testing of large data sets. As is well known, for smaller data sets, or when there are few observations of some outcomes, a χ 2 approximation may not be close enough to the true distribution for reliable testing. Even for large data sets, however, work of Drton [6] has highlighted that problems can arise in using a χ 2 approximation at singularities and boundaries of the null model. The correct asymptotic distribution can be quite different from standard distributions at nearby regular points.
Drton's work shows how one can understand and often calculate an asymptotic distribution at boundaries and singularities, but it does not suggest how to use these distributions in practice. Indeed, this is a difficult question, as the nature of these asymptotic distributions make clear. For instance, one may find that an asymptotic distribution is χ 2 d with a fixed d degrees of freedom at almost all model points, but that at a boundary or singularity it discontinuously jumps to a different distribution -for instance, a mixture of several χ 2 distributions, or something more complicated. However, for the true non-asymptotic distribution, for any fixed sample size no matter how large, we do not expect such a jump to occur.
One might surmise that the asymptotics at the singularity or boundary could be relevant to testing even when the true parameter value is near that point, for fixed sample sizes. As the sample size is increased, the region on which the asymptotics give poor approximations shrinks, but no matter how large a sample is, the discontinuous behavior of the asymptotic distribution indicates there is some parameter region on which it is inappropriate for empirical use. What is needed for a practical test is a different approximation, which is dependent on both sample size and parameter value, but still tractable to evaluate.
In this work we explore this issue of practical testing near a singularity or boundary, using particular examples of hypothesis testing that arise in phylogenomics. Phylogenomics is concerned with inferring evolutionary trees relating several different species from genomic-scale data. It builds on phylogenetics (the inference of trees based on sequences of a single gene), but brings in populationgenetic effects that lead to many inferred gene trees differing from the species (or population) tree. Basics of the underlying multispecies coalescent model are explained below, though little familiarity with it is necessary for this work. It simply provides two motivating examples of nicely structured and accessible submodels of a trinomial (3-category multinomial) distribution, for which we can investigate behavior of tests near singularities and boundaries.
Using both theory and simulations, we investigate distributions relevant to hypothesis testing with the likelihood ratio statistic for these models. We illustrate the problematic behavior of a χ 2 distribution near boundaries and singularities, even when the sample size is large. We define an alternative approximating distribution, and show that it leads to better testing for these models. While applications of the material developed here are highly relevant to phylogenomic practice, we defer discussion for empiricists to a later paper. This paper is organized as follows. In Section 2 we lay out basic definitions, and illustrate with a simple example the problems that might arise when χ 2 distributions are used to approximate the distributions of likelihood ratio statistics near boundaries and singularities of null models. The specifics of the genomic models motivating our primary examples are then introduced.
The main theorem is given in Section 3, where an approximating distribution is defined for use in hypothesis testing. In Sections 4 and 5, we specialize to our examples, giving explicit forms of the finite sample approximating distributions. By simulation we show that using the standard χ 2 1 for hypothesis testing gives poor performance near a boundary or singularity; in contrast, the finite sample distributions we define perform very well for true parameters anywhere in the null model.
In Section 6, we use variation distances between the competing distributions (χ 2 1 and ours) to investigate the region of the null model where the standard χ 2 1 is good for testing, since this depends both on sample size and proximity to a singularity or boundary point. The final section is a discussion of our work and its potential for application beyond the examples developed here.

Definitions and examples
Let Θ, an open subset of R k , denote the parameter space for a family of probability distributions, and θ ∈ Θ an unknown parameter vector. Submodels are specified by Θ 0 ⊂ Θ ⊆ Θ, and we formulate the null hypothesis H 0 : θ ∈ Θ 0 , with alternative H 1 : θ ∈ Θ 1 = Θ Θ 0 . Given some data set, the likelihood ratio statistic is where (θ) = (θ | data) is the log-likelihood function. Then Λ ≥ 0, and a large value of Λ indicates a substantially larger likelihood that θ ∈ Θ 1 than θ ∈ Θ 0 , taken as evidence to reject H 0 . By determining the distribution of Λ under H 0 , the decision as to how large Λ must be for rejection can be quantified.
While it is commonly assumed that the distribution of the likelihood ratio statistic under H 0 is well approximated by a χ 2 distribution, establishing this depends on a number of assumptions. Wilks [21] provided an early justification for sufficiently regular models defined by hyperplanes. Chernoff [2] extended the result to more general models, elucidating the role of the tangent space to the model, and making clear that asymptotic distributions other than χ 2 can arise. Other works emphasize that the statistic may not be asymptotically χ 2 -distributed at boundaries of Θ 0 (e.g., [11], [17] and [18]).
Recent research of Drton [6] has emphasized that singularities pose problems as well. An asymptotic distribution of the statistic can be obtained at these problematic model points, as the distribution of the squared Euclidean distance between a standard normal sample and the appropriately linearly-transformed tangent cone of Θ 0 at the true parameter point θ 0 (Theorem 2.6 of [6]). Informally, the tangent cone is the set of all possible tangent vectors when approaching θ 0 along all possible paths in Θ 0 . The tangent cone generalizes the tangent space which lead to the more familiar χ 2 distributions, but may lack the closure properties of a vector space that holds at smooth points of Θ 0 .
To precisely define singularities and boundaries, we follow [6]. Assume Θ 0 is a semialgebraic subset of Θ. That is, Θ 0 is defined by a finite Boolean combination of polynomial equalities and inequalities, which ensures Chernoff regularity. The Zariski closure, Θ 0 , of Θ 0 is the smallest algebraic variety (the zero set of a finite set of polynomials) containing Θ 0 . This closure is the union of at most finitely many irreducible varieties, called components, which themselves cannot be expressed as a finite union of proper varieties.
A singularity of Θ 0 is then either a) a point in Θ 0 which lies on more than one irreducible component of Θ 0 , or b) a point that lies on only one component, but at which the n × m Jacobian matrix of the defining equations of that component has lower rank than typical. When a point lies on a single irreducible component Θ i 0 , the rank of the Jacobian is generically m − dim Θ i 0 . Lower rank indicates a problem with the notion of dimension at the point.  Figure 1: The parameter space for a possibly biased coin. The solid segment is Θ 0 = 1 2 , 1 , while the dashed segment is Θ 1 = 0, 1 2 .
A subset of Θ 0 is said to be open if it is the intersection of Θ 0 with an open subset of R k . The interior of Θ 0 is the union of its open subsets, and the boundary of Θ 0 is the complement in Θ 0 of its interior. (In most applications, including all our example models, Θ 0 is closed in Θ under the standard topology, and this coincides with the usual definition of topological boundary in Θ.) Note that the boundary and the set of singularities of a model need not be disjoint.
Example 2.1 (Simple model with boundary). To test whether a coin, modeled by a Bernoulli random variable with probability of heads θ ∈ (0, 1), is biased towards tails, formulate hypotheses Here Θ = (0, 1) = ∆ 1 , the open simplex, and Θ 0 = 1 2 , 1 , as depicted in Figure 1. The Zariski closure of Θ 0 is the real line, and Θ 0 has no singularities but a single boundary point 1 2 . At any θ 0 in the interior of Θ 0 the tangent cone is the full real line, (−∞, ∞). However, for θ 0 = 1 2 the tangent cone is the half-line 1 2 , ∞ . From Theorem 2.6 of Drton [6], the asymptotic distribution of the likelihood ratio statistic is the distribution of the squared Euclidean distance between a normal random variable centered at θ 0 with variance 1 and the tangent cone at θ 0 . For θ 0 > 1 2 , the squared Euclidean distance is 0 with probability 1 asymptotically, so the asymptotic distribution is a Dirac delta function δ 0 . However, for θ 0 = 1 2 the asymptotic distribution is a mixture 1 2 δ 0 + 1 2 χ 2 1 . Intuitively this is because samples from N 1 2 , 1 lie on or off the tangent cone 1 2 , ∞ with probability 1 2 , and the distributions of the squared distances are δ 0 and χ 2 1 respectively. For this model, the maximum likelihood estimator (MLE)θ 0 of the parameter θ 0 is the maximum of 1 2 and the relative frequency of heads in a sample. If θ 0 lies in the interior of Θ 0 , then for a sufficiently large sampleθ 0 lies in the interior with probability arbitrarily close to 1. However, for a fixed sample size, no matter how large, there are points θ 0 close to 1 2 but still in the interior of Θ 0 for which this probability is much smaller (in fact, as close to 1 2 as desired). A better approximation to the distribution of the likelihood ratio statistic at such a point might be, for instance, a mixture of δ 0 and the square of a truncated normal centered at θ 0 with variance dependent on sample size. The mixing parameters depend on both θ 0 and the variance, while the truncation point of 1 2 is not generally the mean of the normal. When θ 0 = 1 2 , the normal distribution is truncated at the mean giving the asymptotic mixture distribution already described.
Of course for this model one can simply perform an exact binomial test, without any approximation. Nonetheless, this example highlights 1) that a likelihood ratio statistic's distribution can fail to converge uniformly to a χ 2 distribution even on the interior of Θ 0 , 2) the role of the tangent cone in determining correct asymptotics, and 3) the inappropriateness of these asymptotic approximations for hypothesis testing for certain parameter values.
The next examples are the primary focus of our investigations. We briefly describe their motivation from phylogenomics, with more details supplied in Appendix A. The knowledge that these are submodels of a trinomial model is sufficient for the remainder of this work.   Figure 2, where the internal branch has length t ≥ 0. Gene trees depicting evolutionary relationships for particular gene lineages (A, B, C) sampled from the three species may show differing topological relationships due to the population genetic effect of incomplete lineage sorting, illustrated in Figure 2. Under the multispecies coalescent model (see Appendix A), the three possible rooted gene tree topologies have probabilities with C|AB denoting the rooted topological gene tree matching the species tree topology with gene lineages A and B most closely related, and B|AC and A|BC, interpreted analogously, gene tree topologies that do not match that of the species tree. For a null hypothesis H 0 that the rooted topology of the species tree is c|ab, then is shown in Figure 3a. Here ∆ 2 denotes the open 2-dimensional probability simplex. The alternative hypothesis, Θ 1 = ∆ 2 Θ 0 , can be interpreted either that the species tree has a different tree structure b|ca or a|cb, or that the model of a simple species tree under the multispecies coalescent is inadequate, perhaps due to introgression or hybridization of species populations, population structure within species, or other more complex biological issues.
Samples of n rooted gene trees drawn independently from the multispecies coalescent model on the species tree of Figure 2 are thus described by a submodel of the trinomial model with parameter space Θ 0 . Example 2.3 (Model T3: Three species related by any of the three possible species trees). If the model T1 of Example 2.2 is modified, so that the specific species tree structure is not fixed, but any one of a|bc, b|ac, or c|ab might be the species tree, then H 0 is that there is some species tree giving rise to the gene tree data. The alternative H 1 is that a simple species tree model does not fit the data. The null parameter space Θ 0 ⊂ ∆ 2 , shown in Figure 3b, is the union of three submodels of trinomial models.
As seen in Figure 3a, the model T1 has a boundary point at 1 3 , 1 3 , 1 3 ∈ Θ 0 , and no singularities. For model T3, the point 1 3 , 1 3 , 1 3 is a singularity of Θ 0 , since the Zariski closure of Θ 0 is three lines (irreducible components) crossing at that point. This point is also a boundary, though we will refer to it simply as the singularity.  When a rooted species tree on three species has a short internal branch so that much incomplete lineage sorting occurs, the expected gene tree probabilities lie close to the boundary or singularity 1 3 , 1 3 , 1 3 of the models. This is exactly the situation in which it is hardest to resolve species tree relationships, and therefore often one of pressing biological interest. Indeed, motivation for this paper is the recognition that the use of the standard asymptotic approximation is not reliable near boundaries and singularities, and a careful investigation of this problem is of practical as well as theoretical interest. The models T 1 and T 3 , and the more general multispecies coalescent model for larger trees, are increasingly used in inference of species trees from genomic-scale data, though typically little is done to test whether the model is appropriate for data. For relating three species, Degnan and Rosenberg [5] describe a hypothesis test using a χ 2 distribution, though our work here underscores that this test can be problematic near singularities and boundaries. Results of Allman, Degnan and Rhodes [1] show that this test extends to the unrooted 4-species trees this paper focuses on, though the same boundary and singularity issues arise in using the χ 2 . Gaither and Kubatko [8] introduce a different hypothesis test for 4-species trees, but in a different framework, working from DNA sequence data under a combined model of coalescence with sequence evolution, and not on gene tree frequencies. Most empirical studies simply assume the coalescent model on a species tree is appropriate, even though several biological processes are known which could violate it.

Approximate distributions of likelihood ratio statistics
We now illustrate that, in principle, one can obtain an alternative, potentially more useful, approximation to the distribution of the likelihood ratio statistic than the asymptotic one. For a statistical model with parameter spaces Θ 0 ⊂ Θ ⊆ Θ, Θ 1 = Θ Θ 0 , and parameter θ 0 ∈ Θ 0 , let X (1) , . . . , X (n) denote n independent and identically distributed random observations. The likelihood function for a sample realization Maximizers of the likelihood over Θ 0 and Θ are the maximum likelihood estimators (MLEs) over the corresponding parameter spaces.
The likelihood ratio statistic for a sample then is Under appropriate regularity conditions (see Theorem 16.7 of Van der Vaart [19]) the asymptotic distribution of this statistic is that of for X ∼ N (0, I), I (θ 0 ) the Fisher information matrix at θ 0 , T 0 and T the tangent cones to Θ 0 and Θ at θ 0 , and where x − B denotes the minimal Euclidean distance between a point x and set B. In essence, establishing this theorem using local asymptotic normality depends on two approximations: the likelihood ratio process from sample realizations is approximately normal, and the model parameter space is approximated locally by its tangent cone.
Of these two approximations, it is that of the tangent cone which leads to the discontinuous behavior of the asymptotic distribution, since the tangent cone's features behave discontinuously as a function of the parameter. For example, if a model is parameterized by a closed ball in R k , at interior points the tangent space will be a k-dimensional Euclidean space, while at the boundary it becomes a half-space. For a model with parameter space a curve in the plane that crosses over itself, the tangent space will be a line at most points, but at the singularity it is two crossed lines.
Examining a derivation of the asymptotics of the likelihood ratio statistic more closely, local asymptotic normality allows for the approximation by a normal for large samples. For large samples the distribution's covariance approaches 0, and rescaling to a standard normal means the parameter space must be dilated around the true parameter. It is this dilation that allows the parameter space of the model to be approximated by a tangent cone. Thus these two approximations are interrelated, and are not made independently.
Nonetheless, we informally reason that while the normal approximation may be a good one even for a relatively small sample size, a much larger sample may be needed for the approximating normal to be sufficiently concentrated that the tangent approximation of the model is accurate. This motivates Theorem 3.1 below.
For parameter spaces Θ 0 ⊂ Θ ⊆ R k and parameter value θ 0 ∈ Θ 0 , define sequences of scaled translated parameter spaces T n = √ n Θ − θ 0 and T n,0 = √ n (Θ 0 − θ 0 ). Suppose T n → T and T n,0 → T 0 in the sense defined in [19]. As pointed out by [6], a condition such as Chernoff regularity ensures this convergence of spaces, with T and T 0 the tangent spaces at θ 0 of Θ and Θ 0 .
Then under the regularity assumptions of Proposition 16.7 of [19], for a sample of size n the likelihood ratio statistic Λ n for H 0 vs. H 1 is approximately distributed as the random variable in the sense that both the likelihood ratio statistic and this random variable converge in distribution to the same limit as n → ∞.
Proof. By Theorem 16.7 of [19], the likelihood ratio statistic converges in distribution to Since T n → T and T n,0 → T 0 , applying Lemma 7.13 of [19] yields the result.
Note that the condition that the sample is i.i.d. is not necessary in the theorem; a more general result is possible if √ nI (θ 0 ) 1 2 is replaced with the square root of the Fisher information matrix for a sample of size n.
Moreover, this theorem offers no measure of accuracy of the approximation for any finite sample size, and thus does not indicate whether it gives a better approximation than the asymptotic one in practice. This is typical of results on approximate distributions of test statistics. To highlight the theorem's potential for improved testing, in subsequent sections we present simulation results indicating that this distribution outperforms the asymptotic one in our example models T1 and T3.
Though the above theorem is stated for the likelihood ratio statistic, this is but one member of the power-divergence family of goodness-of-fit statistics of Cressie and Read [3]. For multinomially distributed data, with appropriate assumptions on the null model, all members of the family converge in distribution to the same asymptotic distribution. Thus the theorems and results in this paper are potentially useful for all members of the family. Although the Neyman-Pearson lemma (Neyman and Pearson [12]) states that the likelihood ratio test is the uniformly most powerful test for simple hypotheses, Cressie and Read [4] highlighted that in other scenarios other family members, such as Pearson's chi-squared statistic, may be better approximated by a χ 2 distribution than the likelihood ratio statistic is. It is of interest to investigate the use of the distribution of Theorem 3.1 for these other statistics.
In practice, θ 0 and I (θ 0 ) are estimated using the MLE θ 0 . Florescu [7] states that a regular exponential family has consistent MLEs, and members of a regular exponential family satisfy the regularity conditions of Drton [6]. However, the approximate distribution of the likelihood ratio statistic in Theorem 3.1 may not be accurate for small sample sizes and a consistent θ 0 may still be biased for finite samples, in which cases attempts should be made to correct the bias.
We emphasize that Theorem 3.1 is focused on obtaining a useful approximate distribution near singularities and boundaries of Θ 0 within the open parameter space Θ ⊆ R k . Close to the topological boundary of Θ in R k , both the approximate distribution of Theorem 3.1 and the χ 2 may perform poorly for tests, even with a large sample. This occurs for the models T1 and T3, where Θ = ∆ 2 , if the true parameter is near the vertices of the triangle bounding the simplex. Then frequencies of two tree topologies may be very low, and the normal approximation is poor. When this occurs, other methods such as exact tests or parametric bootstrapping may be used instead.

Application to Model T3
We now apply Theorem 3.1 to determine an approximate distribution for the likelihood ratio statistic when testing the model T3 vs. an alternative of "nospecies-tree". More formally, for t (i) the branch length in species tree i ∈ {1, 2, 3} and taking φ , the hypotheses are: We view the model Θ = Θ 0 ∪Θ 1 = ∆ 2 as a subset of R 2 through an appropriate affine transformation (see Appendix B for full details) which maps the singularity of Θ 0 to the origin and the true parameter point θ 0 = 1 − 2 3 φ 0 , 1 3 φ 0 , 1 3 φ 0 , without loss of generality, to a point (0, µ 0 ) as in Figure 4. This affine transformation scales the simplex so that the normally distributed variable Y of Theorem 3.1 now has mean (0, µ 0 ) and identity covariance, where µ 0 is measured in standard deviations from the singularity and can be interpreted analagously for model T1. Unless θ 0 = 1 3 , 1 3 , 1 3 the affine transformation does not preserves angles. For other parameter values θ 0 , the angle α 0 shown in Figure 4 is less than π 6 . We make one additional simplification, valid under the assumption that θ 0 is far from the triangle bounding the simplex Θ, in a sense dependent on the sample size: the mass of the normal distribution of Y outside the image of Θ is negligible. This leads to the following proposition which is proved in Appendix B.
and The singularity is mapped to the origin (0, 0) and the true parameter point θ 0 to (0, µ 0 ). The mapping is not conformal unless θ 0 is the singularity.
Note that all the trigonometric functions in Equation (1) can be expressed as algebraic functions of φ 0 .
To understand Equation (1), note that Z andZ are random variables corresponding to the x and y components of the sample point in the transformed space. The first argument then is simply the squared distance of Z,Z to the vertical half-line in the null parameter space. The second argument is the squared distance to the other two half-lines, provided the closest point is not the origin. Z,Z will be closest to the vertical half-line when the closest point on the other two half-lines is the origin. As shown in the proof, the distance predicted by the first argument of Equation (1) is then minimal. Thus, Equation (1) is the minimum squared Euclidean distance between the sample point and the transformed null parameter space.
By replacing sgn (Z) and sgn Z with ±1, the arguments are easily recognizable as χ 2 distributions. Moreover, suppose µ 0 > 0 corresponds to any non-singular point in Θ 0 , then as the sample size n goes to infinity, µ 0 also goes to infinity, causing the distribution of sgn Z to concentrate on 1, and the minimum in the formula tends toward selecting the first argument. It follows that Λ n is asymptotically χ 2 1 -distributed as is the likelihood ratio statistic Λ n , though for Λ n the asymptotic behavior is typically determined more directly using the tangent cone approximation. Now suppose µ 0 = 0, so φ 0 = 1; that is, the true parameter is the singularity. Then for any sample size n the approximate distribution in Equation (1) simplifies, with both Z andZ standard normal. Although this distribution is not a χ 2 , it is exactly the asymptotic distribution, found using the tangent cone as in [6]. This is not surprising, as the tangent cone at this point locally agrees with the model itself.
Additional computations in Appendix B give the following.
Proposition 4.2. The probability density function for the random variable Λ n given for model T3 in Proposition 4.1 is, for λ > 0, One can show that for φ 0 ∈ (0, 1) as n → ∞ Equation (2) gives the probability density function of χ 2 1 . Although Proposition 4.2 expresses the probability density function in terms of the error function, this density can quickly be integrated numerically to obtain a highly accurate approximation. Figure 5 compares the density functions of Equation (2) at the singularity µ 0 = 0 (φ 0 = 1) and a regular point near the singularity µ 0 = 1 (φ 0 ≈ 0.9993 when n = 10 6 ) to that of χ 2 1 . At the singularity, the asymptotic density is given exactly by Equation (2), since there is no dependence on n. At all other points µ 0 > 0, the asymptotic density is given by χ 2 1 . The density plot for the parameter near the singularity, at µ 0 = 1, lies between the other two plots, and can be considered a sort of interpolant that depends both on the sample size n and value of the parameter φ 0 . Unlike the asymptotic densities, which have a jump discontinuity at the singularity, the density of Equation (2) is a continuous function of φ 0 ∈ [0, 1) for any fixed n.

Simulations
We performed simulations to compare the use of the probability density function of Equation (2) to the χ 2 1 density for determining p-values of the likelihood ratio statistic when testing H 0 vs. H 1 . We focused on true parameter values both at (µ 0 = 0) and near the singularity (µ 0 = 1, n varying). Near the singularity both distributions agree asymptotically, but at the singularity the χ 2 1 distribution is not the asymptotic distribution, while that of Equation (2) is. As the χ 2 1 distribution might naively be applied by an empiricist at the singularity, this last comparison is relevant. The value µ 0 = 1 was chosen to be near enough, but not too near, to the singularity so that the χ  (2) at the singularity µ 0 = 0 (φ 0 = 1) is in black; the approximating density at the nearby parameter value µ 0 = 1 (φ 0 ≈ 0.9993 and n = 10 6 ) is in blue; and the asymptotic density at non-singular points, the χ 2 1 distribution, is in red. The blue approximating density can be viewed as an interpolant of the two asymptotic densities at and near the singularity.
the asymptotic distribution at the singularity were both poor approximations. A range of sample sizes was chosen, in part to demonstrate that near the singularity the χ 2 1 distribution can perform relatively poorly even for a large sample size, despite its being the asymptotic distribution.
Specifically, for the simulations presented in Figures 6, 7, (and later in Figures  9 and 10), θ 0 ∈ Θ 0 was chosen making µ 0 = 0 or 1 for sample sizes n = 30, 10 3 , 10 6 . For each setting, µ 0 , n, data was simulated from the multinomial distribution 10 6 times, and likelihood ratio statistics were calculated for each replicate. The probability density functions of Proposition 4.2 were used to determine p-values by numerical integration from the observed value of the statistic to infinity; p-values were also calculated using the χ 2 1 approximation by standard software. For each setting an empirical cumulative distribution function for 10 6 p-values was graphed. In Figures 6 and 7, the discrete nature of the multinomial distribution is strongly apparent, particularly for n = 30. Since the possible likelihood ratio statistics form a discrete set and are unevenly spaced, jumps in the cumulative plots of p-values are unavoidable regardless of the simulation size.
Ideally, when Θ 0 has lower dimension than Θ (unlike Example 2.1) as for model T3, an approximate density function for the likelihood ratio statistic produces a simulated empirical cumulative distribution function of p-values close to F X (x) = x for x ∈ (0, 1). The left column of Figure 6 shows that this holds for the density function of Equation (2) for the singularity, even for a relatively small sample size of n = 30. In contrast, this fails for the χ 2 1 distribution (which is not the asymptotic distribution), as seen in the right column.
In Figure 7, the results of these simulations are shown for the parameter near the singularity. Again, plots in the left column show that the density function of Equation (2) performs extremely well, even for a sample size of n = 30. The right column illustrates that the χ 2 1 distribution is a poor approximation for each of the three sample sizes, even though it is the asymptotic distribution. As an approximate density, the χ 2 1 performs better here than at the singularity where it is not the asymptotic distribution, but not as well as the approximating density of Λ n . In summary, naively assuming the χ 2 1 distribution is an accurate approximation for the likelihood ratio statistic near (or at) a singularity can lead to inaccurate estimates of p-values.
Significantly, the right columns of Figures 6 and 7 suggest that the use of the χ 2 1 distribution gives a conservative test, as it produces larger p-values than desired, leading to rejecting H 0 less often than desired. Moreover, such a test is increasingly conservative closer to the singularity. This behavior has an intuitive geometric interpretation: When θ 0 is on the vertical line segment of Θ 0 and near, but not at, the singularity, then the observation can be substantially closer to an incorrect segment of Θ 0 than to the correct segment. The observation is then interpreted to be less extreme than it should be. Use of the χ 2 1 distribution then gives a larger p-value than desired.

Application to Model T1
We now examine our second example, model T1, in which the null hypothesis is that the species tree has a specific topology.
Our two hypotheses for this test are: The model Θ = Θ 0 ∪ Θ 1 is again the open probability simplex ∆ 2 , which is viewed as a subset of R 2 through the same affine transformation used for model T3. This is as depicted in Figure 4, but with the two non-vertical line segments erased. Applying Theorem 3.1, an approximate distribution of the likelihood ratio statistic can be found. The proof of the following is given in Appendix C.
Note that the distribution is the same as the first argument of the minimum in the distribution in Proposition 4.1 for model T3. This is expected as the first argument referred to the single line segment which is Θ 0 in this example.
Again, if sgn Z was always positive then the distribution would be a χ 2 1 distribution, while if sgn Z was always negative then it would be a χ 2 2 distribution. Further calculations in Appendix C yield the following.
At the singularity, where µ 0 = 0, Equation (3) gives the probability density function of 1 2 χ 2 1 + 1 2 χ 2 2 . This is as one expects from Example 1.2 of Drton [6]. One can also show that for φ 0 ∈ (0, 1) as n → ∞ Equation (3) gives the probability density function of χ 2 1 , since M 0 (x) → 0 as x → ∞. Again the approximate probability density function can be integrated numerically quickly to obtain a highly accurate numerical approximation. Figure 8 gives a graphical comparison of the probability density functions of Equation (3) at µ 0 = 1 (φ 0 ≈ 0.9993 and n = 10 6 ) and at µ 0 = 0 (the probability density function 1 2 χ 2 1 + 1 2 χ 2 2 at the boundary) to that of χ 2 1 . The black and red densities are the asymptotic densities at and near the boundary, respectively. The graph for a parameter near the boundary (µ 0 = 1) lies between those for the asymptotic distributions, interpolating them in a way dependent on both sample size n and parameter φ 0 . Unlike the asymptotic distributions, which jump discontinuously at the singularity, the density of Equation (3) is a continuous function of φ 0 .
Note that the χ 2 1 density (red curve) is closer to the approximate density (blue curve) in Figure 8 than in Figure 5, indicating it is closer to our distribution for T1 than for T3. This is not surprising, since the derivation of the asymptotic χ 2 1 is based on replacing the model with a single vertical line, which more closely matches the geometry of the model T1 than T3.

Simulations
The performance of the approximate density function of Proposition 5.2 was compared to the density function of the χ 2 1 distribution through simulations for model T1, similar to those previously described for T3.
In Figure 9, it can be seen that at the boundary our approximate density function outperforms the χ 2 1 approximation, which is biased towards smaller pvalues. This is expected, since the distribution in Proposition 5.1 is the asymptotic distribution and χ 2 1 is not. We note that the χ 2 1 approximation rejects H 0 more often than it should and thus gives an anti-conservative test.
Near the boundary, as shown in Figure 10, our probability density function again fits the distribution of the likelihood ratio statistic better than the χ 2  does, though the improvement is minimal compared to that in Figure 9 for the boundary. This is expected as the χ 2 1 is now the asymptotic distribution. Moving away from the boundary (simulations not shown), the χ 2 1 distribution becomes a progressively better approximation, but remains biased towards smaller p-values. Thus the use of the χ 2 1 approximation leads to rejection of H 0 more often than it should, and is anti-conservative. Again, the χ 2 1 performs better for model T1 than for model T3 for some µ 0 .
The anti-conservative behavior of the χ 2 1 distribution is geometrically intuitive. For a true parameter θ 0 near the boundary point of Θ 0 , some sample points will lie lower than the boundary, giving an MLE that is the boundary point. Such sample points are thus further from the MLE than they are from the vertical line extending Θ 0 . However, the χ 2 1 distribution is appropriate for judging their squared distance from that line. This causes them to be viewed as more extreme than they should be, and their p-values to be calculated as smaller than desired.

Approximating likelihood ratio statistic distributions with χ 2
The distributions of Propositions 4.1 and 5.1 interpolate between the asymptotic distribution at the singularity or boundary, respectively, and the asymptotic χ 2 1 distribution far from the singularity or boundary. The further the true parameter point is from the singularity or boundary, the more accurate the χ 2 1 approximation is.
Indeed, while we have shown these approximate distributions for likelihood ratio statistics perform better than the asymptotic ones for finite sample sizes near the singularities and boundaries of our example models, it may still be desirable to use the asymptotic χ 2 1 distribution for testing sufficiently far from those points. The simpler form of these distributions and ready availability in standard software remains attractive. A natural problem, then, is how to decide when the simpler distribution is likely to lead to adequate performance in testing.
To approach this question quantitatively, we employ the total variation distance between our approximate distributions and the χ 2 1 . The total variation distance between two continuous probability distributions F , G, with densities f , g, of support R, is which can be interpreted as the maximum absolute difference of probabilities of events.
Using the distribution of Proposition 4.1 or Proposition 5.1, one can choose an acceptable upper bound on the total variation distance between this distribution and the χ 2 1 . Then, using a numerical optimization routine, one can determine the values of φ 0 , n for which this bound is not exceeded. The χ 2 1 approximation might be considered acceptable for such φ 0 and n.

Application to Model T1
For model T1, the dependence of the distribution from Proposition 5.1 on φ 0 and n is only through µ 0 = µ 0 (φ 0 , n), so let F µ0 denote this distribution viewed as a function of µ 0 . From the derivation of the density in Appendix C, it is clear that δ F µ0 , χ 2 1 is a decreasing function of µ 0 . It is thus sufficient to determine numerically the valueμ 0 for which δ Fμ 0 , χ 2 1 = . Then µ 0 >μ 0 characterizes the parameters and sample sizes for which the χ 2 1 approximation might be considered acceptable. Table 1 summarizes, for several choices of , the threshold valueμ 0 . It also shows for several choices of sample size n, the corresponding thresholds φ 0 <φ 0 and t >t = − log φ 0 , since µ 0 is a function of n and φ 0 . For a given bound , larger sample sizes allow for shorter internal branches of the tree in Figure 2, while maintaining the χ 2 1 distribution as a reasonable approximation for the distribution of the likelihood ratio statistic. Table 1 For model T1, the threshold valuesμ 0 are given for which µ 0 >μ 0 ensures the total variation distance between the distribution of Proposition 5.1 and χ 2 1 is less than , for various . For a fixed sample size n, the thresholds are also given in terms ofφ 0 ort.

Application to Model T3
For model T3, the dependence of the density of Proposition 4.2 on parameter φ 0 and sample size n is through both µ 0 and α 0 . However, it is clear from the derivation in Appendix B that an upper bound on the variation distance is obtained by setting α 0 to its minimum value, α 0 = arctan 1 3 , for any value of µ 0 . This simplifies the computations and leads to a conservative estimate of the thresholdμ 0 . Table 2 summarizes thresholds found in this way.

Discussion
As the examples of models T1 and T3 illustrate, not only should we expect non-standard asymptotic distributions for hypothesis testing at singularities and boundaries of models, but that even near such points the standard χ 2 asymptotic Table 2 For model T3, conservative threshold valuesμ 0 are given for which µ 0 >μ 0 ensures the total variation distance between the distribution of Proposition 4.1 and χ 2 1 is less than , for various . For a fixed sample size n, the thresholds are also given in terms ofφ 0 ort. distributions may behave poorly when testing. Although increasing sample size may lead to better performance at any specific point, the discontinuous behavior of the asymptotic distribution means a region of poor performance can remain, though it shrinks in size. While Drton [6] commented that convergence to the asymptotics can be slow near a boundary or singularity, we further emphasize that the nonuniformity of the rate of convergence poses even more of a problem. Unless we have an a priori quantitative bound separating the true parameter from the singularities and boundaries, no finite sample size can be found which will lead to uniformly good performance of the standard asymptotic approximation. Moreover, depending on the model, use of the χ 2 asymptotic approximation may lead to either conservative or anti-conservative tests (or both, in different regions), depending on the geometry of the model beyond the singularity or boundary. Thus no simple rule can be adopted for adjusting one's test. Theorem 3.1 suggests one alternative approach, of avoiding the approximation of the model by its tangent cone inherent in the derivation of the asymptotic distribution, and using a different approximate distribution dependent on both the true parameter and the sample size. For our example models this performed well, as illustrated by our simulations.
Even for our models, there are a number of hypothesis tests not presented here for which Theorem 3.1 will be useful. For instance, one may wish to test whether data fits a null hypothesis of a particular tree, model T1, vs. an alternative of the other trees, model T3 T1. Failure to reject the null hypothesis for each of the three choices of T1 would, in biological terminology, be interpreted as a soft polytomy, where an unresolved (star) tree represents ignorance of the true resolution. Similarly, one may wish to test whether data fits a simple hypothesis of an unresolved tree, θ 0 = 1 3 , 1 3 , 1 3 , vs. an alternative of a resolved tree, model T 3 {θ 0 }. For this test failure to reject the null hypothesis would, in biological terminology, be interpreted as a hard polytomy, where an unresolved tree represents what are believed to be true relationships.
Within phylogenetics, another possible use of Theorem 3.1 is for conducting hypothesis tests for distance data to fit a tree. An evolutionary distance d (a, b) is typically a numerical measure of the amount of mutation between two species a and b, and under certain modeling assumptions should in expectation match the sum of lengths of branches between them on a tree. The 3-point condition states that for an ultrametric tree to exist relating species a, b, c, the expectations of d (a, b), d (a, c), and d (b, c) must have the two largest equal, with the smallest pair indicating the correct tree topology. This is similar to models T1 and T3, with the inequality reversed, except that the distances may have any non-negative values. Again the model has a singularity or boundary.
Several works [9,10] have proposed statistical tests involving distances. For instance, Gu and Li [9] tested the 3-point condition by focusing on the difference of the two distances that are assumed to be equal under H 0 . Arguing that this difference is asymptotically normally distributed, a Z-test is performed. However, when all three distances are near equal, as they would be near the singularity or boundary point corresponding to a star tree, this test becomes inaccurate, as the smallest value may well not correspond to the true topology. Just as with models T1 and T3, the test could either be anti-conservative or conservative, depending on whether the null hypothesis was of a specific 3-species ultrametric tree or of any of the three possible trees, respectively.
Our example models have rather special structure making them amenable to our approach. Since Θ 0 was locally linear, except at the singularities and boundaries, we were able to compute explicit density functions for the relevant distributions, so that using them was no more difficult than using a χ 2 . Our examples also had the interesting feature that in the biological application one is often most interested in data around the singularity or boundary, and so effective hypothesis testing in that region is of special concern. Although we do not believe similar calculations of our approximate finite sample distribution will be tractable for all models, there are likely to be some where this approach will prove useful. For models that are not amenable to such calculations, special attention still needs to be paid near singularities and boundaries, perhaps through the use of parametric bootstrapping to obtain approximations of the distribution.
With a broader perspective, Theorem 3.1 suggests that whenever the asymptotic distribution performs badly for hypothesis testing, one might do better by using a distribution taking the local geometry of the model into account in a more subtle way than just through the tangent cone. For instance, if a model were described by a curve in the plane, one should expect that even at regular points the asymptotic distribution may be less useful in regions of high curvature, where the tangent cone approximation of the model is poor. However, unlike in the case of singularities or boundaries one should be able to work out a sample size ensuring a reasonable fit by a χ 2 , as long as the curvature is bounded. If obtaining a data set of that size is not possible, then even if the distribution of Theorem 3.1 cannot be computed, first approximating the model by a simpler curve with similar curvature, such as an appropriate polynomial, and then using the theorem might lead to a better distribution for hypothesis testing. Failing that, parametric bootstrapping again remains an option.

Acknowledgements
This work is supported by the US National Institutes of Health grant R01 GM117590, awarded under the Joint DMS/NIGMS Initiative to Support Research at the Interface of the Biological and Mathematical Sciences.

Appendix A: The multispecies coalescent model
We briefly introduce the multispecies coalescent model, which underlies models T1 and T3 of Examples 2.2 and 2.3. This model, introduced by Pamilo and Nei [14] (see also [15]), extends the Kingman coalescent model of population genetics, from applying to a single population, to a tree of populations, called a species tree. It is the fundamental model of the biological process of incomplete lineage sorting, by which gene trees of sampled lineages can fail to match the structure of the tree relating species overall. Incomplete lineage sorting is one of several processes that can make inference of species relationships from genetic data difficult. An example of a single such gene tree sampled for a particular species tree is shown in Figure 2.
The Kingman coalescent models a finite number of lineages, traced backward in time within a single population, as they merge, or coalesce, at common ancestors. The most convenient time scale is in coalescent units t, where ∆t = ∆τ /N (τ ), with time τ measured in number of generations and N (τ ) the population size. In these units, if k lineages are sampled, the time to the first coalescence of the first pair of lineages is exponentially distributed with rate k 2 . The pair that coalesces is then chosen uniformly at random. Then the coalescent process begins again with one less lineage, and hence rate k−1 2 . Wakeley [20] provides a comprehensive introduction to this model. While in population genetics, one often views the Kingman model as running until all lineages coalesce to a single one, in the multispecies coalescent that may not happen within a single population, which has a finite duration.
The parameters of the multispecies coalescent model are a rooted metric species tree, with branch lengths given in coalescent units. The branches of the species tree should be thought of as representing unstructured populations, which stretch back in time until they merge with another population. We also consider a population ancestral to the root of the species tree, which is considered to have infinite length, so that lineages in it coalesce into one with probability 1.
Specific finite numbers of lineages are to be sampled from each species' population at the leaves of the tree. Then the Kingman coalescent model applies for the duration of that population to its parental node in the tree. At that point, there are fewer lineages if any coalescent event occurred, but we gain more lineages from the other branch of the species tree which descends from that node. The combined collection of lineages then starts a new coalescent process on the branch leading towards the root. Continuing in this way, eventually a finite number of lineages reach the root, where a final Kingman coalescent process leads to a rooted metric gene tree. Finally, ignoring branch lengths yields a sampled rooted topological gene tree.
While for species trees with many species it is difficult to compute the probability of any gene tree (e.g., Rosenberg [16]), in the applications based on models T1 and T3, the species tree has only three species, and only one lineage is sampled from each. With only one lineage per species, coalescence can occur only in the internal branch of the tree or "above-the-root", and not in any terminal branch. Thus the only relevant branch length is the internal one.
Suppose that the true species tree is a rooted three species tree ((a, b) :t, c), as shown in Figure 2. There are three possible gene tree topologies, AB|C, AC|B, BC|A.
In this case, the probability of gene trees discordant from the species tree are easiest to compute. For instance the gene trees AC|B and BC|A can only form if no coalescence occurs except above the root. From the exponential distribution of coalescent times, the probability of no coalescence of two lineages in a branch of length t is e −t . Then, with three lineages present at the root, due to the exchangeability of lineages, the formation of each of the three rooted trees must have equal probability of 1 3 . Thus p AC|B = 1 3 e −t . The same argument gives which we apply to the planar image of ∆ 2 . The point θ 0 is mapped to (0, µ 0 ) with Under these transformations the null parameter space Θ 0 is mapped nonconformally, provided φ 0 ∈ (0, 1), to three line segments emanating from the origin, one to 0, , and varies from π 6 for φ 0 = 1 down to arctan 1 3 as φ 0 → 0. Letting γ 0 = tan (α 0 ), in the transformed space the image of Θ 0 is contained in the union of the half-lines {(0, y) | y ≥ 0} and y = −γ 0 sgn (x) x.
By Theorem 3.1, the approximate distribution of the likelihood ratio statistic is the distribution of the minimum squared Euclidean distance between a normal sample, N ((0, µ 0 ) , I), and three line segments in the transformed space. Assuming that θ 0 is not too close to the boundary of Θ 0 in a sense dependent on sample size, little of the mass of N ((0, µ 0 ) , I) is outside the image of the simplex. Thus, for the remainder of the argument, we replace these line segments with half-lines emanating from the singularity (0, 0).
Denote the marginal probability distributions of the bivariate normal sample by Z ∼ N (0, 1) andZ ∼ N (µ 0 , 1). We next determine the minimum squared distance of a sample point Z,Z to the three half-lines.
Consider first the half-line {(0, y) | y ≥ 0}. IfZ is non-negative, then the squared Euclidean distance is Z 2 , while ifZ is negative, then the squared distance is Z 2 +Z 2 . Thus the squared Euclidean distance is Now consider the half-lines y = −γ 0 sgn (x) x and denote the closest point to Z,Z by (X, −γ 0 sgn (X)X). Assuming X = 0, then sgn (X) = sgn (Z), and minimizing (Z − X) 2 + Z + γ 0 sgn (Z) X In the case X = 0, the closest point to the two half lines is the origin. This can occur only whenZ ≥ 0, so the squared distance to the two half-lines is at least Z 2 , which is the squared distance to the vertical half-line given in Equation (4). Moreover, it can be shown that Z 2 is at most the value given in Equation (5) in this case.
It follows that the approximate distribution of the likelihood ratio statistic is that of the random variable Λ n = min Z 2 + 1 2 1 − sgn Z Z 2 , sin 2 α 0 Z + cot α 0 sgn (Z)Z 2 , as given in Proposition 4.1.
To determine the probability density function for the approximate distribution of Proposition 4.1, we let G X (x) denote the cumulative distribution function of the (non-squared) Euclidean distance. This can be found by integrating the bivariate normal distribution N ((0, µ 0 ) , I) over the tube of points within distance x from the transformed Θ 0 , as shown in Figure 11. Calculations are simplified by the fact that the tubular region in Figure 11 has bilateral symmetry, as does the normal distribution.
Due to symmetry, we need only integrate over the shaded regions in the figure. Let L i = L i (x) denote the half-lines forming the outer boundaries of these regions. Denote by β 0 the angle formed by the line segment joining the origin to the point of intersection of L 1 and L 2 . This angle has measure β 0 = 1 2 π 2 − α 0 . With this setup, G X (x) = 2 (G 1 (x) + G 2 (x) + G 3 (x)), where G i is the integral over the shaded strip R i , and the density of the Euclidean distance is Considering d dx G 1 (x) first, one sees that this derivative is the integral of the normal density over boundary L 1 . We show this formally using polar coordinates: d dx x cos 2 β dβ. Figure 11: The region of integration for G X (x) is between the dashed lines. The integral is evaluated as three integrals, over each of the shaded regions R i .
Substituting y = x tan β gives More briefly, over R 2 we have where f is the density of the bivariate normal. To evaluate this, we reflect about the line y = tan β 0 x, mapping the mean of the Gaussian to (µ 0 cos α 0 , −µ 0 sin α 0 ), and sending L 2 to a vertical half-line (x, y), with y ≥ x tan β 0 , so Finally, since the same reflection maps L 3 to the vertical half-line (−x, y) with Figure 12: The region of integration for model T1.
On R 2 , using polar coordinates and C 2 for the dashed semi-circle, we find where the last line is obtained after a change of variables, and M 0 (z) is the modified Struve function 11.5.5 from Olver [13].
Summing, and making a change of variable x = √ λ, we find the probability density function for Λ n is