A note on BIC in mixed-effects models

The Bayesian Information Criterion (BIC) is widely used for variable selection in mixed effects models. However, its expression is unclear in typical situations of mixed effects models, where simple definition of the sample size is not meaningful.We derive an appropriate BIC expression that is consistent with the random effect structure of the mixed effects model. We illustrate the behavior of the proposed criterion through a simulation experiment and a case study and we recommend its use as an alternative to various existing BIC versions that are implemented in available software.


Introduction
Mixed effects models are typically used in population studies where repeated measurements are observed from several independent subjects (see [3,22] for instance). Population studies are relevant in various fields such as pharmacokinetics or public health, for example to study a disease progression and to evaluate the effect of a treatment or the impact of physiological covariates [3,4,24]. The particular data structure in population approaches raises difficulties with the maximum likelihood (ML) approach. Since the likelihood cannot be expressed in a closed form, model fitting is a complicated computational issue. There is an extensive literature devoted to the development of powerful algorithms based on likelihood linearization [22] or on stochastic versions of the EM algorithm [7,11]. Available software packages (Monolix, NLMIXED in SAS, saemix and nlme in R, nlmefitsa in Matlab, among others) allow to fit a wide range of generalized linear and nonlinear mixed models. For model selection purpose, the Bayesian Information Criterion (BIC) provides a consistent and easy-to-use method [27]. Its generic form is the negative maximized log-likelihood penalized by a term that depends on the number of estimated parameters and the sample size. Surprisingly, the BIC expression differs from one software to another. The reason is that the effective sample size and the effective number of parameters are not clearly defined in this context. The aim of the paper is to clarify how the BIC should be defined in general mixed effects models.

Model formulation
To fix the notations, let N be the number of subjects and let y i = (y i1 , . . . , y i,ni ) ′ be the n i × 1 vector of observations for subject i, where y ij , j = 1, . . . , n i denotes the jth measure observed under time condition t ij and possible additional conditions u ij (such as drug doses for instance). For simplicity, we assume in this paper that the number of repeated measurements n i is the same for each subject: n i ≡ n, i = 1, . . . , N , but extension of our results to different n i s is straightforward. We write x ij = (t ij , u ij ) the values of the so-called regression variables or design variables for subject i and we denote x i = (x i1 , . . . , x i,n ). Population studies try to explain the variability of the observed profiles (x i , y i ), i = 1, . . . , N , and to determine if some of the variation is associated with subject characteristics c i that do not change with time (such as weight or age for instance).
Assuming that the N subjects are mutually independent, a mixed effects model may be presented hierarchically as a two-stage model (see for instance [3,13,22]): • Stage 1 (observations): for each individual i, the probability distribution of the observations y i depends on a set of regression variables x i and a d × 1 vector of individual parameters ψ i y i |ψ i ∼ p(·|ψ i ; x i ); (1.1) • Stage 2 (individual parameters): the inter-individual variability is modeled by considering ψ i as a random vector and by introducing in the model covariates c i ψ i ∼ p(·|θ; c i ), (1.2) where θ is a vector of population parameters.
A linear mixed effects model assumes a linear relationship between y i and ψ i at stage 1 where F i = F (x i ) is a n × d design matrix and where ε i is a n × 1 vector of normally distributed residual errors: By combining (1.3) and (1.4), we obtain the general representation of a linear mixed effects model [13]: are n × m and n × ℓ design matrices. We give an example of such linear mixed effects model in the simulation study Section 3.
Formulation (1.1)-(1.2) includes very general models such as nonlinear mixed effects models for continuous data where f describes the structural model and σ models the variability of the residual errors ε ij [3,22]. Section 4 provides a typical example of a pharmacokinetic study for which the nonlinear mixed effects model is an appropriate framework. Model (1.1)-(1.2) can also consider generalized linear mixed models for categorical, count or survival data [17,26]. In this context, the first stage essentially consists in defining the conditional mean E(y ij |ψ i ; x ij ) as a function µ(x ij , ψ i ) which depends on the fixed and random effects through a link function and a linear predictor: Our main result presented in Section 2 is obtained under the hypothesis that the individual parameters ψ i are normally distributed. For a sake of simplicity in the notations, we will denote η i = D i b i in the sequel. Then, (1.4) reduces to The vector of population parameters θ includes β and the parameters in Ω.
Equation (1.5) can consider models for which certain individual parameters are random or purely fixed. Degenerate matrices Ω may have the following block-diagonal structure: where Ω R is a d R ×d R positive-definite variance-covariance matrix, with d R ≤ d.
The basic model defined in (1.5) for the individual parameters can be extended to more general models. First, we can assume that there exists a monotonic transformation h such that φ i = h(ψ i ) = C i β + η i . For instance, the log-transformation is frequently used for non negative parameters and the logit transformation for parameters such as proportions which are known to take their values between 0 and 1. Equation (1.5) can also be extended to nonlinear Gaussian models such as φ i = h(ψ i ) = µ(β, c i ) + η i , where µ is a possibly nonlinear function of c i and β. Indeed, the proof of our results use of the parametrization that involves the φ i 's instead of the ψ i 's, and does not make any assumption of linearity for the function µ.

Covariate selection
In this paper, we focus on the covariate selection problem, i.e. the selection of the non zero elements of β. In equation (1.5), our model formulation emphasizes that this variable selection problem consists in choosing the most relevant components in the c i 's for describing the between-subjects variability of the individual parameters of the model.
To clarify the variable selection problem tackled in the present work, we consider a basic example based on the following linear mixed-effects model y ij = ψ i0 + ψ i1 t ij + ε ij , i = 1, . . . , N, j = 1, . . . , n, specified by the individual parameters ψ i = ψ i0 , ψ i1 ′ and the regression variables x ij = t ij , i = 1, . . . , N , j = 1, . . . , n, with a Gaussian measurement noise N (0, σ 2 ). We assume that the slope may vary with one covariate c i , so The decision on whether or not to include the covariate c i in the model can be based on the BIC of the fits with and without it. Our goal is to compute the appropriate BIC; to answer this question, we assume that we know which of the parameters are purely fixed or random, i.e. which diagonal element of Ω is null or not. The problem of selecting the variance-covariance structure, i.e. the non zero elements of Ω is beyond the scope of the paper.
The BIC is defined as −2LL + pen where LL is the maximized log-likelihood and pen is a term equal to the product of the number of estimated parameters and the logarithm of the sample size; this term is referred to as the BIC penalty. In a pure fixed effects model where all the observations y ij , i = 1, . . . , N, j = 1, . . . , n i are independent, the effective sample size is the total number of observations n tot = N i=1 n i . In a pure random effects model where all the components of ψ i are random, the estimation of the fixed effects β is based on N independent random vectors (y 1 , . . . , y N ). In such situation, the effective sample size is the number of subjects N . In a mixed effects model which combines fixed and random components of ψ i , the definition of the effective sample size is not straightforward. Yet, in the mixed effects model literature, the BIC penalty usually involves log n tot . From a practical point of view, the log n tot penalty is implemented in the R package nlme [23] and in the SPSS procedure MIXED [28] while the log N penalty is used in Monolix [18], saemix [2] or in the SAS proc NLMIXED [25].
From a theoretical point of view, the BIC penalty relies on asymptotic approximations. Nie [20] showed that convergence rates of maximum likelihood estimators can differ from parameter to parameter according to the level of variability designed in the mixed model. Based on the work of [20], we consider the double-asymptotic framework where both the number of subjects N and the numbers of measurements per subject n i , i = 1, . . . , N tend to infinity. We use an appropriate decomposition of the complete log-likelihood combined with the Laplace approximation to derive the asymptotic BIC approximation. We obtain a BIC penalty based on two terms proportional to log N and log n tot that adapts to the mixed effects structure of the model. BIC-type procedures were studied by Jiang and Rao [9] in linear mixed effects models. Jiang et al. [10] pointed out the difficulties encountered with these criteria in non conventional situations including mixed models and introduced new strategies called fence methods. Recently, several authors clarified how the number of parameters should be chosen to define a conditional AIC in mixed models [8,15,16,29]. Bondell et al. [1], Lai et al. [12] and Schelldorfer and Bühlmann [26] studied the problem of fixed and random effects selection with a lasso-type penalized likelihood approach in different mixed effects models. In the three procedures, a BIC step was implemented to choose the regularization parameters. The three papers used different BICs with different choices of effective sample sizes and effective degrees of freedom. Our work gives an answer on how to define the BIC accounting for the covariance structure in the model.
The rest of the article is organized as follows. We describe in Section 2 the asymptotic framework of our study and give the theoretical penalty term to select covariates using BIC in mixed models. Sections 3 and 4 are devoted to a simulation experiment and a case study. We illustrate the performance of the proposed BIC and compare it with standard BIC criteria that are implemented in available softwares. The article concludes with a discussion in Section 5.

BIC for mixed-effects models
Denote by y = (y 1 , . . . , y N ) the N vectors of n i ≡ n observations. Then, the total number of observations is n tot = i n i = N n. From here on, the c i 's and the x i 's will be omitted in the notations of the probability distributions. The conditional distribution of y i will be denoted p(·|ψ i ) rather than p(·|ψ i ; x i ) and the distribution of ψ i will be denoted p(·|θ) rather than p(·|θ; c i ). The distributions' dependency on rather the c i 's or the x ij 's is nevertheless still accounted for in the following theoretical developments.
The BIC statistic arises in the Bayesian approach for model selection. Bayesian choice is based on the posterior probability of a given model m: p(m|y) ∝ p(m)p(y|m). Assuming that the prior over models m is uniform, we need some way of approximating p(y|m). The Laplace approximation [14] of the integral where p(θ|m) denotes the prior for the parameters θ of each model m, gives Here θ denotes the maximum likelihood estimator (MLE) of the model parameters and H θ = − ∂ 2 log p(y|m) ∂θ∂θ ′ | θ= θ is the negative Hessian matrix computed at the MLE and also referred to as the observed information matrix. We use ≈ to mean "is approximately equal to", corresponding to asymptotic equivalence as sample size goes to infinity. In fixed-effects models, under basic regularity assumptions, the Hessian H θ behaves as N times a full-rank matrix I θ , namely the information matrix of the model. Then, where q = dim(θ) is the number of parameters in the model, i.e. the number of elements of θ. The second term does not grow with N and by dropping it and substituting into (2.1) we get the classical BIC approximation In mixed effects models, this is often not true, depending on whether n → ∞.
The order of accuracy of the Laplace approximation depends both on the number of subjects N and the number of observations per subject n; the second term log det(I θ ) cannot be evaluated anymore as a constant. To evaluate the respective contributions of N and n in (2.1), we investigate the orders of magnitude of the components of H θ with respect to both N and n. It requires to expand the likelihood p(y|m) around the random effects rather than around the average effect over all subjects. To see this, we write log p(y|m) = N i=1 log p(y i |θ) and consider the marginal likelihood for subject i, p(y i |θ), that is obtained by integrating the conditional distribution function of the data vector y i with respect to ψ i 's distribution: Except for particular models, the integral in (2.2) does not have a tractable expression. Along the lines of [19,20], the individual complete likelihood p(y i , ψ i |θ) is decomposed into two terms according to the covariance structure in the model. Equation (1.6) naturally divides the ψ i 's into two components: the individual parameters ψ ik which are not random (for which Ω kk = 0), and the individual parameters that randomly vary among subjects, corresponding to a non-degenerate random effect η ik with non negative variance, k = 1, . . . , d. We denote by ψ F,i the components of ψ i that are not random and ψ R,i the components of ψ i that include a random component, leading to following notations: in such a way equations (1.5) and (1.6) are still valid. Thus the population Note that although the components of β and θ are now indexed by either "F" or "R", the whole population parameters are fixed parameters in the model. Indexes "F" and "R" only refer to the fixed and random components of ψ i . Some precise illustration of such a decomposition is provided in Section 3 and summarized in Table 1. Thus, (2.2) can be replaced by The negative Hessian matrix H θ can be written as Under suitable regularity conditions, by using (2.4) and Laplace approximations of the partial derivatives of the individual log-likelihoods, we obtain where I 1 and I 2 represent full rank matrices of constant order of magnitude. By dropping constant terms and substituting into (2.1) we get the appropriate BIC approximation Details of the proof are given in the Appendix. Both N and n need to be sufficiently large if this approximation is to work. It is worth noting that the leading term of the Laplace approximation of the individual log-likelihood derivatives is at most O(1), the reminder having order of O(1/n). Hence, the order of accuracy of the leading term of the Laplace approximation to the sample log-likelihood (2.1) is O(N/n), which is negligible provided n grows faster than N . Hence the BIC approximation is obtained under the assumption that the number of observations per subject grows at a faster rate than the number of subjects. If n grows at a rate slower than N , the Laplace leading term no longer converges to the log-likelihood, although the MLEs still attain consistency [30].
Under the assumptions given in the Appendix, we obtain the following result.
The BIC procedure consists in selecting the model that minimizes Remarks: 1. The new BIC criterion penalizes the size of θ R with the logarithm of the number of subjects and the size of θ F with the logarithm of the total number of observations. In a pure fixed-effects model, θ R is empty and θ = θ F . Then, the proposed criterion is the BIC with a log n tot penalty: In the other extreme situation where all the individual parameters are random, all the population parameters are components of θ R and the proposed criterion is the BIC with a log N penalty: Thus, the BIC h proposed in equation (2.6) appears to be an hybrid BIC version that automatically adapts to the random-effects structure of a mixed model. 2. In the Akaike Information criteria (AIC) adjusted to mixed models [16,29], the log-likelihood is penalized by a term that depends on the number of degrees of freedom ρ of the model. ρ is either the number of fixed parameters (marginal AIC) or the effective number of parameters (conditional AIC, [8,15]). The choice between the two formulae depends on the focus of the model selection, i.e. prediction about the fixed effects or prediction about the random effects themselves. Our work on the BIC does not address the same issue. We give the optimal BIC to perform covariate selection, i.e. to find the non-zero elements of the fixed-effects β, whatever the random structure of the mixed model. 3. Deriving the BIC h 's value involves the computation of the population parameters' estimatorsθ and the computation of the observed log-likelihood. These calculations can be intricate in nonlinear models but several algorithms are available [11] and are implemented in standard packages. The penalty term is straightforward to compute once the decomposition of θ = (θ R , θ F ) is specified for the model under study. An illustrative example is given in the next section.

Simulation study
The objective of this section is to compare numerically the performances of the proposed "hybrid" BIC, called BIC h , with those of the two most widely used BIC versions: BIC ntot and BIC N , which systematically penalize any model using log(n tot ) and log(N ) respectively. We will consider a basic variable selection problem based on the following linear mixed-effects model: N (0, σ 2 ). Here we have ψ i = C i β +η i , with where Ω is a diagonal matrix with diagonal elements (ω 2 0 , ω 2 1 , ω 2 2 ). Here, covariate model selection reduces to selecting the non zero elements of (α 1 , α 2 ). There are therefore 4 possible covariate models to compare using different information criteria: Unlike BIC N and BIC ntot , BIC h 's penalty depends on both the number of random individual parameters and the number of covariates in the model. Table 1 displays the elements of θ R and θ F as well as the three different penalization terms used by the three different BIC, for the four covariate models (M k , 1 ≤ k ≤ 4) and the four following variance models: The aim of this numerical experiment is to investigate the behavior of the three different versions of BIC in different situations, i.e. using different covariate models, different variance models and different designs.
We have then simulated data under the 4 × 4 × 4 = 64 possible situations by combining the covariate models (M k , 1 ≤ k ≤ 4), the variance models (O ℓ , 1 ≤ ℓ ≤ 4) and four different designs obtained with different numbers of subjects  For each of these 64 models, 50 datasets were simulated as follows: the n observation time points t i1 , . . . , t in were equally spaced in [0, 10] and the residual error variance was fixed to σ 2 = 1. In order to consider different situations, the covariates (c i ) and the variances (ω 2 m , 0 ≤ m ≤ 2) where randomly drawn for each replicate: c i ∼ N (0, 1), µ 0 , α 1 , α 2 ∼ N (0.01, 1), µ 1 ∼ N (0.005, 1), µ 2 ∼ N (0.0025, 1) and ω 2 m ∼ U [0.01,1.01] , m = 0, 1, 2. For each simulated dataset, the EM algorithm was used for estimating the population parameters and the observed likelihood was computed as a Gaussian likelihood, according to (3.1). We could then derive the three versions of BIC using the penalization terms displayed in Table 1 and thus obtain three selected covariate models. The selected models were then compared to the original covariate model used for generating the data.
The results of the Monte-Carlo simulation study are displayed in Figure 1. The results obtained with the 64 models are displayed as follows: each column is associated to a covariate model, each row to a variance model. Then, any of the 16 subplots displays the results obtained with the four designs for a given covariate and variance model.
We can remark first that the performances of BIC N and BIC ntot significantly differ according to the covariance structure of the model. Figure 1 especially shows that BIC ntot is more likely to select the right covariate model than BIC N when there are no random components (model O 1 ). In this situation BIC h behaves exactly as BIC ntot since the penalizations are the same, up to a constant term (see Table 1). On the contrary, BIC N better behaves than BIC ntot in a model with a large number of random parameters (model O 4 ). Such result was expected according to Remark 1 in Section 2. In this extreme situation, we can remark that BIC h now behaves exactly as BIC N (see Table 1). In models with an intermediate covariance structure (models O 2 and O 3 ), the comparison between BIC N and BIC ntot seems to depend both on the design and the covariate model. In some situations, BIC ntot underestimates the number of covariates and in other situations, BIC N overestimates this number while BIC h uses the most adequate penalization whatever the model and the design.
In summary, BIC h is globally the best selection criterion in the present covariate selection problem, since it behaves in various situations at least as well as both standard criteria BIC N and BIC ntot . On the other hand, the performances of BIC N and BIC ntot highly depend on the number of random parameters in the model. Thus, by automatically adapting its penalization according to the covariance structure of the model, BIC h is a useful compromise between BIC N and BIC ntot for covariate selection in a population approach.

Application to the warfarin data
We will use a classical clinical pharmacology study [21] for illustrating the proposed method with a real data example. This data is available with the Monolix software and all the results presented in this section can easily be reproduced.
In this well known study, N = 32 healthy volunteers received a 1.5 mg/kg single oral dose of warfarin, an anticoagulant normally used in the prevention of thrombosis. The warfarin plasma concentration C was then measured at different times for these patients and a total number of n tot = 251 measurements was obtained.
We consider a one compartment model for this data: where D is the initial dose of drug, ka the absorption rate constant, V the volume of distribution and Cl the clearance of the drug. We then model the observations (y ij , 1 ≤ i ≤ n i ) of patient i using a proportional error model: where x ij = (t ij , D i ) are the regression variables, ψ i = (ka i , V i , Cl i ) are the PK (pharmacokinetic) parameters for patient i and the residual errors ε ij are i.i.d. centered Gaussian random variables with variance σ 2 . Due to positivity constraints, we assume that (ka i , V i , Cl i ) are log-normal random variables, i.e. log ψ i is Gaussian. We also assume a diagonal covariance matrix Ω. Available covariate for patient i is its weight w i . Here w i has been centered by the empirical mean computed on the 32 patients.
We have compared several possible covariate models for the PK parameters but we only report here four representative models. For these four models, we use the same following model for . We also assume that there is no random effect on ka i . We then consider different covariate models for ka i and Cl i :  Table 2. The numbers of non-random and random components of θ in each model are indicated in columns 2-3 and the selection criteria BIC N , BIC ntot and BIC h are computed in columns 5-7. We can remark that the four covariate models are not ranked in the same way according to the BIC penalty. BIC N increases from the largest model to the smallest one and is minimum for M 4 . BIC ntot ranks the models in the reverse order and selects the simplest one M 1 . As expected, BIC h chooses the intermediary model M 2 which includes in the model a weight effect on Cl.
We don't pretend here that M 2 is the "best" model and that BIC h is able to select it. This example only illustrates the fact that different models can be selected according to the penalization which is used. It is the simulation study presented in the previous section that illustrates the good statistical properties of BIC h .

Discussion
The classical definition of BIC is inappropriate for mixed effects models where the information associated with a given model structure is affected by the number of random effects. This article derives the appropriate BIC penalty for mixed effects models depending on both the number of subjects N and the numbers of observations per subject n i , i = 1, . . . , N . The selection procedure is consistent if N and min(n i ) tend to infinity. Intuitively, the log N term comes from standard asymptotic theory while the log n tot term comes from the Laplace approximation of the integrated likelihood. Although validated only if min(n i ) goes to infinity faster than N , the result of equation (2.6) is expected to work well when the number of measurements is moderate, as illustrated in the simulated example. Equation (2.6) is derived under usual regularity assumptions satisfied by generalized linear and nonlinear mixed-effects models. It can be extended to more general models involving a dependence structure within each subject, such as mixed-effects hidden Markov models [4,5].
We have performed a simulation study for comparing the behavior of the proposed BIC with two standard BIC criteria that are implemented in different softwares used for the analysis of mixed effects models. Using a simple linear mixed effects model, we have found that the proposed BIC mainly behaves as the best of the two standard BIC, whatever the random structure of the model. This is predicted by the theory since the proposed criterion automatically adapts to the number of random factors that define the model structure. Additional simulations involving more complex models such as "inter-occasion" models often used in pharmacology are required to investigate the empirical properties of the proposed criterion, which will be implemented soon in the next version of Monolix.
The proposed BIC is designed only for covariate selection when the structure of the random effects of the model is given. The selection of significant random effects cannot be treated in the same way and deriving the BIC for joint selection of covariate and covariance components remains an open problem.

Appendix: Technical details
The key for deriving the expected BIC penalty in mixed-effects models is to get an appropriate approximation of log det H θ when both the number of subjects N and the number of observations per subject n tend to infinity. The study of the Hessian matrix is based on the asymptotic evaluation of the partial second derivatives of the individual log-likelihood log p(y i |θ). The main steps of the proof are given below.

A.1. Decomposition of the log-likelihood
Let us first give the general form of the Laplace approximation of the partial second derivatives of subject i's observed log-likelihood [20]: and As a result of the partition of θ between θ R and θ F (2.3), l i (y i , ψ i |θ) naturally divides into according to the Bayes formula.

A.2. Regularity assumptions
Some conditions on the model are required to study the order of magnitude of H θ 's components when N, n → +∞. Let ϑ denote an open subset of the parameter space Θ and let θ ⋆ denote the true population parameter value. We assume that for any given n: (H1) For all i = 1, . . . , N , and for all θ ∈ ϑ, p(y i |θ) admits all first, second and third derivatives with respect to θ for almost all y i . (H2) (i) There exists M > 0 such that for all i = 1, . . . , N , for all θ ∈ ϑ and all k, l = 1, . . . , q, (ii) Moreover, there exists a sequence of functions {G 1 (y 1 ), . . . , G N (y N )} such that for all θ ∈ ϑ, for all i = 1, . . . , N and for all k, l, h = 1, . . . , q, Matrix H θ ⋆ is positive definite and invertible, and max where I q stands for the q × q identity matrix.
Assumptions (H1) to (H4) are classical regularity conditions encountered in most studies relative to the asymptotic maximum likelihood theory. They were adapted to mixed-effects models by [20]. Condition (H5) is necessary to evaluate the respective order of the components of the Hessian matrix and could be seen as a regularity condition on the distributions p(y i |ψ i ). (H5) is a realistic assumption in the sense that it is verified in most mixed models, even in models including a dependance structure within each subject under classical regularity conditions on p(y i |ψ i ) (see [20] and [6] for illustrative examples).

A.3. Asymptotic evaluation of H θ
In the following, we consider the Hessian matrix normalized with the number of subjects, i.e. 1 N H θ , instead of H θ itself. Denote Due to the decomposition θ = (θ R , θ F ), G can be written as a block matrix: Since log det G −1 = − log det G, we rather study the orders of magnitude of G −1 's block components. By inverting (A.2), we have: 3) The asymptotic evaluation of G −1 first requires to study the orders of magnitude of G RR , G RF and G F F separately. This relies on the Laplace approximations of ∂θF ∂θ ′ F respectively and assumption (H5). For example, using (A.1) to evaluate G RF , we write As we assume ψ i is Gaussian (see (1.2)), can be expressed as a first-order polynomial function of ψ i . Thus, have similar behaviours when N, n → +∞. According to assumption (H5)-(ii), we deduce that G RF is of the order of a constant when N, n → +∞. The orders of G RR and G F F are obtained on the same pattern. More precisely, using (H5)-(i), we show that | ψi= ψi ] is positive definite when N, n → +∞, thus G RR = O(1). Similarly, we can show that G F F = O(n), since G F F only involves the derivatives of l i1 of order O(n).
In a second step, the asymptotic behaviors of G RR , G RF and G F F are combined together. We get: where I 1 and I 2 are respectively dim(θ R ) × dim(θ R ) and a dim(θ F ) × dim(θ F ) full rank matrices of constant order of magnitude.
Hence the BIC approximation given in equation (2.6).