Posterior propriety of an objective prior for generalized hierarchical normal linear models

ABSTRACT Bayesian Hierarchical models has been widely used in modern statistical application. To deal with the data having complex structures, we propose a generalized hierarchical normal linear (GHNL) model which accommodates arbitrarily many levels, usual design matrices and ‘vanilla’ covariance matrices. Objective hyperpriors can be employed for the GHNL model to express ignorance or match frequentist properties, yet the common objective Bayesian approaches are infeasible or fraught with danger in hierarchical modelling. To tackle this issue, [Berger, J., Sun, D., & Song, C. (2020b). An objective prior for hyperparameters in normal hierarchical models. Journal of Multivariate Analysis, 178, 104606. https://doi.org/10.1016/j.jmva.2020.104606] proposed a particular objective prior and investigated its properties comprehensively. Posterior propriety is important for the choice of priors to guarantee the convergence of MCMC samplers. James Berger conjectured that the resulting posterior is proper for a hierarchical normal model with arbitrarily many levels, a rigorous proof of which was not given, however. In this paper, we complete this story and provide an user-friendly guidance. One main contribution of this paper is to propose a new technique for deriving an elaborate upper bound on the integrated likelihood, but also one unified approach to checking the posterior propriety for linear models. An efficient Gibbs sampling method is also introduced and outperforms other sampling approaches considerably.


Introduction
Bayesian hierarchical models (or multilevel models) have been extensively used in the modern application, including education (Raudenbush & Bryk, 1986), psychology (Lindenberger & Pötter, 1998), clinical trials (Xia et al., 2011), economics (Shimotsu, 2010) and many other applied statistical fields. The fundamental idea of hierarchical modelling is to think of the lowest-level units (smallest and most numerous) as organized into a hierarchy of successively higher-level units. For example, students are in classes, classes are in schools, schools are in districts, and districts are in states. Accordingly, hierarchical models are naturally applicable to the survey, observational or experimental data involved with complicated nesting. However, the most commonly used and fully discussed hierarchical models are merely of two levels. Goldstein (2011) and Berger et al. (2020b) have ever defined 3-level hierarchical model and implemented statistical analysis on that. The hierarchical model with more levels was usually avoided by the researchers for the reason of analytical difficulty and intractable computation. To the best of authors' knowledge, a general hierarchical linear model with arbitrary levels seems to have never been defined or studied. In this paper, we will introduce the definition of a generalized hierarchical normal linear (GHNL) model and carry out an in-depth theoretical investigation of Bayesian inference for the GHNL models.
In order to implement fully Bayesian analysis, priors are supposed to be specified on the hyperparameters (parameters at higher levels of the hierarchical model). Improper (objective) priors are often used to express ignorance or to match frequentist properties (see the review article, Consonni et al., 2018). When using improper priors, an important issue whether the resulting posterior distributions are proper arises. As Hobert and Casella (1996) stated, without proper precaution, misuse of improper priors, sometimes unknowingly, will result in practical difficulties, such as the nonconvergence of the Gibbs sampler. The enormous practical importance of posterior propriety motivates us to explore it in the framework of GHNL modelling afterwards. There is also a vast modern literature investigating the posterior propriety of improper priors applied to a large variety of models, such as, Sun et al. (2001), Speckman and Sun (2003), Berger et al. (2005) and Michalak and Morris (2016).
A great deal of efforts have been devoted to the development of objective hyperpriors in hierarchical modelling, such as, Daniels and Kass (1999), Everson and Morris (2000), Gelman (2006), Gustafson et al. (2006), Berger et al. (2005) and Berger et al. (2020b). Formal objective Bayesian approaches, like the Jeffreys-rule prior or reference prior are only feasible for the simple hierarchical settings. For instance, the exact Jeffreys-rule prior for covariance matrices at higher level depends on the parameters from the lower level of the model, leading to plenty of difficulties in formulation and computation. Therefore, a common way is to use less formal approaches, such as applying formal objective priors from non-hierarchical models to hierarchical modelling. Unfortunately, the non-hierarchical Jefferys-rule prior and reference prior typically yield improper posteriors in the hierarchical settings (cf. Berger et al., 2005). Those who can recognize this problem often use constant priors instead for higher level variance components. However, the constant prior is so diffuse that it requires twice as many observations as logically needed to achieve posterior propriety (cf. Berger et al., 2005 andBerger et al., 2020b).
In other words, the extra observations required are wasted on correcting the over-diffuse tail of the constant prior. The most powerful tool known for detecting over-diffuse hyperpriors is by looking at the frequentist notion of admissibility of resulting estimators (see Berger et al., 2005 for discussions and references). Sensible choices of objective hyperpriors are on the boundary of admissibility, being as diffuse as possible without leading to inadmissible estimators. Berger et al. (2005) studied the propriety and admissibility of a number of hyperprios, but no overall conclusion was reached as to a specific prior to recommend. The reasons are as follows: (a) the admissibility of the leading candidate prior was unable to get proved; (b) the proposed computation methods were only efficient for relatively low-dimensional covariance matrices and remained quite challenging for the candidate priors; (c) the hierarchical model discussed was merely of two levels, and the results are not adaptive to a general hierarchical model with many levels. To address this issue, Berger et al. (2020b) recommended a particular objective prior for use in all normal hierarchical models. Consider the following canonical form of 2-level hierarchical normal model. Suppose that, independently, y i ∼ N k (θ i , I k ) and θ i ∼ N k (β, V) for i = 1, . . . , m, where N k (·, ·) denotes the k-dimensional normal distribution, y i are k × 1 observation vectors, θ i are the k × 1 unobserved mean vectors, β is a p × 1 'hypermean' vector, and V ∈ R k×k is an unknown 'hypercovariance' matrix. Berger et al. (2020b) proposed a particular combination of independent priors on hyperparameters β and V as where v 1 > v 2 > · · · > v k > 0 are the ordered eigenvalues of V. The recommendation (1) for hyperpriors was justified by Berger et al. (2020b) from the aspects of admissibility, ease of computation and performance. Most importantly, prior (1) is adapted to being used at any level in hierarchical modelling, which is not true for other proposed objective priors as previously mentioned.
Since it is hazardous to skip demonstrating propriety at the risk of making inference from improper posterior, Berger et al. (2020b) has shown the posterior propriety of a 3-level hierarchical model using prior (1), while assuming square design matrices for a technical reason. Berger et al. (2020b) also conjectured that the posterior is proper with the recommended prior being utilized at all levels of a hierarchical normal model with arbitrarily many levels, a rigorous proof of which was unable to be provided, however. In this paper, we complete this story and prove the posterior propriety for the GHNL models in general situations. Besides, as pointed out in Michalak and Morris (2016), researchers have been finding it daunting and time-consuming to inspect posterior propriety when using improper priors, except in the simplest models. For this reason, we supply a userfriendly guidance for checking posterior propriety to practitioners in different practical situations.
In Section 2, we give an explicit definition of the GHNL model which accommodates arbitrarily many levels and usual design matrices. It is important to note that we are considering the 'vanilla' covariance matrix problem herein. We are not assuming any special structures or sparsity for hypercovariance matrices. The association between the GHNL model and a linear mixed-effect model is also discussed. In Section 3, we demonstrate that recommended prior yields a proper posterior in the framework of GHNL modelling. In addition, we exhibit a guidance for checking posterior propriety. An efficient MCMC algorithm for sampling from the posterior is introduced in Section 4. Section 5 provides some concluding remarks and further generalizations.

Generalized hierarchical normal linear model
In this section, we will introduce the definition of a GHNL model with (r + 1) levels, where r ≥ 1. The association between the GHNL model and a linear mixed-effect model will be demonstrated as well, which brings an insight into the GHNL model. At last, the recommended prior on the hyperparameters of the GHNL model will be presented and discussed. Firstly, we introduce some notations to be used in the main body of this paper.
Notations Let [k] = {1, 2, . . . , k} for a positive integer k; 1 {·} stands for the indicator function; N k (μ, ) represents the k-dimensional normal distribution with mean μ and covariance ; N k (μ, ) denotes a kdimensional normal random variable with mean μ and covariance ; for a symmetric matrix A, A > (<)0 means that A is a positive (negative) definite matrix, and A ≥ (≤)0 denotes that A is a non-negative (nonpositive) definite matrix. Berger et al. (2020b) proposed a 3-level hierarchical model with the form:

Model structure
where y i are k × 1 observation vectors, θ i are the k × 1 unobserved mean vectors, η is a p × 1 'hypermean' vector, V ∈ R k×k and W ∈ R p×p are unknown 'hypercovariance' matrices, and Z i are k × sp known matrices. At last, all the normal random variables in model (2) are mutually independent. Based on the 3-level hierarchical normal model, a more general hierarchical model with (r + 1) levels (r ≥ 1) can be constructed as Level r : Level r + 1 : (3) Firstly, all the normal random variables noted in the above model are mutually independent. Within model (3), the output of level (j + 1) consists of m j units whose values are k j × 1 vectors for j = 0, 1, . . . , r. By stacking the output units of level (j + 1) on top of one another, we can obtain the outcome vector of level (j + 1) as θ j for j ∈ [r] and y = y 1 , . . . , y m 0 for level 1. Then θ j 's are (m j k j ) × 1 vectors and y is a (m 0 k 0 ) × 1 vector. In fact, only the outcome of the lowest level can be observed, and the outcomes of higher levels are inaccessible and latent variables. Hence, the outcome variables of interest are always situated at the lowest level of the hierarchy. Different units in the same level share in common input effects (intercept can be included) which are exactly the outcome vectors from the upper level, except that the input effect of level (r + 1) is η which is a d × 1 vector of fixed effects. In addition, the units from the same level have the same variance component. The variance component within level (j + 1) is denoted by V j ∈ R k j ×k j for j ∈ [r] and accounts for the magnitude of random variation within the corresponding level. The covariance matrices V j are unobserved for j ∈ [r]. The matrices Z ji j are k j × (m j+1 k j+1 ) matrices and denote the matrices of observed covariates for unit i j in level j, where j = 0, 1, . . . , r and i j ∈ [m j ]. It is natural to assume that there exist at least two units in each level and the dimensions of all units and η are no less than 1, mathematically, m j ≥ 2 and k j ≥ 1 for j = 0, 1, . . . , r, and d ≥ 1. Table 1 summarizes several important notations that will mainly affect the results for the posterior propriety in Section 3. The extensions from Berger et al. (2020b)'s model (2) to model (3) are two-fold, and model (3) accommodates arbitrarily many levels and usual covariate matrices. Further define Z j = Z j1 , . . . , Z jm j for j = 0, 1, . . . , r. Then Z j are (m j k j ) × (m j+1 k j+1 ) matrices for j = 0, 1, . . . , (r − 1), Z r is an (m r k r ) × d matrix, and an alternative representation of the (r + 1)-level hierarchical model (3) is thereby given by Remark 2.1: If we assume that the covariance matrix for the units from level 1 in model (3) is a positive definite matrix 0 instead of the identity matrix, when 0 is known, the two assumptions are actually equivalent by taking reparameterization that y * i 0 = − 1 2 0 y i 0 and Z * i 0 = − 1 2 0 Z i 0 . Furthermore, for a technical reason, 0 must be assumed as known throughout this paper, and this reason will be explained in Section 5.

Connection with the linear mixed-effect model ( LMM )
The two-level hierarchical normal models are often referred to as LMMs in many places. As for the GHNL model (4), let = {θ 1 , . . . , θ r } denote the set of unobserved outcome vectors and V = {V 1 , . . . , V r } represent the set of unknown covariance matrices. If we take θ j 's as intermediate variables, then marginalizing out over yields is a (m 0 k 0 ) × (m 0 k 0 ) matrix and X j are (m 0 k 0 ) × (m j+1 k j+1 ) matrices for j = 0, 1, . . . , r. Suppose that Z j are of full column ranks. Then by Sylvester's rank inequality X j are also of full column ranks, j = 0, 1, . . . , r. In the rest of the paper, Z j are assumed to be of full column ranks for j = 0, 1, . . . , r.
If we consider a particular LMM as where η is the fixed effect, θ j 's are random effects and independently distributed as N m j k j (0, I m j ⊗ V j ) for j ∈ [r], and denotes the vector of random errors and is distributed as N m 0 k 0 (0, I m 0 k 0 ). By integrating out the random effects, the marginal distribution of y conditioning on (η, V) is identical to the distribution (5). In one word, the GHNL is equivalent to an LMM in the sense of the marginal distribution of observations after integrating out intermediate outcome vectors or random effects. The equivalence between the GHNL models and the LMMs can be illustrated by an example of mixed-effect ANOVA model as well.

Example 2.1 (Mixed-effect ANOVA model):
Suppose we can observe the scores of p courses for student (ijk) as y ijk for i = 1, . . . , s 1 , j = 1, . . . , s 2 and k = 1, . . . , s 3 . The observed data are within a hierarchy of three levels: student (ijk) is nested within class (ij), and class (ij) is nested within school i. Thus, we have total s 1 schools, each school has s 2 classes and each class has s 3 students. Consider a mixed-effect ANOVA model as where y ijk , η, α i , β ij and ijk are all p × 1 vectors for i = 1, . . . , s 1 , j = 1, . . . , s 2 and k = 1, . . . , s 3 , η denotes the overall mean and is fixed effect, and represents the effect of class, the student-level independent random error is denoted by ijk and has distribution N p (0, 0 ), and 0 is a known matrix. At the same time, α i , β ij and ijk are independently distributed. Consequently, V α , V β and 0 are the variance components describing the school-level, class-level and student-level variations, respectively. Due to the hierarchical structure of the observations, we can naturally build a hierarchical model as independently, for i = 1, . . . , s 1 , j = 1, . . . , s 2 and k = 1, . . . , s 3 . Denote that where Y i and E i are both (s 3 p) × s 2 matrices, and β i and β * i are both (s 2 p) × 1 vectors. Let y i = vec(Y i ) and i = vec(E i ), where vec(A) denotes the column vector obtained by stacking the columns of the matrix A on top of one another. Define that Thus, y and are (m 0 p) × 1 vectors, and β and β * are (m 1 p) × 1 vectors, and α and α * are (m 2 p) × 1 vectors, where m 0 = s 1 s 2 s 3 , m 1 = s 1 s 2 and m 2 = s 1 . Then the hierarchical normal model (9) can be expressed as a GHNL model with the form where where 1 q denotes the q × 1 vector with all elements being one, and Z 0 , respectively. Thus, model (8) can be summarized as By integrating out (α, β) and (α * , β * ), the marginal distributions of y for model (11) and model (10) are identical and of the form: Example 2.1 provides a simple example to illustrate how the hierarchical model and the mixed-effect model can be constructed based on the nested data, and the equivalence between two models is also presented. In Appendix 1, we define a special LMM which is a special case of model (7) with m j = 1 for all j ∈ [r], and the theoretical investigation of this special LMM is distinct from that of the GHNLM. This special LMM could be common in application, and we just want to provide some theoretical results for those who have interests to refer to. We will still focus on the GHNL model in the following sections.

Priors on the hyperparameters
In order to implement fully Bayesian analysis, we should specify hyperpriors on the parameters (η, V). It follows the recommendation from Berger et al. (2020b) that we can assume priors on (η, V) as: where ω j1 > ω j2 > · · · > ω jk j > 0 are the decreasingly ordered eigenvalues of V j , j ∈ [r]. Apart from prior (12), common choices of prior on η include the constant prior and conjugate prior. None of the three priors will result in improper priors or difficulties in computation. However, among the three priors, prior (12) is the most perfect for all k from the perspective of admissibility. Besides, it refers to Berger et al. (2005) that the prior (12) is a mixture-of-normal prior of the hierarchical structure as and those mixture-of-normal priors have shown great success in shrinkage estimation particularly (cf. Fourdrinier et al., 1998) and robust Bayesian estimation generally (cf. Berger, 1980). Therefore, prior (12) was actually recommended by Berger et al. (2005) for default use.
As for prior (13) on the unknown covariance matrices V j , j ∈ [r], consider the transformation from V j to j = diag ω j1 , . . . , ω jk j and the orthogonal matrix j of corresponding eigenvectors, the Jacobian is Consequently, the prior (13) on V j becomes the prior density of ( j , j ) as with respect to Lebesgue measure on ω j1 , . . . , ω jk j and the invariant Haar measure over the space : = I k j . Note that the prior on j is improper and, independently, the prior on j is constant. Use of a uniform prior for j ranging over a compact space is natural and non-controversial and has no influence on the eigenvalues. The term s<t ω js − ω jt is eliminated after changing variables for prior (13). In contrast, the commonly used priors on the covariance matrix, such as inverse Wishart, Jeffreys-rule and constant priors, contain the term s<t ω js − ω jt in the transformed space. This special term gives low mass to close eigenvalues and hence effectively force the eigenvalues apart. It is contrary to the common intuition which would suggest that one should choose a prior that pushes the eigenvalues closer together. As a result, prior (13) is essentially neutral as to expansion or shrinkage of the eigenvalues.
In the context of 2-level hierarchical normal model, Theorem 1 from Berger et al. (2020b) has demonstrated that the combination of priors (12) and (13) is on the boundary of admissibility, being as diffuse as possible without yielding inadmissible estimators. Furthermore, it is shown that the generalization that allows covariates at all levels of the hierarchical model will not affect the result of admissibility (cf. Berger et al., 2020b). Nonetheless, the admissibility of the recommended prior for the (r + 1)-level hierarchical model with r ≥ 2 is not clear. Generally speaking, this is a very difficult question to answer, and we mainly justify the recommendation of hyperpriors from the angles of posterior propriety and computation afterwards in the framework of the GHNL model. Berger et al. (2020b) has shown that the resulting posterior of the recommended prior is proper for the 3level hierarchical model (2), but under a narrow set of assumptions. They also conjectured the posterior propriety for a hierarchical model with any number of levels, a rigorous one of which was not given yet. In this section, we will comprehensively investigate the conditions for the posterior propriety of the GHNL model (4) using the recommended prior in more general situations. The dimension of η affects the investigation of posterior propriety considerably, and two cases d ≥ 2 or d = 1 will be discussed separately.

Posterior propriety
Based on (5) and (14), by integrating out η, we can obtain the marginal distribution of y conditioning on (V, λ) as The posterior propriety of the GHNL model (4) employing priors (13) and (14) is defined as where m(y) denotes the marginal density of the observation vector. Next, we display some definitions and additional notations which are frequently used in this section.
More Notations Let card(A) denote the cardinality of the set A. For 0 < I 1 , I 2 ≤ ∞, denote that I 1 I 2 if there exist constants 0 < C 1 ≤ C 2 such that C 1 I 2 ≤ I 1 ≤ C 2 I 2 ; For a symmetric matrix A ∈ R n×n , λ i (A) represents the i-th largest eigenvalue of A, namely, λ 1 (A) ≥ · · · ≥ λ n (A); Let λ max (A) and λ min (A) denote the maximum and minimum eigenvalues of an arbitrary symmetric matrix A.
Definition: For convenience, let ω r+1,1 λ, m r+1 d and k r+1 1. Define a function of an arbitrary non- Further define the composition of F and H as

Two key lemmas
Before we formally investigate the posterior propriety, we first introduce two important lemmas which dominate in the process of proving the main theorems in this paper.
Lemma 3.1 mainly demonstrates two inequalities with respect to the summation of a series of quadratic forms which have matrical inputs. Since X i 's have full column rank and A j 's are positive definite, then n ≥ p j and rank X j A j X j = p j . It is worthwhile to note that the non-zero diagonal elements of D j are the decreasingly ordered eigenvalues of A j in the lower bound of |H|, and this relation will deeply influence the last result when we derive the sufficient condition for posterior propriety afterwards. Besides, we can never find a constant C * > 0 such that |H| + ≤ C * where a j j = 1, 2, . . . , k, and b i , i = 1, 2, . . . , n, are real constants, and λ = (λ 1 , . . . , λ k ) ∈ ≡ [0, ∞] k . Then the integral I is finite if and only if (iff) the following two conditions are both satisfied.
Here, Lemma 3.2 (b) may not be straightforward enough, so we will use the following example to elaborate on how Lemma 3.2 can be employed.

Example 3.1: Consider integral
where a 1 , a 2 , a 3 , b 1 , b 2 , b 3 are all real constants. Similar to Lemma 3.2, we can define M = {{1} , {1, 2} , {2, 3}}. Then I 0 < ∞ iff all the following inequalities hold: (a) a j < 1, j ∈ [3]; (b) Note that, no matter how M is defined, we always need to check all the inequalities corresponding to all D ∈ F([k]). Even though some inequalities could be trivial after being written down, for the sake of assurance, we would better take all non-empty subsets of [k] into account in the early stage. Lemma 3.2 plays a crucial role in obtaining the follow-up theorems, and detailed proof of Lemma 3.2 sees Appendix A.2.

Conditions for the posterior to be proper when d ≥ 2
In this subsection, the case d ≥ 2 is mainly considered.
Theorem 3.1: Consider the GHNL model (4) with priors (12) and (13) on η and V i ∈ V, respectively. When d ≥ 2, a sufficient condition for the posterior propriety is given by for any D ∈ S([r + 1]).
Proof: It follows (17) that the integrated likelihood of (V, λ) after marginalizing out (θ 1 , . . . , θ r , η) is given by By dropping the exponent term involving y (since it is less than one), we have L(V, λ; y) < 1 | + λX r X r | 1/2 . By applying Lemma 3.1 (b), we can further bound the integrated likelihood as where C 1 is a positive constant that only depends on X j 's and where j are the diagonal matrices of the decreasingly ordered eigenvalues of V j for j ∈ [r], r+1 = ω r+1,1 , O q j are q j × q j zero matrices and q j = m 0 k 0 − m j k j for j ∈ [r + 1]. Since r+1 = (ω r+1,1 ), then r+1 is a degenerate matrix as scalar λ and the prior on λ becomes The definition of c jl,s in (19) yields Therefore,  (28) is equivalent to that in (23).
Applying Lemma 3.1 yields the upper bound on the integrated likelihood function of hyperparameters as (26). Therefore, the special notation c jl,s can be understood as the indicator of whether eigenvalue ω jl appears in the s-th diagonal element of M 1 for j ∈ [r + 1], l ∈ [k j ] and s ∈ [m 0 k 0 ]. At the same time, the left-hand side of inequality (23) actually stands for the cardinality of set s | ∃ (j, l) ∈ D, such that c jl,s > 0 for any D ∈ S([r + 1]).
The cardinality of S([r + 1]) is 2 j∈[r+1] k j − 1, which means that the total number of the inequalities to be checked in (23) is exponential with r and the dimensions k j . It has to be admitted that this will impose considerably heavy computational burden in common practice by applying Theorem 3.1 directly. Nevertheless, the researchers have no need to be anxious about the heavy computational burden, because most of the inequalities in Theorem 3.1 are trivial. To conclude this point, we have the following corollary.
where p is the smallest positive integer i such that j | m j > card(R i ), j ∈ R i = ∅. We call the levels within R p as kernel levels and denote R p by R ker . The inequality (23) holds for any D ∈ S([r + 1]) iff the inequality (23) holds for any D ∈ S (R ker ). Consequently, if R ker = ∅, then the posterior is always proper.
Proof: Let R c 1 = [r + 1] − R 1 , thus, m j > r + 1 for j ∈ R c 1 by the definition of R 1 . For any D ∈ S([r + 1]), if there exists j * ∈ R c 1 such that (j * , l) ∈ D for any l ∈ [k * j ], then which is a trivial one. As a result, inequality (23) holds for any D ∈ S([r + 1]) iff inequality (23) holds for any it can be recursively shown that inequality (23) holds for any D ∈ S([r + 1]) iff inequality (23) holds for any D ∈ S(R i ) and i ∈ [p], where p is the smallest positive integer i such that R i − R i+1 = ∅.
By using the technique of extracting kernel levels, we dramatically narrow down the checking region for posterior propriety. Since we should only check the inequalities for the levels within R ker , substantially reducing the number of inequalities to be checked from 2 j∈[r+1] k j − 1 to 2 j∈R ker k j − 1. Moreover, Corollary 3.1 also indicates two interesting conclusions depicted as follows.
(a) First, it reveals the mechanism how three roles, number of levels, numbers of units in levels and dimensions of levels, affect the posterior propriety simultaneously. Roughly speaking, in the context of GHNL, the more levels with fewer units having lower dimensions, the less likely the posterior is to be proper. For example, if m r−2 = m r−1 = m r = 2 and k r−2 = k r−1 = k r = 1, for Therefore, the posterior propriety can hardly be guaranteed by Theorem 3.1. Conversely, if the units in each level are adequate enough such that the set of kernel levels is empty, i.e., R ker = ∅, then the posterior is always proper. As a consequence, more attention should be focused on the levels with small number of units, namely, the kernel levels. (b) Second, the recommended prior for use at any level in hierarchical modelling is further justified from the aspect of posterior propriety and ease of implementation. For instance, if we switch the prior on V j , j ∈ [r] from (13) to where 0 < a ≤ 1 (a has to be larger than zero by Lemma (3.2)). Then the condition in Theorem 3.1 becomes On one hand, when a is greater than but close to zero (denoted by a 0), all the inequalities in (31) always hold. Thus, the posterior will be always proper. However, it is impractical to decide how small is for a and find such a fixed value that fits all levels. On the other hand, when a 1, which denotes that a is less than and close to 1, then 2a * card(D) > (j,l)∈D 1 k j , concluding that inequality (31) is harder to get reached than inequality (23), especially for large dimensions k j 's. Therefore, the posterior using prior (30) is rather unlikely to be proper than using prior (13). Similar to Corollary 3.1, we can recursively define that for i ∈ p * and p * is the smallest positive integer l such that j | m j > 2a * card(E * l ), j ∈ R * l = ∅. Then the posterior using prior (30) instead is proper if (31) holds for any D ∈ S R * p * . When a 1, card R * p * will be remarkably larger than card(R ker ) for large values of k j 's, imposing a dramatically heavier burden of checking inequalities than that for prior (13). Above all, one sensible choice for a is that let a be inversely proportional to k j for level j, which takes both practical and theoretical considerations into account.
The upper bound on (j,l)∈D 1 k j as card (R i ) for D ∈ S(R i ), leads to an effective way to extract kernel levels as presented in Corollary 3.1, but this upper bound is still too rough. Next, an elaborate upper bound on (j,l)∈D 1 k j is demonstrated and a sufficient condition of clean form for posterior propriety is derived then. , and It follows from Corollary 3.1 that we only need to prove Also, for any D belonging to S (R ker ), we have Distinct eigenvalues from the same level (≤ r-th) never occur in the same row of matrix M 1 . Mathematically, c jl 1 ,s c jl 2 ,s = 0, for 1 ≤ l 1 < l 2 ≤ k j , j ∈ [r], s ∈ [m 0 k 0 ]. Thus, for any j ∈ [r], which is also true for j = r + 1 since k r+1 = 1. It is obvious that min j∈[r+1] m j = min j∈R ker m j by the definition of R ker , and that is denoted by m * . Combining (35) with (34) yields According to the proof procedure of Theorem 3.2, it can be deduced that j∈ [r+1] 1 k j < m * is also sufficient for posterior propriety. Obviously, the condition in Theorem 3.2 is easier to be satisfied. Theorem 3.2 reveals that for fixed m * , the posterior is more likely to be proper for higher dimensions of the units in the kernel levels. Theorem 3.2 also provides the researchers with a powerful tool to check the posterior propriety quickly.
Remark 3.1: Consider the model (4) with r = 1, namely, a two-level hierarchical model. When d ≥ 2, we have m * = min {d, m 1 } ≥ 2. Then the posterior using the recommended prior is always proper for k 1 ≥ 2 referring to Theorem 3.2.
Example 3.2 (Continue with Example 2.1): Consider the GHNL modelling of the mixed-effect ANOVA as (10), which is a 3-level hierarchical model with r = 2, m 0 = s 1 s 2 s 3 , m 1 = s 1 s 2 , m 2 = s 1 , m 3 = p, k 0 = k 1 = k 2 = p and k 3 = 1. It is natural to assume that we have at least two schools, each school has at least two classes and each class has at least two students, namely, s 1 ≥ 2, s 2 ≥ 2 and s 3 ≥ 2. If (a) p > 2 and s 1 ≥ 2 or (b) p ≥ 2 and s 1 > 2 holds, it can be readily derived that the set of kernel levels is empty. Thus, the posterior is always proper according to Corollary 3.1. When s 1 = p = 2, the set of kernel levels is R 2 = {2, 3}, since 1 k 2 + 1 k 3 < 2, then the posterior is proper by applying Theorem 3.2. In conclusion, the posterior using the recommended prior is always proper when p ≥ 2.
In Berger et al. (2020b)'s work, for a technical reason, they assumed k = sp for 3-level hierarchical normal model (2) such that the design matrices for units within level 2 and 3 are square matrices. They eventually reached a conclusion that the posterior employing the recommended prior in model (2) is always proper for k = sp and p ≥ 2 (p is the dimension of hypermean in model (2)). However, the assumption that the design matrices for the units at high levels are square matrices appears to be unnatural and hard to be interpreted in practice. Nevertheless, we still generalize Berger et al. (2020b)'s result to the GHNL model in the following so as to draw a consistent conclusion with theirs. (12) and (13) on η and V i ∈ V, respectively. Assume that d ≥ 2 and k j = m j+1 k j+1 , j ∈ [r]. Then the posterior is always proper.
Above all, a general procedure for checking the posterior propriety of the GHNL models (4) employing the recommended prior for d ≥ 2 can be summarized as follows.
Guidance for checking the posterior propriety when d ≥ 2: (a) If the design matrices for each unit in each level are square matrices, then the posterior is proper, otherwise, turn to (b). (b) Derive the set of kernel levels, R ker . If R ker = ∅ or inequality (32) holds, the the posterior is proper. If neither, turn to (c). (c) Check inequality (23) for all D belonging to S (R ker ). If that always holds, the the posterior is proper. If not, the posterior propriety can hardly be guaranteed.

Conditions for the posterior to be proper when d = 1
It is quite common that the dimension of the fixed effect η is only one in practice. However, when d = 1, note that for D = {(r + 1, 1)}, resulting in the failure of the sufficient condition in Theorem 3.1. Therefore, in this subsection, we mainly reinvestigate the conditions for the the posterior to be proper for d = 1. Proof: When d = 1, vector η will degenerate into a scalar η. Hence the prior (12) on η becomes a constant prior on η. Based on (5), by integrating out over η, we can get the upper bound on the integrated likelihood of V after dropping the exponential term (less than one) Since X r −1 X r ≥ X r X r λ min −1 = X r X r λ max ( ) −1 , using Lemma 3.1 (a) and (b), we have where C 0 and C 1 are constants that are independent of the j for j ∈ [r], M 2 = I m 0 k 0 + r j=1 D j and D j 's are defined in (26). Similar to the proof of Theorem 3.1, we can derive the upper bound on m(y) as It follows from the definition of c jl,s in (19) that Thus, for any D ∈ S([r]) by employing Lemma 3.2. It's obvious that inequality (38) is equivalent to that in (37).
Resembling the interpretation of Theorem 3.1, the left-hand side of inequality (37) actually denotes the cardinality of set s | ∃ (j, l) ∈ D, such that c jl,s > 0 for any D ∈ S([r]). To reduce the burden of checking inequalities, we have the following corollary.
where q is the smallest positive integer i such that j | m j > card(R i ) + 1, j ∈R i = ∅. We call the levels withinR q as kernel levels and denoteR q byR ker . Inequality (37) holds for any D ∈ S([r]) iff inequality (37) holds for any D ∈ S R ker . Consequently, ifR ker = ∅, then the resulting posterior is always proper.
In the process of extracting kernel levels, the thresholds of m j to split up levels are increased by one in Corollary 3.3, when compared with that in Corollary 3.1, and this is because the upper bound on the right-hand side of inequality (37) is increased by one than that of inequality (23). Except for this point, the proof of Corollary 3.1 is same as that of Corollary 3.3. For good measure, a simple tool to check the posterior propriety is shown as follows, which is a counterpart of Theorem 3.2 for d = 1. for any D ∈ S R ker . Similar to (36), we have for any D ∈ S R ker . Since L(D) ≥ m * always holds, then the proof is completed.

Remark 3.2:
In model (4), when r = 1 and d = 1, suppose m 1 ≥ 2 and k 1 ≥ 2. The posterior using the recommended prior is always proper by applying Theorem 3.4.

Remark 3.3:
If all k j 's are equal to one, the sufficient condition in Theorem 3.3 can be simplified as where D ∈ F([r]). By employing the technique of extracting kernel levels, Theorem 3.4 is equivalent to Theorem 3.3, rather than a mere sufficient condition.  (10) with assuming that s 1 ≥ 2, s 2 ≥ 2 and s 3 ≥ 2. If p = 1, we have k 0 = k 1 = k 2 = k 3 = m 3 = 1. When s 1 > 2, we can easily derive the set of kernel levels as empty set. Thus, the posterior is always proper by Corollary 3.3. If s 1 = 2, inequality (37) fails, and the posterior propriety can hardly be guaranteed. Consequently, when p = 1, the posterior using the recommended prior is always proper for s 1 > 2.
Next, we generalize Berger et al. (2020b)'s result to the GHNL models for d = 1, assuming that the design matrices Z ji j 's for units are square matrices. 1 k j < m * − 1.
According to the conditions that k j = m j+1 k j+1 , j ∈ [r] and m r+1 = k r+1 = 1, we have k j = r s=j+1 m s , j ∈ [r]. Thus, Summing up the theoretical results above, a general procedure for checking the posterior propriety of the GHNL models (4) employing the recommended prior for d = 1 can be depicted as follows.
Guidance for checking the posterior propriety when d = 1: (a) If the design matrices for each unit in each level are square matrices and m j ≥ 3, j ∈ [r], then the posterior is proper, otherwise, turn to (b). (b) Derive the set of kernel levelsR ker . IfR ker = ∅ or inequality (39) holds, then the posterior is proper. If neither, turn to (c). (c) Check inequality (37) for all D belonging to S R ker . If that always holds, then the posterior is proper. If not, the posterior propriety can hardly be guaranteed.

Computation
In this section, we consider the MCMC sampling from the posterior arising from the model in Section 2. For the GHNL model (4) with prior (14) and (13) on η and V, respectively, the joint posterior of ( , V, η, λ) can be written as Sampling ( , V, η, λ) from the posterior density (40) can be handled by Gibbs sampling method. The main difficulty of the computation is to sample the covariance matrices V j 's efficiently.

Gibbs sampling for input effects
The full conditionals of the input effects ( , η) can be derived from the joint posterior (40) and are illustrated as follows.
(a) Conditioning on θ 2 and V 1 , the posterior distribution of θ 1 is wherẽ The full conditional posteriors of θ j , j = 2, . . . , r have the forms: and θ r+1 η. (c) By using (14), the full conditional of η can be derived as wherẽ Input effects θ j ∈ and η can be readily sampled from their conditionals during the Gibbs sampling procedure, as their full conditional posterior distributions are all standard distributions.

Gibbs sampling for variance components
The variance components which include V j 's and λ can be updated from their full conditionals, and these conditionals have densities as follows.
(a) Given η, the conditional posterior density of λ is which is actually an inverse gamma distribution as where etr(A) denotes exp(tr(A)) for a square matrix A, and The updating of λ can be simply carried out by sampling from an inverse gamma distribution. The full conditional posteriors of V j , (45) are actually distributed as a recently proposed class of prior distributions by Berger et al. (2020a) for the covariance matrix, which is called the Shrinkage Inverse Wishart (SIW) distributions. The new class SIW(a, H) for a k × k covariance matrix W has the density as where ν 1 > ν 2 > · · · > ν k > 0 are the ordered eigenvalues of W, a is a real constant and H is a k × k non-negative definite matrix. Thus, V j are distributed as SIW t j , H j , j ∈ [r]. To sample the covariance matrices from the full conditional posteriors, the previously suggested methods include the Metropolis-Hastings algorithm (cf. Berger et al., 2005) and Hit-and-run method (cf. Yang & Berger, 1994). The two methods both generate full candidate matrices by utilizing fullparameter proposal distributions, resulting in that they only work for moderate dimensions of the covariance matrices. To tackle this issue, Berger et al. (2020a) proposed a powerful Gibbs method for efficiently sampling the covariance matrices from their conditional densities and this new method works for higher dimensions k. The audience can refer to Berger et al. (2020a) or Appendix 3 for details of this Gibbs sampling method. According to the simulation results of Berger et al. (2020a), the new Gibbs method outperforms the Metropolis-Hastings and Hit-and-run methods for moderate dimensions and work for k up to 100, while the other two algorithms break down in much lower dimensions.
In the framework of 2-level HNLM, Berger et al. (2020b) compared the numerical performance, from the mean square error (MSE) perspective, of a dozen of objective hyperpriors, which are the product of three objective hyperpriors for the hypermean and four objective hyperpriors for the hypercovariance matrix. Priors on the hypermean include constant prior, conjugate prior and the recommended prior (12). Priors on the hypercovariance matrix include constant prior, hierarchical Jefferys prior, hierarchical reference prior and the recommended prior (13). Their simulation results have shown that the recommended combination of hyperpriors dominates all the others in terms of Bayes risk, and the constant prior on the hypercovariance performs the worst. However, neither of the two remaining choices for hypercovariance is computationally easy. Considering the 4-level HNLM, Song et al. (2020) performed numerical experiment to compare the recommended prior with constant prior for the hypercovariance matrices, and the other two priors were canceled due to intractable computation. Also, Song et al. (2020)'s result presented the domination of the recommended hyperpriors over other priors. In conclusion, both Berger et al. (2020b) and Song et al. (2020) have provided strong numerical evidence of the superiority of the recommended hyperpriors for use in GHNLM, since 2-level and 4-level HNLMs both are specific GHNLM.

Discussions
We have proposed a generalized hierarchical normal linear model applicable to the nested data with complex structures. The GHNL model proves to be equivalent to a LMM model, while the GHNL model is more natural for researchers to model nested data from scratch, especially when incorporating covariates at high levels. Like generalizations to the simple normal linear model, the GHNL can be generalized to the hierarchical model with generalized liner model in the first level, and thus discrete observations can be handled. Besides, the first level (or even higher levels) of the GHNL model can be also extended to the setting of semiparametric regression models, such as, single index model and partially linear model. The technique of modelling and investigation in this paper can be applied to the linear part of the models mentioned above. The statistical analysis could be complicated and such explorations are beyond the scope of this paper, however. Berger et al. (2020b) put an end to the endless search for the appropriate hyperpriors in hierarchical modelling and investigated properties comprehensively to justify the recommendation. Nonetheless, when it came to the propriety of the resulting posterior, they only suspected that that is true for use at any level for a general hierarchical normal model, the conditions for which were not given. To complete the story, we have studied the conditions for the posterior to be proper in more general situations than Berger et al. (2020b), when employing the recommended prior for the GHNL model. Theorems 3.1 and 3.3 demonstrate the main result, and Corollaries 3.1 and 3.3 reduce the computational burdens by defining kernel sets for d ≥ 2 and d = 1, respectively. In addition, Theorems 3.2 and 3.4 provide powerful tools of simple forms for checking propriety of posterior for d ≥ 2 and d = 1, separately. The user-friendly guidance for checking posterior propriety is eventually supplied. Note that our results only present sufficient conditions, and necessary conditions have never been discussed. The reason is because the derivation of the lower bound on the integrated likelihood of hyperparameters is intractable. Moreover, it is not worthwhile to investigate necessary conditions, as the derived upper bounds are tight enough such that the corresponding sufficient conditions are very modest, according to the remarks and examples in Section 3. At last, an efficient and powerful Gibbs sampling method for sampling from the posterior is introduced, overcoming the bottleneck of computation that the previously proposed sampling method only works for low dimensions or moderate dimensions inefficiently. The numerical evidence supporting the superiority of the recommended prior for hierarchical models was presented in Berger et al. (2020b) and Song et al. (2020).
Though we have made much progress in the hierarchical linear modelling, a major obstacle to applying our results is that the variance component for the first level is supposed to be known, which can hardly be satisfactory in practice. If we assume an unknown covariance matrix 0 for the first level and specify prior (13) on it, the exponential term within the likelihood can not be dropped simply any longer when deriving the upper bound, otherwise, the resulting integral will be always infinite. The upper bound on the exponential term with respect to the eigenvalues of covariance matrices is very tricky to be obtained, and the condition for the integrability of the resulting integral remains to be further studied. Thus, the GHNL modelling with unknown 0 can be taken as a sequential study of this paper.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
The research was supported by the National Natural Science (A10) Lemma A.2: For n × n symmetric matrices A j , j ∈ [r], we have Proof: For (a), given an n × 1 non-zero vector x, by (A10), The proof for (a) is completed by using (A10) again. For (b), it suffices to prove that for any j ∈ [r] Since A l ≥ 0, l ∈ [r], for any j ∈ [r], x ∈ R n and x = 0, Minimize the both sides of the inequality above over {x | x ∈ U, x = 0} first. Then take the maximum over {U | U ⊆ R n , dim(U) = k}. (A12) can be easily obtained by using Lemma A.1.

A.1 Proof of Lemma 3.1
For (a), by Lemma A.2 (a), it suffices to prove that λ max (X j A j X j ) ≤ C 1 λ max (A j ), j ∈ [r]. For any j ∈ [r], applying (A10), yields It is obvious that X j x * = 0, otherwise, it will result in R X j A j X j , x * = 0, which contradicts. In addition, since X j x * , X j x * ≤ λ max X j X j x * , x * ≤ C 1 x * , x * , we have R X j A j X j , x * ≤ C 1 R A j , X j x * ≤ C 1 max R A j , z z ∈ R p j , z = 0 = C 1 λ max (A j ).
Therefore, we have proved part (a).
For (b), we only need to prove that λ k (H) ≥ C 2 r r j=1 a jk , k ∈ [n].
It follows from Lemma A.2 (b) that for any k ∈ [n] Since rank(X j A j X j ) = rank(A j ), then λ k (X j A j X j ) = 0, p j < k ≤ n. Thus, it remains to show that Firstly, for any j ∈ [r], we introduce a linear transformation L j : R n → R p j defined as L j (ν) = X j ν, ν ∈ R n . The kernel space of L j is denoted by Ker(L j ) = ν ∈ R n : L j (ν) = 0 . Since X j is of full column rank p j ≤ n, then the dimension of the complementary space of Ker(L j ) is p j , i.e., dim Ker(L j ) ⊥ = p j . Thus, the mapping L j : Ker(L j ) ⊥ → R p j is a one-to-one mapping. For any U ⊆ R p j with dim(U) = k, k ∈ [p j ], define that L * j (U) = ν ∈ Ker(L j ) ⊥ : L j (ν) = x, x ∈ U .
It is obvious that L * j (U) ⊆ Ker(L j ) ⊥ and dim L * j (U) = k. For any U ⊆ R p j with dim(U) = k and any x ∈ U, there exists one and only one ν ∈ L * j (U) such that L j (ν) = x. Since x, x = ν (X j X j )ν ≥ C 2 ν, ν , we have R X j A j X j , ν ≥ C 2 R A j , x .
(A14) It refers to Lemma A.1 that for k ∈ [p j ]. Minimize the both sides of inequality (A14) over {x | x ∈ U, x = 0} first. Then take the maximum over U | U ⊆ R p j , dim(U) = k , (A13) can be easily obtained by using Lemma A.1 and (A15).

A.2 Proof of Lemma 3.2
The Domain of the integral can be divided into 0 = {λ | 0 ≤ λ j ≤ 1, j ∈ To verify condition (b), we only need to justify the following statement. Also, we assume condition (a) is always satisfied hereafter.
The reason why formula (A17) is required is that it plays an important role in verifying condition (A16).