Gram–Schmidt–Fisher scoring algorithm for parameter orthogonalization in MLE

: The estimation of parameters is a key component in statistical modelling and inference. However, parametrization of certain likelihood functions could lead to highly correlated estimates, causing numerical problems, mathematical complexities and difficulty in estimation or erroneous interpretation and subsequently inference. In statistical estimation, the concept of orthogonalization is familiar as a simplifying technique that allows parameters to be estimated independently and thus separates information from each other. We introduce a Fisher scoring iterative process that incorporates the Gram–Schmidt orthogonalization technique for maximum likelihood estimation. A finite mixture model for correlated binary data is used to illustrate the implementation of the method with discussion of application to oesophageal cancer data.


Introduction
The problem of estimating parameters is one of the key stages in fitting a statistical model to a set of data. Typically, the model will contain some parameters which are not of interest in themselves, but whose values affect the inferences that we make about the parameters of direct interest. Parametrization used when working with certain distributions could lead to high correlations of the maximum likelihood estimates. Maximization algorithms converge rapidly if the initial estimates are good, the likelihood function is well approximated by a quadratic in the neighbourhood of the parameter space and the information matrix is well conditioned, which means that the parameter estimates are not strongly intercorrelated. High intercorrelation of parameters causes numerical problems, and difficulty or erroneous interpretation in the parameter estimation. In statistical estimation, the concept of orthogonalization is familiar as a simplifying technique that allows parameters to be estimated independently, through rotation of the parameter space. Computationally, all that is required is that the original parameters should be computable from the new ones, and vice-versa. An injective transformation will therefore be desirable, though not necessary. Ross (1970) gave a comprehensive discussion of techniques that can be used to reduce correlation in particular problems. Among them are sequential and nested maximizations. Kalbliesch and Sprott (1970) discuss methods aimed at eliminating nuisance parameters from the likelihood function so that inference can be made about only the parameters of interest. Amari (1982Amari ( ,1985 derived from a theoretical point of view, a general criteria for the existence and construction of orthogonal parameters for statistical models using differential geometry. Cox and Reid (1987) gave a more general procedure and covered the advantages in maximum likelihood estimation. The approach, though popular in theoretical statistics, is computationally impractical in many application situations. Willmot (1988) discussed orthogonal parametrization but only for a two-parameter family of discrete distributions. His method, however, does not allow for higher parametric problems. Hurlimann (1992) discusses the existence of orthogonal parameterization to the mean, characterized by a partial differential equation involving the mean, the variance and cummulant generating functions. Godambe (1991) deals with the problem of nuisance parameters, within a semi-parametric set-up which includes the class of distributions associated with generalized linear models. Their technique uses the optimum orthogonal estimating functions of Godambe and Thompson (1989). Bonney and Kissling (1986) applied the Gram-Schmidt orthogonalization (Clayton, 1971) to multinormal variates and presented an application in genetics. Kwagyan, Apprey and Bonney (2001) presented the idea of Gram-Schmidt parameter orthogonalization in genetic analysis. A major consequence of parameter orthogonalization is that the maximum likelihood estimate of one parameter is not affected or changes slowly with the misspecification of the other. The most direct interpreted consequence is that the maximum likelihood estimates of orthogonal coordinates are asymptotically independent.
The remainder of this paper is organized as follows. In Section 2, the disposition model for clustered binary data proposed by Bonney (1995Bonney ( ,1998a and further investigated by Kwagyan (2001) is introduced and used as a motivation. In Section 3, an overview of maximum likelihood estimation and parameter orthogonalization is then presented. Section 4 develops the proposed Gram-Schmidt parameter orthogonalization and iterative scheme for maximum likelihood estimation. In section Section 5, fitting algorithm of the disposition model is discussed with application to oesophageal cancer data in Chinese families. Finally, Section 6 wraps up with discussion of the methodology and outlines directions for further research.

Model for correlated binary data
Consider a sample of N groups, each of size n i , i = 1, … , N; and let Y i = (Y i1 , … , Y in i ) T denote the vector of binary outcomes for the ith group. It is postulated that the measures of the outcomes within a group are possibly correlated. Further suppose the jth subject in the ith group has 1 × p individual-specific covariates X ij = (x ij1 , … , x ijp ) and let the ith group have group-specific covariates Z i = (z 1 , … , z q ). Let ij denote the conditional probability of Y ij = 1 given Y ij � = 1; that is We call ij the individual disposition which is simply interpreted as the probability of the outcome on one unit given another unit from the same cluster has the attribute. This in essence is an indication of aggregation. Assume further that within the group, a pair of observations satisfy the relation where i is common for all pairs. Clearly, i = 1 implies independence of the observations. Thus, i is a measure of the departure from independence and is called the relative disposition. With the above definitions, Bonney (1995Bonney ( ,1998a, and from further investigation through a latent mixture formulation by Kwagyan (2001), has shown that the joint distribution for the N groups can be based on The i and ij are generally modelled as where M(Z i ) is function of mean effects, D(Z i ) is a measure of within-group dependence and W(X ij ) a function describing the effect of the individual covariates. Typically, M(Z i ), D(Z i ) and W(X ij ) are modelled, respectively, as where and are unknown parameters to be estimated. When data consist of a few large clusters or truncated or size-biased samples, Bonney (1998b) has shown that there is little or no information to separate the effects of M(Z) and D(Z). The estimates of the parameters tend to be highly correlated. Similar problems occur in other approaches as well. The popular estimating equations approach, GEE (Liang & Zeger, 1986), suffers the same problems when there are only a few large clusters (see discussion of Prentice, 1988).

Maximum likelihood estimation and parameter orthogonalization
There are different statistical methods for estimating parameters, but the approach most commonly used is that based on maximum likelihood estimation. Maximum likelihood estimates of parameters are those values of the parameters that make the likelihood function a maximum. Let = (y 1 , y 2 , … , y n ) be observed data from a population with a probability distribution P(y; ) indexed by the vector of unknown parameters = ( 1 , … , p ). The likelihood function for is the joint distribution According to the likelihood principle, ̂ is regarded as the value of which is most strongly supported by the observed data. In practice, ̂ is obtained by direct maximization of Log L( ) or by solving the set of equations from the score function, U( ),where Suppose is a p parameter vector that is partitioned into two vectors 1 and 2 of lengths p 1 and p 2 , respectively, p 1 + p 2 = p . Cox and Reid (1987) define 1 to be orthogonal to 2 if the elements of the information matrix satisfy If this holds for all in the parameter space, it is sometimes called global orthogonality. If it holds at only one parameter value say 0 , then 1 and 2 are said to be locally orthogonal at 0 . We now discuss Cox and Reid (1987) approach for the construction of orthogonal parameters.
Suppose that initially the likelihood is specified in terms of ( , ) , = ( 1 , … , p ) and let l( , ) be the log-likelihood function. Assume is the parameter of interest and the set of nuisance parameters.
It is easiest to think of the original parameters as some function of the new parameters, , that is Then, the log-likelihood function can be expressed as where now l * refers to the log-likelihood in the new parametrization. Taking derivatives of this equation with respect to the new parameters, we have by the chain rule; and The derivatives of are not functions of the data, and hence are constant with respect to expectation. Therefore, after taken expectations with respect to the distribution of the data indexed by the parameters { , } ,the last term in Equation 3.1 is zero. If the parameters are orthogonal, then again from Equation 3.1, so that the orthogonality equations are We require the transformation from ( , ) to ( , ) to have a non-zero Jacobian; hence, E l This is a system of p differential equations which must be solved for ( , ). In fact, since does not enter explicitly into the equations, the solution for j can contain an arbitrary function of as the integrating constant. It is noted in the discussion of Cox and Reid that although it is sometimes theoretically possible to find a differential equation, simple explicit solutions of the differential equation were not feasible for the some models. There are also cases where explicit solutions are possible, but the original nuisance parameters could not be explicitly expressed in terms of the orthogonal parameter (Hills, 1987). In general, global orthogonalization can also not be achieved by this approach.

Gram-Schmidt parameter orthogonalization
The Gram-Schmidt orthogonalization process (Clayton, 1971) is equivalent to the linear transforma- In linear regression set-up, * j is j adjusted for 1 , … , p , and b jk , j = 2, … , p; k = 1, … , (j − 1) are the multiple regression coefficients. Writing this using matrix notation, we have where , the transformation matrix, is lower triangular with ones along the diagonal. The transformation matrix is chosen so that * 1 , * 2 , … , * p are mutually uncorrelated. The Jacobian of the transformation is unity. Suppose is the covariance matrix of Θ, then the covariance matrix of * , * = B B T is a diagonal matrix. We recommend that the parameters be ordered in terms of interest.
This ensures that the parameters of most interest are least affected by round-off errors.

Evaluation of the transformation matrix
To evaluate the transformation matrix, , let , are elements of the covariance matrix. The orthogonality relation, Thus, the system of linear equations to be solved for entries of the transformation matrix based on the covariance matrix is: Consequently, solving the system of linear equations (4.2), we obtain the elements of the transformation matrix as where I r(j−k) are entries of the information matrix. The observed information matrix would be a good approximation of the expected information, if there is difficulty evaluating it.
For illustration, when p = 4, we have Gram-Schmidt orthogonalization process in matrix notation as, The orthogonalization relationship of the parameters ( * 1 , * 2 , * 3 , * 4 ) implies And so the system of linear equations to be solved for the elements of the transformation matrix is: Let Q be the matrix of coefficients for the system of equations (4.4) Above; then, we note that Q is a patterned block diagonal matrix. Furthermore, let D r (r = 1, 2, 3), be the block diagonal of Q; then, in this illustration, Having obtained * = ( * 1 , … , * p ) the original parameters, = ( 1 , … , p ) can be obtained recursively as or writing this using matrix notation, we have

Block orthogonalization
Further suppose the vector is the set of parameters of interest. Then, the Gram-Schmidt orthogonalization process computes the new set of parameters in terms of the original parameters as follows: where B 21 , B 31 and B 32 are matrices with dimensions q × r, p × q and p × r, respectively.
In matrix notation, we write or where is a lower triangular block matrix whose diagonal unit matrix is the transformation matrix chosen such that * , * , * are mutually uncorrelated and where * T is a block diagonal matrix.
The Jacobian of the transformation is unity. The only way in which this procedure could break down would be if one of the vectors of is identically zero. From the method of formation of * , it is clear that it is a linear combination of the vectors in . If is identically zero, this means that are linearly dependent, thus contradicting the assumption of independence. Extensions to more than three sets of vectors of parameters readily follow.

Gram-Schmidt-Fisher scoring algorithm
We introduce a modification of the Fisher scoring algorithm incorporating the Gram-Schmidt process to obtain an information matrix which is diagonal and thus ensuring the approximate (near) orthogonality and consequently the stability of the estimates of the new parameters. Since the transformation is linear and injective, the original parameters are readily obtained.
Let l( ; ) be the log-likelihood in the original Parametrization; then, Fisher scoring algorithm is given by the iterative routine where ( ) is the expected information matrix and ( ) is the score function.
Suppose is transformed to * through the Gram-Schmidt orthogonalization process. Let = (b ij ) be the matrix of transformation, defined as in Equation (3.4). Then, since is square and non-singular, we can write from Equation (4.5).

Algorithmic process
The proposed Gram-Schmit-Fisher scoring algorithm is a 2-stage iterative process that oscillates between Equations (4.3) and (4.6) until convergence and is described iteratively as follows: 1 Start with an initial estimate of the original parameter, , denoted by (0) .

Example
We consider a sample problem which concerns inference about the difference between two exponential means. Let Y 1 and Y 2 be the independent exponential random variables with means and ( + ), respectively. Then, the joint distribution is given by the function The score vector is The information matrix is and its inverse, the variance-covariance matrix, is

Cox and Reid approach
Orthogonal parametrization, following Cox-Reid method, requires solving the differential equation

Gram-Schmidt approach
The proposed Gram-Schmidt approach requires seeking a transformation ( , ) to ( , ( , )) such that = − b 21 and solving the equation From the variance-covariance matrix, we obtain Just like the Cox-Ried approach, it is impractical to obtain or express uniquely in terms of and and subsequently formulate a modified approximate orthogonal joint distribution function. Therefore, one could proceed iteratively, using the proposed Gram-Schmidt-Fisher scoring algorithm.

Fitting the disposition model
We will present procedures for estimating the parameters in the correlated model Equation (2.1).
The calculations are standard, and so we will only outline the results. We write the likelihood of the joint distribution of the ith group as where Let and define the following The contribution of the i-th group to the log-likelihood is the term and the contribution to the score function is the term The parameter estimates can subsequently be obtained by the Gram-Schmidt-Fisher scoring algorithm given by *

Application to oesophageal cancer in Chinese families
This application involves the study of oesophageal cancer in 2951 nuclear families collected in Yangcheng County, Shanxi Province in China (Kwagyan, 2001). In this study, we consider as a group the nuclear family unit. The outcome variable is whether an individual is affected with oesophageal cancer or not. The objective of the study is to assess the presence and aggregation of oesophageal cancer in these families. Table 1 summarizes the distribution of number of affected individuals by family size. Of the 2951 families, 1580 (53%) had no affected individuals. The combined total number of individuals from the studied families was 14310, with mean ± sd age of 48.26 ± 18.17 years.
The respondent within a family has correlated outcomes which are influenced in part or wholly by the group variables as well as the variables on the individual respondent. The main objective here is to assess the presence of familial aggregation of oesophageal cancer adjusting for measured risk factors: sex, age and alcohol consumption. Here, the model for the regression analysis of disposition is parametrization as In this application, the vector of parameters = ( 1 , 2 , 3 ) is the set of parameters of interest.
Computations are performed using the computer program we developed CORRDAT (Bonney, Kwagyan, Slater, & Slifker, 1997b), which was linked with the likelihood optimization software MULTIMAX (Bonney, Kwagyan, & Apprey, 1997a). Computations can also be accomplished in MATLAB. Table 2 gives estimates of the correlation matrices. The correlations between the original parameters are quite high compared to those of the orthogonal parameters. In particular, the correlation between and is −0.179; the correlation between and 1 is −0.414; and that between and 3 is high, −0.841. The correlations between the orthogonal parameters are near zero or nonexistent. The correlation between * is and * is −0.026; the correlation between * and * 1 is now low, 0.004 and that between * and * 3 is also low, 0.0188.
As expected, the orthogonal likelihood converged in fewer iterations than the non-orthogonal one-the orthogonal likelihood converged after 17 iterations; the non-orthogonal one converged * after 56 iterations. Estimates of the parameters are summarized in Table 3. The maximum likelihood estimate of the relative dependence parameter, ̂= 0.577 with a corresponding asymptotic 95% confidence interval of (0.529, 0.625) suggests that there is significant familial aggregation of oesophageal cancer in the families sampled. Sex and age have a positive significance, while alcohol was negative. The results suggest that males are at a higher risk of getting oesophageal cancer than females and also that it is more prevalent in older people. The negative effect of alcohol seems to suggest that it has the propensity to lower the risk of oesophageal cancer in the Chinese population studied, perhaps if drank in moderation. In summary, we conclude that oesophageal cancer aggregates in the families sampled.

Conclusions
Parameter orthogonalization is used as an aid in computation, approximation, interpretation and elimination or reduction of the effect of nuisance parameters. The resulting algorithm that we have proposed based on the Gram-Schmidt orthogonalization process is computationally feasible. Unlike the approach of Cox-Reid which requires solving a system of differential equations to obtain orthogonality, the proposed method only requires solving a system of linear equations. Our approach has some similarities with the conjugate gradient algorithm, but is distinctly different from it. The conjugate gradient method is an optimization routine, whereas the method we have proposed is principally an approximate orthogonalization technique combined with a maximization algorithm to aid in parameter estimation. Amari (1985) claims that global orthogonalization is possible only in a special case. Our method allows for global orthogonalization, if the information matrix can be found or estimated. The approach we have proposed is based on exact calculations. The transformation is surjective, that is the original parameters can be sufficiently obtained. Clearly, the process can be cumbersome if a large number of parameters must be accurately estimated because of the inversion of the information matrix at every stage to obtain the covariance matrix and subsequently the transformation matrix. The advantages, however, are that convergence is generally rapid in iteration times, sure and accurate. Block parametrization of parameters would be desirable if interest is in sets of parameters and where the dimension of the parameter space is large. The work we have presented in this article is the first attempt to consider the use of Gram-Schmidt process for estimation of the parameters. It is possible, however, that closer scrutiny, practical considerations, numerical studies for numerical stability and properties, and simulation studies to evaluate and compare convergence times would suggest modifications and refinements to the methods we have discussed.
In conclusion, the algorithm that we have proposed provides an exceedingly convenient method for multi-parameter statistical problems first to reduce numerical difficulties and second improve accuracy in parameter estimation. The method would be efficient for use of function minimization in both linear and non-linear maximum likelihood estimations and particularly useful for small parameter space estimation problems.