PIRLS: Poisson Iteratively Reweighted Least Squares Computer Program for Additive, Multiplicative, Power, and Non-linear Models

A computer program to estimate Poisson regression coefficients, standard errors, Pearson x2 and deviance goodness of fit (and residuals), leverages, and Freeman-Tukey, standardized, and deletion residuals for additive, multiplicative, power, and non-linear models is described. Data used by the program must be stored and input from a disk file. The output file contains an optional list of the input data, observed and fitted count data (deaths or cases), coefficients, standard errors, relative risks and 95% confidence intervals, variance-covariance matrix, correlation matrix, goodness of fit statistics, and residuals for regression diagnostics.


Poisson Distribution
Poisson regression models are commonly used for modeling risk factors and other covariates when the rates of disease are small, i.e., cancer [1,2,3,4,5]. An area of research in which Poisson regression of cancer rates is commonly employed is the investigation of radiation-induced cancer among Hiroshima and Nagasaki atomic bomb survivors [6,7,8] and individuals occupationally [9], medically [10] and environmentally [11] exposed to ionizing radiation. Poisson regression has also been used to investigate the effects of information bias on lifetime risks [12], diagnostic misclassification [13,14], and dose-response curves for radiation-induced chromosome aberrations [15].
Several computer programs have been developed that perform Poisson regression [16,17,18,19,20]. While the PREG program [3] can fit additive, multiplicative, and non-linear models, it does not provide the capability to fit power models nor perform regression diagnostics. This report documents the PIRLS Poisson regression algorithm for fitting additive, multiplicative, power, and non-linear models and regression diagnostics using leverages of the Hat matrix, deletion and standardized residuals.
The critical data required to apply the computer code are: • Number of cases for each stratum • Number of person-years for each stratum

• Design matrix of independent variables
There is only one output format, listing the input data, observed and fitted count data (deaths or cases), coefficients, standard errors, relative risks (RR) and 95% confidence intervals, variance-covariance matrix, correlation matrix, goodness of fit statistics, and residuals for regression diagnostics. The program and all of its subroutines are written in FORTRAN-77 and are run on a desktop PC with a AMD-K6-166MHz(MMX) chip. The algorithm can be compiled and run on Unix machines as well. The PC version of the executable requires 180,224 bytes of random access memory (RAM) to operate. Average execution time is several seconds per run, depending on the amount of input data.

PROGRAM DESCRIPTION
The computer program proceeds as follows. During run-time, the user interface with the program involves the following: • Specifying the type of model to be fit (1-additive, 2-multiplicative, 3-power, and 4-non-linear) • For additive models, specifying the power, ρ, which ranges from 1 for purely additive models to 0 for purely multiplicative models • Specifying the number of independent variables (not cases or person-years) that are to be read from the input file • Specifying the number of parameters to estimate • For non-linear models, optional starting values for parameters • For multiplicative models, specifying if the regression coefficients are to be exponentiated to obtain RR and its 95% CI • Specifying whether or not the variance-covariance and correlation matrices are written to the output file whose name is specified by the user • Specifying whether or not the input data are written to the output file whose name is specified by the user • Specifying whether or not the observed and fitted count data (deaths or cases) are written to the output file • Specifying whether or not the leverages, and standardized and deletion residuals are written to the output file Once the data above have been specified during run time, the program does the following: 1. First, a "starting" regression is performed by regressing the ratio of counts (deaths or cases) to person-years of follow-up as the dependent variable on the independent variables. For non-linear models, the default starting regression can be skipped by specifying starting values for each parameter. 2. The difference between the observed and fitted counts (error residuals) from the starting regression are used as the dependent variable and regressed on the cross product of the inverse of the information matrix (variance covariance-matrix) and the score vector to obtain the solution vector. 3. The solution vector is then added to the original regression coefficients obtained during the starting regression. 4. Iterations are performed until the sum of the solution vector is below a fixed value called the convergence criterion, for which the default in PIRLS is 10 −5 . 5. After convergence is reached, the standard errors are obtained from the diagonal of the variance-covariance matrix, and if specified, residuals for regression diagnostics are estimated. 6. Depending on specifications made by the user, various data are written to the output file, and then the program terminates run-time.

Binomial Basis of Poisson Regression
For large n and small p, e.g., a cancer rate of say 50/100,000 population, binomial probabilities are approximated by the Poisson distribution given by the form  [21].
where µ is the mean of the Poisson distribution. It has been said that as long as npq is less than 5, then the data are said to be Poisson distributed.

Poisson Assumption
The Poisson assumption states that, for stratified data, the number of deaths, d, in each cell approximates the values x=0,1,2,... according to the formula where λ is the rate parameter which is equal to the number of deaths in each cell divided by the respective number of person-years (T ) in that same cell. As an example, Table 1 lists a multiway contingency table for the British doctor data for coronary disease and smoking [21]. The rates for non-smokers, λi(0), are simply the ratio of the number of deaths to Ti in age group i and exposure group 0, whereas the rates for smokers, λi(1), are the ratio of deaths to Ti in age group i for the exposure group 1.

Ordinary Least Squares
Let us now look into the sum of squares for common linear regression models. We recall that most of our linear regression experience is with the model where Y is a column vector of responses, X is the data matrix, β is the column vector of coefficients and ε is the error. Another way of expressing the observed dependent variable, yi, of (4) in model terms is and in terms of the predicted values,ŷi, iŝ yi =β0 +β1xi1 +β2xi2 · · ·βjxij + ei.
The residual sum of squares of (6) is Because of the relationship we can rewrite (7) in matrix notation as

Partitioning Error Sum of Squares for Weighted Least Squares
Suppose we have the general least squares equation and according to Draper and Smith [22] set and say that Now if we let and then premultiply (10) by P 1 , as in or By applying basic least squares methods to (14) the residual sums of squares is and since which is equivalent to The sum of squares residual then becomes for (14)

Partitioning Error Sum of Squares for Iteratively Reweighted Least Squares
According to McCullagh and Nelder [23], a generalized linear model is composed of the following three parts: • Random component, λ, which is normally distributed with mean 1 T i · µi and variance µ i Non-linear N/A • The link between the random and systematic components for a linear model is given by where λi is the observed rate parameter and λi = (xi β) is the predicted rate parameter. It should be pointed out that the rate parameter, λi = (xi β), serves as its own link function. Thus, the rate parameter provides a linearization of our distributional models so that we can use multiple linear regression to obtain Best Linear Unbiased Estimators (BLUEs). In effect, the rate parameter allows us to reduce non-linear (non-Gaussian) models to linear models so that we can solve the system of linear equations. The rate parameter can take on various linear and non-linear functions. These are shown in Table 2. The power transformation was adapted from [24]. As shown in Table 2, it is obvious that the random component, λi, can take on a non-linear relationship with the systematic component, x. To linearize λi in a least squares approach by a succession of stages or iterations, we carry out a Taylor series expansion of λi about a point β0. Since we know that a Taylor series expansion of λi about (β − β0) gives us Rearranging this equation, we have where and The Jacobian matrix of partial derivatives of λi(x i β) with respect to the parameters, β at the zeroth iteration is then From (26), the residuals, ei, now become

Development of Dependent Variable, Y
Using (30), let λi − λ 0 i (x i β) be equal to Yi. We now rewrite the residuals so that the column vector, Y of dependent Y becomes

Development of Weighting Matrix, W
We know that V ar(cX) = c 2 V ar(X) so if we substitute inλ we have Therefore, each diagonal element of the weight matrix becomes Because λi can be very small and Ti very large and it is better to divide by a larger number than a smaller one with a computer, we will we choose the weight

Defining Maximum Likelihood Solution Vector, δ
Let V −1 = W, and now substitute our new definition for residuals from (31) into (21) obtain the new sum of squares The transpose (Zδ) of the term (Zδ) WY in (37) is equal to δ Z because (AB = B A ). In addition, since (Zδ) WY is a 1 × 1 matrix, or scalar, then its transpose Y ZWδ has the same value. Thus, (37) is transformed into The δ on the left sides of parameters in the right side of (38) drop out because if we set and from Searle [25] know that and if we set and from [25] know that then the normal equation then becomes In order to minimize (43) with respect to the δ parameters, we must take the first partial derivative. This gives us that provides us with the consistent equations At the ith iteration, the values of the parameters are Convergence is reached at the point when where is 10 −5 by default in PIRLS.

Variance-covariance Matrix
The variance-covariance matrix, with parameter variance on the diagonal and covariances on the off-diagonals, at convergence is identical to the information matrix or inverse of the weighted dispersion matrix

Standard Errors of Regression Coefficients
The standard errors of the regression coefficients are equivalent to σ 2 j . A Wald test for each parameter is simply βj /s.e.(βj) and indicates significance if the ratio is < −1.96 or > 1.96 at the α = 0.05 level of significance.

Residuals and Regression Diagnostics
The error residual, ei has mean 0 and measures the lack of concordance of the observed µi against the fitted valuê µi. The greater the value of ei, the worse the fit of the model. Pearson residuals, rP are estimate as and provide a measure of the residual variation. The deviance residuals allow the investigator to determine the goodness of fit of the model, i.e., how well the observed data agree with the fitted values. Freeman and Tukey [26] introduced a variance stabilized residual for Poisson models given by The leverage [23] for each observation, hi, provides information about the geometric distance between a given point (zi1, zi2, · · · , zij) and the centroid of all points (zi1,zi2, · · · ,zij ) in the predictor space. Individual leverages, hi, are obtained from the diagonal of the "Hat" matrix given by Because hi = trace(H) = p, observations with high leverage can approach unity. A criterion introduced by Hoaglin and Welsch [27] for determing high leverage for an observation states that a point has high leverage if hi > 2p/n. Points with high leverage are located at a substantial distance from the fitted regression line, and act to "pull" the fitted line toward their location. Increases in the goodness of fit of a model can be obtained by refitting a model after the observation(s) with high leverage have been removed.
Cook [28] introduced the residual, ∆(β)−ij, which measures the difference between β when the observation is included in the model and β when the observation is not in the model. The more positive the values of ∆(β)−ij , the more influence an observation has when compared with other observations. Individual ∆(β)−ij residuals are obtained by the n × p matrix ∆(β)−ij , by the relationship Finally, the standardized residual is used to measure the model error introduced by particular observation when its error residual has constant variancer Values forri that exceed 1.96 tend to be outliers at the α = 0.05 level of significance.

Goodness-of-Fit Statistics
The Pearson χ 2 goodness of fit is and the deviance goodness of fit is which is also χ 2 distributed. If a model fits, its χ 2 and D will be lower than the degrees of freedom (n − p).

Structure
Program flow is outline in Figure 1.

Auxiliary Algorithms
MATMUL [29] performs matrix multiplication for the matrix manipulation. TRNPOS tranposes arrays as needed for matrix mulitplication. MATINV calls SVDCMP based on [30] to invert the information matrix using singular value decompostion. ALG and GAMAIN are used for estimating tabled values of χ 2 .

Restriction
All input parameters are checked for nullity and a fault message is returned if there is an illegal entry.

Time
A thorough investigation of absolute timing has not been performed; however, it should be noted that execution time is a function of the number of model parameters and effects to be considered.

Precision
The algorithm may be converted to double precision by making the following changes: 1. Change REAL to DOUBLE PRECISION in the algorithm.

Change the constants to double precision.
3. Change EXP to DEXP and ALOG to DLOG in all applicable routines.

Data Arrangement and Partial Derivatives
We shall enumerate the steps required for fitting a multiplicative model. Before we start the regression run, however, we must set up the data file and also determine the first partial derivatives of the rate ratio λ with respect to each parameter in the model.

1.
Arranging the data in an ASCII text file. The first step is to arrange our data in an ASCII text file. The typical ASCII text file arrangement of data for a Poisson regression run using data in Table 1 is shown in the following: where the person-years in Table 1 are divided by 1,000 and the covariates denoting the age group are dummy coded (0,1) with a 1 to indicate that the rates are from a given cell, and whether or not the subgroup smoked.
2. Determining Partial Derivatives of λ i (x i β) With Respect to Each β j . Next we must write out the partial derivatives of the rate parameter with respect to each parameter (these will be used later in the algorithm). The partial derivatives are used in the Jacobian Z matrix to form the scores during each iteration. Thus, each partial derivative forms an element, z ij , in the Jacobian Z matrix. Given that the partial derivative of an exponentiated function u in the form e u is e u du, the partial derivative of λ i (x i β) with respect to each parameter, β j is as follows:

Steps in the Algorithm
The following section discusses the steps taken by the algorithm for performing iteratively reweighted least squares.
2. Estimate column vector of dependent variables for the log-linear model. The algorithm starts by reading in the data from the input file and then estimates the rate parameter, λi, for each record as the ratio of number of observed deaths to the number of observed person-years. The log of λi is then calculated to obtain the dependent variable column vector T/1,000 ) log( d 4 T/1,000 ) log( d 5 T/1,000 ) log( d 6 T/1,000 ) log( d 7 T/1,000 ) log( d 8 T/1,000 ) log( d 9 T/1,000 ) log( d 10 T/1,000 ) and when substituting in the data above we get 3. Construction of the X matrix of independent variables. In addition to the estimation of the dependent variable Y column vector, the algorithm constructs the X matrix of independent variables from the independent variables read in from the input file as 4. "Starting Regression" -log-linear regression to estimate initial values of parameters. Once the Y and X matrices are constructed, the algorithm fits the log-linear model which in matrix terms is given as where diagonal elements of W are set equal to T . Figure 1 shows the program flow in PIRLS for the various models. Table 3 shows the link functions for λ for each starting regression and the linear predictor used for estimating λi(x i β) during subsequent iterations. Table 3: Link functions used during "starting regression" and linear predictor used to estimateλ during subsequent iterations.

Model
Link function for λ Form of predictor for λi(x i β) Non-linear (optional) * log(λi) λi(x i β) = exp(x i β) * For non-linear models, the user can specifiy starting values for parameters, rather than using the default multiplicative fit to obtain starting values.

5.
Estimation of predicted rate parameter. After the first regression or iteration, the regression coefficients, βj , are used to estimate predicted rates from the linear predictor 6. Estimation of expected number of deaths. The predicted rates are then multiplied by the observed person-years in each record to give the expected number of deaths aŝ 7. Estimation of dependent variable. The expected number of deaths,μ, are then subtracted from the observed deaths,μ, to give the dependent variable 10. Construction of the Jacobian matrix of partial derivatives. Now the algorithm constructs the n × p Jacobian matrix Z composed of the first partial derivatives estimated above, which appears as Z = [zij ].
11. Construction of the inverse variance weighting matrix. Since the Poisson variance ofμi changes at each iteration, we must weight the dispersion matrix with weights defined by T 2 i /μi given as W = diag[wii], where wii is 14. Repeat steps 5-13 above until convergence is reached. The above steps (5 through 13) are repeated until the euclidean norm of the score vector is below some non-negative value where is 10 −5 by default in PIRLS. At this point, convergence is reached at a global maximum.

Application -Fitting a Non-linear Model
For this example, we will fit a non-linear model to estimate the number of stem cell colonies in spleens of laboratory mice into which irradiated cells were transplanted [31]. Table 4 shows the colonies formed, µ, number of trials, Ti, cell concentration, xi1, and radiation dose (Sv), xi2, arranged in tabular notation.
The non-linear model we will fit to obtain maximum likelihood estimates of colonies formed is of the form where xi1 is the concentration of transplanted cells, xi2 is the radiation dose (Sv), and β1, β2, and β3 are the parameters of interest.
In order to build the Jacobian matrix, Z, we need to take the first partial derivatives of λi with respect to β1, β2, and β3, i.e., ∂λi/∂β1, ∂λi/∂β2, and ∂λi/∂β3. Since there are three β coefficients, the first column entries for row i of Z, i.e., zi1, will be estimated by use of equation The first partial derivative in the 2nd column entry of row i of Z, i.e., zi2 is estimated as .
The third column of Z has elements zi3, which are numerically estimated as When fitting this model, we chose to specify starting values of β1 = 7.6364, β2 = 0.9341, and β3 = 7.6364 from [3], rather than use the default multiplicative fit to obtain starting values. Input and output for this model using the data in Table 4 are shown in the Appendix.

Input
During run-time, the user must respond to the following queries concerning each run: • The type of model to be fit (1-additive, 2-multiplicative, 3-power, and 4-non-linear) • For additive models, the power, ρ, which ranges from 1 for purely additive models to 0 for purely multiplicative models • The number of independent (input) variables to be read in from the input file • The number of parameters to estimate • For non-linear models, optional starting values for parameters • For multiplicative models, the regression coefficients are to be exponentiated to obtain RR and its 95% CI The examples in the Appendix give listings of input files.

SAMPLE INPUT/OUPUT CASES
The output for all runs is in tabular form in the ASCII text file whose name is specified by the user at run-time.
Four examples using an input file are given in the Appendix.  Table 5 list the names of the files that were used to program, link and execute PIRLS. As one notices, the source code is an ASCII text file and the object and executable files have been compiled with Microsoft FORTRAN Powerstation 4.0. There is one required input data file, whose name is specified at run-time.

AVAILABILITY
The program and all of its subroutines are available from the Journal of Statistical Software free of charge at http://www.stat.ucla.edu/journals/jss/ 9 APPENDIX: SAMPLE INPUT/OUTPUT CASES Example 1: Additive (linear) model using data in Table 1 adapted from [21]. The input data file for this run is distribution file EXAMP1-3.DAT and is listed below: The screen queries to be entered by the user are as follows: On the first page of the output, we see that the model is indeed additive, we also see the input data and the value of the deviance goodness of fit at each iteration. The number of observed and expected deaths are given and the Freeman-Tukey and Pearson residuals as well. At convergence, the coefficient for smoking (No. 6) is 0.5907. Thus, smoking adds 0.59 deaths per 1,000 person-years (or 59/100,000) of follow-up at each age. However, since the χ 2 and D goodness of fit statitics are greater then 4 degrees of freedom, we conclude that the model does not fit the data well. Table 6 shows the comparison between the modeled baseline rates and modeled baseline rates to which 0.59/1,000 was added. Table 6: Interpretation of absolute risks from the additive model.  Table 1 adapted from [21]. The input data file for this run is the same as that in Example 1.

PIRLS
The screen queries to be entered by the user are as follows: At convergence, the coefficient for smoking (No. 6) is 0.3545. Since we exponentiated the regression coefficients, we can interpret the RR for smoking and the 95% confidence interval as being 1.43(1.15,1.76). This implies that the risk of coronary death in smokers is 1.43 times the age-specific rates of the non-smokers. The goodness of fit statistics for this run are greater than those generated in Example 1, indicating that the multiplicative model fits worse when compared with the additive model. Table 7 shows the comparison between modeled baseline rates and the modeled baseline rates multiplied by the RR of 1.43. Table 7: Interpretation of relative risks from the multiplicative model. Example 3: Power model using data in Table 1 adapted from [21]. The input data file for this run is the same as that in Example 1.
The screen queries to be entered by the user are as follows: This run fitted ten models with ρ varying from one to zero, i.e., going from a purely additive link, ρ = 1, to a purely multiplicative link, ρ = 0. The model quation is λi(x i β) ρ where ρ is the power function. By inspecting the fits of the models as a function of ρ, one can see that for these example data, the lowest deviance is obtained with ρ between 0.5 to 0.6. A single run can now be made to fit the model with ρ = 0.55 by following the directions given in Example 1, but by specifying 0.55 for ρ instead of one. Table 8 lists the modeled baseline rates, β 1/0.55 j , and modeled baseline rates to which β 1/0.55 smoke was added. Example 4: Non-linear model using cell concentrations, and radiation doses to estimate number of splenic stem cell colonies formed after transplantation into laboratory mice. The input data (see Table 4) for this run is provided in distribution file EXAMPLE4.DAT and is listed below: