The R Package bild for the Analysis of Binary Longitudinal Data

We present the R package bild for the parametric and graphical analysis of binary longitudinal data. The package performs logistic regression for binary longitudinal data, allowing for serial dependence among observations from a given individual and a random intercept term. Estimation is via maximization of the exact likelihood of a suitably defined model. Missing values and unbalanced data are allowed, with some restrictions. The code of bild is written partly in R language, partly in Fortran 77, interfaced through R. The package is built following the S4 formulation of R methods.


Introduction
This paper describes the R (R Development Core Team 2011) package bild (Gonçalves, Cabral, and Azzalini 2012) available from the Comprehensive R Archive Network at http: //CRAN.R-project.org/package=bild for the parametric and graphical analysis of binary longitudinal data. Similarly to the generalized estimating equations (GEE) approach when this is applied to binary response data, the present methodology works by introducing a parametric model for the marginal distribution of the response variable but it differs from the GEE approach on another front, in that the parametric analysis developed here is associated to a fully specified stochastic model for the individual profiles. Important work on the likelihood approach for discrete longitudinal data, with emphasis on the important special case of binary response, has been done by , Fitzmaurice, Laird, and Rotnitzky (1993), and Fitzmaurice, Laird, and Lipsitz (1994), via the so-called mixed-parametrization. However their approach is unsuitable to handle series of different length of response across individuals, and the interpretation of the association parameters is somewhat problematic; see the discussions of these papers.

Parametric models for binary data 2.1. Binary Markov chains
Denote by y it ∈ {0, 1} the binary response value at time t (t = 1, . . . , T i ) from subject i (i = 1, . . . , n), and by Y it its generating random variable whose mean value is P(Y it = 1) = θ it . For each observation time and for each subject, a set of p covariates, x it , is available. In our formulation the parameter of interest is the marginal probability of success, that is related to the covariates via a logistic regression model, where β is a p-dimensional parameter. The dependence structure of the process is taken to be a second order Markov chain, which is suitably parameterized so that the marginal parameter β retains its meaning irrespective of the serial dependence.
This set-up leads to consideration of the joint distribution of three components of the process at the time, (Y t−2 , Y t−1 , Y t ). To simplify notation, we drop the subscript i where it is not essential. Our choice is to impose the constraints where the notation OR(U, V ) denotes the odds ratio of the joint distribution of a pair of binary random variables (U, V ), and ψ 1 and ψ 2 denote two positive parameters. In the context of binary processes, dependence is more conveniently measured by odds ratios than by correlations, and conditions (2)-(3) provide a parametrization whose interpretation is similar to the partial autocorrelation of a Gaussian process, transferred to the odds ratio scale. The algebraic problem is finding the transition probabilities satisfying the above-stated conditions; see Gonçalves and Azzalini (2008) for more details and Gonçalves (2002) for a full account.
Therefore in the adopted formulation, interest lies in the marginal probability of success, and this is related to the covariates via the logistic regression model (1); hence β is the parameter of interest. Serial dependence is regulated by λ = (λ 1 , λ 2 ) = (log ψ 1 , log ψ 2 ), which we assume to be constant across time and individuals. Notice that, when λ 2 = 0, the Markov chain reduces effectively to first order dependence, and we recover the formulation of Azzalini (1994).

Likelihood inference
We shall now consider likelihood inference based on the a sample of n individual profiles, assumed to be independent from each other. The contribution from a generic individual to the log-likelihood for the parameters (β, λ) is where the three blocks on the right-hand side represent the contribution to the log-likelihood from y 1 , y 2 , and (y 3 , . . . , y T ), respectively, and where p j = P(Y t = 1|Y t−1 = j). The overall log-likelihood function is obtained as the sum of the n individual contributions of type (5).
Maximization of the log-likelihood must be performed by numerical methods. To improve speed of convergence, expressions of the score functions have been obtained; see the references quoted in Section 2.1.

Residuals
We introduce a form of residuals of a fitted model, to be used for diagnostic purposes. Denote by Y (s) the set of random variables {Y 1 , . . . , Y s } representing the portion of the i-th individual profile up to time s. The conditional mean given the past observations is denoted by for t = 2, 3, . . . , and t = 1 does not involves any conditioning, so m 1 = E(Y 1 ). Then define the standardized residual for t = 1, 2, . . .
Graphical analysis of residuals is difficult even in the simple case of logistic regression for independent data, due to the extreme discreteness of binary data. To alleviate the problem of discreteness, and at the same time to reduce the number of plots to examine, it is sensible to aggregate the residuals across individuals, at each given time point, in the form where the index i, now re-introduced, runs across the whole set of n individuals, or possibly an homogeneous sub-group of them.

Missing data
A frequent problem with longitudinal studies is the presence of missing data, since it is difficult to have complete records of all individuals, especially in cases when measurements are taken at occasions very distant in time. Missing values are allowed on the response, provided they are missing at random in the terminology of Little and Rubin (1987).
If missing data occur at the beginning or at the end of an individual profile, this poses no problems, since this case is equivalent to a designed unbalance in the length profile T i for that individual. Some restrictions exist for the presence of missing data when they occur in the middle of the profile. If the first order dependence model is considered the present implementation requires that only one missing value should appear between two significant observations. For the second order dependence model we need that the pattern of missing observations satisfies the requirement that missing data have two observed values on each side of the time sequence, except for the two end portions of the observation period, where no restriction is made. Therefore, if there is a missing value at time point t − 2, it is required that there are observations at time points t − 4, t − 3, t − 1, t.
In practice, the program performs the fit even when the above conditions are not fulfilled. In these cases, however, a warning message is printed since the log-likelihood is not computed exactly, with consequent slight inaccuracy of the results.

Random effects
Individual random effects b i ∼ N (0, σ 2 ) can be incorporated as an additive term to the linear predictor in (1), leading to the logistic model with random intercept where the b i 's are assumed to be sampled independently from each other.
The corresponding expression for the contribution of the i-th subject to the likelihood function is obtained by integrating over the distribution of b i in the expression of the likelihood for the fixed effect model evaluated at β (b i ) , which is the same of β with the intercept β 0 replaced by β 0 + b i . It is convenient to reparametrize ω = log σ 2 both for numerical convenience and to improve accuracy of the asymptotic approximation to the distribution of maximum likelihood estimates (MLE's). In explicit terms, the likelihood for the random intercept case is Clearly, the log-likelihood for the whole sample is given by In practice the integrals in (8) are computed using adaptive Gaussian quadrature. To improve efficiency of the numerical optimization of the log-likelihood, it is convenient to make use of its derivatives. See Gonçalves and Azzalini (2008) for more details and Gonçalves (2002) for a full description.
In most generalized linear mixed model (GLMM) formulations for binary data, an expression of type (7) is also used, but assuming independence of observations within a given individual, conditionally on b i . Since here we allow for the presence of serial dependence, the methodology can be used to test the assumption of conditional independence underlying a GLMM model, by examining the estimates of λ k 's and the associated quantities.
In interpreting the parameters β of (7), one must bear in mind that the inclusion of the random effect b i , as given by (7), alters the meaning of the β's with respect to their meaning in a model with fixed effects only, since clearly where η = x it β denotes the systematic part of the linear predictor.
To obtain the exact value on the left-hand side an integration is required, in practice via numerical methods. To avoid this integration, a simple and effective approximation has been considered by Zeger, Liang, and Albert (1988), which in our case amounts to evaluating the right-hand side of (9) with η replaced by a(ω) η, where and c = 16 √ 3/(15π). This adjustment is incorporated by the package in the graphical display of estimated probabilities, and referred to as "adjusted fit" in Section 3.1.
In addition to estimation of β and λ, it is of interest to obtain estimatesb i of the individual random effects, b i 's. The appropriate quantity to consider is the conditional expectation of b i given the observed value of the i-th individual profile y i , that is E(b i |y i ; β, λ), but its exact computation is difficult. A simple alternative follows naturally from the following argument: if the parameters β of the systematic component η i = x it β of (7) were available, one could estimate b i by fitting a simple logistic model, separately for each individual, regarding η i as a fixed constant. In practice one replaces β by its estimate to compute η i , and then fits a logistic regression model to y i , with η i treated as an "offset".
Once estimatesb i of the individual random effects are available, one can use the estimated conditional linear predictorb i + x itβ to compute transition probabilities. This process will deliver an approximation to the conditional mean which in turns allows computation of residuals r it , along the lines of Section 2.3. See Gonçalves and Azzalini (2008) for details.

Package overview
The package is built around its main function bild() which performs the fit of models of type described in the previous section by maximizing the log-likelihood according to some serial dependence structure (5).
Serial dependence of first order and second order Markovian type will be identified as MC1 and MC2, respectively; the corresponding parameters λ 1 = log ψ 1 and λ 2 = log ψ 2 are denoted log.psi1 and log.psi2. When a random intercept term is included, then notation for the dependence structure will be either MC1R and MC2R, depending of the order of serial dependence. In addition the dependence ind is allowed, to select independence.
The arguments used in a call to the function bild() are: bild(formula, data, time, id, subSET, aggregate, start, trace, dependence = "ind", method = "BFGS", control = bildControl(), integrate = bildIntegrate()) We summarize next the main arguments of bild() plus two auxiliary functions, bildControl() and bildIntegrate(), which help to regulate the working of the main function.
formula: Description of the model to be fitted of the form response~predictors.
data: data.frame whose structure is described in Section 3.2. Notice that certain components of data influence the working of the function.
time: String that matches the name of the time variable in data. By default, the program expects a variable named time to be present in the data.frame, otherwise the name of the variable playing the role of time must be declared by assigning time here.
id: String that matches the name of the id variable in data. By default, the program expects a variable named id to be present in the data.frame, otherwise the name of the variable playing the role of id must be declared by assigning id here.
subSET: Optional expression indicating the subset of data that should be used in the fit. This is a logical statement of the type (variable1 == "a" & variable2 > x) which identifies the observations to be selected. All observations are included by default.
aggregate: String that identifies the levels of the factor to perform the parametric fit in the plot method.
start: Starting values for the optimization can be defined through the argument start. The values in start vector depends on the structure of the dependence model and not on the parameters of model predictor. The structure of the vector start should be: (λ 1 ) when dependence = "MC1", (λ 1 , λ 2 ) when dependence = "MC2", (λ 1 , ω) when dependence = "MC1R" or (λ 1 , λ 2 , ω) when dependence = "MC2R". dependence: Expression stating which dependence structure should be used in the fit. The default value is "ind". According to the stochastic model chosen serial dependence and random effects are allowed. There are five options: "ind" (independence), "MC1" (first order Markov chain), "MC2" (second order Markov chain), "MC1R" (first order Markov chain with random intercept) or "MC2R" (second order Markov chain with random intercept).
method: The method to be used in the optimization process: "BFGS", "CG", "L-BFGS-B" and "SANN". The default is "BFGS". See optim for details.
control: bildControl() returns a list of algorithmic constants for the optimizer optim, via a call of the form: bildControl(maxit, abstol, reltol).
integrate: bildIntegrate() returns a list of constants that are used to compute integrals based on a Fortran 77 subroutine package QUADPACK for the numerical computation of definite one-dimensional integrals. The list is generated by a call of the form: bildIntegrate(li, ls, epsabs, epsrel, limit, key, lig, lsg). For given values of li and ls, the above-described numerical integration is performed over the interval (li·σ, ls·σ) to compute the integral given by (8) where σ = exp(ω/2) is associated to the current parameter value ω examined by the optim function. In some cases, this integration may generate an error, and the user must suitably adjust the values of li and ls. In case different choices of these quantities all lead to a successful run, it is recommended to retain the one with largest value of the log-likelihood. Integration of the gradient of (8) is regulated similarly by lig and lsg. which = 5 provides the parametric fitted model if the dependence structure is "ind", "MC1" or "MC2". If the dependence structure is "MC1R" or "MC2R" the parametric adjusted fit is provided and the user can set add.unadjusted = TRUE to obtain the unadjusted fit.
which = 6 provides individual mean profiles and is used only if the random intercept is present.
The show-method displays simple summary of a bild() object and the summary-method returns a more detailed list of summary statistics of the fitted model. Moreover, bild-class allows the user to extract several items produced by the maximum likelihood procedure. Besides the parameter estimates and other quantities featuring in the summary table of the fitted model, one can extract the residuals values, the fitted values and the transition probabilities. For a full description of the available quantities, see the list of slots of bild-class provided with the package documentation.

Data structure
The structure of the data is a data.frame. Each element of the data argument must be identifiable by a name. NA values can then be inserted in the response variable, provided that the missing values are missing at random (for details see Section 2.4). The response variable represent the individual profiles of each subject, it is expected a variable in the data.frame that identifies the correspondence of each component of the response variable to the subject that it belongs, by default is named id variable. It is expected a variable named time to be present in the data.frame. The time variable should identify the time points that each individual profile has been observed.
If a response profile is replicated several times, a variable called counts must be created accordingly. This vector is used for weighting the response profile indicating for each individual profile the number of times that is replicated. The vector counts must repeat the number of the observed replications for each individual profile as many times as the number of observed time points for the correspondent profile. If each profile has been observed only once, the construction of the vector counts is not required.
In the two next subsections we illustrate the working of the package with the help of two real datasets.

Example: Locust data
The dataset has been presented and analyzed by MacDonald and Raubenheimer (1995), and subsequently examined by other authors, including Gonçalves and Azzalini (2008) using the methodology considered here. The data have been collected to study the effect of hunger on the locomotion behaviour of locusts (Locusta migratoria). Specifically 24 locusts have been observed at 161 time points, at thirty-second intervals; the subjects were divided in two treatment groups ("feed" and "unfeed"), and within each of the two groups, the subjects were alternatively "male" and "female". For the purpose of this analysis the categories of the response variable were "moving" and "not moving". To start analyzing the data we first load the package

R> library("bild")
The available data have the following structure:
where the time unit has been set equal to an hour, for numerical convenience. The function bild() is called to fit the model logit(θ it ) = β 0 + β 1 time + β 2 time 2 + β 3 feed + β 4 time × feed + β 5 time 2 × feed using a dependence structure MC2 via the statements R> locust2 <-bild(move~(time + I(time^2)) * feed, data = locust, + start = NULL, aggregate = feed, dependence = "MC2") R> summary(locust2) Call: bild(formula = move~(time + I(time^2)) * feed, data = locust, aggregate = feed, start = NULL, dependence = "MC2") All the parameter estimates are significant at 5% level. This shows, among other things, that a quadratic time effect is present, and that a difference exists between the two groups ("feed" and "unfeed"). Moreover the estimates of λ 1 = log ψ 1 and λ 2 = log ψ 2 point strongly to second order serial dependence. To explore further this point, we can fit a similar model but adopting MC1 dependence structure via R> locust1 <-bild(move~(time + I(time^2)) * feed, data = locust, + start = NULL, aggregate = feed, dependence = "MC1") and compute the likelihood ratio test to compare the two dependence structures R> 2 * (getLogLik(locust2) -getLogLik(locust1)) [1] 88.45186 This change of deviance, compared with the χ 2 1 reference distribution, produces a p value about 0, confirming that the MC2 structure is significantly preferable to MC1. The graphical display of the fitted probabilities as shown in Figure 1 can be obtained using the plot method (setting which = 5) by R> plot(locust2, which = 5, ylab = "probability of locomotion") The residual analysis can be summarized, as shown in Figure 2, using the plot method setting which = 1 for residuals vs. fitted, which = 2 for residuals vs. time, which = 3 for ACF residuals and which = 4 for PACF residuals via the following statements,  R> plot(locust2, which = 1) R> plot(locust2, which = 2) R> plot(locust2, which = 3) R> plot(locust2, which = 4) For models with random intercept, MC1R and MC2R, the user may need to specify the integrate argument list changing the integration limits in the bildIntegrate() function.
If a random intercept is considered in the previous model (dependence = "MC2") the call to the function bild() is now done setting dependence = "MC2R". However, using the default settings for the integration limits of log-likelihood and its gradient, the call to bild generates an error. Therefore, the function bildIntegrate() was used to define new limits for likelihood and gradient as indicated here: R> Integ <-bildIntegrate(li = -2.5, ls = 2.5, lig = -2.5, lsg = 2.5) R> locust2r <-bild(move~(time + I(time^2)) * feed, data = locust, + start = NULL, aggregate = feed, dependence = "MC2R", integrate = Integ) The fitted probabilities for the model given by locust2r, choosing which = 5 in the plot method, can be adjusted (by default) or unadjusted (add.unadjusted = TRUE), see the Figure 3. The adjusted fits use the approximation considered by Zeger et al. (1988) and presented in equation (10). The unadjusted fits give the estimate probability of locomotion for the locust with the unobserved random intercept b 0i = 0, in both groups.
R> plot(locust2r, which = 5, ylab = "probability of locomotion", + add.unadjusted = TRUE) The individual mean profile for all subjects (by default) or to a subset (subSET = ...) can be obtained using the plot method choosing which = 6. The identification of the subjects is also allowed by setting ident = TRUE. These options are only available for random intercept models. See Figure 4 as the results of the settings below, R> plot(locust2r, which = 6, ylab = "probability of locomotion", + main = "Feed & Unfeed groups") R> plot(locust2r, which = 6, ident = TRUE, subSET = feed == "0", + ylab = "probability of locomotion", main = "Unfeed group") In Figure 5 examples of individual mean profiles for females (sex = 0) are given for both treatments groups and for unfeed group (feed = 0) only. In these two plots the identification of the subjects is produced setting ident = TRUE.
R> plot(locust2r, which = 6, subSET = (sex == "0"), main = "sex==0", + ident = TRUE) R> plot(locust2r, which = 6, subSET = (feed == "0" & sex == "0"), + main = "Unfeed & Female", ident = TRUE) 3.4. Example: Muscatine data Fitzmaurice et al. (1994) and Azzalini (1994) have analyzed a subset of data from the Muscatine Coronary Risk Factor Study, a longitudinal study of coronary risk factors in school children from Muscatine (Iowa, USA). The data set contains records on 1014 children who were 7-9 years old in 1977 and were examined in 1977, 1979 and 1981. The binary response of interest is whether the child is obese (1) or not (0). A marginal model is appropriate to examine the probability of obesity as a function of gender and age. However, many data records are incomplete, since not all children participate in all the surveys.
R> musc1r <-bild(obese~time1, data = muscatine, time = "time1", + start = c(1, 1), dependence = "MC1R") R> summary(musc1r) Call: bild(formula = obese~time1, data = muscatine, time = "time1", start = c(1, 1), dependence = "MC1R") Number of profiles in the dataset: 52  The decrease in deviance between the models given by musc1r and musc2r is ∆D = 2 × (950.92 − 947.24) = 7.37 on five degrees of freedom (p value= 0.19429). Thus, the model with only the linear effect over time and MC1R is not rejected at the level of significance 5%, the results of summary(musc1r) confirms the linear increase (on the logit scale) in the rate of obesity over time. To obtain the individual means profile we set which = 6 in the plot method and the result is obtained in Figure 6.

Closing remarks
In this paper we present an overview of the bild package for the analysis of binary longitudinal data. The theory used for model fitting is summarized briefly, and the functions of the package are described in detail. Practical use of bild is illustrated for the case of two real examples. A substantial computational burden is involved by the numerical integration connected to the random effects of Section 2.5, but this is not heavier than other formulations which incorporate random effects in discrete longitudinal data when a similar exact numerical integration is performed.