Model Averaging Software for Dichotomous Dose Response Risk Estimation

Model averaging has been shown to be a useful method for incorporating model uncertainty in quantitative risk estimation. In certain circumstances this technique is computationally complex, requiring sophisticated software to carry out the computation. We introduce software that implements model averaging for risk assessment based upon di-chotomous dose-response data. This software, which we call Model Averaging for Di-chotomous Response Benchmark Dose ( MADr-BMD ), ﬁts the quantal response models, which are also used in the US Environmental Protection Agency benchmark dose software suite, and generates a model-averaged dose response model to generate benchmark dose and benchmark dose lower bound estimates. The software fulﬁlls a need for risk assessors, allowing them to go beyond one single model in their risk assessments based on quantal data by focusing on a set of models that describes the experimental data.


Introduction
Risk assessors are frequently concerned in finding an exposure level which is associated with some specified level of excess risk from exposure to a hazardous agent. This level of exposure is often estimated using the benchmark dose method (BMD, Crump 1984). Here a point estimate of the dose, the BMD, or a lower limit on this endpoint (i.e., 100(1 − α)% lower confidence limit), the benchmark dose lower bound (BMDL), is typically reported. The benchmark dose estimate is dependent on the parametric model used to fit the dose-response data, with the BMDL often being more sensitive to modeling assumptions than the BMD. Further, as one model is frequently selected to estimate the BMD, while other plausible models are ignored, model uncertainty is part of any risk assessment. This problem is well noted in the literature, and is often problematic when two or more models that describe the data similarly produce widely varying estimates of the benchmark dose. A technique which allows one to estimate the BMD from more than one model that has been put forth in recent years, has been Bayesian model averaging (Raftery 1995), and its related frequentist analogue model averaging (MA) (Buckland, Burnham, and Augustin 1997).
Model averaging, for risk assessment, was first proposed by Kang, Kodell, and Chen (2000) for continuous microbial dose-response data, and further considered by Bailer, Noble, and Wheeler (2005a) for dichotomous dose-response data. These papers suggested a model averaged benchmark dose (MA-BMD) formed by taking a weighted average of model specific BMDs. The corresponding 100(1 − α)% lower bounds are also derived in this manner. This method was extensively studied through computer simulation experiment by Wheeler and Bailer (2009) and this averaging procedure was frequently found to calculate a lower limit that failed to have the nominal coverage probability on the true BMD value. This failure to effectively characterize the lower limit on the BMD estimate was most evident in cases when the exposed agent was highly hazardous, e.g. a steep dose-response relationship, which lead to questions about the broader applicability of MA in risk assessment. Subsequently Wheeler and Bailer (2007) modified the above procedure to a more computationally intensive methodology. They found this modified MA procedure, which averaged individual dose-response curves instead of the individual BMDs, more effectively described both the dose-response curve as well as the given benchmark dose often yielding estimated BMDs with minimal bias and lower bound estimates that achieved coverage at the nominally specified 100(1 − α)% level.
Although this work was promising, the computational complexity of the method makes the general use of such a method difficult to all but the most computationally sophisticated risk assessors. Consequently, this paper introduces a software program that implements the "average-model" MA procedure of Wheeler and Bailer (2007) for dichotomous risk assessment and is freely available to the risk analysis community. The paper proceeds by briefly describing this form of model averaging in the context of risk assessment applied to dichotomous dose-response data, and in the process the software is described. It then highlights the model averaging for dichotomous response benchmark dose (MADr-BMD) software against features and performance of the popular US EPA benchmark dose software package (BMDS US Environmental Protection Agency 2001).

Modeling dichotomous dose-response data
In the context of risk assessment the term "dichotomous response data" is used to describe a two level response denoting the presence or absence of a potentially unwanted outcome (e.g., tumor response, death). Frequently, the probability of adverse response is modeled by some set of covariates. For experimental data, a single covariate corresponding to the dose administered is often considered. Dichotomous dose-response modeling describes the probability of the adverse outcome given a specific dose administered to subjects under study. In this context, parametric models are used to link a dose to the observed dichotomous response under the assumption that the observed data are binomially distributed with a probability of response related to the dose.
Many models exist that can be used in this situation. The MADr-BMD software package estimates a variety of models that are commonly used in risk assessment to fit dichotomous dose response data. The following list describes the dose-response models, as well as the default parameter bounds that are used, in the software package. logistic: (1) log-logistic: probit: log-probit: Here π(d) represents the probability of adverse response given the dose d, Φ(x) is the cumulative distribution function of a standard normal random variable at x, and π i (d) = γ when d = 0 for the log-logistic (2) and the log-probit (6) models. The bounds stated for the above models allow for model families having a wide range of curvature that includes both sub-linear and supra-linear behavior. These constraints also represent a wider range of curvature than available, by default, in the US EPA software. In particular the Weibull, gamma, log-probit, and log-logistic models allow a wider degree of curvature, when fitting dichotomous response data. The increased curvature is added, based upon the results of Wheeler and Bailer (2009). Here they suggest that the low-dose linear behavior is better estimated, using model averaging, given this wide range of curvature.

Model averaging
Model averaging (MA) is a statistical technique that combines estimates from different doseresponse models. This is accomplished through a weighted average of the dose-response functions considered in the analysis, where the weights reflect the relation of the fitted function to the observed experimental data. For instance, a model's posterior probability can be used as a weight, and in this case this is a form of model averaging known as Bayesian model averaging (here the posterior probability represents the probability that a given model is the true model given the observed data). The weights are used to combine the models considered into one central model form.
Given the individual model weights the "averaged-model," is defined asπ ma (d) = K i=1 w i · π i (d), whereπ i (·) represents a dose-response function evaluated at the MLE, and w i represents the weight of the i th model. Conceptually the functionπ ma (d) can be thought of as a smoothed dose-response, which includes information from all models considered in the analysis.
The model weights, used in the above computation, can be estimated using a variety of methods. The MADr-BMD software program is designed to accommodate weights formed using the AIC (Akaike 1978), the BIC (Schwartz 1978), and the KIC (Cavanaugh 1999) information criterion. These criteria are defined as −2 log(L) + K, and −2 log(L) represents the maximum value of the −2 log likelihood for a particular model, K = 2P for the AIC, K = P log(n) for the BIC, and K = 3P for the KIC, here n represents the total sample size (i.e., the number of observations not the number of dose groups), and P represents the number of parameters in the model. These criteria were used in MA for risk assessment by Kang et al. (2000), Bailer et al. (2005a) Bailer, Wheeler, Dankovick, Noble, andBena (2005b), and Moon, Kim, Chen, and Kodell (2005). Given one of these criteria the weights are formed using the following formula: . , which was used by both Raftery (1995) and Buckland et al. (1997). It is important to note that three weighting criterion options are available to the MADr-BMD program, even though six options can be specified. These extra three options represent a modification to the AIC, BIC and KIC; we call these the AICB, BICB and the KICB respectively, and they are based upon the number of non-bounded parameters in the model. It is often the case that some of the models parameters are estimated at their lower, or upper, bounds, and in this case the effective number of parameters in the model is reduced. Although it can be argued that there is still the original number of parameters in the model, the effective number of parameters may be a better estimate than the total number of parameters. Though the effective number of parameter method is the default behavior of the US EPA BMDS software it is not clear which construction is more correct. Consequently the MADr-BMD software provides both estimates for exploration and further research; however, all results presented below are formed using the full number of parameters originally specified in the model. Thus we recommend using this approach, and leave the other three options for further research.

Benchmark dose estimation
Given the model-averaged dose-response curveπ ma (d), the benchmark dose (BMD) is defined as the dose that increases the risk above the background rate by some predefined level. Excess risk, for dichotomous data, is represented by the probability of adverse response above the background response rate. The value, which represents a specified increase in the probability of response, is known as the benchmark response (BMR), and uniquely determines the benchmark dose; the BMD is defined as the dose d that satisfies the equation: where the BMR is typically set at values of 1%, 5% and 10%. The above formulation is known as the extra risk specification of the BMD, and can be thought of as the dose that increases the probability of response, or risk, above background by the BMR, given that the event would not occurred in the absence of exposure. An alternative specification, also known as the added risk, is the dose d that satisfies the equation: which represents the absolute increase in risk, relative to no-exposure. The software package MADr-BMD can estimate either risk type of the BMD specified above.
Although the MA-BMD value can be readily estimated using the above formula, no such closed form formula exists in calculating the model averaged benchmark dose lower bound (MA-BMDL). Consequently the MADr-BMD calculates this value using a parametric bootstrap (Efron and Tibshirani 1993). Here parametric bootstrap resamples are taken forπ ma (d), and the calculated 100(1 − α)% lower bound is obtained using the bias corrected and adjusted confidence limits (BCa). The parametric bootstrap assumes that the response, at each dose d, is distributed binomially having success probabilityπ ma (d) given n trials specified by the original experiment (note that it is possible, though rare, forπ ma (d) to be equal to zero, specifically at the background dose, in this case the bootstrap proceeds by generating zero positive responses for the dose d) . The acceleration constant, which is required when computing the BCa, is estimated through the jackknife procedure also described by Efron and Tibshirani (1993).
MADr-BMD computes these estimates (i.e., the MA-BMD and its corresponding MA-BMDL) automatically. It also assumes a monotonic increasing dose-response, and in cases where shallow or no dose-response is evident the software can produce BMD estimates that are greater than the maximum dose administered. Further, in cases of flat or negative dose-response the software will produce an arbitrarily large BMD estimate of 10 8 . As these estimates are often far outside of the doses administered care should be taken on the risk assessor's part on interpreting the output when very shallow dose-response data are observed.

Implementation details
Estimation of the MA-BMD and the corresponding MA-BMDL is computationally intensive requiring large amounts of CPU time. Consequently, the MADr-BMD software is written in C++ and compiled into machine readable code as a stand-alone command line program that is executable under the Microsoft Windows operating system's shell command prompt. The program itself utilizes many routines that are under public license or in the public domain. The maximum likelihood estimation is done using the Fortran subroutine dmngb (Dennis, Gay, and Welsch 1981), which is available at http://www.netlib.org/. This routine is the same routine used within the US EPA's benchmark dose software. Further the C++ function, which calls dmngb routine in the benchmark dose software, is borrowed from the source code of the US EPA's software, and is available at the US EPA's web site (http://www.epa.gov/ncea/bmds/). The GNU scientific library GSL Galassi, Davies, Theiler, Gough, Booth, and Rossi (2005), was used for all other numerical routines (e.g., the matrix algebra and numerical integration routines were GSL algorithms) except those involved in differentiation. In this case, a finite difference algorithm (Press, Teukolsky, and Flannery 1992), which was borrowed with slight modifications from the EPA's benchmark dose software (US Environmental Protection Agency 2001), was used. The finite difference algorithm was tested by applying it to a range of functions. These test functions included Gaussian, probit and Weibull dose-response functions with known parameters, i.e. functions whose derivatives were known. In all cases the numerical derivative produced by the algorithm agreed with a high degree of precision, e.g. an absolute difference < 10 −6 , to the known derivatives. The algorithm was further tested within the context of maximum likelihood estimation. Here 2000 randomly simulated data sets were generated and models (1)-(9) were fit using the finite difference algorithm and compared to the US EPA's BMDS suite. In most cases the likelihood evaluated at the solution to the maximized likelihood estimating equations was identical to that based on the BMDS software, and differences, when they occurred, were small with the test algorithm and the BMDS software outperforming the other approximately 50% of the time.

Software specifications
The MADr-BMD program is invoked through the command line argument madrbmd, taking its input through a user specified text file. Given no input arguments (i.e., typing madrbmd at the command prompt) it assumes an input file named input.txt is located in the present working directory. Alternative input files can be specified as an additional command line argument that names file. For example, invoking the program with the command madrbmd ethychlor.txt would prompt the program to take its input from the file ethychlor.txt. The structure of the input file is relatively straightforward and is outlined above, where the meaning of each line completely described in Table 1. The file specifies which models, out of the models (1)-(9), to include in the analysis, as well as the MA weighting criterion (i.e., AIC, BIC, KIC etc.), and the specifics on the benchmark dose calculation. Individual analyses can be specified by only including one model in the specification. In these cases, the benchmark dose, and the corresponding lower bound, represents estimates based upon the single model and the lower bound based on the bootstrap. Here the choice of the weighting criterion is irrelevant to the calculation, as the single model will always receive 100% of the weight.
MADr-BMD, by default, outputs all calculations to the command prompt; however, output can be saved to a file using redirection at the command prompt. For example the command madrbmd ethychlor.txt > output.txt would take data from the file ethychlor.txt and output the data to the file output.txt. The output can then be viewed using any text editor. Partial output from the MADr-BMD program is shown below.  The output itself describes fit statistics of each model, along with the computed MA benchmark dose output. The model weights formed based upon the user specified weighting criterion, as well as the −2 log likelihood the BIC and AIC (computed using the full number of parameters) are shown in the initial output table created by the program. The subsequent output describes the calculation of the MA-BMD, its corresponding lower bound, as well as information that corresponds to its calculation. Specifically the term labeled "acceleration" represents the estimated acceleration constant used in the BCa calculation, and the random seed number is the value passed to the random number generator when computing the bootstrap resamples. Both values are both provided for transparency in the calculation. Specifying this value allows the researcher to replicate the results sin the future. Not surprisingly, if the same random number seed is not used, these values may change slightly each time the program is executed. Finally the goodness of fit statistic, i.e.  Table 1: Description of each line of input when specifying the model average (MA) dose response analysis in the MADr-BMD program. It is important to note that the degree of the multistage polynomial must be specified regardless of its inclusion in the model averaging.

Model Fit Statistics
Further no error checking is done on this value, and thus one can specify a polynomial that has more parameters than there are data, which may cause unpredictable results.
corresponding significance level estimated from the bootstrap resamples assuming the functionπ ma (d) generated data. The program produces further output, not reported here, which includes model-specific parameter estimates as well as the individual bootstrap resamples.

Software comparison
4.1. Software features Table 2 compares the basic feature set of the US EPA BMDS and the MADr-BMD software programs for fitting dichotomous dose-response data, and highlights the similarities, and differences between the two programs. Both packages rely on the same optimizer, and thus produce nearly identical results in most situations. In the situations when the two methods do not give identical parameter estimates, a difference in initial estimates along with the existence of local maxima, within the likelihood of the given model being fit, is usually the cause. A small simulation study by Wheeler and Bailer (2007) showed that there was no clear advantage between either the BMDS software and the MADr-BMD optimization strategy. In most all cases, the software converged to the same value, and when the two failed to achieve the same results, the larger likelihood value was found approximately half the time in the US EPA's software, and half the time in the MADr-BMD package. Parameter bounds*  models. This is done, as described previously, by simply including the desired model in the model specification line of the input file, while excluding all other models. If this is done the estimated BMD will be based upon the maximum likelihood, of the desired model, while the 100(1 − α)% BMDL will be estimated through parametric bootstrap. This is an important difference from the method used by the US EPA's software; in that they use the method of profile likelihoods to compute the lower bound. BCa and profile methods yield BMDLs with similar coverage properties; however, the value of these estimated lower limits may differ, with the difference more pronounced with small sample sizes. Finally, we note that only the MADr-BMD package can calculate MA-BMD estimates.

Data example
Consider the National Toxicology Program study where male F344/N rats were exposed to 2,4-hexadienal via gavage (National Toxicology Program 2001). For this study four groups of 50 rats were exposed to one dose of 2,4-hexadienal for two years. The observed proportion of rats exhibiting squamous cell papilloma of the fore stomach was 0/50, 3/50, 10/50, and 29/50, which corresponded to hexadienal doses of 0, 22.5, 45, and 90 mg/kg/day. Table 3 describes the estimated BMDs computed from models (1)-(9) for a BMR of 1% and 10%, using the added risk BMD specification, where the estimates were computed using both the US EPA's BMD software and the MADr-BMD package. A model average using the AIC as a weighting criterion and all the models except the quantal-linear (7) Table 3: Benchmark dose comparisons for squamous cell papillomas in rats exposed to 2,4hexadienal.(16) Data were analyzed using both the MADr-BMD program and the US EPA BMDS software, with benchmark responses of 1% and 10%. The added risk BMD formulation was used for all calculations. Finally for the model averaging data were analyzed with the software package MADr-BMD, using the AIC as weights and with the BCa lower bound confidence interval reported (see Table 2 for the models and the weights used in the model average).
(8) models is also reported. For both packages, the estimated BMD for models (1)-(9) are nearly identical. The MADr-BMD BMDL estimates tend to be slightly less than the BMDL values estimated using the corresponding profile likelihood counterparts calculated using the US EPA software. These differences are most noticeable in the Weibull, gamma, log-probit, and log-logistic. Here supra-linear fits are allowed, which frequently produce BMDL much smaller than the estimated BMD, and results in MADr-BMD BMDL estimates that are frequently smaller than the US EPA BMDS by an order of magnitude, though this behavior is not observed in this example.

Coverage comparison
We also consider the differences of model averaging using MADr-BMD compared to fitting the true model using the US EPAs benchmark dose software. For this comparison the true dose-response model is assumed to be known; though this is something that is seldom, if ever, the case in practice. The comparison is instructive as it shows that model averaging can frequently estimate the 100(1 − α)% lower bound of the BMD at a rate which is similar to that which would be obtained using the true dose-response model.
The coverage data for the true model fits, using the US EPA software, was obtained from the study of Wheeler and Bailer (2009), where the MA-BMDL coverage was obtained from the subsequent study of Wheeler and Bailer (2007). These studies were preformed using identical simulation conditions. In these studies, dichotomous dose-response experiments with 4 dose groups and 50 experimental units per dose group were simulated 2000 times and the BMD/BMDL was recorded. In these simulations, models (1)-(9) were used as the true underlying dose-response forms, and for each model six unique parametric forms were used given the underlying model, and the observed coverage (i.e., P(BMDL ≤ BMD true )) was reported for the BMDL estimated from the true model, as well as the observed coverage from two different "average-model" MA-BMDLs families of models. The first model space, from which the MA was constructed, consisted of three flexible models which included the 2-stage multistage (4), the log-probit (6), and the weibull (9). The second model space consisted of seven models adding the quantal linear (7), the quantal quadratic (8), the probit (5), and the logistic (1) models to the three model space. Finally, we note that even though the MADr-BMD software program can calculate BCa MA-BMDL estimates, the lower bound estimates described were calculated using the percentiles of the bootstrap distribution due to the increased computational demand required for the BCa estimate and the size of the simulation. For a comprehensive review of the simulation design, as well as other implementation details, the reader is referred to the aforementioned manuscripts Bailer 2009, 2007). Table 4 compares the observed coverage of the BMDL, with a BMR of 10%, when the true model is fit as compared to using model averaging BMDL calculation. Further, the nominally specified coverage rate was 95% for all calculations. With the exception of the quantal linear case, model averaging using the MADr-BMD software performed similarly in terms of observed coverage when calculating the BMDL. In some cases, the log-probit case in particular, MA's performance was superior to knowing and fitting the true model. Potential causes for this observed behavior warrant further investigation.

Conclusions
The software package MADr-BMD gives risk assessors increased flexibility when estimating risk from dichotomous dose-response data. By combining estimates from multiple models, MA gives researchers and regulators alike a method that accounts for statistical variability as well as model uncertainty. This method has been shown by Wheeler and Bailer (2007) to yield point estimates with minimal bias and confidence limits with nominal coverage properties, while producing estimates that are superior to picking one single model. This is true even if this model describes the data "better" than the other models used. Further, as shown above, in most cases model averaging performs similarly, in terms of observed coverage, to actually knowing the true model. The MADr-BMD software package thus fills a need within quantitative risk assessment community. By allowing researchers to estimate risk, and corresponding benchmark doses readily, researchers will be less encumbered by difficult questions pertaining to which single model should be used, and instead will be able to focus, on the far more manageable question "what curvature is plausible, given the current state of knowledge on the hazard, and what models should be fit prior to averaging?" With this said, it is important to note that the MADr-BMD software package does not constitute a "magic-bullet" when estimating risk from dichotomous dose-response data. Many questions should be addressed with further research, and care should be taken when using values estimated from the software package. For example, research questions about the best model space to use, as well as diagnostic methods for the appropriateness of the BCa cal-   Table 4: Comparison of observed coverage (i.e., P(BMDL ≤ BMD true ) ) between two different model averaging(MA) model spaces and the benchmark dose lower bound (BMDL) formed from fitting the true model to the data. The above data compares simulation results from the two studies of Bailer (2009, 2007). Here the true models were fit using the US EPA BMDS software, and the model averaging was conducted using a modified version of the MADr-BMD package with the AIC used as a weighting criterion. Further the MA-BMDL was computed using percentile based confidence intervals. culation in apparent shallow dose-response curves, are still open. Further, by incorporating uncertainty, the software does not immediately give one the license to estimate BMDs based upon an "averaged-model" that constitute low dose extrapolations beyond the range of the data. This was best illustrated in Wheeler and Bailer (2007) In their study, the model space directly influenced the ability of the "averaged-model" to recover the BMD and BMDL, for the logistic and probit conditions. Here the 3-model average covered the true BMD at the nominal 95% level when the BMR was set to 10%, but failed to cover the true BMD when the BMR was set at 1%. However the 7-model average condition covered, at the nominal level, in both situations. This was primarily because the fact that the true model was included in the 7-model condition, which stabilized the lower bound estimate. Finally, the current implementation assumes an underlying binomial response distribution. Extensions to over dispersed responses could be considered, and it is hoped that the software may provide a template for this extension.
Finally note that the development of this software does not remove the need for expert judgment on the part of the researcher when conducting an analysis using the MADr-BMD software. The researcher should carefully choose the model space, where mechanistic model knowledge should be used if available, and should approach low-dose extrapolations with considerable care when using model averaging. The software advances risk assessors ability to address model uncertainty, but should not be used as a substitute for scientific reasoning.