The NIEHS Predictive-Toxicology Evaluation Project.

The Predictive-Toxicology Evaluation (PTE) project conducts collaborative experiments that subject the performance of predictive-toxicology (PT) methods to rigorous, objective evaluation in a uniquely informative manner. Sponsored by the National Institute of Environmental Health Sciences, it takes advantage of the ongoing testing conducted by the U.S. National Toxicology Program (NTP) to estimate the true error of models that have been applied to make prospective predictions on previously untested, noncongeneric-chemical substances. The PTE project first identifies a group of standardized NTP chemical bioassays either scheduled to be conducted or are ongoing, but not yet complete. The project then announces and advertises the evaluation experiment, disseminates information about the chemical bioassays, and encourages researchers from a wide variety of disciplines to publish their predictions in peer-reviewed journals, using whatever approaches and methods they feel are best. A collection of such papers is published in this Environmental Health Perspectives Supplement, providing readers the opportunity to compare and contrast PT approaches and models, within the context of their prospective application to an actual-use situation. This introduction to this collection of papers on predictive toxicology summarizes the predictions made and the final results obtained for the 44 chemical carcinogenesis bioassays of the first PTE experiment (PTE-1) and presents information that identifies the 30 chemical carcinogenesis bioassays of PTE-2, along with a table of prediction sets that have been published to date. It also provides background about the origin and goals of the PTE project, outlines the special challenge associated with estimating the true error of models that aspire to predict open-system behavior, and summarizes what has been learned to date.

Level ofevidence (LOE). NTP assigns a LOE to each sex-species, chemical carcinogenicity experiment, as defined in each NTP Technical Report (TR). These are CE, clear evidence; SE, some evidence; EE, equivocal evidence; NE, no evidence; for older studies, they are P, positive; E, equivocal; and N, none.
Overall LOE. The LOE assigned to each sex-species experiment, combined with a classification for the overall bioassay study, using the following algorithm: a) If the LOE for one or more of the experiments is CE, SE, or P, then the overall classification is positive (POS); b) If the LOE for all of the experiments is NE or E, then the overall classification is negative (NEG); c) If the LOE for one or more of the experiments is EE or E and the LOE for the other experiments is NE or N, then the overall classification is equivocal (EQV); d) Experiments classified as inadequate study (IS) are given no consideration in arriving at the overall LOE classification.

Need for Predictive-Toxicology Models
The NTP conducts standardized chemical bioassays in rodents to identify and characterize exposures to substances that may be associated with carcinogenic or other toxicological effects on human health (1). Current regulations require that safety testing be performed in connection with the development of new chemicals or new uses of known chemicals. However, before the advent of such regulations, more chemicals came into use than can ever be tested using conventional methods. At the present time, society in general and the discipline of toxicology in particular, face the parallel tasks of performing safety evaluations that support the development of new chemical uses before human exposures are permitted and assessing the potential hazard posed by exposures to chemicals that lack safety evaluations. This situation creates an urgent need to develop PT models that * generate predictions of known reliability or are accompanied by confidence level estimate * identify hazardous-chemical exposures more rapidly at a lower cost than current procedures * apply to all types of test articles, includ-ing organic, inorganic, polymeric, mineral, and mixtures * provide information that supports sound decision making for the effective and efficient management of laboratory animal testing that is still needed by regulatory and chemical development programs * refine and reduce reliance on the use of large numbers of laboratory animals in the conduct of chembioassays * accelerate the performance of risk assessments and the conduct of research and development programs.

Goals of Predictive-Toxicology Research
The development of models that reliably identify the hazard for untested chemical substances, of any type, using attribute values that can be computed or obtained with minimum testing time and cost is widely recognized to be the most immediate goal of PT research. The return of information and overall value of an NTP bioassay increases when it is included in a PTE experiment because each prediction made about its outcome represents an additional hypothesis that is tested by the bioassay. Thus, in addition to characterizing the toxicity of individual chemicals (i.e., identify hazard), standardized bioassay tests also stimulate PT research by providing both learning sets for the development of models and the means to subject model performance to hypothesis testing.
Another, less perceived, aspect of PT research has potential value that far exceeds the generation of reliable predictions per se. Some PT models are based on patternrecognition analysis of a learning set (2)(3)(4)(5)(6)(7)(8). The learning set is a database that includes a representative number and range of classified cases, where the chemical bioactivity of each case towards a particular toxicity end point has been determined by standardized testing. Each classified case in the learning set is represented by a corresponding array of values on attributes, selected to reflect various aspects of either or both biological factors and chemical structure that may influence activity. Although "data-mining" by pattern-recognition analysis can be limited by the availability of suitable learning sets, it represents a new approach that has great potential to help discover and confirm the key factors and relationships that govern the various, multifactorial, mechanistic pathways and determine toxic effects. Thus, the ultimate value and most important goal of PT research may lie in the development of its potential to help identify, characterize, and understand the various mechanisms or modes of action that determine the type and level of response observed when biological systems are exposed to chemicals. Because PT research can confirm existing hypotheses regarding mechanisms and stimulate the formation of new ones (9), it is complementary to and synergistic with the conduct of mechanistic studies.
The discovery aspect of PT research may also lead to an important refinement in the use of quantitative structure-activity-relationship (QSAR) models. A classical, extra thermodynamic QSAR approach (10,11) can only be applied to model chemical bioactivities governed by a unique mechanistic pathway, i.e., where chemical bioactivity is controlled by a single ratelimiting step. This limits the legitimate application of each different QSAR model, to untested chemicals that can be expected to be processed under the control of the same mechanism for which the QSAR was developed. When faced with selecting a QSAR model to study the mechanistic behavior of an untested chemical, there is no legitimate way to determine which of the many available might apply most appropriately. This uncertainty would be eliminated by the development of PT models that predict not only the activity expected for an untested chemical, but also indicate the mechanistic pathway that governs it. Thus, the output of such PT models would serve to guide the selection of QSAR models that may be used legitimately to elucidate mechanistic details and gain understanding that fosters better interpretation of the activity predicted.

Evaluation of Predictive-Toxicology Models
The advantages offered by PT research are clear; however, difficult problems remain that involve both model development and acceptance issues (12). A recent, definitive study of difficulties associated with the model confirmation problem (9) reports Verification, validation, and confirmation of numerical models of natural systems is impossible. This is because natural systems are never closed and because model results are always nonunique. Models can be confirmed by the demonstration of agreement between observation and prediction, but confirmation is inherently partial. Complete confirmation is logically precluded by the fallacy of affirming the consequent and by incomplete access to natural phenomena. Models can only be evaluated in relative terms, and their predictive value is always open to question. The primary value of models is heuristic. This important publication explains why it is impossible to establish confidence limits on boundaries of the feature space spanned by a PT model, which might otherwise be used to guide and restrict its application to legitimate cases. Also, because the boundaries of PT models are inexact, the legitimate range of application for PT models will always be uncertain, to some extent. The complex nature of the model confirmation problem presents a perplexing challenge to both developers and potential users; to gain acceptance and fulfill their promise, PT models must demonstrate performance accuracy that earns the confidence of would-be users.
PT-model evaluations based on crossvalidation techniques (13) provide useful feedback during development of a model by analysis of a learning set of classified cases, but alone, they cannot provide the information needed to discriminate between high classification accuracy, a sign of model brittleness due to overlearning, and low prediction accuracy for unclassified cases.

The PTE Project
Owrview This project enlists the interdisciplinary resources of the entire PT community in the conduct of experiments that rigorously determine the extent to which predictions, made prospectively, agree with experimental observation. It provides objective, experimentally determined estimates for true error of model performance. It creates unique opportunities for the user and model-developer communities to jointly assess the strengths and weaknesses of various PT models and to evaluate the principles and ideas underpinning their development. More specifically, the PTE project * identifies test sets of bioassays that focus predictive-toxicology research efforts on a common goal and thereby provides a means for the rigorous, experimental evaluation of PT models; * provides information on NTP test results as well as samples of test-chemical to the research community, * encourages involvement of researchers from diverse disciplines to promote the application of a wide range of alternative approaches to  With the support and cooperation of the editor of Mutagenesis, others were invited to publish sets of predictions, basing them on the methods they preferred (15). A variety of researchers responded and the original set of published predictions evolved to become PTE-1. Figure 1 illustrates how Tennant et al. used the NTP-standardized testing program to first develop their human-heuristic PT model and then to evaluate the accuracy of its performance. The flow diagram identifies the basic components needed to develop and evaluate PT models and indicates the type, source, and flow of information typical of what might be used to generate prospective predictions and organize a PTE experiment.
The "Tox testing" module in Figure 1 represents the engine that drives learning in toxicology, because it is the primary source of phenomenological observations, the foundation for learning in science. Standardized toxicity testing fosters the healthy growth and maturation of this relatively young discipline (12) by providing learning sets that support the development of models and theories. It is important to use learning sets that include a sufficient number and variety of classified cases to adequately represent the uncertain number of multifactorial, mechanistic pathways that are associated with a complex toxicity endpoint like chemical carcinogenicity. Figure 2 illustrates how a fully evaluated and confirmed PT model simplifies, when testing, learning, comparing, and modifying steps are no longer needed. A fully confirmed model needs only a few basic components to generate reliable predictions about hazard associated with exposure to untested chemicals. Information generated by the model is interpreted and used with confidence by decision makers.

PTE-1: Prediction Sets, Final Bioassay Results, andWorkshop Condusions
Final results for the 44 NTP bioassays that made up PTE-1 are presented in Table 1. The sets of predictions generated by PTE-1 are listed in Table 2. Several papers evaluating various aspects of the PTE-1 experiment have already been published. We hope that this compilation of PTE-1I During 1993 the NIEHS conducted an (16). First, SAR-based models do not perprediction sets accompanied by presenta-international workshop to evaluate what form as accurately as models that utilize tion of the final results for all 44 of the had been learned from the PTE-1 collabora-biological attributes and, second, models PTE-1 bioassays will inspire the publication. Broad consensus was evoked during that used multiple attributes to represent tion of more papers that involve analyses of discussions on some points while widely dif-the chemical carcinogenicity endpoint perthe results from this experiment to extend ferent opinions were heard on others. The formed better than models that were based what has already been learned, workshop reached two main conclusions on one or two attributes.   (7) PACT (8) sky(4) CASE (9) hybrid (7) (10) Abbreviations: +, positive; -negative; NP, no prediction made; W+, weakly positive; W+U, weak positive or uncertain probaility for being positive; E, equivocal. "separate predictions were made for rats and mice; when the predictions were different both were entered into the table, separated by a / mark. bsee Table 1, footnote i. c[he original, published prediction was changed at the request of these authors, after information about the correct Identity, structure, and CAS RN for the chemical tested was sent to all participants, along with a request for them to notify us in writing, if the new information led to a revised prediction.

Support Provided to Foster Participation in PTE Experiments
The primary purpose of a PTE experiment is to learn by focusing the intellectual resources of different research groups on a common problem. When the set of test cases for a PTE experiment is reasonably representative for the end point activity, the overall learning potential for an evaluation experiment is influenced more by the number and variety of models applied to generate predictions than by the number of test-set bioassays. Therefore, it is important that as many predictors participate as possible. The original announcement for PTE-2 (17) made available a package of comprehensive information that was distributed by mail or fax. Early in 1996, a page for the PTE Project was established on the Internet, as a link to the NIEHS homepage. It provides updates about the current status of the PTE-2 experiment and access to NTP database information of particular interest to PTE participants; the more important Internet addresses include:  -.:f -.:: L" f. :: : .. ..