Relationship between molecular connectivity and carcinogenic activity: a confirmation with a new software program based on graph theory.

For a database of 826 chemicals tested for carcinogenicity, we fragmented the structural formula of the chemicals into all possible contiguous-atom fragments with size between two and eight (nonhydrogen) atoms. The fragmentation was obtained using a new software program based on graph theory. We used 80% of the chemicals as a training set and 20% as a test set. The two sets were obtained by random sorting. From the training sets, an average (8 computer runs with independently sorted chemicals) of 315 different fragments were significantly (p < 0.125) associated with carcinogenicity or lack thereof. Even using this relatively low level of statistical significance, 23% of the molecules of the test sets lacked significant fragments. For 77% of the molecules of the test sets, we used the presence of significant fragments to predict carcinogenicity. The average level of accuracy of the predictions in the test sets was 67.5%. Chemicals containing only positive fragments were predicted with an accuracy of 78.7%. The level of accuracy was around 60% for chemicals characterized by contradictory fragments or only negative fragments. In a parallel manner, we performed eight paired runs in which carcinogenicity was attributed randomly to the molecules of the training sets. The fragments generated by these pseudo-training sets were devoid of any predictivity in the corresponding test sets. Using an independent software program, we confirmed (for the complex biological endpoint of carcinogenicity) the validity of a structure-activity relationship approach of the type proposed by Klopman and Rosenkranz with their CASE program.

In the field of structure-activity relationship (SAR) studies, the software programs CASE (computer-automated structure evaluation) and MULTICASE, created by Klopman and Rosenkranz (1), represent an original approach for elucidating mechanisms of interaction between biological systems and exogenous compounds to predict the biological activities of chemicals. The strategy adopted is based on the hypothesis that molecular connectivity identifies the tridimensional structure: fragments of connected atoms and their interatomic bonds determine to a significant extent angles between pairs of contiguous atoms and their interatomic distance. The program should be able to detect, with the help of a statistical procedure, the submolecular structures that could interact with biological sites (i.e., receptors) involved in the biological process analyzed. The structure can be responsible for the biological activity of the compound (biophore) or its inhibition (biophobe). This view partially agrees with the work of Ashby and Paton (2), who singled out specific molecular fragments associated with genotoxicity.
The analytical capabilities of CASE increase with the amount of data input. CASE minimizes the possibility of bias due to human factors because it identifies parameters objectively, independent of human judgment. The only human operations are the choice of the data to be submitted to analysis and the interpretation of data in output. The selection of the descriptors (molecular fragments) that are used to predict biological activity is completely automated. The choice of descriptors is based on statistically significant prevalence in active or inactive molecules.
Since 1984, many studies have been published by Klopman and Rosenkranz (3-11) on this subject: sets of congeneric and noncongeneric compounds have been tested for several biological endpoints (mutagenicity, carcinogenicity, etc.). We have selected for discussion in this report some papers among the most pertinent to our work. Concerning predictivity, the results obtained by Klopman and Rosenkranz change for different endpoints and for different chemical classes analyzed and overall show a high level of accuracy; often, however, predictivity has been tested only in the training set or in arbitrarily built test sets.
The general strategy of CASE is known, but the detailed structure of the software is not available because it is protected by copyright. Up to now, all reports on predictivity using CASE have been published solely by the program creators or by authors using the CASE program by license or permission. Due to these restrictions, we saw the need to develop a new, completely independent program to confirm (or disprove) the validity of the type of SAR approach used by CASE.
Our software uses graph theory to reproduce basic operations characterizing the CASE program. The program associates a graph with a molecule to represent its topological properties. The program searches for subgraphs (molecular fragments) characteristic of groups of carcinogenic or noncarcinogenic compounds. To test the performance of the software, we chose the induction of tumors in rodents as a biological endpoint. Tumors are the endpoint of carcinogenesis, a complex multistage event, in which genetic alterations are only one part of the story. We used the Carcinogenic Potency Database (CPDB) (12)(13)(14)(15) and the National Toxicology Program (NTP) (16)(17)(18) data to obtain information on rodent carcinogenicity. We divided the data into two subsets: a randomly selected learning set including 80% of the chemicals, and a nonoverlapping test set including 20% of the chemicals. An additional control analysis tested an artificially paired set of data where carcinogenicity is attributed randomly to the molecules of the training set but not to the molecules of the test set.

Software Features
To analyze the possible relationships between the structure of molecular fragments and carcinogenicity, our software analyzes the topological properties of molecular fragments using graph theory. For a detailed introduction to graph theory, see Christofides (159).
Graph theory is used to relate the topological properties of molecules to their possible carcinogenicity. A graph is a pair (V, E), where Vis the set {vi, i= 1,...n} of vertices, and E is the set {eij = (vi, v), vi, v; E 1/ of edges that express existing relations between vertices; both vertices and edges may be labeled (i.e., they may have an associated name or value). Any compound can be represented as a graph by associating the atoms with the vertices and the bonds with the edges. This kind of representation is frequently adopted in literature because it allows easy handling of the topological properties of compounds. In fact, graph theory has many applications, such as in nomenclature, coding and information processing, storage, and retrieval (20).
Our software system uses a fragmentation approach to determine whether subfamilies of compounds with carcinogenic activity, or lack thereof, are characterized by the presence of some common structural features (molecular fragments). A similar approach has already been applied in earlier computer-aided methods (21)(22)(23) for predicting different biological activities (antiarthritic-immunoregulatory effects and antineoplastic effects). In these earlier works, not all the possible fragments within a given range of nonhydrogen atoms were generated, but only a limited subset of fragments, such as augmented atoms, heteropaths, and ring fragments. A definition of these substructural units is given by Chu et al. (22). Our work is mainly based on the works of Rosenkranz and Klopman (3,4) and on the studies of Ashby (24,25), who has defined indicators that can be thought of as subgraphs usually present in genotoxic compounds (genotoxicity is an important component of carcinogenicity). Essentially, the system searches all the fragments (i.e., subgraphs) of the compounds present in the training set whose activity is known, in an attempt to determine a reliable set of fragments whose presence in compounds of unknown carcinogenicity (test set) may be an indicator of their activity. In particular, the main procedure of the program that executes the fragmentation works as follows: all the fragments within a given size of each compound of the training set are produced; a unique code is associated with any fragment yielded, and, if this code is not already present in a fragment dictionary, it is inserted in the dictionary. A list of the compounds to which the fragment belongs is linked to the fragment code and it is initially filled with the code of the compound currently examined. Otherwise, if the fragment code is already present in the dictionary, only the corresponding compound list is updated. Once all the compounds of the training set have been fragmented, the system scans the dictionary by searching for the fragments that satisfy the statistical conditions (described in later).
The program was developed in standard C language, and it can be compiled on both MS-DOS and Unix architecture. The version used for the experiments described here can run on any machine with a 3.0 or later version of MS-DOS operating system, and it requires at least 4 MB of memory and 100 MB of hard disk. A typical experiment (a single run of a standard training set of 661 molecules) takes about 4 hr of computation time on a 486 machine to develop the database of significant fragments. Two additional hours are required for the statistical analysis that selects the significant fragments. The amount of time needed to determine if a new compound of a test set contains one or more of such fragments depends mainly on the compound structure; for example, the analysis of a 40-atom (nonhydrogen) compound, normally connected, takes about 5 min, whereas a 10-atom (nonhydrogen) compound takes no more than 30 sec.
The program accepts as input an ASCII file describing the structure of the compounds that will be analyzed by a connectivity matrix. A separate interface program has been developed to graphically input such structures, storing them in that ASCII file. In general, the analysis system yields synoptic report files, but it also stores information in ASCII files in which data are organized in tables; in this way such information can be easily accessed by the most database software.

Statistical Methods
After the software has considered all the molecular subunits with size between two and eight "heavy" atoms, a statistical analysis is performed to select only significant fragments. The first selection is based on the distribution of the fragments between positive and nonpositive molecules. The training set initially generates a global number of about 278,000 fragments. Of these, about 103,000 are different fragments. For the successive stages of the analysis, the software keeps only those fragments that have a probability of random association with carcinogenicity (or lack thereof) lower than 0.125 (one tailed) according to binomial distribution. We computed our statistical estimate for the tail in the direction of biological prevalence; however, statistical fluctuations can make a fragment significant in both directions (carcinogenicity or lack thereof. Therefore, conceptually, the real confidence limits have to be considered two tailed and about twice the one-tailed level of confidence. We have calculated the probability for the entire tail of the distribution to estimate statistical significance.
For each monomial of the distribution we have used the classical formula: where Nis the number of times in which a given fragment has been generated in different molecules (trials); X is the number of times in which the fragment has been generated by positive molecules (successes); p is the probability that one fragment has been generated by a positive molecule (probability of success); its value is determined by the ratio p fragments generated by positive chemicals (_ 159,000) fragments generated by all chemicals (_ 278,000) q is the probability that the fragment has been generated by a nonpositive molecule (probability of failure = 1 -p); and Pr(X) is the probability of X successes (single monomial). The fragments selected in this way are labeled "activating" if their occurrence in carcinogenic chemicals is higher than the statistical limit that we established. Similarly, the fragments are labeled "inactivating" if their occurrence in nonpositive compounds is higher than the established statistical limit. In a second stage, the program removes the fragments that are redundant because they are "imbedded" in larger fragments and have identical behavior (only the subunit with smaller size is kept). At this stage the number of fragments is reduced at least 300 times in respect to the initial set of fragments generated (generally from 103,000 to 315 fragments).
A test set, a random sample of the overall data set, is tested to search each chemical for the presence of significant fragments selected in the training stage. On the basis of fragment distribution for the chemicals in the test set, a prediction of their carcinogenicity is made.
A molecule of the test set can have one or more fragments that are present in molecules of the training set. Combining the statistical significance of these fragments, we calculate an empirical index, PI (probability index), for the molecules of the test set. An example of the calculation of this simple index follows.
A molecule, Xv, of the test set contains three fragments among those ones selected as statistically significant in the training set (Fl and F2 "activating," F3 "inactivating"). The fragment Fl has been selected because it is present, in the training set, in five active molecules (AT, BT, CT, DT, ET) and in one inactive molecule (GT). Similarly, F2 is contained in four active molecules (AT, BT, CT, HT), whereas the selection of fragment F3 originates by the presence of this subunit in four inactive molecules (GT, QT' ST, TT). The fragments F1 and F2 are probably related because they were generated by a similar set of molecules. To remove the redundancies, the two fragments are treated as one fragment that originates by seven chemicals (AT, BT, CT, DT' ET, GT' HT). In a similar way, the information obtained from the fragments F3 is added to create a single aggregate (AT, BT, CT, DT, ET' GT, HT, QT' ST, TT), in which the ratio between molecules with  carcinogenic properties and all the molecules contributing to the evaluation is 0.6. This value is used as a PI.
A successive step is the calculation of the PI value that is used as a cut-off value to define two categories (positives and negatives) of predicted activity for the test set. This cut-off index is the value that maximizes the accuracy of the contingency table 2 x 2 (carcinogenicity or lack thereof versus predicted activity) in the training set.
Accuracy in the training set as a function of the PI is illustrated in Figure 1. Levels of accuracy higher than 0.73 are obtained in the training set in a range of PI values between 0.35 and 0.8. This is because the majority of molecules have a probability index higher than 0.8 or lower than 0.35 (Fig. 2). A cut-off within this range only slightly affects the attribution to the carcinogenic or noncarcinogenic class. The average optimal cut-off value for eight runs was 0.41.
Preliminary runs of our program showed, for partial subsets of carcinogenicity data, statistical fluctuations in terms of predictivity indices. For this reason, we performed eight runs using our final database (826 compounds, 515 carcinogens and 311 noncarcinogens). For each run we randomly drew 80% of compounds for the training set and used the remaining 20% as the test set. We also performed eight paired runs using the same chemicals, but, in this case, the property of carcinogenicity in the training set was randomly attributed (pseudo-training set). The procedure for randomly selecting the chemicals for the training set and the test set imposed the condition that in both sets, 62.3% of the chemicals must be positive carcinogens. This simple procedure uses a routine of BASIC language (RANDOMIZE TIMER) as a random-number generator to assign the chemicals for the training sets and to assign the carcinogenic property in the pseudo-training sets.
To evaluate the predictivity level of our methodology, we adopted some indices that are conventionally used for diagnostic tests: In addition, according to Klopman and Kolossvary (26), we evaluated the following two parameters: where Xis the fraction of active molecules in the data set, and Y is the fraction of molecules predicted as active.

Sources of Data
We gathered the carcinogenicity data analyzed here from two of the main databases: CPDB (12)(13)(14)(15), in which more than 4000 experimental values are reported (1053 chemicals are considered in the database), and the NTP database (16)(17)(18), in which 301 chemicals have been tested with standardized protocols in mice and rats. The two databases provide qualitative and quantitative data for each experiment. We considered only qualitative results because our software can process only categorical outcomes at this time. To simplify the situation, in our first analysis we used only binary data: we classified the experimental results for each chemical as "positive" or "nonpositive." To this end, we arbitrarily fixed criteria to make a binary outcome. Table 1 shows the rules adopted for CPDB data, and Table 2 describes the rules used for NTP data. The two databases overlap extensively due to the fact that NTP data (except for most recent experiments) are already contained in CPDB. For only a few chemicals was there incomplete agreement between the two sources: Table 3 considers all the possible combinations of matched results. A large portion of the compounds for which there are data available in the two databases is included in our database. No intentional selection was performed. We discarded 50 (4.4%) chemicals with uncertain carcinogenicity status (not classified according to Tables 1-3); 263 (23.1%) chemicals were excluded for one or more of the following reasons: 1) administered in mixture; 2) less than three "heavy" atoms; 3) molecules too large for the input interface (more than 50 heavy atoms); 4) contained unusual atoms (chemicals containing only H, C, S, N, Cl, 0, Na, F, Br, P were included in the database); 5) difficulty finding the structural formula. Our program can currently analyze 826 chemicals. The CAS numbers of these chemicals are given in Appendix A.

Results
The fragmentation stage of the process produces about 278,000 fragments (average of 8 runs), adding up all the fragments produced for each molecule; of these, about 103,000 are different fragments. From the analysis -of their occurrence and after removal of redundant fragments, on the average, 315 fragments significantly associated with carcinogenicity or lack thereof (p<O.125 according to binomial distribution) are kept for the successive steps of the analysis. The number of fragments is significantly lower for the paired training sets with a random attribution of carcinogenicity: on average, 174 fragments are selected. Detailed features of the data analyzed are summarized in Table 4. We also counted the fragments generated with a threshold of statistical significance at p<O.Ol. In this case, the training set of all the 826 chemicals in our database generated 50 fragments, whereas 6 pseudo-training sets (see Methods) of 826 chemicals generated an average of only 11.8 fragments. Examining the distribution of the fragments shown in Appendix B, we observe that the most common size is 4 heavy atoms (15 fragments), although sizes between 3 and 7 are also relatively common (5-10 fragments). Only two significant fragments of eight heavy atoms and only one fragment of two heavy atoms are present.
The 315 fragments obtained from the training stage are prevalently "inactivating" (60.6%), and only 39.4% are "activating." This fact may be due to the ratio between fragments generated from carcinogens and noncarcinogens in the database studied. In our global database we have more carcinogens (62.3%) than noncarcinogens (37.7%). However, noncarcinogens have an average size larger than carcinogens (15.1 heavy atoms versus 13.0 heavy atoms). Most likely for this reason, out of the total number of generated fragments (redundant fragments included), 57.0% come from carcinogens and 43.% from noncarcinogens. Figure 3 shows the distribution of the occurrences of 103,000 fragments of the average training set. In the case of negative fragments, those present in three noncarcinogens reach our established limit of statistical significance (0.433<0.125). This is not the case for positive fragments (0.573>0.125). For a positive fragment to become significant, it has to be wresent in at least four carcinogens (0.57 <0.125). As shown in Figure 3, many more fragments are present at least three times than those present at least four times. Statistically significant negative fragments can be sorted from a larger set than statistically significant positive ones.
As a consequence, even if we start with  (13): a, National Cancer Institute (NCI) or NTP evaluation is that the incidence of tumors at that site(s) was associated with administration of the compound. This code is used for technical reports before March 1986; c, NTP evaluation is clear evidence of carcinogenic activity. For NCI/NTP reports before March 1986, c indicates that the evaluation was carcinogenic; e, NTP evaluation is equivocal evidence of carcinogenic activity: studies that are interpreted as showing a marginal increase of neoplasms that may be chemically related; p, NTP evaluation is some evidence of carcinogenic activity: studies that are interpreted as showing a chemically related increased incidence of neoplasms (malignant, benign, or combined) in which the strength of the response is less than that required for clear evidence; +, author in general literature evaluated site as positive; -, in the general literature the author evaluated site as negative. NTP evaluation is no evidence of carcinogenic activity: studies that are interpreted as showing no chemically related increases in malignant or benign neoplasms; NE, no evaluation for NTP and general literature. bp, positive; NC, not classified; NP, nonpositive. A chemical that could be defined as positive at least in a single species, in a single sex, in a single site, was defined as positive. A chemical that could be defined as nonpositive in all sites was defined as nonpositive. Chemicals with a mixture of not classified and nonpositive evaluations were discarded as equivocal. more positive (57%) than negative fragments (43%), we end up with 60.6% statistically significant negative fragments and 39.4% statistically significant positive ones (in the final set of 315 statistically significant different and nonredundant fragments). Among the 315 significant and nonredundant fragments, similar (not identical), related fragments are still present, but the possible bias that they could introduce in terms of predictivity is lessened by the statistical treatment described in the previous section. These fragments generate the predictions of carcinogenicity or lack thereof for the test sets. For each run, a 2 x 2 contingency table is created and all the most important indices of qualitative predictivity are calculated.      0.00 (p <1) Carcinogenicity randomly attributed (in the training sets); average of eight runs (± SE).
'As defined in Klopman and Kolossvary (26). seem to show a high level of predictivity. However, even the indices obtained with the eight training sets where carcinogenicity was randomly attributed (Table 6) show a high predictivity performance. It is clear that the results obtained are not due to the predictive capability of the program but mainly to the many degrees of freedom existing in the system. These degrees of freedom allow for an a posteriori adaptation of the program to the pattern of positive and negative data in the training sets. In conclusion, the training sets cannot be used for an assessment of predictivity. It must be noted that the pseudo-training sets generate less "significant" fragments than the real training sets. As a consequence, there are fewer chemicals associated with a positive or negative prediction (376.9) in respect to the real training sets (521.6). Table 7 shows the contingency table obtained for an average of eight test sets. The level of accuracy (67.5%) is significantly higher (p-0.0006) than the expected level, based on the hypothesis of no association between connectivity and carcinogenicity (53.2%). The results obtained when the training sets with carcinogenicity randomly attributed are used to predict the same test sets (Table 8) do not show any association. These results and the previous observation that for a random attribution of carcinogenicity, about 55% of apparently significant fragments are generated in respect to a real training set, strongly suggest that connectivity is associated only with a real biological property and not with a randomly distributed simulated property.
Among the 165 chemicals of the test sets: 1) 32.4% (average of eight runs) contained only statistically significant positive fragments and were predicted with an accuracy of 78.7%; 2) 24.4% of the chemicals contained only statistically significant negative fragments and were predicted with an accuracy of 60%; 3) 19.8% of the chemicals contained both statistically significant positive and negative fragments and were predicted with an accuracy of 59.3%; 4) 23.3% of the chemicals contained no statistically significant fragments (70.8% of these chemicals were carcinogens and 29.2% were noncarcinogens), thus preventing a prediction of carcinogenicity.
Of those chemicals without statistically significant fragments, the ratio between carcinogens and noncarcinogens (70.8/29.2) is higher than the ratio present in the global database (62.3/37.7). This result can be explained by the fact that among the 315 statistically significant fragments selected by the program, more negative fragments (60.6%) than positive fragments (39.4%) Environmental Health Perspectives .Imdm-9 -_ are detected. For this reason, perhaps, we more often detected noncarcinogens than carcinogens. This could explain the enrichment in carcinogens among the molecules not associated with significant fragments.

Discussion
The major drawback to this type of automated analysis is the number of elementary operations performed and the quantity of memory needed. Determining the largest common subgraph between two graphs is a nonpolynomial task and requires time that exponentially depends on the size of the graphs and subgraphs involved. Fortunately, some characteristics of the chemical compounds partially simplify this otherwise formidable task: 1) the maximum number of edges converging at a node is usually small (around four); 2) the number of atoms in the compounds of our database is relatively small: the average number of heavy atoms (nonhydrogen) per compound is 13.8, and the largest compound contains 48 heavy atoms (see Fig. 4); 3) the maximum size of the searched fragments was limited to eight heavy atoms. As can be observed in Figures 5 and 6, fragments of greater size tend to appear in large numbers, but each of them tend to be present in too few compounds to be statistically significant. We have also observed that in our database, the information (associated with carcinogenicity or lack thereof) related to fragments of size 9 is redundant in respect to the information of smaller sizes in 100% of the cases (data not reported).
Finally, thus far, the adopted technique of representation of molecular fragments does not make a distinction among steric isomers; such cases will be dealt with in a future improvement to the system.
We have described the method for calculating our PI value in Methods. We used the PI value as a discriminant for deciding if a molecule of the test set will be predicted to be a carcinogen or a noncarcinogen. The strategy adopted prevents strongly related fragments from contributing to the analysis as independent fragments. In this way the informative content of a single chemical in the training set can have only one unit weight: we thus avoid the introduction of a bias of redundancy resulting from the multiplication of information related to a single molecule.
This strategy can introduce a different potential bias for a subset of molecules with different active substructures all common to the same molecules: in this case the index calculated can be underestimated. However, in our opinion, adding up the contributions of highly correlated fragments would cause more distortion than discarding multiple contributions present in the same molecule.  As a general result, we have confirmed what has been suggested by Klopman and Rosenkranz (4): an approach based on molecular connectivity can predict carcinogenicity. The results obtained in our test sets are statistically significant (p-0.0006). We believe that the observed levels of predictivity are not only statistically significant but also biologically relevant and potentially useful as one component of a spectrum of information that can contribute to hazard evaluations. Our initial work is promising, but we must test the software in additional experiments to develop it as a predictive toxicology system. For instance, we have to investigate in detail the performance of our program for different thresholds of statistical significance when we are selecting significant fragments from the training set to be used for predictions in the test set.
We can logically presume that with a smaller (and/or less diversified) training set, a fragment potentially associated with carcinogenicity or lack thereof could not reach statistical significance (or reach a more equivocal statistical significance). Therefore, we would expect that the percentage of nonassessable chemicals should decrease for a larger training set, and we should obtain better predictivity in general.
We plan to test our software program using smaller training sets (i.e., from 200 to 400 chemicals randomly selected) to verify if our assumption is correct. Klopman and Rosenkranz (11) have already verified this assumption. However, for the moment, we do not know if the similarities between the CASE program and our program are sufficient to allow extrapolation of their results to the results of our program.
We also have to look in detail at the fragments selected as significant to comment about their biological plausibility and compare them with the alert structures of Ashby (2,16,17,18,24,25) and also with fragments identified by the CASE and MULTICASE programs. We plan to coordinate with the authors of CASE and MULTICASE to test our respective programs with identical training sets and identical test sets so that we can compare the results obtained.
We used a database much larger than those used previously by other authors. We have obtained an average (eight runs) level of accuracy of 67.5% (SE, ±1.3). As shown in Table 7, we predicted 82.1 chemicals as positive and 44.4 as negatives. If these predictions (with the same proportions of predicted positives and negatives) had been based only on chance, the level of accuracy would have been 53.2% (ECP value). In our database, the prevalence of positive carcinogens is 62.3%. If we had predicted all the chemicals of the test sets as carcinogens, we would have obtained an accuracy of 62.3%. When you predict that all chemicals are potential carcinogens, the sensitivity is 100% and the specificity is 0%, and the prediction is not very useful.
An accuracy of 62.3% is apparently not very different from 67.5%, but we would anticipate for our software program levels of accuracy in the range of 65-70% at a ratio of carcinogens/noncarcinogens of 50/50, or even 38/62. We plan to perform these experiments in a future study. Different levels of predictivity were observed for different subclasses of chemicals. For instance, the confidence of the prediction for a chemical of the test sets, characterized only by positive fragments, is significantly higher (78.7%) than the confidence of the prediction for a chemical characterized only by negative fragments or contradictory fragments (60.7% and 59.3%, respectively).
We have met some difficulties in performing a direct comparison of our results with the results obtained by CASE. At the level of the training set, accuracy was higher (95%) for CASE (8,9) than for our program. This difference is probably related to differences in the decisional-statistical procedures used for the information obtained from different molecular fragments. In addition, the carcinogenicity database used by Klopman and Rosenkranz was different from ours. We have clearly demonstrated that accuracy at the level of the training sets is not correlated to the real predictivity of the software program (compare Tables 6 and 8).
A test set concerning carcinogenicity is present in two different reports by Klopman and Rosenkranz (8,9). The training set contained 189 chemicals of the NTP study (50.2% active, 22.2% marginally active, and 27.5% noncarcinogens). The rodent carcinogens (or noncarcinogens) considered in the test sets of the two papers are the same chemicals. They had been evaluated for carcinogenicity in the GeneTox program. In this test set, 23 out of 24 chemicals were rodent carcinogens. The expected correct predictivity was 92%, and the observed predictivity (accuracy) was 100%. Obviously, it is not possible to directly compare this extremely unbalanced database with ours.
In 1990, an analysis of the capability of CASE to predict carcinogenicity for a group of polycyclic aromatic hydrocarbons was reported by Richard and Woo (27). Thirty-one active and 25 inactive PAHs were used in the training set ("LEARN"), and 9 active and 15 inactive PAHs were used in the test set ("VALIDATE"). The authors reported an accuracy of 75% (SE, 89%; SP, 67%). In a recent publication (28), results concerning the predictive capabilities of CASE were reported for a group of chemicals for which carcinogenicity data recently became available (NTP studies). Out of 25 chemicals predicted by CASE, 17 were carcinogens and 8 were noncarcinogens (6 equivocals omitted). The degree of accuracy was 64% (SE, 59%; SP, 75%). Obviously, these results are from a small test set, not directly comparable with ours.
Among the works published by Klopman and Rosenkranz, a larger database (more similar to our database) was used to predict mutagenicity in Salmonella. In a recent study (1), Klopman and Rosenkranz used mutagenicity data from the GeneTox program and NTP studies to perform the analysis. The training set was built using GeneTox mutagenicity data, and the test set was built using NTP mutagenicity data.
Chemicals present in both the databases were not submitted to CASE and MULTI-CASE analysis. In this way, the training set contained 450 mutagens, 253 marginally active mutagens, and 123 nonmutagens, whereas the test set contained 63 mutagens, 21 marginally active mutagens, and 61 nonmutagens. The highest level of predictivity obtained using the MULTICASE program was about 80%, opposed to an expected correct prediction of about 50%. According to Ashby and Tennant (29), mainly electrophiles (directly or after metabolic activation) are involved in Salmonella mutagenicity. It is reasonable to think that mutagenicity in Salmonella should be more easy to predict than the complex endpoint of carcinogenicity: phenomena such as promotion, clonal expansion, remodeling, tissue necrosis and regeneration, and modulation of proliferation, apoptosis, and differentiation are clearly involved in the carcinogenic process, but not in mutagenicity in Salmonella or in other short-term tests of genotoxicity. We would expect a wider and more heterogeneous spectrum of molecular fragments to be involved in carcinogenicity than in genotoxicity. In the future, we will have to apply our software program not only to carcinogenicity but also to mutagenicity in Salmonella to test our hypothesis that it is in general easier to predict genotoxicity than carcinogenicity.
After analyzing recent studies evaluating the qualitative correlation between short-term tests for genotoxicity and carcinogenicity (30,31), we conclude that accuracy is in the range of 56-62%. It seems reasonable that short-term genotoxicity tests can reflect irreversible alterations in the genome during carcinogenesis. On the other hand, short-term tests should not be able to monitor nongenotoxic events (for instance, those events linked to pro-____-e motion and clonal expansion of preneoplastic cells). The fact that the predictivity of molecular connectivity is better than the predictivity of short-term genotoxicity tests suggests that molecular connectivity can detect not only electrophilic fragments, like the ones described by Ashby et al. (2,(16)(17)(18)24,25), but also fragments linked to nongenotoxic effects (promotion, modulation of differentiation, etc.). An alternative explanation of this difference in accuracy could be related to the fact that nongenotoxic carcinogens may be more abundant in the databases used to assess the predictivity of short-term tests (30,31) than in our larger database. In the future we will investigate the predictivity of molecular connectivity for genotoxic and nongenotoxic carcinogens.
We have discussed the predictive capability of short-term genotoxicity tests. How much higher would this predictivity be with a test biologically closer to carcinogenicity in rodents? We can partially answer this question. The endpoint of carcinogenicity in a single species of small rodents is not very different in the evolutionary scale from the endpoint of carcinogenicity in at least one of two closely related species. If our endpoint is now only in mice or rats, we can predict carcinogenicity in one species with carcinogenicity in the other. For the database of Gold et al. (12)(13)(14)(15), a concordance of 75% between rat and mouse studies has been reported (32), and for the chemicals of the NTP studies, a concordance of 74% has been reported (33); the predictivity of molecular connectivity is only moderately lower than the values reported above. This can be considered an additional indication of the good behavior of our parameter. We will have to confirm this impression in future experiments using only mouse data or rat data.
Within the framework of hazard evaluation, we believe that the computerized SAR approach should be given a weight similar to that of a standard short-term test in a multifactorial analysis of the carcinogenic potential of a given chemical. With regard to genotoxicity and carcinogenicity, Ashby (34) has pointed out that some fragments detected as significant by Klopman and Rosenkranz (and likewise by us) could not stand an in-depth analysis performed by a human expert, considering both biological and chemical specific arguments.
We agree with this observation. Because we found in the pseudo-training sets a number of apparently significant fragments equal to about 55% of the statistically significant fragments found in the real training sets, we suspect that (as a first approximation) about half of the fragments defined as significant according to our statis-tical threshold (p<O. 125, one tailed) are spurious. According to our analysis, only about 50% of apparently significant fragments emerging from a training set can be fragments of real biological significance. The remaining 50% is probably generated by chance and can also be present in a pseudo-training set in which carcinogenicity is assigned randomly. The level of predictivity reached in our experiments is probably due to a mixture of approximately 50% predictive fragments and approximately 50% of noise fragments. We think that fragments suggested as significant by our software program should be considered only as candidates for biological significance, but are by no means foolproof biological indicators of carcinogenicity. Their probability of being significant is higher, as expected, when we select a more severe statistical threshold. As a consequence of these considerations, a new potentially significant fragment detected by our software program is only submitted to the attention of investigators as a possible fragment characterizing a subfamily of molecules, potentially responsible for their common carcinogenic activity. Additional biological and chemical considerations could lead to the acceptance or rejection of the fragment as biologically significant. For instance, if the chemicals considered are similar procarcinogens, a similar metabolism should generate similar proximate carcinogens and perhaps also similar DNA adducts.
There are also cases in which it is impossible to reach a definite conclusion. Statistical significance is only one factor; however, when the statistical threshold is much more severe (p<0.01 instead of p<0. 125), the number of significant fragments generated in a real training set is four to five times larger than the number of significant fragments generated in a pseudo-training set (against a ratio of 2/1 for the threshold, p<0.125). Fragments with a higher statistical significance deserve priority in subsequent biological investigations with the aim of confirming or disproving the existence of a new molecular structure relevant for carcinogenicity or genotoxicity. On the other hand, the information obtained with the threshold p<0.125, while less significant than the information obtained with the threshold p<0.01, still allowed us to make predictions about a much larger fraction of chemicals. For this reason, the threshold p<0.125 was selected for the general predictivity study presented here.
We have used the overall evidence of carcinogenicity in at least one species, one sex, and one tissue, without any consideration about carcinogenic potency to determine whether or not a chemical is a carcinogen (yes or no). In the future we plan to stratify our database according to spectrum of carcinogenicity (large spectrum, narrow spectrum), as suggested by Tennant (35) and perhaps take into consideration different ranges of potency. A subfamily of chemicals sharing a common chemical fragment could also display a relatively homogeneous behavior in respect to a different subfamily sharing a different fragment.
Finally, in conclusion, we have confirmed that with a large database, using an independent software program, SAR approaches based on the computer-automated detection of molecular fragments statistically associated with a given biological property can be used to predict carcinogenicity in rodents. We are not aware of other independent validations of this type of SAR approach. institution at the end of the exchangeship.