Prediction of rodent carcinogenicity bioassays from molecular structure using inductive logic programming.

The machine learning program Progol was applied to the problem of forming the structure-activity relationship (SAR) for a set of compounds tested for carcinogenicity in rodent bioassays by the U.S. National Toxicology Program (NTP). Progol is the first inductive logic programming (ILP) algorithm to use a fully relational method for describing chemical structure in SARs, based on using atoms and their bond connectivities. Progol is well suited to forming SARs for carcinogenicity as it is designed to produce easily understandable rules (structural alerts) for sets of noncongeneric compounds. The Progol SAR method was tested by prediction of a set of compounds that have been widely predicted by other SAR methods (the compounds used in the NTP's first round of carcinogenesis predictions). For these compounds no method (human or machine) was significantly more accurate than Progol. Progol was the most accurate method that did not use data from biological tests on rodents (however, the difference in accuracy is not significant). The Progol predictions were based solely on chemical structure and the results of tests for Salmonella mutagenicity. Using the full NTP database, the prediction accuracy of Progol was estimated to be 63% (+/- 3%) using 5-fold cross validation. A set of structural alerts for carcinogenesis was automatically generated and the chemical rationale for them investigated- these structural alerts are statistically independent of the Salmonella mutagenicity. Carcinogenicity is predicted for the compounds used in the NTP's second round of carcinogenesis predictions. The results for prediction of carcinogenesis, taken together with the previous successful applications of predicting mutagenicity in nitroaromatic compounds, and inhibition of angiogenesis by suramin analogues, show that Progol has a role to play in understanding the SARs of cancer-related compounds.


Introduction
An understanding of the molecular mechanisms of chemical carcinogenesis is central to the prevention of many environmentally induced cancers. One approach is to form structure-activity relationships (SARs) that empirically relate molecular structure with ability to cause cancer. This work has been greatly advanced by the longterm carcinogenicity tests of compounds in rodents by the National Toxicology Program (NTP) of the National Institute of Environmental Health Sciences (1). These tests have resulted in a database of more than 300 compounds that have been shown to be carcinogens or noncarcinogens. The database of compounds can be used to form general SARs relating molecular structure to formation of cancer.
The compounds in the NTP database present a problem for many conventional SAR techniques because the compounds in the NTP databases are structurally very diverse, and many different molecular mechanisms are involved. Most conventional SAR methods are designed to deal with compounds having a common molecular template and presumed similar molecular mechanisms of action-congeneric compounds. Numerous approaches have been taken to forming SARs for carcinogenesis. Ashby and co-workers (2)(3)(4) developed a successful semiobjective method of predicting carcinogenesis based on the identification of chemical substructures (alerts) that are associated with carcinogenesis. A similar but more objective approach was taken by Sanderson and Earnshaw (5), who developed an expert system based on rules obtained from expert chemists. An inductive approach, not directly based on expert chemical knowledge, is the computer-automated structure evaluation (CASE) system (6,7). This system empirically identifies structural alerts that are statistically related to a particular activity. A number of other approaches have been applied based on a variety of sources of information and SAR learning methods (8)(9)(10)(11)(12)(13). The effectiveness of these different SAR methods was evaluated on a test set of compounds for which predictions were made before the trials were completed (round 1 of the NTP's tests for carcinogenesis prediction) (8,14,15) There is currently a second round of tests.
The machine-learning methodology Inductive Logic Programming (ILP) has been applied to a number of SAR problems. Initial work was done using the program Golem to form SARs for the inhibition of dihydrofolate reductase by pyrimidines (16)(17)(18). This work was extended by the development of the program Progol (19) and its adaptation for application to noncongeneric SAR problems (20). Progol has been successfully applied to predicting the mutagenicity of a series of structurally diverse nitroaromatic compounds (21), and the inhibition of angiogenesis by suramin analogues (20). The ProgolSAR method is designed to produce easily understandable rules (structural alerts). For the nitroaromatic and suramin compounds the rules generated provided insight into the chemical basis of action.
Most existing SAR methods describe chemical structure using attributes-general properties of objects. Such descriptions can be displayed in tabular form, with the compounds along one dimension and the attributes along the other dimension. This type of description is very inefficient at representing structural information. A more general method of describing chemical structure is to use logical statements, or relations. This method is also clearer, as chemists are used to relating chemical properties and functions for groups of atoms. The Progol method is the first to use a general relational method for describing chemical structure in SARs. The method is based on using atoms and their bond connectivities and is simple, powerful, and generally applicable to any SAR. The method also appears robust and suited to SAR problems difficult to model conventionally (21).
The most similar approaches to Progol are those of CASE (6), MULTICASE (7), and the symbolic machine learning approaches of Bahler and Bristol (8) and Lee (22). However the Progol methodology is more general, as the other approaches are based on attributes and therefore have built-in limitations in representing structural relationships.
This article describes application of the Progol SAR method to predicting chemical carcinogenesis. Progol was first benchmarked on the test data of round 1 and then applied to produce predictions for round 2. The predictions for round 2 are completely blind trials. Such tests are very important because they ensure that the predictions are free from any conscious or unconscious bias.

Materials and Methods
consisted of 291 compounds, 161 (55%) carcinogens and 130 noncarcinogens. In addition to this train/test split, a 5-fold cross-validation split of the 330 compounds was tested for a more accurate estimate of the efficacy of Progol. The compounds were randomly split into five sets, and Progol was successively trained on four of the splits and tested on the remaining split. Progol In inductive logic programming (ILP) all the inputs and outputs are logical rules (23) in the computer language PROLOG. Such rules are easily understandable because they closely resemble natural language. For any application the input to Progol consists of a set of positive examples (i.e., for SAR, the active compounds), negative examples (i.e., nonactive compounds), and background knowledge about the problem (e.g., the atom/bond structure of the compounds) ( Figure 1). Progol outputs consist of a hypothesis expressed as a set of rules that explain the positive and negative examples using the background knowledge. The rule found for each example is optimal in terms of simplicity (information compression) and the language used to describe Data The compilation of 330 chemicals used in this study was taken from the literature (2,3,8) as well as directly from the collective database of the National Cancer Institute (NCI) and the NTP (1). The compounds used were all the organic compounds for which there were completed NTP reports at the time of this work. A listing of the compounds and their activities is given in Table 1. Inorganic compounds were not included because it was considered that there are too few of them to allow meaningful generalizations. Of the 330 compounds, 182 (55%) are classified as carcinogenic, and the remaining 148 as noncarcinogenic. Carcinogenicity is determined by analysis of long-term rodent bioassays. Compounds classified by the NTP as equivocal are considered noncarcinogenic, this allows direct comparison with other predictive methods. No analysis was made of differences in incidence between rat and mouse cancer, or the role of sex, or particular organ sites.
The Progol SAR method was first tested using the test data considered in the first round of the NTP trial (3). This allowed direct comparison with the results of many other SAR techniques (8). The  the examples. This guarantee of optimality does not extend to sets of rules constructed by Progol, as it does not follow that a set of rules consisting of individually optimal rules is itself optimal for information compression. Information compression is defined as the difference in the amount of information needed to explain the examples with and without using the rule. It is statistically highly improbable that a rule with high compression does not represent a real pattern in the data (24).

Compound Representation for Progol
The generic atom/bond representation that we previously applied to mutagenesis was used (21). Two basic relations were utilized to represent structure: atom and bond. For example, for compound 1 (CAS no. 117-79-3), atom(127, 127_1, carbon, aromatic_carbon_&6ring, -0.133) states that in compound 127, atom no. 1 is of element carbon, and of type aromatic carbon in a 6-membered ring, and has a partial charge of -0.133. The type of the atom and its partial charge were taken from the molecular modeling package QUANTATM; any similar modeling package would also have been suitable. Equivalently, bond(127, 127_1, 127_2, aromatic) states that in compound 127, atom no. 1 and atom no. 2 are connected by an aromatic bond. In QUANTATM partial charges assignment is based on a specific molecular neighborhood; this has the effect that a specific molecular substructure can be identified by an atom type and partial charge. This relational representation is completely general for chemical compounds and no special attributes need to be invented. The structural information of these compounds was represented by    Di(2-ethylhexyl)phthalate 139-13-9 Nitrilotriacetic acid 50-55-5 Reserpine 123-31-9 Hydroquinone 2432-99-7 1   Information was also given about the results of Salmonella mutagenicity tests for each compound. The mutagenic compounds were represented by the relation Ames, e.g., ames(127) states that compound 127 is mutagenic.
The Progol algorithm allows for the inclusion of complex background knowledge in the form of either facts or computer programs. This allows the addition, in a unified way, of any information that is considered relevant to learning the SAR. In general, the more that is known about a problem, the easier it is to solve. The ability to use a varietly of background knowldge is perhaps the most powerful feature of Progol. In this study we included the background knowledge of chemical groups from our work on predicting mutagenesis (21), and the structural alerts identified by Ashby et al. (4) were also encoded and tested. It is important to appreciate that encoding PROLOG programs to define these concepts is not the same as including them as simple indicator variables. This is because Progol can learn SARs that use structural combinations of these groups, e.g., Progol could in theory learn that a structural indicator of activity is diphenylmethane (as a benzene single-bonded to a carbon atom single-bonded to another benzene). In contrast, a normal SAR method would only be able to use the absence or presence of the different groups, not a bonded combination of them. To represent compounds to the equivalent level of detail using a CASE-type representation (6) would require several orders of magnitude more descriptors than needed for only the simple atom/bond representation (21). In the future the background knowledge used could be extended to include more information, e.g., 3D structure, knowledge about metabolism, subchronic in vivo toxicity, route of administration, minimally toxic dose (MTD) levels, etc.

Other SAR Algorithms Compared with Progol
The train/test dataset has previously been studied using a number of SAR methods. We use the predictions from these methods and the predictions from two default methods to compare their results with those of Progol. The two default methods that we implemented were the following: * The largest class prediction method is to predict all compounds to be carcinogenic (the largest class).
* The Ames prediction method is to predict a compound to be carcinogenic if it has any form of a positive Ames test. The previously applied prediction methods that were compared with Progol can be placed into two groups. In the first group are the prediction methods that do not directly use data from experiments on rodents. The Progol SAR method belongs to this group and can be directly compared with such methods. These methods are as follows: * The Bakale and McCreary method (9) used experimentally measured electrophilic reactivity (Ke) values to discriminate between carcinogenic and noncarcinogenic compounds. * The DEREK method (deductive estimation of risk from existing knowledge) (5) is an expert-system that predicts carcinogenesis based on a set of rules derived from experienced chemists. * The COMPACT method (computeroptimized molecular parametric analysis of chemical toxicity) (10) predicts carcinogenesis based on the predicted interaction of the compound with cytochrome P450 and the Ah receptor. * The CASE method (25) is based on a statistical method of selecting chemical substructures associated with carcinogenesis. * The TOPKAT system (toxicity prediction by komputer [sic] assisted technology) (11) uses structural attributes to describe the compounds and applies statistical discrimination and regression to estimate the probability of carcinogenesis; it uses a number of noncarcinogenic pharmaceuticals and food additives to increase the number of negative examples. * The Benigni method (12) forms a Hansch-type quantitative structureactivity relationship (QSAR) using estimated electrophilic reactivity (K) and Ashby's structural alerts (below). The second group of prediction methods that have been previously applied uses information from biological tests on rodents. It is unfair to directly compare these procedures with methods based only on chemical structure and Salmonella mutagenicity since they use more information. Rodent biological tests are very expensive both in money and animal welfare terms. The prediction methods that use rodent biological tests are as follows: * The Ashby prediction method (3)  Progol is marginally the most accurate prediction method that does not use rodent tests (although this is not statistically significant). The more accurate prediction methods of Ashby, TIPT, and RASH are based on use of short-term rodent in vivo tests. This information is much more difficult and expensive to obtain than chemical structural and Salmonella mutagenicity data. The Ashby and RASH methods are also based on the subjective application of a set of structural alerts formed by Ashby et al. (4); the TIPT method uses an objective application of these expert defined alerts.
A number of the errors in prediction made by Progol were repeated by most other methods, suggesting some anomaly with these compounds (14)-methylphenidate hydrochloride and methyl bromide. Progol correctly identified naphthalene as a carcinogen, while it was missed by all other methods.

Cross-validation Results
Progol has an accuracy of 63% (standard error ± 3%) for all compounds estimated by 5-fold cross validation. This compares with estimated accuracies of 55% using the default rule, and 63% using the Ames rules. There is a significant difference at p < 0.05 between the accuracy of Progol and the default rule. Although there is no significant difference in accuracy between Progol and the Ames rule, there is a large difference in the number of carcinogens identified. Progol makes fewer errors of omission than  (16,9,5,7) DEREK (5) 57 37 (12,8,8,9) COMPACT (10) 54 37 (14,10,7,6) TOPKAT (11) 54 26 (6,3,9,8) CASE (25) 49 37 (11,9,10 (19,11,2,7) Terms and abbreviations: Accuracy = number correctly predicted/total number predicted. Cover = number of compounds predicted (PP= predicted to be carcinogenic and is carcinogenic, PN = predicted to be carcinogenic and is not carcinogenic; NP = predicted to be not carcinogenic and is carcinogenic, NN = predicted to be not carcinogenic and is not carcinogenic. Default methods are those that use simplistic prediction strategies. The basic methods are those that use information solely from chemical structure and Salmonella mutagenicity tests. The complex methods use information from rodent biological tests; the Ashby and RASH methods also exploit expert chemical knowledge and are therefore not automatic.
the Ames rules and more errors of commission, i.e., Progol identifies more carcinogens than the Ames rule at the cost of classifying more noncarcinogens as carcinogens.

Rules
The Progol SAR method produces prediction rules in the form of easily understood chemical patterns. The prediction rules are given in Figures 2 and 3. There is a direct translation from the rules generated by Progol into chemical structure. For example, rule 3 in PROLOG notation is: Element 1, ester_carbon, Chargel) and atom(Drug, Atom_2, Element 2, aromatic_.hydrogen, Charge2) and less_than_or_equal(Charge2, 0.041) (names with capital letters are variables).
The particular use of partial charges requires some explanation. They are given to three significant places because of a peculiarity in the method of assigning partial charges in QUANTATM (above), not because it is considered that these exact values are important to this accuracy.
It is important that rules produced by any automatic SAR procedure are screened to ensure that they make chemical sense. More confidence can be put in a rule if a mechanism of action can be identified (27)(28)(29). This is a general application of the principle of using prior knowledge to guide decision making. All the rules found by Progol were analyzed to try to identify their chemical rational.
It was found that use of the Ames test for Salmonella mutagenicity (rule 1) was the most effective rule for predicting carcinogenicity. While learning rule 1, Progol automatically searched for structural features that improved rule 1 and no such rule was found that had higher compression than rule 1 (recall that compression is an objective way of balancing sensitivity/ specificity of a rule). This does not conflict with the results of Ashby and Tennant (4), who showed that the Ames test was correlated with a set of structural alerts.
The remaining rules found by Progol are new and they automatically generated structural alerts for carcinogenesis. As Progol removes examples covered by previous rules when searching for a new rule, rules found after rule 1 was covered are indicators for carcinogenic compounds not recognized by the Ames test. This means 2,3-Dibromo-1-propanol    that they could be either structural alerts for nongenotoxic carcinogenesis (4), (i.e., not based on induction of DNA damage by the test agent or its metabolites), or structural alerts for genotoxic carcinogens that are missed by the Ames test. Most of the structural features identified by Progol appear to be for highly reactive structures, suggesting that they mainly act by genotoxic carcinogenesis. Chemical interpretations of the rules are given below (arranged by chemical group): Rules 2 and 3 identify ester groups as indicators for carcinogenesis. The meaning of the modifying groups is unclear, but they are essential, as ester groups on their own have no discriminatory power. Rules 2, 6, and 11 use the generic background knowledge that was first used in applying Progol to predicting mutagenesis (21). * Rule 4 is concerned with ether oxygens with high partial charges. All such groups are bonded to aromatic rings, suggesting the involvement of electrophilic substitution in activity.
* Rule 5 identifies an ether group in a 6-membered ring. These cyclic ethers may also be involved in electrophilic reactions.
* Rules 6 and 7 identify reactive halides as indicators of carcinogenesis; such compounds have been widely recognized as potential carcinogens. * Rule 8 identifies an aldehyde group as an indicator of carcinogenicity. Aldehyde groups are potentially very reactive. * In rule 9, the aromatic amine group indicates high reactivity, as does the low partial charge on the unsaturated carbon (it is associated with a double bond to an oxygen group). * The high partial charge on the unsaturated carbon in rule 10 occurs in reactive alkenes. * Rule 11 occurs in substituted cyclohexenes; note the similarity with rule 5. * Rule 12 occurs when a 6-membered aromatic ring is bonded to a nonaromatic ring. * Rule 13 occurs when a carbon atom in a single 6-membered aromatic ring is bonded to an amine or carbon-substituted amine group. * Rule 15 occurs in chlorinated alkane groups; see rule 6. * Rule 16 occurs when a hydroxyl group is attached to an aliphatic carbon. * In rule 17 the indicator of a halide atom is attached a tetrahedral carbon. This is the only rule that uses the structural alerts from Ashby et al. (4), It is possible that this rule is an artifact, since there appears to be no chemical reason why 4 halide atoms should be chosen instead of, say, 3 or 5. * Rules 14 and 18 may also be artifacts because there appears to be no chemical rationale for them.

Discussion
Prediction of Results of

Ongoing NTP Studies
The Progol rules were used to predict the compounds in the second round of the NTP test of strategies for predicting chemical carcinogenesis in rodents; the 25 organic compounds were predicted but no prediction was made for the 5 inorganic compounds (  and those experimentally predicted will indicate how relevant the assumptions underlying the Progol predictions are. A major statistical problem in trying to predict the results of this NTP trial is that the distribution of compounds in the trial is not the same as that from which the rules were learned, e.g., the percentage of compounds with positive Ames tests is only 16% (4 of 25) compared with 42% for the compounds previously tested.
The change in distribution between training and test data for NTP trials has previously been noted by Tennant et al. (3). This is called concept drift in machine learning and it is a problem because almost all statistical methods are based on the assumption of an underlying constant distribution.

Comparison with Other SAR results
The estimated accuracy of 63% for predicting carcinogenesis by Progol is higher but not statistically significantly higher than the results obtained using other SAR methods that do not incorporate results from rodent biological tests. This confirms the results of Benigni (15), who showed that all the SAR approaches to carcinogenicity had similar prediction profiles. The relatively low prediction accuracy of = 60% is probably due to the diversity of mechanisms of action and the complexity of interactions in vivo.

Comparison ofthe Ashby et al. Structural Alerts and Those Generated by Progol
The Ashby et al. structural alerts (2)(3)(4)28) and those generated by Progol differ fundamentally in their formation and application. The Ashby alerts were generated by a human expert and applied subjectively. The Progol alerts were generated automatically by machine and are applied objectively. The Ashby structural alerts are based on electrophilic attack on DNA. This means that they are not statistically independent of the Ames test (4), and there is some redundancy between the Ames test and the structural alerts. The Progol structural alerts were selected so that they covered compounds not covered by the Ames test. This makes them much more independent of each other than those of Ashby. Many of the structural alerts found by Progol are similar to those identified by Ashby, e.g. Ashby recognizes forms of esters (rules 2 and 3), ethers (rules 4 and 5), halogenated compounds (rules 6, 7, and 15), and aldehydes (rule 9) as structural alerts. The exact forms of the alerts differ significantly between Ashby and Progol. This strongly suggests that it may be possible to develop a system for predicting chemical carcinogenesis that combines the best features of human-based prediction with the objectivity and speed of the Progol rules to develop a superior SAR system.
The results for prediction of carcinogenesis, taken together with the previous successful applications of predicting mutagenicity in nitroaromatic compounds and inhibition of angiogenesis by suramin analogues, show that Progol has a role to play in understanding the SARs of cancerrelated compounds.
The Progolalgorithm starts by randomly selecting a positive example. The order of example selection does not affect the final theory, only the efficiency of the learning process, e.g., drug number 3: high(drug3). Progol generalizes the example using inverse resolution to construct the most specific rule that explains the example in terms of the background knowledge. This rule logically implies the original example. The rule is high(X): substI (X), not subst2(X), subst3(X). In plain language, this rule states that A drug has high activity if it has a substitution at position 1 and it does not have a substitution at position 2 and it has a substitution at position 3.
This rule covers 1 positive example and no negative examples. Progol further generalizes this rule by removal of redundant parts of its body (literals) to find the maximally compressive rule using a complete top-down search. The most compressive rule for our example is high(X): substI (X), not subst2(X). This rule is the optimal in terms of compression and the descriptive language The final theory produced is A drug has high activity if it has a substitution at position 1 or it has a substitution at position 2. This is called the exclusive or rule.