Prediction of the rodent carcinogenicity of organic compounds from their chemical structures using the FALS method.

Fuzzy adaptive least-squares (FALS), a pattern recognition method recently developed in our laboratory for correlating structure with activity rating, was used to generate quantitative structure-activity relationship (QSAR) models on the carcinogenicity of organic compounds of several chemical classes. Using the predictive models obtained from the chemical class-based FALS QSAR approach, the rodent carcinogenicity or noncarcinogenicity of a group of organic chemicals currently being tested by the U.S. National Toxicology Program was estimated from their chemical structures.


Introduction
The prediction of carcinogenicity has become a subject of great importance for regulatory perspectives and ecotoxicity assessments. Especially, prediction only from the chemical structure is desired, since it can be utilized even when a test compound is unavailable or does not exist. Approaches using some correlative methods for noncongeneric chemicals were reviewed by Richard (1), who found that published prediction accuracies were in excess of 90%, while prospective prediction accuracies were less than 70% in these approaches. Moreover, worse results were published for a prospective prediction of rodent carcinogenicity using a variety of quantitative structure-activity relationship (QSAR) approaches (2). Further studies are required to improve the predictive reliability.
We have recently developed fuzzy adaptive least-squares (FALS) (3,4), a pattern recognition method for correlating structure with activity rating, and applied the This  method to a noncongeneric structurecarcinogenicity correlation (5). Ideally, rational preclassification of compounds based on possible carcinogenic mechanisms should be extensively investigated to enhance the predictive accuracy of noncongeneric QSAR approaches. Unfortunately, for this purpose there is still not sufficient knowledge concerning molecular mechanisms of carcinogenicity. In this study, a rough chemical classification was adopted to generate the predictive models. Using data from the International Agency for Research on Cancer (IARC) (6) and the National Toxicology Program (NTP) (7,8) on carcinogenicity as training sets, FALS QSAR models for eight chemical classes were generated. Based on these models, prospective predictions of rodent carcinogenicity of 25 organic chemicals issued by the National Institute of Environmental Health Sciences (NIEHS) were accomplished.

FALS Methodology
FALS is a nonparametric pattern classifier. It formulates QSAR in a single discriminant function irrespective of the number of activity rating classes, as: Z=wo+w1x1+W2X2+---+WpXp [1] In this equation, xk = kth descriptor (k= 1,2,... p) for structures, wk (k= 0,1, For noncarcinogenic activity, M(Z) = 1/[1+{(Boundary-Z)/O. 1-1}4] when Z. Boundary-0. 1, otherwise M(Z) = 1 [3] In these equations, Boundary takes the value of (n 1-n2)/(n1 + n2), where n1 and n2 are the numbers of noncarcinogens and carcinogens, respectively, in the training set. The calculated value of M(Z) is the membership grade. The weight coefficients in the discriminant function are generated so as to maximize the sum of the membership grade over the set of compounds by an adaptive least-squares iteration. The resultant discriminant functions that have various descriptors are validated by the leave-one-out prediction. The discriminant function with a scientifically reasonable set of structural descriptors giving the best leave-one-out prediction is finally adopted as the QSAR model. The FALS methodology has been described on a number of occasions (3)(4)(5).

Database and Chemical Classes
A database including a total of 586 compounds listed in Table 1 was used for the training sets. The compounds had been designated as carcinogenic or noncarcinogenic by IARC (6) and/or NTP (7,8) based upon evaluation of rodent test data. If the two agencies' carcinogenicity/noncarcinogenicity assignments differed for any given compound, the NTP designation was adopted. Compounds giving equivocal   evidence of carcinogenicity were not used. Inorganic and metallo-organic chemicals, polymers, and mixtures were also excluded from the training sets. The chemical classification was designed to be broad enough to permit a reasonable number of training compounds to fall into each class for generation of statistically significant QSAR models. With a special reference to the chemical features of the compounds to be predicted, the following eight chemical classes were investigated: class 1, hydrocarbons (39 compounds); class 2, heterocyclics (185 compounds); class 3, nitro and nitroso compounds and N-oxides (98 compounds); class 4, halides (152 compounds); class 5, alcohols, phenols, and ethers (160 compounds); class 6, carbonyl compounds (205 compounds); class 7, nonaromatic amines (25 compounds); and class 8, oxygenated sulfur compounds (52 compounds). An individual compound can appear in several classes according to its chemical structure. 2,3,5,6-Tetrachloro-4-nitroanisole, for example, appears in classes 3, 4, and 5.

Structunral Descriptors
Three kinds of variables-continuous variables, discrete variables, and indicator variables -were investigated as candidate descriptors. Molecular weight, hydrophobic constant (log P), and its squared value were used as continuous variables. The log P (octanol/water) values used were calculated using the revised version (10) of our simple method (11,12). Discrete variables were defined as the number of specific atoms, bonds, functional groups, and specific ring and chain structures. The upper values of the discrete variables other than the number of specific atoms and bonds were empirically set at 3.0 so as to avoid possible overestimation for polyfunctional structures. Indicator variables were defined as 1 for the presence and 0 for the absence of any kind of structural or physicochemical features considered to be contributing to carcinogenicity.

Generation ofPredictive Models
The FALS analyses were performed for carcinogenic/noncarcinogenic dichotomization using eight sets of data for the various chemical classes. As a result, the eight satisfactory equations including from 5 to 25 descriptors (Moriguchi et al., unpublished data) were derived. They are listed in Table 2.
Descriptors with positive coefficients are usually considered to contribute in a     However, this is not always valid beyond the chemical classes. Moreover, strictly speaking, these coefficients cannot be used to make general inferences about the contribution of each fragment within a variety of structures. They are valid only when used in the context of the present multidimensional model within each chemical class.
The results of recognition and leaveone-out prediction of the eight QSAR models are shown in Table 3. The values of the mean membership grade were fairly good, from 0.860 to 0.949 in the recognition and from 0.783 to 0.923 in the leaveone-out prediction. The false negative was from 1.6 to 5.8% in the recognition and from 3.1 to 8.0% in the leave-one-out prediction. These equations were then used for the carcinogenicity prediction of 25 organic chemicals.

Prospectve Pricton of the Organic Chemicals
The second NIEHS Predictive-Toxicology Evaluation Project involves the rodent carcinogenicity of 30 chemicals consisting of 25 organic and 5 inorganic compounds. The five inorganic compounds were omitted from our FALS prediction because sufficient carcinogenicity data for inorganic chemicals were not available for generating predictive QSAR models. The prediction of the 25 organic compounds was performed using the QSAR models for the eight chemical classes listed in Table 2. Salts such as scopolamine hydrobromide trihydrate and sodium xylenesulfonate were treated as undissociated forms. The results are shown in Table 4.
From the chemical features, compounds 1 (scopolamine) and 2 (codeine) fall into three chemical classes, and compounds 5 (tetrahydrofuran), 10 (D&C Yellow No. 11), 13 (1-chloro-2-propanol), 14 (diethanolamine), 15 (phenolphthalein), 18 (furfuryl alcohol), 19 (primaclone), 24 (oxymetholone), and 26 (emodin) fall into two chemical classes. When there were discrepancies between the estimates by two or three QSAR models, we evaluated them as "equivocal." Among the 25 organic chemicals, 14 showed positive, 5 showed equivocal, and 6 showed negative carcinogenicity. Further detailed predictions by the correlative method are thought to be unreliable, since there are not sufficient data concerning mechanisms and sites of tumor formation with a wide variety of chemicals for the generation of statistically significant QSAR models.
In these predictions, the mutagenicity and subchronic toxicity test data were not considered. The prediction based on the QSAR models can be performed in a very short time at a very low cost, and it can be utilized even when the test compound does not exist. Unfortunately, the first round of this exercise showed that the results by the correlative methods were not very good (2). It is considered that the predictive power of correlative methods significantly depends upon the quality and quantity of the training set data used. Sufficient highquality data covering a large variety of chemical structures, as well as the use of mechanism-based descriptors, will enhance the prospective prediction accuracies of the QSAR approaches.