Integration of QSAR and SAR methods for the mechanistic interpretation of predictive models for carcinogenicity

The knowledge-based Toxtree expert system (SAR approach) was integrated with the statistically based counter propagation artificial neural network (CP ANN) model (QSAR approach) to contribute to a better mechanistic understanding of a carcinogenicity model for non-congeneric chemicals using Dragon descriptors and carcinogenic potency for rats as a response. The transparency of the CP ANN algorithm was demonstrated using intrinsic mapping technique specifically Kohonen maps. Chemical structures were represented by Dragon descriptors that express the structural and electronic features of molecules such as their shape and electronic surrounding related to reactivity of molecules. It was illustrated how the descriptors are correlated with particular structural alerts (SAs) for carcinogenicity with recognized mechanistic link to carcinogenic activity. Moreover, the Kohonen mapping technique enables one to examine the separation of carcinogens and non-carcinogens (for rats) within a family of chemicals with a particular SA for carcinogenicity. The mechanistic interpretation of models is important for the evaluation of safety of chemicals.


Introduction
Carcinogenicity is among the toxicological endpoints that pose the highest public concern. The standard bioassays in rodents used to assess the carcinogenic potency of chemicals are time-consuming, costly and require the sacrifice of a large number of animals. Cancer bioassays should be reduced according to the EU regulation REACH (Registration, Evaluation, Authorisation and restriction of Chemicals) [1], while the Seventh Amendment to the EU cosmetics directive will ban the bioassay for cosmetic ingredients from 2013 [2].
For these reasons, there is a need for alternative methods for carcinogenicity testing. Quantitative structure activity relationship (QSAR) methods can contribute to reduction of the animal studies. To be accepted for regulatory use, the QSAR model should follow the five OECD principles [3]. The fifth principle is that the models should have a mechanistic interpretation, if possible. The goal of mechanistic interpretation of QSAR models is to find relationship between descriptors and the endpoint and to incorporate mechanistic understanding and/or biological information.
It is a challenge to represent a mechanistic interpretation for models for prediction of carcinogenic potency for different classes of chemicals (so called non-congeneric chemicals) as they are comprised of a wide diversity of molecular structures related to variety of biological mechanisms.
Both the statistically based and knowledge-based methods are used in carcinogenicity models for prediction of non-congeneric chemicals [4][5][6][7]. The statistically-based methods (MultiCASE, Leadscope, TOPKAT, LAZAR and CAESAR systems) rely on techniques such as multivariate analysis, rule-induction, artificial intelligence, cluster analysis, pattern recognition, etc.). They deal with limited or no prior chemical or biological classification according to mechanism of carcinogenicity [8]. The knowledge-based (or rulebased) methods (HazardExpert, OncoLogic, Toxtree, and DEREK systems) include toxicological knowledge, expert judgment and fuzzy logic taking into consideration toxicokinetics, toxicodynamics and metabolism related to processes with cellular macromolecules or receptors. Each of the above mentioned approaches has the potentials and limitations described in the literature [7]. The knowledge-based approaches provide opportunity to gain insight into the mechanism underlying the carcinogenicity. The main advantage of statisticallybased models is higher accuracy of prediction.
In this study we have combined QSAR (statistically-based) with SAR (knowledge-based) approaches. The QSAR model provided information about an association between chemical features (expressed as chemical descriptors) and the endpoint being predicted (carcinogenicity in our case) while knowledge-based Toxtree expert system [9] was employed as a supporting tool in interpretation of obtained results in terms of possible mechanism of carcinogenic activity of studied chemicals (encoded in the carcinogenic SAs). The mechanistic basis of the QSAR model was determined a posteriori (after the modelling), by interpretation of the final set of training structures and descriptors belonging to topological, electrotopological, and hydrogen bonding descriptors, which express different aspects of shape and size of molecules, contain encoded information about electronic interactions of the atoms and comprise features of electrostatic interaction between molecules.
We have considered counter propagation artificial neural network (CP ANN) model for prediction of carcinogenicity containing twelve Dragon descriptors which from statistical point of view was correlated to carcinogenicity and showed good recall ability and acceptable accuracy of prediction (69%) as was reported in the article [10]. Then we implemented selected descriptors correlated with carcinogenicity for prediction of SAs for carcinogenicity. The inherent to CP ANN mapping technique (Kohonen maps) was applied to see the distribution of chemicals, individual descriptors (in weight level maps), carcinogenic potency (Yes/No) and SAs for CSBJ Abstract: The knowledge-based Toxtree expert system (SAR approach) was integrated with the statistically based counter propagation artificial neural network (CP ANN) model (QSAR approach) to contribute to a better mechanistic understanding of a carcinogenicity model for non-congeneric chemicals using Dragon descriptors and carcinogenic potency for rats as a response. The transparency of the CP ANN algorithm was demonstrated using intrinsic mapping technique specifically Kohonen maps. Chemical structures were represented by Dragon descriptors that express the structural and electronic features of molecules such as their shape and electronic surrounding related to reactivity of molecules. It was illustrated how the descriptors are correlated with particular structural alerts (SAs) for carcinogenicity with recognized mechanistic link to carcinogenic activity. Moreover, the Kohonen mapping technique enables one to examine the separation of carcinogens and non-carcinogens (for rats) within a family of chemicals with a particular SA for carcinogenicity. The mechanistic interpretation of models is important for the evaluation of safety of chemicals. carcinogenicity in the same 2D space. The integration of the CP ANN mapping technique with the decision tree based Toxtree module for carcinogenicity enables to get mechanistic interpretation of a CP ANN QSAR model. The correlation between statistically selected descriptors and the carcinogenic potency as well as the possible mechanism of carcinogenic action (encoded in SAs for carcinogenicity) was studied in this paper.

Data and Methods
The 805 chemicals extracted from initial dataset of 1481 chemicals (taken from Distributed Structure-Searchable Toxicity (DSSTox) Public Database Network http://www.epa.gov/ncct/dsstox/sdf_cpdbas.html) were used for modelling. The carcinogenic potency for rats was selected as a response (see article [10]). The information about structural alerts for carcinogenicity and type of alert (genotoxic alert (GA), non-genotoxic (nGA) alert or no alert (NA)) for each chemical in the dataset was taken out from the Toxtree expert system.
The diversity of dataset with indication of proportion of Ppositive (carcinogen) and NP-non positive (non-carcinogen) chemicals for rats for chemicals with NA, GA, nGA is presented in Figure 1 and in Table S1 in Supplement Material section.
One should keep in mind that carcinogenic potency of chemicals in Toxtree knowledge-based system is based on observations in humans collected through epidemiological studies and on studies in animals while in CP ANN model we considered carcinogenic potency for rats.
The list of 33 SAs for carcinogenicity is reported by Benigni and Bossa [9] in the Toxtree rulebase. Carcinogenic SAs are functional groups or molecular substructures that were mechanistically and/or statistically associated with induction of cancer. In a broad sense the set of chemicals characterised by the same SA could compose a family of compounds with the same mechanism of action. A recent review [11] represents the information about chemical classes with recognized mechanistic link to carcinogenicity, coded as SA in the Toxtree 2.1.0 software. The list of SAs used in CP ANN model with indication of the number of chemicals in the dataset corresponding to particular SA is represented in Table 1. This table also contains the number of carcinogens (P) and non-carcinogens (NP) for each group of chemicals with a particular SA. The DRAGON professional 5.4 program [12] has been employed for the calculation of 835 Dragon descriptors. Different descriptors represent different ways or perspectives to view a molecule, taking into account the mono-dimensional (e.g. the simple counts of atoms and groups), bi-dimensional (e.g. the topological graph) or three-dimensional (e.g. the minimum energy conformation) features. Thus, descriptors express different aspects of the shape and size of molecules, encode information about topological environment and electronic interactions of the atom and reflect the electrostatic interaction between molecules. The "Handbook of Molecular Descriptors" by Todeschini and Consonni [13] provides an encyclopaedic reference to molecular descriptors that are suitable for (Q)SAR studies.
The subset of twelve Dragon descriptors was identified using a statistical analysis (cross correlation matrix, multicolinearity and fisher ratio techniques). These Dragon descriptors maximally explain the variance in observed carcinogenic potency (property or activity of interest). Twelve Dragon descriptors selected for modeling are represented in Table 2.
Hereby, the molecular descriptors provide the information to generate the mechanistic interpretation of the underlying structureactivity or property relationship because they represent the relevant features of molecular structure that affect the observed properties (carcinogenicity) of a studied molecule.
The CPANN method was used in modelling, it belongs to self organizing map technique that is often used to analyse the data in multi-dimensional space. The basis of this technique is a non-linear projection from multi-dimensional space onto a two-dimensional map. The topology preserving projection is achieved via non-linear algorithm known as training. The fundamental property of the trained network is close vicinity of similar objects. Therefore, it is expected that chemicals with similar structure will form the clusters, which is the case of examination.
The architecture of CPANN is shown in Figure 2. The network constructed of neurons has two layers: input layer (Kohonen layer) containing encoded information of structure expressed as descriptors values and output layer (response). Both layers of neurons are placed exactly one above the other and the output layer has exactly the same layout of neurons as the input one [14]. The input layer has a number of levels (weights of the input neurons corresponding to the number of descriptors, i.e. the dimension of input vector X), while the output layer has as many levels as the target vectors have responses.
Kohonen maps enable visualisation of the distribution of chemicals (in the top map) and distribution of descriptors values (in weight levels maps). CP ANN, in turn, is a generalization of self organizing map. Additionally, it takes into account the property (output) values [15,16] and is encompassed in the output layer. The learning in the input layer in the CPANN is the same as in Kohonen neural network, i.e., the similarity among input variables determines the arrangement of objects in the input layer map. When the arrangement is set the positions of objects are projected to the output where the weights are modified in a way that the weights on projected positions are getting similar to the values of corresponding objects.
The two kinds of models have been examined in the study: the model for prediction of carcinogenic class (model_cancer_class) and model for prediction of SAs for carcinogenicity (model_cancer_SA). It should be noted that input Kohonen layer is the same for both models while responses are different (see Figure 2).
In Figure 2 the inputs x1, x2, x3,..., xn are vector components representing chemical structure which corresponds to descriptors calculated for all chemicals used in training dataset. In the other words, x1i, x2i, x3i,..., xni can be represented as a matrix of descriptor 1, 2, 3..., n values for all of 644 chemicals (i=1,…, 644) in training dataset, respectively. The distribution of chemicals and their clusters in 2D space is examined in the Kohonen top map. Weight levels 1, 2, 3,..., n are the maps with distribution of particular descriptors 1, 2, 3,..., n, correspondingly. Output variables are expressed in the output layer as a carcinogenicity class (class 2 was marked as carcinogen and class 1 as non-carcinogen) in models for prediction of carcinogenicity and as particular SAs for carcinogenicity in the models for prediction of SAs.
In the study the following 2D maps were used to analyse the similarities in non-congeneric set of substances: Kohonen top map (distribution of chemicals), weight levels maps (distribution of individual descriptors) (Figure 2 (a)), corresponding response surface output layer maps with distribution of carcinogenicity class (carcinogens/non carcinogens) in model for prediction of carcinogenicity ( Figure 2

Results and Discussions
Special features of the CP ANN model for prediction of carcinogenic class In the first part of study we represented the CP ANN model for prediction of carcinogenic class (carcinogen (2) and non-carcinogen (1)) (model_cancer_class). We examined model based on twelve Dragon descriptors with 35x35 dimensional artificial neural network (ANN) and trained for 200 epochs (see paper [10]).
For the model validity a wide series of statistical checks have been used. Models yielded the accuracy of training set (644compounds) equal to 89%; the accuracy of the test set (161compounds) was 69%. The accuracy of the leave 20% out cross validation for the training set was equal to 62% . To verify if the models perform correctly on new compounds the external validation was carried out. The external test set was composed of 738 compounds. We obtained accuracy of external validation equal to 60.0%.
The relationships between carcinogenic class, structure of chemicals and applied descriptors were in focus of our investigations.
Because of the diversity of the molecules studied in this work, the carcinogenic property of the compounds is related to the molecular structure in a complex way. The descriptors used in the study encode different aspects of the molecular structure. We have used the CP ANN and combined the mapping capability of a Kohonen network with a supervised learning strategy. We examined the pattern levels in the weights of trained network which provide the researcher with a deeper knowledge about mechanistic background related to the effect of individual variables visualized and estimated from the formed clusters.
The present paper resumes and upgrades the mechanistic interpretation of model using twelve Dragon descriptors reported in the paper [10]. We have considered the top map of Kohonen layer (with distribution of chemicals), weight level maps of descriptors variables and output layer corresponding to the carcinogenic class to show connection between the structure features of chemicals, individual descriptors and corresponding carcinogenic class.
The correlation between carcinogenic potency, nitroso compounds and Dragon descriptors We investigated the distribution of carcinogens and noncarcinogens in the output layer of model_cancer_class and have found in the left bottom section of the map (Figure 3a) an area populated with a majority of carcinogens (2). Firstly, we considered an output layer of the model which represents the map 35*35 in the x and y direction. Green small squares (1) in the output layer ( Figure 3a) are non-carcinogens, while brown small squares (2) are carcinogens. Figure 3d shows the fragment of the bottom left section of this map 35x35 of model_cancer_class (see red rectangle) with coordinates (nx from 1 to 15 and ny from 1 to 10) labelled with ID of chemicals occupying individual neurons with indication of SAs for carcinogenicity. Additionally, the weight maps of Dragon descriptors and D12 (N-078-Ar-N=X / X-N=X) and D9 (nRNNOx-Number of N-nitroso groups (aliphatic)) are shown in Figure 3b and

Integration of QSAR and SAR methods
Indeed, the descriptors identify the certain structural features or particularities. Descriptor D9 corresponds to Functional group counts (number of N-nitroso groups (aliphatic) while D12 relates to Atomcentred fragments (Ar-N=X / X-N=X).
Thus, we have found the relationship between descriptors containing features for nitro compounds that gave ability to neural network to organize those families of chemicals in topologically near locations (neurons). The majority of chemicals from this class are carcinogens, i.e. possessed the same biological activity. Obviously, the nitro SAs are important for carcinogenic activity which is in good agreement with the selection of Dragon D9 and D12 descriptors that resulted from our modelling methodology (see article [10]).

The study of influential zones of Dragon descriptors
The following part of our study was dedicated to research of influential zones of descriptors (areas with the largest values) and their correlation with structure of chemicals located in those areas. Analysing the individual descriptors layers in Self-Organizing Maps one recognized the importance and role of individual descriptors in a studied model. The results of our investigations are represented in the supplementary material section in Tables S1-S9.
After careful consideration of data represented in Tables S1-S9 we have found that influential zones of some of Dragon descriptors like D2 (D/Dr06-distance/detour ring index of order 6) ( Figure  S2), D7 (GGI2-topological charge index of order 2) ( Figure S4), D10 (nPO4-Number of phosphates/thiophosphates) ( Figure S7), D11 (N-067-Al2-NH) ( Figure S8) have small limited area (see Figure 4) corresponding to non-carcinogens. The influential areas of descriptors D2, D7, D10 and D11 do not match up with influential areas of descriptors D9 and D12 (see Figure 3) related to location of carcinogens. Possibly the descriptors D2, D7, D10 and D11 might have features explained the non-carcinogenic properties while descriptors like D9 and D12 correspond to carcinogenic property of compounds.
In contrast, the Dragon descriptors like D1 (PW5-Path/walk 5 -Randic shape index) ( Figure S1), D3 (MATS2p-Moran autocorrelation -lag 2 / weighted by atomic polarizabilities) ( Figure  S3), D5 (ESpm11x-Spectral moment 11 from edge adj. matrix   weighted by edge degrees) and D6 (ESpm09d-Spectral moment 09 from edge adj. matrix weighted by dipole moments) ( Figure S5) have influential zone spread over whole map (see Figure 5). This phenomenon probably is the evidence that these descriptors have features that affect the majority of chemicals in the dataset and their properties. The Dragon descriptors D4 and D7 ( Figure S4), D5 and D6 ( Figure S5) and D12 and D9 ( Figure S9) have influential zones in the similar locations.
Several descriptors were found to have overlapping influential zones for the same chemicals. Figure S4 demonstrates this phenomenon. Beta-Cyclodextrin; (CASRN 7585-39-9) and Vinblastine;  were found in the influential zones of Dragon descriptors D4 and D7 ( Figure S4). Beta-Cyclodextrin;  is also located in the influential zone of Dragon descriptor D2 ( Figure S2). It means that descriptors selected for characterization of carcinogenicity have similar features. In the second example two non-genotoxic halogenated cycloalkanes chemicals with SA_20 (Mirex, and Chlordecone (kepone) (CASRN 143-50-0)) were found in influential zones of Dragon descriptors D1 ( Figure S1), D5 and D6 ( Figure S5).
How CP ANN algorithm create clusters of chemicals containing particular SA for carcinogenicity In the first part of the study we described the CP ANN model for prediction of carcinogenic class (model_cancer_class). Model for prediction of SA for carcinogenicity (model_cancer_SA) represented in the second part of the study was based on the same input data (the numeric representation of the twelve Dragon descriptors). 35x35 dimensional ANN was trained for 200 epochs.
For example, 5-SA21 corresponds to green colour, 8-SA28 corresponds to orange colour, and 6-SA27 corresponds to yellow colour and so on (see Figure 6).
The Kohonen map enables to get clusters of congeneric substances as well as to get insight within congeneric sets of chemicals and to determine similarities or dissimilarities within groups of chemicals characterised on the basis of particular carcinogenic SA. The weight levels maps with distribution of the chemicals containing the particular SAs (5-SA_21; SA 8-SA_28; 7-(SA_27+SA_28); 4-(SA13 +SA27) and NA (no alert)) are shown in Figure 6   distribution of values of particular SA over the 35*35 map. The highest values correspond to red, yellow and then to light green color, while dark blue relate to the smallest values. We have noticed that the following groups of chemicals generate one cluster: 5-SA_21 (nitro compounds) (Figure 6a) and 4-(SA13 +SA27) (hydrazine and nitro-aromatic) (Figure 6d). Two clusters are visible in the case of 7-(SA_27+SA_28) (nitro-aromatic, primary aromatic amines) (Figure 6c), while others groups of chemicals marked as 1, 2, 3, 6, 8, 9 and 10 are scattered in the whole map. The chemicals with SA 8-SA_28 (Figure 6b) have several clusters spread over the map.
As the distribution of chemicals in the Kohonen map caused by similarity in their activity we have considered the mechanism of action for groups of chemicals marked as 5, 4, 7 and 8 (see Figure 6).

Mechanism of action intrinsic to studied groups of chemicals
Thus, the N-Nitroso compounds (5) containing SA_21 belongs to alkylating, indirect acting agent. The mechanism of action intrinsic to N-Nitroso compounds containing SA_21 reported in papers [11,17]. We can conclude that chemicals containing the SA21 possess the similar activity. Indeed, N-nitrosamines and N-nitrosamides represent a well established class of chemical carcinogens as was reported in paper [11].
The cluster marked with number (4) corresponds to compounds containing two SAs: (SA_27 + SA_13) (nitro-aromatic compounds and hydrazineschemicals). The hydrazines (chemicals containing SA13) belong to alkylating, indirect acting agent while nitroaromatic compounds (containing SA27) belong to aminoaryl DNA-adducts forming, indirect acting agents. The structures of studied compounds (12 chemicals) are represented in Table S2 in Supplement material section. It should be noticed that only one chemical from this dataset is non-carcinogen (C.I. Pigment red 23 (CASRN 6471-49-4)). This compound has molecular weight (MW) equal to 486. The chemicals with very high MW and size have little chance to be absorbed in significant amounts and therefore are non-active. The rest of the chemicals are carcinogens and have MW between 200 and 300. The detailed explanation of mechanism of action intrinsic to chemicals contained SA27 and SA13 is reported in paper [11].
Thus, the Dragon descriptors represent such features that organize the group of chemicals (with SA13+SA27) in clear visible cluster in Kohonen map which is the evidence of similarity. Indeed, almost all chemicals (one exception) are active (carcinogens). Table  S2 demonstrates that 12 chemicals presented in this list have similar structure and as a result possess the same activity. The cluster marked with number (7) corresponds to compounds containing two SAs: SA_27: Nitro-aromatic + SA_28: primary aromatic amine, hydroxyl amine and its derived esters (14 chemicals) ( Figure 6c). Both classes of chemicals belong to aminoaryl DNAadducts forming, indirect acting agents. Clusters of chemicals containing those two alerts (SA27 + SA28) (see Table S3) are placed in the closest neurons because of similarity due to similar structure and presence of the same groups responsible for the similar mode of action or biological activity. It is interesting that all chemicals in this cluster are positive by results of mutagenicity tests (Salmonella typhimurium TA98 strain). The chemicals marked with number (8) correspond to compounds containing SA28 (primary aromatic amine, hydroxyl amine and its derived esters). These 52 chemicals are spread over the Kohonen map due to big diversity of structures and presence of many others functional groups (see Table S4 in the Supplement Material Section).
As for distribution of chemicals marked as NA (no alert) one can see that chemicals are distributed over the map and difficult to separate individual clusters (see weight level map of NA in the Figure   6e).
How CP ANN algorithm separates carcinogens (P) and noncarcinogens (NP) inside group of chemicals containing a particular SA The intrinsic to CP ANN Kohonen mapping technique enables to see the distribution of carcinogens (P) and non-carcinogens (NP) for rats inside different families of chemicals with particular SA for carcinogenicity. Figure 7 demonstrates the output layer of CP ANN model for prediction of SAs for carcinogenicity complemented with the weight maps illustrating the distribution of carcinogens (P) and non-carcinogens (NP) for chemicals containing the following individual SAs: SA_27 (b, c), SA_21 (d, e), and SA_27+SA_28 (f, g).
Location of P and NP chemicals differs in the Kohonen maps due to different activity of considered chemicals. It should be highlighted that carcinogenic potency of chemicals in Toxtree knowledge-based system is based on observations in humans collected through epidemiological studies and on studies in animals while in CP ANN model we considered carcinogenic potency for rats. This is why we used SA for carcinogenicity only for explanation of possible mechanism of carcinogenic activity in broad sense. Determination of the differences inside the congeneric chemicals (with particular SA for carcinogenicity) using Kohonen mapping is very important for evaluation of safety of chemicals.

Conclusion
Mechanistic insight into CP ANN models was demonstrated using the inherent mapping technique (i.e. Kohonen maps) which enables the visualization of the following features in 2D space: the carcinogenic potency; the distribution of descriptors in individual layers; and the distribution of congeneric groups of chemicals with indication of specific carcinogenic SAs with indication of broad mechanisms of action.
We have examined statistically selected twelve Dragon descriptors which are correlated with carcinogenicity. These descriptors express structural and electronic features such as molecular shape (linear,