Transitioning to composite bacterial mutagenicity models in ICH M7 (Q)SAR analyses

The International Council on Harmonisation (ICH) M7(R1) guideline describes the use of complementary (quantitative) structure-activity relationship ((Q)SAR) models to assess the mutagenic potential of drug impurities in new and generic drugs. Historically, the CASE Ultra and Leadscope software platforms used two different statistical-based models to predict mutations at G-C (guanine-cytosine) and A-T (adenine-thymine) sites, to comprehensively assess bacterial mutagenesis. In the present study, composite bacterial mutagenicity models covering multiple mutation types were developed. These new models contain more than double the number of chemicals (n = 9,254 and n = 13,514) than the corresponding non-composite models and show better toxicophore coverage. Additionally, the use of a single composite bacterial mutagenicity model simplifies impurity analysis in an ICH M7 (Q)SAR workflow by reducing the number of model outputs requiring review. An external validation set of 388 drug impurities representing proprietary pharmaceutical chemical space showed performance statistics ranging from of 66 to 82% in sensitivity, 91 to 95% in negative predictivity and 96% in coverage. This effort represents a major enhancement to these (Q)SAR models and their use under ICH M7(R1), leading to improved patient safety through greater predictive accuracy, applicability, and efficiency when assessing the bacterial mutagenic potential of drug impurities.


Introduction
The bacterial reverse mutation assay is designed to detect and classify mutagens. Specifically, the test uses several auxotrophic strains of Salmonella enterica serovar Typhimurium and Escherichia coli to detect point and frame-shift mutations, which include substitution, addition, or deletion of one or more DNA base pairs (Ames et al., 1973;Green et al., 1976;Maron and Ames, 1983). The principle of the bacterial reverse mutation assay is to detect mutagens through the reversion of auxotrophic bacteria to wild type in the presence of the test substance. This assay can be conducted in Salmonella enterica Typhimurium strains TA98, TA100, TA1535, TA1537 (or TA97, or TA97a), and TA102(or E. coli WP2 uvrA with or without pKM101) (ICH, 2011). The bacterial reverse mutation assay is one of the most widely used components of the International Council on Harmonisation (ICH) S2 genotoxicity test battery to assess the safety of pharmaceuticals prior to clinical exposure (ICH, 2011). The battery includes multiple assays to detect mutagenic, clastogenic and aneugenic effects in vitro and in vivo, where the most commonly used combination of tests comprises the bacterial reverse mutation assay, the mouse lymphoma assay, the in vitro chromosomal aberration assay, and the in vivo micronucleus assay (Gatehouse, 2012;Stavitskaya et al., 2015). The test battery is intended to identify genotoxic substances that exhibit a greater likelihood of subsequently causing carcinogenicity in humans.
A pivotal study conducted by Ashby and Tennant (1988) showed that although not all carcinogens are genotoxic, many genotoxic chemicals are carcinogenic in rodents. This was later confirmed by Kirkland et al. (2005), who examined the correlation between carcinogenicity and genotoxicity in at least one of the three assays (Ames + mouse lymphoma assay, in vitro micronucleus assay, and in vitro chromosomal aberration assay).
The authors found that 93% of the examined carcinogens had positive results in one or more genotoxicity assays. Furthermore, the results showed that the Ames test had the best specificity, at 74%, for predicting the outcome of the rodent carcinogenicity 2-year bioassay when compared to the other genotoxicity assays, making it the most promising earlyscreening assay. Early screening is especially important in drug development where a positive mutagenic result is unfavorable for a pharmaceutical candidate unless the candidate's benefit clearly outweighs the risk.
The use of (quantitative) structure-activity relationship ((Q)SAR) models in drug development has become increasingly important as it provides rapid, early screening of pharmaceutical candidates based upon their chemical structures . (Q)SAR models describe the correlation between chemical moieties and their biological activities under the general assumption that similar chemical structures exhibit similar biological activities (Benigni and Bossa, 2011;Enoch and Cronin, 2010;Kazius et al., 2005;Mortelmans and Zeiger, 2000). They have been used for a variety of endpoints related to drug safety, including several assays present in the ICH S2 battery such as the bacterial mutagenicity, mouse lymphoma assay, the in vitro chromosome aberration assay and the in vivo micronucleus assay (Hsu et al., 2018;Kruhlak et al., 2012;Matthews et al., 2006b). In a regulatory environment, (Q)SAR models can contribute to the weight of evidence for decision-making by predicting toxicological endpoints for chemicals with limited or no experimental data (Kruhlak et al., 2012;Rouse et al., 2018). While (Q)SAR models have historically been used for predicting the toxicological profiles of active pharmaceutical ingredients (APIs), models have more recently found mainstream use for predicting the mutagenic potential of drug impurities.
Under the ICH M7(R1) guideline, drug impurities and degradants can be assessed for bacterial mutagenicity using two complementary computational methodologies, statisticalbased (QSAR) and expert rule-based (SAR) (ICH, 2017). Statistical-based models offer the benefit of being rapid to construct from extremely large and chemically diverse datasets, while expert-rule based models provide greater interpretability capturing human derived and often mechanistically defined structural alerts that contribute to biological activity. If, following expert review, an impurity is predicted to have mutagenic properties by either methodology and the carcinogenic potential is unknown, the ICH M7 guideline recommends that the compound be controlled to a level at or below the threshold of toxicological concern (TTC) (ICH M7, 2017;Muller et al., 2006). Under the ICH M7 guideline, (Q)SAR data may be submitted by pharmaceutical applicants in place of conventional in vitro bacterial mutagenicity assay data for drug impurities up to 1 mg (ICH, 2017).
Over the past several decades, numerous bacterial mutagenicity (Q)SAR models have been constructed using a variety of (Q)SAR modeling methodologies and data sets (Cariello et al., 2002;Chakravarti and Saiakhov, 2018;Chakravarti et al., 2012;Contrera et al., 2005;Hanser, Barber et al. 2014;Jolly et al., 2015;Marchant et al., 2008;Matthews et al., 2006b;Saiakhov et al., 2013;Stavitskaya et al., 2013a;Stavitskaya et al., 2013b;Valerio and Cross, 2012;Votano et al., 2004;Williams et al., 2016;Zeiger et al., 1996). In one of the earlier studies, Zeiger et al. (1996) described the development of Salmonella mutagenicity (Q)SAR models using CASE and TOPKAT, as well as an SAR model based on structural alerts extracted from the published literature (Ashby, 1985;Ashby and Tennant, 1988;Ashby and Tennant, 1991). Two versions of the CASE models were constructed: 1) CASE/n contained 820 NTP chemicals and 2) CASE/e contained 808 EPA GENE-TOX chemicals. The external validation performance statistics of these models ranged from 67%−78% in sensitivity and 66%−84% in specificity. Similarly, the TOPKAT model was also constructed using data primarily derived from the EPA GENE-TOX database. External validation was performed using a set of less than 100 chemicals (45% positive) and performance statistics for these models showed 71% in sensitivity and 76% in specificity (Zeiger et al., 1996).
A report by Votano et al. described the use of atom E-state descriptors and MDL (Q)SAR modeling software to predict Salmonella mutagenicity using both artificial neural networks and multiple linear regression-genetic algorithm modeling techniques. A model was constructed from 2693 compounds and 400 compounds were used for validation, where concordance ranged from 81%−91%, false positive rates ranged from 3%−11%, and false negative rates ranged from 6%−8% (Votano et al., 2004). In a subsequent report applying the same approach, Contrera et al. constructed Salmonella mutagenicity models demonstrating sensitivity of 81% and specificity of 76% (Contrera et al., 2005). The authors also constructed E. coli and composite bacterial mutagenicity models; however, the E. coli training set was limited in the number of chemicals (n = 472) and the composite bacterial mutagenicity training set contained data from several strains (e.g., TA2638) and organisms (e.g., B. subtilis) that are inconsistent with current regulatory guidelines (Contrera et al., 2005;ICH, 2011). Furthermore, atom E-state indices do not provide sufficient transparency and interpretability for the qualification of pharmaceutical impurities under ICH M7 and therefore have limited utility in a regulatory environment.
In 2006, FDA/CDER developed Salmonella and E. coli mutagenicity models using the fragment-based MC4PC software. The models were validated by external cross-validation achieving specificities of 90% and 94% and sensitivities of 70% and 31% for Salmonella and E. coli mutagenicity, respectively (Matthews et al., 2006a;Matthews et al., 2006b).
Although these models provided sufficient transparency and interpretability, they were tuned for specificity rather than sensitivity making them more suitable for early screening in drug development rather than regulatory decision-making.
To this end, in 2013, FDA/CDER enhanced the predictive performance profile of the Salmonella mutagenicity models using CASE Ultra and Leadscope by expanding the training sets with recently-marketed drugs, compounds containing previously out-of-domain (OOD) toxicophores, and previously unmodeled atoms such as boron, silicon, selenium, and tin (Stavitskaya et al., 2013b). Additionally, the models were refined to yield increased sensitivity (82%) and negative predictivity (up to 73%), which are performance statistics of greater importance for (Q)SAR models used for regulatory decision-making. The models also achieved greater coverage (up to 88%) during external validation using the Hansen data set (Hansen et al., 2009); however, lower specificity values ranging from 58-68% were reported. Subsequently, FDA/CDER constructed additional (Q)SAR models that predict A-T base pair mutations-based on a combination of E. coli and Salmonella TA 102 mutagenicity data-with improved coverage and performance over previous models. The models were enhanced over those previously described by Matthews et al. (2006b) in part by expanding the training set to include molecular features from more recently marketed drugs, as well as by targeting areas of chemical space where the previous models were known to have weaknesses (Stavitskaya et al., 2013a). Cross-validation performance statistics for these models ranged from 68% to 73% in sensitivity, 80% to 87% in specificity and 77% to 81% in negative predictivity.
The need for greater sensitivity in detecting potential mutagens resulted in a decrease in specificity (primarily due to additional false positives), which can be mitigated through the application of expert knowledge. The ICH M7(R1) guideline contains a provision for the application of expert knowledge to provide additional evidence to support the final conclusion, and several recent publications have described best practices for this process (Amberg et al., 2016;Barber et al., 2015;Bower et al., 2017;Kruhlak et al., 2012;Myatt et al., 2018;Powley, 2015;Sutter et al., 2013). Expert knowledge may be applied by: 1) examining the chemical environment of the training set compounds supporting an alert to ensure they are relevant to the query chemical; 2) reviewing additional analogs to identify relationships between structures and their mutagenic activity and/or 3) reviewing publications to identify relevant mechanisms of genotoxicity to the query chemical (Hsu et al., 2018;Amberg et al., 2016;Bower et al., 2017;Myatt et al., 2018). In cases where additional evidence suggests that a prediction may be incorrect, or when the model outcome is ambiguous (e.g., equivocal or OOD) or conflicts with results from another model or alert set, expert knowledge may be used to support a revised conclusion.
In the present study, two statistical-based models for composite bacterial mutagenicity were constructed based upon Salmonella and E. coli mutagenicity data combined. The use of a composite bacterial mutagenicity model simplifies impurity analysis in an ICH M7 (Q)SAR workflow by reducing the number of model outputs requiring review. The new models contain more than double the number of chemicals than the earlier models to enhance the domain of applicability. Data gaps were identified and compounds were added to improve predictions in those areas of chemical space. In addition, discrepant and/or deficient studies for 1,140 chemicals were examined to resolve conflicting calls. The newly-constructed models were externally validated using a test set representing proprietary pharmaceutical chemical space. Furthermore, the external test set was used to examine the predictive performance of existing structural alerts for bacterial mutagenicity in a commercially available expert rule-based model. Finally, the use of multiple (Q)SAR models in various combinations was examined in accordance with ICH M7 guidelines. These models provide greater predictive accuracy, applicability, and efficiency when assessing the mutagenic potential of drug impurities under ICH M7(R1), consistent with FDA/CDER's regulatory imperative to protect patient safety.

Data sources
All training set compounds used to construct (Q)SAR models were comprised of nonproprietary bacterial reverse mutation assay data harvested from US FDA approval packages and other regulatory authorities (e.g., the Japanese NIHS and the Japanese Ministry of Health), online repositories of genetic toxicology data (e.g., NTP, EPA GENE-TOX, and CCRIS), data sharing efforts, the published literature and MultiCASE and Leadscope internal databases. Data were harvested for the following strains Salmonella TA98, TA100, TA1535, TA1537 (or TA97, or TA97a), and/or TA102 (or E. coli WP2 uvrA, or WP2 uvrA (pKM101)) in the absence and presence of metabolic activation. All findings were captured using a binary scoring system for modeling purposes, where a "0" denotes a negative response and a "1" denotes a positive response as indicated by the author call. Chemicals with a positive response in the presence and/or absence of S9 were scored as overall positive. Chemicals with equivocal, ambiguous or conflicting study results from multiple sources were re-reviewed according to ICH S2(R1) guidance (ICH, 2011) and given a resolved binary call or removed from the database. All bacterial mutagenicity training sets were constructed by expanding previously published databases for Salmonella mutagenicity and E. coli/TA102 mutagenicity (n = 3,979 and n = 1,199, respectively) (Stavitskaya et al., 2013a;Stavitskaya et al., 2013b). References and activity scores are provided in Supplementary Table S1 for data harvested from the Japanese Ministry of Health, Labor and Welfare, the Japanese National Institute of Technology and Evaluation, US FDA/CFSAN, published literature, and recently marketed drugs (n = 380) (Ellis et al., 2013;Greene et al., 2015;Honma et al., 2019;Amberg et al., 2015;Araya et al., 2015;Scott and Walmsley, 2015;Zhu et al., 2014).

Data review
Published study conclusions (i.e., mutagenic, non-mutagenic) were used for this investigation without being re-reviewed unless conflicting results were obtained from multiple sources. In those cases, studies were re-reviewed according to ICH S2(R1) guidance (ICH, 2011) and given a resolved binary call or removed from the database. Specifically, chemicals with a 2-fold dose-related increase in revertants in any strain in the presence or absence of S9 were scored as overall positive, while chemicals tested under standard conditions with negative study results were scored as negative. Chemicals tested in a single strain with an overall negative result, those tested with non-standard or modified tester strains, or those with equivocal or ambiguous study results were excluded from the database. Overall, the activity scores of 94 chemicals were updated, and 98 chemicals were removed due to unacceptable study design (e.g., chemicals tested as a mixture, use of nonstandard strains). Detailed information about the chemicals with updated conclusions (i.e., mutagenic, non-mutagenic) is provided in Supplementary Table S2.

Chemical structure curation
Electronic representations of the chemical structures were created using MDL molfile format. Inorganic chemicals, mixtures, and high molecular weight compounds (e.g., peptides, polysaccharides, proteins, and polymers >1000 Daltons) were excluded from the training sets due to processing limitations within the (Q)SAR software used in this investigation. Furthermore, the neutralized free form of any simple salt was included.

(Q)SAR software
Two commercial (Q)SAR software platforms, CASE Ultra (CU) version 1.7.0.5 (MultiCASE Inc., USA) and Leadscope Enterprise (LS) version 3.6.3 (Leadscope Inc., USA) were used to construct two distinct composite bacterial mutagenicity models. Derek Nexus (DX) version 6.0.1 (Lhasa Limited, UK) was used concurrently with the new composite bacterial mutagenicity models as a complementary expert rule-based model when testing the predictive performance under ICH M7 guidelines. It is of note that each software platform contains both an expert rule-based as well as statistical-based models. Historically, FDA has used the DX expert rule-based system and CASE Ultra and Leadscope statisticalbased models as a first pass for internal evaluations. All software were acquired and used under Research Collaboration Agreements between FDA/CDER and the software providers mentioned above.

CASE Ultra (CU)-CU
includes a statistical-based (Q)SAR software platform that uses a machine-learning algorithm in combination with molecular descriptors generated by fragmentation of training set structures. Fragments that are identified as being statistically associated with active molecules in the training set are defined as structural alerts. Additional fragments are also identified as deactivating features that decrease the potency of the alerts. During model application, the model generates a confidence score between 0 and 1 to indicate the likelihood of a test chemical being positive based on the presence of alerts and deactivating features. The model also verifies that all 3-non-hydrogen atom fragments present in the test compound are present among the training set structures. In cases where the model identifies no alerts or produces a non-positive confidence score and one or more "unknown fragment(s)," the model returns an OOD response.
The new composite bacterial mutagenicity model was constructed in CU using a training set of 13,514 chemicals. The model was cross-validated using a 10 by 10% leave-many-out (LMO) method. Briefly, the entire dataset was randomly divided into 10 equal subsets, with a single subset (10% of the total training set) set aside as a test set and the remaining 9 subsets (90% of the total training set) used to reconstruct a model. CU recalculated descriptor weights for each prediction cycle based solely on the remaining 9 subsets. This process was repeated 10 times, with a different training set for each iteration. The classification threshold was selected based on optimal balance between sensitivity and specificity on the receiver operating characteristic (ROC) curve. During model application, predictions were classified as equivocal when a predicted confidence was within ±0.1 of the classification threshold. Predicted values above the upper bound of this range were treated as positive, and those below this range were treated as negative. An OOD response was given to any chemicals that contained one or more unknown fragments not recognized by the model.

Leadscope Enterprise (LS)-LS is a data mining and visualization software
package that includes a statistical-based (Q)SAR modeling functionality. To construct a bacterial mutagenicity (Q)SAR model, a training set (n = 9,254) was imported into LS and fingerprinted using a set of 27,142 pre-defined medicinal chemistry structural features as candidates for model building descriptors. A small predictive subset of these features was used to construct the model. Additionally, a set of unique scaffolds was automatically constructed from the pre-defined structural features that specifically defined structureactivity relationships in the training set. The unique set of scaffolds was generated using the following settings: 1) the minimum of compounds per scaffold (10); 2) the minimum number of atoms per scaffold (6); 3) the maximum number of rotatable bonds (unspecified); and 4) the minimum absolute Z-score (2.0). Additionally, inclusion of properties such as charge, hydrogen bonding, and lipid solubility were explored to improve predictive performance.
The highest predictive features were identified for retention while weakly predictive features were removed using Z-score and mean activity as constraints (Roberts et al., 2000). Additional pruning was manually performed to reduce the number of features while maintaining optimal predictive performance. Subsequently, additional structural features based on known mechanisms of chemical mutagenesis were manually identified and included in the model. Lastly, the total number of model features was reduced using a partial least-squared regression algorithm leaving only those that best fit the experimental activity scores in the training set (Cross et al., 2015).
The model was cross-validated 25 times using a 10 × 10% LMO method. This method sets aside 10% of the training set for testing and reconstructs a reduced model using the remaining 90% of the compounds recalculating the descriptor weights. This process was repeated 10 times with 10 different training sets ensuring that all of the training set compounds have been predicted. The entire process was then repeated 25 times and the average predicted values were used in calculating the Cooper statistics (Cooper et al., 1979).
A classification threshold was determined by varying the positive cutoff probability thresholds for equivocal results and analyzing the resulting Cooper statistics. The optimal probability range for indeterminate predictions was identified to be 0.4 to 0.6. Predictions that are above the 0.6 probability cutoff are classified as positive, while predictions below 0.4 are classified as negative. The domain of applicability is determined using two criteria. The first criterion is defined as the presence of at least one chemical descriptor (in addition to all property descriptors) for the test compound. The second criterion is defined as the presence of at least one structurally similar analog in the training set, defined as a structure within a Tanimoto distance of ≥ 0.3. If either criterion is not met by the test chemical, then an OOD response is generated.

Derek Nexus (DX)-DX
is an expert rule-based system that identifies SAR alerts derived from mechanistic knowledge and relevant experimental data (Segall and Barber, 2014). The software uses a controlled vocabulary of confidence terms to express the likelihood that a prediction is correct based on the weight of evidence for and against it. Compounds that matched a structural alert for bacterial mutagenesis with a likelihood of "plausible" were treated as positive, "equivocal" predictions were treated as equivocal, "inactive" as negative, and compounds that were designated as "inactive with misclassified or unclassified features" were treated as negative, but were subjected to further investigation in an expert review workflow. DX v.6.0.1 contains a total of 135 structural alerts for bacterial mutagenesis.

Combining model outputs
The new bacterial mutagenicity models were compared to previous models for Salmonella mutagenicity, based on Salmonella TA97, TA97a, TA98, TA100, TA1535 and TA1537 (Stavitskaya et al., 2013a), and E. coli/TA102 mutagenicity, based on Salmonella TA102, E. coli WP2 uvrA, and/or WP2 uvrA (pKM101) (Stavitskaya et al., 2013b). To compare the new bacterial mutagenicity models to the previous Salmonella and E. coli/TA102 models, predictions from the Salmonella and E. coli/TA102 models were combined, where a positive prediction from either model within a given software platform justified an overall positive prediction for that software's Salmonella/E. coli prediction. Note that this approach can mathematically improve the sensitivity when applying two models, but also increases the false positive rate. An equivocal prediction from any one model resulted in an overall equivocal prediction except in cases where the Salmonella model returned an OOD call and the E. coli/TA102 model returned an equivocal call, which resulted in an overall OOD call.
Additionally, an overall OOD call was given when one of the two models generated a negative prediction and the other was OOD.
In a second exercise, CU and LS were each combined with DXgiving one statistical-based model and one expert rule-based model combinations (CU/DXand LS/DX), consistent with the ICH M7 guideline. Additionally, all three models were combined into a single prediction for each query chemical (CU/LS/DX). When combining model outputs across different software platforms, a positive prediction from any one software platform was used to justify an overall positive prediction. Similarly, an equivocal prediction from any one software platform was used to justify an overall equivocal prediction, in the absence of a positive prediction. In cases where both statistical models returned an OOD call and DX returned a negative prediction, an overall OOD result was reported. However, if only one statistical model was OOD and the other statistical model generated a prediction, the OOD was disregarded and the remaining predictions were used to generate an overall call.

External validation
Predictive performance of the models was assessed using an external validation set comprised of 388 proprietary compounds (72 actives and 316 inactives). The compounds are pharmaceutical impurities that were harvested from New Drug Applications (NDAs) submitted to FDA/CDER. Chemicals that were already part of the CU and LS training sets were removed. The final set contains a range of chemical classes, including aromatic amines, aromatic nitro compounds, carbamates, and alkyl halides.

Performance statistics
Cooper statistics were used to evaluate the performance of individual and combined model outputs. Briefly, predictive performance was evaluated using a classic 2×2 contingency table containing counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Chemicals classified as OOD and equivocal were excluded from Cooper statistic calculations. Statistics such as sensitivity, specificity, positive predictivity, negative predictivity, and concordance were calculated as described by Cooper et al. (1979). Coverage was calculated as the percentage of all chemicals screened for which a prediction could be made (OOD results do not constitute a prediction).
To account for the bias present within the external validation set, the negative predictive value and positive predictive value were normalized using the following equations:

Overview of bacterial mutagenicity data sets
The bacterial mutagenicity training sets were expanded by each software provider, independently, from published databases (2013 Salmonella training set (n = 3,979) and 2013 E. coli/TA102 training set (n = 1,199)). The final composite bacterial mutagenicity training sets were increased to 13,514 and 9,254 for CU and LS, respectively.
The active use of previously constructed models at FDA/CDER revealed some situations where previous training set classifications were inconsistent with newer study calls. In general, the reproducibility of the Ames test is about 85%, affected by a variety of test conditions (Benigni and Giuliani, 1988;Cooper et al., 1979;Piegorsch and Zeiger, 1991). In order to improve the predictive performance of the earlier (Q)SAR models, Ames data were re-evaluated for 1,140 chemicals (192 of which had conflicting data) in accordance with published methods (Gatehouse, 2012;ICH, 2011;Mortelmans and Zeiger, 2000) to enhance the overall quality of underlying training data. Of the 1,140 chemicals re-reviewed, 46 chemicals were changed to negative and 48 were changed to positive, while 95 were removed due to inadequate study data. Of those chemicals removed, 40 were because activity could not be assigned to a single structure. References for the updated chemicals are provided in Supplementary Table S2.
Among the re-reviewed chemicals was the acid halide class, specifically carboxylic and sulfonic acid halides. This class of chemicals is often used in the synthesis of organic chemicals and has been shown to produce a positive result in the bacterial reverse mutation assay when tested in DMSO. Furthermore, it has been recently reported that the acid halide class is capable of reacting with DMSO, via a Pummerer rearrangement, to form alkyl halides, a well-known class of mutagens (Amberg et al., 2015). However, when tested in solvents other than DMSO, the majority of these chemicals were experimentally negative. As a result, 5 chemicals were reclassified in the database as non-mutagens.
In addition to the non-drug compounds containing under-represented functional groups, 150 drug substances approved between 2013 and 2018 were added to the database, including opioids, benzodiazepines, and nucleic acid analogs (Table 1).

Predictive performance using cross-validation and external validation
The predictive performance of the bacterial mutagenicity models based on 10 × 10% LMO cross-validation experiments are presented in Table 2. The CU composite bacterial mutagenicity model achieved a sensitivity of 90.6% and a negative predictivity of 88.9% in cross-validation, while the LS model achieved a sensitivity of 84.7% and a negative predictivity of 82.5%. It should be noted that the cross-validation statistics between the two software are not directly comparable because the total number of chemicals in crossvalidation training sets are different between the two software platforms. Whereas the external validation statistics are directly comparable.
In a subsequent evaluation, the performance of the newly-constructed models was assessed using an external validation set of 388 proprietary compounds, where 19% were experimentally positive (

Author Manuscript
Author Manuscript

Author Manuscript
Author Manuscript external validation, in contrast to the Salmonella and E. coli/TA102 models combined, which achieved a sensitivity of 69.4% and a negative predictivity of 92.8%. In both software platforms, transitioning to the combined bacterial mutagenicity models resulted in a decrease in false positive rates and an increase in positive predictivity as expected. Additionally, normalized positive and negative predictivity were determined to account for the small number of active chemicals in the external validation set (Table 2). Overall, the normalized positive predictive values substantially increased in both LS and CU while the normalized negative predictive values showed a decrease in CU but maintained good predictive performance.
The combined predictive performance of the CU and LS models and DX expert rule-based system was also assessed ( Table 2). Both CU+DX and LS+DX showed an increase in sensitivity (19.9% and 8.6%, respectively), and normalized negative predictivity (12% and 5.9%, respectively) when compared to the bacterial mutagenicity statistical-based models alone. In contrast, a decrease in specificity was observed for both CU and LS when combined with DX (−12.5%, and −10.2%, respectively). Similarly, the normalized positive predictivity decreased by 4.6% in CU and 5.2% in LS. These results were expected given the practice of combining predictions across different platforms resulting in an increase of false positive predictions.
The combined use of all three (Q)SAR software resulted in similar overall predictive performance to the use of two, except that higher coverage was obtained. The number of chemicals that were classified as OOD are reported in Table 3. CU returned 21 OOD responses with the Salmonella/E. coli models and 16 OOD responses using the new bacterial model, whereas LS generated 57 OOD results with the previous Salmonella/E. coli models and 16 OOD results using the new bacterial model. When combined with DX predictions, the frequency of OODs remained the same since compounds that were predicted by DX as inactive with misclassified or unclassified features were treated as negative in the absence of expert review (see Section 2.4.3). However, when predictions from all three software platforms (CU, LS, DX) were combined, 4 chemicals generated an OOD in both statistical programs in the previous Salmonella/E. coli models whereas only one chemical was OOD in both statistical programs when the new bacterial mutagenicity models were applied. The three newly-predicted chemicals representing different chemical classes contained a true positive, a true negative, and a false positive.

Discussion
Computational methods that provide early screening of pharmaceutical candidates and impurities to predict the outcome of a bacterial mutagenicity assay have become increasingly important for industry as well as regulatory agencies. However, for regulatory purposes, it is desirable that models provide sufficient transparency and interpretability in their predictions to facilitate the use of expert knowledge for the qualification of pharmaceutical impurities under the ICH M7 guideline. Furthermore, the application of two complementary systems that use different methodologies to leverage different strengths has been generally shown to provide greater sensitivity in detecting mutagens (Amberg et al., 2016;Barber et al., 2015;Bower et al., 2017;Kruhlak et al., 2012;Myatt et al., 2018;Powley, 2015;Sutter et al., 2013). In the current study, predictions from the newlydeveloped composite bacterial mutagenicity CU and LS models in combination with predictions from DX were examined. Additionally, selected toxicophores are presented to illustrate how the new models have improved for specific chemical classes. Lastly, a series of case studies are presented below for three newly approved drugs.

External Validation of (Q)SAR Models
In a regulatory environment, high sensitivity and negative predictivity are important characteristics of (Q)SAR models used to support drug safety decisions, thereby minimizing risk to patients. In the present study, negative predictivity was maintained by both CU and LS in external validation, while a 12.9% increase in sensitivity was observed in the LS bacterial mutagenicity model as compared to the Salmonella/E. coli models (Fig. 1). In contrast, CU showed a 19% decrease in sensitivity when compared to the previous models.
However, it is noted that changes in sensitivity may be magnified in the current study due to the low number of active chemicals (n=72) in the external validation set. Also of note is the increase in positive predictivity, which shows that the models are more reliable in their positive predictions. This is due in part to the combined use of Salmonella and E. coli mutagenicity models in the previous version which mathematically resulted in an increase in false positive predictions. Furthermore, the LS model demonstrated a substantial increase in coverage (10.6%) as compared to the previous models, resulting in CU and LS now showing the same coverage of the external validation set (95.9%) ( Table 2).
The combined use of two complementary (Q)SAR methodologies is recommended by ICH M7(R1) to take full advantage of the relative strengths of different model descriptors and algorithms, thereby providing a more robust assessment of mutagenic activity for impurities in the absence of empirical data. Performance of a single statistical methodology compared to the performance of two complementary (Q)SAR models is shown in Fig. 2. The combined use of CU with DX as well as LS with DX increased sensitivity by 19.9% and 8.6%, respectively, and the negative predictivity by 4.2% and 1.7%, respectively. This supports the regulatory imperative to protect patient safety by reducing the number of false negative predictions, which is particularly important as drug impurities provide no therapeutic benefit. As previously reported, the combined use of two methodologies also resulted in a decrease in specificity and positive predictivity, generally producing an increased number of false positive predictions; however, the false positive rate can be decreased through the application of expert knowledge.
In addition, the performance of all three (Q)SAR models was assessed with results showing a slight improvement in sensitivity and no further improvement in negative predictivity when predictions were combined from all three software platforms. However, using three software platforms instead of two substantially decreased the number of OOD calls (Fig. 3). Furthermore, the OOD results have decreased from 4 chemicals to 1 when comparing combined predictions from the previous CU and LS models and current DX model to the new composite bacterial mutagenicity CU and LS models and current DX model (Table 3).
Of the three chemicals that were determined to be OOD by the previous Salmonella/E. coli models and DX, two were correctly predicted (1 TN and 1 TP) and one was incorrectly predicted as positive (FP) by the composite bacterial models. While a false positive is not desirable, it can be mitigated through the application of expert knowledge, using structurally similar analogs to provide evidence to dismiss the positive prediction. Overall, this result demonstrates that negative predictivity and sensitivity can be increased by combining bacterial mutagenicity predictions from two software applications and better coverage can be obtained by using three (Q)SAR models.

Toxicophore analysis
A toxicophore analysis was performed to assess the predictive performance of known toxicophores in the new bacterial mutagenicity models. A toxicophore or a "structural alert" is a unique structural feature associated with toxicity. The new bacterial mutagenicity (Q)SAR models contain larger, more comprehensive training sets with more than double the number of chemicals and have a broader applicability domain which has not only introduced new structural alerts and deactivating fragments but also led to the refinement of previously identified alerts. Although CU and LS utilize different algorithms for identifying structural alerts, both software platforms generated a more comprehensive set of structural features to better represent mechanistic alerts and mitigating effects. As an example, the complexity of a primary aromatic amine alert and a set of associated features that were generated by CU and LS are shown in Fig. 4 and 5. Compounds containing primary aromatic amines have the potential to exhibit mutagenic activity, which is heavily dependent on the alert chemical environment, wherein minor structural modifications can either increase or decrease mutagenic activity (Ahlberg et al., 2016). As such, CU identified several related alerts and deactivating fragments that share a primary aromatic amine substructure (Fig. 4). In contrast, the deactivating features that were identified by LS (Fig. 5) may or may not be specific to aromatic amines. Each feature in LS is instead given a calculated mean that is based on all instances in the training set, rather than the instances related to the alerting substructure of concern (in this case, a primary aromatic amine). The LS deactivating features (Fig. 5) were selected to show that similar features may be present in CU and LS (Fig. 4).
The performance of selected toxicophores was assessed by comparing the frequencies and the mean activities of previously defined Salmonella mutagenicity model features and the new composite bacterial mutagenicity model features for both CU and LS (Fig. 6).
Specifically, the features that were examined in this study were considered closely related between model generations. The results of this investigation showed that the sulfonate toxicophore is now present as a general and a more specific fragment in the new CU model.
Both of the new fragments contain a methylene carbon atom in the alpha position relative to the ester oxygen atom, which is essential for the nucleophilic substitution reaction to take place (Benigni and Bossa, 2011).
Similarly, there are now two vinyl halide features in the LS bacterial mutagenicity model: a more specific vinyl halide feature, and a vinyl geminal dihalide feature (both excluding fluorine). Simple vinyl halides can be metabolized by cytochrome P450 to halo oxiranes, which are directly able to alkylate DNA (Guengerich, 1994), while more highly halogenated vinyl halides can take several active forms, including halo oxiranes and acyl halides (Benigni and Bossa, 2011). Furthermore, the new vinyl halide feature has been restricted to bromides, chlorides and iodides due to the chemicals' potential to alkylate DNA. Fluorides have been excluded because fluorine atoms are believed to be biologically inert (Benigni and Bossa, 2011) and many of the training set chemicals containing a vinyl fluoride were found to be negative.
Additionally, the new CU model contains a refined feature for the aromatic diazo toxicophore. Aromatic diazos are metabolized by azo reductase to form aromatic amines, which are then further metabolized to reactive electrophiles capable of directly reacting with DNA (Benigni and Bossa, 2011). The definition of the previous diazo feature in the Salmonella CU model was not specific and based on aromatic imines. In contrast, the new fragments in the composite bacterial mutagenicity model are specific to aromatic diazos and show increased mean activities.
Another toxicophore that has been refined in the LS model is the aromatic hydrazine.
Hydrazines are often metabolized to azo compounds, which can generate reactive carbocations and free radical species capable of interacting with DNA (Benigni and Bossa, 2011). By limiting the substitution patterns present on both nitrogens the LS feature becomes more specific and the mean activity increased. Moreover, another new terminal aromatic hydrazine feature showed even greater positive predictivity and mean activity.
In addition to the refined features, the new models contain new toxicophores as well as deactivating features. An example of a toxicophore that has been introduced into both software platforms is the aromatic amide. Aromatic amides can be metabolized to nitrenium ions, which can then directly react with DNA (Benigni and Bossa, 2011).
The enhanced training set and improved model features facilitate expert analysis of (Q)SAR predictions. A more refined feature set simplifies the identification of toxicophores and evaluation of the training set chemicals supporting predictions. Furthermore, an increase in the number of training set chemicals has resulted in more structurally relevant analogs to support or modulate predictions.

Case Studies
To further exprore the practical application of the newly-developed models, the (Q)SAR methods described in the ICH M7 guideline were applied to a set of three recently-approved drugs (Fig 7-9). The guideline recommends the use of two complementary (Q)SAR methodologies and states that the results may be further examined using "expert knowledge" to provide "additional supportive evidence on the relevance of any positive, negative, conflicting or inconclusive prediction and to provide a rationale to support the final conclusion" (ICH, 2017). Furthermore, the use of expert knowledge has been shown to improve the overall performance of models used under ICH M7 (Barber et al., 2016;Sutter et al., 2013).
The Ames negative drug, solriamfetol, was predicted to be negative by the Salmonella/E. coli CU models and DX, while predicted to be positive by the Salmonella/E. coli LS models (Fig. 7). In contrast, the new CU bacterial mutagenicity model generated an equivocal prediction based on the presence of a carbamate moiety (highlighted below), while LS's bacterial mutagenicity model gave a negative prediction. A review of the training set compounds supporting the carbamate alert revealed that carbamates that contain simple alkyl chains on the oxygen substituent (e.g., urethane) are more likely to exhibit a positive response. However, a review of additional training set analogs (e.g., felbamate) indicated that larger molecules containing an aromatic ring have limited mutagenic activity. A supplemental substructure search of an external database identified a relevant analog, phenylalanine, which was negative in TA98, TA100, TA1537 andTA1535. Although phenylalanine does not contain a carbamate it does provide evidence that the backbone is likely to be negative. Based on the entire weight-of-evidence, solriamfetol was predicted to be overall negative for bacterial mutagenicity.
Amifampridine, a newly-approved, non-mutagenic drug, was predicted to be equivocal by the CU and LS Salmonella/E. coli models and positive by DX based on the presence of an aromatic amine (Fig. 8). However, the new bacterial mutagenicity models both predicted the drug to be negative. The new models identified the diamine moiety as a structural alert and a heterocyclic nitrogen in the para position to the amine as a deactivating fragment, which is consistent with the observed trends for anilines (Ahlberg et al., 2016). A review of structurally similar analogs from additional databases revealed that the presence an amine group in the ortho position to the heterocyclic nitrogen can result in mutagenicity. However, the strong deactivating effect of a heterocyclic nitrogen in the para position is supported by the aminopyralid analog when compared to the effects of chloramben. Based on the weightof-evidence, amifampridine was predicted to be non-mutagenic.
Triclabendazole, a recently approved drug, tested negative in the bacterial reverse mutation assay; however, no prediction could be made by either of the LS and CU Salmonella and E. coli models (Fig. 9). Additionally, triclabendazole was outside the applicability domain of the LS bacterial mutagenicity model. In contrast, triclabendazole was within the applicability domain and predicted to be negative by the new LS bacterial mutagenicity model. DXalso generated a negative prediction but identified a misclassified feature, based on an experimentally positive structural analog, carmethizole. However, this analog lacks a fused ring system and contains two carbamate moieties that are not present in the triclabendazole, suggesting a lack of relevance. Furthermore, two structurally similar analogs in the bacterial mutagenicity model training sets, 2-benzimidazolethiol and benzothiazole, support a negative prediction for the fused ring system. Supplemental searching of additional databases identified 5-methoxy-2-aminobenzimidazole, which shows that the methoxy group is unlikely to function as an activating group. Lastly, a review of training set chemicals containing the dechlorinated moiety (e.g., dichlorophenol) revealed that it is unlikely to be mutagenic. Considering the negative (Q)SAR predictions in combination with the negative structural analogs, triclabendazole was predicted to be overall negative for bacterial mutagenicity.

Impact of new models on equivocal and out-of-domain results for drug impurities
In 2017, a retrospective analysis was conducted by FDA/CDER to assess the impact of applying expert review to (Q)SAR predictions for 519 drug impurities evaluated in new and generic drug applications (Amberg et al., 2019). The ICH M7 guideline includes a provision for the application of expert knowledge to increase prediction confidence and resolve conflicting calls. Expert knowledge, which includes structural analog searching and mechanistic interpretation, has been particularly valuable in situations where models return an indeterminate (equivocal) result or are unable to generate a prediction due to a lack of relevant training set analogs (OOD). The use of expert knowledge was previously found to change the overall predictions 13% of the time and to resolve 72% of equivocal predictions (Amberg et al., 2019) and 95% (103/108) of OOD results (unpublished data).
This assessment was repeated using the new models described in this report. The percentage of the 519 chemicals that gave at least one OOD result decreased substantially, from 21% to only 6%. Similarly, the number of chemicals that gave an equivocal prediction dropped from 31% to 25%, indicating improved prediction resolution. In contrast, the number of chemicals whose overall predictions changed through the application of expert knowledge increased slightly from 13% to 18%, suggesting that expert review of predictions still plays an important role in resolving conflicting predictions in an ICH M7 (Q)SAR analysis workflow.

Conclusions
Computational toxicology continues to evolve as an increasingly valuable and important tool for both drug development as well as regulatory review. Determining the risk associated with pharmaceutical candidates for a variety of endpoints, including bacterial mutagenicity, is invaluable for preliminary high-throughput screening during development, as well as for the evaluation of pharmaceutical impurities. From a regulatory perspective, the ability to predict the toxicological profile of impurities greatly enhances the review process.
In the present study, two statistical software platforms were utilized to construct two composite bacterial mutagenicity (Q)SAR models. This was achieved by enhancing the previous model training sets and improving upon them through data collection and curation efforts. The newly-constructed bacterial mutagenicity models maintain good sensitivity and negative predictivity while showing greater coverage of proprietary pharmaceutical chemical space. Sensitivity and negative predictivity were further improved by applying two different (Q)SAR software in accordance with ICH M7 recommendations, and the use of three software platforms was found to increase overall coverage and the ability to obtain at least two valid, complementary predictions. Additionally, when two or more software platforms were in consensus, greater confidence could be inferred. In cases where the results are inconclusive or conflicting, the case studies demonstrated that expert review remains a critical step in providing additional evidence to support a final conclusion in an ICH M7 workflow.
In conclusion, the new composite bacterial mutagenicity models represent a major improvement over previous models for AT and GC mutations, providing better predictions and more efficient evaluation of drug impurities under ICH M7.

Highlights
• (Q)SAR models for predicting bacterial mutagenicity of drug impurities under ICH M7 • Published study calls with discrepant and/or deficient experimental studies were re-reviewed for 1,140 chemicals • Expanded bacterial mutagenicity training sets (n = 9,254 and n = 13,514) • Case studies illustrating model application  Changes (Δ) in sensitivity, specificity, positive predictivity, negative predictivity, and coverage between Salmonella/E. coli cumulative predictions and bacterial mutagenicity models in external validation. Validation statistics for single and combined models. Columns 2 and 3 show cross-validation performace statistics and columns 4-9 show external validation performance statistics.