Novel Prediction Models for Genotoxicity Based on Biomarker Genes in Human HepaRG TM Cells *

Transcriptomics-based biomarkers are promising new approach methodologies (NAMs) to identify molecular events underlying the genotoxic mode of action of chemicals. Previously, we developed the GENOMARK biomarker, consisting of 84 genes selected based on whole genomics DNA microarray profiles of 24 (non-)genotoxic reference chemicals covering different modes of action in metabolically competent human HepaRG™ cells. In the present study, new prediction models for genotoxicity were developed based on an extended reference dataset of 38 chemicals including existing as well as newly generated gene expression data. Both unsupervised and supervised machine learning algorithms were used, but as unsupervised machine learning did not clearly distinguish both groups, the performance of two supervised machine learning algorithms, i.e., support vector machine (SVM) and random forest (RF), was evaluated. More specifically, the predictive accuracy was compared, the sensitivity to outliers for one or more biomarker genes was assessed, and the prediction performance for 10 misleading positive chemicals exposed at their IC10 concentration was determined. In addition, the applicability of both prediction models on a publicly available gene expression dataset, generated with RNA-sequencing, was investigated. Overall, the RF and SVM models were complementary in their classification of chemicals for genotoxicity. To facilitate data analysis, an online application was developed, combining the outcomes of both prediction models. Furthermore, this research demonstrates that the combination of gene expression data with supervised machine learning algorithms can contribute to the ongoing paradigm shift towards a more human-relevant in vitro genotoxicity testing strategy without the use of experimental animals


Introduction
Genetic toxicity testing is routinely performed to ensure the safety of newly developed chemical entities for human health. Traditionally, a step-wise standardized approach is applied, starting with a battery of in vitro tests covering both gene mutations as well as structural and numerical chromosome aberrations. In case of a positive outcome in one of the in vitro tests, an adequate in vivo follow-up test is performed. Despite its wide applicability and high sensitivity, the current genotoxicity battery is facing several limitations including the lack of information on the underlying mode of action (MoA) and the high number of misleading positive results. These "misleading positives" are chemicals with a positive result in at least one of the in vitro tests but a negative result in the associated follow-up in vivo test and are caused by the low specificity of the in vitro genotoxicity tests (Kirkland et al., 2007;Ates et al., 2014;Corvi and Madia, 2017). As the misleading positive results trigger needless animal studies, which are costly, time-consuming, morally irresponsible and not always biologically relevant to humans, the existing in vitro genotoxicity testing strategies need to be improved. Over the last years, efforts have been undertaken to develop new in vitro assays that can be used in a weight of evidence (WoE) approach to de-risk a misleading positive result for genotoxicity. NAMs for genotoxicity testing proposed by the Scientific Committee on Consumer Safety (SCCS) include, amongst others, the 3D reconstructed human skin comet and micronucleus test, toxicogenomics, recombinant cell models, hen's egg test for micronucleus induction (HET-MN) and assays based on the evaluation of the phosphorylated form of H2A histone family member (γH2AX) (SCCS, 2021).
Not only in genetic toxicology, but in toxicology in general, there is currently a transition ongoing to reduce or even completely step away from animal testing and to move towards the use of innovative and new approaches that do not (directly) rely on animals (SCCS, 2021;Parish et al., 2020). Several of the recently developed NAMs for understanding and predicting compound toxicity are based on the evaluation of changes at the molecular level upon exposure to the chemical of interest (Alexander-Dann et al., 2018). Gene expression technologies such as microarray analysis or next generation sequencing allow to evaluate the impact of chemicals on a large part of or even of the complete transcriptome. As chemicals which exhibit similar mechanisms of toxicity are assumed to induce similar profiles of gene expression, such transcriptomic data can thus be used to understand and predict toxicity (Merrick, 2019;David, 2020). In genetic toxicology, the value of transcriptomics data for collecting insights in the early molecular events involved in a chemicals' genotoxic MoA is becoming increasingly recognized. However, analysis of the whole transcriptome may overcomplicate the analysis as many of the genes may not be affected by genotoxic compounds. For this reason, several biomarkers consisting of a defined set of genes (also referred to as 'gene signatures') have been developed based on transcriptomics data (David, 2020). These transcriptomic-based biomarkers facilitate the interpretation of complex genomic data sets and thus increase their relevance for risk assessment (Buick et al., 2021). When combining gene signatures with machine learning algorithms, predictive models can be developed that classify chemicals for a specific type of hazard and thus strengthen the hazard identification process (Vo et al., 2020). To our knowledge, there are three in vitro biomarkers for genotoxicity based on transcriptomics data collected in HepG2, TK6 and HepaRG TM cells. Magkoufopoulou et al. used Affymetrix DNA microarrays to develop a biomarker in human liver HepG2 cells (Magkoufopoulou et al., 2012). The 33 genes of their biomarker were selected based on the transcriptomic changes in the HepG2 cells after 12, 24 and 48 h exposure to 34 reference chemicals. Prediction analysis of microarrays (PAM), a nearest shrunken centroid method, was used to classify chemicals for their genotoxicity. Later, Li et al. developed the TGx-DDI biomarker of 64 genes by using transcriptomics data obtained from the human TK6 lymphoblastoid cells exposed to 28 (non-)DNA damage inducing agents for 4 h (Li et al., 2015). In order to classify chemicals as direct or non-direct DNA damaging, a three-pronged analytical approach including twodimensional clustering (2DC), principle component analysis (PCA) and finally a probability analysis (PA) were applied to the TGx-DDI gene panel. Later, studies of the research group showed that the TGx-DDI biomarker can also be used in other cell lines such as the human metabolically competent HepaRG TM cells (Buick et al., 2020(Buick et al., , 2021. The third biomarker, developed by our research teams and further referred to as GENOMARK, consists of 84 genes for which the selection was based on transcriptomic data collected in HepaRG TM cells. The 84 genes of the GENOMARK biomarker were selected based on the microarray results collected after 72 h exposure of HepaRG TM cells to low cytotoxic concentrations, i.e., IC10 concentrations, of 12 genotoxic and 12 non-genotoxic chemicals (Ates et al., 2018). The 24 reference chemicals were specifically chosen to address a broad range of mechanisms of genotoxicity including bulky adduct formation, DNA alkylation, cross-linking, radical generation causing DNA strand breaks, inhibition of tubulin polymerization and base analogues (Ates et al., 2018). Afterwards, a prediction model based on a machine learning algorithm, i.e., support vector machine (SVM), was developed to classify test chemicals as genotoxic, non-genotoxic or equivocal based on the gene expression values for the 84 genes. In order to facilitate the implementation and use of the GENOMARK biomarker, the selected 84 genes were translated into an easy-to-handle qPCR array and the applicability of the SVM prediction model to the collected qPCR data was assessed (Ates et al., 2018). When considering equivocal results as positive, GENOMARK showed a predictive accuracy of 100% when applied on the qPCR data of 5 known in vivo genotoxicants, 5 in vivo non-genotoxicants and 2 chemicals with debatable genotoxicity data. Despite the promising results, the existing SVM prediction model could be further improved. For example, when running the SVM algorithm on a particular dataset, a new prediction model was created instead of using a fixed model, resulting in uncontrolled models that can highly affect the prediction outcomes. In the present study, we therefore describe the development and comparison of new improved prediction models to classify chemicals based on the GENOMARK gene expression levels. Additionally, the predictive accuracy of the new prediction models to de-risk misleading positives was evaluated for the first time. For this purpose, the existing reference dataset of 24 compounds was enlarged to 38 by including 9 out of the 10 validation chemicals described in the study of Ates et al. (2018) and by including 5 additional known in vivo (non-)genotoxic compounds for which new gene expression data were generated. Next, both unsupervised and supervised methods were applied on the gene expression data of the extended reference list. As the two supervised machine learning algorithms yielded the best results, the predictive capacity of both models was further compared by applying them on newly generated gene expression data for 10 misleading positive chemicals. The applicability of both models on a publicly available transcriptomic dataset collected with RNA-sequencing was investigated as well. Finally, an online application was developed to facilitate application of the GENOMARK prediction models by other scientists 1 .

Chemical name
In vitro genotoxicity

Collection of gene expression data by qPCR
qPCR was performed using pre-spotted 96-well plates (Integrated DNA technologies) containing the primers and probes for the 84 biomarker genes and 5 housekeeping genes (Ates et al., 2018). The 7 remaining wells of the 96-well plate consisted of one no template control with H2O as input sample and 3 controls in duplicate: (i) a no amplification control with RNA of the test chemical as input sample, (ii) a positive control and (iii) a negative control. As a positive qPCR control, the cDNA of the well-known in vivo human genotoxicant methyl methanesulfonate (MMS) was used. As a negative control i.e., vehicle control, the cDNA of 0.5% dimethyl sulfoxide (DMSO) in medium was used. On the qPCR plate, 2 μl (0.05 μg/μl) purified cDNA (GenElute™ PCR Clean-Up Kit, Sigma) was used in a total reaction mix of 20 μl per well (master mix: TaqMan® Gene Expression Master Mix, Applied Biosystems™). The qPCR plates were run according to the following protocol: 0.20 min at 95 °C; 0.01 min at 95 °C; 0.20 min at 60 °C (40 cycles). Normalization of the mRNA expression was done against the geometric means of the mRNA expression levels of the 5 housekeeping genes to generate the ΔΔCq values. The log2 fold changes per treatment versus vehicle control were calculated for every sample using the 2(−ΔΔCq) method (Livak and Schmittgen, 2001).

Selection and annotation of reference and test chemicals
The previous dataset of 24 reference chemicals (n=1) as described in the publication of Ates et al. (2018) was expanded with data of 14 chemicals and their replicates (n=3) resulting in a total amount of 38 reference chemicals. Compared to the previous dataset which was solely based on microarray data, the new dataset contained gene expression values generated both with microarray and qPCR techniques. The 14 new reference chemicals included 9 of the 10 validation compounds, except climbazole, described in the publication of Ates et al. (2018) as well as 5 additional chemicals (section 2.1) for which GENOMARK data were generated with qPCR as part of the current study. The 5 additional reference chemicals were selected based on the publicly available expert opinions of the European Food Safety Authority (EFSA) and the SCCS. The 38 reference chemicals of the extended dataset consist of 19 known "in vivo genotoxic chemicals" and 19 known "in vivo non-genotoxic chemicals" and cover different application domains (pharmaceuticals, pesticides, food contact materials and cosmetics) and MoAs of genotoxicity. A list with more detailed information on the 19 known genotoxic and 19 known non-genotoxic reference chemicals can be found in Tab. S1 and Tab. S2 3 , respectively. Ten "misleading positive" chemicals (section 2.1) were selected as test chemicals to determine the classification accuracy of the prediction models. A "misleading positive" chemical was defined as a chemical with a positive result in at least one of the in vitro tests (e.g., Ames test, in vitro mammalian gene mutation test, in vitro chromosome aberration test (CAvit), and/or in vitro micronucleus test (MNvit)) and a negative result in the adequate in vivo follow-up test. All 10 chemicals were selected based on the list of recommended genotoxic and non-genotoxic chemicals by Kirkland et al. (2016), the EURL ECVAM Genotoxicity and Carcinogenicity Consolidated database 4 and/or expert opinions such as the publicly available opinions of the SCCS 5 . It should be noted that two of these "misleading positives" (HBM and 1-NAP) were also included in the previous reference dataset as two clearly known in vivo non-genotoxic chemicals. However, they showed some positive historical in vitro findings which could not be confirmed in vivo and therefore are considered as misleading positives. Furthermore, the gene expression data of the reference dataset for both chemicals were collected with microarray experiments. In order to evaluate the performance of the new prediction models on data collected with qPCR, both chemicals were also included in the present study.

Unsupervised machine learning
HC, a Pearson's correlation coefficient test, and PCA were applied on the reference dataset with the objective to group the gene expression data of 38 chemicals that has not been labeled, categorized or classified (i.e., unlabeled dataset). In the R statistical environment, the stats package was used for PCA and Pearson's correlation whereas the gplots 6 package was used for HC. The expression of the 84 selected genes of the reference dataset was visualized in a heatmap using gplots package in R.

Support Vector Machine
SVM classification analysis was performed on the expression of 84 genes using R packages e1071 7 and caret (Kuhn, 2008). The dataset of the 38 reference chemicals labeled as genotoxic or non-genotoxic (i.e., labeled dataset) was randomly split in a training (80% of dataset) and a test set (20% of dataset). To gain a better separation between the two classes, the model was tuned using the following parameters: kernel = "linear" and cost = 1. A confusion matrix was performed to determine the classification accuracy using the labeled test set. The accuracy was calculated as the total of two correct predictions (true positives (TP) + true negatives (TN)) divided by the total number of a dataset (P+N). The output of the SVM algorithm is a probability value between 0 and 1 for genotoxicity. A chemical is classified as genotoxic when the probability > 0.55 and as non-genotoxic when the probability < 0.45. Probabilities obtained between 0.45-0.55 are marked as equivocal. An illustration of the development of a prediction model using machine learning can be found in Fig. 1.

Random forest
The generation of a prediction model based on RF was performed on the expression of 84 genes using gplots 6 , randomForestExplainer 8 and randomForest (Breiman, 2001) packages. The gplots package was applied to plot the correlation between the gene expression and the reference dataset using the heatmap.2 tool. Classification and regression based on a forest of trees was done with the randomForest package using the expression of 84 genes as input data. The most important variables in the RF were identified with RandomForestExplainer. Ntree was set on 100. The labeled dataset of the 38 reference chemicals was randomly split in training (56%), validation (24%) and test set (20%) (Fig. 1). Prediction accuracy of the test set for the RF model was calculated using the caret package in R. The output of the RF algorithm is a probability value between 0 and 1 for genotoxicity. A test chemical is classified in groups based on their probability value as described above.

Comparing the performance of the SVM model to the RF model
First, the SVM and RF model were both applied on the gene expression values for the 84 genes of the test set of the reference dataset as illustrated in Fig. 1. A probability result smaller than 0.45 was considered as negative (non-genotoxic), a result between 0.45-0.55 was considered as equivocal (intermediary region) and a result higher than 0.55 was considered as positive (genotoxic). The sensitivity, specificity and predictive accuracy of both models were determined. The Pearson's correlation coefficient (R-value) was calculated for the predictions generated by SVM versus RF using dplyr, ggplot2 and ggpubr packages in R. The interpretation of the Pearson's correlation coefficient was done as described in (Akoglu, 2018) considering a R-value > 0.5 as a moderate correlation.
Afterwards, the impact of outlier gene expression values on the prediction outcomes of both models was examined by manually creating outlier log2fold changes values for a specific gene within the gene expression data of two known in vivo non-genotoxic chemicals (2M4I and SoB) and two in vivo genotoxic chemicals (ethyl methanesulfonate (EMS) and aflatoxin B (AFB1)). In order to evaluate the impact of outlier gene expression values, four genes (FOLH1, SLC39A11, SLC22A7 and CCDC178) of the 84 biomarker genes were selected for which recurrently no cycle threshold (Ct) value was obtained with the qPCR assay after exposure to the test chemicals. For each of these four genes, the gene expression Ct value of these genes was changed into a low (Ct 0), mid (Ct 20) or high (Ct 40) expression value, individually. The gene expression data containing the outlier values were then analyzed by both prediction models (SVM and RF). The sensitivity, specificity and predictive accuracy of both models were calculated.

Application of the SVM and RF prediction model on test data sets
Both prediction models were used to evaluate the genotoxicity of the 10 misleading positives (n=3) based on their newly generated gene expression values. Next, both prediction models were applied on one publicly available dataset of gene expression data. In a study by Buick et al. (2020), HepaRG TM cells were exposed for 55h to increasing concentrations (lowmid-high) of 10 chemicals to study genotoxicity. The chemicals consisted of 6 known genotoxic chemicals (i.e., AFB1, cisplatin (CISP), ETP, MMS, 2-nitrofluorene (2-NF), and the aneugen colchicine (COL)) and four known non-genotoxic chemicals (i.e., AMP, 2-deoxy-D-glucose (2DG), sodium ascorbate (ASC), and sodium chloride (NaCl)). The normalized reads per million files, generated with Ion AmpliSeq™ whole transcriptome sequencing, were downloaded to test the GENOMARK biomarker (GEO accession number GSE136009). Log2 fold changes were calculated for treatment versus vehicle control in R for the 84 GENOMARK genes. For the missing genes, infinite or missing values in the dataset, the median log2 fold change value of the reference dataset of GENOMARK corresponding to the missing gene value was added. The SVM and RF classifier were applied to predict genotoxicity of the 10 test chemicals following the 55h exposure in human HepaRG TM cells. The predictive accuracy for both models was calculated.

Development of the GENOMARK biomarker online application
To facilitate the analysis of gene expression data with the newly developed GENOMARK prediction models, an online application was developed using the Shiny package 9 in R Cran, Version 4.0.4.

Collection of additional GENOMARK gene expression data using qPCR
To expand the reference dataset, gene expression data were collected with qPCR for 5 additional chemicals, i.e., 2 known in vivo genotoxic (GLY, 4AP) and 3 known in vivo non-genotoxic reference chemicals (4M2P, 2M1P and PHTH). The concentrations used in the qPCR experiments for each of the reference chemicals were selected based on the results of the MTT experiments and can be found in Tab. S1 and S2 3 . For 3 out of the 5 reference chemicals (PHTH, GLY and 4AP), an IC10 value could be derived. No cytotoxicity was observed in the MTT test for the remaining 2 chemicals (4M2P, 2M1P) within the tested concentration range (0.1-10 mM) and therefore, 10mM was selected as concentration of exposure of the HepaRG TM cells. For all 5 reference chemicals, gene expression values could be successfully collected in 3 different badges of HepaRG TM cells (n= 3). To verify how GENOMARK positions "misleading positives", qPCR was also performed for 10 test chemicals inducing a positive result in at least one of the in vitro tests but not in the in vivo follow-up test (i.e., HBM, 2M4I, 1-NAP, 4A3N, SoB, DHA, tBHQ, SoS, EUG and GLU). For 7 out of the 10 chemicals, an IC10 value could be determined based on the MTT experiments. However, due to trypsinization at the IC10 concentration, a lower concentration of 2M4I had to be used for the qPCR experiments. For the remaining 3 chemicals, SoB, SoS and DHA, no cytotoxicity was observed in the MTT assay within the tested concentration range (0.1-10 mM) and therefore, 10mM was selected for the qPCR experiments. An overview of the concentrations used in the tests with the misleading positives is provided in Tab. 1. As for the reference chemicals, gene expression data could be collected with qPCR for all the misleading positives in at least 3 different batches of HepaRG TM cells (n=3). The log2 fold changes can be found in Tab. S4 10 .

Unsupervised clustering is inefficient to distinguish between genotoxic and non-genotoxic chemicals
The dataset of reference chemicals was extended with the gene expression data of 9 out of the 10 chemicals from Ates et al.  (n=3). Climbazole was not selected as a new reference chemical. This known in vivo non-genotoxicant showed a negative result for genotoxicity using qPCR and an equivocal result using microarray in Ates et al. (2018). However, when included in the new reference dataset followed by PCA analysis, climbazole was clearly grouped in the genotoxicity class. Therefore, climbazole was considered as an outlier which may result in a prediction model with a lower accuracy and was not included in the new reference dataset of 38 chemicals. Furthermore, the newly generated expression data of 2 known in vivo genotoxic (GLY, 4AP) and 3 known in vivo non-genotoxic reference chemicals (4M2P, 2M1P and PHTH) (n= 3) were also included to extend the reference dataset.
To distinguish the 19 genotoxic from the 19 non-genotoxic chemicals of the enlarged dataset, 3 different unsupervised clustering analyses were applied to the gene expression data of the 84 GENOMARK genes of the 38 reference chemicals: HC, Pearson's correlation and PCA. In Fig. 2 the results of the different unsupervised clustering analyses are depicted. Fig. 2A represents the results of the HC, demonstrating that one panel of the 84 genes is upregulated after exposure to a genotoxicant (purple region) and that the other panel of genes is clearly downregulated after exposure to a genotoxicant (yellow region). The detailed list of genes can be found in Tab. S3 3 . However, HC showed to be not sufficient to distinguish the genotoxic (GTX) and non-genotoxic (NGTX) chemicals since the main branch of the dendogram does not perfectly separates both classes. In Fig. 2B the chemicals were clustered by the Pearson's correlation analysis. The dendogram demonstrates that the Pearson's correlation analysis is not sufficient to group the chemicals in the correct class. The green bars correspond to the NGTX chemicals and the red bars correspond to the GTX chemicals. The blue bars represent the Pearson correlation coefficient between the genes. In Fig. 2C  components of the dataset depicts three clusters: a cluster of genotoxicants (red dots), a cluster of non-genotoxicants (green dots) and a grey zone with equivocal results. There is no clear separation between the two classes.
To conclude, unsupervised learning algorithms are inefficient to distinguish between GTX and NGTX chemicals and therefore cannot be used to develop an accurate prediction model.

Supervised learning distinguishes genotoxic and non-genotoxic chemicals
Since unsupervised machine learning algorithms were not sufficient to build the prediction model, we next applied two supervised machine learning algorithms on the gene expression data of the reference dataset, i.e., SVM and RF.
First, the sensitivity, specificity and predictive accuracy of SVM versus RF were determined to compare the predictive accuracy of both models on the test set of the enlarged reference dataset (Tab. 2). As illustrated in Fig. 1, the dataset was separated into training and test data. Additionally, for RF, the training data was divided into training and validation data. The results in Tab. 2 show that both SVM and RF have a high and identical predictive accuracy on the test set of 92.3%, although the RF model was characterized by a slightly higher sensitivity whereas the SVM model clearly had a higher specificity.

The random forest model is more robust compared to the support vector machine model
To compare the robustness of the prediction models, the impact of outlier values (low, mid or high expression) for four individual genes (FOLH1, CCDC178, SLC22A7, SLC39A11), also expressed as outlier gene expression values, on the prediction outcomes of both RF and SVM models for two known in vivo NGTX chemicals (2M4I and SoB indicated in green symbols) and two known in vivo genotoxic chemicals (EMS and AFB1 indicated in red symbols) was investigated. Fig. 3

Fig. 3: Overview of the prediction scores for two in vivo non-genotoxicants (2-methyl-2H-isothiazol-3-one(2M4I) and sodium benzoate (SoB), green symbols) and two in vivo genotoxicants (aflatoxin B1 (AFB1) and ethyl methanesulfonate (EMS), red symbols) using random forest (RF) (on x-axis) and support vector machine (SVM) (on y-axis) based on their gene expression data for the 84 biomarker genes including outlier values for four individual genes (FOLH1, CCDC178, SLC22A7, SLC39A11)
Each of the squares represents the outcome of the prediction models for the four test chemicals when the values of one of the four genes were modified (Ct value 0, 20 or 40). The figures show that SVM is affected by outliers for CCDC178, SLC22A7 and SLC39A11. RF is not affected by outliers in each of the four genes.
represents four scenarios illustrated as four squares, each square corresponds to the prediction results of the four chemicals by the RF (x-axis) and SVM (y-axis) model when modifying the expression values for one gene to 0, 20 or 40. In all four scenarios, RF classified the two known in vivo NGTX chemicals and the two in vivo GTX chemicals correctly as negative and positive, respectively. Thus, the RF model resulted in a predictive accuracy of 100% in all four scenarios when having outlier values for a certain gene. The SVM model showed a lower sensitivity and accuracy to outlier gene expression values as in three out of the four outlier scenarios (i.e., CCDC178, SLC22A7, SLC39A11), the two non-genotoxicants were classified as GTX when considering the equivocal results as positive. The accuracies of the SVM model for the prediction on the FOLH1, CCDC178, SLC22A7, SLC39A11 genes are 100%, 77%, 93% and 95%, respectively. An outlier value for the CCDC178 gene has the most impact on the prediction by SVM while RF is less affected by outlier values.

Prediction scores of both models correlate to predict the genotoxicity of misleading positive chemicals
Both the RF and the SVM model were applied on the gene expression data generated with qPCR for the 10 misleading positive chemicals (HBM, 2M4I, 1-NAP, 4A3N, SoB, DHA, tBHQ, GLU, SoS and EUG) to predict their genotoxicity. The resulting prediction scores can be found in Tab. 3. Both prediction models classified six (2M4I, 4A3N, SoB, DHA, Sos and EUG) out of the ten chemicals as NGTX. GLU was clearly classified as positive and therefore identified as GTX by both prediction models. Three chemicals (1-NAP, HBM and tBHQ) were classified differently by the RF and the SVM model. The Pearson's correlation analysis was applied on the predictions made by SVM and RF on the individual gene expression data of the ten "misleading" positive chemicals to test the correlation between the two machine learning models. The individual predictions by SVM and RF based on the gene expression data for each chemical (n=3) resulted in a moderate correlation of 0.66 and p=5.3 * 10 -5 . The results of the Pearson's correlation analysis can be found in Fig. S1 3 .

Both the RF and the SVM model accurately predict genotoxicity of chemicals based on publicly available sequencing data collected in human HepaRG TM cells
To further compare the prediction performance of the SVM and the RF model and to evaluate the use of GENOMARK on gene expression values collected with technologies other than microarrays and qPCR, a publicly available sequencing dataset in HepaRG TM cells was used. In a study by Buick et al. (2020), HepaRG TM cells were exposed to 3 increasing concentrations of 10 chemicals belonging to two different classes: 6 in vivo genotoxicants and 4 in vivo non-genotoxicants. The external sequencing dataset contained 76/84 genes of the GENOMARK biomarker, missing the following 8 GENOMARK biomarker genes for each of the 10 chemicals: CDIP1, ANGPTL8, LRMDA, LINC01503, ENSG0259347, ENSG0260912, ENSG0261051, ENSG0261578. Both the SVM and the RF prediction model resulted in a predictive accuracy of 90%. The predictions made by both prediction models for the 10 chemicals in the different concentrations (low-mid-high) can be found in Tab. 4. In case there was a positive prediction for at least one of the three concentrations, the chemical was considered to be predicted as genotoxic. All four in vivo non-genotoxic chemicals were correctly classified as NGTX by both prediction models. Four out of the 6 known in vivo genotoxicants were classified as GTX by both prediction models. The two remaining in vivo GTX chemicals (CISP and COL) were classified differently by both prediction models. CISP, a known in vivo genotoxicant was classified as NGTX by the RF model. COL, was wrongly classified as NGTX by the SVM model. An overview of the sensitivity, specificity and accuracy for the RF and SVM model can be found in Tab. 5.

The online application allows easy and fast analysis of GENOMARK gene expression data
Both the RF and SVM prediction models were combined in an online application 1 to rapidly evaluate the genotoxic potential of a chemical of interest. In the interface of the online application, an example dataset is provided and new data files can be uploaded. Data to be analyzed should be uploaded as tab-delimited csv files containing the log2 ratios from treated vs. control data. The output of the analysis consists of a heatmap and a table containing the individual outcomes of both prediction models as well as the overall prediction based on a WoE approach. According to this WoE approach, a positive or negative call for genotoxicity in both models results in a classification of the chemical as GTX or NGTX, respectively. However, when having a different outcome in both models, the result is considered as inconclusive. The output table can be downloaded from the interface as an csv file.

Discussion
Previously, we have described the development of a SVM prediction model to classify chemicals for their genotoxicity based on the expression of the 84 GENOMARK genes in human-derived metabolically competent HepaRG TM cells (Ates et al., 2018).

Tab. 4: Predicted classification as genotoxic (+) or non-genotoxic (-) by the random forest (RF) and support vector machine (SVM) prediction models and the corresponding historical in vivo genotoxicity data for the 10 chemicals
The overall prediction classification result is depicted in the grey bars. Data for the 10 chemicals in three concentrations (low-midhigh) were obtained from the published sequencing dataset in HepaRG TM cells by Buick et al. (2020).  The use of HepaRG TM cells is an added value of this biomarker, as the commonly used human-derived cell types HepG2 and TK6 cells have several limitations (Mišík et al., 2019;Seo et al., 2019). HepaRG TM cells closely reflect the metabolism of xenobiotics in the human liver and do not require the use of exogenous S9-mix, which is of particular interest when developing a next generation in vitro genotoxicity test (Gerets et al., 2012). However, our previous SVM prediction model was modified by every run, resulting in uncontrolled and unvalidated models that can highly affect the prediction outcomes. Within the present study, we therefore developed new fixed prediction models based on a more extended reference dataset consisting of gene expression data collected with both microarray and qPCR technologies for 38 chemicals, equally balanced in the number of 19 known in vivo genotoxicants and 19 in vivo non-genotoxicants. The results of this study showed that unsupervised machine learning (clustering and PCA) algorithms were insufficient to develop a more accurate prediction model for genotoxicity based on the extended dataset. In contrast, promising results were obtained with two supervised machine learning algorithms (SVM and RF). It should be noted that the gene expression data of the reference dataset was obtained from two different technologies, microarrays and qPCR. Both technologies require and utilize different normalization procedures and the correlation of gene expression results between both technologies is influenced by data quality parameters (fold-change and q-value) and the amount of change in expression reported (Morey et al., 2006). However, data on the correlation between microarray and qPCR data are scarce. Some studies demonstrated that data obtained with the two different technologies yield comparable results when properly filtered (Dallas et al., 2005;Ach et al., 2008). Since the gene expression levels from both qPCR and microarray data are log-transformed and the SVM and RF algorithms use a threshold value for the genotoxicity predictions, the outcome of the GENOMARK biomarker is expected not to be affected by the technology used to collect the gene expression data. In addition, our group has previously compared GENOMARK predictions based on microarray data and qPCR data for eight chemicals using the same experimental conditions and demonstrated a high correlation (Ates et al., 2018). Various studies have already investigated the performance of classifiers or prediction models using multiple machine learning algorithms on different types of datasets. Both SVM and RF are two popular machine learning algorithms which can handle learning tasks with a small amount of training data and have a relatively high similar performance in terms of classification accuracies (Wu and Wang, 2018). Different studies demonstrated that a choice for only SVM or RF is difficult to state (Statnikov and Aliferis, 2007). Statnikov et al. showed that RF is outperformed by SVM on different diagnostic and prognostic datasets for cancer classification (Statnikov and Aliferis, 2007;Statnikov et al., 2008). However in other studies, from Delgado et al. and Deist et al. better results for different datasets were obtained using RF compared to SVM (Fernandez-Delgado et al., 2014;Deist et al., 2018). Overall, these diverging results indicate that the performance of a classifier highly depends on the dataset used (Statnikov and Aliferis, 2007;Deist et al., 2018). In the present study, both the SVM and the RF model had a high predictive accuracy of 92.3% for the reference dataset. However, the RF model showed a higher sensitivity whereas the SVM model demonstrated a higher specificity. Although in genotoxicity testing a high specificity of the tests is desired in order to reduce the number of misleading positives and the need for unnecessary animal testing, this might not be at the expense of sensitivity. Indeed, from a regulatory point of view, it is essential to have a high sensitivity to avoid that hazardous genotoxic chemicals would not be picked up (Kirkland et al., 2005). Within this context and based on the chemicals used to build the RF model, the latter would be preferred.
Furthermore, the RF model showed to be more robust to outliers. RF classifies chemicals based on the sum of the predictions of all decision trees and therefore, outlier values for specific genes do not have a large impact on the prediction outcome. A cycle threshold value between 0 and 20 might result in a log2 fold change beneath the threshold value to classify the chemical in the decision tree in the same group. This is in contrast to SVM in which the classification is based on the input value. An outlier in log2 fold change will thus have a higher influence on the prediction outcome of a chemical by SVM. Log2 fold changes are used in both prediction models to detect genotoxic responses since gene expression levels are heavily skewed in a linear scale. By log-transforming, the data becomes better for statistical testing since log-transformed data has a less skewed distribution, less extreme values compared to the untransformed data and is symmetrically centered around zero (Zwiener et al., 2014). Since differences between relative fold changes (treated versus control) can be used as substitute values for changes in gene expression, expression data measured by different platforms (microarrays, RNA-sequencing and RT-qPCR) could be used to predict the genotoxicity.
Although its higher sensitivity and robustness to outliers would suggest RF to be the preferred model over SVM, the prediction outcomes obtained for the 10 misleading positives demonstrate that both models are rather complementary. Six out of the 10 misleading positives were classified as clearly NGTX by the RF and the SVM model. In contrast, GLU was classified by both the SVM and RF model as genotoxic, despite the fact that the available data demonstrate that the chemical is NGTX in vivo. GLU is a known DNA-protein crosslinking agent in vitro and is commonly used for biologic tissue fixation (Tsai et al., 2000;Speit et al., 2008). The negative results observed in in vivo studies have been linked to rapid metabolism and protein binding characteristics of GLU (Vergnes and Ballantyne, 2002). The HepaRG TM cell system used to collect the GENOMARK gene expression values does not take into account all toxicokinetic properties of GLU, which might explain why the compound is classified as positive by both the RF and SVM prediction model.
The remaining three misleading positive chemicals (i.e., 1-NAP, tBHQ and HBM) were classified differently by both prediction models. 1-NAP is used as an oxidizing coloring agent for hair dyeing in the cosmetic industry (SCCP, 2008). Previous studies reported that 1-NAP showed conflicting results in vitro, but was negative in vivo and therefore, the SCCS assessed 1-NAP as NGTX. Nevertheless, some uncertainty remains with respect to the genotoxicity of 1-NAP and several mechanisms have been proposed to explain a possible genotoxic MoA including an increase in oxidative stress (Doherty et al., 1984;Miller et al., 1986;Wilson et al., 1996;Kapuci et al., 2014), the formation of reactive quinone metabolites such as 1,4napthoquinone by CYP metabolism and inhibition of topoisomerase (Cho et al., 2006;Fowler et al., 2018). Consequently, it remains difficult to evaluate whether or not 1-NAP would induce a genotoxic effect in the HepaRG TM cell system.
The same is true for tBHQ, a phenolic antioxidant that is, frequently used as a preservative in food and as an antioxidant in cosmetic products. Again, conflicting data exist with respect to the possible genotoxiciy of tBHQ and its metabolites (Braeuning et al., 2012). In some in vivo studies as well as in vitro studies, tBHQ is a confirmed clastogen. The observed in vitro clastogenic effect was linked to ROS generation, while chromosome loss was hypothesized to result from binding of quinone or semiquinone metabolites to proteins critical for microtubule assembly and spindle formation (Dobo and Eastmond, 1994;Gharavi et al., 2007). As most of the in vivo studies were negative, the Joint FAO/WHO Expert Committee on Food Additives (JECFA) and EFSA considered tBHQ as non-genotoxic (EFSA, 2004;Gharavi et al., 2007). Fowler et al. (2012), hypothesized that p53-deficiency in many of the rodent cell lines used for in vitro genotoxicity testing may have been responsible for the misleading positive results. As HepaRG TM cells are metabolically active and p53 competent, we would have expected tBHQ to be classified as NGTX. Nevertheless, as for 1-NAP, the formation of reactive metabolites or degradation products, might also play a role in the genotoxic effects observed in vitro. Consequently, it is not clear whether or not the induction of DNA damage would be expected in the test system used here.
Also for HBM, an oxidative hair dye, contradictory results for genotoxicity have been reported in the scientific literature. HBM induced both positive and negative results in in vitro assays but was not genotoxic in vivo. As the positive result was only observed in the bacterial reverse gene mutation test and not in in vitro or in vivo genotoxicity studies with mammalian cells, the Cosmetic Ingredient Review Expert (CIRE) Panel and the Scientific Committee on Consumer Products (SCCP) concluded that HBM is safe for use in cosmetic products (Elder, 1991;SCCP, 2006). Previous results of our research group supported the absence of genotoxicity for HBM as the compound was predicted NGTX in three out of the five in silico ALTEX, accepted manuscript published November 4, 2022 doi:10.14573/altex.2206201 models and clustered together with NGTXs based on microarray data (Ates et al., 2016b). Thus, based on the existing in vitro results and the additional in silico information, we would have expected that HBM was also classified as NGTX by the RF model. Overall, in case of different classifications by both models, a more in-depth investigation into the gene expression values that drive the different classification by the RF and the SVM model might be needed to obtain more insight in the genotoxic profiles of the compounds.
Interestingly, both prediction models were able to classify 10 (non-)genotoxic chemicals with high accuracy based on gene expression data collected in the same human cell line (HepaRG TM ) but using a different gene expression technique, namely RNA-sequencing. Two genotoxic chemicals were classified differently by both prediction models: CISP and COL. Whereas CISP, a known in vivo GTX, was classified as GTX by the SVM model, it was considered as NGTX by the RF model. However, also in the RF model, there appeared to be a concentration-related increase in the probability value for genotoxicity of CISP. Consequently, testing a higher concentration of CISP might have resulted in a classification as GTX in the RF model as well. COL, an aneugen, was classified as GTX by the RF model, while NGTX by the SVM model. In contrast to the TGx-DDI biomarker which has been developed solely on directly damaging genotoxicants, aneugens were also included in the reference dataset of the GENOMARK biomarker. Therefore, it would be expected that this compound is also classified as GTX by our prediction models. One possible explanation underlying the different prediction outcomes of both models might be the differences in the experimental set-up to collect the gene expression data. Regulation of expression levels of many important genes are tissue-, dose-or time-specific (Lambert et al., 2009;Wei et al., 2015). Indeed, HepaRG TM cells were exposed for 55h in the experiments of Buick et al. whereas in these experiments, cells are exposed for 72h (Buick et al., 2020). Some genes may thus not yet have been significantly altered after 55h. Also the concentrations tested and the technology used to collect the gene expression data might have had an impact, although the impact of the latter is expected to be limited. Moreover, it should be noted that the predictions were based on only 76 out of the 84 GENOMARK biomarker genes, as the remaining 8 genes were not included in the publicly available dataset. Despite these differences, as demonstrated in Tab. 4, four out of the six known in vivo genotoxic chemicals were classified as genotoxic and all four known in vivo non-genotoxic chemicals were classified as non-genotoxic with the GENOMARK prediction models. To have more clear insights in how GENOMARK is really performing for CISP and COL, gene expression data should be collected for all 84 biomarker genes in HepaRG TM cells with an exposure period for 72h. Nevertheless, the high predictive accuracy (i.e., 90%) of both models suggests that GENOMARK can be applied on a different platform for gene expression such as RNA-sequencing under slightly different experimental conditions. This is of importance in view of the rapidly evolving technologies used for gene expression profiling.
The results obtained with the misleading positive chemicals and with the existing RNA-sequencing dataset suggest that both models are complementary. Using the RF and SVM prediction models in a WoE approach, rather than using only one model to make a decision about the genotoxicity of a chemical of interest, might strengthen the decision making. Therefore, both prediction models were combined in an online application that allows other scientists to easily evaluate the genotoxic potential of a chemical of interest based on their gene expression data in a WoE approach. It should be noted that the GENOMARK gene signature and prediction models have been developed based on gene expression data after exposure of metabolically active, human HepaRG TM cells to a single concentration (IC10) of the test chemical for 72 hours. When applying the online application, it is therefore recommended that new experimental data are generated under similar experimental conditions.
Although the rather small number of test chemicals to assess the prediction performance of the biomarker, the results of this study and the high prediction accuracy obtained, demonstrate that GENOMARK represents a promising tool for genotoxicity testing. However, until now, the throughput of the method was limited by the fact that gene expression levels were evaluated with qPCR. This technique was originally selected as it has the advantage that it is widely available in different labs and in addition, data interpretation is relatively straightforward. However, it requires a high amount of cell material and is rather time-consuming as RNA and DNA purification are needed. For this reason, only a limited number of test chemicals could be analyzed. In the future, the predictive capacity of the GENOMARK biomarker for gene expression data obtained with high-throughput technologies such as TempO-Seq will be investigated. Application of a higher-throughput technology will allow us to collect data for a larger amount of test chemicals in different concentrations that can then be used to further evaluate the performance/robustness of the tool. In addition, the molecular information of GENOMARK should be investigated. Indeed, although the set of reference genotoxic compounds was selected based on a maximum amount of different genotoxic MoAs, including aneugenicity, to increase the sensitivity to detect genotoxic compounds, GENOMARK cannot predict at this moment a particular MoA. On the other hand, it is a strength that GENOMARK can make accurate predictions on genotoxicity independent of a specific MoA of a chemical, indicating its potential as a first screening tool. GENOMARK might be of particular interest for evaluating the genotoxicity of cosmetic ingredients as in Europe, animal testing is no longer allowed for these purposes (EC, 2009). GENOMARK could be a useful element in the toolbox that has been proposed by the SCCS for the follow-up of in vitro positive results (SCCS, 2021). In conclusion, GENOMARK uses a human-relevant and metabolically competent cell model for genotoxicity prediction based on a broad range of pathways, molecular functions, biological processes and protein classes of the 84 genes. Via this approach, GENOMARK might contribute to the 21 st century toxicology goals/approaches to move towards a reduced animal, next generation risk assessment. Using GENOMARK as a first screening assay or in combination with other NAMs in a WoE approach, GENOMARK could enhance genotoxicity assessment and reduce the need of unnecessary animal follow-up studies when having a positive result in vitro.