An In Silico Model for Predicting Drug-Induced Hepatotoxicity

As one of the leading causes of drug failure in clinical trials, drug-induced liver injury (DILI) seriously impeded the development of new drugs. Assessing the DILI risk of drug candidates in advance has been considered as an effective strategy to decrease the rate of attrition in drug discovery. Recently, there have been continuous attempts in the prediction of DILI. However, it indeed remains a huge challenge to predict DILI successfully. There is an urgent need to develop a quantitative structure–activity relationship (QSAR) model for predicting DILI with satisfactory performance. In this work, we reported a high-quality QSAR model for predicting the DILI risk of xenobiotics by incorporating the use of eight effective classifiers and molecular descriptors provided by Marvin. In model development, a large-scale and diverse dataset consisting of 1254 compounds for DILI was built through a comprehensive literature retrieval. The optimal model was attained by an ensemble method, averaging the probabilities from eight classifiers, with accuracy (ACC) of 0.783, sensitivity (SE) of 0.818, specificity (SP) of 0.748, and area under the receiver operating characteristic curve (AUC) of 0.859. For further validation, three external test sets and a large negative dataset were utilized. Consequently, both the internal and external validation indicated that our model outperformed prior studies significantly. Data provided by the current study will also be a valuable source for modeling/data mining in the future.


Introduction
As the first organ that comes into contact with most of the products of digestion, the liver plays a critical role in energy exchanges and the biotransformation of xenobiotics. Livers suffering from damage always disrupt the normal metabolism, and even lead to liver failure [1][2][3]. During the past decades, as one of the major types of liver damage, drug-induced liver injury (DILI) continues to be an active area of research. In the process of drug discovery, DILI has been identified as the most frequent cause for the termination of drug development programs. Additionally, hundreds of drugs

Optimizing the Training Set
After data filtering, the final dataset consisted of 1416 unique compounds (707 positives, 709 negatives). As expected, the initial modeling attempts failed to yield models with satisfactory performance. We thought such a situation may be attributed to three aspects as follows. Firstly, the whole dataset included 1416 unique compounds. To our knowledge, this is the largest dataset for hepatotoxicity. So large a sample size increased the diversity and complexity of the data structure. Secondly, although we had filtered the data based on a complex workflow, there still might have existed some poor data in the dataset, including false negatives and false positives from the literature. Thirdly, the existence of outliers also brought noise to the development of models. To further improve the data quality, a voting method was applied to filter the dataset. Initially, eight effective classifiers, including Naïve Bayes, KNN, Kstar, AdaBoostM1, Bagging, J48, Random Forest, and Dl4j, were applied to train the dataset within 10-fold cross-validation. For each compound, the predicted result returned by each classifier was recorded. When a compound was correctly predicted by one classifier, the compound got 1 score, or else 0 score was assigned. Finally, the total score of each compound was calculated, and the compounds with scores <2 were removed from the dataset. This reduction led to a new training set of 1254 compounds, containing 636 positives and 618 negatives (Supplementary file S2).

Data Analysis
To investigate the chemical diversity of the modeling dataset, the Tanimoto similarity index was calculated for the whole dataset based on FP2 fingerprints. As demonstrated in Figure 1, the Tanimoto similarity indices between most compounds ranged from 0.00 to 0.30. Actually, the average Tanimoto similarity index was only 0.16, indicating the significant chemical diversity of our training set. Additionally, we investigated the chemical space of the dataset with molecular weight and calculated partition coefficient P (ClogP) as X-axis and Y-axis, respectively ( Figure 2). As a result, the scale of molecular weight was between 79. 10

Optimizing the Training Set
After data filtering, the final dataset consisted of 1416 unique compounds (707 positives, 709 negatives). As expected, the initial modeling attempts failed to yield models with satisfactory performance. We thought such a situation may be attributed to three aspects as follows. Firstly, the whole dataset included 1416 unique compounds. To our knowledge, this is the largest dataset for hepatotoxicity. So large a sample size increased the diversity and complexity of the data structure. Secondly, although we had filtered the data based on a complex workflow, there still might have existed some poor data in the dataset, including false negatives and false positives from the literature. Thirdly, the existence of outliers also brought noise to the development of models. To further improve the data quality, a voting method was applied to filter the dataset. Initially, eight effective classifiers, including Naïve Bayes, KNN, Kstar, AdaBoostM1, Bagging, J48, Random Forest, and Dl4j, were applied to train the dataset within 10-fold cross-validation. For each compound, the predicted result returned by each classifier was recorded. When a compound was correctly predicted by one classifier, the compound got 1 score, or else 0 score was assigned. Finally, the total score of each compound was calculated, and the compounds with scores <2 were removed from the dataset. This reduction led to a new training set of 1254 compounds, containing 636 positives and 618 negatives (Supplementary file S2).

Data Analysis
To investigate the chemical diversity of the modeling dataset, the Tanimoto similarity index was calculated for the whole dataset based on FP2 fingerprints. As demonstrated in Figure 1, the Tanimoto similarity indices between most compounds ranged from 0.00 to 0.30. Actually, the average Tanimoto similarity index was only 0.16, indicating the significant chemical diversity of our training set. Additionally, we investigated the chemical space of the dataset with molecular weight and calculated partition coefficient P (ClogP) as X-axis and Y-axis, respectively ( Figure 2). As a result, the scale of molecular weight was between 79.10 and 899.08, and ClogP values ranged from −9.89 to 11.93.

Model Construction and Evaluation
For the final training set of 1254 compounds, a total of eight machine learning methods were applied to develop prediction models for DILI. As a result, eight unique models were generated. A total of five important indices, including accuracy (ACC), sensitivity (SE), specificity (SP), area under the receiver-operating characteristic curve (AUC), and balanced accuracy (BACC), were investigated to assess the performance of each model created. As listed in Table 1, ACC varied from 0.601 to 0.777, AUC varied from 0.648 to 0.852 (Receiver-operating characteristic curve (ROC) for each classifier is displayed in Figure 3), SE varied from 0.608 to 0.857, SP varied from 0.401 to 0.761, and BACC was between 0.600 and 0.777. Three out of the eight classifiers failed to yield models with acceptable performance, including Dl4j, NaiveBayes, and J48 of which the ACC values did not exceed 0.700. KNN provided the highest ACC value of 0.777 but gave a relatively lower AUC value of 0.780. For the other four classifiers, ACC and AUC were generally greater than 0.700 and 0.800, respectively. In summary, the Random Forest algorithm produced the optimal model of which ACC, SE, SP, AUC, and BACC were 0.761, 0.785, 0.736, 0.761, and 0.852, respectively. One can clearly find that the model's performance varied with the machine learning methods, and each classifier has its own strengths and weaknesses.

Model Construction and Evaluation
For the final training set of 1254 compounds, a total of eight machine learning methods were applied to develop prediction models for DILI. As a result, eight unique models were generated. A total of five important indices, including accuracy (ACC), sensitivity (SE), specificity (SP), area under the receiver-operating characteristic curve (AUC), and balanced accuracy (BACC), were investigated to assess the performance of each model created. As listed in Table 1, ACC varied from 0.601 to 0.777, AUC varied from 0.648 to 0.852 (Receiver-operating characteristic curve (ROC) for each classifier is displayed in Figure 3), SE varied from 0.608 to 0.857, SP varied from 0.401 to 0.761, and BACC was between 0.600 and 0.777. Three out of the eight classifiers failed to yield models with acceptable performance, including Dl4j, NaiveBayes, and J48 of which the ACC values did not exceed 0.700. KNN provided the highest ACC value of 0.777 but gave a relatively lower AUC value of 0.780. For the other four classifiers, ACC and AUC were generally greater than 0.700 and 0.800, respectively. In summary, the Random Forest algorithm produced the optimal model of which ACC, SE, SP, AUC, and BACC were 0.761, 0.785, 0.736, 0.761, and 0.852, respectively. One can clearly find that the model's performance varied with the machine learning methods, and each classifier has its own strengths and weaknesses.  Compared to single classification algorithms, ensemble algorithms always offer more stable and effective in silico models by integrating the advantages of multiple classification algorithms. For classification problems, averaging is an effective and frequently used ensemble method. Averaging takes contributions from all classifiers into consideration and is considered to be less biased than many base classifiers. Here, we attempted to integrate the eight base classifiers mentioned above to generate an ensemble model by averaging the probabilities from each classifier. As expected, all of the five indices were slightly greater than the corresponding values yield by the Random Forest algorithm. From Table 1, we can find that the ensemble model gave ACC of 0.783 and SE, SP, BACC, and AUC were 0.818, 0.748, 0.783, and 0.859, respectively. Compared to the model achieved by Random Forest classifier, ACC, SE, and SP of the ensemble model improved by 2.2%, 3.3%, and 1.2%, respectively. Therefore, we can claim that the ensemble model provided a better performance than the model derived from Random Forest classifier.
To assess the statistical significance and robustness of the ensemble model, 100 runs of Yrandomization were conducted. Consequently, the performance of Y-randomization model was rather poor with performance indicators of ACC 0.499 ± 0.018, SE 0.503 ± 0.044, SP 0.494 ± 0.037, BACC 0.498 ± 0.018, and AUC 0.499 ± 0.022. It is obvious that the ensemble model was significantly better than the Y-randomization model, indicating our model was rather robust and reliable.

External Validation
A total of three external datasets compiled by Ai et al. [37], Zhang et al. [30], and Kotsampasakou et al. [33] were considered in the current study (Table 2). We first compared our ensemble model with Kotsampasakou's model of which the test set consisted of 67 molecules. As a result, prediction accuracies of our model and Kotsampasakou's model were 67.2% and 59.7%, respectively. It is clear that our model performed better than Kotsampasakou's model significantly. Compared to single classification algorithms, ensemble algorithms always offer more stable and effective in silico models by integrating the advantages of multiple classification algorithms. For classification problems, averaging is an effective and frequently used ensemble method. Averaging takes contributions from all classifiers into consideration and is considered to be less biased than many base classifiers. Here, we attempted to integrate the eight base classifiers mentioned above to generate an ensemble model by averaging the probabilities from each classifier. As expected, all of the five indices were slightly greater than the corresponding values yield by the Random Forest algorithm. From Table 1, we can find that the ensemble model gave ACC of 0.783 and SE, SP, BACC, and AUC were 0.818, 0.748, 0.783, and 0.859, respectively. Compared to the model achieved by Random Forest classifier, ACC, SE, and SP of the ensemble model improved by 2.2%, 3.3%, and 1.2%, respectively. Therefore, we can claim that the ensemble model provided a better performance than the model derived from Random Forest classifier.
To assess the statistical significance and robustness of the ensemble model, 100 runs of Y-randomization were conducted. Consequently, the performance of Y-randomization model was rather poor with performance indicators of ACC 0.499 ± 0.018, SE 0.503 ± 0.044, SP 0.494 ± 0.037, BACC 0.498 ± 0.018, and AUC 0.499 ± 0.022. It is obvious that the ensemble model was significantly better than the Y-randomization model, indicating our model was rather robust and reliable.

External Validation
A total of three external datasets compiled by Ai et al. [37], Zhang et al. [30], and Kotsampasakou et al. [33] were considered in the current study (Table 2). We first compared our ensemble model with Kotsampasakou's model of which the test set consisted of 67 molecules. As a result, prediction accuracies of our model and Kotsampasakou's model were 67.2% and 59.7%, respectively. It is clear that our model performed better than Kotsampasakou's model significantly. In column of size of dataset, "+" and "−" denote the number of DILI-positives and DILI-negatives, respectively. For test, indicators within and outside parentheses were provided by prior studies and our model, respectively. We also investigated the performance of our model against the entire external test set which consisted of the external test sets provided by Ai  Zhang's test set included 85 unique molecules. Actually, it was really a challenge for our model to compare with Zhang's model based on this test set. One can find that Zhang's test set consisted of 58 positives and 27 negatives. Apparently, this test set was highly imbalanced. As we know, Zhang's model may be a biased model of which the SE and SP were 0.948 and 0.585, respectively. In other words, Zhang's model is preferred to predict the test data as positives. Therefore, it was easy to yield acceptance prediction accuracy for Zhang's model when the test set was dominated with positives. However, interestingly, Zhang's model provided ACC of 68.2%, which was significantly lower than our model's 74.1%, indicating the satisfactory predictive capability of the model developed in the current study.
As a publicly accessible prediction model for DILI, Ai's model was developed based on 1241 diverse compounds by an ensemble learning method. Compared to many prior studies, Ai's model showed a stronger predictive power [37]. Ai's test set consisted of 66 positives and 17 negatives. Similar to Zhang's test set, this test set was also dominated with positives. It was reported that Ai's model provided a higher SE (0.799) and a lower SP (0.603). Compared to Ai's model, our model gave a relatively unbiased performance with SE and SP of 0.818 and 0.748, respectively. It is clear that there was less challenge to predict such a positives-dominant dataset for Ai's model. Interestingly, the performance of our model was as good as Ai's model. As demonstrated in Table 2, the same to Ai's model, our model also yielded an ACC value of 0.831. In addition, our model provided an SP value of 64.7%, which was higher than Ai's model by 11.8%. However, we only correctly predicted two negatives more than Ai's model. So large a difference of the SP value between our model and Ai's model may be attributed to the small proportion of negatives in the test set. After all, there were only 17 negatives included in this test set. In summary, we can claim that there was no significant difference between Ai's model and our model in the ability to detect positives. However, it was indeed difficult to answer the question of whether our model outperformed Ai's model when predicting negatives based on such a positives-dominant dataset. To answer this question, a reverse validation was performed based on a large negative dataset. A total of 312 non-DILI molecules were included in this negative dataset. As a result, our model provided an ACC value of 68.9%, which was significantly higher than Ai's model of 30.1%, indicating the stronger ability of our model to recognize negatives. Taken together, we can speculate that our model outperformed Ai's model significantly. In addition, the performance of our model against the entire external test set was provided in Table 2. For the entire external test set, our model gave SE, SP, ACC, and BACC of 0.773, 0.658, 0.730, and 0.716, respectively.
These comparisons demonstrated that our model outperformed many existing prediction models, but also highlighted the deficiency of works reported in the literature.

Discussion
As one of the major causes of drug failure in clinical trials, DILI seriously impeded the development of new drugs. Although a series of in silico models for DILI have been reported during the past decades, it indeed remained a huge challenge to predict DILI successfully. One reason for such a situation may be attributed to the relatively complicated mechanism of DILI. Generally, DILI was divided into three types according to the type of liver damage and the clinical chemistry biomarker alterations, including cholestatic liver injury, hepatocellular liver injury, and mixed liver injury (hepatocellular and cholestatic) [38]. Some researchers even divided DILI into 21 hepatotoxicity endpoints [36], indicating the highly complicated and diverse mechanism of DILI. So complex a mechanism increased the difficulty in creating an in silico model for DILI. The poor quality of the dataset may be another reason for the disappointing performance of in silico models for DILI. Owing to the lack of a 'gold standard' that defines DILI risk, it is difficult to evaluate the accuracy of DILI annotation. Data from different sources with conflicting labels always brought noise to the training sets and decreased the accuracy of models. Theoretically, mechanism-based methods may achieve a satisfactory performance. However, currently, the hepatotoxicity mechanisms of many drugs are far from elucidated. Therefore, it may be more realistic to improve the predictive capability of in silico models for DILI by improving the quality of the datasets.
In this work, a comprehensive literature retrieval was conducted followed by a critical data filtering. Consequently, a large-scale and high-quality dataset for DILI was attained. Notably, such a large dataset for DILI was considerably rare in prior works. It should be emphasized that this dataset was not acquired by a simple integration of data provided by prior studies. According to statistics, a total of 727 unique compounds were collected from literature reviews by us. We thought the excellent performance of our model partly benefited from such a high-quality dataset. Additionally, data provided by the current study will be a valuable source for modeling/data mining in the future.
As natural products, herbs and medicinal plants are usually deemed as safe and widely used in many Asian countries. Recently, ingredients from medicinal plants have gained increasing attention for their significant efficacy in the treatment and prevention of many diseases, and played essential roles in drug discovery and development. However, natural products are not always beneficial to health. Several commonly used medicinal plants have been proven to induce liver injury [39,40]. Data provided by Drug-Induced Liver Injury Network in the United States showed that 16% of clinical hepatotoxicity cases were attributable to herbal and dietary supplements [41]. Additionally, a multicenter investigation from China reported that not less than 39% of clinical DILI were ascribed to herbs. Therefore, DILI labeling for natural products would be completely necessary. Kim et al. [24] and Huang et al. [25] have attempted to screen ingredients with DILI risks from traditional Chinese medicines (TCMs) by QSAR models created based on the Liver Toxicity Knowledge Base (LTKB) database. However, in silico models generated only by synthetic drugs often lacked applicability to natural products, which was typically attributed to inadequate coverage of the chemical space of interest. By integrating 11 herbal ingredients and the LTKB dataset, Zhao et al. [26] established a QSAR model for assessing the DILI risk of ingredients from TCMs. It turned out that adding natural hepatotoxins into the modeling dataset assists the QSAR model to provide more rational results when applied to TCMs. In the current study, we collected hundreds of hepatotoxic/nonhepatotoxic ingredients from medicinal plants and added them into the modeling dataset, increasing the size and diversity of the training set. Therefore, theoretically, compared to a model solely derived from synthetic drugs, our model would provide a better performance when assessing the DILI risk of natural products.
Many researchers always highlight that it is more significant to recognize hepatotoxins than to identify nonhepatotoxins [24,26,37]. However, it is indeed not a perfect strategy to increase the sensitivity by decreasing the specificity. After all, lower specificity always leads to a higher false positive rate, and diminishes the reliability of results acquired. Currently, many QSAR models show very limited predictability in differentiating compounds with "No-DILI" assignment, typically due to the unbalanced datasets [25,30,37]. In this work, the Kennard-Stone algorithm was utilized to achieve a balanced dataset, by which we attempted to develop an effective and unbiased QSAR model for DILI. As expected, our model yielded a considerably satisfactory SP value of 0.748, which is significantly higher than SP values (0.317-0.700) provided by many prior studies [29][30][31][32][33][34]. We also made a comparison of the ability to identify negatives between our model and Ai's model using a large negative dataset containing 312 unique nonhepatotoxic molecules. As a result, our model outperformed Ai's model significantly.
Although our model outperformed many prior studies significantly, there indeed existed some limitations which were difficult to avoid. In this work, the DILI labels were only categorized into two groups: DILI-negative and DILI-positive. However, in many publications, DILI was categorized into multiple levels according to the severity of DILI risk [35,[42][43][44][45]. Generally, multiple class labels lead to a more accurate QSAR model by decreasing the complexity of the characteristic pattern. However, it is difficult to find a unified criterion to define the DILI risk. The lack of sufficient data also increased the difficulty in developing a multiple-level QSAR model for DILI. In addition, similar to many prior studies, the relationship between dose and incidence of DILI was not considered in the current study. There is no doubt that dose indeed plays an important role in the occurrence of DILI. However, data provided by different laboratories always adopted different detection methods and evaluation indices. Therefore, it is actually a challenge to integrate data from different sources. Taken together, a large-scale dataset with standard annotations and dose information will help to develop a more effective QSAR model for DILI risk assessment.
Improvements in several aspects as follows may contribute to the construction of an in silico model with stronger predictive power:

Datasets
The existence of publicly accessible datasets for DILI offers us an opportunity to gain an integrated dataset. Here, a comprehensive investigation into datasets for hepatotoxicity was performed by retrieving PubMed database with the following terms: "drug-induced hepatotoxicity", "drug-induced liver injury", and "DILI". It should be pointed out that data from text mining and in silico prediction was not considered in the current study. Consequently, 14 unique datasets, which have been used in much research, were identified and are detailed in Table 3. Most of the hepatotoxicity labels for these datasets originated from animal/cell experiments, clinical reports, drug labels, medical monographs, and scientific literature. New datasets always deserve more attention and play an essential role in data-driven science. To collect new data for hepatotoxicity, a system literature retrieval, focusing on natural products with hepatotoxicity/hepatoprotection, was performed based on PubMed and CNKI (China National Knowledge Infrastructure) database between 2009 and 2018. In Table 4, the list of search terms that were used to collect hepatotoxicity/hepaprotective compounds is displayed. As a result, a series of scientific literature was obtained. Then, we collected hepatotoxicity/hepaprotective compounds by reviewing these scientific publications one by one. To ensure the reliability of DILI risk annotation, each record was verified by two individuals simultaneously. As a result, 673 hepatotoxicity records were acquired, as well as 2914 hepatoprotective records. The compounds, without hepatotoxicity records, that showed liver protective effects were labeled as "negative", and the compounds that induced any hepatotoxicity were flagged as "positive".

Data Preprocessing
Data standardization is usually a starting point to integrate different data sources. In the present study, the data preparation procedures were detailed as follows: (1) The simplified molecular input line entry system (SMILES) information of each compound was retrieved from PubChem Compound database by name/CAS matching. (2) Any compounds containing metal and rare atoms were discarded, as well as inorganic compounds and mixtures. (I) Salts were converted into the corresponding bases or acids; (II) water molecules and solvents were deleted; (III) all aromatic compounds were normalized to kekule form; (IV) Neutralize model was implemented to neutralize the molecules. Finally, Clear Stereo model was utilized to clear the stereo information of the molecule. (4) Compounds containing carbon atoms fewer than 4 or with molecular weight greater than 900 were deleted from the datasets. (5) High-quality label annotation always assists in creating accurate and reliable in silico models.
To minimize the risk brought by ambiguous data, a critical data screening strategy was established. Data from DILIrank [42], Livertox [44], and LTKB [45] were retained directly and marked as dataset 1. For other datasets, the compounds included in dataset 1 were discarded, and the remaining data were named as dataset 2. Compounds with conflicting class labels were removed from dataset 2, and the remaining data were utilized to develop a large-scale dataset for hepatotoxicity in combination with dataset 1. Finally, a total of 1880 compounds were retained, 707 positives and 1173 negatives. As shown in Figure 4, a total of 1153 compounds originated from datasets of prior studies, and the present study provided 819 compounds (727 new compounds). (6) Machine learning algorithms are known to work best on balanced datasets. Imbalanced datasets usually bring challenges to the development of in silico models, and may lead to biased performance of prediction models established [51]. To avoid the risk brought by unbalanced datasets, the Kennard-Stone algorithm was used to balance the datasets provided by prior studies and the present study. Finally, we acquired a large-scale balanced dataset for hepatotoxicity, including 707 positives and 709 negatives, which were used for the following analysis. marked as dataset 1. For other datasets, the compounds included in dataset 1 were discarded, and the remaining data were named as dataset 2. Compounds with conflicting class labels were removed from dataset 2, and the remaining data were utilized to develop a large-scale dataset for hepatotoxicity in combination with dataset 1. Finally, a total of 1880 compounds were retained, 707 positives and 1173 negatives. As shown in Figure 4, a total of 1153 compounds originated from datasets of prior studies, and the present study provided 819 compounds (727 new compounds). (6) Machine learning algorithms are known to work best on balanced datasets. Imbalanced datasets usually bring challenges to the development of in silico models, and may lead to biased performance of prediction models established [51]. To avoid the risk brought by unbalanced datasets, the Kennard-Stone algorithm was used to balance the datasets provided by prior studies and the present study. Finally, we acquired a large-scale balanced dataset for hepatotoxicity, including 707 positives and 709 negatives, which were used for the following analysis.

External Validation
Performance comparisons to prior studies based on publicly available external validation sets were indispensable and always deserve more attention. To evaluate the generalization ability of our models, external test sets collected by Ai et al. [37], Zhang et al. [30], and Kotsampasakou et al. [33] were used. The data preparation procedures were the same as for the training set. When we collected external test sets, we found that some compounds in the external test sets of Ai et al. and Kotsampasakou et al. appeared in their training sets. To perform a fair comparison, data included in the training sets were removed from the external test sets. It should be emphasized that all of the data included in the external test sets had been removed from our training set in the process of data preprocessing. In addition, in section data preprocessing, the Kennard-Stone algorithm was utilized to extract negatives from the original negative dataset. As a result, a total of 312 non-DILI ingredients collected by the present study were removed. In the following analysis, as a reverse validation dataset, these non-DILI ingredients were used to assess the specificity of our model. The final external test sets are listed in Table 2 and Supplementary file S2.
During the past decades, there have been continuous attempts in the development of in silico models for predicting DILI. However, it is not feasible to compare our model with every model reported in the literature. Here, three latest models with stronger predictive power were selected. Comparisons between many of those old models and these three models have been conducted, which indicated that these three models were better than or at least equal to those old models. Therefore, here, we only compared our model with these three models.

External Validation
Performance comparisons to prior studies based on publicly available external validation sets were indispensable and always deserve more attention. To evaluate the generalization ability of our models, external test sets collected by Ai et al. [37], Zhang et al. [30], and Kotsampasakou et al. [33] were used. The data preparation procedures were the same as for the training set. When we collected external test sets, we found that some compounds in the external test sets of Ai et al. and Kotsampasakou et al. appeared in their training sets. To perform a fair comparison, data included in the training sets were removed from the external test sets. It should be emphasized that all of the data included in the external test sets had been removed from our training set in the process of data preprocessing. In addition, in section data preprocessing, the Kennard-Stone algorithm was utilized to extract negatives from the original negative dataset. As a result, a total of 312 non-DILI ingredients collected by the present study were removed. In the following analysis, as a reverse validation dataset, these non-DILI ingredients were used to assess the specificity of our model. The final external test sets are listed in Table 2 and Supplementary file S2.
During the past decades, there have been continuous attempts in the development of in silico models for predicting DILI. However, it is not feasible to compare our model with every model reported in the literature. Here, three latest models with stronger predictive power were selected. Comparisons between many of those old models and these three models have been conducted, which indicated that these three models were better than or at least equal to those old models. Therefore, here, we only compared our model with these three models.

Molecular Descriptors
As a rapidly used chemical editor, Marvin provided a series of molecular descriptors which have been proven to be useful and effective in various QSAR research [52][53][54][55]. In this work, a total of 85 molecular descriptors offered by Marvin were calculated which can be divided into two categories, physicochemical properties (30) and topological geometry properties (55).
Physicochemical properties provided data about molecular charge, elemental composition, drug-like properties, donor/acceptor count, and partition coefficients. Information on geometry structure, ring system, and other topological structures was represented by topological geometry properties. Actually, some of these molecular descriptors, including molecular polarizability, molecular weight, ClogP, and refractivity, have been demonstrated to be effective in the development of hepatotoxicity prediction systems [56,57]. A detailed description of all molecular descriptors used in this work is available in Table S1 (Supplementary file S1).

Feature Selection
Redundant and irrelevant features always not only failed to assist in the development of QSAR models, but brought noise and affected the performance of in silico models significantly. Feature selection is an effective method to decrease the redundancy and degeneracy of characteristic properties [58]. In this work, molecular features, with constant values higher than 95% of all compounds, were removed. Thereafter, a correlation analysis was performed to delete the highly correlated descriptors; the cutoff value of Pearson's correlation coefficient was set to 0.95. Finally, 55 molecular descriptors remained and are highlighted with gray in Table S1 (Supplementary file S1). From Figure 5A, one can find that a large number of highly correlated features existed in the initial feature matrix, which were removed from our defined molecular descriptors before modeling. Figure 5B shows the correlation matrix of the remaining descriptors where one can clearly find that the redundancy of the molecular features was decreased significantly.
As a rapidly used chemical editor, Marvin provided a series of molecular descriptors which have been proven to be useful and effective in various QSAR research [52][53][54][55]. In this work, a total of 85 molecular descriptors offered by Marvin were calculated which can be divided into two categories, physicochemical properties (30) and topological geometry properties (55).
Physicochemical properties provided data about molecular charge, elemental composition, drug-like properties, donor/acceptor count, and partition coefficients. Information on geometry structure, ring system, and other topological structures was represented by topological geometry properties. Actually, some of these molecular descriptors, including molecular polarizability, molecular weight, ClogP, and refractivity, have been demonstrated to be effective in the development of hepatotoxicity prediction systems [56,57]. A detailed description of all molecular descriptors used in this work is available in Table S1 (Supplementary file S1).

Feature Selection
Redundant and irrelevant features always not only failed to assist in the development of QSAR models, but brought noise and affected the performance of in silico models significantly. Feature selection is an effective method to decrease the redundancy and degeneracy of characteristic properties [58]. In this work, molecular features, with constant values higher than 95% of all compounds, were removed. Thereafter, a correlation analysis was performed to delete the highly correlated descriptors; the cutoff value of Pearson's correlation coefficient was set to 0.95. Finally, 55 molecular descriptors remained and are highlighted with gray in Table S1 (Supplementary file S1). From Figure 5A, one can find that a large number of highly correlated features existed in the initial feature matrix, which were removed from our defined molecular descriptors before modeling. Figure  5B shows the correlation matrix of the remaining descriptors where one can clearly find that the redundancy of the molecular features was decreased significantly.

Model Construction
Seven effective and widely used classifiers were adopted to create in silico models, including Naïve Bayes, K-nearest neighbor, Kstar, AdaBoostM1, Bagging, Decision tree, and Random Forest. A detailed description of these classifiers is available in Supplementary file S1. Additionally, Deeplearning4j, a deep learning classifier, was adopted. All the classification algorithms were implemented using WEKA (Waikato Environment for Knowledge Analysis, version 3.8) [59] with the default parameters and configurations, and all the classification procedures took place within 10-fold cross-validation.

Model Construction
Seven effective and widely used classifiers were adopted to create in silico models, including Naïve Bayes, K-nearest neighbor, Kstar, AdaBoostM1, Bagging, Decision tree, and Random Forest. A detailed description of these classifiers is available in Supplementary file S1. Additionally, Deeplearning4j, a deep learning classifier, was adopted. All the classification algorithms were implemented using WEKA (Waikato Environment for Knowledge Analysis, version 3.8) [59] with the default parameters and configurations, and all the classification procedures took place within 10-fold cross-validation.

Performance Evaluation
To investigate the predictive ability of each model developed in the current study, four important indicators were calculated, including accuracy (ACC = (TP + TN)/(TP + TN + FP + FN)), balanced accuracy (BACC = (SE + SP)/2), sensitivity (SE = TP/(TP + FN)), and specificity (SP = TN/(TN + FP)). TN, TP, FN, and FP stand for the count of true negatives, true positives, false negatives, and false positives, respectively. ACC indicates the overall classification performance. SP and SE always represent the ability to correctly recognize negatives and positives, respectively. BACC is the mean of SE and SP.
Another frequently used method in the estimation of in silico models, receiver operating characteristic (ROC) analysis, was also performed. Compared to SE and SP, which reflect the model's performance at a single parameter, ROC analysis provides us with a more global and unbiased evaluation. Herein, area under the receiver operating characteristic curve (AUC) was calculated of which the values were between 0.5 and 1.0. A random classifier always returns AUC of 0.5. Conversely, for a perfect classifier, the AUC would reach 1.0 [60].

Y-Randomization
To measure the statistical significance and robustness of models developed, Y-randomization test [61] was carried out. This is done by randomly permuting the DILI annotations 100 times without alerting the ratio of positives and negatives, and leaving all descriptor data untouched. Then, we applied the same model-building procedure of the training set to the random permutation data. Ideally, the performance of Y-randomization models should be lower than the models based on real data significantly.
The entire diagram of data processing and model construction is illustrated in Figure 6.
To investigate the predictive ability of each model developed in the current study, four important indicators were calculated, including accuracy (ACC = (TP + TN)/(TP + TN + FP + FN)), balanced accuracy (BACC = (SE + SP)/2), sensitivity (SE = TP/(TP + FN)), and specificity (SP = TN/(TN + FP)). TN, TP, FN, and FP stand for the count of true negatives, true positives, false negatives, and false positives, respectively. ACC indicates the overall classification performance. SP and SE always represent the ability to correctly recognize negatives and positives, respectively. BACC is the mean of SE and SP.
Another frequently used method in the estimation of in silico models, receiver operating characteristic (ROC) analysis, was also performed. Compared to SE and SP, which reflect the model's performance at a single parameter, ROC analysis provides us with a more global and unbiased evaluation. Herein, area under the receiver operating characteristic curve (AUC) was calculated of which the values were between 0.5 and 1.0. A random classifier always returns AUC of 0.5. Conversely, for a perfect classifier, the AUC would reach 1.0 [60].

Y-Randomization
To measure the statistical significance and robustness of models developed, Y-randomization test [61] was carried out. This is done by randomly permuting the DILI annotations 100 times without alerting the ratio of positives and negatives, and leaving all descriptor data untouched. Then, we applied the same model-building procedure of the training set to the random permutation data. Ideally, the performance of Y-randomization models should be lower than the models based on real data significantly.
The entire diagram of data processing and model construction is illustrated in Figure 6.

Conclusions
In the current study, a comprehensive dataset consisting of 1254 unique compounds was built through a system literature retrieval. Then, a set of topological geometry properties and physicochemical properties were calculated with Marvin and used to train classifiers within 10-fold cross-validation. A total of eight effective and widely used classifiers were involved, including Naïve Bayes, KNN, Kstar, AdaBoostM1, Bagging, J48, Random Forest, and Dl4j. Consequently, the optimal model was attained by Random Forest algorithm of which ACC, SE, SP, AUC, and BACC were 0.761, 0.785, 0.736, 0.761, and 0.852, respectively. To improve the predictive power of our model, we integrated the eight base classifiers mentioned above to generate an ensemble model by averaging the probabilities from each classifier. As expected, the ensemble model achieved a better performance; ACC, SE, SP, BACC, and AUC were 0.783, 0.818, 0.748, 0.783, and 0.859, respectively. For further validation, comparisons between our model and prior studies were conducted based on three external test sets and a negative dataset. As a result, both the internal and external validation demonstrated that our model outperformed prior studies significantly, indicating our model was considerably successful. We believe the current work will assist in evaluating the DILI risk of drug candidates in the early stage of drug discovery.

Conflicts of Interest:
The authors declare no conflict of interest.