Feature selection. Feature selection was achieved by ranking the features based on the filter-method or Adjusted Mutual Information (AMI) score and resulted in reducing the dimensions of the database. The feature combinations for each model were selected based on the highest performance in the 10-fold cross-validation. In three models, the highest median AUC scores were achieved by removing features that had 100% constant values and using half of the remaining features with the highest AMI scores. In general, the RDKit topological fingerprints and molecular descriptors had higher predictive performances than both ECFP’s fingerprints. MD had the highest score of 0.852 with the best performing variance threshold of var_95, which features were selected as the best performing on the train set.
Model selection. The test set included 20% of the data that was not used in training of the models. The MD model resulted in the highest mean and median AUC score in cross-validation and test sets, in addition to the highest accuracy score of 83.1% for cross-validation, as can be seen in Figure 1. The best performing descriptor type based only on accuracy of its test set was RDKit5 with the score of 86.7%.
The receiver operating characteristic (ROC) curves are plotted via the True Positive Rate (TPR) against the False Positive Rate (FPR) at varying classification thresholds, which resulted in the following four graphs that show the performances of different descriptor types for classifying the samples of the test set. When comparing the ROC curves and AUC scores in Figure 2, it can be said that the best performing model on the test set is the molecular descriptors (MD). Overall, the analysis of the ROC curves reveals that all four models performed significantly better than a random prediction.
Model’s selection was dependent on their performances on the cross-validation and test sets, and MD model achieved the best AUC results for both parameters. A detailed summary of the four models’ performance is displayed in Table 1. The MD model had the highest mean AUC score for both train and test sets. Moreover, the MD model achieved the lowest RMSE values in both train and test sets, which suggests that the model predicts the response the most accurately out of four models. The RDKit5 model resulted in the highest accuracy score on the test set, but the MD model resulted in the highest accuracy score of 0.959 on cross-validation and did not significantly decrease for classifying the compounds of the test set, suggesting the minimised overfitting of the model. Therefore, the MD model was selected based on its highest test AUC score of 0.811 for further detailed analysis.
Table 1. Performances based on the AUC score on cross-validation of train and test sets achieved by each descriptor type; ECFP (1024-bit length), ECFP (2048-bit length), RDKit topological fingerprints, molecular descriptors.
Model name
|
Number of selected features
|
Cross-validation on training set (mean AUC ± SD)
|
Test set (mean AUC ± SD)
|
ECFP_1024
|
1024
|
0.862 ± 0.002
|
0.725±0.001
|
ECFP_2048
|
2022
|
0.846 ± 0.008
|
0.710±0.010
|
RDKit5
|
2048
|
0.894±0.004
|
0.771±0.028
|
MD
|
289
|
0.959±0.003
|
0.811±0.020
|
Confusion matrix. The confusion matrix of the MD model for predicting the class of the compounds in the test set in represented in Figure 3.
Based on the confusion matrix, the calculation of the recall or Positive Predictive Value (PPV) (Equation 1) and False positive ratio (FPR) (Equation 2), and other confusion matrix metrics (Equations 3, 4) was done as displayed below:
The True positive ratio (TPR) and Negative Predictive Value (NPV) are both relatively high, suggesting the 67.7% of the compounds predicted as active had indeed anti-ageing properties, and 88.5% for inactive. The FPR is not high, possibly due to the imbalanced dataset with most true negative entries. Even if the data is imbalanced, the model is working well, as the values of recall should be high, and FPR should be as low as possible. Since precision is more focused in the positive class than in the negative class, it measures the probability of correct detection of positive values, while FPR and TPR measure the ability to distinguish between the classes. Since there are many inactive samples due to the imbalance in the positive and negative entries in the DrugAge database, the false positive rate increases more slowly. Hence, precision (Equation 3) would give a better metric, as it is not affected by many negative samples.
Feature importance. Feature selection is an important data pre-processing step essential to the improvement of the prediction performance of the model and increases the speed of calculations30. It can be applied to rank the features as well as to perform the dimensionality reduction. Gradient boosting benefits of using ensembles of decision tree method due to its automatically provided estimates of feature importance from a trained predictive model. After constructing boosting trees, importance scores are retrieved for each attribute, which indicates how useful or valuable each feature is within the model. The selected model, MD, included 289 molecular descriptor features calculated by the MOE software26. A trained XGBoost model automatically calculated feature importance on the predictive modelling problem, as the top 40 features for the MD model based on the Feature Importance score. The top-ranking features can be differentiated into three categories of: 1) subdivided surface areas, 2) electrostatic, 3) atom and bond counts.
In addition, the XGBoost library provides a built-in function to plot features ordered by their ranking in the MD model, the top 15 features are shown in Table 2. and in Figure 4.
The subdivides surface areas are descriptors based on each atom’s approximate accessible van der Waals surface area (in Å2)26. The subdivided surface area descriptors with the highest feature scores are vsurf_D8, SlogP_VSA3, SlogP_VSA8, SMR_VSA3, SMR_VSA6. The second highest-ranking feature is hydrophobic volume and is a structure connectivity- and conformation-dependent descriptor. This can be references to the Rinnie et al. (2019) study, the authors suggested that the presence of nitrogen atoms reduced hydrophobicity, which was found to be the main physicochemical property for the activity of 4-alkynyldihydrocinnamic acid (ADCA) as FFAR1 agonists29. It was also reported that the presence of nitrogen atoms reduced the activity of a series of ADCA analogs as free fatty acid receptor 1 (FFAR1) agonists31, relating to the a_nN feature.
Electronic molecular descriptors describe the electron distribution in a chemical compound and its electrostatic interactions. These included the highest-ranking partial charge descriptors, such as PEOE_VSA+2, Q_VSA_PPOS, PEOE_VSA-1, PEOE_VSA-4. They outperformed all other features and took more ranking places in the feature importance score ranking. The difference between the variants is the source of the partial charge. The “PEOE_” prefix relates to the descriptors that were calculated using the partial equalization of orbital electronegativity (PEOE) algorithm for quantification of atomic partial charges in the system31. The more electronegative an atom, the greater its tendency to attract the electron density of the bonding pair and therefore, adopt a negative charge32. In relation to a ligand–receptor system, partial charges contribute to the binding properties of the molecule and molecular recognition.
Atom and bond count descriptors essentially describe the size and shape of the chemical compound. The size and shape of a compound may influence its binding with an enzyme or receptor binding sites and can also affect other psychochemical properties. The highest-ranking atom and bond count descriptors were b_single, b_double, a_hyd, opr_brigid, and a_nN. The number of nitrogen atoms, a_nN, had a Feature Importance score of 20.0, same as the number of rigid bonds. This is lower in comparison to the results of Barardo et al. (2017), where a_nN was ranked highest for predicting the class of the compounds in the DrugAge database4. Atom and bond counts can partially capture the overall properties of a compound such as size, hydrogen bonding and polarity, which consequently can impact the overall activity and performance of a drug.
Table 2. Top 15 features ranked by build-in feature importance score for the MD model. The description of the features was taken from the MOE™ 26.
Feature importance score
|
Feature
|
Description
|
31.0
|
PEOE_VSA-4
|
Total positive van der Waals surface area of atoms with a partial charge in the range of − 0.25 to − 0.20
|
30.0
|
vsurf_D8
|
Hydrophobic volume at -1.6
|
27.0
|
PEOE_VSA-1
|
Total positive van der Waals surface area of atoms with a partial charge in the range of − 0.10 to − 0.05
|
25.0
|
SlogP_VSA3
|
Sum of van der Waals surface areas such that the logP(o/w) is in the range of 0.0 to 0.1
|
22.0
|
rsynth
|
A value in [0,1] indicating the synthetic reasonableness, or feasibility, of the chemical structure
|
22.0
|
Q_VSA_PPOS
|
Sum of van der Waals surface areas such that the molar refractivity contribution is in the range of 0.485 to 0.560
|
21.0
|
b_single
|
Number of single bonds (including implicit H)
|
20.0
|
opr_brigid
|
Number of rigid bonds
|
20.0
|
a_nN
|
Number of nitrogen atoms
|
19.0
|
SlogP_VSA8
|
Sum of van der Waals surface areas such that the logP(o/w) is in the range of 0.30 to 0.40
|
19.0
|
b_double
|
Number of double bonds
|
19.0
|
chi0v_C
|
Carbon valence connectivity index (order 0)
|
19.0
|
SMR_VSA3
|
Sum of van der Waals surface areas such that the molar refractivity contribution is in the range of 0.35 to 0.39
|
18.0
|
PEOE_VSA+2
|
Total positive van der Waals surface area of atoms with a partial charge in the range of 0.10 to 0.15
|
17.0
|
SMR_VSA6
|
Sum of van der Waals surface areas such that the molar refractivity contribution is in the range of 0.485-0.560
|
Shapley values. From Feature Importance plots, it is hard to interpret information on how the descriptors were used to predict the class of a compound or whether there was a positive or negative correlation with activity. This was done using the SHAP package in the scikit-learn Python library33 that is based on the model-agnostic method and uses the Shapley values from Game Theory34. To understand how a single feature affected the output of the model, the plot of the SHAP values for each feature in the MD model against the value of the feature is shown in Figure 5. Vertical dispersion represents interaction effects with other features. To help reveal these interactions, dependence plot automatically selects another feature for colouring. The larger values for a certain feature led to a higher predicted number of life-extending compounds. In this plot, we can see that feature for the number of nitrogen atoms (a_nN) dominates the SHAP value ranking, as the resulting order of feature importance plot is different in the SHAP tree explainer method. Nitrogen atoms could have affected the physicochemical properties of the drugs as well as the interactions and binding of the molecules with target residues. Most of the feature values for the number of nitrogen atoms have a negative SHAP value. Assuming features selected by the build-in feature importance of the XGBoost algorithm are indeed the most important features, the SHAP summary plot allows to interpret with more confidence how each feature contributes to the classification process. Therefore, combining these two methods of feature analysis, helps to understand how the model makes its predictions.
Prediction of lifespan-extending compounds. To predict the class of the compounds in the external database, DrugBank, the selected highest performing model MD was applied. The database consisted of a total of 1754 approved small-molecules with the total number of 335 features, out of which the top-ranking compounds with a predictive probability of 0.70 for increasing the lifespan of C. elegans are presented in Table 3. The resulted compounds were then differentiated into the three groups: (i) flavonoids and isoflavonoids, (ii) fatty acids and conjugates, and (iii) other classes of compounds.
Table 3. The resulted list of chemical compounds extracted from the DrugBank database with a predictive probability for life-extending ability for C. elegans.
Chemical name
|
Predicted probability
|
Compound classification
|
Diosmin
|
0.87
|
flavonoids
|
Rutin
|
0.84
|
flavonoids
|
Hesperidin
|
0.79
|
flavonoids
|
Sodium aurothiomalate
|
0.79
|
fatty acids and conjugates
|
Aloin
|
0.78
|
other
|
Ertugliflozin
|
0.75
|
other
|
Soy isoflavones
|
0.75
|
isoflavonoids
|
Calcium glucoheptonate
|
0.74
|
other
|
Obeticholic acid
|
0.73
|
fatty acids and conjugates
|
Alginic acid
|
0.72
|
fatty acids and conjugates
|
Microcrystalline cellulose
|
0.71
|
other
|
Calcium saccharate
|
0.70
|
fatty acids and conjugates
|
Lactose
|
0.70
|
other
|
Flavonoids and isoflavonoids. Flavonoids are a group of naturally occurring substances with variable phenolic structures, usually occurring in fruits, vegetables, grains, roots, flowers, tea, and wine35. Flavonoids are a common component in a variety of nutraceutical, pharmaceutical, medicinal, and cosmetic applications. Research has shown that flavonoids contribute to the prevention of coronary heart disease and decreased cardiovascular mortality rate36. Besides having a capacity to modulate key cellular enzyme function, flavonoids possess anti-oxidative, anti-mutagenic, anti-inflammatory, and anti- carcinogenic properties associated with diseases such as cancer, Alzheimer’s disease, and atherosclerosis37. The antioxidant activity of flavonoids relates to their ability to reduce free radical formation and to scavenge free radicals, such as reactive oxygen species (ROS). ROS can be damaging in vivo, as they attack lipids in cell membranes, proteins in tissues or enzymes, and DNA, to induce oxidation 38. Consequently, this causes membrane damage, protein modification, and DNA damage, which overall is an oxidative stress (see Glossary)100 that negatively affects ageing and associated pathologies39.
The compound with the highest predictive probability of 0.87, diosmin, is a natural flavonol glycoside, famous for treating varicose veins and chronic venous insufficiency40. Diosmin possesses diverse pharmacological activities that include anti-oxidation, anti-inflammation, anti-diabetes, anti-cancer, anti-microorganism, as well as, liver, cardiovascular, retinal, and neuro protection activities40. It is hard to involve diosmin into clinical applications, due to its low water solubility40. A study done by Kamel et al. (2017) indicated that the combination of diosmin and essential oil improved its antioxidant, sun-blocking and anti-photoageing effects41.
The flavonol glycoside, rutin, is a glycosylated conjugate of quercetin, found in buckwheat seeds and some citrus fruits. Rutin has also been identified as the bioactive component of small berries, such as cranberry, goji berry, pomegranate, and blackcurrant, with anti-ageing activity in C. elegans that increased their lifespan42. A study has also confirmed that rutin has a relatively strong antioxidative activity in the worms43. Moreover, rutin consumption might be helpful in preventing an inherited neurodegenerative Huntington’s disease through the insulin/IGF1 (IIS) signaling pathway and autophagy activity 44.
Hesperidin (HSD) is one of the principal bioflavonoids of citrus fruits. HSD is known for its bioactivity and its antioxidant, anti-inflammatory, and anticancerogenic functions45. It is commonly used as an anti-ageing active component in cosmetics, as it acts as a topical UV-protective agent and a potent anti-photoageing factor46. The neohesperidin showed the capacity to extend yeast’s chronical lifespan for 10 different ageing factors, such as scavenging ROS effects, regulation of stress-related enzymes, and maintaining pH cellular value, favourable for life-extension of yeast cells 47.
Isoflavonoids are a large subgroup of flavonoids, which includes genistein, glycitein, and daidzein that are predominantly found in soyabeans and other leguminous plants and have the potential to fight several diseases36. Soy isoflavones (SIF) have been shown to protect against oxidative DNA damage in different cell lines and to possess antioxidant activities in both animals and humans36. Isoflavones have beneficial health-related effects due to their diphenolic structure and their phytoestrogenic activity that may contribute to their potential anti-carcinogenic and cardio-protective effects48. SIF supplementation may effectively attenuate oxidative stress and improve parameters related to ageing and Alzheimer’s disease, as studied in mice and C. elegans 48, 49.
Fatty acids and conjugates. Sodium aurothiomalate is a gold compound used for its immunosuppressive anti-rheumatic effects50. The precise mechanism of the anti-inflammatory effect of sodium aurothiomalate is unknown, but it could be said that it may alter cellular mechanisms by inhibiting sulfhydryl systems51, 52.
Calcium glucoheptonate, or calcium α-D-glucoheptonate, is a calcium supplement used to treat hypocalcemia and maintain calcium levels53. Calcium saccharate is a sugar acid that is derived from D-glucose. Calcium ions participate in a large set of cellular processes, and studies in C. elegans have reported that calcium is a central factor in neurodegeneration54. Calcium is required for the heart, muscles, and nervous system to work properly. However, an excess of calcium ions in the worm’s pharyngeal muscle has been correlated with the loss of its mechanical function112. A correct balance in calcium levels is important for muscle contractility, as it is possible that the observed accumulation of calcium in ageing C. elegans contributed to age-related slowing of body movement. Another study reported that feeding calcium D-saccharate did not affect the lifespan of D. melanogaster, but high dietary intake of certain calcium salts can even increase the rate of ageing55.
Obeticholic acid is an analog of the natural bile acid chenode oxycholic acid and most commonly, used to treat autoimmune liver disease 56. Bile acid-like molecules control the dauer formation program in C. elegans through the nuclear receptor-abnormal dauer formation protein-12 (DAF-12)57. DAF-12 modulation is closely linked to lifespan extension, and metabolism homeostasis 58.
Alginic acid, or algin, is a naturally occurring polysaccharide, that is found in brown algae. Alginic acid is a biopolymer formed from chains of polyuronic acids used only in combination with antacids59. A study demonstrated that supplementation with polysaccharides, such as hyaluronic acid (HA) and alginic acid (AA) does not extend the lifespan of C. elegans 60.
Other classes of compounds. Aloin is a C-glycosyl compound extracted from aloe vera latex that possesses metabolite and laxative properties. Aloe vera is a useful source of vitamins, such as vitamin A, B12, C, E, and folic acid, while aloe vera gel contains 19 of the 20 amino acids needed by the human body 61. Research indicated that aloin and aloe-emodin may possibly suppress the inflammatory responses by blocking iNOS and COX-2 mRNA expression62. A study showed that aloe vera supplementation extended longevity in D. melanogaster by increasing antioxidant enzyme activity and better neuroprotection63.
Lactose is a disaccharide from most dairy products. Lactose supplement shortens the lifespan of C. elegans, as a study by Xing et al. (2019) confirmed, as lactose treatment significantly induced cellular ROS64. According to the oxidative stress theory (OST), increases in mitochondrial ROS and oxidative damage would disrupt normal functions and therefore, accelerate cellular senescence11.
Microcrystalline cellulose (MCC) is a refined wood pulp and a renewable nanomaterial, that is a chemically inert and an insoluble biopolymer. It is a commonly used excipient in the pharmaceutical industry as binder or adsorbent, since it has good compressibility properties and is used in solid dose forms (tablets)65.
Ertugliflozin is a diarylmethane and a prescription oral drug, sold under the name Steglatro, used in adults with type 2 diabetes to improve blood sugar (glucose) control66.