Current Trends and Challenges in Drug-Likeness Prediction: Are They Generalizable and Interpretable?

Importance: Drug-likeness of a compound is an overall assessment of its potential to succeed in clinical trials, and is essential for economizing research expenditures by filtering compounds with unfavorable properties and poor development potential. To this end, a robust drug-likeness prediction method is indispensable. Various approaches, including discriminative rules, statistical models, and machine learning models, have been developed to predict drug-likeness based on physiochemical properties and structural features. Notably, recent advancements in novel deep learning techniques have significantly advanced drug-likeness prediction, especially in classification performance. Highlights: In this review, we addressed the evolving landscape of drug-likeness prediction, with emphasis on methods employing novel deep learning techniques, and highlighted the current challenges in drug-likeness prediction, specifically regarding the aspects of generalization and interpretability. Moreover, we explored potential remedies and outlined promising avenues for future research. Conclusion: Despite the hurdles of generalization and interpretability, novel deep learning techniques have great potential in drug-likeness prediction and are worthy of further research efforts.


Introduction
The lengthy drug development timelines and the low success rate in clinical trials give rise to significant risks and substantial expenses associated with bringing a drug to market [1][2][3][4][5].It is advisable to filter out compounds with low development value in the early stage as tens of billions of them are tangible [6], and therefore to economize on wasteful expenditure in drug research and development.
Drug-likeness of a compound is defined by its physicochemical or structural similarity to a set of known drugs to holistically assess the potential for passing clinical trials.By screening chemical libraries with drug-likeness, compounds with potential adverse properties can be filtered out, and thereby reducing the risk in the later stages of drug development.A common category of drug-likeness evaluation methods is property-based rules defined as the acceptable thresholds on the physicochemical properties of drugs or investigational candidates [7][8][9][10], such as the well-known Lipinski's Rule of Five (RO5).Besides, drug-likeness can also be defined by the representative structural patterns of drugs [11][12][13][14].
While there are some publications [31][32][33][34][35] that have thoroughly reviewed drug-likeness prediction methods including rules and traditional ML models, and even from the perspective of ADME/T (absorption, distribution, metabolism, excretion, and toxicity), one that expatiates on methods using novel deep learning techniques is absent.In this review, we primarily focused on the assessment of drug-likeness based on methods derived directly from the set of known drugs, rather than on aspects such as ADME/T properties [36][37][38][39][40][41].We overviewed drug-likeness prediction methods in a developmental manner and surveyed recent advances involving novel ANN techniques.From the two aspects of generalizability and interpretability, we proposed major challenges in drug-likeness research and discussed possible solutions and prospects.

Drug-likeness filters and scorers defined on physicochemical properties and structural features
The earliest and most famous drug-likeness rule is RO5 [7], which is a group of 4 empirical rules summarized from clinical phase II drugs: (a) molecular weight ≤ 500, (b) octanol/water partition coefficient ≤ 5, (c) number of hydrogen bond donors ≤ 5, and (d) number of hydrogen bond acceptors ≤ 10.Compounds that meet 3 of these rules are considered as drug-like and well orally absorbed via passive transport.Thereafter, various empirical drug-likeness rules [8][9][10] have been developed based on the analysis of physicochemical properties of drugs or drug-like compound databases.Some studies [36][37][38][39] have also developed rules based on ADME/T properties in support of oral medications [42], considering their convenience and compliance.
Some analyses [43][44][45][46] of drug building blocks and frameworks in drug sets have found typical structural patterns that may help to understand drug-likeness concretely and to find the preferred frameworks and moieties for focused chemical library design.Therefore, drug-likeness methods can be defined by comparing the structural features of a compound to those of a set of known drugs from different perspectives.These include the distance to the cluster center extracted from drug building block descriptors [11], the comparison between the probability of substructures in the molecule query and in drugs [12,13] (Fig. 1), the presence of pharmacophores according to medicinal chemistry theory [14], and the principal components analysis in the space of physiochemical and structural descriptors [47].
Based on physiochemical and structural descriptors, optimization algorithms, such as genetic algorithm, were adapted to effectively explore the chemical space and obtain a diverse library with drug-like properties.SELECT [48], and its variant MoSELECT [49], which was combined with Pareto optimality to address multi-objection optimization, could be used to choose an optimal configuration for a multicomponent (e.g., amide) library.Genetic algorithm has also been used in druglike feature selection for both substructure analysis [50] and ML modeling [18].Also, search algorithms including Monte Carlo tree search were used to build a drug-like library with efficiency as well [51,52].
In addition to the binary discrimination between drug-like and non-drug-like molecules, a continuous measurement may be preferred for fine-grained assessment and flexible use.Quantitative estimate of drug-likeness (QED) [53] introduced a simple yet efficient method for multi-objective optimization, named desirability function, to provide a quantitative measurement of drug-likeness.Eight commonly used properties for drug-likeness assessment were taken into consideration, and desirability functions were obtained by fitting asymmetric double sigmoidal functions to these of oral drugs, respectively.The final QED score was given by (weighted) geometric averaging over desirability function values of the molecule: where d i indicates the ith desirability function with w i as its weight, and n is the number of properties being considered and equals to 8. Besides, continuous measurements also include the drug-likeness probability predictions of Bayesian probability theory [54], multivariate logistic regression [55], and various ML models.
There have been several proposals of new property indices recently to measure drug-likeness, such as the fraction lipophilicity index [56] and the fraction of sp 3 carbon atoms [57].These property-based rules are straightforward and easy to use, but they may not be adequate for dealing with more complex relationships between molecular features and the success rate in clinical trials.Methods based on structural features are conceptually consistent with fragment-based drug design and facilitate the construction of combinatorial chemistry-friendly design libraries.However, their main flaw lies in the overrestriction to the frameworks of known drugs or drug candidates, thus may potentially missing out on novel drug scaffolds in chemical space and limiting creativity in their application.

Predicting drug-likeness with traditional ML models
Given the limited number of known drugs, ML methods were introduced to model and predict drug-likeness in consideration of their extrapolation capabilities.ML models are computer systems that learn and improve from data automatically [58].Due to their ability to handle large-scale data and to separate complex features, ML models are widely used in various stages in drug development, providing board opportunities in research and innovation [59].
ANNs (Fig. 2A) are a category of ML models that are characterized by nonlinear transformations and thus capable of fitting any function [60].The use of ANNs to predict drug-likeness Drug-likeness methods defined on the comparison of substructure probabilities.(A) Multilevel chemical compatibility method extracts substructures by 5 kinds of atom-centered group labeled with letter C, and their side atoms labeled with letter A (adapted with permission from [12], Copyright 1999 American Chemical Society).(B) Substructures defined by Morgan algorithm includes topological structure centered in any atoms with radius r, and features of involved atoms (adapted with permission from [13], Copyright 2010 American Chemical Society).
dates back to 1998, when Ajay and Murcko [15] used Bayesian neural networks (BNNs) to classify drug-like and non-drug-like molecules based on 7 molecular property descriptors and 166-bit MDL substructure fingerprints.In the same year, Sadowski and Kubinyi [16] used 92 Ghose-Crippen atomic types as descriptors and constructed a feedforward neural network to score the degree of drug-likeness for molecules.These early ANN methods outperformed property-based filters and even DT models in accuracy of discriminating between drug-like and non-drug-like structures.This predictive power was shown to generalize on external datasets.SVM [62] (Fig. 2B) was introduced into research on druglikeness prediction [17][18][19][20][21][22] as they became popular in the ML community.There is also a webserver, DrugMint [63], developed based on SVM.In comparative studies [17][18][19]64], SVM showed improvement in both performance and robustness, compared to ANN using the same descriptors.The performance of SVM in predicting drug-likeness is highly dependent on the input features.For instance, Li et al. [21] found that using a set of elaborately designed structural descriptors, extended connectivity fingerprints (ECFPs) [65], led to better accuracy than molecular properties or atom types (MOLPRINT 2D [66]).Moreover, Korkmaz et al. [22] improved the baseline SVM model by applying various features selection strategies.DT (Fig. 2C) was adopted in predicting drug-likeness as well.Though they exhibit similar or inferior performance to SVM [19], DT models have advantages in interpretability: Focusing on either moieties [23] or properties [24] depending on the initial condition settings, the branch conditions of DT models can be extracted and used as criteria for drug-like compound design.
Overall, most of them have achieved a modest classification performance of which accuracy reached 80%.Moreover, Li et al. [21] achieved 92.73% classification accuracy using ECFP4 and a larger dataset, which led to ~4% and ~5% raise, respectively.

Drug-likeness prediction using novel deep learning models
With the resurgence of deep learning methods, drug-likeness prediction methods based on novel ANN architectures and molecular representations (as shown in Fig. 3) have been proposed for better classification performance.
Due to the relatively small data size of the drug set, pretraining seems a helpful strategy to leverage unlabeled molecules and learn broad chemical knowledge and thus improve the classification performance on the downstream drug-likeness prediction task in the later fine-tuning process.The work by Hu et al. [25], which used autoencoders (AEs) for pretraining, perhaps bridges traditional ANN methods and new ANN techniques.The predictive model was initialized with the parameters of the encoder part and then trained on classification task between drugs and non-drug-like molecules.Their model achieved better performance, which is 91% accuracy for drug-like/ non-drug-like classification, and 97% for drug/non-drug-like, compared with early ANN and SVM methods.This may be attributed to basic chemical knowledge that the model learned during the pretraining process.Hooshmand et al. [26] also employed a pretraining approach to enhance the drug-likeness prediction using a deep belief network, where every 2 consecutive hidden layers make up a restricted Boltzmann machine and are pretrained layer-wise in a greedy manner using contrastive divergence algorithm.Their model achieved 97.75% accuracy on the leave-out test set, 2% better than Hu et al. [25], and 93.08% on the external test set, in which pretraining has contributed over 7% improvement.
Later studies are more concentrated on learning drug-likeness in an end-to-end way and use graph neural networks (GNNs; Fig. 2D) that directly operate on graphs built from molecular structures instead of elaborately designed descriptors.Beker et al. [27] evaluated multi-layer perceptron (MLP) classifiers with different descriptors as inputs, which were either randomly initialized or pretrained via an AE, as well as GNN with molecular graphs as inputs, and found minor gaps among their performance.The comparison suggests that the performance improvement compared to earlier studies is likely due to the use of more elaborate molecular representations rather than the model architectures.This is consistent with the satisfactory results achieved by Li et al. [21], who used SVM with ECFP rather than atom types or molecular descriptors as inputs.Inspired by the observation in some cases where one model made incorrect predictions with high variance while another's were correct with low variance, they further improved the classification accuracy by combining 2 BNN modified from former models and retaining the less uncertain predictions.This adaptation raised the external accuracy of best predictors from 87 to 88% to 93%, which is the same level as Hooshmand et al. [26].Sun et al. [28] predicted the drug-likeness by graph convolutional attention network (D-GCAN), which introduced an attention mechanism into GNN for drug-likeness prediction and achieved a 1% to 3% higher level of performance than the combined BNN of Beker et al. [27].Cai et al. [29] developed a 3-subdivisional druglikeness prediction model system, which consists of 3 individually trained models for evaluating the potential to reach in vivo, investigational, and approved stages progressively from in-stock compounds.They also combined active learning with ensemble learning to enhance the predictive ability of these models.
Whether to consider a non-drug-like set as negative background is a concern in drug-likeness prediction.Beker et al. [27] pointed out that QED is limited by its reliance on only the drugs and surveyed among 3 alternative non-drug datasets of the ZINC15 [67], the Network of Organic Chemistry (NOC) [68,69], and the Protein Data Bank (PDB) [70].According to the evaluation results on the negative sets built through positiveunlabeled learning and the drug set, they observed that the drug/non-drug classifiers learned from ZINC15 showed better performance than others, and recommended ZINC15 as the negative set.On the contrary, Lee et al. [30] argued that these dichotomous models tend to learn ad hoc features discriminating between drugs and non-drug-like molecules rather than their features individually, by which their generalization is limited when non-drug-like molecules to be distinguished are substantially different from those in the negative training set.Instead, they adopted generative self-supervised learning to train a recurrent neural network (RNN) based on the SMILES strings of only drugs to fit their distribution.As the result, the drug-likeness scores of their self-supervised model showed relatively more consistent performance than the dichotomous GNN across different negative sets including GDB17 [71], ZINC15 [67], and ChEMBL [72], while the classification metric did drop 7% compared with the latter.
We concluded the aforementioned studies involving novel ANN techniques, as demonstrated in Table 1.Although differing in methods, these studies have primarily focused on developing classifiers between drugs and non-drug-like compounds based on their structures, and have demonstrated high predictive performance.We also listed drug/drug-like and non-drug-like databases in Table 2.

Challenges and potential directions of drug-likeness prediction
Recent research on using ANN to predict drug-likeness may have gradually fallen into the trap of focusing solely on improving performance of dichotomous classification, neglecting the a Area under receiver operating characteristic curve.
b MDL Drug Data Report.
c World Drug Index.
d Available Chemical Directory.
e Fingerprint encoding 881 structural features implemented in the PubChem database [78].
f A set of 200 descriptors from the RDKit [79] library.
g A bit vector denotes the presence or absence of 3,000 maximum common substructures appearing most frequently in the drug and non-drug datasets.
original purpose of drug-likeness, which is to exclude compounds with poor properties that are likely to fail in later stages.
A desirable drug-likeness index should possess good generalizability and interpretability rather than pursuing classification performance all the way.The classification metrics directly demonstrate the ability to distinguish drugs from non-druglike compounds in a mixed dataset, but the screening power of identifying the compounds with poor development value remains unclear.Drug-likeness research poses a unique challenge since, unlike tasks such as predicting solubility or activity, the only ground truth of drug-likeness to rely on is the results of clinical trials, which cost far more than laboratory tests.Therefore, it is nearly impossible to perform experimental ex post validation on a drug-likeness index.The primary challenge in drug-likeness research is the demand for robust generalizability.Dichotomous models extract discriminating between drugs and non-drug-like compounds, so the assigned probability is based on the ratio of distances between the given compound and drugs/non-drug-like molecules in the training set.However, it is impractical to construct an ideal negative dataset that encompass the entire chemical space of nondrug-like molecules, especially considering the data imbalance as there are only thousands of drugs.To address this challenge, some drug-likeness methods have employed subdivisional labels assigned by chemists [20] or incorporated research progress.[29] Some others only fit drugs to gain independence from the nondrug-like background, such as QED [53] and self-supervised RNN [30].
An emerging approach involves the utilization of multimodal and large pretrained models, which have shown exceptional performance in fields such as natural language processing [80] and image generation [81].For instance, transformers [82][83][84] pretrained on large-scale data [85,86], which were obtained from quantum chemistry calculations, have set new records in various molecular property prediction tasks [87,88].Drug-likeness prediction would also benefit from molecular foundation models [89,90] for their general physiochemical and molecular structural knowledge learnt from pretraining.Yet, they are not well utilized.
Nevertheless, the scope of drugs is constantly changing with approvals and withdrawals by local drug administrations, rendering drug-likeness indices less effective over time.Drug discovery is inherently innovative, necessitating a robust druglikeness prediction method that transcends the mere classification of compounds as drugs or non-drugs.It should possess the capability to identify compounds with desirable properties and high development potential in realistic scenarios that explore the vast chemical space.
Another significant challenge is the lack of interpretability, particularly in the context of ANN.These models are often referred to as "black boxes" due to their complex nonlinearity, which brings difficulty in understanding why and how they make predictions, despite their strong fitting capacity.Since direct ex post validation is impractical to perform, interpretability of drug-likeness methods becomes more important for validating from other perspectives, explaining why a molecule is "like a drug."For example, if low prediction values of a drug-likeness method are generally attributed to some adverse properties like inferior pharmacokinetics or structural alerts indicating toxicity, we can rely on it more as an overall index that considers multiple factors.Besides, good interpretability could provide intuitive evidence to improve the confidence of medicinal chemists in the model, promoting cooperation between experimental research and drug-likeness prediction.Moreover, under the premise of good generalized performance, interpretable models could identify potential modifications that guide the design of new compounds with desirable properties.
Despite the significance of interpretability in ANN-based drug-likeness prediction, it has received little attention in recent studies.It is necessary to apply interpretation methods to these predictive models and construct compound datasets containing toxicity and other adverse properties for validating their interpretive performance.While gradient-based feature attribution [91,92] and subgraph recognition methods [93][94][95][96] could be employed handily on GNN-based drug-likeness models to attribute importance to structural features, other models, including MLP and RNN, may suffer from using molecular descriptors or SMILES strings that do not directly correspond to the molecular structure.Thus, when using ANN for drug-likeness prediction, it is important to carefully select the representation of input molecules.This can be achieved by incorporating medicinal chemistry knowledge into the model development process and utilizing feature engineering techniques to extract relevant atomic descriptors from the input data.
Last but not least, combining drug-likeness indices with one another and other methods could be a meaningful direction.Drug-likeness is associated with multifarious crucial properties for passing through clinical trials, and there is no direct regression label available for fitting.A single drug-likeness method is not capable of providing a comprehensive measurement.Combining multiple drug-likeness indices and other methods, leveraging their advantages, such as minimizing uncertainty, might be an effective way to obtain a more reliable drug-likeness index.

Conclusion
The primary motivation behind this review was to address the evolving landscape of drug-likeness prediction, with emphasis on methods employing novel deep learning techniques.As novel ANN techniques develop, drug-likeness prediction methods have made great progress in classification accuracy, while also raising the risk of losing sight of the original purpose to filter compounds with adverse properties and poor development potential.Our aim was to highlight the challenges and potential solutions in this domain, and we focused on the need for both accuracy and the often-neglected aspects of generalizability and interpretability.As more attention is drawn on classifying drugs and non-drug-like molecules, their generalized performance remains unclear.There are several ways to tackle this, such as incorporating expert knowledge, employing large-scale pretraining, and improving non-drug-like set (or even becoming independent of it).Moreover, along with their strong capacity, ANNs have made drug-likeness prediction more opaque and thus less credible, which is an even more neglected problem that hinders the further practical use of drug-likeness and requires more efforts on exploring interpretation methods and built-in explainers.To sum up, in addition to accuracy, generalizability and interpretability are worthy of efforts, yet are neglected in drug-likeness research.A practicable way to develop reliable drug-likeness indices is to employ models that are more generalizable and interpretable, and combine them to obtain more robust and comprehensive performance.literature search, and drafted the manuscript.Y.W. contributed intellectual input to the study and edited the manuscript.Y.N., L.Z., and Z.L. supervised the study and edited the manuscript.Competing interests: The authors declare that they have no competing interests.

Fig. 3 .
Fig. 3.A sketch map for drug-likeness prediction methods based on novel ANN architectures mentioned in the "Drug-likeness Prediction Using Novel Deep Learning Models" section (GNN was adapted from [61], CC-BY 4.0).

Table 1 .
Comparison of recent drug-likeness studies based on drug-likeness

Table 2 .
Common-used databases for drug-likeness research