Comparative analysis of molecular fingerprints in prediction of drug combination effects

Abstract Application of machine and deep learning methods in drug discovery and cancer research has gained a considerable amount of attention in the past years. As the field grows, it becomes crucial to systematically evaluate the performance of novel computational solutions in relation to established techniques. To this end, we compare rule-based and data-driven molecular representations in prediction of drug combination sensitivity and drug synergy scores using standardized results of 14 high-throughput screening studies, comprising 64 200 unique combinations of 4153 molecules tested in 112 cancer cell lines. We evaluate the clustering performance of molecular representations and quantify their similarity by adapting the Centered Kernel Alignment metric. Our work demonstrates that to identify an optimal molecular representation type, it is necessary to supplement quantitative benchmark results with qualitative considerations, such as model interpretability and robustness, which may vary between and throughout preclinical drug development projects.


Introduction
In the past years, deep learning (DL) methods have been successfully applied to a variety of research topics in biomedicine and drug discovery [1][2][3]. Deep neural networks achieve stateof-the-art performance in medical computer vision tasks and protein structural modeling, enabling de novo generation of drug candidates and development of prognostic clinical models [4][5][6][7][8]. However, such performance of DL models is context-dependent [9][10][11][12]. While quantitative metrics are routinely and effectively used to compare various computational methods, overreliance on them is a well-known issue [13][14][15][16][17][18]. It is beneficial to supplement performance results on benchmark datasets with estimates of model uncertainty and robustness, as well as ability to generalize on unseen data [19][20][21]. These aspects are particularly important in the biomedical research, where in silico model predictions direct experimental design choices, as exhaustively testing all combinations of relevant factors is usually unfeasible due to the combinatorial explosion [22,23].
Advances in high-throughput screening of bioactive compounds in cancer cell lines promote the development of personalized cancer treatments [24]. A major goal in such drug sensitivity and resistance testing studies is to prioritize promising combinatorial therapies that involve coadministration of multiple drugs [25]. By combining synergistic compounds, often with distinct mechanisms of action, it is possible to overcome single-drug resistance, produce sustained clinical remissions and diminish adverse reactions [26,27]. Drug synergy refers to a degree of drug-drug interaction quantified as the difference between expected and observed dose-response profiles measured by a biological endpoint, such as cell viability or cell toxicity [28]. While synergy characterizes how compounds modulate each other's biological activity, combination sensitivity score (CSS) quantifies drug combination efficacy [29]. In addition to the CSS, we use four synergy scores based on distinct null models, namely Bliss independence, highest single agent (HSA), Loewe additivity and zero interaction potency (ZIP) in the regression analysis of molecular fingerprints [30][31][32][33][34]. Predicting drug combination synergy and sensitivity is related to quantitative structure activity relationship (QSAR) modeling and virtual screening [35,36]. The QSAR captures mathematical associations between drug descriptors and assay endpoints based on the assumption that structurally similar compounds have similar bioactivity properties, while in the virtual screening studies candidate molecules are prioritized for subsequent experimental validation according to in silico prediction results [37]. Rule-based molecular fingerprints are commonly used as drug descriptors in QSAR/Virtual Screening, and MACCS structural keys based on molecular topology are arguably the most popular type of rulebased fingerprints [38][39][40][41]. Other types include circular topological fingerprints that describe combinations of non-hydrogen atom types and paths between them within a predefined atom neighborhood, and pharmacophore fingerprints that incorporate local features related to molecular recognition [42][43][44].
More recently, data-driven fingerprints generated by DL models have been shown to perform well in various research projects [45]. Majority of such DL fingerprints are based on the encoder-decoder architecture, whereby an approximate identity function is learned to translate high-dimensional input into a low-dimensional, fixed-size latent manifold, which is then used to reconstruct the original input [46]. When an encoderdecoder DL model is trained on chemical structures, its latent manifold is interpreted as a data-driven fingerprint. Examples of early DL fingerprinting models include a convolutional neural network (CNN), Chemception and a recurrent neural network, SMILES2Vec, as well as a variational autoencoder (VAE) model with a CNN encoder and a gated recurrent unitbased decoder [47][48][49][50][51]. Development of attention methods for sequence modeling further contributed to the popularity of datadriven DL fingerprints, whereas evolution of generative models enabled de novo molecular design through latent space sampling [52][53][54][55][56]. These DL solutions operate on images of molecules or SMARTS/SMILES sequences to create drug structural representations [57][58][59]. Further, DL fingerprints may be enriched with numerical drug descriptors through multitask DL learning methods or simply by concatenating to latent space [60]. Unlike Figure 1. Study workflow. Compounds found in combinations in the DrugComb database are represented using four rule-based (blue) and seven data-driven (yellow) fingerprint types. Rule-based fingerprints include topological, and 2D/3D extended connectivity variants. Data-driven fingerprints are generated using two VAE and two transformer models trained on ChEMBL 26, GAE trained on DrugComb compounds, and a pre-trained Deep Graph Infomax model (Infomax). The fingerprints are compared in three tasks: predictions of drug combination sensitivity and four synergy scores (VS I); representation similarity based on CKA (VS II); one-versus-all fingerprint clustering based on ATC drug classes (VS III). VS I results are also used to identify the most predictive synergy model. sequence-based versions, geometric DL fingerprints are derived from molecular graphs, and in addition to global molecular descriptors enable position-aware encoding of individual atom and bond features [61][62][63][64][65][66][67].
There exist several extensive benchmark datasets for ranking DL models in chemoinformatics tasks, such as MoleculeNet, Open Graph Benchmark and Benchmarking GNNs [68][69][70]. Despite the widespread use of molecular fingerprints, there is a lack of systematic evaluation of data-driven DL and rule-based versions. To address the gap, we study 11 types of molecular representations, comprising seven DL and four rule-based variants, in prediction of cancer drug combination synergy and sensitivity, based on 17 271 848 data points from 14 cancer drug screening studies ( Figure 1, experiment VS I). By comparing four synergy scores based on distinct null models, we identify a preferred synergy formulation for use in cancer drug combinations research [71,72]. We measure the fingerprint similarity by adapting centered kernel alignment (CKA) as a distance metric (Figure 1, experiment VS II). Lastly, we explore the downstream performance of molecular representations by clustering compounds assigned to 10 anatomical therapeutic chemical (ATC) classes in one-versus-all mode (Figure 1, experiment VS III). We believe that our work will contribute to the rational design of drug combinations, enable easier selection of molecular representations for in silico modeling, and promote further use of DL methods in biomedicine.

Data provenance
The DrugComb data portal, one of the largest public drug combination databases, is used to access combination sensitivity and synergy data [73]. Its October 2019 release contains standardized and harmonized results of 14 drug sensitivity and resistance studies on 4153 drug-like compounds screened in 112 cell lines for a total of 447 993 drug-drug-cell line tuples. Each pairwise drug combination is characterized by the CSS and four synergy scores, namely Bliss independence (Bliss), HSA, Loewe additivity (Loewe) and ZIP, further details are in the Supplementary Information. ChEMBL (release 26) is used to obtain SMILES strings, which are subsequently standardized by stripping salt residues and solvent molecules [74,75]. SMILES shorter than 8 and longer than 140 characters are filtered out. PubChem identifiers (CID) are used to cross-reference compounds between the databases when necessary. The final DL training dataset consists of 1 795 483 unique SMILES with a median length of 48 and a median absolute deviation of 10.

Molecular representations
Fingerprints are numeric arrays of n elements (bits) long, where n ranges between 16 and 1024 depending on fingerprint type. Even though n values up to 16 384 have been tested in literature demonstrating a positive correlation between fingerprint size and downstream prediction performance, not all the studies support these findings [38,76]. Fingerprints used in the current work are classified into rule-based with binary values and DLbased with continuous values. Rule-based models are further split into topological, 2D and 3D circular subtypes. DL fingerprints are split into sequence and graph subtypes. More detailed classification is found in Table 1.

Rule-based fingerprints
Four types of rule-based fingerprints used in the current work are: path-based (Topological 1024 bits long), 2D circular (Morgan  [38,43]. E3FP is a 3D extension of 2D extended-connectivity models, it is generated following the no_Stereo variant [77].

Deep learning-based fingerprints
Seven data-driven molecular fingerprints of different lengths are generated using four types of unsupervised encoder-decoder DL models, namely a graph autoencoder (GAE), a VAE, a Transformer and a pre-trained Deep Graph Infomax (Infomax).

GAE fingerprints
Sixteen bits long GAE fingerprints are defined via a diagonal semidefinite matrix of singular values , obtained through the singular value decomposition of GAE embedding matrix [61,62]. Inspired by the Ky Fan matrix k-norm, equal to the sum of k largest singular values of the matrix, the main diagonal of is used as a 16 bits long fingerprint [78]. If small molecules result in diagonal shorter than 16 bits, then zero-padding is applied. Sixty-four bits long GAE fingerprints are generated by concatenating average, min-and max-pooled representations of the embedding matrix to 16 bits long GAE fingerprints.

VAE fingerprints
VAE fingerprints are 16 and 256 bits long latent spaces of two independently trained VAE models [49].

Transformer fingerprints
Sixty-four bits long Transformer fingerprints are constructed by concatenating average-and max-pooled latent embeddings of the 16 bits model with the first output of its last and second last recurrent layers. Similarly, the 1024 bits variant is generated from the embedding space of the Transformer 256 bits model [52].

Infomax fingerprints
Infomax fingerprints are 300 bits long, generated using a pretrained Deep Graph Infomax model that by design maximizes mutual information between local and global molecular graph features [79,80].

Deep learning models used for fingerprint generation
Graph autoencoder model GAE model uses a graph G, with V nodes and E edges as input, where V correspond to atoms and E to atomic bonds. Additional numeric features may be incorporated via node or bond feature matrices [81]. A graph G is represented with an adjacency matrix, A R |V|x|V| , where |V| are node indices, such that nonzero A elements correspond to existing molecular bonds [82]. A is normalized to be symmetric and contain self-loops following [83].
where I is an identity matrix equal in size to A, D is a diagonal node degree matrix such that its main diagonal represents bond counts of A self-loop . GAE model is initialized with a node matrix of 54 atom features, where each atom is represented by an array of one-hot encoded values denoting one of the 37 atoms types, six possible atom degrees, five atom charges, four variants of chiral configuration and an aromaticity indicator, all generated using RDKit. One-hot refers to encoding categorical variables as binary arrays. To make the GAE model compatible with previously unseen atoms, a placeholder for an unknown atom type is added. GAE encoder consists of seven convolutional layers with sum pooling followed by ReLu activation [84]. The decoder part is a dot product of the embedding matrix with itself, followed by 0.1 dropout and sigmoid activation. Cross-entropy over A norm is used as a loss function. Empty nodes in A norm are initialized with zeros.

Variational autoencoder model
Two VAE models are trained with the embedding sizes of 16 or 256 bits. Both models have a 54 characters in vocabulary, consisting of 53 unique alphanumeric characters found in SMILES and an additional empty character for zero-padding. Input length is 140 characters, zero-padded if necessary. VAE encoder consists of three 1D convolutional layers of 9, 9 and 10 neurons, each followed by SELU activation [85]. The decoder consists of three GRU layers with a hidden dimension of 501, followed by softmax activation. Loss function is an equally weighted combination of binary cross-entropy and Kullback-Leibler divergence. Xavier uniform initialization is used to assign the starting weights of two VAE models [86].

Transformer model
Two transformers are trained with the embedding sizes of 16 or 256 bits. The vocabulary size for both models is 58 characters including 53 unique SMILES characters and five tokens for end-of-string, mask, zero-pad, unknown-character and initializedecoding. Maximum input length is 141 characters, zero-padded if necessary. Both models have four-headed attention and six transformer layers, with a dropout of 0.3 applied to the positional encoding output [54]. Loss function is negative log likelihood. Network weights are initialized with Xavier uniform.

DL model training
The GAE model is trained on 4153 DrugComb drug-like compounds, while VAE and Transformer models are trained on 1 795 483 molecules from DrugComb and ChEMBL 26 databases. Fivefold cross-validation is used for training all the DL models. Transformer and VAE models are trained for 10 epochs on each fold, GAE is trained for 40 epochs on each fold. All models use Adam optimizer with a learning rate decay and an initial learning rate of 1e-03, the training is halted once the learning rate reaches 1e-06 or loss reaches zero [70,[88][89][90]. GAE hyperparameters are optimized using tree-structured parzen estimators with a budget of 1000 iterations, other DL models employ random search [91]. Further training details can be found in Supplementary Information.

Data input
One-hot encoded cell line labels and each of 11 drug fingerprints are used as inputs to regression models to predict drug combinations sensitivity and synergy. Combination fingerprints are generated by concatenating single molecular representations, topological fingerprints are bit-averaged [92]. Full dataset contains 362 635 cell line-drug combination tuples of 3421 compounds, when filtered by the SMILES strings (SMILES-filtered), and 447 993 combination tuples of 4153 molecules, when filtered by the CID (CID-filtered). For each cell line-drug combination tuple, four synergy scores and CSS sensitivity scores are obtained from DrugComb. If found, biological replicates are averaged, further, dose-dependent synergy scores are averaged inside cell line-drug combination tuple.

Cross-validation (VS I)
Model selection for the regression analysis of molecular fingerprints is split into three steps. In the first step, 13 different regression models are tested thrice in 5-fold cross-validation on the 10% of the full dataset, sampled without replacement (Supplementary Information). The goal is to identify an optimal type of a regression model for prediction of four synergy scores and the CSS. The second step concerns hyperparameter tuning of the previously selected regression model on all available data in 10-fold cross-validation. Lastly, the model is trained in 10-fold repeated cross-validation on SMILES-filtered and CID-filtered datasets with 90:10 and 60:40 train:test splits [93].

Regression performance metrics
Pearson correlation coefficient (PCC) and root-mean-squared error (RMSE) are used to assess the regression performance. PCC and RMSE 95% confidence intervals are calculated via student's t-distribution estimate of Fisher's z-transformed PCC values, and via empirical bootstrap with 1000 iterations and symmetric confidence intervals [94][95][96][97][98]. RMSE values are normalized by standard deviations. Shapiro-Wilk test is used to test the normality assumption [99].

Related work
PCC scores of regression models, used to predict single synergy scores in three recent studies, are in Supplementary Information section.

Fingerprint similarity metric
All 11 types of molecular representations vary in length and data types, making commonly used metrics, such as Jaccard-Tanimoto or cosine distances poor choices for fingerprint comparison [100][101][102]. Jaccard-Tanimoto is suboptimal, as it is based on bits present in one fingerprint, absent in another and shared by both [103]. Cosine distance between two vectors, defined as their inner product normalized by the corresponding L2 norms, only measures an angle between two vectors without accounting for differences in their ranges [104]. It may be possible to postprocess DL fingerprints and define common distance metrics on both the binary and real-valued arrays [105]. However, we opted against it, as we are not aware of any studies that systematically assess DL fingerprint similarity or quantify downstream effects of such transformations. Recall that an inner product is an unnormalized measure of similarity allowing metrics based on the canonical correlation analysis (CCA), singular vector CCA and projection-weighted CCA to be defined on any real-valued arrays [106][107][108]. All these methods underperform when the number of compounds is smaller than the dimensionality of feature space, i.e., n bits [109]. It is not intuitive to use unnormalized inner product as a similarity measure, as it is unbounded and requires original data to be referenced alongside the similarity scores. Since calculation of pairwise compound distances is not a prerequisite to quantify their similarity, we compare complete fingerprint matrices using CKA, a modification of Hilbert-Schmidt independence criterion (HSIC) originally proposed to assess nonlinear dependence of two sets of variables [110].

Fingerprint matrix
Let m compounds be represented with two fingerprint matrices X and Y, where individual fingerprints x i and y i may be of different data types and different lengths x and y: X and Y are normalized by subtracting column means from the corresponding column values.

Linear kernel k
Let K be a kernel matrix, such that its entries correspond to scalar outputs of a linear kernel function k. Let k be an inner product, k = x i T y i , where x i and y i are 1D vectors from two fingerprint matrices X and Y corresponding to the same compound or feature. When x i and y i are column vectors, K becomes a feature similarity matrix: If x i and y i are row vectors, K is a sample similarity matrix:

Hilbert-Schmidt independence criterion
HSIC is a test statistic equal to 0 when X and Y are independent [110]. Unnormalized HSIC is without an upper bound and equal to: where Y T X is a dot product of feature vectors and . 2 F is a squared Frobenius norm and XX T and YY T are left Gram matrices. Notice that: Further, for centered X and Y under linear dot product kernel: where cov(X, Y) is a cross-covariance matrix of X and Y [109].

CKA (VS II)
HSIC is an empirical statistic that converges to its unbiased value at a rate of 1 √ number of samples [111]. Unbiased HSIC values are used to define CKA, a normalized version of HSIC that ranges from 0 to 1. CKA is used to quantify the difference between two fingerprint matrices X and Y. When CKA is calculated via the feature similarity, it is defined as: CKA is a non-linear extension of the CCA and does not require any assumptions about noise distributions in the datasets [112]. CKA with linear kernel is equivalent to the RV coefficient and Tucker's Congruence coefficient [109,[113][114][115]. If the number of samples is higher than the number of features, CKA should be calculated using feature similarities. Conversely, sample space and use of Gram matrices is preferred.

Fingerprint clustering (VS III)
The ATC classification system is used to annotate drugs according to biological systems on which they act, as well as their therapeutic, pharmacological, and chemical properties [116]. The 2228 DrugComb compounds found in the ATC database are assigned to 10 classes. All but GAE 16 bits and Morgan 1024 bits models are then used to generate nine fingerprint matrices. The generated fingerprint matrices are preprocessed 3-fold: by z-score normalization, z-score normalization followed by dimensionality reduction with PCA and z-score normalization followed by dimensionality reduction with PLS. For the PCA preprocessing, the number of loadings explaining >0.95 variance is used, PLS regression for dimensionality reduction is performed with ATC labels as targets. Linear discriminant analysis (LDA) is used for one-versus-all clustering with ATC class labels as response variables, averaged Silhouette score and variance ratio criterion (VRC) are clustering performance metrics [117,118].
Silhouette score for a single point is defined as: where a is the mean distance between point i and all points within its cluster C i and b is the smallest mean distance between point i and all points in a cluster = C i .
VRC is a ratio of between-to within-cluster variation, adjusted by the number of clusters. VRC is closely related to the F-statistic in ANOVA [119]. Both scores are min-max scaled to be in [0, 1].

Computational facilities
All models are trained on Tesla P100 PCIe 16GB GPU. VM deployment is automated with Docker 19.03.9, python-openstack 3.14. 13

Regression model selection
We identify Catboost Gradient Boosting on Decision Trees (GBDT) as an optimal regression model for the prediction of drug combination sensitivity and synergy after testing 13 algorithms on the 10% of the DrugComb dataset in three replicates ( Table 2). Three of the tested algorithms failed to generate any predictions and are omitted. With optimized hyperparameters GBDT models tend to reach the early stopping criterion in the last 20% of the training on all the fingerprint variants which indicates correctly tuned hyperparameters, further details are in Supplementary Information. There exist alternative dataset splitting modes that incorporate chemical similarity via Tanimoto distance or Murcko decomposition [68,120]. While they may better mimic current drug development practices and lead to a better correlation between in silico predictions and prospective experimental validation, we do not expect them to produce categorically different results.

Regression performance (VS I)
Among 11 fingerprinting models, Infomax 300 and VAE 256 achieved the highest PCC in prediction of Loewe synergy score using Catboost Gradient Boosting across all the test folds, crossvalidation modes and duplicate filtering methods. As seen in Figure 2 and Table 3, for the 60:40 splits on the SMILES-filtered dataset Infomax reaches a PCC of 0.6842, while VAE 256 score is 0.6813. All tested fingerprints result in the CSS prediction scores above 0.85 PCC, with Infomax 300 and VAE 256 fingerprints still ranked on top. Infomax 300 and VAE 256 have overlapping 95% confidence intervals, as such they are considered to be equally performant. E3FP is the best rule-based fingerprint and is among the top three in most experimental runs. As seen in Figure 3 and Table 4, normalized RMSE scores further corroborate that DL-based fingerprints are better than rule-based variants in regression tasks. Further regression results for 90:10 and 60:40 train:test splits using SMILES and CID-filtered datasets are in Supplementary Figures S1-S6 and Supplementary Tables S1-S6).
Experimental results indicate that if similarity-based clustering or identification of key molecular moieties are of

CKA distance (VS II)
A heatmap of pairwise CKA similarities between 11 fingerprints, as seen in Figure 4, indicates that similar types of fingerprints  cluster together. Rule-based fingerprints form two clusters corresponding to topological and circular subtypes. All the DL fingerprints generated by the trained models form the third cluster. Graph-based models appear to be far removed from all sequence and rule-based variants. GAE 64 is the most different from other trained DL fingerprints, while being co-clustered with them. Infomax 300 fingerprints, based on a pre-trained Deep Graph Infomax model, are not part of any cluster. Smaller sequencebased DL fingerprints, namely VAE 16 and Transformer 64 are at least as similar to each other, as they are to their longer intype/subtype counterparts. We conclude that fingerprint type and subtype, as indicated in Table 1, contribute the most to the CKA similarity, followed by fingerprint pretraining status, size, and data format.

LDA clustering (VS III)
To further study the differences between fingerprint models, we perform one-versus-all LDA classification of 2228 compounds based on their ATC classes, using nine different fingerprinting models to represent the molecules. The GAE 16 bits fingerprints are omitted, since GAE 64 bits fingerprints extend their shorter counterparts by concatenating average, min-and max-pooled embedding spaces. Further, due to the comparable performance of Morgan 300 and 1024 bits models in VS I and VS II experiments, only Morgan 300 bits fingerprints are used in LDA clustering experiments. VS III clustering results are in Table 5 and the overview of DrugComb compounds with the corresponding ATC classes is in Figure 5. The Infomax 300 bits model achieves the best clustering results on the z-score normalized fingerprint matrices, followed by three rule-based fingerprints. Dimensionality reduction following z-score normalization generally improves clustering performance of all rule-based fingerprints. It has the opposite effect on most DL fingerprints, with the largest reduction seen in the Infomax 300 and GAE 64 models. Longer DL sequence models, namely VAE 256 and Transformer 1024, perform better after dimensionality reduction steps, albeit with a minimal improvement in relative rankings. Such differences between graph and sequence-based DL fingerprints are supported by the CKA analysis (VS II study), indicating that the graph-based fingerprints differ the most from other DL variants.

Conclusion
Choosing an optimal fingerprint type to represent molecular features is an important step in computational drug discovery.
To this end, we systematically compared 11 variants of such molecular representations in predicting drug combination sensitivity and synergy scores, and evaluated their relationships based on the clustering performance and CKA-based fingerprint similarity. We found that VAE 256 bits long and 3D circular  E3FP 1024 bits long fingerprints generated from SMILES strings, as well as Infomax 300 bits long fingerprints based on molecular graphs lead to the best regression performance. Out of the four tested synergy scores, we observe that Loewe synergy is the easiest to predict with best models reaching PCC 0.72. CSS, a measure of drug combination efficacy, can be predicted >0. 85 PCC with any fingerprint type. We found that the rulebased fingerprinting methods underperform in regression tasks in comparison to the data-driven DL variants. However, the gap between the best and worst performing fingerprint models rarely exceeds PCC 0.05. Further, we adapted CKA to quantify the extent of similarity between fingerprint matrices and to demonstrate that similar types of fingerprints cluster together. An optimal similarity measure for the comparison of single rulebased and data-driven fingerprints remains an open question. Lastly, in one-versus-all compound clustering using ATC classes as labels, rule-based fingerprints perform on par or better than the best DL representations. We conclude that the quantitative performance differences between rule-based and DL-based fingerprints are likely to be insignificant in the context of preclinical studies of small molecule drugs [122,123]. In order to identify an optimal fingerprint type for a given project we advise enriching quantitative performance metrics with qualitative concerns, e.g., available chemometric and DL expertise, model interpretability requirements, opinions of project stakeholders and model performance on unseen data. Fingerprints generated using the E3FP 1024, Infomax 300, Morgan 1024 and VAE 256 bits models are suggested as good starting points based on our experimental results and distinct methodologies underlying their data generating methods [124]. We recommend the Loewe synergy score for use in drug combination screening due to its best performance among four tested synergy models tested on dose-response data from 14 DSRT studies.
This work focuses on the evaluation of single fingerprint types. However, it is worth exploring the impact of combining several fingerprints together. We expect a statistically significant regression performance increase when combining molecular representations with low CKA similarity, or using models trained on multimodal data and/or key biological databases, such as Gene Ontology, Protein Data Bank and UniRef [5,[125][126][127]. Another line of inquiry could address high computational costs of DL and E3FP models. To this end, we suggest exploring alternative molecular representations and CPU-friendly generative models based on genetic algorithms, such as STONED on SELFIES [128]. Finally, we hope that in the future biomedical DL research will go beyond representation learning and will be used to derive novel biological knowledge by e.g., inferring synthetic and retrosynthetic chemical reactions, identifying novel disease-associated druggable proteins and clinically actionable biomarkers [129][130][131].

Key Points
• To choose an optimal molecular fingerprint type, it is advised to enrich quantitative metrics of model performance with qualitative concerns related to the nature of downstream tasks, model interpretability and robustness requirements, as well as available chemometric expertise.
• Data-driven fingerprints, namely VAE 256 bits long trained on SMILES and Infomax 300 bits long-trained molecular graphs are well-suited for regression tasks. 1024 bits long 2D and 3D circular fingerprints are flexible and well-performant rule-based models fit for clustering tasks. GAE 64 bits long model may be used in any analysis scenario as a baseline option. • Loewe synergy scores enable the highest correlation between in silico predictions and subsequent experimental validation of drug combination synergy in cancer cell lines.
• CKA is an effective measure of molecular representation similarity applicable to any combination of rulebased and DL fingerprints.

Supplementary data
Supplementary data are available online at https://academic. oup.com/bib.

Data and code availability
The data and code underlying this article are available in the article and in its online supplementary material.

Combination sensitivity and synergy scores
It is known that the choice of an appropriate null model of no interaction is crucial for an accurate assessment of drug synergy. Four such models are used in the current work. Bliss model is based on probabilistic independence of drug effects, such that single agents are competing, but independent perturbations each contributing to a total effect [30]. HSA assumes that an expected drug combination effect is the higher of the two single-agent effects at corresponding concentrations [31]. Degree of non-interaction according to Loewe is equal to the outcome of a sham experiment, that is, combining a drug with itself [32,33]. ZIP assumes that non-interacting drugs minimally impact each other's doseresponse curves [34]. CSS quantifies drug combination efficacy, which is defined as an average of areas under drug combination dose-response curves, whereby each curve is determined by fixing one of the drugs at its IC50 concentration [29].

Regression model selection
Thirteen regression models, representing a wide spectrum of ML algorithms, are compared in prediction of drug combination synergy and sensitivity. All models are tested with default hyperparameters in 5-fold cross-validation using 1024 bits long Morgan and 300 bits long Infomax fingerprints together with one-hot encoded cell line labels on 10% of randomly sampled data in three replicates. Thirteen tested regression models are: Bayesian Ridge, Catboost Gradient Boosting, ElasticNet, Gaussian Process Regression with a sum of Dot Product and White Kernels, Histogram-based Gradient Boosting, Isotonic Regression, Lasso regression, LassoLars regression, Linear regression, Ridge regression, Random Forest, Support Vector Machines with a linear kernel and XGBoost Gradient Boosting. All trees-based models are limited to a depth of six. Neural networks are not included in the comparison, since on tabular data they tend to perform on par with the previously mentioned methods, while being less interpretable and more difficult to set up [132,133]. Top four identified regression models are bagging and boosting ensembles, followed by linear kernel Support Vector Regression and Bayesian Ridge. Gaussian Process Regression, Isotonic and Lassolars models failed to generate any predictions and their results are omitted from Table 1. Catboost implementation of GBDT is selected for further experiments due to its efficient GPU utilization and two design choices aimed at reducing overfitting: out-of-the-box categorical encoding that translates classes into numeric representations, binning them based on the expected value of target statistic, and ordered boosting whereby training data are randomly permuted throughout tree growing process to limit unwanted target metric prediction shift, one of wellknown GBDT disadvantages [134][135][136].
Tuned in 10-fold cross-validation best hyperparameters for Catboost GBDT are Poisson bootstrap with 0.66 subsampling ratio, L2 regularization of 9, tree depth of 10, learning rate of 0.15 with 5000 boosting iterations and 50 early stopping rounds. Overall, we conclude that ensembles are the most powerful type of tested ML algorithms in prediction of drug combination synergy and sensitivity [137].

Neural network training
For all DL models, the hyperparameter search consisted of testing activation functions (GELU, ELU, LeakyReLU, ReLU, SELU, Sigmoid, Softmax, Swish), dropout ratios (0.1-0.5 with 0.1 step size), initial learning rates (1e-01 to 1e-05, with a step size of 0.1), number of patience epochs (1-30 with a step size of 1) and the learning rate decay factor (0.9-0.1 with a step size of 0.1). Transformer models were tested with 3-6 attention heads. VAE encoders were tested with up to five convolutional layers and convolutional kernel sizes up to 10; VAE decoder is tested with up to three recurrent GRU layers of sizes up to 600.
Transformer. Two transformer models are trained in 5fold cross-validation, for up to 10 epochs on each fold with a decay factor of 0.1 and patience of 5 epochs in batch sizes of 650 and 340 for 16 bits and 256 bits long fingerprint variants. The final cross-entropy losses are 2e-07 and 1e-08, respectively.
VAE. Two VAE models with the latent spaces of 16 and 256 neurons are trained on ChEMBL 26 using 5-fold crossvalidation and 10 epochs on each fold with a decay factor of 0.2 and patience of 2 and 3 epochs, respectively. An equally weighted sum of binary cross-entropy and KL-divergence is used as a loss metric. The final losses are 0.9231 and 0.0984 for 16 bits and 256 bits models. GAE. A single GAE model is trained on 3421 unique Drug-Comb compounds in 5-fold cross-validation mode for 200 epochs with a batch size of 340 for up to 40 epochs per fold. Learning rate decay factor is 0.1 with 30 epochs patience. Cross-entropy over the molecular graph adjacency matrix is used as a loss, with the best score of 0.8604 reached at the end of the 200th epoch.
It is likely that longer training, more extensive hyperparameter optimization, or use of alternative optimizers, such as SGD with a cyclic learning rate scheduling, may result in lower final loss values. However, we do not expect them to have a significant inf luence on the downstream experiments [138]. It is interesting to note that optuna-based optimization of the GAE model resulted in the encoder architecture consisting of seven convolutional layers 54-46-40-34-28-22-16 neurons wide. Such a high number of convolutional layers is somewhat unexpected, as performance of Graph Neural Networks based on spectral convolutions is expected to deteriorate with the convolutional layer count above six, most likely due to excessive feature smoothing [139,140]. Relatively deep GAE encoder architecture may be explained by a positive correlation between the performance of DL models and the number of trainable parameters, as the seven layer GAE model with dot-product based Decoder has circa 10 k learnable parameters, whereas large Transformer and VAE models that reach lower test losses during training have two orders of magnitude more parameters [141].

Prior work in drug combination synergy predictions
In three independent studies listed below single synergy scores are predicted. Developed models are cross-validated on single datasets.
Random forest. A Random Forest model developed by Menden et al. achieved a 0.3 PCC in prediction of Loewe synergy using drug and cell line labels on the AstraZeneca dataset including 910 combinations tested in 85 cancer cell lines [142].
Convolutional neural network. A CNN model introduced by Preur et al. achieved a 0.73 PCC in prediction of Loewe synergy from drug fingerprints, physicochemical molecular descriptors and basal expression of 3984 cancer-associated genes on the O'Neil dataset including 583 drug combinations tested in 39 cancer cell lines [143,144].
Gradient Boosting and Random Forest. XGBoost and Random Forest models developed by Sidorov et al. achieved a 0.64 PCC for a modified version of Bliss synergy from drug fingerprints and their physicochemical characteristics on the NCI-ALMANAC dataset including 5232 combinations tested in 60 cancer cell lines [26,145].