Modeling PROTAC Degradation Activity with Machine Learning

PROTACs are a promising therapeutic modality that harnesses the cell's built-in degradation machinery to degrade specific proteins. Despite their potential, developing new PROTACs is challenging and requires significant domain expertise, time, and cost. Meanwhile, machine learning has transformed drug design and development. In this work, we present a strategy for curating open-source PROTAC data and an open-source deep learning tool for predicting the degradation activity of novel PROTAC molecules. The curated dataset incorporates important information such as $pDC_{50}$, $D_{max}$, E3 ligase type, POI amino acid sequence, and experimental cell type. Our model architecture leverages learned embeddings from pretrained machine learning models, in particular for encoding protein sequences and cell type information. We assessed the quality of the curated data and the generalization ability of our model architecture against new PROTACs and targets via three tailored studies, which we recommend other researchers to use in evaluating their degradation activity models. In each study, three models predict protein degradation in a majority vote setting, reaching a top test accuracy of 80.8% and 0.865 ROC AUC, and a test accuracy of 62.3% and 0.604 ROC AUC when generalizing to novel protein targets. Our results are not only comparable to state-of-the-art models for protein degradation prediction, but also part of an open-source implementation which is easily reproducible and less computationally complex than existing approaches.


Introduction
Machine learning (ML) has transformed various scientific domains, including drug design and discovery, by offering novel solutions to complex, multi-objective optimization challenges (Atance et al., 2022;Fromer and Coley, 2023;Gao et al., 2022;Winter et al., 2019).In the context of medicinal chemistry, ML techniques have revolutionized the process of identifying and optimizing potential drug candidates.Traditionally, drug discovery has relied heavily on trial-and-error experimentation, which is not only timeconsuming but also expensive.ML techniques have the potential to significantly accelerate and improve this process by predicting properties of molecules in silico, such as binding affinity, solubility, and toxicity, with remarkable accuracy (Born et al., 2023;Gorantla et al., 2024;Vassileiou et al., 2023).This in turn saves time and money in early-stage drug discovery by focusing resources on the most promising candidates.At the same time, AI models' high performance can potentially lead to better designed drugs for patients.
In order to develop ML models for chemistry, ML algorithms leverage vast datasets containing molecular structures, biological activities, and chemical properties to learn intricate patterns and relationships, also called quantitative structureactivity relationships (QSAR).These algorithms can discern subtle correlations and structure in molecular data that are difficult for human experts to identify.Consequently, MLbased approaches aid in predicting which molecules are likely to be effective drug candidates, thereby narrowing down the search space and saving resources (Blaschke et al., 2020;Wu et al., 2021).
PROTACs, or PROteolysis TArgeting Chimeras, represent an innovative class of therapeutic agents with immense potential in challenging disease areas (Hu and Crews, 2022;Liu et al., 2020;Tomoshige and Ishikawa, 2021).Unlike traditional small molecule inhibitors, PROTACs operate by harnessing the cell's natural protein degradation machinery, the proteaosome, to eliminate a protein of interest (POI), as summarized in Figure 1a.This catalytic mechanism of action for targeted protein degradation (TPD) offers several advantages over conventional approaches, which frequently work by having a small molecule drug bind tightly to and thus block a protein's active site.In fact, by leveraging their unique mechanism, PROTACs bypass the need for tight binding to specific protein pockets, offering a novel strategy for targeting previously "undruggable" proteins.This approach is particularly relevant in cases where inhibiting the target's activity might not be sufficient; notable examples include certain neurodegenerative diseases like Alzheimer's, where misfolded proteins agglomerate and lead to negative downstream effects in patients (Békés et al., 2022).
By catalytically degrading POIs, PROTACs have the potential to offer more comprehensive therapeutic effects at lower doses relative to traditional inhibitors.Their capacity for TPD highlights the necessity of thorough efficacy evaluations, typically conducted through dose-response assessments (Figure 1b) to determine critical parameters such as  50 (the molar concentration of PROTAC at half maximum degradation of the POI; the lower the better) and   (the highest percentage of degraded POI; the higher the better) (Gesztelyi et al., 2012).However, PROTAC development and evaluation face significant challenges due to the limited availability of open-source tools and resources specifically designed for this molecule class, a gap predominantly filled by tools aimed at small molecule inhibitors (Mostofian et al., 2023;Nori et al., 2022).
To address these challenges, our work introduces a comprehensive machine learning toolkit and curated data specifically designed for PROTAC research.We have developed predictive models that leverage the curated data to effectively forecast the degradation activity of PROTACs, achieving high predictive accuracy and ROC-AUC scores on the test set (top 82.6% and 0.848, respectively).Our system, fully open-source and easily accessible via a Python package, is designed to streamline the predictive modeling of PROTAC degradation activity, thus facilitating the rapid evaluation and optimization of new PROTAC designs.Our contribution significantly expands the available public resources for PROTAC development, setting a new baseline in the application of ML techniques to this emerging therapeutic area.

Data Curation
For this work, we collected and curated data from PROTAC-DB (Weng et al., 2021) and PROTAC-Pedia (London and Prilusky) that represent, to our knowledge, the two largest open datasets for PROTAC data.PROTAC-DB contains experimental data, scraped from the scientific literature, for 5,388 PROTACs (as of May 2024; version 2.0).While the PROTAC-DB allows users to query, filter, and analyze PROTAC data via its online platform (e.g., comparing different compounds based on their  50 and   ), its data is not specifically structured for ML models, but rather for online access through its web page.Wrangling the data for use in data-driven models requires significant cleaning and curation.On the other hand, PROTAC-Pedia provides 1,190 crowd-sourced entries (as of May 2024), with details on PROTACs and their degradation activity.
To prepare the data for our models, we extracted and standardized the following features from the PROTAC-DB and PROTAC-Pedia datasets, where a specific combination of the features corresponds to one experiment: the PROTAC compound, cell line identifier, E3 ligase, POI, and degradation metrics ( 50 and   ).
Each dataset entry includes the SMILES representation of the PROTAC, which was canonicalized using RDKit (Landrum, 2010).In PROTAC-DB, cell line information was predominantly found in textual assay descriptions, such as "degradation in LNCaP cells after 6 h at 0.1/1000/10000 nM", with "LNCaP" being the cell type in this statement.Cell type information was extracted using regex parsing, with a few manually cleaned entries.Afterward, cell line names were standardized using Cellosaurus (Bairoch, 2018) to remove synonyms.The Uniprot IDs (EMBL-EBI, 2023) of E3 ligases and POIs lacking that information were manually web searched and added as text to each entry.
For PROTAC-DB, some of the  50 and   values were obtained by splitting entries containing information for the same PROTAC on multiple assays.Duplicates consisting of the same SMILES, POI, E3 ligase, and cell line, but different  50 or   , were handled by merging them into one entry; we assigned as  50 and   the geometric mean of their reported  50 and   values, respectively.A data sample is labeled as active when both its  50 (i.e., the  50 value expressed in negative  10 units) and   are above their respective predefined threshold values; here we used 6 and 60%, respectively.Effectively, each data point is assigned a binary label indicating degradation activity.

Data Representation
Given the available data consisting of PROTACs, E3 ligases, POIs, and cell lines, our goal is to encode the diverse information into efficient numerical embeddings that an ML model can leverage.Because our pool of curated data has a limited size (∼ 10 3 data samples), we decided to focus on Bio-Emb.T5 Model Figure 3: Model pipeline and architecture.The normalization, softmax, and sigmoid functions are denoted as , , and , respectively.The pretrained bio-embedding model can be found in Dallago et al. (2021), while the pretrained sentence Transformer is from Reimers and Gurevych (2019).
learning individual embeddings for each of the following: the PROTAC, E3 ligase, POI, and cell type for each experiment.
For PROTACs, their SMILES strings are converted, via RDKit (Landrum, 2010), to Morgan fingerprints of 256 bits with a radius 10 and stereochemistry information included, with 256 being the smallest 2  vector length not resulting in the overlap of any two fingerprints.We experimented with several bit lengths and radii on the available data while counting the number of overlapping fingerprints, i.e., different PROTACs having identical fingerprints.Ultimately, we selected the combination with the smallest bit length and radius not resulting in any overlapping fingerprints.The two proteins corresponding to the E3 ligase and POI are converted into pre-computed Uniprot embeddings of 1024 elements (Dallago et al., 2021;EMBL-EBI, 2023).Appendix E includes an evaluation using amino acid sequence counts as protein embeddings, for comparison.Cell lines were one-hot encoded, although other embeddings were also explored and are described in Appendix B. Finally, a pretrained sentence Transformer model (Reimers and Gurevych, 2019) was used to encode the text descriptions into numerical embedding vectors of 768 elements.
Once we collected all the embeddings representations, POI and E3 ligase embeddings were normalized independently by removing the respective mean and by scaling to unit variance.The normalization parameters are learned on the given training set and kept fixed for validation and testing.Morgan fingerprints and cell line one-hot encodings, being binary vectors, were not normalized.

Model Architecture
An illustration of the model architecture is shown in Figure 3.The model includes a set of linear layers, each processing a separate input vector, i.e., the Morgan fingerprints, and the normalized POI, E3 ligase, and cell embeddings, respectively.The linear layer outputs are then softmax-ed in order to make them of comparable magnitude, and finally summed together.Lastly, they are forwarded to two additional linear layers, interleaved by a ReLU activation function and a batch norm layer.The model is trained to optimize a binary cross-entropy loss (with logits).We set the batch size to 128 and reduce the learning rate by a factor of 10× whenever the validation loss increases compared to the previous training step.Finally, we apply a sigmoid function to the output of the final linear layer before returning predictions about PROTAC activity.

Evaluation Strategy
To fully assess the quality of the curated data and the potential performance of DL models in predicting degradation activity, we designed a set of three studies (Figure 2).In the first study, we seek to identify the potential upper bound of the model performance given the curated data, referred to in this work as the standard CV split.To do so, we randomly pick 10% of the data as a test set, and leave the remaining data for training with 5-fold cross validation (CV).This leads to an ensemble of five trained models, one per CV fold.In the next study, we explore model generalization against unseen POIs, referred to in this work as the target CV split.Similar to the previous study, we carefully select 10% of the available data for testing, such that the POI does not appear in the remaining 90% of the data which is used for training (5-fold CV).The target CV split hold-out set was selected based on Table 1 Parameters optimized by Optuna: the table reports the parameter name, its type, i.e., categorical (Cat) or continuous (Cont), and the range of values or options suggested in each trial.We apply SMOTE oversampling (Chawla et al., 2002) to the concatenated input data, when suggested.

Parameter
Type Options / Range Hidden Dimension Cat [32,64,128,256,512 [3, 4, ..., 15] the distribution of active and inactive entries within Uniprot groups, where we prioritized less common POIs first and ensured that adding each group to the test set did not exceed the specified test split proportion while maintaining a balance between active and inactive entries.This is similar to the constraints in the similarity CV split.Finally, we evaluate the model generalization performance to new PROTACs, referred to in this work as the similarity CV split.To do so, we compute the average Tanimoto distance from all PROTAC Morgan fingerprints to all other PROTAC fingerprints in the full data.For generating the test set for this experiment, we isolated the data entries starting from the ones where their PROTAC is mapped to a high average Tanimoto distance, until reaching 10% of the total available data, leaving the rest for CV training.
For each study, we used stratified group CV as implemented in scikit-learn to ensure each fold has a balanced distribution of active and inactive compounds.The validation set performance is obtained by averaging the validation performance on each of the five CV folds using the best set of hyperparameters found via the Optuna optimization framework (Akiba et al., 2019).The test performance is obtained from the held-out test set by averaging the performance of three models with the best hyperparameters, each trained using a different random seed on all the data used in CV.

Hyperparameter Tuning and Ablation Studies
For hyperparameter tuning we leveraged the Optuna optimization framework (Akiba et al., 2019).In each study, we let Optuna spawn 150 trials to suggest a model architecture and hyperparameters to be used to train the models in the CV folds (we used 5 folds).Each trial is instructed to sample all the hyperparameters values listed in Table 1.Using Optuna, the goal is to find the best set of hyperparameters that maximize the average validation ROC-AUC score across the CV folds.The best hyperparameter configuration is then used to train three separate models per study, each with randomly initialized weights (with different seeds), in order to account for model variability.The best configuration models in each study are trained on the combined study's train and validation sets and evaluated on the respective held-out test set.
Additionally, we conducted an ablation study in which we progressively set input vectors to all zeros, and feed them to the three best models trained during the random split study.

Degradation Activity Thresholds
A data sample was labeled active if its  50 is ≥ 6.0 (equivalent to 1 ) and   ≥ 60%.The  50 threshold helps identify PROTACs with therapeutic potential, as molecules above this threshold are likely to show significant biological activity.Similarly, the   threshold helps identify PROTACs capable of achieving substantial degradation of the target protein, indicative of efficacy. 50 is particularly relevant for drug design, as it allows for the prioritization of compounds that not only bind to the POI but also lead to its effective degradation at a reasonable concentration.By choosing the above thresholds, we aimed to mitigate model bias, ensuring our dataset includes a balanced representation of both active and inactive compounds, enhancing the model's generalizability.Note that a PROTAC can be labeled active in one cell type and inactive in another, such as DT2216, a Bcl-xL degrader, which is active in MOLT-4 cancer cells ( 50 ∕  = 7.20∕90.8%)and inactive in 2T60 hybrid cells ( 50 ∕  = 5.52∕26.0%)(Khan et al., 2019).

Curated Dataset
After data curation, we were able to extract a total of 2,141 data samples, out of which 812 (37.9%) report information about   , and 1,350 (63.1%) include a  50 value.The curated dataset contained no E3 ligase knockout cell lines, as determined by searching for "-/-" in the cell line text (a few other knockouts, however, were identified).When applying the aforementioned definition of degradation activity, we isolated 759 data samples, 391 (51.52%) of which are labeled active and the remaining 368 (48.48%) inactive.An overview of the distribution of  50 and   values is shown in Figure 4a.We can see that the majority of the data samples are normally concentrated around the  50 threshold of 7.31, with a few outliers.  values, on the other hand, are more spread out, with half of the samples showing a   above 80% (median value).
Figure 4b shows the distribution of E3 ligases and their frequency in the dataset, together with the percentage of active/inactive samples associated with each of them.PROTACs are equally distributed (roughly) among the two main E3 ligases, cereblon (CRBN) and von Hippel-Lindau (VHL), with a small fraction of PROTACs being evaluated with other E3 ligases.We see that CRBN and VHL are indeed the most common (52.8% and 40.6%, respectively), whereas 5.6% of the data samples report less common E3 ligases: IAP (3.16%), MDM2 (1.10%), cIAP1 (1.10%), FEM1B (0.37%), Ubr1 (0.11%).Regarding the active samples distribution among E3 ligases, CRBN and VHL are quite balanced (54.7% and 53.5%, respectively), and FEM1B and Ubr1 are mostly associated with active samples.The less common MDM2, IAP, and cIAP1 are mostly associated with inactive samples.(a)  mean validation accuracy and ROC-AUC scores of the five models trained during CV (one model per fold) with the best hyperparameters found.Additionally, the plots show the performance on the test set of three models trained per study with the best hyperparameters found in CV and different initial weights.For those models, we also report the mean of the test accuracy and ROC-AUC scores, alongside the test accuracy and ROC-AUC scores calculated using majority voting.A dummy model is included as a baseline, which always predicts the majority class in the training set.

Model Performance
The performance metrics derived from the standard CV split offer an upper bound for our model's capability, with a validation average/test average/test majority vote accuracy of 81.4%/80.8%/79.5% and a validation average/test average/test majority vote ROC AUC of 0.887/0.865/0.865.These results suggest an optimal scenario where the model has access to a diverse and representative sample of the data during training, maximizing its learning potential.The standard split serves as an upper bound estimate for model performance, as real-life scenarios generally require more constrained and specialized testing conditions.
On the other hand, in the similarity CV split study, designed to evaluate the model's generalizability to unseen PROTAC compounds that do not share structural similarities with the training set, our model reached a remarkable validation average/test average/test majority vote accuracy of 79.6%/74.9%/67.5% and a validation average/test average/test majority vote ROC AUC of 0.869/0.826/0.850.The high performance in this study indicates the model's robust ability to extrapolate from known PROTACs to predict the activities of novel molecules.
Finally, the target split study presents a significant challenge for our model, as evidenced by the lower validation average/test average/test majority vote accuracy of 64.0%/62.3%/55.3%and a validation average/test average/test majority vote ROC AUC of 0.710/0.604/0.616.This study tests the model's ability to generalize across different protein targets, a critical factor for PROTAC design in novel disease mechanisms.The diminished performance suggests a need for improved protein representations or for embeddings that better capture more detailed and relevant features of the target proteins.Moreover, it underscores the necessity for more extensive and diverse datasets that include a broader array of PROTACs and targets.
Additional performance metrics are reported in Appendix A. Appendix D includes instead the performance scores of an XGBoost model evaluated on the aforementioned studies (Chen and Guestrin, 2016).

Ablation Studies
The ablation study summarized in Figure 5b highlights the contributions of various embeddings to model performance in PROTAC activity prediction.We focus on the average test accuracy of the three models trained with the best hyperparameters in the standard split study.With all embeddings enabled, the three models achieved an average test accuracy of 80.8%, serving as the baseline for full-feature utilization.Disabling cell, E3 ligase, and protein of interest (POI) embeddings individually led to varied decreases in performance, with test accuracies of 65.1%, 66.3%, and 65.4%, respectively.This highlights the importance of each type of embedding in enhancing predictive accuracy.
Notably, the model performance dropped below that of the dummy model when disabling compound information, emphasizing the importance of the PROTAC fingerprints.This is further highlighted by the test accuracy of the combination of disabled POI, E3, and cell embeddings (leaving the PROTAC information only), which reached 64.7%, close to other setups in which only a single component was disabled.In general, molecular fingerprints appear to be the most relevant input feature to the model.However, the general trend of high accuracy drops suggests that the contextual embeddings collectively contribute with significant predictive value beyond the structural information provided by molecular fingerprints alone.
Overall, this ablation study demonstrates the synergistic effect of integrating diverse embeddings, including compound structure (PROTAC fingerprint) and biological context (cell type, E3 ligase, POI), to capture the diverse determinants of biological activity in PROTACs.

Related Work
The studies most closely aligned with our work are those of Li et al. (2022) and Nori et al. (2022).Li et al. (2022) introduces DeepPROTACs, a deep learning model for prognosticating PROTAC activity, whereas Nori et al. (2022) proposes instead a LightGBM model for predicting protein degradation activity.LightGBM is a gradient boosting framework that uses a histogram-based approach for efficient, high-performance ML tasks (Ke et al., 2017).
The DeepPROTACs architecture encompasses multiple branches employing long short-term memory (LSTM) and graph neural network (GNN) components, all combined prior to a prediction head.Each branch processes distinct facets of the ternary complex, encompassing elements like E3 ligase and POI binding pockets, along with the individual components of the PROTAC: the warhead, linker, and E3 ligand.The model's performance culminates in an average prediction accuracy of 77.95% and a ROC-AUC score of 0.8470 on a validation set drawn from the PROTAC-DB.The LightGBM model, on the other hand, achieves a ROC-AUC of 0.877 on a PROTAC-DB test set with a much simpler model architecture and input representation.
Notwithstanding their achievements, the DeepPROTACs and LightGBM models both exhibit certain limitations.In DeepPROTACs, there is a potential risk of information loss as the PROTAC SMILES are partitioned into their constituent E3 ligands, warheads, and linkers, which are then fed into separate branches of the model.Secondly, while the authors undertake advanced molecular docking of the entire PROTAC-POI-E3 ligase complex, their subsequent focus on the 3D binding pockets of the POI and E3 ligase renders it less amenable for experimental replication and practical use.Finally, and perhaps most importantly, the potential for data leakage during hyperparameter optimization and its effects on out-of-distribution (OOD) generalization was not investigated.Data leakage between the different PROTAC components in the training and test sets of the model may artificially render a more accurate model that does not generalize well to new real-word data, necessitating more rigorous testing procedures.Because of that, generalization of the DeepPROTACs model would need to be further investigated on a separate test set.

Conclusions
In this work, we curated open-source PROTAC data and introduced a versatile toolkit for predicting PROTAC degradation effectiveness in three different experimental scenarios, aiming to assess the quality of our curated data and model generalizability.The performance of our models, achieving a top 80.8% test accuracy and a 0.865 ROC-AUC test score are competitive with, if not surpassing, existing methods for protein degradation prediction.Ours are also the first models to consider both  50 and   in predicting degradation activity for PROTACs, a significant contribution as both properties are important to determining PROTAC efficacy.We show that our models can generalize well to unseen PROTACs, while struggling with unseen targets, highlighting the need for more comprehensive protein representations and more extensive datasets.Finally, our approach offers open-source accessibility, ease of reproducibility, and a less computationally complex alternative to previous work, making it a valuable resource for researchers working on data-driven approaches to PROTAC engineering.

B. Cell Line Embeddings From Text Descriptions
This section details the methods used to extract complex cell line embedding vectors from text descriptions.A basic approach is to assign a categorical (or one-hot encoded) label to each cell line in the dataset.While practical, this method ignores any inherent information about the cell lines and their biological similarity.To address this, we utilized the Cellosaurus database, which provides standardized information about common cell lines used in research (Bairoch, 2018).Our approach involves isolating relevant biological information about each cell line into a text description.Features such as omics, genome ancestry, doubling time, and sequence variations, all in text form, are ranked by uniqueness and filtered to form a concise single text description of a given cell line.We then encode this text into an embedding vector by using a sentence Transformer model (Reimers and Gurevych, 2019).
A sentence Transformer is designed to generate embedding representations of input sentences such that similar sentences have high cosine similarity.However, sentence Transformers have a fixed input size, accepting a maximum number of tokens.To process longer texts, we divide them into chunks of the maximum size, encode each chunk into a vector, and average the vectors into a single representation.To avoid diluting relevant information during this averaging process, we aim to summarize each cell line's information into concise, yet informative, short text descriptions.
We manually isolated columns containing relevant biological information about cells from the available database columns, such as their category (e.g., "hybridoma", "cancer cell line", "transformed cell line"), sex (male or female), and species of origin (e.g., "mus musculus", "homo sapiens", etc.).We discarded identification information, such as patents, synonyms, or entry dates.Additionally, Cellosaurus provides comments in various categories (e.g., "monoclonal antibody target", "sequence variation", etc.), which we also included.The list of selected information is shown on the y-axis of Figure 7a.
Next, we ranked columns and comments based on the fraction of unique entries relative to their total, as illustrated in Figure 7a.Our intuition is that comments with a high number of unique entries help identify specific cell lines, making  it easier to distinguish cell types.Following this principle and after reviewing examples, we selected the following information in this order: genome ancestry, karyotypic information, senescence, biotechnology, virology, caution, donor information, sequence variation, characteristics, transfected with, monoclonal antibody target, HLA typing, knockout cell, microsatellite instability, hierarchy (HI), breed/subspecies, derived from site, population, group, monoclonal antibody isotype, cell type, transformant, selected for resistance to, and category (CA).
Finally, for each database entry, we concatenated the strings from the selected information, removed PubMed references, and stripped extra spaces.The average text description length (i.e., number of characters) of the cell lines in our curated dataset was 181.1, below the 384-token input size limit of the selected sentence Transformer model.

B.1. Cosine Similarity of Cell Line Descriptions
Table 2 presents a cosine similarity matrix for three cell line descriptions generated by following the above methodology.The cosine similarity metric quantifies the similarity between the textual descriptions of different cell lines, with values ranging from 0 to 1, where 1 indicates identical descriptions and 0 indicates no similarity.
For instance, the description of the cell line UKF-NB-2rDACARB4 is highly similar to that of UKF-NB-2rDOCE10, with a cosine similarity of 0.8759.Both of these cell lines are cancer cell lines derived from the same species (Homo sapiens) and are part of the resistant cancer cell line (RCCL) collection.They differ primarily in their resistance to different chemotherapeutic agents: dacarbazine for UKF-NB-2rDACARB4 and docetaxel for UKF-NB-2rDOCE10.
In contrast, the description of FHS036i-sh18961C, an induced pluripotent stem cell line, has a much lower similarity to the cancer cell lines, with cosine similarities of 0.2832 and 0.3522 to UKF-NB-2rDACARB4 and UKF-NB-2rDOCE10, respectively.This lower similarity is expected given the fundamental differences in cell type, collection origin, and specific biological characteristics.
These examples illustrate how cosine similarity can effectively differentiate between cell lines based on their detailed descriptions, reflecting both broad classifications and specific attributes.

B.2. UMAP Visualization of Cell Line Embeddings
Figure 7b presents a uniform manifold approximation and projection (UMAP) plot of the cell line embedding vectors.UMAP is a dimensionality reduction technique that helps visualize high-dimensional data by projecting it into a lowerdimensional space, preserving both local and global data structure (McInnes et al., 2020).
The plot showcases the embedding vectors of cell lines, color-coded according to their categories.Each point represents a cell line, and its position reflects the similarity of its embedding vector to others.Similar cell lines cluster together, indicating that the embedding vectors effectively capture meaningful biological relationships.For instance, induced pluripotent stem cells (light purple) and hybridoma cell lines (light blue) form distinct, dense clusters, demonstrating the embeddings' ability to reflect their biological differences.In contrast, some categories, such as spontaneously immortalized cell lines (purple) and cancer cell lines (yellow), exhibit partial overlap, suggesting shared biological features while maintaining enough distinction to form identifiable subclusters.This visual validation underscores the embeddings' capacity to encapsulate and differentiate between various cell line categories, supporting the efficacy of our approach.

B.3. Evaluation
Figure 8 reports the evaluation scores of models leveraging cell line embeddings from text descriptions.The models were tuned and trained following the same methodology described in Section 2.5.Compared to the baseline model scores on the standard split study in Figure 6, using text-based cell line embeddings resulted in a drop in validation average/test average/test majority vote accuracy of 0.1%/2.2%/1.3%.Notably, on the similarity split study the model reached validation average/test average/test majority vote accuracy of 78.9%/80.1%/77.9%,compared to 79.6%/74.9%/67.5% for  the baseline.However, the models trained on text-based cell embeddings still struggle to generalize against new targets in the target split study, displaying slightly lower accuracy than the baseline model on this same split (validation average/test average/test majority vote accuracy of 55.4%/61.8%/52.6%).

C.1. PROTAC-DB and PROTAC-Pedia
Table 3 provides an overview of the two datasets used in our study: PROTAC-DB and PROTAC-Pedia.PROTAC-DB contains a total of 5,388 entries, whereas PROTAC-Pedia comprises 1,203 entries.The number of unique SMILES

C.2. Cross-Validation Folds and Test Sets
Table 4 presents detailed statistics for the datasets used in the three studies proposed in our evaluation strategy.
For the standard split, each fold consists of 560 training entries, 140 validation entries, and 78 test entries.The proportion of active data samples in these splits is consistent and balanced across folds, with the training and validation sets containing around 50% active samples, and the test set 55.1%.Notably, a significant percentage of entries have leaking Uniprot identifiers (around 80%) and a smaller proportion have leaking SMILES (around 8%).The average Tanimoto distance between PROTACs in the test set is 0.379, indicating moderate structural similarity.
The target split aims to evaluate model generalization to unseen POIs.The training set sizes vary between 507 and 594, with the validation set sizes ranging from 108 to 195, and the test set consistently containing 76 entries.Because of stratified folds, the active data proportions in the training, validation, and test sets vary more widely than in the standard split.In fact, there are no leaking Uniprot identifiers in this split, and the proportion of leaking SMILES is below 1.4%.The average Tanimoto distance between PROTACs in the test set is slightly higher at 0.390.
For the similarity split, designed to test generalization to new PROTACs, the training set sizes range from 510 to 571, validation sets from 110 to 171, and the test set again consistently contains 75 entries.The active sample proportion in the training sets average around 52.2%, with the validation set showing slightly more variation.The leaking Uniprot identifiers are around 60.1%, and there are no leaking SMILES, by construction.The average Tanimoto distance between PROTACs in the test set is the highest among the splits at 0.412, reflecting the structural novelty of the test PROTACs in this specific study.

D. XGBoost Performance
Given the experimental setup and evaluation strategy described in Section 2.4, we first trained different XGBoost models in a CV setting via Optuna.The selected hyperparameters tuned in Optuna are reported in Table 5.We then trained, with the best hyperparameters found, three models and evaluated them on the held-out test sets.As with the deep learning models, we evaluated the XGBoost models both individually by computing their average performance, and together via majority voting.Figure 9 compares the performance metrics for the trained XGBoost models on the different studies.
The comparison of test performances between the trained XGBoost models and the proposed deep learning models

Parameter Type Range Scale
Step size shrinkage (eta) float 1e-4 to 1e-1 log Maximum depth of a tree (max_depth) int 3 to 10 Minimum sum of instance weight (hessian) needed in a child (min_child_weight) float 1e-3 to 10.0 log Minimum loss reduction required to make a further partition (gamma) float 1e-4 to 1e-1 log Subsample ratio of the training instances (subsample) float 0.5 to 1.0 Subsample ratio of columns when constructing each tree (colsample_bytree) float 0.5 to 1.0

E. Amino Acid Counts as Protein Embeddings
We conducted an experiment to investigate whether our model utilizes latent information from protein structure embeddings or merely uses them as "barcodes" to differentiate between data samples.We encoded both the POI and E3 ligase amino acid sequences with a 1-gram count-vectorizer from scikit-learn, i.e., we counted the characters in the sequence string.We then proceeded to tune and train the models following the same methodology detailed in Section 2.5. Figure 10 shows the obtained performance scores.When encoding protein sequences as amino acid counts, we can see validation average/test average/test majority vote accuracy differences of +0.3%/-6%/-9%, compared to the baseline model scores showed in Figure 6.In particular, the models completely fail to generalize against new targets, reaching a top test accuracy of 57.9%.Overall, the embeddings from pretrained Transformers used by our models, although not perfect, appear to help the models learn more meaningful latent representations of the biological context of PROTACs.

Figure 1 :
Figure 1: (a) Schematic representation of the PROTAC mechanism of action: the proteasome (violet) degrades the ubitiquinated POI targeted by the PROTAC.After degradation, the PROTAC becomes available again for new targets.(b) Example of a typical PROTAC dose-response curve, along with the activity thresholds used in this work.

Figure 2 :
Figure 2: Data curation pipeline and proposed studies.

Figure
Figure5areports the performance of the different models across the various studies.For each study, named after either the standard, target, or similarity split used, we show the

Figure 5 :
Figure 5: (a) Performance of the different models across the various studies, with model accuracy plotted on the left and ROC AUC plotted on the right.(b) Ablation results for the standard cross-validation split.Each bar indicates the embedding(s) not available to the model to process.

Figure 6 :
Figure 6: Additional performance metrics for the presented deep learning models: (a) F1 score, (b) precision, and (c) recall.

Figure 7 :
Figure 7: (a) Cell line information (database columns) from Cellosaurus, ranked by their percentage of unique entries over the total number of entries in that column.(b) UMAP visualization of the generated cell line embedding vectors, color-coded by cell line categories.

Figure 8 :
Figure 8: Performance of the models leveraging cell embeddings from cell line descriptions: (a) accuracy, and (b) ROC-AUC.

Figure 10 :
Figure 10: Performance of the models leveraging amino acid counts as protein embeddings: (a) accuracy, and (b) ROC-AUC.
Figure 4: (a) Histogram of  50 and   in the full curated dataset.Note that the  50 values are scaled 10× to better display them along side   values, although they are not bounded by 0 and 100 as   is.(b) The percentage of curated data associated with each E3 ligase and the active/inactive percentage of data points per E3 ligase.

Table 2
Example of cosine similarity matrix for three cell line descriptions.

Table 3
Characteristics of PROTAC-DB and PROTAC-Pedia datasets.The term single here indicates entries for which the SMILES or target appears only once in the corresponding dataset.

Table 4
Statistics of datasets used in different studies.The term leaking indicates the percentage of entries in the training set with either a SMILES or target that also appears in the test set data samples.The avg Tanimoto distance refers to the average Tanimoto distance between PROTACs in the test set.