Chemogenomic library design strategies for precision oncology, applied to phenotypic profiling of glioblastoma patient cells

Summary Designing a targeted screening library of bioactive small molecules is a challenging task since most compounds modulate their effects through multiple protein targets with varying degrees of potency and selectivity. We implemented analytic procedures for designing anticancer compound libraries adjusted for library size, cellular activity, chemical diversity and availability, and target selectivity. The resulting compound collections cover a wide range of protein targets and biological pathways implicated in various cancers, making them widely applicable to precision oncology. We characterized the compound and target spaces of the virtual libraries, in comparison with a minimal screening library of 1,211 compounds for targeting 1,386 anticancer proteins. In a pilot screening study, we identified patient-specific vulnerabilities by imaging glioma stem cells from patients with glioblastoma (GBM), using a physical library of 789 compounds that cover 1,320 of the anticancer targets. The cell survival profiling revealed highly heterogeneous phenotypic responses across the patients and GBM subtypes.


INTRODUCTION
In the past ten years, there have been encouraging advances in the treatment and consequently, the survival rates, for many cancers. This progress has largely been driven by increased molecular understanding and classification of distinct cancer subtypes, the development of novel therapeutics focusing on targets associated with specific disease subtypes, and advances in both the type and range of therapeutic molecules available for the treatment of patients. When considering molecules that have been approved or advanced in clinical trials, there are successful examples of targeted chemotherapeutics, 1 small interfering RNAs, 2 monoclonal antibodies, 3 microRNAs, 4 and virotherapy, 5 among other drug classes. These developments are very encouraging considering the complexity of human cancers and the difficulty in developing effective treatments for advanced disease. However, small-molecule chemotherapeutics still makeup the vast majority of approved drugs available to oncologists treating cancers. [6][7][8] In the case of glioblastoma (GBM) brain tumors, small-molecule chemotherapeutics are currently the only approved treatment modalities beyond surgery and radiation 9 and represent the most fertile ground for future innovation of new therapeutics to address the current challenges inherent in treating brain tumors. These challenges include (i) breaching the blood-brain barrier to effectively deliver therapeutics to the tumor, (ii) developing combinatorial treatments to systematically target redundant signaling pathways and tumor vulnerabilities inherent in brain tumors which typically exhibit wide intra-and inter-tumor heterogeneity, and (iii) selectively targeting GBM stem cells, which have been shown to be the main source for cancer recurrence in GBM. 10 Traditional drug development often employs high-throughput drug screening of large collections of diverse small-molecule libraries against a nominated therapeutic target to identify chemical starting points (hit compounds) for further optimization. This process has been relatively successful at the industrial level, but it tends to become less successful at the academic level, due to the prohibitive infrastructural costs required to develop and prosecute large-scale screens, 11 as well as the cost to develop hit compounds to viable leads and sequentially to an investigational drug candidate. Instead, many academic target discovery facilities have focused on developing more physiologically relevant phenotypic assays, which better recapitulate disease biology, 12,13 and screening of smaller, more focused libraries of small compounds, such as collections developed and curated for drug repurposing, [14][15][16] probes for target discovery, 6,17,18 or pharmacologically active targeted compound sets for specific target classes. 7,19,20 This focused approach has the advantage of providing a better understanding of the molecular basis of the disease, while simultaneously exploiting existing therapeutics and compounds with known safety profiles, along with probes that possess drug-like properties and compounds with known protein targets. In complex diseases of unmet therapeutic need, where target biology is still poorly understood or where disease heterogeneity indicates multiple target pathways that contribute to disease progression (as exemplified by GBM), phenotypic screening of target-annotated compound libraries in relevant patient-derived cell models may provide a valuable strategy for empirical identification of druggable targets or drug combinations. By circumventing major pitfalls, such as poor selectivity, cellular activity, and biological or target space diversity, these targeted libraries have great potential to accelerate the drug discovery process.
Here, we describe the construction of a comprehensive anticancer target-annotated compound library, designed to interrogate a wide range of potential cancer targets in phenotypic screening. The library design is approached as a multi-objective optimization (MOP) problem, where the aim is to maximize the cancer target coverage, while guaranteeing compounds' cellular potency and selectivity, and minimize the number of compounds arrayed into the final screening library. To do this, we used two target-based design strategies. First, we searched for small molecules against the druggable cancer targets among approved and investigational compounds (AICs) identified from the literature, drug databases, and existing oncology collections. Second, to expand the target-annotated compound library, we surveyed several pan-cancer studies to identify anticancer compound-target pairs and then expanded the chemical space around those novel targets by identifying additional bioactive compound probes through database queries. Importantly, cancer-mutated proteins, nearest neighbors, and influencer targets were further investigated for potential small-compound interactors, which generated a large in silico probe set collection. Finally, we refined the probe set collection by applying several filters, with adjustable activity and similarity thresholds, and removed redundant structures and compounds which could not be readily sourced at the time of library curation, to yield a sufficiently diverse, focused, and target-annotated compound library for phenotypic screening purposes, named the Comprehensive anti-Cancer small-Compound Library, or C 3 L. In the pilot application of C 3 L to cell survival profiling of patient-derived GBM stem cell models, we discovered widely heterogeneous patient-specific vulnerabilities and target pathway activities. All the compound libraries and their target and compound annotations, as well as the pilot screening data, are freely available as data spreadsheets and through an interactive web platform (www.c3lexplorer.com).

Identifying and curating small-molecule inhibitors of cancer-associated targets
Our first design objective was to define a comprehensive list of protein targets associated with the development and progression of cancers to form the basis of the anticancer small-molecule library. We first defined a list of proteins known to be implicated in cancers using The Human Protein Atlas 21 and nominal targets of pan-cancer studies from the PharmacoDB, 22 leading to the target space of 946 oncoproteins ( Figure 1A; see STAR Methods for details). We then expanded the target space by using additional pancancer studies linked back to cancer-related targets to define the full set of 1,655 proteins or other cancer-associated gene products. Our target space was designed to span a wide range of protein families, cellular functions, and cancer phenotypes, and it covers all the categories of ''hallmarks of cancer'' 23,24 ( Figure 1B).
After defining the comprehensive list of cancer-associated targets, our next objective was to identify and curate a small-molecule collection of compounds targeting these proteins. Since the compounds ranged from investigational and experimental probe compounds (EPCs) to approved drugs, we took a systematic approach to defining each source, through sorting the compounds targeting the cancer-associated targets, ranking the compounds for activity, diversity, and availability, and finally, producing a sortable and searchable database of the screening library along with its target and other annotations. The screening library construction started from >300,000 small molecules and ended up with 1,211 optimized for physical library size, cellular activity, chemical diversity, and target selectivity, leading to 150-fold decrease in compound space, yet still covering 84% of the cancer-associated targets. iScience Article Target-based approach: EPC collection Using the target-based design, we extracted compound-target interactions manually from public databases, leading to chemical probes and investigational compounds in three nested subsets of gradually decreasing sizes; (i) the theoretical set is an in silico set curated from established target-compound pairs covering the expanded target space of 1,655 cancer-associated proteins, (ii) the large-scale set is a broader screening collection of filtered compounds covering the same target space as theoretical set, and (iii) the screening set is the final set of most potent probes arrayed into the physical library. The screening set is the smallest subset due to the limitations in compound commercial availability for screening purposes. Figure 1C shows a schematic diagram of the construction of these three target-based compound sets, with their compound and target numbers.
The theoretical set contains 336,758 unique compounds from four probe sets for pan-cancer target space, pan-cancer collection with extended compound space, mutant target space, and mutant target collection with extended target space. The large-scale set contains a subset of 2,288 compounds from the theoretical set, filtered to reduce the number of molecules in the library while still covering the same target space, using both the activity and similarity filtering procedures with pre-defined cutoff values (see Figure S4 for the filtering parameters used in this study, but the cutoff parameters are freely adjustable for other studies and library constructions). The large-scale set was based on both the on-and off-target profiles of the 2,282 small-molecule (ii) the extended compound space consists of additional compounds that have off-target activity against the annotated targets based on bioactivity data from public drug/target repositories such as ChEMBL, 25 Drug Target Commons (DTC), 26 and DrugBank; 27 (iii) the mutant target space compounds have activity against the mutant variants of the annotated targets extracted from the COSMIC database; 28 (iv) the extended target space includes compounds with activity against cancer-related targets established through a nearest neighbor approach. 29 (B) The targets of the compounds in the theoretical set cover all the hallmarks of cancer. Note: each target protein may belong to multiple hallmarks explaining why they do not add up to 1,655. (C) The filtering criteria for designing the large-scale and physical screening sets. See STAR Methods for the step-by-step procedures for the construction of the various probe set collections.
compounds, which could be used in larger-scale screenings campaigns in academic or industrial projects (Table S1).
The screening set contains a smaller number (1,211) of compounds that are more easily purchasable and used for screening applications, thus making the probe set suitable for routine exploration of oncology-associated biological target space in complex phenotypic assays and identification of potential candidates for further drug development. This probe subset was obtained by subjecting the theoretical set to three filtering procedures (see Figure 1C). These filtering procedures involved (i) global target-agnostic activity filtering to remove 13,335 non-active probes, (ii) selecting the most potent compounds for each target to reduce the library to 2,331 compounds, and (iii) filtering by the availability of the compounds, which reduced the library size by 52%, while the target coverage remained at 86% and the target activity distributions were relatively unchanged (p > 0.05; Kolmogorov-Smirnov [K-S] test; Figure S1).

Drug-based approach: AIC collection
As expected from the target-based approach (i.e., identifying established potent small molecules for respective targets), EPC collections included mostly compounds that are currently in preclinical stages ( Figure 2A). We next compared EPC collection with an additional set of small molecules that are currently approved for clinical use, also including anticancer compounds in various clinical development stages that might be candidates for drug repurposing applications. This compound-based design strategy led to a complementary AIC collection, named AIC collection, manually curated from several public compound sources and clinical trials. The AIC collection was further subjected to removal of duplicate molecules and similarity searches using the extended connectivity fingerprint (ECFP4/6) and molecular ACCess system (MACCS) fingerprints, where Dice similarity for ECFP4/6 and Tanimoto similarity for MACCS keys with cutoff of R 0.99 were used to identify and remove structurally highly similar compounds (e.g., doxorubicin and epirubicin).
The AIC collection consisted of 546 unique compounds, out of which 74 (14%) are labeled as standard chemotherapeutics ( Figure S2). Most of the molecularly targeted compounds in AIC collections are kinase inhibitors (50%), but the collection also includes various other drug classes, such as immunomodulatory agents (3.7%) and metabolic modifiers (2.9%). As expected, the AIC compounds contain a large number of approved drugs (40.1%), and the compounds in this collection only partly overlapped with those of the EPC collections (64% with EPC and 27% with screening set; Upset plot in Figure 2A). However, even if the total number of compounds is rather different in the collections, the AIC and EPC collections cover the same anatomical and therapeutic classes (ATCs), and the proportions of ATCs in the screening set were somewhere in between the AIC and EPC collections ( Figure 2B).
To further compare the target-based EPC and compound-based AIC collections to a functional genomicsbased approach to construct an anticancer compound library, we identified potent inhibitors of the 628 priority targets extracted from the CRISPR-Cas9 screens of the DepMap project, 30

Characterization of the compound and target spaces of the probe collections
After comparing the EPC collections to the approved/investigational and priority target collections, we next analyzed the compound and target spaces of the three EPC collections, designed using the targetbased procedures and selected parameter values for filtering. The target distributions of the three probe collections remained relatively similar, whereas the screening set showed reduced numbers of multi-target compounds by its design ( Figure 3A, upper panel). However, the median number of targets per compound was one for each probe collection, indicating that the sets include relatively selective compounds. For most of the targets, the screening set contains only a single potent compound, again per its design principles, whereas in the theoretical and large-scale sets the median number of compounds per target was 42 and 2, respectively ( Figure 3A, lower panel). The target class distribution of the screening set is similar to that of the larger probe collections of the theoretical and large-scale sets (K-S test, p = 0.19; Figure 3B). This indicates that the compound filtering did not miss any target class. As expected, targeted inhibitors of kinases and other enzymes are the most frequent, but there are also other well-covered target classes, such as membrane receptors, epigenetic regulators,  Figure S7 for the other activity thresholds). (B) The distribution of target classes of compounds in the screening set and theoretical/large-scale sets (that contain the same set of targets). The numbers next to the bars indicate the percentage of targets in the two collections of different sizes. The target classification of proteins was extracted from ChEMBL. 25 (C) The number of targets associated with cancer types in the theoretical/large-scale, screening set and in the physical compound set that was tested in the patient-derived GBM stem cells in an imaging-based assay. The numbers above the bars indicate the percentage of targets in the two collections of different sizes. The disease associations were extracted from OpenTargets, 31 with overall association score >0.5 (https://www.opentargets.org/). iScience Article and ion channels ( Figure 3B). When zooming into kinases, the screening set covers all the major kinase families, and only 3 of the kinase targets present in the theoretical/large-scale sets are missed in the smaller screening set ( Figure 3D). Importantly, the compound sets cover similar proportions of targets implicated in multiple types of cancers (K-S test, p = 1.0), with breast and lung cancers having the largest numbers of targeted compounds followed by glioma ( Figure 3C).
The target distribution across cancer types remained very similar in the constructed physical library, which included 325 of the screening set compounds, and was used here in the pilot screening study on patientderived GBM stem cells (assay compounds, Figure 3C), described in the next section. At the time of the curation, 325 compounds from the optimized library were sourced first from collaborators and then from multiple commercial providers. The remaining 886 compounds could not be sourced as they were not available from any vendor, were prohibitively expensive, or the lead time for acquisition of the compound was excessive. In these cases, we selected the most potent and selective inhibitor instead, yielding a completed C 3 L of 789 small molecules, covering 1,320 protein targets and 79% of the cancer-associated targets.

Identification of compound responses across GBM subtypes in primary patient cells
In a pilot screening application to identify therapeutic vulnerabilities in GBM, we carried out phenotypic profiling of patient-derived stem cell lines from 6 GBM patients covering the three GBM subtypes: 33 classical (n = 2), proneural (n = 2), and mesenchymal (n = 2). The imaging-based compound testing assay quantifies the phenotypic effects of each compound by tracking cell proliferation or survival in response to the 789 compounds in the C 3 L physical library, each tested with 4 concentrations (see STAR Methods for details). These compounds and their 1,320 anticancer targets cover 27% and 79% of the compound and target spaces of the screening set, respectively. In this pilot screening study, we used a single phenotypic readout for further analysis: the survival fraction and its Z score, calculated based on nuclei counts in treated, untreated, and negative control DMSO samples ( Figure 4A). The lower these readouts are, the more effective is the compound in the sense that fewer cancer cells survived the tested treatment.
We developed a data-driven method to define an activity threshold separately for each patient and dose, based on mixture modeling and parameter estimation ( Figure 4B, see STAR Methods for details). Using a cutoff of 0.01 percentile of the background distribution, we identified as strong hits those compounds with responses under the data-driven activity thresholds across all the 4 concentrations. When using either the survival fraction or Z score at 0.01 percentile, we identified 10 strong response compounds from the screening set that cover multiple target classes. The benefit of such a data-driven threshold is that it provides an objective way to select patient-specific compounds that show strong activity at various concentration levels since it is estimated from the distribution of the measurements. Notably, the activity thresholds varied between the patients and dose levels, supporting the need for data-driven approach ( Figure 4B).
To summarize the activity of a compound across the 4 tested concentrations, we calculated the area under the dose-response curve (AUC) for each compound-patient pair using the survival fraction as the response readout ( Figure 4C, upper panel). The patient-specific response profiles showed a relatively large variability between the cell lines, even from the same GBM subtype, and the compound response profiles did not cluster according to the GBM subtypes ( Figure 4C, middle panel). This supports the notion that GBM is a heterogeneous disease, also in terms of phenotypic responses to compound treatment, hence requiring patient-specific treatment approaches. The identified 10 strong hits included 3 antineoplastic and immunomodulating agents, such as approved multiple myeloma treatments bortezomib and carfilzomib, which showed an overall high potency in all the 6 GBM patient-derived cell models ( Figure 4C, bottom panel).

Screening hits identify patient-specific vulnerabilities and target pathway activities in GBM
When investigating the 10 strong hit compounds across the concentration levels and patient samples, one can see an expected dose-response relationships, where higher concentrations lead to increased responses of the compounds ( Figure 5A). However, the compounds have differing activity profiles both within and between the GBM subtypes. For instance, dinaciclib, an investigational compound that inhibits cyclindependent kinases (CDKs), showed high activity in multiple patients already at lower concentrations (30nM), except for one proneural subtype patient. The broad efficacy of dinaciclib was shown recently also in 2D and 3D GBM models and in long-term culture experiments. 34  iScience Article compared to proneural and classical subtypes. 35 Notably, nicotinamide phosphoribosyltransferase (NMPRTase) inhibitor daporinad demonstrated highly selective response in a single patient of classical subtype, while the other patients had much lower responses to daporinad at all concentrations ( Figure 5A).
Pathway analyses among the protein targets of strong hit compounds demonstrated that these compounds modulate multiple cellular pathways, including immune system, signal transduction, neuronal To enable others to explore these data and freely set other data-driven thresholds for the compound selection, we have implemented a web application (www.c3lexplorer.com) that allows the users to interactively explore the signaling and other pathways targeted by the selected compounds for additional analyses and hypothesis generation.

DISCUSSION
Phenotypic screening provides a systematic approach to identify novel small molecules as promising future therapeutics, especially when focusing on realistic disease models and relevant target landscapes. 36 In this work, we developed various analytical procedures for the construction of targeted anticancer compound libraries, designed to interrogate a wide range of potential cancer-related targets for phenotypic screening. We started the systematic library construction from more than 300,000 molecules, by compiling comprehensive chemical, activity, and target information, and ended up with a physical screening set of 1,211 molecules with pre-defined cellular activity, biological and chemical diversity, target selectivity, and compound availability. This reduced the compound space by 150-fold, whereas iScience Article the remaining compounds in the screening set still covered 84% of the target space of 1,655 cancer-related proteins. The remaining 16% represent targets that need further development due to a lack of either enough specific and potent chemical probes (2%) or compound availability for sale (14% at the time of library curation).
We applied the C 3 L design principles to imaging-based phenotypic profiling of glioma stem cells from GBM patients, using a pilot library of 789 compounds, and identified patient-specific compound responses and target pathway activities. We anticipate these systematic library-design principles and the resulting broadly annotated small-molecule libraries will prove useful for the community in various phenotypic screening experiments in GBM and other cancers. We have made the compound libraries and their annotations available in the form of open-access data sheets (https://github.com/PaschalisAthan/C3L) and implemented a web platform that enables others to interactively explore the screening library and corresponding compound-target interaction networks, as well as screening data and pathway activities using user-defined activity thresholds and other filtering criteria (www.c3lexplorer.com). This enables the community to access these data and make further analyses, e.g., drug class-specific responses or for identifying combinatorial treatments to target multiple pathway redundancies and patient-specific intra-tumor vulnerabilities that are urgently needed for treating patients with GBM. 37 In the field of precision oncology, there have been several efforts to systematically catalog all of the genes implicated in cancer, 38,39 and several in-house, commercial, and governmental libraries of AICs targeting many of the most heavily investigated proteins are readily available. 14,15,40,41,[41][42][43][44][45][46] However, even though these commercial or in-house libraries have been well-utilized in many anticancer screens and drug repurposing studies, 16,17,47,48 there is a need for a comprehensive, target-annotated, anticancer library optimized for selectivity and diversity against all known oncology targets. There are well-annotated libraries for specific classes or applications, such as for kinase inhibitors, 49 metabolic modifiers, 20 and drug repurposing, 50 but what is lacking is a comprehensive and high-quality probe library that provides a starting point for various anticancer screening applications. Despite the emergence of new treatment modalities, such as monoclonal antibodies, molecularly targeted small molecules are, and will most likely remain, the most prolific anticancer therapeutics for the foreseeable future that are also applicable in advanced stages of the disease, for instance, after tumor has progressed toward metastatic disease and when the cancer has become chemo-or radiotherapy resistant. [6][7][8] The development and implementation of well-curated and comprehensive screening libraries are expected to facilitate the identification of novel anticancer drug targets, drug combinations, synthetic-lethal interactions, and therapeutic targets to overcome cancer resistance that goes beyond the recurrently mutated genes. 51 To enable cross-comparisons, we have carefully annotated our libraries with multiple compound IDs for molecules and target annotations; these include unique ChEMBL ID that makes it easy to extract additional information of the compounds, compound generic name of the active ingredient, chemical abstracts service (CAS) registry number for sourcing the compounds from potential vendors, and simplified molecular input line entry system (SMILES) for describing the structure of chemical species for further chemical analyses. We have also carefully annotated the active on/off-target space of the compounds using multiple databases, such as ChEMBL, DrugBank, and DrugTargetCommons, using both gene accession and gene symbol of the target proteins. Such annotations make it easy for others to further characterize the chemical and target spaces of the compound libraries, as well as compare with the existing and emerging libraries to allow for comparative analyses of their shared and unique chemical and target spaces for future library developments and screening applications.
We formulated the compound library construction as an MOP problem, where the aim is to simultaneously maximize the anticancer target diversity and compound potency (or selectivity), while minimizing the number and cost of the compounds in the physical screening set. This is important because a targeted library enables a more thorough evaluation of the anticancer potential in advanced, physiologically relevant cancer models, such as multiple genetically distinct patient-derived cell panels. With such a focused yet diverse-enough library, comprehensive screening paradigms including drug combinations, sequencing, and scheduling can be more easily incorporated into a primary phenotypic screening pipeline, achievable by many academic screening facilities. 52 In the present work, we used fast and heuristic procedures to select the compounds with a pre-defined potency against the selected targets that presented with sufficient structural dissimilarity, as implemented in several filtering procedures. Designing an optimal compound library of small molecules is challenged by the compound promiscuity; that is, many compounds may modulate their effects through multiple protein targets with various degrees of potency in a given context. For instance, kinase inhibitors are notorious for their target promiscuity and wide polypharmacological effects across various target classes beyond kinase families. 54 The wider target selectivity of many compounds remains still uncharted, 54,55 and therefore the phenotypedriving targets of many compounds are currently still unknown in many cellular contexts or disease applications. Compared to other compound libraries that focus more on compound selectivity, C 3 L was designed based on dual objectives: (i) to identify compounds that are effective against GBM or other cancers (either individually or in combination) and (ii) C 3 L function as a target-identifying or hypothesisgenerating library for follow-up studies. In the first application, target specificity is not as important as compound efficacy. In the second application, specificity is more important but not a necessary condition for hypothesis generation. Systematic study of cross-reactivity of the compounds will be needed to further investigate the target selectivity of the molecules in the current screening set across various target classes and cell contexts, beyond the rather limited data for target activity currently available in public databases.

Limitations of the study
We highlight below some potential caveats of the present work and how the library design could be further improved in the future studies. In addition to the potency and selectivity, chemical similarity of the compounds is another important feature of library design, especially if the aim is to have a collection of not too similar small molecules that target the cancer-related proteins of interest. In addition, further phenotypic and clinical characterization of the compounds, as implemented in Drug Repurposing Hub 50 and ChemicalChecker, 56 could aid several drug discovery tasks, including target identification, mechanism of action (MoA) classification, and library characterization. Integrated analysis of chemical, molecular target, cell-based profiling, and clinical information could also be useful to provide a relevance ranking of the targets and compounds in a library, once the disease application is defined, hence adding one further dimension to the MOP problem. Additional filtering steps to aid the eventual screening applications include, for instance, removal of interference compounds, finding on-target compounds with diverse scaffold profiles, and exclusion of compounds with known adverse off-target activities. The relatively small size of the physical screening library used in the pilot screen may miss some relevant compounds and targets in GBM. To investigate and improve the relevance of the hits, follow-up phenotypic screens with full morphological cell painting profiling are warranted.
The commercial availability of the compounds was a limiting factor that reduced the physical library size significantly. Inclusion of synthesizable compounds from the theoretical set can further increase the diverse and coverage of the screening set in the future versions. Even though screening of compounds with overlapping primary targets increases the cost, having multiple compounds with overlapping primary targets may become beneficial in specific applications, e.g., for validating the screening hits from high-throughput screens, target deconvolution of the phenotypic responses, or for discovering target activity differences when screening at multiple concentrations. The future applications in phenotypic screening, either in established cell lines or patient-derived cell models, will define the usefulness of any compound collections and libraries and the cost-benefit trade-off of compound libraries with various sizes and target overlaps. We expect that the herein presented comprehensive libraries with curated target information and the pilot screening data in patient-derived cells will provide new opportunities for many exciting drug discovery applications in GBM and other cancers, including novel target identifications, target deconvolution, drug repositioning, and drug combination prediction and testing. 57 Follow-up studies of the screening hits may lead to the identification of chemical starting points, new target-directed drug discovery programs, or approval of new therapeutics for GBM and other cancers.  For the cell passaging procedure, the media should be removed and washed with 15ml PBS. This is followed by aspirating PBS and adding 1ml accutase in order to coat the bottom of each flask. To ensure complete coverage the flask should be rocked. Then, incubation follows for 3-4 mins at 37 C and 5% CO 2 . Then, 9ml of wash media is added, and the contents are transferred to 15ml conical tube, appropriate for centrifuge (for 3 mins at 1.5 rpm). The supernatant is then aspirated and the pellet is resuspended in 10ml complete media, EGF, FGF and Laminin (4ug/mL) in a T75 flask. The final volume of media is 15ml and 45ug laminin (150ug laminin for E57 cells), but the split can be done as preferred. We note that one should not split more than 1:6 (usually 1:3 a few days before plating) and not allow the split to become (approximately) >80% confluent. Yield is approximately 3x0 6 cells per T75 flask and viability is usually 80-90%. We also note that doubling time is approximately 24 (E57) to 60 (other lines) hours. Cells tend to form spheroids when becoming too confluent and indicates they should be split. Spheroids can be recovered and will re-adhere. Also, even if cells do not need to be passaged, fresh media should be refreshed after 4 days to encourage healthy growth. Lastly, cells grow faster with prolonged culture. For this reason, passage should exceed 30 and the lowest passage should be used for screening.

Cell maintenance -Freezing cells
The freezing media consists of wash media and 10% DMSO. Cells should be detached as described above, counted and resuspended in freezing media at approximately 1-5-2x10 6 cells per ml, and 1ml for thawing into a T25 flask. Then the cells should be transferred to a specialised freezing chamber and freeze overnight at -80 C, and as soon as possible to liquid nitrogen storage.

384-Well plate pre-coating with laminin
For the pre-coating of the wells, we start first by preparing a diluted solution of laminin (10ug/ml) in complete media. Then a ViaFill dispenser is prepared under sterile conditions, and is followed by adding solution of 20ug/well and incubating at 37 C and 5% CO 2 for a minimum of 2 hours. Then 25uL of cell suspension (without aspiration) is added at 500-1500 cells/well.

384-Well plate seeding
Cell suspensions should be prepared as described above, followed by pelleting cells and resuspending those in wash media. The cells are then counted. A ViaFill dispenser is prepared under sterile conditions, and the cell suspension is prepared in complete media (without laminin), seeding at 1000 cells/25ul (E3, E21, E31, E34), 500 cells/25ul (E57) and 1500 cells/25ul (E28), and adding 25ul per well, resulting in a final volume of 45ul (4ug/ml laminin). Incubation for 30 min at RT is followed, and then plates are transferred to an incubator at 37 C and 5% CO 2 , and are incubated overnight prior to dosing with the library. We noted that for positive control, Staurosporine (1uM) is used for cell death and all wells have final DMSO concentration of 0.1% (w/w).

384-Well plate fix/stain/acquisition
Materials/Equipment Required (see Table S5). ll OPEN ACCESS iScience 26, 107209, July 21, 2023 iScience Article Fixation First the plates are removed from the incubator and are allowed to cool for approximately 10 minutes. A fresh solution of 8% formaldehyde in PBS is prepared. An equivalent volume of fixative is added and incubated at RT for 30 minutes. Then, the entire well is aspirated and washed 3 times with PBS (see Table S6).

Permeabilisation
For the permeabilization, the plates are washed 2 times with PBST, incubated in PBST at RT for 20 minutes (minimum) and then aspirated (see Table S7).

Hoechst staining
For the staining, the plates are washed 3 times in PBS. Then Hoeschst is added in PBS (1:5000) and then incubated for 30 min in RT in dark. Following that, the plates are washed 3 times with PBS and Hoechst and are sealed and stored at 4 C. Nuclei counts are carried out via built in software module (Molecular Devices), followed by a preliminary univariate analysis using TIBCO Spotfire Analyst, normalized to DMSO wells/plate (see Table S8).
iScience Article Extending the compound space (PS2) To further extend the compound space, the annotated primary targets from the pan-cancer studies were queried across publicly available compound bioactivity repositories. [25][26][27] To include also potential off-targets of the compounds, we used a relatively liberal activity threshold of % 1000 nM for multi-dose bioactivity (IC 50 , EC 50 , K i or K d ) to identify the compounds having cellular activity against these targets. In case of multiple entries corresponding to the same compound-target interaction, the median bioactivity value was recorded, similar to a previous study. 65 This curation procedure resulted in 141,087 unique compounds against 441 targets (Table S12).

Collection for the mutant target space (PS3)
In addition to investigating the wild type targets implicated in various cancer types, we also identified compounds with cellular activity against their corresponding mutant variants. The mutation information of the annotated targets of Section Pan-cancer probe collection (PS1) were retrieved from the COSMIC Database. 28 We used various types of variants:

Insertion -in frame
Complex -deletion in frame Substitution -nonsense

Deletion -in frame
We queried these mutant targets across the existing data repositories ChEMBL 25 and DTC, 26 and the corresponding compounds were compiled using the criteria similar to those described above. The resulting mutant collection consists of 944 unique compounds targeting 293 unique mutant targets (Table S13).

Extending the target space (PS4)
To further extend the target space, we queried public repositories for cancer-related targets, their first neighbors and influencers in four cancer types with high mortality rate (colon, breast, hepatocellular, and non-small cell lung cancer). Such extended ''targets'' are suggested to influence cancer pathogenesis and therefore to increase the drug target space for anticancer therapies. 29 Here, a target was considered as a cancer-related gene, when the corresponding protein was either mutated or had a differential expression in cancer. A first neighbour of cancer-related protein was defined as a protein that is directly and physically interacting with a cancer-related protein in human interactome or signalling networks, 29 according to the databases SignaLink v2, 66 Reactome, 67 HPRD, 68 DIP, 69 IntAct, 70 BioGrid, 71 or in a cancer signalling network. 72 An influencer protein was defined as a protein that has a direct interaction to one of the first neighbours. After combining these three subsets, an overlap analysis with the cancer-related targets was performed to identify a total of 1115 unique targets. The non-overlapping targets were queried in ChEMBL 25 and DTC 26 to find a total of 208 653 potent compounds following the same procedure as in the previous sections (Table S14).

Reducing the number of compounds
The full theoretical EPC probe set consists of a total of 336 758 compounds against 1655 unique protein targets (Figure 1). We next reduced the number of compounds to a more manageable size and constructed a compound library that is more attractive to academic screening projects using several filtering procedures, each with freely adjustable cut-off parameters that determine the stringency of the compound filtering process. Figure S4 summarizes the filtering steps and the specific filtering parameters we set in this study to produce the large-scale probe set.

Target-specific activity filtering
The first compound filtering technique was to apply a target-specific activity threshold to reduce the number of compounds from the theoretical set with the following steps:

Reducing the compound space
The second step in the physical screening library size reduction was to pick only a single compound for each target, with the aim to come up with the smallest number of most potent compounds that cover the same target space. More specifically, the compound that had the lowest bioactivity value among the multi-dose activity types (IC 50 , EC 50 , K i , K d ) was selected for the particular target, since that compound is assumed to have the highest binding potency against the target. The different activity types were treated equally in this process due to sparsity of bioactivity readout data.

Compound availability filtering
The final step for designing the physical screening library was to include only those compounds that are available for sale from at least one vendor using the information from ZINC15. 74 More specifically, if a particular compound was not available for sale, at the time of library curation, then the next most potent compound that is available for sale replaced the original one in the screening library. Even though this step reduced the number of compounds in the final screening library by 52% of the original size, the targets' coverage remained at 86% of the original target space (see Table S16). Furthermore, the target activity distributions remained relatively unchanged ( Figure S1), and the differences originated mainly from the few of the most potent compounds.

Cell survival profiling of glioblastoma patient cells
We performed high-throughput screening of the GBM patient glioma stem cells with an image based, nuclei staining technique. In this pilot application of the C3L design principles, the cell-based assay was performed using 789 compounds in the physical library, which included 325 of the screening set compounds (27%). The rest of the screening set compounds were not timely available at the time of library curation, or they were prohibitively expensive to purchase from any vendor. To compensate for this, we sourced additional 464 theoretical set compounds, which were selected to maximize the target space. In total, the pilot physical library covered 1320 out of the 1655 targets (80%) of the screening set. In this pilot screening study, we sourced the compounds first through collaborators, and then from commercial providers, by focusing on the most potent compounds against as many of the 1655 protein targets for comprehensive cell-based testing. iScience  Survival fraction, Z score and AUC calculation While there are multiple imaging features, such as shape and granularity, that can be quantified for each cellular compartment (nuclei, cytoplasm, etc), here we used only one feature for further analysis: the survival fraction and its z-score (see Figure 4A) from staining the nucleus with Hoechst 33342. The survival fraction is calculated by dividing the nuclei count of the treated sample by the nuclei count of the untreated control sample (DMSO 0.1% (v/v)), and the z-score is the number of standard deviations from the mean survival fraction normalized to DMSO. The lower the survival fraction or the z-score the more effective is the compound in the sense that fewer cancer cells survived the treatment or that the survival score is lower than the mean survival score of the negative controls, respectively. To summarize the activity of a compound across the 4 tested concentrations, we calculated the area under the dose-response curve (AUC) using trapezoidal rule in NumPy 75 for each compound-patient pair using the survival fraction as the response readout (see Figure 4C).

Data-driven activity threshold definition
To set an activity threshold for strong hits based on the nuclei count measurements, we applied a mixture modelling 76 on the survival fraction distributions, separately for each concentration and patient (but combining all the GBM subtypes due to the small number of samples in each subtype, n=2). As expected for molecularly-targeted compounds, most of the responses appear around 1 in terms of survival fraction (see Figure 4B). To identify extreme compound responses falling outside of these background distributions, we estimated a mixture model with two Gaussian components, one component for the background responses (centered around 1 after parameter estimation), and the other distribution for the extreme responses (parameters freely estimated from the patient-specific survival fraction data). To identify compound responses far enough from the background distribution, we chose a cut-off of 0.01 percentile of the background distribution as the activity threshold to classify a compound as a strong hit (green dotted lines in Figure 4B). In comparison to a fixed threshold based on SF levels, or their z-scores, such data-driven activity threshold ensures that the hit selection does not only depend on the full distribution of the measurements, but the activity threshold is also patient-and concentration-specific, and therefore takes into account potential cell growth and dose-response differences. Finally, the compounds with responses under the activity thresholds across all the 4 concentrations were classified as strong hits for further analyses.

Pathway analyses and web-based data platform
The target annotations available in C3L were used to map the strong hit compounds to molecular pathways and cellular processes. We used Reactome pathway tool 67 and KEGG pathways 77 (through ShinyGO tool 78 ) to explore the pathways that are modulated by the strong hits using the pre-selected activity cut-off of 0.01 percentile of the background distribution (see the section Data-driven activity threshold definition above). We have also implemented an interactive web-application (www.c3lexplorer.com) that enables users to freely set various compound selection criteria, including data-driven activity threshold for the active compound selection, and then interactively explore drug-target interactions and pathways targeted by the selected compounds for additional analyses and hypothesis generation.

QUANTIFICATION AND STATISTICAL ANALYSIS
In the section Compound availability filtering, Kolmogorov-Smirnov test was used in Python to test whether there was any difference on the distribution of the compound-target activities before and after replacing the unavailable compounds in the screening set. A significance value of p<0.05 was considered to indicate a statistically significant difference.