Clinical forecasting of acute myeloid leukemia using ex vivo drug-sensitivity profiling

Summary Current treatment selection for acute myeloid leukemia (AML) patients depends on risk stratification based on cytogenetic and genomic markers. However, the forecasting accuracy of treatment response remains modest, with most patients receiving intensive chemotherapy. Recently, ex vivo drug screening has gained traction in personalized treatment selection and as a tool for mapping patient groups based on relevant cancer dependencies. Here, we systematically evaluated the use of drug sensitivity profiling for predicting patient survival and clinical response to chemotherapy in a cohort of AML patients. We compared computational methodologies for scoring drug efficacy and characterized tools to counter noise and batch-related confounders pervasive in high-throughput drug testing. We show that ex vivo drug sensitivity profiling is a robust and versatile approach to patient prognostics that comprehensively maps functional signatures of treatment response and disease progression. In conclusion, ex vivo drug profiling can assess risk for individual AML patients and may guide clinical decision-making.


In brief
Predicting the response to chemotherapy for treatment of AML is a major challenge.Mutation analysis and gene panel diagnostics are routinely used to predict treatment outcome, but the predictive power of these methods is limited.Here, Andersen et al. demonstrate that ex vivo drug profiling can assess risk of individual AML patients and may guide clinical decision-making.

INTRODUCTION
Acute myeloid leukemia (AML) is a heterogeneous cancer where the clonal expansion of myeloid progenitor cells (blasts) in the bone marrow and peripheral blood interfere with healthy hematopoiesis, resulting in immunodeficiency, thrombocytopenia, and anemia. 1 The current 5-year survival rate of patients over 60 years of age is estimated to be 10%-15%. 2Most treatment-eligible individuals receive standard induction chemotherapy, which consists of a 3-day treatment with either 60 mg/m 2 daunorubicin or 10-12 mg/ m 2 idarubicin and 100-200 mg/m 2 cytarabine (Ara-C) intravenously for 7 days. 3Survival rates of older patients have not substantially improved over the past decades, underscoring MOTIVATION Ex vivo drug-sensitivity screening may be used to assess treatment options and predict treatment response.However, the effectiveness of this approach relies on the preservation of clinical cancer characteristics in ex vivo cell systems, and systematic data on accurate representation of drugresponse profiles are lacking.Here, we used forecasting of clinical outcomes as a standard for exploring optimal scoring of drug sensitivities and management of noise and technical confounders.We show that ex vivo drug-sensitivity profiles robustly predict AML patient survival and provide valuable insights about treatment response and disease progression.
the need for better clinical assessment and more accurate prognostic approaches.
Current risk stratification methods such as the European LeukemiaNet (ELN) guidelines for AML stratification are mainly based on parameters such as molecular pathology and genomic alterations. 45][6] Various non-genomic methods have been developed for identification of clinically and biologically relevant molecular subtypes for risk stratification, such as flow and mass cytometry, transcriptomics, and proteomics, [7][8][9][10][11][12] although clinical implementation of these methods has generally been slow.
Recently, ex vivo drug sensitivity profiling has been used as a precision medicine approach to identify potential compounds that may be repurposed for the treatment of various types of cancer, including AML. [13][14][15][16][17][18][19][20] Here, leukemic cells derived from bone marrow aspirates or peripheral blood are incubated in the presence of various drugs at different concentrations.Drug responses are then quantified, and overall drug efficacy is scored from a dose-response relation.][15][16][17][18][19] The validity of drug sensitivity profiling requires that clinically relevant characteristics are conserved in ex vivo analyses; i.e., drug responses should reflect cancer dependencies, and drug sensitivities should be measurable reliably despite potential technical confounders and noise.2][23][24] A common procedure is to use model curve fitting to de-noise dose-response data and summarize the drug sensitivities with the half-maximal effective concentration (EC 50 ) or the area under the curve (AUC). 25,268][29] However, curve fitting of incomplete or non-sigmoidal drug responses remains a challenge for largescale drug testing, and single output metrics from Hill models tend to yield incomplete information about the biological variability of a given drug response. 25,30Therefore, streamlined analysis pipelines typically offer multiple alternative dose-response evaluation and error-reporting tools. 23,313][34][35] In this study, we evaluated the clinical information available in ex vivo drug sensitivity profiles within a cohort of 69 AML patients by using several statistical techniques and machine learning routines.Regularized regression was used to evaluate the clinical forecasting potential and risk-interpretability from ex vivo drug sensitivity profiles.We show that drug sensitivity profiling can be used as a tool for clinical forecasting and decision-making.

Study approach, quality control, and comparison of metrics
To evaluate the clinical utility of ex vivo drug sensitivity profiling in AML, we systematically performed drug screens on bone marrow or peripheral blood samples from patients that were obtained at the time of diagnosis (Figure 1A and Table S1).For high-throughput curve fitting and multiparametric scoring of drug sensitivities, we used the Breeze pipeline, which performs Hill curve fitting on inhibitory responses to compute several drug sensitivity metrics, such as the EC 50 , toxicity EC 50 (TEC 50 ), and DSS1, DSS2, and DSS3. 23,27Here, DSS1 is the relative integral for Hill curves that exceed 10% inhibition, DSS2 adjusts DSS1 to normalize for strongly toxic responses at high concentrations, and DSS3 adjusts DSS2 to give weight to responses with high dose sensitivity (Figure 1B, see STAR Methods).We also used Breeze to compute the AUC under a locally estimated scatterplot smoothing (LOESS) curve fit (loess-AUC), as a more flexible model-free alternative to the Hill equation.In addition to these metrics, we calculated a raw AUC based on a stepwise normalized rectangular area under the relative viability dose response along a logarithmic concentration scale (rAUC; Figure 1B).Since the distributions of relative viabilities are positively skewed, we also performed a negative log 2 -transformation of the rAUCs, resulting in a weighted-average log 2 fold change in cell viability (rAUC-log 2 ).This step was equivalently performed on the loess-AUCs.
To obtain a measure of the quality of the drug screens we computed plate-wise Z 0 -factors and analyzed drug sensitivity profile correlations between patients to detect potential outliers (Figures S1A-S1D).Although some patient-specific plates had a suboptimal Z 0 -factor (Figure S1A), all negative plate controls (DMSO) were at least two standard deviations above the positive plate controls (BzCl, Figure S1B), and we did not observe any consistent association between plate control noise and patient profile correlations (Figure S1C).However, we did find substantial differences in profile correlations for different metrics, with DSS1-3, rAUC-log 2 , and loess-AUC-log 2 yielding the best inter-patient correlations (median Pearson correlation coefficient [PCC] over 75%) and TEC 50 and EC 50 giving the worst (Figures S1C and S1D).Furthermore, intra-patient profile correlations between treatment-naive and relapsed samples were higher than inter-patient correlations (Figure S1D).Comparison of the different metrics revealed that low-confidence Hill curve fits reported by Breeze tend to be associated with low-sensitivity dose-responses (EC 50 > 10 À7 mM) (Figure S1E), as well as with responses yielding lower correspondence between different AUC metrics.Despite this, the overall correlation between the different metrics was high, indicating that inferences made from the drug screen dataset were sufficiently robust for these analysis methods.
Assessing the predictivity of drug sensitivity metrics Median survival in the patient cohort was 1,177 days (Figure 1C).To assess ex vivo drug sensitivity profiling in predicting patient survival, we compared several regularized Cox models trained on the different drug sensitivity metrics (Figure 1D).Predictive accuracy was assessed using a C-index over multiple randomized assignments of patient test data (Figure 1D).We also tested different regularization types, as Ridge and Elastic net regression tend to perform better than Lasso when there are grouped correlations between features (which is expected for drugs with similar modes of action). 36AUC-based metrics outperformed EC 50 and TEC 50 in predicting patient survival regardless of regularization technique (Figure 1E).Interestingly, the loess-AUC, which had a strong linear correlation with the rAUC (Figure S1E), improved the survival prediction for Elastic net and Lasso models, suggesting that de-noising dose-response data using a LOESS curve fit has a beneficial effect (Figure 1E, left).This was also the case for the DSS metrics for all three model types, with DSS3 outperforming DSS1 and DSS2 and resulting in lower overfitting (Figures 1E and S1H).Strikingly, log-transforming the AUCs resulted in a substantial improvement in prediction accuracy, performing at least as good as DSS3 (Figure 1E, left), indicating that distributional skewness in relative viabilities has a negative impact on appropriate drug sensitivity scoring.
To counter potential batch effects, we standardized the drug sensitivity distribution for each patient (Figure S1F).This technique transforms the drug sensitivity metrics to a respective Z score, assuming that the major differences in patient distributions mainly reflect technical influences and that only changes in the relative magnitude of specific drug sensitivities are relevant.This operation further improved the prediction accuracy for the AUC-based metrics, with rAUC-log 2 and loess-AUC-log 2 yielding the best performances with respective median C-index scores of 71% and 74% using Lasso (Figure 1E, right).In contrast, patient standardization of the DSS metrics had little effect on the predictivity and caused a slight reduction in performance for models trained with the Lasso penalty (Figure 1E).To better understand these effects, we measured the inter-drug profile correlations, which revealed library-wide multicollinearity that was removed when standardizing the metrics but maintained to a certain degree for drug pairs targeting similar proteins (Figure 1F).Here, DSS1 and DSS2 showed a slightly stronger correlation than DSS3 and other AUC metrics.
Scaling the drug distributions (Figure S1F) to equalize the penalty of drugs with differences in drug sensitivity variance caused a reduction in prediction accuracy using most metrics except for EC 50 and TEC 50 , where it caused an improvement (Figure S1I).This suggests that drugs generating weak and noisy responses have a negative impact on the model performances.Given the large number of drugs relative to the number of patients, we therefore tested the effect of pre-selecting features ranked by their standard deviations (to remove weak drug responses) as well as a greater range of Elastic net penalties (Figure S1G).These operations, in particular feature pre-selection, improved the survival predictions further, with rAUC-log 2 yielding C-index averages over 77% for several Elastic net models (Figures 1G  and S1J).Altogether, these results show that there is high prognostic value in patients' ex vivo drug sensitivity profiles.

Versatility of clinical forecasting and integration with AML biomarkers
To evaluate the prognostic versatility and robustness of ex vivo drug profiling, we also tested the forecasting potential of other clinical outcomes, including survival status after different periods.We found that the accuracy of AUC-based metrics at predicting the initial response to induction therapy (as measured by the presence of persistent leukemia in the patients), long-term survival, or probability of relapse was limited (Figures 2A and  S2A).In contrast, the survival rates during the first 2 years after diagnosis, which is characterized by a sharp reduction in survival (Figure 2B), were predicted by AUC-based metrics with very high accuracy (Figures 2A and S2A).Furthermore, while rAUC-log 2 and loess-AUC-log 2 excelled over DSS metrics at short-term survival prediction, DSS3 robustly performed better than all other metrics in predicting long-term survival (Figures 2A and  S2A).These results show that drug sensitivity profiles predict well the initial treatment phase but also have potential for longterm survival forecasting.
We next compared the effectiveness of ex vivo drug sensitivity profiling in predicting survival to the predictiveness of other AML patient data, such as genetic features and/or prognostic clinical features that included ELN risk stratifications, age, and sex.We also performed the analysis on 2-year truncated short-term survival data to evaluate the performance on the initial treatment response phase.
As expected, due to coverage sparsity, the Ridge models performed better on the genetic and clinical feature sets (Figure 2C).The genetic feature set provided a prediction with a median C-index of 59% overall and 60% short term, which is close to what has been reported previously using much larger cohorts. 37Combining the different clinical feature sets did not improve predictivity, most likely due to redundancy in information and low sample size (Figures 2C, S2B, and  S2C).Interestingly, ex vivo drug screening performed at least as well as or better than the genetic or clinical feature sets at predicting treatment outcome (Figure 2C).Combining genetic or clinical feature sets with the entire drug dataset did not substantially improve predictivity, and combining genetic features with a reduced drug dataset using feature pre-selection only resulted in a slightly improved but non-significant combination effect (Figure 2C, left lower panel).Furthermore, the feature-reduced rAUC-log 2 dataset showed prediction accuracy of 83% for two-year survival (Figure 2C, right lower panel), which did not improve further when combined with the genetic or clinical feature sets.There were also no complementary effects between the datasets when testing for other clinical outcomes (Figure S2D).This marked a 23% improvement over the prediction using genetic features alone.Since drug sensitivities may change during the progression of a disease and in response to treatment, we evaluated the survival prediction on relapsed patient samples and compared to their treatment-naive counterparts.Here, the survival models were trained by completely excluding data from both treatment-naive and relapsed samples for the patients that were tested.Interestingly, despite a major reduction in predictivity, standardizing the rAUC-log 2 scores, but not DSS3, was able to nearly rescue the performance on relapsed samples (Figures 2D and 2E), indicating that the relative risk prediction between patients sampled from the same disease stage is preserved.Surprisingly, this was despite a significant reduction in predicted patient hazard for relapsed samples overall (Figures 2F and S2E).This suggests that whereas the cancer dependencies relevant for initial survival prognostics of high-risk patients evolve during the course of therapy, the relative patient identity characteristics remain preserved.

Removal of confounding factors in ex vivo drug sensitivity profiles
Due to the major differences in clinical forecasting performance between the different drug sensitivity metrics, we performed principal component analysis (PCA; Figure S3A) to identify po-tential confounding factors that may cause variability in the datasets, such as artifacts associated with curve fitting, batch covariates, instrument and sample type, and biological/clinical covariates based on genetic and diagnostic phenotypes.
The first principal component identified over 20% of the total variance (Figure 3A, left) and was strongly associated with curve fit non-responders (>75%, Figure 3B, left).Moreover, noise in plate controls and curve fit error had strong associations with the first two to four components of DSS1, DSS2, and DSS3, whereas the rAUCs and loess-AUCs were affected across multiple components throughout the dataset (Figure 3B, left).Strikingly, for any drug sensitivity metric, the first few components had very weak association with biological/clinical covariates, suggesting that the major cause of variability in drug sensitivities stems from patient response averages, noise, and potential curve fit issues.These effects were rectified to some extent by standardizing the drug sensitivities per patient and removing weak and noisy responses using feature pre-selection, particularly for rAUC and rAUC-log 2 (Figures 3A-3C and S3B).These operations also increased the variance explained by biological/ clinical covariates, while decreasing the effect of other sample characteristics, such as number of patient non-responders, curve fit error, and noise in the plate controls (Figures 3B and  3C).For the rAUCs and loess-AUCs metrics, these results highlighted their sensitivity to noise (represented by curve fit error), which was strongly reduced when the metrics were standardized (Figure 3C).Interestingly, considering the strongest drug responses (with feature pre-selection) improved the relative variance explained by biological/clinical covariates only for rAUC-log 2 and loess-AUC-log 2 but not the DSS metrics (Figure 3C).To investigate whether this was related to loss of information from curve fitting or removing non-inhibitory responses, we performed Hill curve fitting and computed AUCs from the rectangular area under the predicted dose-responses (Hill-rAUC; Figures 3D and  S3C-S3E).The Hill-rAUC was further adjusted for non-inhibitory responses (Hill0-rAUC) for a more direct comparison with the inhibition-only curve fits from Breeze protocol (HillB-rAUC) and the DSS metrics (Figure 3E).All of these metrics were also log 2transformed to enable comparison with the corresponding non-curve fit AUCs.These operations indicated that there was little benefit on survival prediction from curve fitting alone and a further negative influence of zero-inflating non-inhibitory dose-responses (Figure 3F).Moreover, in particular, removing variation from non-inhibitory responses replicated many of the covariance characteristics observed for the DSS metrics using PCA (Figure S3G).

A
To test whether the dominant covariances in the data represented confounding factors that interfered with accurate predic-tion of patient survival, we performed a removal of confounding principal components (RCPC) procedure. 38For most drug sensitivity metrics, removing at least one or two principal components improved prediction accuracy (Figures 4A, 4B, S4A, and S4B), which was more profound for short-term survival predictions.Interestingly, both log 2 -transformation and standardization appeared to negate the beneficial effect of RCPC on rAUC and loess-AUC (Figures 4B, 4C, and S4B-S4E), which compared to DSS metrics had a sharper decline in prediction performance when removing a greater number of principal components (Figures 4A and S4D), indicating loss of valuable information about patient survival.Finally, standardizing the rAUCs and loess-AUCs resulted in a significant decrease in the number of component subtractions required for achieving optimal performance for short-term survival (Figures 4B and S4E).These results indicate that processed ex vivo drug screening data can contain consequential confounders that may bias the dose-response curve fit and drug sensitivity scoring, and that standardization or RCPC can be used as a straightforward and reliable method for diagnosing and de-confounding the data.

Drug sensitivity risk associations
Statistical analysis of the library-wide survival associations revealed that different models showed overall good correlation in their learned coefficients (Figures 5A, 5B, S5A, and S5B), where the strongest survival associations also corresponded to high contributions to a model's prediction accuracy (Figures 5C and  S5C).Given that many compounds in the library have overlapping targets, there was substantial redundancy in single-drug withdrawals with respect to model performance (Figure 5C).Indepth investigation of specific drug responses that were significantly associated with survival revealed that ex vivo sensitivity to anthracyclines predicted a favorable outcome (Figure 5A).Daunorubicin, one of the major components of 7 + 3 chemotherapy, had one of the most significant associations with survival, while idarubicin also had a favorable (albeit statistically non-significant) survival association.Epirubicin, another anthracycline used in treatment of several types of cancers, and the anthracycline analog mitoxantrone were also significantly associated with survival (Figure 5A).Interestingly, we found that ex vivo sensitivity to the Bcl-2/Bcl-XL/Bcl-w inhibitor ABT-263 (navitoclax) was strongly associated with increased risk (Figure 5A).ABT-263 is an experimental drug with a mechanism of action similar to the more selective Bcl-2 inhibitor venetoclax, which has been reported to significantly improve survival of AML patients who are ineligible for standard chemotherapy (please note that venetoclax had not been approved for AML therapy at the time of sample collection, which is why it was absent from the drug library). 39Strikingly, models trained on sets of treatment drugs alone achieved similar survival forecasting performance as the full library, and combining standard and alternate treatment drugs had a complementary effect on prediction accuracy (Figures 5D and S5D).This shows that improved prognostic value from drug sensitivity profiling can be achieved by carefully selecting drug libraries with non-redundant information about key cancer dependencies.Hierarchical clustering of learned risk associations against different clinical outcomes revealed interesting correlations (Figures 5E and S5E).For instance, sensitivity to the mTOR inhibitors rapamycin, everolimus, deforolimus, and temsirolimus was associated with initial positive response to induction therapy but also with increased risk of relapse and decreased long-term survival.This implies that dependence on mTOR may be an early prognostic marker for acquired resistance to chemotherapy.Similarly, whereas sensitivity to daunorubicin was most positively associated with short-term survival and response to induction therapy, it was also associated with increased risk for relapse (Figure 5E).
Enrichment analysis of drug targets associated with Cox model coefficients demonstrated a library-wide risk association of sensitivity to kinase inhibitors, including JAK and PI3K signaling (Figures 5F and S5F-S5H).On the other hand, relapse risk was associated with sensitivity to multiple receptor tyrosine kinase inhibitors, and sensitivity was significantly increased in relapsed samples (Figures 5F-5H and S5F-S5I).Conversely, sensitivity to topoisomerase II-targeting chemotherapeutics in treatmentnaive samples, which was associated with a favorable shortterm survival and increased risk of relapse, was decreased in samples from relapsed patients (Figures 5F-5H).Consistent with previous studies, 40,41 ex vivo sensitivity to agents that target anti-apoptotic proteins, particularly ABT-263, was strongly reduced in relapsed samples (Figure 5G), indicating an increased apoptotic threshold or development of cross-resistance.This was also accompanied with a library-wide negative correlation trend between differential relapse sensitivity and risk association in treatment-naive samples, where several prognostic drug markers for adverse response to standard treatment diminished in sensitivity after relapse (Figure 5G, lower panel).We conclude that ex vivo drug sensitivity profiles can serve as prognostic markers and may provide insight into relevant cancer dependencies and mechanisms of disease progression.

DISCUSSION
In this retrospective study, we used ex vivo drug sensitivity screening to predict the survival of AML patients subjected to standard chemotherapy.An important topic that has been the subject of much discussion is how to appropriately score drug sensitivities in large-scale drug screens.Although sigmoid-constrained curve fitting and response scoring is a valuable tool in large-scale drug screens, it is inherently sensitive to dose coverage in the dynamic range, and model parameters tend to be unreliable. 25,30,42Consistent with this, we observed that the EC 50 was inferior in predicting patient survival compared to various scoring metrics based on the average dose-response efficacy.To combat the noisy nature of ex vivo drug screens, a common approach is to only consider the area under inhibitory Hill-response curves that cross a certain threshold, exemplified by the DSS. 27We also observed that this compression of drug response variability resulted in zero-inflated data with substantial loss of information.Some of this is likely due to bias associated with Hill curve fitting, introducing confounding covariation that could be countered by removing the leading principal components in the data.
An alternative approach to the DSS that worked well without or in combination with model-free curve fitting was log-transforming averages of the relative viabilities.This is a common re-scaling procedure that when used in this context effectively reduces the leverage of noisy and weak data points around one or above (non-inhibitory responses) and improves the separability of strongly inhibiting drug responses.For these drug sensitivity metrics, which retained the full range of drug response variation, patient standardization (Z scoring) generally improved representation of clinical information in the data even further.This operation removes patient variation in general levels of drug sensitivity, which is responsible for substantial drug-wise correlations (multicollinearity) as observed by others. 43,44This property has previously been linked to biological differences in gene expression and where conditioning on this feature can improve the inferences between datasets. 43,44Here, using RCPC to remove this feature completely improved short-term survival forecasting, indicating that it served as a negative confounder for utility of specific drug sensitivity variations.Moreover, while the burden of confounding factors may vary between datasets, the use of PCA to investigate which sample features dominate the leading components can serve as a valuable tool to determine the approximate number of components to remove. 38This can further be used in conjunction with other batch correcting methods to harmonize data from different centers. 45n our study, we used a drug library that not only includes standard chemotherapeutics but also a wide variety of other compounds, and we identified informative ''drug sensitivity fingerprints'' that may not only help guide treatment choice but that can also provide valuable prognostic information.Importantly, merging the data from drug sensitivity profiling with information from genetic biomarkers did not further improve the predictivity of drug profiling alone.This suggests that predictive information in genetic profiles is already captured by ex vivo drug sensitivities and indicates that ex vivo drug profiling can be further developed into a prognostic tool through rational design of focused drug libraries that fully cover cancer dependencies.Examples of drugs that should be included in such next-generation libraries for risk stratification are BCL2, mTOR, and HSP90 inhibitors, which, as already discussed above, are clearly associated with risk.7][48] We are currently analyzing data from other large-scale ex vivo drug screening efforts, such as the FIMM and BEAT-AML datasets, [49][50][51] which we expect will identify additional compounds associated with risk.Such analyses will not only be important for inter-center validation of data but also for development of effective harmonization methods.
Although progress has been made to update recommendations for AML risk stratification and treatment guidelines, prediction accuracies in large and heterogeneous cohorts remain modest. 4,37One reason for this is that AML is in part driven by non-genetic alterations, 52,53 including epigenetic alterations and rewiring of metabolic pathways and signaling circuits, which are not easily detected by more conventional methods, such as genomics and transcriptomics.Functional ex vivo drug profiling may be better suited for identifying such cancer dependencies. 50,51Thus, with further research, we envision that ex vivo drug profiling can be a useful tool in the clinical decision-making process to stratify patients into treatment groups.
We conclude that ex vivo drug profiling reveals cellular dependencies leading to chemoresistance and cancer progression, and it robustly predicts patient survival and response to induction therapy.

Limitations of the study
This study was conducted within a single center on a relatively homogeneous and small patient cohort.Because AML is genetically a heterogeneous disease, the mutation profiles were sparsely distributed in the patient population, which resulted in low statistical power when cross-examining genetic data with ex vivo drug sensitivities for patient risk group stratification and treatment response prediction.Thus, future studies on larger patient cohorts from multiple centers will be instructive for evaluating the clinical forecasting performance of drug sensitivity profiles across different age groups and sub-classes of AML.This may also allow for development of more flexible multimodal machine learning strategies that generalize across cohorts and can provide biologically informed risk categorization based on the optimal combination of genetic and drug response screen results.
In this study, we focused on variations of intensive chemotherapy given as a first-line treatment in the clinic and not on targeted therapies.The accuracy of clinical outcome predictions may vary depending on the selected therapy and the preservation of treatment-specific cancer determinants to ex vivo cell cultures.Given that several targeted therapies have entered the clinic, it will be important to perform similar studies on these patient cohorts.Ideally, this will identify a focused set of drugs that are highly predictive of the outcome of chemotherapy as well as targeted therapies.Using the systematic approaches that we have outlined here on more heterogeneous patient groups subjected to a wider array of treatment options may further refine the use of drug sensitivity screens as a tool for stratifying patients in appropriate treatment groups.
EC50 to the maximum concentration for the models with R max under 25%.Before further use, both EC50 and TEC50 were log-transformed and mean-subtracted per drug.
The drug sensitivity score (DSS) is computed from an analytical solution to the integral ; We also used Breeze to compute an AUC based on a LOESS curve fit (referred to as loess-AUC), which models both growth promoting and inhibitory responses.For bidirectional Hill curve fitting, we modified the Breeze algorithm by setting R max = ½ À 100; 100, and thus allowing fitting of non-inhibitory responses.Hill-rAUC was computed by predicting the relative viability and using the rAUC formula as indicated before.
Patient-wise standardization of drug sensitivity metrics (s), was performed as follows to compute a drug sensitivity Z score: Here m pt and s pt is the drug sensitivity mean and standard deviation per patient respectively.For scaling of the drugs, a drug-wise standardization procedure was done over the drug sensitivity metrics or drug sensitivity z-scores.

Clinical data processing
The clinical data were grouped into two feature sets, with binarized dummy variables for categorical data.Clinical prognostic features included age at the time of diagnosis, sex, WHO patient performance status, and ELN2022 risk stratifications.Genetic features included specific mutations and chromosomal rearrangements with coverage for at least three patients.Missing data were zeroimputed.The survival times were computed from the date of diagnosis until the registered date of death, and censoring times were computed to the last recorded visiting date.Response to induction was binarized based on reaching blast clearance (remission) or not after induction therapy.Relapse was binarized for the sub-population of patients having reached complete remission.

Model training and cross-validation
Regularized Cox models were trained on the different drug sensitivity metrics and the binarized clinical variables using the glmnet package. 56Ridge, Lasso, or Elastic net models were trained by setting the penalty mixture parameter (a) to 0, 1, or 0.4, respectively, unless otherwise specified, and screening over a sequence of penalties using leave-one-out cross-validation (due to the low sample size).The model with the best average cross-validation (partial likelihood) deviance score was selected for testing or further analysis of model coefficients.The Cox model coefficients represent the change in log-hazard ratio (log-HR) as a function of drug sensitivity.Regularized binomial models were used for logistic regression of binarized clinical outcomes, and optimized using the same procedure.
Pre-selection of model features prior to model training was based on thresholds for standard deviations in drug sensitivity (as indicated), where the number of drugs with the greatest drug sensitivity standard deviations were selected.For a random sampling of features (p = 100), the rAUC standard deviation was used as sampling weights.

Model testing
For testing the models, a random sample of 10 patients was withheld from the training procedure, with a fixed proportion of deceased to surviving patients to maintain the approximate proportions in the complete dataset (5 deceased and 5 surviving for the full survival models, and 4 deceased and 6 surviving for the short term survival models).The prediction accuracy on test data was scored with Harrell's concordance index (C-index), which defines the proportion of patient pairs with concordance between their observed survival times and predicted risk.It is especially suited for clinical studies as it takes into account the censoring of patients where data is not recorded after a given point. 57The training and testing procedure was repeated 50 or 200 times, as indicated.For testing of models predicting binary clinical outcomes, a 5-fold training and testing procedure was performed and prediction accuracy was measured using an area under the receiver operating characteristic curve (ROC AUC).

Analysis of confounding factors
Principal component analysis (PCA) was performed using singular value decomposition (SVD) on scaled matrices of drug sensitivity metrics or z-scores, and removal of confounding principal components (RCPC) was performed by reconstructing the datasets after removing the leading singular values and singular vectors, followed by rescaling of the metrics (see Figure S3A).SVD was performed on scaled matrices of drug sensitivity metrics (or z-scores) such that The association of specific sample characteristics with a principal component was assessed using linear regression against v k and measured with the adjusted r-squared (r 2 adj ).The patient sample characteristics with the following features were evaluated: log average CPS signal for DMSO and BzCl; average CPS noise measured as the coefficient of variation for DMSO and BzCl as well as their SSMD; the number of curve fit non-responders measured as the number of DSS3 values equal zero or EC50 or TEC50 values at maximum; curve fit error features, which included the number of low-confidence curve fits reported by Breeze, and the average log-EC50 standard error, mean absolute error and max residual error per patient reported by Breeze; batch covariates, which included time period clusters for screen executions (defined using k-means clustering), instrument batch, cell seeding number (over or under 10000 cells) and blast source (PBMC or BM); biological/clinical covariates, which included genetic and cytogenetic features, age, sex, ELN2022 risk stratifications, clinical history (primary AML or secondary after antecedent myelodysplastic syndromes or other causes), and AML FAB classifications.Cumulative variance explained by a sample characteristic up to a component K was calculated as

Variable importance
The importance of individual drugs was assessed by their contribution to predictivity through a variable withdrawal test, and the statistical significance of the model coefficients.The contribution to predictivity was measured by training and testing models leaving out individual drugs one-by-one and computing the change in C-index from the model containing all drugs.The statistical significance of model coefficients was estimated by training 200 models on bootstrapped datasets and computing the mean and confidence interval for the fitted coefficients.

Model coefficient clustering
Estimated Ridge coefficients from models predicting different clinical outcomes were normalized by the standard deviation, and drugs with a normalized coefficient greater than three were selected for hierarchical clustering using euclidean distance and ward.D2.Before the clustering, the sign of coefficients for the different models were harmonized such that positives and negatives were associated with unfavorable and favorable clinical outcomes, respectively.

Enrichment test
Parametric enrichment analysis was performed using the GSEA function from ClusterProfiler based on ranked Ridge model coefficients or differential drug sensitivity z-scores, where drug sets were defined based on drug target or class association. 58ANTIFICATION AND STATISTICAL ANALYSIS The similarities in drug sensitivity profiles between drugs or between samples/patients in Figures 1F, S1C, S1D, S1E, S1C, and S3E, were measured using the Pearson correlation coefficient (PCC, or denoted as R in the scatterplots).
Statistical significance of drug-wise survival associations based on Ridge model coefficients in Figures 5A and S5, was determined based on the 95% confidence intervals, where significance (*) was determined when 95% of the bootstrapped coefficients did not include or cross zero.
For analysis of differential drug sensitivity z-scores between relapse and treatment-naı ¨ve samples in Figure 5G, statistical significance was evaluated using a paired t-test, with pairing based on patient identity (n = 8).Significance was indicated for p values % 0.05.Statistical significance of enriched drug sets in Figures 5F, 5H, and S5F-S5I were assessed using the permutation test option of GSEA.

Figure 1 .
Figure 1.Survival prediction from ex vivo drug sensitivity profiles (A) Study overview and workflow.(B) Drug sensitivity metric computation.The gray areas indicate how the rAUC and DSS are calculated.(C) Survival curve of the study cohort.Ticks indicate censoring.(D) Machine learning routine for testing survival prediction from different drug sensitivity metrics and data processing operations.(E) C-index results (200 tests) for Cox models trained on different drug sensitivity metrics (left) or drug sensitivity Z scores (right).(F) Pearson correlation coefficients of drug sensitivities between drug pairs having the same or different target.(G) Mean test C-index results (50 tests) for Cox models trained on rAUC-log 2 or DSS3 Z scores with various penalty mixture parameters (a) and feature preselection thresholds based on rAUC standard deviations.

Figure 2 .
Figure 2. Versatility of drug sensitivity profiling for clinical outcome predictions (A) Average ROC-AUC for classification of various binarized clinical outcomes using different drug sensitivity metrics.The values are averaged over all the ROC-AUC scores from models with different penalties and pre-selection thresholds in Figure S2A.(B) The first 2 years after diagnosis characterized by a 40% drop in survival was defined as an initial phase for short-term survival (STS) modeling using Cox regression.The STS models were compared with the full survival (FS) models that were trained on data from the entire study interval.(C) Test C-index results (200 tests) for Cox models trained on different dataset compositions based on clinical feature sets and rAUC-log 2 or DSS3 Z scores.The lower panels represent pre-selection of 40 features based on rAUC standard deviations, and the right panels represent prediction results from short-term survival modeling.(D) Test C-index on samples from treatment-naive and relapsed patients, for different models trained on rAUC-log 2 or DSS3, or their corresponding Z scores.The two right panels represent prediction results from short-term survival modeling.(E) Overview of the time of relapse in relation to the time of death for the eight patients in the relapse cohort.(F)Predicted hazard ratio (relative to population median) on samples from treatment-naive and relapsed patients for Elastic net models trained on rAUC-log 2 using 100 pre-selected features.Right panel represents short-term survival modeling.p values were computed using a paired Wilcoxon test.

Figure 3 .
Figure 3. Exploring confounding factors in ex vivo drug sensitivity profiles (A) Percent variance explained by principal components for different drug sensitivity metrics and Z scores.(B) Principal component variance explained by different patient sample characteristics for different drug sensitivity metrics and Z scores (to the right).(C) Cumulative variance explained by different patient sample characteristics, for different drug sensitivity metrics and Z scores, with or without feature preselection based on rAUC standard deviations to remove weak and noisy drug responses.(D) Correlation between rAUC and Hill-rAUC estimated from a stepwise normalized rectangular area under the relative viability dose-response predicted by a Hill curve fit.Hill0-rAUC represents only inhibitory responses by setting all non-inhibitory values to zero.(E) PCA plot of the first two principal components indicating the similarity between the different AUC-and DSS-based drug sensitivity metrics compared in the study.The direction (sign) of all the metrics were harmonized, and the PCA was done without scaling.(F) Test C-index (200 tests) for Cox models trained on Hill-based rAUC or rAUC-log 2 scores or corresponding Z scores (upper panels) and their computed C-index change from the non-curve fit raw rAUC or rAUC-log 2 counterparts (lower panels).

Figure 4 .Figure 5 .
Figure 4. Effect of removing confounding principal components on survival prediction (A) Mean test C-index (left four panels) or C-index change (right four panels) for different survival models trained on different drug sensitivity metrics or Z scores with different numbers of principal components removed.The C-index changes are computed from the respective reference datasets (RCPC = 0).The values are averaged over all the C-index test scores from models with different penalties and pre-selection thresholds in Figure S4A.The lower panels represent results for short-term survival (STS) models.(B) Number of components removed to achieve the highest mean test C-index for different drug sensitivity metrics or Z scores shown in Figure S4A.The lower panels represent results for short-term survival (STS) models.p values were computed using a paired Wilcoxon test.(C) Mean test C-index results(50 tests) for Lasso survival models comparing zero or the optimal number of components removed for 50 datasets generated under weighted random sampling of features for different drug sensitivity metrics or Z scores (shown in FigureS4C).The lower panels represent results for short-term survival (STS) models.p values were computed using a paired Wilcoxon test.
ðsdr;pt À m dr Þ sdr = P k u dr;k ffiffi ffi l p k v k;pt , where u dr,k and v k,pt are components of u k and v k which are left and right singular vectors of the principal axis k with variance l k .The variance explained by each principal component was computed as r k = lk P k lk .e3 Cell Reports Methods 3, 100654, December 18, 2023 Article ll OPEN ACCESS 2 adj;k .PCA on dose-response data was done such that ðy ðdr;cÞ;pt À m ðdr;cÞ Þ s ðdr;cÞ ;pt , where y (dr,c),pt represent the raw, processed or Hill fitted drug response or curve fit residuals of a patient (columns) and a drug at a specific concentration (rows).When indicated, standardization was applied over the columns.To test the removal of potentially confounding principal components, the drug sensitivity metrics (or z-scores) were reconstructed as s 0 dr;pt = s dr ;pt + m dr , where l represents the minimum number of principal components removed.RCPC testing was performed for different Cox models and different dataset sizes based on the feature pre-selection described earlier (Figures 4A, 4B, S4A, and S4B) and for Lasso trained on 50 randomly sampled datasets (Figures 4C and S4C-S4E).
Statistical significance was assessed using paired Wilcoxon test for Figures2F, 3B, 3C, S2E, S4B, and S4E.For predicted hazard ratio comparison between relapse and treatment-naı ¨ve samples in Figures2F and S2E, data points were paired based on patient identity (n = 8).For the other Fig., data points were paired based on methodological parameters (Figures3B and S4B) or random sample identity (Figures3C and S4E).Significance was indicated for p values %0.05.