Adaptive digital tissue deconvolution

Abstract Motivation The inference of cellular compositions from bulk and spatial transcriptomics data increasingly complements data analyses. Multiple computational approaches were suggested and recently, machine learning techniques were developed to systematically improve estimates. Such approaches allow to infer additional, less abundant cell types. However, they rely on training data which do not capture the full biological diversity encountered in transcriptomics analyses; data can contain cellular contributions not seen in the training data and as such, analyses can be biased or blurred. Thus, computational approaches have to deal with unknown, hidden contributions. Moreover, most methods are based on cellular archetypes which serve as a reference; e.g. a generic T-cell profile is used to infer the proportion of T-cells. It is well known that cells adapt their molecular phenotype to the environment and that pre-specified cell archetypes can distort the inference of cellular compositions. Results We propose Adaptive Digital Tissue Deconvolution (ADTD) to estimate cellular proportions of pre-selected cell types together with possibly unknown and hidden background contributions. Moreover, ADTD adapts prototypic reference profiles to the molecular environment of the cells, which further resolves cell-type specific gene regulation from bulk transcriptomics data. We verify this in simulation studies and demonstrate that ADTD improves existing approaches in estimating cellular compositions. In an application to bulk transcriptomics data from breast cancer patients, we demonstrate that ADTD provides insights into cell-type specific molecular differences between breast cancer subtypes. Availability and implementation A python implementation of ADTD and a tutorial are available at Gitlab and zenodo (doi:10.5281/zenodo.7548362).


Estimate hidden cell proportions
The hidden cell proportions c = (c 1 , c 2 , . . ., c n ) ∈ R 1×n + of ADTD can be estimated analogously to Eq. ( 7) of the main manuscript as follows: where we used p j=1 x j = 1 and assumed that the ϵ i are small.

Optimization with respect to C
Next, we derive an estimate for C. Consider the ADTD loss function L ADTD (C, x, ∆) for given x and ∆: The latter expression allows us to reformulate the estimate of C as a quadratic programming problem.Let c = C •i and y = Y •i , then estimating C can be achieved by minimizing 1 2 and −c ⪯ 0 .
with respect to c, where This procedure is performed for all columns C •i .

Optimization with respect to x
Finally, we derive an estimate for x, where we use the abbreviation Z = Y − ∆XC: subject to x ⪰ 0 and p j=1 x j = 1 , with P ′ = 2(cc T )Γ and q ′T = −2cZ T Γ, where one should note that cc T and x T Γx are scalars.Thus, also this optimization problem reduces to quadratic programming.

Optimization with respect to ∆
In the following, we derive a procedure to minimize L ADTD (C, x, ∆) with respect to ∆ j,• , while C, x and ∆ k,• with k ̸ = j are kept fixed.Let, δ k = (0, . . ., 0, 1, 0 . . ., 0) T , where the 1 is at the kth position and let where we used the abbreviation ) and summarized all terms independent of ∆ j,• as const.. To minimize the former equation using quadratic programming, we rewrite it as to formulate a typical quadratic programming problem: The latter constraint can be derived from ADTD constraint 2 Supplementary Figures and Tables Table S1: ADTD performance for different hyper parameters on the training data.A comprehensive parameter grid consisting of all combinations of λ 1 ∈ {0, 10 −5 , 10 −4 , . . ., 1, 10, ∞} with λ 2 ∈ {10 −9 , 10 −8 , . . ., 10, ∞} was evaluated.Performance was assessed by (1) calculating Pearson's correlation between ground truth and predictions for each of the included cell types, and (2) by subsequently averaging them.

Figure S2 :
Figure S2: ADTD performance for different hyper parameters on the validation and test data data.A comprehensive parameter grid consisting of all combinations of λ 1 ∈ {0, 10 −5 , 10 −4 , . . ., 1, 10, ∞} with λ 2 ∈ {10 −9 , 10 −8 , . . ., 10, ∞} was evaluated.Figure a to c correspond to the validation data and d to e to the test data.Figure a and d show the average performance in predicting the known cellular contributions, and b and e for the hidden contributions.Fig. c and f give the corresponding areas under the ROC curves for detecting cell-type specific gene regulation.

Figure S3 :Figure S4 :Figure S5 :Figure S6 :Figure S7 :Figure S8 :Figure S9 :Figure S10 :
Figure S3: ADTD performance for recovering cell-type specific gene regulation on the validation data.The left figure shows areas under the ROC curve for recovering cellular regulation for different regularization parameters λ 2 , where λ 1 = 10 −1 was kept fixed.The corresponding performance in terms of Pearson's correlation for ADTD for estimating the known and hidden cellular contributions is shown on the right.Abbreviation: "hidden" = hidden cellular contributions.
Performance of ADTD, EPIC, CIBERSORTx and Scaden on training data.Observed Pearson's correlations obtained by comparing the estimated cellular proportions with the ground truth for artificial cellular training mixtures generated from single-cell data of healthy breast tissue specimens (see Methods).The errors correspond to ±1 standard deviation obtained over 10 simulation runs.For ADTD the parameters λ 1 = 10 −1 and λ 2 = 10 −8 were used (see hyper-parameter selection in validation performance section).