cypress: an R/Bioconductor package for cell-type-specific differential expression analysis power assessment

Abstract Summary Recent methodology advances in computational signal deconvolution have enabled bulk transcriptome data analysis at a finer cell-type level. Through deconvolution, identifying cell-type-specific differentially expressed (csDE) genes is drawing increasing attention in clinical applications. However, researchers still face a number of difficulties in adopting csDE genes detection methods in practice, especially in their experimental design. Here we present cypress, the first experimental design and statistical power analysis tool in csDE genes identification. This tool can reliably model purified cell-type-specific (CTS) profiles, cell-type compositions, biological and technical variations, offering a high-fidelity simulator for bulk RNA-seq convolution and deconvolution. cypress conducts simulation and evaluates the impact of multiple influencing factors, by various statistical metrics, to help researchers optimize experimental design and conduct power analysis. Availability and implementation cypress is an open-source R/Bioconductor package at https://bioconductor.org/packages/cypress/.

1 Simulation details

Gamma-Poisson compound
Using Negative Binomial (NB) distribution is a common practice to model RNA-seq data [Wu et al., 2015, Meng et al., 2023].Given its equivalence to the Gamma-Poisson hierarchical distribution (shown below), RNA-seq data simulation can also be constructed from these two distributions for the sake of cell-type proportions incorporation: then Y is equivalent to a negative binomial distribution.For systematic details, please refer to [Meng et al., 2023].
Deconstructing the Negative Binomial (N B) model within the two-step Gamma − P oisson process offers several advantages.It separates the control of the biological and technical noises respectively by the Gamma and P oisson distribution.Moreover, it facilitates the integration of cell-type proportion information into the underlying expressions generated by the Gamma distribution.This divide-and-reintegrate process allows us to fine-tuning of the cell-type proportion parameters while retaining the control over the underlying cell-type specific gene expression.

Notations and Overview
The simulation of cell-type specific profiles is based on real datasets and the simulation procedures are slightly different depending on the availability of cell-type abundances.We denote g as gene index, ranging from 1 to G, and i as sample index, ranging from 1 to N .The mixing proportion for each sample i is represented by θ i = (θ 1i , θ 2i . . ., θ ki ), where k θ ki = 1 satisfies the constraint, with k indicating cell-type index.Furthermore, z i represents a subject-specific phenotypical group assignment, where z i = 1/0 denotes cases and controls, respectively, for two-group comparisons.
The underlying cell-type-specific gene expression for each phenotypical group is denoted as a vector x T g,z=1/0 = (x g,z=1/0,1 , . . ., x g,z=1/0,k ), while the underlying gene expression across cell types for each subject, m gi , represents a weighted sum of cell-specific expressions evaluated by subject-specific cell types proportions.Subsequently, RNA-seq read count data, Y gi , is generated from a P oisson distribution.

Cell-type proportions unavailable
If users are unable to provide cell-type proportions and could only supply bulk RNA-seq data, cypress employs a referencefree deconvolution approach to infer cell-type proportions for each sample.This inference is achieved through a quadratic programming matrix decomposition algorithm [Houseman et al., 2016].Following this initial step, sample-and cell-typespecific expression profiles can be estimated using the tensor function provided by the TCA package [Rahmani et al., 2019].
The parameters for cell-specific gene expression are inferred from sample-and cell-type-specific expression profiles.The cell-specific underlying expression parameters for cell k and gene g are estimated using the PROPER package [Wu et al., 2015].This package assumes that the read counts follow a N B distribution with parameters µ gk and φ gk , representing the mean expression and biological dispersion for gene g and cell-type k in log-scale.
We further employ a Multivariate Normal distribution to model the gene expression mean and dispersion across the entire genome.Utilizing Maximum Likelihood Estimation (MLE), we also estimate the variance-covariance structure for both ( Σm and Σφ), respectively.M ultivariate − Gaussian distribution is also used to simulate two matrices, representing gene expression mean and dispersion parameters of cell-specific gene expression, for each phenotypical group.The underlying cell-specific gene expression matrix is simulated using a re-parameterized Gamma distribution for each group.
We assume cases and controls share the identical dispersion parameters, Φ G×K , but different M z G×K , where z = 1/0.LFCs are randomly drawn from N (µ LF C , 0.5).

Cell-type proportions available
If users provide cell-type proportions, the reference-free deconvolution process becomes redundant and is skipped.cypress would directly infer sample-and cell-type specific expression profiles and estimate cell-type-specific underlying expression parameters.Similarly to the methodology described above, gene expression mean and dispersion parameters are simulated using a M ultivariate − Gaussian distribution.Subsequently, the underlying cell-type specific gene expression matrix is generated using a re-parameterized Gamma distribution for each group.

Cell-type proportion generation
The cell-type proportions are simulated using a Dirichlet distribution based on α parameters, where α T C and α T D represent the parameters for controls and cases, respectively.
α T C and α T D are solved by dirichlet.mlefunction (sirt package) [Robitzsch, 2022] from the available sample-specific cell-type abundance information obtained from the previous section.

Distinguish each cell type
As noted in the supplementary materials section 1.2, the cell type-specific expression parameters, mean (µ gk ) and dispersion (φ gk ), for cell type k and gene g are estimated using the PROPER package.For each gene g, the mean and dispersion across all cell types are represented as µ g = (µ g1 , . . ., µ gK ) T and φ g = (φ g1 , . . ., φ gK ) T , respectively.To account for correlations among multiple cell types, we apply a multivariate normal distribution (M V N ) to both µ g and φ g : Maximum likelihood estimation (MLE) is used to estimate the parameters m, Σ m , d, and Σ d , where m and d are vectors of mean values for the six cell types, and Σ m and Σ d are 6 × 6 variance-covariance matrices.This approach guarantees that each cell type is not only characterized by unique mean and dispersion parameters but also by distinct variance-covariance relationships with other cell types.
For example, in Figure 1B from the main manuscript, simulation parameters are derived from an immune-associated disease (IAD) study (GSE60424), which includes six cell types: B-Cell, CD4, CD8, Monocytes, Neutrophils, and NK.The estimated mean ( m) and dispersion vectors ( d), along with their respective variance-covariance matrices ( Σm and Σd ) are detailed in Table S1 and Table S2.These parameters capture the means and dispersions of the cell-type-specific expression matrix.In the simulation process, we use 'Cell type 1' through 'Cell type 6' as labels to mask the names of these cell types.As shown in Table S1 S2: Estimated mean vector ( d) and variance-covariance matrix ( Σd ) for dispersions of the cell-type-specific expression matrix.

RNA-seq count data
For each sample, we could obtain the underlying gene expression, m gi , through a weighted sum of cell-specific gene expression, x gi , evaluated by θ T i .Conditioning on m gi , RNA-seq count data are simulated by P oisson Distributions.: True discovery rate (TDR) influenced by sample size, effect size, and gene expression strata.Each point on the line plot is an average value over N=30 simulations based on a real bulk RNA-seq data with 6 cell lines (GSE60424).Log fold change (LFC) has a distribution with a mean given in each panel, with a standard deviation of 0.5.(A) TDR by top-rank genes under the scenario of increasing LFC mean value, each line represents one cell type.TDR was averaged across sample sizes.(B) TDR at the top 300 genes by strata under the scenario of increasing LFC mean value, each line representing one cell type.TDR was averaged across cell types.(C) TDR by top-rank genes under the scenario of increasing sample size, each line represents one cell type.TDR was averaged across effect sizes.

Impact of the number of cell types
Users can obtain results from different numbers of cell types using the simFromData() function in the cypress package.Specifically, users need to specify the total number of cell types using the 'CT_index' arguments and set the 'CT_unk' syntax to 'True'.Under this scenario, the cypress package assumes that only bulk RNA-seq data is available for users.
Therefore, it will automatically run an RF deconvolution algorithm to estimate the simulation parameters for prospective usage.
We utilize bulk RNA-seq data from a large Autism Spectrum Disorder (ASD) study Gandal et al. [2018], Parikshak et al. [2016], which is attached to the cypress package, to evaluate how the power is affected by the number of cell types.
We set 'CT_index ' to 3,6,8,and 10 and 'CT_unk' to 'True' to simulate the conditions with different numbers of cell types.The simulation scenario is set with a sample size of N = 50 for each group, a total of G = 3000 genetic features, and an effect size defined as LF C = 2. Table S3 below summarizes the results across different numbers of cells.A decrease in the total number of cell types can lead to increased power or decreased FDC.This conclusion aligns with previous findings that cell type proportions are positively associated with the csDE genes detection accuracies Meng et al. [2023].The explanation is straightforward: for cell types that have small proportions, the technical noises could easily overwhelm biological signals.S3: Impacts of the total number of cell types on power and false discovery cost (FDC).The sample size for each group is set at 50, and the effect size is defined as LFC=2.

Impact of the number of genetic features and simulations
We conducted additional simulations at various combinations of the total number of genetic features (10,000, 20,000, and 30,000) and simulation counts (20, 30, 40, and 50).We notice that these factors have a minor impact on the power assessment, as shown in Table S4.Here, the sample size is fixed at N=50 for each group, the effect size is set to LFC=2, and the percentage of DE genes for each cell type is set at 5%. 6 Impact of the estimated cell-type-specific expression matrices To explore how the precision of estimated cell-type-specific expression matrices affects power inferences, we conducted additional simulations based on parameters estimated from GSE60424.Initially, we need to create different levels of precision for estimating cell-type-specific expression matrices, measured by the correlation between the true and estimated.
This could be achieved by introducing levels of random noises (±1%,±10%,±20%,±30%) to the simulated cell type proportions.As shown in the left panel of Figure S8, the precision of the estimated cell-type-specific expression matrices decreases with increasing noise levels, indicated by different colors.When the precision of the estimated cell-type-specific expression matrices decreases, the power inference also diminishes.This trend is also demonstrated in the right panel of Figure S8, which shows power boxplots for different noise levels added to the cell proportions.Figure S8: The left panel displays the relationship between sensitivities and precisions of the estimated cell-type-specific expression matrices, defined as the correlation between the true and estimated.This correlation is categorized by different levels of noise added to cell proportions.The right panel presents boxplots of the simulated sensitivity across different noise levels added into cell proportions.
7 Impact of DE genes proportions cypress assumes each DE gene is present in only one cell type at a time and does not account for a gene being DE in multiple cell types.Users can specify the percentage of DE genes for each cell type, and the total number of DE genes is the sum across cell types.For example, if a user sets 5% DE genes for each of the six cell types, the total would be 30%.
In Table S5, powers are evaluated under different DE genes percentages (1%, 2%, 5%, and 8% for each cell type) and effect sizes.It shows the amount of DE genes assigned to each cell type has minimal impact on power, at the current design.
Power  et al. [2016].We evaluate the simulated data generated from these three sources by overlapping density plots with the respective real data, as shown in Figure S9.We can conclude that our simulation pipeline is a reliable data generation process that produces data that are highly similar to real data.We additionally present a scatter plot (Figure S10) of estimated dispersion over mean for 3000 genes across 100 samples from one iteration of the data simulation process, based on the IAD data (GSE60424) simulation parameter.We observe a trend that for the genes with high mean expression, the dispersion across replicates tends to decrease accordingly.9 cypress Run time

YFigure S1 :Figure S3 :Figure S4 :Figure S5 :
Figure S1:The figure illustrates the target power versus cell-type abundances under different simulation scenarios.Each row corresponds to a different effect size (LFC) of 0.5, 1, and 1.5, respectively, while each column represents a different sample size (N) of 10, 20, and 50, respectively.The target power threshold is LFC=0.5.

Figure S9 :
Figure S9: The distributions of the simulated RNA-seq data compared to real data.(A) Overlapped distributions of simulated cell type-specific expressions and real RNA-seq counts for each immune cell subset from GSE60424.(B-C) Comparisons of distributions between simulated datasets and real datasets, with parameters respectively estimated from IBD and ASD studies.

Figure S10 :
Figure S10: Dispersion estimates relative to mean for a simulated bulk RNA-seq dataset: This plot is based on simulation parameters from GSE60424.Each dot represents the dispersion estimate of an individual gene.
below, monocytes is the most abundant cell (4.72) and neutrophils has the largest dispersion (4.05).
Thus, detecting csDE genes among minor cell types is a more challenging task.

Table S4 :
Power assessment across different combinations of genetic features and simulation counts.The sample size for each group is set at 50, the effect size is defined as LFC=2, and the percentage of DE genes for each cell type is set at 5%

Table S6 :
TableS6below shows runtime information across different numbers of genes and simulations.Increasing the number of genetic features and iterations will lead to longer run times.In general, our method is efficient in providing comprehensive evaluation results in minutes.Run time across different combinations of genetic features and simulation counts.The sample size for each group is set at 50 for each group, and the effect size is defined as LFC=2.