clrDV: a differential variability test for RNA-Seq data based on the skew-normal distribution

Background Pathological conditions may result in certain genes having expression variance that differs markedly from that of the control. Finding such genes from gene expression data can provide invaluable candidates for therapeutic intervention. Under the dominant paradigm for modeling RNA-Seq gene counts using the negative binomial model, tests of differential variability are challenging to develop, owing to dependence of the variance on the mean. Methods Here, we describe clrDV, a statistical method for detecting genes that show differential variability between two populations. We present the skew-normal distribution for modeling gene-wise null distribution of centered log-ratio transformation of compositional RNA-seq data. Results Simulation results show that clrDV has false discovery rate and probability of Type II error that are on par with or superior to existing methodologies. In addition, its run time is faster than its closest competitors, and remains relatively constant for increasing sample size per group. Analysis of a large neurodegenerative disease RNA-Seq dataset using clrDV successfully recovers multiple gene candidates that have been reported to be associated with Alzheimer’s disease.

where α ∈ R is the skewness parameter.Suppose Z is a random variable that has a skew-normal distribution (i.e.Z ∼ SN(α)).Its mean and its variance are given by µ z = E(Z) = bδ , σ 2 z = Var(Z) = 1 − (bδ ) 2 , respectively, where b = 2/π and δ = α/ √ 1 + α 2 .The formal derivation of the properties of the skew-normal distribution is due to Azzalini (1985) who treated the skew-normal distribution as a generalization of the normal distribution.Historically, the skew-normal model was arrived at by several different authors in other contexts (e.g. as a prior distribution in Bayesian analysis by O'Hagan and Leonard (1976); see Azzalini (2022)).However, these authors did not elaborate further on the theoretical properties of the skew normal as in Azzalini (1985).

S1.2 The skew-normal distribution -direct and centered parametrizations
The probability density function of a skew-normal distribution with direct parameters (DP) is given by with location parameter ξ g ∈ R, scale parameter ω g ∈ R + , and skewness parameter α g ∈ R; φ (•) and Φ(•) are the probability density function and the cumulative distribution function of the standard normal distribution, respectively.The skew-normal distribution with centered parameters (CP) is derived from the DP form via the mapping (Azzalini and Capitanio, 2014) and the inverse mapping is provided by where b = 2/π, δ g = α g / 1 + α 2 g , and R = 3 2γ g /(4 − π).For a single sample, the log-likelihood function for θ θ θ (D)  g = (ξ g , ω g , α g ) T is given by where c is a constant and ζ 0 (•) = log 2Φ(•) .Taking z gi = (y gi − ξ g )/ω g , we obtain the partial derivatives of ℓ 1 : thus the likelihood equations for a sample of size n are given by n where . Numerical methods are necessary to solve these equations.Azzalini and Capitanio (2014) suggested that a sample size up to about 50 may be necessary for the skew-normal distribution.To initialize the search, method of moments (MM) estimates are chosen as starting points for the CP components in Equation (1).
The MM estimators for the centered parameters are given by respectively, where Ȳg is the sample mean, S g is the sample standard deviation, and M g,3 is the sample third central moment.By estimating the CP components in Equation (1) using Equation (4), and then converting them to DP components using Equation (2), we obtain the MM estimators of the DP components: ξg , ωg and ᾱg .Subsequently, a search of the DP space where Equation (3) holds is done.Once θ θ θ g = ( μg , σg , γg ), the maximum likelihood estimators of the centered parameters.
Under regular maximum likelihood estimation, certain data values can produce a divergent αg .To overcome this problem, Azzalini and Arellano-Valle (2013) proposed a maximum penalized likelihood estimation ("Qpenalty") approach.A non-negative penalty term Q that penalizes the divergence of the skewness parameter α g is formulated as Q = c 1 log(1 + c 2 α 2 g ), where c 1 ≈ 0.87591 and c 2 ≈ 0.85625 (Azzalini and Arellano-Valle, 2013;Azzalini and Capitanio, 2014).Then, the maximum penalized likelihood for θ θ θ (D)  g is the penalized log-likelihood where y y y g = (y g1 , y g2 , . . ., y gn ), ℓ(θ θ θ (D) g ; y y y g ) is the log-likelihood function with respect to the parameter vector θ θ θ (D)  g : The maximum penalized likelihood estimator (MPLE), θ θ θ g , is a finite point that maximizes ℓ p (θ θ θ (D)  g ).The standard errors of θ θ θ g can be approximated from the corresponding penalized information matrix as Var( θ θ θ g ) −1 .The "MPpenalty" approach (Azzalini and Capitanio, 2014) defines the penalty function Q in Equation ( 5) as − log π m (α g ), where π m is a prior distribution for the skewness parameter α g .The matching prior (Cabras et al., 2012) for α g , allowing for the presence of ψ ψ ψ = (ξ g , ω g ), is given by where the terms involved are specific blocks of the Fisher information matrix I I I of θ θ θ (D) g .Since π m (0) = 0, the matching prior penalty effectively penalizes α g = 0 with Q = ∞.

S1.3 Fisher Information Matrix
For the direct parameters (DP) vector θ θ θ (D) = (ξ , ω, α), the Fisher information matrix is given by , (Azzalini, 1985).The matrix I θ θ θ (D) becomes singular as α → 0. This problem prevents the direct application of maximum likelihood estimation (MLE) for estimating the parameters of the DP form of the skew-normal distribution.In the same paper, Azzalini (1985) introduced the centered parametrization form to address the singulariy problem.He redefines a skew-normal variable which has E(Y ) = µ and Var(Y ) = σ 2 .The centered parameters (CP) vector θ θ θ (C) = (µ, σ , γ) has parameter space R × R + × (−0.9953, 0.9953) (Azzalini and Capitanio, 2014).The skewness parameter γ is the coefficient of skewness of Z, and also that of Y .Thus, we write Y ∼ SN C (µ, σ , γ). Figure S1 shows examples of the probability density functions of the skew-normal distribution with CP for different values of µ, σ and γ.

4/8 S3 CAPTIONS FOR SUPPLEMENTARY TABLES
Table S3: List of DV genes detected by clrDV for the control vs. AD comparison in the analysis of the Mayo RNA-Seq dataset.Table S4: List of DV genes detected by clrDV for the control vs. PSP comparison in the analysis of the Mayo RNA-Seq dataset.Table S5: List of DV genes detected by clrDV, MDSeq, and GAMLSS for the control vs. AD comparison in the analysis of the Mayo RNA-Seq dataset.Table S6: List of DV genes detected by clrDV, MDSeq, and GAMLSS for the control vs. PSP comparison in the analysis of the Mayo RNA-Seq dataset.