ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Method Article

Fast effect size shrinkage software for beta-binomial models of allelic imbalance

[version 1; peer review: 3 approved with reservations]
PUBLISHED 28 Nov 2019
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioconductor gateway.

Abstract

Allelic imbalance occurs when the two alleles of a gene are differentially expressed within a diploid organism, and can indicate important differences in cis-regulation and epigenetic state across the two chromosomes. Because of this, the ability to accurately quantify the proportion at which each allele of a gene is expressed is of great interest to researchers. This becomes challenging in the presence of small read counts and/or sample sizes, which can cause estimates for allelic expression proportions to have high variance. Investigators have traditionally dealt with this problem by filtering out genes with small counts and samples. However, this may inadvertently remove important genes that have truly large allelic imbalances. Another option is to use Bayesian estimators to reduce the variance. To this end, we evaluated the accuracy of three different estimators, the latter two of which are Bayesian shrinkage estimators: maximum likelihood, approximate posterior estimation of GLM coefficients (apeglm) and adaptive shrinkage (ash). We also wrote C++ code to quickly calculate ML and apeglm estimates, and integrated it into the apeglm package. The three methods were evaluated on both simulated and real data. Apeglm consistently performed better than ML according to a variety of criteria, including mean absolute error and concordance at the top. While ash had lower error and greater concordance than ML on the simulations, it also had a tendency to over-shrink large effects, and performed worse on the real data according to error and concordance. Furthermore, when compared to five other packages that also fit beta-binomial models, the apeglm package was substantially faster, making our package useful for quick and reliable analyses of allelic imbalance. Apeglm is available as an R/Bioconductor package at http://bioconductor.org/packages/apeglm.

Keywords

RNA-seq, Allelic imbalance, Allele-specific expression (ASE), Beta-binomial, Shrinkage estimation, Empirical Bayes, Bioconductor, Statistical software

Introduction

Allelic imbalance (AI) occurs when the two alleles of a gene are expressed at different levels in a diploid organism, and the measurement of AI is valuable in elucidating the factors that regulate the expression of genes. For example, for a diploid organism, the allele on one chromosome may have higher or lower expression levels compared to the allele on the other chromosome due to genetic variation in nearby non-coding regulatory sites, a process known as cis-regulation. Allelic imbalance in expression may also be associated with differential epigenetic state of the genomic region across the chromosomes. In some cases, differential allelic expression resulting from differential epigenetic state can be linked to the parent-of-origin of the alleles, a phenomenon known as genetic imprinting.

One challenge currently faced in allelic expression studies is that estimates for allelic expression proportions can be highly variable in the presence of low read counts and/or small sample sizes. Large estimates of allelic proportions in these cases often result from estimation error as opposed to true differences in allelic expression. Though small samples and low counts are a problem for RNA-seq data in general, they are especially problematic when dealing with allele-specific counts. When a subject is heterozygous for a gene at a particular SNP(s), RNA-seq reads that overlap the SNP(s) allow for quantification of the levels of expression from either chromosome1. Thus, allelic expression cannot be measured within a gene for subjects that are homozygous for that gene, and the number of samples with allele-specific counts for a gene can be much less than the number of samples in the study. Furthermore, alleles are often differentiated by a single SNP, and RNA-seq reads that do not overlap the SNP cannot be mapped to either allele. For these reasons, the proportion of RNA-seq reads that are allele-specific can be quite low, depending on both read length and heterozygosity of the subjects. For instance, one study with 2x50 base pair (bp) paired-end reads and 30 million heterozygous SNPs from breast tumors of 550 human subjects found that allele-specific counts made up only 3.4% of RNA-seq reads2. Experiments making use of model organism crosses can maximize the number of RNA-seq reads overlapping heterozygous SNPs, for example Raghupathy et al.3 found in an RNA-seq dataset of a mouse F1 cross that 22% of uniquely mapping reads were allele-specific.

One traditional remedy investigators have used to deal with the challenges of high-variance estimates is to filter out genes that have low counts or small samples. While this does cause the resulting estimates to be more stable and thus representative of true allelic expression proportions, filtering may also remove genes that have true allelic imbalances. Furthermore, the cutoff used to determine what genes to filter out (i.e. how many counts a gene must have for it to not be removed) must be chosen per dataset by the analyst. Another potential remedy to the problem of variable estimates from low-count and low-sample genes is to use Bayesian shrinkage estimators to moderate estimates.

A large number of Bayes estimators have already been developed for allelic expression studies. For instance, MMSEQ4 uses a Gamma prior on allele-specific transcript abundance to provide isoform and allelic imbalance estimates that are more accurate and stable in the face of low coverage. Other methods that have used Bayesian approaches to test for AI include those by Leòn-Novelo et al. 20145 and Skelly et al.6 However, these methods can only test overall AI, and cannot test the effects of covariates, such as different groups, on AI. More recently, a method was developed that expanded on that by Leòn-Novelo et al. 2014 and was able to estimate AI within groups as well as compare AI between groups7. It uses Bayesian shrinkage estimates for its parameters to shrink allelic proportions within groups toward 0.5, overdispersion toward a pre-specified prior mean, and the total counts of both alleles toward a pooled estimate. While the method is more flexible than the other methods listed, it still cannot estimate the effects of continuous covariates on allelic imbalance, nor can it estimate differences in AI between groups while controlling for additional confounding variables. Furthermore, while the authors showed their method to be effective in reducing type I and type II error in the face of different sources of bias, the advantage of their method in estimation accuracy itself and in the face of genes with low counts need to be more thoroughly investigated.

Though gene expression read counts are typically larger than allele-specific counts and can be measured for all subjects, the uncertainty of estimates in the presence of low counts and/or low sample sizes is still an issue. Thus, several shrinkage estimators for log fold changes in gene expression have also been developed which shrink estimates that are only large due to the variance of the estimator and leave unchanged estimates that are likely to be large due to true expression changes811. Many of these methods directly involve or can easily be applied to linear models, which provide great flexibility in the kinds of study designs that can be treated and hypotheses that can be investigated. Though these methods were originally developed for improving accuracy and stability of log fold change estimates in gene expression, several can be can be directly applied or at least easily extended to estimating the effects of covariates on allelic expression proportions.

To this end, we look at three different estimation methods and their performance on data sets with small-to-moderate numbers of samples: maximum likelihood (ML), approximate posterior estimation of GLM coefficients (apeglm)11 and adaptive shrinkage (ash)10 ML estimates are based on estimating effects by modelling allele-specific counts with a beta-binomial GLM. Apeglm and ash are Bayesian shrinkage estimators which shrink maximum likelihood-based estimates toward zero. Our results show that while apeglm is not always the best method, it always performs better than ML and never performs much worse than ash for most metrics, making it the most robust and reliable when dealing with small sample sizes in our analysis. We also introduced new source code for the apeglm package to improve computational performance for beta-binomial GLMs, and compared our improved package to other R packages that can also fit beta-binomial GLMs. As the apeglm package can calculate both ML and Bayesian shrinkage estimates, our improvements can be used even by those who wish not to use shrinkage estimators. Compared to other R packages, we show that apeglm with our improved code gives better running times and greater scalability with the number of covariates.

Methods

Estimation methods

We evaluated three estimation methods on their ability to estimate allelic expression proportions (or equivalently, the effects of covariates on allelic expression proportions): maximum likelihood estimation (ML estimation or MLE) with the likelihood described below, approximate posterior estimation of GLM coefficients (apeglm) and adaptive shrinkage (ash). All analyses was done using R version 3.5.112. The first two methods mentioned are implemented in the apeglm v.1.7.5 package, while the last is implemented in the ashr v.2.2.32 package. When using the ash function in the latter package, we set the method parameter equal to "shrink". While there are many Bayesian estimation methods that can be used to quantify allelic imbalance, these allow for arbitrary design matrices. For instance, these methods can estimate differences in AI between groups while controlling for, or allowing interactions with, multiple additional variables, and can estimate the effects of continuous variables on AI.

For the g-th gene (1 ≤ gG), a beta-binomial GLM was fit to model allele-specific counts as follows. Let Yig be the read counts of the first of the two alleles (which allele is designated as the first allele is arbitrary) for the i-th subject, 1 ≤ iI. Investigators may designate the first and second alleles of a gene as the paternal and maternal alleles or as the alternate and reference alleles. It is assumed that Yig ∼ BetaBin(nig , pig ,ϕg), where nig is the equal to the total counts of both alleles for the i-th subject, pig is the probability of counts belonging to the first allele of the i-th subject, and ϕg is the overdispersion parameter. For the remainder of this paper, we will refer to the total allele-specific counts for both alleles of a particular gene and for a particular sample as the ‘total counts’ for that gene and sample. Furthermore, we will refer to the probability that counts for a particular gene belong to a particular allele for a particular sample as the ‘allelic proportion’ for that particular allele and sample. In this case, ϕ → ∞ implies no overdispersion beyond what would be seen in a binomial distribution and ϕ → 0 implies increasing variance. n1g , ..., nIg are assumed to be fixed and known. As the beta-binomial probability density function has multiple forms and parameterizations, we specify our parametrization as:

f(y;n,p,ϕ)=(ny)B(y+ϕp,ny+ϕ(1p))B(ϕp,ϕ(1p))

where B specifies the beta function. Furthermore, let xi be the i-th row of the design matrix X (matrix where columns are vectors of covariates of interest). Potential predictors include disease status for association studies, parent of origin for imprinting studies, and the presence of a SNP for eQTL linkage studies. We also assume that pig=[1+exp(xiTβg)]1, or equivalently logit(pig)=xiTβg, where βg = (β1g , ...,βKg)T is a vector of coefficients representing the effect sizes for the predictors in the design matrix. For ML estimation, βg is estimated via maximum likelihood. Constrained optimization is used for the nuisance parameter ϕg with a maximum of 500, so that genes with no overdispersion have finite estimated values of ϕ.

Apeglm additionally assumes a zero-centered Cauchy prior distribution for the effects of one of the predictors11. For estimating the effect of the j-th predictor in our model, where 1 ≤ jK is chosen by the user, and for the g-th gene, we have:

Yig|βgBetaBin(nig,pig,ϕg)

pig=11+exp(xiTβg)

βjgCauchy(0,γj)

Apeglm shrinks the effect of one chosen predictor at a time, across all genes. The scale parameter of the Cauchy prior, γj, is estimated by pooling information across genes. The posterior distribution of βg is the product of the above Cauchy prior and beta-binomial likelihood, and apeglm provides Bayesian shrinkage estimates based on the mode of the posterior as well as standard errors. Genes with lower expression, smaller numbers of heterozygous subjects and higher dispersion in allelic proportions will have flatter likelihoods, which will lead to the prior having more influence and shrinkage being greater. Furthermore, if the ML estimates are tightly clustered about zero, the estimated scale parameter of the Cauchy prior will be smaller. This will lead to more peakedness in the prior and also cause shrinkage to be greater.

The original apeglm package estimated regression coefficients using C++ for negative binomial GLMs, while GLMs with other likelihoods, such as the beta-binomial, were fit completely in R. To improve scalability for large data sets with beta-binomial GLMs, we wrote fast C++ code for calculating maximum likelihood and apeglm shrinkage estimates of beta-binomial regression coefficients. We also changed the source code to speed up computation of the standard errors (though such computations were still done in R) and prevent convergence issues. Details can be found in the Supplementary Methods section13.

Ash is a general Empirical Bayes shrinkage estimator for hypothesis testing and measuring uncertainty in a vector of effects of interest, such as a set of log fold changes in gene expression between biological conditions10. Suppose again that one is interested in the effect sizes of the j-th predictor, β.j = (βj1, ..., βjG), where 1 ≤ j ≤ K. Ash takes as input a vector of ML estimated effects β^.j=(β^j1,...,β^jG) and corresponding estimated standard errors σβ.j = (σβ j1 , ..., σβ jG). Here we take the estimated standard errors to be the true standard errors as suggested in the original methodology for ash, though the developers of ash have recently proposed an extension to their method that allows for random errors14. For all 1 ≤ g ≤ G, it is assumed that β^jg|βjgN(βjg,σβjg) and that βjg ~ hj , where hj is some unimodal, zero-mode prior distribution. hj is estimated from the ML estimates using mixtures of uniforms and a point-mass at zero, a choice guided by the author’s claim that any unimodal distribution can be approximated as a mixture of uniforms with arbitrary accuracy. The posterior is βjg|β^jgN(βjg,σβjg)×hj, and ash provides Empirical Bayes shrinkage estimates using the mean of the posterior as well as standard errors. Genes with larger standard errors for their ML estimates will have a flatter likelihood that will be less impactful on the estimation. Thus, estimates for these genes will be shrunk more. Like apeglm, ash can only shrink estimates for one covariate at a time.

Datasets and simulations

We compared the three estimation methods using the data set from the allelic expression study by Crowley et al.15,16 The study took mice from three divergent inbred strains (CAST/EiJ, PWK/PhJ and WSB/EiJ) and performed a diallel cross. The data set contains ASE counts for 72 mice and 23,297 genes in the resulting cross, with 12 mice of each possible parent combination (e.g. CAST/EiJ as mother and PWK/PhJ as father is one parent combination, and PWK/PhJ as mother and CAST/EiJ as father is another), and an equal number of males and females within each parent combination. Sequencing was performed with the Illumina HiSeq 2000 platform to generate 100-bp paired-end reads and following the TruSeq RNA Sample Preparation v2 protocol. To assure that the mice all had the same alleles, we chose one genotype to focus on, namely the genotype resulting from the cross with CAST/EiJ and PWK/PhJ. The resulting data set, which we will refer to for the remainder of this paper as the ‘mouse data set’, had 24 mice, 12 of each sex and 12 of each parent of origin, and each mouse had nearly the same nuclear genetic composition as a result of the cross.

To evaluate the estimators on estimating effect sizes of predictors when the truth is known, we first fit an intercept-only beta-binomial model on each gene for the mouse data set. ϕ = [ϕg] is the vector of ML estimates of the overdispersion parameter from each model, and µ = [µg] is the vector of ML estimates of allelic proportions (which ranges between 0 and 1). 8 mice were then selected from the data set. Denote NI×G = [nig] as the matrix of total ASE counts for the 8 mice. Finally, a matrix of counts from one of the alleles YI×G = [yig] was simulated for a sample size of 4 vs. 4, where yig was simulated from BetaBin(nig, pig, ϕg), logit(pig) = µg + βgx, βg was simulated from a standard normal distribution, and x splits the mice into two groups of size four (x = 1 if a mouse is in the first group and 0 otherwise). Samples were drawn from the beta-binomial distribution using the emdbook v1.3.11 package17. We refer to this simulation throughout the paper as the ‘standard normal simulation’, reflecting the distribution of the true effect sizes.

A second simulation was also performed that was similar in setup to the first, but with modifications to the distribution of βg and ϕg. In many studies, the effect sizes of a predictor will be zero for all but a handful of genes. Thus, βg was simulated from t3/10 (a Student’s t-distribution with 3 degrees of freedom scaled by 1/10), which gave effects mostly close to zero, but with moderate and large effects occasionally appearing (Supplementary Figure 113). Furthermore, the distribution of ϕg from the mouse data appeared to be a mixture of two distributions: Genes without overdispersion had an obvious point mass at 500 with 70% proportion, and the remaining 30% genes had a distribution somewhat similar to an exponential with a mean of β = 179 (Supplementary Figure 213). To get more over-dispersed allele-specific counts, ϕg was simulated from 0.5Exp(β=89) + 0.5(500), a mixture distribution where one component was exponential with a mean of 89 and had 50% proportion, and the other component was a point mass at 500 and had 50% proportion. We refer to this simulation throughout the paper as the ‘Student’s t simulation’, again reflecting the distribution of the true effect sizes. Note that these two simulations assume a data generating process, specifically the same data generating process as our assumed likelihood.

The estimators were then evaluated on real data with the focus on estimating mean, or gene-wide, allelic imbalance. From the mouse data set, random samples of size 6 were drawn, and this process was repeated 100 times. We will refer to these samples throughout the paper as the ‘random subsamples’. For each random subsample, the ML, apeglm and ash estimates of intercept-only models were calculated for the genes (where the intercept term was shrunk), and the MLE of the held-out 18 mice was taken to be the truth. Estimating the intercept in an intercept-only model for each gene is equivalent to estimating overall allelic imbalance for each gene.

Additional simulations were conducted for evaluating computational performance of our improvements to apeglm, to see how well they would scale to larger and more complicated data sets. Allele-specific counts were simulated in a similar manner as the apeglm vignette18. Briefly, we have Y100×5000 = [yig] as our simulated count matrix for one allele with associated total count matrix N100×5000 = [nig] where rows are samples and columns are genes, yig ~ BetaBin(nig, pg, θg), θg ~ U (0, 1000), pg ~ N (.5, 0.52), nig ~ NB(µg, 1g), and µg, ϕg are based on the airway data set by Himes et al.19 To see how well our improvements scaled with increasing numbers of covariates, the data were split multiple times into differing numbers of groups of approximately equal size, where the number of groups ranged from 2 to 10. With K groups, the design matrix was X100×K = [1 x1 ... xK –1], where xj is an indicator variable for the (j + 1)-th group, or a row vector whose i-th element is 1 if the i-th sample is in the (j + 1)-th group and 0 otherwise. A simulation was also conducted to see how well apeglm would work with continuous predictors. This time, Y and N was kept the same, but with the design matrix X100×4 = (1, x1, x2, x3) = [xij], where x1 = (1, 0, 1..., 1, 0)T separates the samples into two equally sized groups and xi2, xi3 ~ N (0, 1). x1 is the covariate whose effect size estimates are shrunk.

Data processing

Genes where at least three samples did not have at least 10 counts were removed, which we considered minimal filtering that shouldn’t decrease statistical power. Genes without at least one count for both alleles across all individuals were removed. Genes with a marginally significant sex or parent effect were removed, so that all samples could be assumed independent and identically distributed for all genes. Genes were removed from the mouse data set prior to conducting random sampling from the data set or simulations.

To determine whether sex or parent effects were significant, beta-binomial GLMs were estimated for each gene by maximum likelihood, with a design matrix that included a sex effect (an indicator that was 1 if male and 0 if female), a parent-of-origin effect (an indicator that was 1 if the mother was the CAST/EiJ strain and 0 if the father was the CAST/EiJ strain) and an interaction term. For each gene, if the p-value for the sex, parent-of-origin or interaction effect was less than 0.1, the effect was deemed marginally significant for that gene.

Technical details of evaluations

For each gene, we define the shrinkage score as movement from MLE to zero. We define a gene as (noticeably) shrunk if shrinkage exceeds 0.1, and substantially or most shrunk if shrinkage is greater than max (1,|β^MLE|/4). For instance, if an apeglm estimate for a gene is 0.15 closer to zero than the MLE, then the shrinkage score is 0.15 and the gene is noticeably shrunk but not substantially shrunk by apeglm.

Concordance at the top (CAT) plots20 were used to determine which estimation method could best find the most important genes (the genes with the greatest allelic imbalance or largest effect size). For an estimation method, concordance at the top takes the top genes according to the true ranking and compares it to the top genes according to the estimates, where the top genes are the genes with the largest true or estimated effect sizes in absolute value. For instance, a concordance at the top 10 of 90% means that the top 10 genes according to the estimation method and the top 10 genes according to the truth agree for 9 out of 10 genes.

For evaluating the performance of the three methods in estimating intervals, we calculated normality-based 95% confidence and credible intervals (both of which we will abbreviate as CIs) of the ML and apeglm estimators using their standard errors, or intervals based on the Laplace approximation of the likelihood and posterior. Such normality-based intervals are the default and suggested method for the apeglm package. Credible intervals in the ashr package were calculated from directly estimating tail probabilities of the posterior.

For each of the design matrices posited in our computation simulation, computational performance of apeglm estimation was compared between the old and new apeglm code. From apeglm v1.7.5, we set the method parameter equal to “betabinCR" to run the new C++ code, and set the log.lik parameter equal to a beta-binomial log-likelihood function to run the old code from before our improvements were introduced (version 1.6.0 of the package). Details can be found in the vignette18. Computational performance of ML estimation was also compared between our improved apeglm package and the following packages: aod v1.3.121, VGAM v1.122, aods3 v0.423, gamlss v5.124 and HRQoL v1.025. Computational performance was evaluated using the microbenchmark v1.4.6 package26 for estimation of a single gene and elapsed time for estimation of all 5000 genes, on a 2012 15-inch MacBook Pro with an Intel Core i7-3720QM processor.

Determining the optimal filtering rule

In addition to comparing the three estimation methods described above, maximum likelihood estimation paired with optimal filtering criteria was also assessed via concordance at the top. CAT was chosen over other benchmark metrics, such as mean absolute error, as the different number of genes after filtering would make comparisons between filtered MLE and the three unfiltered methods biased. Furthermore, as we were primarily interested in whether a good filtering rule even existed, the true ranking of genes was used to determine the filtering rule. We looked at three rules: 1) removing genes where less than half the samples had a minimum total count threshold, 2) removing genes where less than all the samples had a minimum total count threshold, and 3) removing genes where the sum of total counts across samples was less than a certain threshold. For the remainder of the paper, we will refer to the sum of total counts across samples as the ‘summed counts’ of a gene. For each rule, various different thresholds were looked at: {0, 10, ..., 200} were potential thresholds for rule 1, {0, 10, ..., 100} were potential thresholds for rule 2, and {0, 50, ..., 1000} were potential thresholds for rule 3. For each rule and threshold, the MLE was calculated and concordance among the top 50, 100, 200, 300, 400 and 500 genes were averaged. We will refer to the rule and threshold that had the best concordance as the ‘optimal filtering rule’.

Results

Standard normal simulation

We began by looking at a simulation where allelic counts came from known beta-binomial distributions and effect sizes came from a standard normal distribution. In this simulation, apeglm and ash successfully shrunk erroneously large estimates and reduced estimation error, particularly for genes that were noticeably shrunk (see Table 1 and Figure 1).

Table 1. Performance Metrics for Normal Simulation.

MLE: Maximum Likelihood Estimation, apeglm: Approximate Posterior Estimation of Generalized Linear Model Coefficients, ash: Adaptive Shrinkage.

Performance MetricMLEApeglmAsh
Mean Absolute Error0 .1520.1420.141
Mean Absolute Error (apeglm-shrunk genes)0.4980.4080.397
Mean Absolute Error (ash-shrunk genes)0.4690.380.37
Mean Absolute Error (|effect size|>2)0.2210.2030.269
Coverage Probability for 95% CI0.9510.9510.949
Average Interval Width for 95% CI0.7610.6970.685
531a33c1-e075-414b-a295-ffcd30b38610_figure1.gif

Figure 1. Truth vs. estimate and CAT Plots for normal simulation.

a) truth vs. estimate plot for MLE. Blue points represent genes substantially shrunk by apeglm only, orange points represent genes substantially shrunk by ash only and green points represent genes substantially shrunk by both ash and apeglm. b) truth vs. estimate plots for apeglm. c) truth vs. estimate plots for ash. d) CAT plot for the three methods as well as for MLE after filtering. CAT: Concordance at the Top, MLE: Maximum Likelihood Estimation, apeglm: Approximate Posterior Estimation of Generalized Linear Model Coefficients, ash: Adaptive Shrinkage.

All three estimation methods gave similar mean absolute error (MAE), as many genes did not differ much between the methods (Table 1). In exploring the behavior of shrinkage estimators, we were most interested in genes where shrinkage was high, and thus where estimates would be much closer to or much farther from the truth for one estimation method than for another. Thus, in addition to overall MAE, we also calculated MAE among genes that were noticeably shrunk by apeglm and genes that were noticeably shrunk by ash, to determine whether there was substantial improvement on average when apeglm or ash did noticeably shrink a gene. Among genes that were shrunk by apeglm, apeglm decreased the mean absolute error by 18.1%, and among genes that were shrunk by ash, ash decreased the mean absolute error by 21.1%. Moreover, from Figure 1a–c, it can be seen that apeglm shrunk most ML estimates that were inflated, bringing them closer to the truth, and mostly left truly large effects alone. Ash also shrunk ML estimates that were inflated, including some inflated estimates missed by apeglm. However, ash also had a tendency to incorrectly and excessively shrink: some genes with estimates close to the truth were severely shrunk, and several genes with truly large effects were shrunk to zero. Because of this tendency to over-shrink, ash performed worse among genes with large effects than among genes with small effects. For instance, among genes with effect sizes greater than two in absolute value, ash estimates had a higher mean absolute error than the MLE.

Ash and apeglm also performed better than the MLE in determining the most important genes, where concordance at the top was higher regardless of the number of genes being considered (Figure 1d). Apeglm performed slightly better than ash in concordance at the top 100 genes, but otherwise they performed about the same. Concordance at the top for the MLE was optimized when filtering out genes with summed counts less than 350. Using this filtering, we were able to get CAT results better than that of the shrinkage estimates, even if only by a very small amount. Thus, for this simulation, it was possible to outperform both apeglm and ash with filtering alone (provided that the true ranking of genes was known, and used to determine the optimal filtering rule).

With regard to the extent of shrinkage, both apeglm and ash mainly exhibited shrinkage for genes that had very low counts (Supplementary Figure 313). This is not too surprising for this particular simulation, as after filtering out lowly-expressed genes, the remaining ML estimates were much closer to the truth (Supplementary Figure 413). When comparing shrinkage scores between apeglm and ash, we found that there was a clear upward shift of shrinkage scores for ash (Supplementary Table 113), further showing that ash had more extreme shrinkage than apeglm for this dataset. Though all three methods gave intervals that were similar in coverage probability, average interval width was smaller for apeglm and ash compared to the MLE (Table 1).

Student’s t Simulation

We also investigated the performance of the estimators when most of the effect sizes were close to zero and overdispersion was large. Here the shrinkage estimates had even more marked improvement over the ML estimates (see Table 2 and Figure 2).

Table 2. Performance metrics for Student’s t Simulation.

Performance MetricMLEApeglmAsh
Mean Absolute Error0.1860.0940.089
Mean Absolute Error (apeglm-shrunk genes)0.3530.1220.113
Mean Absolute Error (ash-shrunk genes)0.3360.1260.115
Coverage Probability for 95% CI0.9240.9370.941
Average Interval Width for 95% CI0.8730.450.456
531a33c1-e075-414b-a295-ffcd30b38610_figure2.gif

Figure 2. Truth vs. estimate and CAT Plots for Student’s t Simulation.

a) truth vs. estimate plot for MLE. Orange points represent genes substantially shrunk by ash only and green points represent genes substantially shrunk by both ash and apeglm. All genes substantially shrunk by apeglm were shrunk by practically the same amount or more by ash. b) truth vs. estimate plots for apeglm. c) truth vs. estimate plots for ash. d) CAT plot for the three methods as well as for ML after filtering. MLE: Maximum Likelihood Estimation, apeglm: Approximate Posterior Estimation of Generalized Linear Model Coefficients, ash: Adaptive Shrinkage.

Apeglm improved mean absolute error by 49.5% among all genes, and by 65.4% among noticeably shrunk genes specifically (Table 2). Ash improved mean absolute error by 52.2% among all genes and by 65.8% among noticeably shrunk genes specifically. These improvements were greater than that seen from the standard normal simulation. Figure 2a–c show that ash successfully shrunk inflated ML estimates closer to the truth while leaving truly large effects mostly unchanged. Apeglm brought many inflated ML estimates closer to the truth as well, but not as many as ash.

Concordance at the top was better for the shrinkage estimates than for the ML estimates, regardless of the number of top genes in question. Furthermore, similar to mean absolute error, the improvements seen from the shrinkage estimates over the MLE was larger than those seen from the standard normal simulation. Ash performed better than apeglm in concordance at the top 50 and 100 genes, though performance was similar when looking at larger number of genes. Concordance at the top for the MLE was optimized when filtering out genes where less than half the samples had at least 110 counts. Though this improved CAT by quite a lot, performance was still much lower than apeglm and ash Thus, unlike in the standard normal simulation, the performance in CAT obtained by shrinkage could not be matched with filtering, even when using the true gene ranking to determine the optimal filtering rule.

In this simulation, due to the increased overdispersion, there were many effects that were overestimated or underestimated by ML, even among genes with large counts. Because of this, both ash and apeglm exhibited shrinkage for effects across the dynamic range of summed counts, as opposed to only shrinking effects with small counts (Supplementary Figure 513). Together with the true vs. estimate plots, this shows that both apeglm and ash can correctly shrink falsely large effects even when the summed counts are large. Ash had larger shrinkage scores than apeglm on average, indicating that ash tended to shrink estimates more than apeglm (Supplementary Table 213). All methods had coverage slightly less than nominal (95%), ranging from 92 to 94%. However, both apeglm and ash had half the average interval width compared to maximum likelihood, despite both having slightly higher coverage rates.

We also conducted simulations similar to that of the standard normal and Student’s t, but with 5 vs. 5 samples. Like the 4 vs. 4 case, both apeglm and ash had lower average estimation error and higher concordance at the top than the MLE (results not shown).

Sampling from the mouse dataset

To evaluate performance on real data, we took 100 random subsamples of 6 mice from the mouse data set and averaged various performance metrics across the random subsamples. Similar to the simulations, apeglm appeared to improve estimation accuracy and shrink erroneously large genes. Ash, on the other hand, appeared to perform worse than the MLE according to mean absolute error, concordance at the top and interval coverage (see Table 3 and Figure 3).

Table 3. Performance metrics averaged across random subsamples.

Performance MetricMLEApeglmAsh
Mean Absolute Error0.1130.1020.132
Mean Absolute Error (apeglm-shrunk genes)0.5560.4770.772
Mean Absolute Error (ash-shrunk genes)0.4370.3770.586
Coverage Probability for 95% CI0.9440.9370.914
Average Interval Width for 95% CI0.5970.4370.399
531a33c1-e075-414b-a295-ffcd30b38610_figure3.gif

Figure 3. Truth vs. estimate and CAT plots for random subsamples.

a) through c) is based on one of the 100 random subsamples used in the mouse data benchmarking, and plots the ML estimates of the associated held-out set against the MLE, apeglm and ash estimates from the random subsample, respectively. Orange points represent genes substantially shrunk by ash only and green points represent genes substantially shrunk by both ash and apeglm. All genes substantially shrunk by apeglm were shrunk by practically the same amount or more by ash. d) plots concordance at the top averaged across the 100 random subsamples for each method. MLE: Maximum Likelihood Estimation, apeglm: Approximate Posterior Estimation of Generalized Linear Model Coefficients, ash: Adaptive Shrinkage.

Among genes that were shrunk by apeglm, mean absolute error was 14.2% lower on average for apeglm (Table 1). On the other hand, average MAE rose by 34.1% for ash among genes that were shrunk. Moreover, the minimum MAE obtained by ash across all 100 random subsamples was larger than the maximum MAE obtained by apeglm (results not shown). From Figure 3a–c, ash appears to be over-shrinking, and some of the genes with the largest held-out effect estimates were shrunk to zero. Though some genes also appeared to have been incorrectly or overly shrunk by apeglm, apeglm mainly was observed to shrink genes with inflated estimates and over-shrinkage was normally less severe when it occurred.

Both apeglm and MLE had universally higher concordance at the top than ash (Figure 1d). While apeglm performed slightly better than the MLE in concordance at the top 50 genes, performance was identical when looking at larger numbers of genes. We found that any filtering only decreased concordance at the top, as many top genes had low counts (i.e. the optimal filtering rule was no filtering). The most likely reason for this is that for each random subsample, we are treating the MLE of the held-out set as the truth. Thus, estimation error in the face of low-count genes would affect the held-out effect estimates and bias CAT results to some degree, even though the held-out sets have larger numbers of samples and performance metrics are averaged over many random subsamples. However, because genes with very large held-out effect estimates are more likely to have low counts, metrics that average across all genes, such as mean absolute error, would not be biased as much by estimation error.

A large amount of variability in the ML estimates was discernible for genes with low counts (Supplementary Figure 613). Like in the standard normal simulation, the low-count genes were mainly the ones shrunk by apeglm and ash. As the truth vs. estimate plots suggest, ash had larger shrinkage scores than apeglm (indicating more extreme shrinkage), and with the difference in shrinkage between the two methods being larger than in the simulations (Supplementary Table 413). Though apeglm intervals appeared to have smaller coverage than the ML intervals, the difference in coverage was very small, and average interval width was also 26.8% smaller for apeglm than that of maximum likelihood. Ash intervals were slightly more narrow than apeglm, with average interval width 32.5% smaller than that of maximum likelihood, but coverage was also lower.

As the mouse data set only had 24 samples, we determined that we didn’t have the sufficient sample size to evaluate our methods on estimating effect sizes of predictors, or models with more than just an intercept term. For instance, even if we wanted to look at the performance of our methods on estimating a group effect with only four samples in each group, each held-out set would only have eight samples in each group. Thus, the ML estimates of the held-out sets would have a lot of variance and could be far from the truth.

Computational performance of Apeglm

To evaluate the computational performance of our package on larger datasets, we simulated allelic counts for 5000 genes and 100 samples, and randomly divided the samples into differing numbers of groups. apeglm with our improvements had very fast running times for both ML and apeglm estimation and scaled well with the number of covariates (see Figure 4 and Figure 5).

531a33c1-e075-414b-a295-ffcd30b38610_figure4.gif

Figure 4. Comparisons in estimation time for one gene (in milliseconds).

531a33c1-e075-414b-a295-ffcd30b38610_figure5.gif

Figure 5. Comparisons in estimation time for all genes.

a) computational time of ML estimation (in seconds) for the apeglm and aod packages by the number of groups (covariates). b) computational time of apeglm estimation for the new and old apeglm packages by number of groups (covariates).

Estimation times per gene for ML estimation was substantially faster for apeglm than all other packages (Figure 4). The next best package, aods3, took 5 to 11 times longer than apeglm and did not scale as well with the number of groups. Furthermore, the aods3. gamlss and HRQoL packages occasionally produced errors and could not fit beta-binomial models for all the simulated genes.

For estimating all genes in the simulation via maximum likelihood, apeglm took 24 seconds for two groups and added only 1–2 seconds of computational time for every group added (Figure 5a). The next fastest package that could fit beta-binomial models for all the genes, aod, took seven times longer for two groups and grew 80 times as much for every group added. Comparisons in apeglm estimation between our improved apeglm package and the original package gave similar conclusions. Furthermore, unlike the new apeglm package, which grew roughly linearly with the number of groups in the range we assessed, the order of growth from the original package was not linear: the greater the number of groups already in the model, the greater the computational time increased for adding additional groups. At 10 groups, our improvements made apeglm 27 times faster than aod for ML estimation and 33 times faster than the old package for apeglm estimation. Our improvements also performed quite favorably when fitting beta-binomial models with two groups and two numerical controls. Elapsed time was 31 seconds for ML estimation and 43 seconds for apeglm estimation with the new apeglm package. In contrast, ML estimation took over nine minutes for aod and apeglm estimation took over seven minutes for the old apeglm package. Introducing multicollinearity into the design matrix did not substantially change computational performance for any package (results not shown).

Discussion

Here the performance of three estimators was compared across two simulations and one real dataset of allele specific expression in mice. Though apeglm was not the best estimator in all cases, it was the most robust and with consistent performance. Apeglm had smaller mean absolute error and greater concordance at the top than the MLE, and was never much worse than ash in these respects. Ash also performed better than the MLE for the simulated data for most metrics, including mean absolute error and concordance at the top. Moreover, ash had higher concordance at the top than apeglm in the Student’s t simulation. However, ash also had a tendency to over-shrink some genes, shrinking some truly large effects close to zero. Furthermore, for the real data set, ash performed worse than the MLE for most metrics, including mean absolute error and concordance at the top, most likely due to over-shrinking of many genes. As performance on the real data set was based on taking random subsamples of mice and using the MLE of the held-out set as the truth, estimation error of the held-out effect estimates may have biased results. For future research, using larger data sets to analyze apeglm performance than that of Crowley et al. would allow for held-out sets with more samples and thus reduce estimation error of held-out effect size estimates.

The shrinkage estimators compared here typically shrunk only low-count genes, as low-count genes tend to be those with the most uncertain and variable estimates. However, during a simulation where extreme overdispersion and heavy tails of the distribution of true effects were introduced, there were some large-count highly-variable genes that were shrunk as well, showing that ash and apeglm will shrink large-count genes if there is high uncertainty in the estimates. Ash consistently had more extreme shrinkage than apeglm and greater estimation error among genes with truly large effects. Thus, ash would most likely perform best in a situation where most effects were small, such as in the Student’s t simulation.

No method gave confidence or credible intervals with the highest coverage rates for all scenarios. However, across both simulations and analysis of of the mouse data, differences in coverage rates between the three methods were small, and coverage rates for apeglm credible intervals in particular were always very close to the interval that had the largest coverage. Furthermore, interval width for apeglm and ash were always smaller than that of maximum likelihood. This suggests that interval estimates from apeglm could be advantageous over those by maximum likelihood. For future research, it would be beneficial to evaluate the accuracy of hypotheses tests based on the estimates or posterior distribution of apeglm using metrics such as type I and type II error. The method of Leòn-Novelo et al. 20187 rejected hypotheses based on credible intervals of its posterior distribution, and if a similar step was taken for apeglm, its narrower intervals and robust coverage could potentially give more powerful hypothesis tests without suffering from inflated type I error.

Our changes to the apeglm package greatly improved computational performance for both ML and apeglm estimation of beta-binomial GLMs, particularly when larger numbers of covariates were involved. Among the R packages that we looked at which could fit beta-binomial models, the new apeglm package was always the fastest for fitting many GLMs in sequence, e.g. across many genes or variant locations. Thus, the new apeglm package is useful for quick and reliable analyses of allelic imbalance even for researchers who wish to only use likelihood-based estimators. Moreover, only coefficient estimates are currently calculated in C++, and even better computational performance would be achieved if overdispersion and standard error calculations were integrated into C++ as well. We are not aware of any other R packages that utilize faster programming languages such as C or C++ to estimate numerous beta-binomial regression models based on large matrices of observed allelic counts. The most similar package we noted was fastglm27, which fits individual quasi-binomial models in C++. While quasi-binomial models also estimate proportions and control for overdispersion, they do so in a different manner and with different assumptions.

Based on previous work, there are several ways in which the apeglm methodology could potentially be improved for allelic expression studies. For instance, while our extension of apeglm estimated overdispersion by MLE, the original methodology for apeglm as applied to negative binomial GLMs utilized Bayesian estimates for overdispersion as well as for regression coefficients. Introducing a prior for beta-binomial overdispersion that pools information across genes may lead to better estimation and inference of regression coefficients. We also assumed that the total allele-specific counts were fixed and known. Allowing such quantities to be random, as in the method by Leòn-Novelo et al. 2018, may lead to better inference as well. Adjusting for read mapping biases and ambiguities (Leòn-Novelo et al. 20145; Leòn-Novelo et al. 20187; Raghupathy et al. 20183) could also lead to better estimates when such biases and quantification uncertainty are present. Lastly, though here we focused on beta-binomial GLMs, a wide variety of statistical models can be used for ASE, from quasi-binomial28 to Poisson-lognormal models29.

Data availability

Underlying data

Zenodo: RNA-seq Dataset from Crowley et al. 2015. http://doi.org/10.5281/zenodo.340468916.

This project contains the following underlying data:

  • fullGeccoRnaDump.csv

This file contains the Crowley et al. mouse dataset which was was obtained from http://csbio.unc.edu/gecco/data/fullGeccoRnaDump.csv.gz15,30. We uploaded the dataset to Zenodo on the authors’ behalf with their permission, due to the fact that the original dataset is not currently hosted in a stable repository.

The dataset from this repository is available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Extended data

Zenodo: Supplementary Material for Zitovsky and Love 2019. http://doi.org/10.5281/zenodo.340469713.

This project contains the following extended data:

  • Supplementary Methods.pdf (Contains the mathematical and algorithmic details of how the apeglm package estimates beta-binomial coefficient effect sizes by maximum likelihood and apeglm, including the steps taken to improve computational performance, increase numerical stability and prevent convergence issues)

  • Supplementary Figures and Tables.pdf (Contains supplementary figures 1–6 and supplementary tables 1–3. These figures and tables were referenced and described in the main body of the article)

Data are available under the terms of the CC-BY 4.0 license.

Software availability

Zenodo: Apeglm v1.7.5 Source Code. http://doi.org/10.5281/zenodo.340450431. This repository contains the source code for the version of the apeglm package used in this paper.

The software from this repository is available under the terms of the GNU General Public License v3.0 (GPL-3).

Zenodo: Source Code for Zitovsky and Love 2019. http://doi.org/10.5281/zenodo.340466932. This repository contains the R scripts used to run the analyses described in this article and generate all of its figures. All figures associated with this paper, including figures present in the main article and supplementary figures, were generated as separate .png and .eps files and can also be found in this repository. The R scripts can be found under the ‘Code’ folder while the figures can be found under the ‘Figures’ folder.

Material from this repository are available under the terms of the GPL-3 license.

apeglm is available as part of the Bioconductor project33 at http://bioconductor.org/packages/apeglm. The vignette18 and manual provide detailed information on how to use the package.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 28 Nov 2019
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Zitovsky JP and Love MI. Fast effect size shrinkage software for beta-binomial models of allelic imbalance [version 1; peer review: 3 approved with reservations] F1000Research 2019, 8:2024 (https://doi.org/10.12688/f1000research.20916.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 28 Nov 2019
Views
25
Cite
Reviewer Report 10 Feb 2020
Ernest Turro, Department of Hematology, University of Cambridge, Cambridge, UK;  MRC Biostatistics Unit, University of Cambridge, Cambridge, UK 
Approved with Reservations
VIEWS 25
This paper has two components:

1) An advance in computational efficiency for estimating beta-binomial regression coefficients with shrinkage. The authors have produced a C++ implementation of the inference code previously written in R. Both versions of the ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Turro E. Reviewer Report For: Fast effect size shrinkage software for beta-binomial models of allelic imbalance [version 1; peer review: 3 approved with reservations]. F1000Research 2019, 8:2024 (https://doi.org/10.5256/f1000research.23018.r57280)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 14 Dec 2020
    Josh Zitovsky, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, 27516, USA
    14 Dec 2020
    Author Response
    This paper has two components:

    1) An advance in computational efficiency for estimating beta-binomial regression coefficients with shrinkage. The authors have produced a C++ implementation of the inference code ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 14 Dec 2020
    Josh Zitovsky, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, 27516, USA
    14 Dec 2020
    Author Response
    This paper has two components:

    1) An advance in computational efficiency for estimating beta-binomial regression coefficients with shrinkage. The authors have produced a C++ implementation of the inference code ... Continue reading
Views
25
Cite
Reviewer Report 04 Feb 2020
Jarad Niemi, Department of Statistics, Iowa State University, Ames, IA, USA 
Ignacio Alvarez-Castro, University of the Republic, Montevideo, Uruguay 
Approved with Reservations
VIEWS 25
Is the rationale for developing the new method (or application) clearly explained?

Yes.
  • In our work a key issue is bias of allele reads toward a reference genome as explained in Sun and
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Niemi J and Alvarez-Castro I. Reviewer Report For: Fast effect size shrinkage software for beta-binomial models of allelic imbalance [version 1; peer review: 3 approved with reservations]. F1000Research 2019, 8:2024 (https://doi.org/10.5256/f1000research.23018.r58251)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 14 Dec 2020
    Josh Zitovsky, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, 27516, USA
    14 Dec 2020
    Author Response
    Is the rationale for developing the new method (or application) clearly explained?
     
    Yes.
    • In our work a key issue is bias of allele reads toward a reference
    ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 14 Dec 2020
    Josh Zitovsky, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, 27516, USA
    14 Dec 2020
    Author Response
    Is the rationale for developing the new method (or application) clearly explained?
     
    Yes.
    • In our work a key issue is bias of allele reads toward a reference
    ... Continue reading
Views
64
Cite
Reviewer Report 17 Dec 2019
Matthew Stephens, Department of Statistics, University of Chicago, Chicago, IL, USA;  Department of Human Genetics, University of Chicago, Chicago, IL, USA 
Approved with Reservations
VIEWS 64
Summary:

The paper presents new implementations of shrinkage methods for beta binomial models, implemented in the R software package apeglm. One potential application of these models is estimating allele-specific biases in various sequencing-based assays (and differences in bias between groups), ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Stephens M. Reviewer Report For: Fast effect size shrinkage software for beta-binomial models of allelic imbalance [version 1; peer review: 3 approved with reservations]. F1000Research 2019, 8:2024 (https://doi.org/10.5256/f1000research.23018.r57281)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 14 Dec 2020
    Josh Zitovsky, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, 27516, USA
    14 Dec 2020
    Author Response
    Summary:
     
    The paper presents new implementations of shrinkage methods for beta binomial models, implemented in the R software package apeglm. One potential application of these models is estimating allele-specific ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 14 Dec 2020
    Josh Zitovsky, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, 27516, USA
    14 Dec 2020
    Author Response
    Summary:
     
    The paper presents new implementations of shrinkage methods for beta binomial models, implemented in the R software package apeglm. One potential application of these models is estimating allele-specific ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 28 Nov 2019
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.