Identifying dysregulated regions in amyotrophic lateral sclerosis through chromatin accessibility outliers

Summary The high heritability of amyotrophic lateral sclerosis (ALS) contrasts with its low molecular diagnosis rate post-genetic testing, pointing to potential undiscovered genetic factors. To aid the exploration of these factors, we introduced EpiOut, an algorithm to identify chromatin accessibility outliers that are regions exhibiting divergent accessibility from the population baseline in a single or few samples. Annotation of accessible regions with histone chromatin immunoprecipitation sequencing and Hi-C indicates that outliers are concentrated in functional loci, especially among promoters interacting with active enhancers. Across different omics levels, outliers are robustly replicated, and chromatin accessibility outliers are reliable predictors of gene expression outliers and aberrant protein levels. When promoter accessibility does not align with gene expression, our results indicate that molecular aberrations are more likely to be linked to post-transcriptional regulation rather than transcriptional regulation. Our findings demonstrate that the outlier detection paradigm can uncover dysregulated regions in rare diseases. EpiOut is available at github.com/uci-cbcl/EpiOut.


Figure S1. The comparison of joint and per-sample peak calling methods
Peaks are identified using MACS2 in both approaches.For the cohort-wide joint peak calling, all ATACseq reads are aggregated into a single BAM file, which is subsequently processed with MACS2 to produce a set of peaks consistent across samples.In the sample-wise peak calling approach, peaks are identified individually for each sample using MACS2, resulting in a separate set of peaks for each sample.Then, overlapping peaks from different samples are collapsed, ensuring a consistent comparison of peaks during the outlier calling.(A) The length distribution (in base pairs) of accessible chromatin regions detected with joint and per-sample peak calling.Joint peak calling produces a greater number of peaks that are narrower compared to per-sample peak calling.(B) The moving average of the replication rate by the rank of peaks, sorted by total read counts.Joint peak calling exhibits a higher replication rate in regions with lower read depth.(C) Overlap of over-accessibility outliers identified through joint and per-sample peak calling methods.Both approaches generate a comparable number of over-accessibility outliers.~63% of the peaks identified by one method were detected by the other method.A B

9%). (C)
The evaluation of gene expression prediction performance, utilizing the ABC-score, was conducted using both observed Hi-C scores (blue) and predicted Hi-C scores (orange).A power regression model (Figure S5B) was employed to compute the Hi-C scores for outlier pairs within a distance of up to 1 million base pairs.There is no statistically significant difference (ns≥0.05) between models to predict gene expression outliers based on the Mann-Whitney U test (PR-AUC for blue = 22.5 ± 7.7% and orange = 22.6 ± 7.9%).The findings suggest that the co-outlier status serves as adequate evidence to infer interactions between accessible regions for predicting gene expression outliers.promoters compared to expression outliers with non-outlier promoters.(C) Among genes exhibiting proteomic aberrations (defined by |-score| > 1), missense and potentially NMD-triggering rare variants show depletion in genes with outlier promoters relative to gene expression outliers with nonoutlier promoters.(D) Splicing-disrupting variants predicted by AbSplice are enriched in genes with nonoutlier promoters against all gene expression outliers.In contrast, genes with outlier promoters are strongly depleted for the potentially splicing-disrupting variants.( ≥ 0.05, * < 0.05, * * < 10 !" , * * * < 10 !# , * * * * < 10 !$ based on the hypergeometric test).

Figure S10:
The enrichment of variants in the vicinity of chromatin accessibility outliers (A) Small variants such as SNVs or indels in the accessible chromatin regions are slightly enriched for low frequency and rare variants ( = 0.002,  = 268,   = 1.2).However, variants in the vicinity of the chromatin accessibility outliers are not enriched compared to non-outliers.The small effect size can be explained by the fact that the majority of these variants are non-coding and are likely to have weak or neutral regulatory effects.(B) Chromatin accessibility outliers are more likely to occur near structural variants ( < 0.001,  = 47,   = 10.5).However, only ~1% of chromatin accessibility outliers have at least one structural variant in 1 kbp distance, and ~1.3% have structural variants in proximity.the NEK1 gene.However, no chromatin accessibility outlier is observed in the vicinity.(B) Manhattan plot presenting variants near chromatin accessibility outliers, with two loci identified by variants rs10280711 and rs73271169 in the vicinity of a chromatin accessibility outlier in two ALS cases.However, no gene expression outlier is observed in the vicinity.(C) MAGMA Z-score distribution across genes, categorized by outlier status.There is no statistically significant difference between non-outlier genes and the expression of outlier genes with or without outlier promoters.(D) OUTRIDER and MAGMA p-values for genes.Genes prioritized based on GWAS and outlier analysis are mostly orthogonal.(E) Odds of observing a potentially NMD-triggering rare variant with a specific frequency in a gene expression outlier.The enrichment of rare and private potentially NMD-triggering variants is stronger than the enrichment of low-frequency variants.The results indicated that outlier analysis prioritized the molecular effect of ultra-rare variants, which GWAS may not prioritize.

Figure S3 .Figure S4 .Figure S5 .Figure S6 .
Figure S3.Read coverage and replication rate statistics.(A) Read coverage distribution across samples.Notable differences in read coverage are observed among the samples, highlighting the importance of size factor normalization to adjust for coverage disparity.(B) Replication rate of accessible regions throughout samples.In the scatterplot, each dot signifies an accessible region for a sample.The x-axis denotes the read counts supporting the accessible region in the sample with the highest read count.Filters, represented by the red lines, were implemented to ensure an accessible region is supported by at least 100 reads in one sample and is replicated in at least 50% of the samples by at least 2 reads.

Figure S7 .
Figure S7.Interaction between outlier pairs.(A)Hi-C score distribution of accessible regions, which are at least 100 kilo-bp apart, by outlier status and annotation.P-values were calculated with the U-Mann Wily test and corrected for multiple testing using the Bonferroni method.(B) Log-log shows the Hi-C scores of non-outlier pairs (in black) and outlier pairs (in red).Hi-C scores decay by power law with increasing distances.We fit a power regression (indicated by black and red lines) on the data to test the interaction between outlier pairs while using distance as a control variable.

Figure S9 :
Figure S9: The enrichment and depletion of variants across outlier types by variant effect category.(A)Odds of observing potentially NMD-triggering variants in expression outliers (green), expression outliers with non-outlier promoters (red), and expression outliers with outlier promoters are presented.Expression outliers with outlier promoters are strongly depleted of potentially NMD-triggering variants compared to those with non-outlier promoters.Meanwhile, expression outliers with non-outlier promoters are significantly enriched for NMD-triggering variants when compared to the baseline comprising all expression outliers.(B) Analysis similar to panel-A but limited to genes under strong

Figure S11 :
Figure S11: Number of chromatin accessibility and gene expression outliers detected per sample.(A) Number of chromatin accessibility outliers per ALS case (11 outliers in median case) (B) and clinically healthy controls (10 outliers in median control sample).There is no statistically significant difference in the number of outliers detected between cases and controls ( = 0.69, based on the Mann-Whitney U test).(C) Number of gene expression outliers per ALS case (6 outliers in median case) (D) and controls (4 outliers in median control sample).There is no statistically significant difference in the number of outliers detected between cases and controls ( = 0.054, based on the Mann-Whitney U test).

Figure S13 .Figure S14 .Figure S15 .Figure S16 .Figure S17 .
Figure S13.Aberrations observed at multiple omics levels of ALYREF gene.(A) ATAC-seq read coverage at the promoter of the gene.The sample with an outlier promoter is indicated by red across panels.(B) Expected and observed accessibility in the promoter of the gene across samples (C) Z-score distribution of promoter accessibility (D) gene expression (E) protein levels across samples.

(D) Overlap
Figure S2.Runtime of counting step.Bam and bed files for 20 ATAC-seq experiments were downloaded from ENCODE.Each bam file has 142 ± 48 million ATAC-seq reads.Overlapping accessible regions across bed files were collapsed, leading to 365,090 accessible regions.The counting was performed with these accessible regions and bam files using EpiCount, bedtools, and PyRanges.During bedtools counting, bed files are first sorted, then counting is performed using the -sorted flag in bedtools to leverage the chrom-sweep algorithm.(A) CPU time of counting methods.EpiCount is twice as fast as PyRanges and comparable with bedtools.The mean runtime of EpiOut per bam file is 169 ± 54 seconds; meanwhile, the runtime is 549 ± 225 seconds for PyRanges and 214 ± 74 for bedtools.Epiout counts one million reads per 1.2 ± 0.1 seconds, while PyRanges counts one million reads in 3.8 ± 0.2 seconds and bedtools counts in 1.5 ± 0.1.(B) EpiOut and bedtools have a much smaller memory footprint than PyRanges.PyRanges loads the entire bam file to memory while EpiCount and bedtools perform stream counting by only loading a read per iteration; thus, EpiCount (mean memory consumption 0.53 ± 0.009 GB) and bedtools (