Misexpression of inactive genes in whole blood is associated with nearby rare structural variants

Summary Gene misexpression is the aberrant transcription of a gene in a context where it is usually inactive. Despite its known pathological consequences in specific rare diseases, we have a limited understanding of its wider prevalence and mechanisms in humans. To address this, we analyzed gene misexpression in 4,568 whole-blood bulk RNA sequencing samples from INTERVAL study blood donors. We found that while individual misexpression events occur rarely, in aggregate they were found in almost all samples and a third of inactive protein-coding genes. Using 2,821 paired whole-genome and RNA sequencing samples, we identified that misexpression events are enriched in cis for rare structural variants. We established putative mechanisms through which a subset of SVs lead to gene misexpression, including transcriptional readthrough, transcript fusions, and gene inversion. Overall, we develop misexpression as a type of transcriptomic outlier analysis and extend our understanding of the variety of mechanisms by which genetic variants can influence gene expression.


Supplemental Figures
is quantified as the percentage of samples where the gene has a TPM > 0.1 (x-axis).E.) Percentage of genes labeled as active, inactive or unassigned from chromHMM k-means clusters stratified by gene expression activity.For each gene, activity is quantified as the percentage of samples where the gene has a TPM > 0.1 (x-axis).Inactive genes are defined as having a TPM > 0.1 in less than 5% of samples.Enrichment of all 82 gene-level features within genes that are misexpressed versus non-misexpressed genes across different misexpression Z score thresholds.Features are grouped into different categories.Tiles shaded in gray do not pass a Bonferroni-adjusted p-value threshold (p < 0.05).

Figure S4. Properties of misexpression-associated rare SVs.
A.) SV length distributions of misexpression-associated and control duplications and deletions restricted to singletons only.The lower, middle and upper hinges of the box plots correspond to the 25th percentile, median and 75th percentile, respectively.P values were calculated using a one-sided Mann-Whitney test comparing the lengths of control and misexpression-associated SVs B.) Enrichment (xaxis) without adjusting for SV length of misexpression-associated deletions (left panel, red) and duplications (right panel, green) compared to controls for genomic scores (y-axis) including evolutionary conservation (phyloP), predicted deleteriousness (CADD-SV), constraint (gnomAD Z score constraint and gwRVIS), and HARs.Enrichments were calculated as the log odds ratio with lines indicating 95% confidence intervals for the fitted parameters using the standard normal distribution.Asterisks indicate significant enrichment after Bonferroni correction.C.) Enrichment without adjusting for SV length of misexpression-associated deletions and duplications compared to controls for regulatory features including CTCF candidate cis-regulatory elements from ENCODE, TAD boundaries shared across multiple cell-lines, A and B compartments, chromatin states from the Roadmap Epigenomics Project and CpG islands from the UCSC genome browser.Enrichments were calculated as the log odds ratio and tiles shaded in gray do not pass Bonferroni correction.

Figure S8. Enrichment of SNVs and indels within the gene body and flanking sequence of genes involved in misexpression events across different misexpression Z score thresholds and MAF cutoffs.
A flanking sequence of ±10 kb was used for SNVs and indels.Enrichments were calculated as the relative risk of having a nearby variant type given the misexpression status.Bars represent 95% Wald confidence intervals of the relative risk estimates.The line at enrichment = 1 indicates no enrichment; asterisks positioned either side of the line indicate significant enrichment or underenrichment after Bonferroni correction.

Figure S1 .
Figure S1.Overview of the SV callset.A.) The percentage of different SV classes in the callset.Text labels indicate the total number of SV calls for each SV class.B.) Histogram showing the number of SVs across different length bins (logtransformed) stratified by SV class.C.) The percentage of SVs within each allele frequency bin by SV class.Text labels indicate the number of SV calls within each allele frequency bin.D.) The length distribution of SVs within each allele frequency bin by SV class.The lower, middle and upper hinges of the box plots correspond to the 25th percentile, median and 75th percentile, respectively.

Figure S2 .
Figure S2.Removal of global expression outliers and inactive gene set validation.A.) Number of top expression events (y-axis) ranked across all samples (x-axis).The dashed line indicates the threshold for removing aberrant samples.Failed samples (red) had a greater number of top expression events than this threshold while samples passing (blue) had a lower number.B.) Different inactive gene validation approaches showing the percentage of inactive genes identified in INTERVAL (y-axis) within different gene sets (x-axis).C.) Heatmap showing the percentage overlap of 60,603 genes (x-axis) over 15 chromHMM states from PBMC data.Genes are clustered into 8 k-means clusters and each cluster is labeled according to the types of overlapping states.D.) Percentage of genes in each k-means cluster stratified by gene expression activity.For each gene, expression activity

Figure S3 .
Figure S3.Different properties of misexpressed and non-misexpressed genes across misexpression Z score thresholds.Enrichment of all 82 gene-level features within genes that are misexpressed versus non-misexpressed genes across different misexpression Z score thresholds.Features are grouped into different categories.Tiles shaded in gray do not pass a Bonferroni-adjusted p-value threshold (p < 0.05).

Figure S5 .
Figure S5.Misexpression metrics across genes and samples.A.) Number of misexpression events across different misexpression Z score threshold bins.B.) Percentage of 8,650 inactive genes that have at least one misexpression event across different misexpression Z score thresholds.Text labels indicate the total number of genes with at least one misexpression event.C.) Percentage of 4,568 samples that have at least one misexpression event across different misexpression Z score thresholds.Text labels indicate the total number of samples with at least one misexpression event.D.) Number of misexpression events per sample across different misexpression Z score thresholds.

Figure S6 .
Figure S6.Properties of misexpressed genes.A.) Enrichment of gene-level features within genes that are misexpressed (Z score > 2 and TPM > 0.5) versus non-misexpressed genes split across all genes, protein-coding genes and lncRNA genes.The union of the top 15 features with the highest absolute log odds passing Bonferroni correction within each gene set are shown.Lines indicate 95% confidence intervals for the fitted parameters using the standard normal distribution.B.) Human Phenotype Ontology (HPO) terms, by -log10(adjusted p-value) on the x-axis, underrepresented within 1,070 misexpressed protein-coding genes using 3,092 inactive protein-coding genes as the custom background.Only significant terms are shown.The log-odds ratio

Figure S7 .
Figure S7.Percentage of misexpression events with a rare SV within 10 kb and 200 kb.Percentage of misexpression events (y-axis) with a rare SV within A.) 10 kb and B.) 200 kb at different misexpression Z score thresholds (x-axis).Text labels indicate the total number of misexpression events with an SV.

Figure S9 .
Figure S9.Enrichment of rare SNVs, indels and SVs across genomic windows and misexpression Z score thresholds.Enrichment of rare (MAF < 1%) SNVs (orange), indels (green) and SVs (blue) within 200 kb genomic windows and the body of the misexpressed gene across different misexpression Z score thresholds.Enrichments were calculated as the relative risk of having a nearby rare variant type given the misexpression status.The line at enrichment = 1 indicates no enrichment; Asterisks positioned either side of the line indicate significant enrichment or underenrichment after Bonferroni correction.Bars represent 95% Wald confidence intervals of the relative risk estimates.

Figure S10 .
Figure S10.Enrichment of rare SNVs and indels across genomic windows and misexpression Z score thresholds.Enrichment of rare (MAF < 1%) SNVs (orange) and indels (green) within 200 kb genomic windows and the body of the misexpressed gene across different misexpression Z score thresholds.Enrichments were calculated as the relative risk of having a nearby rare variant type given the misexpression status.The line at enrichment = 1 indicates no enrichment; Asterisks positioned either side of the line indicate significant enrichment or underenrichment after Bonferroni correction.Bars represent 95% Wald confidence intervals of the relative risk estimates.

Figure S11 .
Figure S11.Enrichment of rare SV classes.Enrichments were calculated as the relative risk of having a nearby variant type or consequence given the misexpression status.Bars represent 95% Wald confidence intervals of the relative risk estimates.The line at enrichment = 1 indicates no enrichment; stars positioned either side of the line indicate significant enrichment or underenrichment after Bonferroni correction.A.) Enrichment of rare (MAF < 1%) inversions and mobile element insertions in a ±200 kb window around the tested genes across different misexpression Z score thresholds.B.) Enrichment of rare (MAF < 1%) deletions, duplications,

Figure S12 .
Figure S12.Enrichment of rare SVs stratified by their class and predicted VEP consequences across misexpression Z score thresholds.Enrichments were calculated as the relative risk of having a nearby variant consequence given the misexpression status.The line at enrichment = 1 indicates no enrichment; asterisks positioned either side of the line indicate significant enrichment or underenrichment after Bonferroni correction.Bars represent 95% Wald confidence intervals of the relative risk estimates.Only SV consequences with at least one Bonferroni significant enrichment at any Z score threshold are shown.Missing points indicate tests failing to pass the nominal p-value threshold (p ≥ 0.05).

Figure S13 .
Figure S13.Transcriptional readthrough region FPKM and FBNC, transcription readthrough SV consequences and OTP misexpression.A.) FPKM and B.) FBNC at the 17 predicted readthrough regions for samples with candidate transcriptional readthrough deletions (red) and duplications (green), as well as samples with no candidate SVs (gray).C.) Proportion of candidate transcriptional readthrough deletions and duplications by their predicted VEP consequence on the misexpressed gene.D.) Expression of OTP comparing samples with DEL chr5:77674588−77771600 (GRCh38) to samples without the deletion.Red color indicates samples passing the misexpression threshold TPM > 0.5 and Z score > 2 while gray samples are below this threshold.E.) Deletion of the 3' end of TBCA results in transcriptional readthrough.Transcriptional readthrough leads to OTP misexpression (orange gene) and intergenic splicing between TBCA and OTP (intergenic reads, orange).In the sashimi plot, the line width corresponds to the number of reads spanning a given junction.

Figure S14 .
Figure S14.Examples of chimeric misexpression via transcript fusion.Comparison of expression in samples with and without an SV associated with chimeric misexpression via transcript fusion alongside FusionInspector visualization of the fusion transcript in a sample with a misexpression-associated SV for A.) CPHXL misexpression via KARS1-CPHXL fusion in a sample with