Genome-wide detection of human variants that disrupt intronic branchpoints

Significance The search for candidate variants underlying human disease in massive parallel sequencing data typically focuses on coding regions and essential splice sites, mostly ignoring noncoding variants. The RNA spliceosome recognizes intronic branchpoint (BP) motifs at the beginning of splicing and operates mostly within introns to define the exon–intron boundaries; however, BP variants have been paid little attention. We established a comprehensive genome-wide database and knowledgebase of BP and developed BPHunter for systematic and informative genome-wide detection of intronic variants that may disrupt BP and splicing, together with an effective strategy for prioritizing BP variant candidates. BPHunter not only constitutes an important resource for understanding BP, but should also drive discovery of BP variants in human genetic diseases and traits.


Prediction of BP in the region [-3, -40] nt upstream of 3'ss
As the previous BP predictions only used one or two eBP datasets for training, and overlapped the prediction in the region [-21, -34] nt of 3'ss, we supplemented them with an additional high-precision BP prediction in the region [-3, -40] nt of 3'ss that covered the sequences closer to 3'ss. We used all 198,256 consensus-guided position-adjusted eBP as positive training data (Results), and randomly generated 1,000,000 non-BP positions from intronic and exonic regions as negative training data. We extracted the flanking 13-nt motif [-9, +3] nt of each BP and non-BP position in the training data, based on the interaction mode between BP and snRNA (15) (Figure 1b). We then vectorized the 13-nt motif into 52-bit binary code by one-hot-encoding (converting A to 0001, C to 0010, G to 0100, and T to 1000). We developed three machine learning classification models: gradient boost machine (GBM), random forest (RF), and logistic regression (LR), by using scikit-learn (16) (Figure S2d). For each model, we first performed parameter optimization by using grid search of different combinations of key parameters, and evaluated the performance of each set of parameters by stratified-shuffled 10-fold cross-validation based on its F1 score (F1 = 2 * (precision * recall) / (precision + recall)). The parameters yielding the highest F1 score were selected as the optimal parameters (Table S10). As we aimed to identify the BP candidates with high precision, we performed thresholding optimization to establish the optimal probability cutoff for each model. We generated precision-recall curves (PRC) and receiver operating characteristic curves (ROC), by averaging the performance of stratified-shuffled 10-fold cross-validation. We determined the optimal threshold of each model by requiring precision ≥ 0.95 and maximizing Youden's J statistic (J = sensitivity + specificity -1). We therefore trained and optimized GBM-BP model, RF-BP model and LR-BP model respectively, and then combined them by majority voting for improved performance (Table S10). We then extracted all positions in the region [-3, -40] nt of all 3'ss, and vectorized their flanking 13-nt motif into binary code as input for BP prediction.

Intronic data
We obtained human genome sequence and gene annotation data on the hg19/GRCh37 genome assembly from the GENCODE database (11). By focusing on protein-coding genes and transcripts, and requiring the gene/transcript status = 'KNOWN' and the confidence level = "1 or 2", we extracted a total of 43,225 transcripts from 19,149 proteincoding genes. We identified multi-exon transcripts and removed introns that were shorter than 10 nucleotides, thereby obtaining 355,472 introns (200,059 unique introns) from 41,975 transcripts of 17,372 genes ( Figure S6). We tested the genomic overlaps between these 200,059 introns, and identified 41,952 introns with alternative splicing. We also collected 672 and 752 minor introns reported by the IAOD (17) and MIDB (18) databases respectively. We then detected 18 new minor introns by implementing the intron classification criteria proposed by MIDB (Supplemental Data 3). A gene ontology analysis revealed that the genes harboring minor introns were enriched in intracellular transport and ion channels. We also obtained canonical transcript data from the MANE database (19).

Mapping BP to introns
The genomic positions of all 546,559 BP and the genomic ranges of all 200,059 introns were formatted into BED files. We then mapped BP to introns using BEDTools (20), to identify all the pairwise BP-intron associations based on their positional intersection.

Nucleotide composition
We defined the region of union [-9, +3] as the BP motif. We measured the nucleotide frequency of BP, 5'ss and 3'ss motifs in major and minor introns respectively, and plotted them by using SeqLogo (21).

BP-U2/U12 snRNA binding energy
U2/U12-snRNA binds to the [-5, +3] and [-7, +2] regions of BP in major/minor introns respectively, whereas the BP site itself bulges out and is not involved in the interaction with snRNA (1, 22) (Figure 1b). We used the RNAcofold function from ViennaRNA package (23) to estimate the binding energy between BP motifs (excluding the BP sites) and U2/U12 snRNA sequences (U2: AUGAUGUG, U12: AAGGAAUGA), according to the associated intron type. RNAcofold allows the intermolecular base pairing between two RNA sequences to form static interactions, and computes the minimum free energy (MEF), which is always negative or equal to zero, to represent the binding energy (in unit: kcal/mol): a value close to zero denotes unstable binding, whereas a more negative value denotes more stable binding between the BP motif and U2/U12 snRNA.

Motif searching in the region [-50, +20] nt surrounding BP
We searched for enriched motif patterns potentially concealed within the 50-nt upstream regions and 20-nt downstream regions of the BP positions separated by their nucleotides (adenine-BP, cytosine-BP, guanine-BP, and thymine-BP). We performed XSTREME analysis (24) for motif discovery (motif width: 5-10 nt), and reported those enriched motifs having p-values <0.05 and appearing >5% in each of the nucleotide-separated BP datasets. The pvalues were computed using the randomly shuffled nucleotides from the input sequences as the background.

Human population variant data
We obtained human genetic variants from the gnomAD database (25) v3.1, which contained 76,156 WGS datasets on the hg38/GRCh38 genome assembly. We converted the variants' genomic positions from GRCh38 to GRCh37 by using the liftover program from the UCSC Genome Browser (26), to allow a consistent presentation of all the genomic data in this article. By focusing on protein-coding genes, we obtained population variants and their total allele count (AC) and minor allele frequencies (MAF). We categorized variants into singleton (AC = 1), rare (MAF < 1%), and common (MAF ≥ 1%).

Cross-species conservation scores
We obtained the pre-computed genome-wide cross-species conservation scores (GERP and PhyloP-46way) from the UCSC Genome Browser (26). GERP (Genomic Evolutionary Rate Profiling) computes position-specific scores of evolutionary constraint using maximum likelihood evolutionary rate estimation by aligning 35 mammals (27). PhyloP-46way (Phylogenetic P-values) measures the evolutionary base-wise conservation based on the alignment of 46 vertebrates (28). Both scores indicate the strength of purifying selection of a given genomic position: a positive value denotes that the genomic position is likely to be evolutionarily conserved across species, whereas a negative value indicates that the genomic position is probably evolving neutrally.

Variant deleteriousness, mis-splicing prediction scores, and splice site strength
We used CADD v1.6 (29), which predicts the deleteriousness of variants by taking account of an array of nucleotide sequence information and variant annotations (including conservation, amino acid change, epigenetic modification, human population variation, splicing, etc.). We extracted CADD PHRED-scaled scores (ranging from 0 to 99): the larger the score, the higher the probability of deleteriousness. It precomputed and ranked all possible variants in the human genome, and then assigned a score of 10 to the top 10% of predicted deleterious variants, a score of 20 to the top 1% of variants, and a score of 30 to the top 0.1% of variants, etc. Usually, a high-cutoff of 20 or a moderatecutoff of 10 were used for large-scale variant filtration for deleterious candidate variants (30,31). We also recruited SpliceAI (32) and MMSplice (33) to evaluate the mis-splicing prediction scores on BP variants. SpliceAI predicts splice junctions from RNA sequence, by means of a deep neural network model (32). It claimed its capability to recognize cis-acting elements (including BP), and to predict their mutational impact on splicing. SpliceAI provides a score for disrupting the acceptor site (ranging from 0 to 1): the higher the score, the higher the probability of altering the acceptor site. SpliceAI suggested a high-cutoff of 0.8 for high-precision, and also recommended a moderatecutoff of 0.5. MMSplice predicts the effect of variants on splicing, by means of a neural network-based modular modelling on different components of splicing. It claimed to include 50 nt upstream of 3'ss to cover BP regions in training their model. MMSplice computes a score (unspecified range) for acceptor site inclusion (positive score) or exclusion (negative score). MMSplice suggested a cutoff of -2 to be considered as evidence for acceptor site disruption (33). In the study of splice site strength, we used SeqTailor (34) to extract the wild-type and mutated 23nt DNA sequences surrounding the variants of interest, and then used MaxEntScan (35) to estimate the splice site strength in wt and mt sequences respectively.

A cohort of patients with critical COVID-19
In this study, we used the whole-exome sequencing (WES) data from a cohort of 1,035 patients with life-threatening COVID-19, which were recruited through an international consortium -The COVID Human Genetic Effort (36). All human subjects in this study were approved by the appropriate institutional review board.

Exon trapping assay
DNA segments encompassing STAT2 exon 5 and 6 region (chr12:56749479 to chr12:56748872 region, GRCh37 reference) were amplified from genomic DNA extracted from PBMCs of a healthy control and were cloned into a pSPL3 vector, using the EcoRI and BamHI sites. c.472-24 A>T of STAT2 (an intronic variant located in intron 5 and predicted to alter a branchpoint) was generated by site-directed mutagenesis. Plasmids containing wild-type (wt) and mutant STAT2 exon 5 and 6 region were then used to transfect COS-7 cells. After 24 hours, total RNA was extracted and reverse transcribed. cDNA products were amplified using flanking HIV-TAT sequences of the pSPL3 vector, and ligated into the pCRTM4-TOPO® vector (Invitrogen). StellarTM cells (Takara) were transformed with the resulting plasmids. Colony PCR and sequencing using primers located in the flanking HIV-TAT sequences of the pSPL3 were performed to assess the splicing products transcribed by the wt and mutant alleles.

TOPO-TA cloning and RT-qPCR
Total RNA was extracted from a whole blood sample from the patient and a healthy control using Tempus Blood RNA Tube and Tempus Spin RNA Isolation Kit (Applied Biosystems), and reverse transcribed into cDNA using SuperScript III (Invitrogen). For TOPO-TA cloning, specific primers located in exon 3 (forward primer, CATGCTATTCTTCCACTTCTTG) and exon 7-8 boundaries (reverse primer, GGCATCCAGCACCTCCTTTC) were used to amplify STAT2 cDNA by PCR. PCR products were then purified and ligated into a pCRTM4-TOPO® vector (Invitrogen). StellarTM cells (Takara) were transformed with the resulting plasmids. Colony PCR and sequencing using the primers used to amplify STAT2 cDNA were performed to assess the splicing products generated from the wt and mutated alleles. For RT-qPCR, STAT2 mRNAs were quantified using probes Hs01013129_g1 (exons 5-6) and Hs01013130_g1 (exons 6-7; Thermo Fischer Scientific), with the Taqman Gene Expression Assay (Applied Biosystems), and normalized to the expression level of human β-glucuronidase. Results were expressed using the ΔΔCt method, as described by the manufacturer, and the amount of STAT2 canonical transcript (ENST00000314128.9) was estimated based on the TOPO-TA data using the following formula: ΔΔCt x percentage of canonical transcript/percentage of transcripts with canonical exon 5-6 junction for probe Hs01013129_g1 or ΔΔCt x percentage of canonical transcripts/percentage of transcripts with canonical exon 6-7 junction for probe Hs01013130_g1.

A cohort of lymphoma patients
We studied the somatic variants from a cohort of 53 diffuse large B-cell lymphoma patients, whose paired WGS and RNA-seq data from the tumor tissues were also available (37)(38)(39). We focused on a set of 212 genes that are frequently mutated in B-cell lymphomas or known to be important for B-cell lymphomagenesis (37). All human subjects in this study were approved by the appropriate institutional review board.

COSMIC database
We collated the somatic variants documented in the COSMIC database (40) v94, which have been detected in cancer patients from a variety of different sources. We also identified four gene sets of interest, which were associated with cancer formation and progression: 123 tumor suppressor genes, 161 apoptosis genes, 150 DNA repair genes and 714 cell cycle genes, based on the COSMIC (40) and MSigDB (41) databases. We used the following criteria to retain the candidate BP variants: (1) in canonical transcripts; (2) deletions (< 100 nt) or SNVs that remove or disrupt the entire BP motifs or the BP/BP-2 positions; (3) in 3'-proximal introns harboring single or two BP; (4) no or very rare (MAF < 1e-3) population variations; (5) having BPHunter score ≥ 3; and (6) passed the variant quality checking. Figure S1: Splicing mechanisms that use distal BP deep inside an intron. (a) Recursive splicing mechanism in an intron for multi-step intron removal. (b) Stem-loop RNA structure brings distal BP closer to 3'ss. (c) Stochastic splice site selection leads to kinetic variation in intron removal. (d) Mutually exclusive splicing by using BP closer to 5'ss.

Figure S5: Motif searching in the region [-50, +20] nt surrounding BP, in terms of different nucleotides of BP.
A motif was reported to be enriched if it had a p-value < 0.05 and > 5% occurrence in each respective BP dataset. Motif enrichment was only observed in the 50-nt upstream region of BP, whereas no motif was enriched in the 20-nt downstream region of BP. Figure S6: Collation of introns from human protein-coding genes. The reference genome and gene annotation were obtained from the GENCODE database. The introns were classified into major and minor types.

ABCC8
g.17452526T>C IVS11-20A>G Figure S18: Schematic of the wild-type (wt) and mutant (mt) sequences of the eight pathogenic "BP" variants un-detected by BPHunter. BP positions, variant sites, and exon regions are colored in red, yellow, and blue, respectively. The value in the box is the MaxEntScan 3' splice site strengths for the constitutional AGs and the newly created AGs.   The intronic reads harboring the deletion were retained.
The exon skipping was seen in 46 aberrantly spliced junction reads.
Splicing efficiency was significantly reduced, as the splice junction reads (8) were much lower than the adjacent exons (147 and 119).
The intronic reads harboring the variant were retained. The exon skipping was seen in 25 aberrantly spliced junction reads.
Intronic retention in neurons but not in fibroblasts.
The intronic reads harboring the deletion were retained.
The intronic reads harboring the deletion were retained.
Var #8 S12 MTR chr1 237016225 A G snv 5 The intronic reads harboring the variant were retained.
The intronic reads harboring the variant were retained.

SUPPLEMENTAL TABLES
Investigators had (1) one or more families, or a cohort of patients with the same disease; (2) mostly performed targeted sequencing on the known disease-associated genes, whilst a few performed massive parallel sequencing; (3) sometimes failed to detect candidate variants in the coding regions or essential splice sites of the known diseaseassociated genes, and hence had extended the search to intronic variants; or sometimes studied all variants of the known disease-associated genes; (4) found one intronic variant upstream of a 3'ss, sometimes displaying an enrichment or family segregation; (5) identified the variant residing in a region matching the BP consensus sequence (YTNAY, or relaxed TNA), and consequently suspected that the variant might disrupt BP; or in some cases directly suspected the variant might disrupt BP without consensus sequence justification; and (6) performed in vitro expression/functional assays to reveal the mis-splicing consequences of the known disease-associated gene, therefore concluding that the intronic variant in BP sites was disease-causing.