Full-length transcript sequencing of human and mouse cerebral cortex identifies widespread isoform diversity and alternative splicing

Summary Alternative splicing is a post-transcriptional regulatory mechanism producing distinct mRNA molecules from a single pre-mRNA with a prominent role in the development and function of the central nervous system. We used long-read isoform sequencing to generate full-length transcript sequences in the human and mouse cortex. We identify novel transcripts not present in existing genome annotations, including transcripts mapping to putative novel (unannotated) genes and fusion transcripts incorporating exons from multiple genes. Global patterns of transcript diversity are similar between human and mouse cortex, although certain genes are characterized by striking differences between species. We also identify developmental changes in alternative splicing, with differential transcript usage between human fetal and adult cortex. Our data confirm the importance of alternative splicing in the cortex, dramatically increasing transcriptional diversity and representing an important mechanism underpinning gene regulation in the brain. We provide transcript-level data for human and mouse cortex as a resource to the scientific community.

: An overview of the analysis pipeline used to generate full-length transcript annotations in human and mouse cerebral cortex samples, related to STAR Methods. Briefly, polymerase reads from PacBio Sequel for each dataset were processed using Iso-Seq 3.1.2 and Cupcake scripts to generate high quality, full-length isoforms. SQANTI2 was used to fully annotate individual isoforms, with comparison to short-read RNA-Seq, ONT nanopore sequencing, and reference annotations. PacBio -Pacific Biosciences, ONT -Oxford Nanopore Technology Figure S2: Saturation is reached across all Iso-Seq datasets at the gene and isoform level, related to STAR Methods and Table 1. We subsampled reads to generate rarefaction curves, using cDNA Cupcake scripts, at the gene and isoform level. Shown are comparisons between A) human (n = 7 biologically independent samples) and mouse cortex (n = 12 biologically independent samples) and B) human adult (n = 4 biologically independent samples) and human fetal cortex (n = 3 biologically independent samples). Also shown are rarefaction curves for each SQANTI2 isoform category in C) human cortex, D) human adult cortex, E) human fetal cortex and F) mouse cortex. FSM -Full splice match, ISM -Incomplete Splice Match, NIC -Novel In Catalogue, NNC -Novel Not in Catalogue.

A B
C D E F Figure S3: Consensus distribution of CCS read lengths across all cortical samples, related to Figure 1. Shown is data for CCS reads generated from A) merging human and adult fetal cortex, and for each sample in B) human adult cortex (n = 4 biologically independent samples, n = 5 SMRT cells), C) human fetal cortex (n = 3 biologically independent samples, n = 5 SMRT cells) and D) mouse cortex (n = 12 biologically independent samples, n = 12 SMRT cells). Several of the human adult and human fetal cortex samples were sequenced more than once to maximize coverage ( Table S1). Number of CCS reads generated per SMRT cell can be found in Table S3. Distribution of CCS read lengths in human and mouse cortex can be found in Figure 1A. CCS -Circular consensus sequence. SMRT -Single-molecule real-time A C D B Figure S4: Transcripts identified by Iso-Seq are enriched near CAGE peaks, annotated transcription start sites (TSS) and transcript termination sites (TTS), related to Figure  1. Shown is the distance of transcripts to annotated CAGE peaks for A) all and B) novel transcripts. Shown also is the distance between the 5' end of each transcript to reference TSSs for C) all transcripts and D) novel transcripts. Finally, shown also is the distance between the 3' end of each transcript to reference TTSs for E) all transcripts and F) novel transcripts. A negative value for distance to TSS refers to a query start site downstream of reference, and a negative value for distance to TTS refers to end site upstream of reference. Novel transcripts are classified as NIC, NNC, antisense, genic/genomic, and fusion. TSS -Transcription Start Site, TTS -Transcription Termination Site.

A B
C D E F Figure S5: Longer genes and those with more exons tend to have a higher number of discrete isoforms, related to Figure 1. The number of detected multi-exonic isoforms in A) human cortex and B) mouse cortex is correlated with gene length (human cortex: Pearson's correlation = 0.19, P = 1.51 x 10 -106 ; mouse cortex: Pearson's correlation = 0.25, P = 1.33 x 10 -197 ). A stronger relationship was observed among 'highly-expressed' genes (>2.5 Log10 TPM) in both C) human cortex (Pearson's correlation = 0.49, P = 1.39 x 10 -33 ) and D) mouse cortex (Pearson's correlation = 0.45, P = 3.56 x 10 -31 ). The number of detected isoforms was also correlated with the number of exons in E) human cortex (Pearson's correlation = 0.24, P = 7.97 x 10 -155 ) and F) mouse cortex (Pearson's correlation = 0.24, P = 4.02 x 10 -193 ). A stronger relationship was observed among 'highly-expressed' genes (>2.5 Log10 TPM) in both G) human cortex (Pearson's correlation = 0.45, P = 7.42 x 10 -28 ) and H) mouse cortex (Pearson's correlation = 0.49, P = 2.16 x 10 -38 ). Gene length corresponds to the longest isoform, and density of genes is represented in increasing scale from light green to dark blue. TPM -Transcripts per Million.  Figure 2A. C) Overall Iso-Seq transcript expression of novel and known transcripts and D) different RNA isoform categories. Known transcripts were more highly expressed than novel transcripts in both human (Mann-Whitney-Wilcoxon test, W = 1.62x 10 8 , P < 2.23 x 10 -308 ) and mouse cortex (Mann-Whitney-Wilcoxon test, W = 3.66 x 10 8 , P < 2.23 x 10 -308 ). E, F) Transcript length and G, H) number of exons for novel and known transcripts of annotated genes, further stratified by RNA isoform category. Novel transcripts were longer (human cortex: Mann-Whitney-Wilcoxon test, W = 1.10 x 10 8 , P = 4.04 x 10 -25 ; mouse cortex: Mann-Whitney-Wilcoxon test W = 2.37x 10 8 , P = 2.13 x 10 -42 ) and had more exons (human cortex: W = 8.83 x 10 7 , P < 2.23 x 10 -308 ; mouse cortex: W = 1.94 x 10 8 , P < 2.23 x 10 -308 ) than known transcripts.  Figure S10: Long-read Iso-Seq data can be used to accurately quantify levels of gene expression in the human cortex, related to STAR Methods. Shown is the relationship between expression estimated using RNA-Seq and Iso-Seq at the A) gene level (n = 9,223 genes, Pearson's correlation = 0.58, P < 2.23 × 10 -308 ) and B) transcript level (n = 17,583 transcripts, corr = 0.40, P < 2.23 x 10 -308 ) in the human cortex (data derived from three biologically independent fetal samples). Also shown is the same relationship at the C) gene level (n = 13,923 genes, corr = 0.71; P < 2.23 × 10 -308 ) and D) transcript level (n = 41,488 transcripts, corr = 0.48, P < 2.23 x 10 -308 ) for mouse cortex (n = 12 biologically independent samples). RNA-Seq gene expression was determined after aligning short-read RNA-Seq to the Iso-Seq transcriptome. Iso-Seq gene expression was determined from the sum of fulllength, multi-exonic transcript reads associated for each gene, with TPM values calculated by dividing the number of full length reads per gene by total full-length reads, multiplied by a million. The density of values is represented in increasing scale from light green to dark blue. E) The number of ERCC spike-in fragments detected compared to the amount used in our mouse cortex Iso-Seq libraries and F) the relationship between the amount of ERCC used and the number of full-length reads identified (Pearson's correlation = 0.98, P = 1.42 x 10 -41 ). There is a near perfect correlation between full-length reads associated with ERCC spike-in fragments and the actual amount of control used.

A B
C D E F Figure S11: Transcripts from an RNA-Seq defined transcriptome were less supported by CAGE peaks, and likely represent incomplete fragments of transcripts identified using Iso-Seq reads, related to STAR Methods. Shown is a comparison of the A) distribution of number of isoforms associated per gene, B) the proportion of isoforms annotated within 50bp of a CAGE peak, and C) classification of the isoforms using SQANTI categories, between Iso-Seq defined and RNA-Seq defined transcriptomes generated on the mouse cortex. RNA-Seq defined transcriptome is generated using a reference-guided assembly of RNA-Seq reads (n = 12 biologically independent samples) using Stringtie.  Table 1. LncRNA transcripts were found to be shorter in both A) human cortex (Mann-Whitney-Wilcoxon, W = 2.28 x 10 7 , P = 3.22 x 10 -34 ) and B) mouse cortex (Mann-Whitney-Wilcoxon test, 3.52 x 10 7 , P = 8.24 x 10 -98 ). They also contained fewer exons in both C) human cortex (Mann-Whitney-Wilcoxon test, W = 3.31 x 10 7 , P < 2.23 x 10 -308 ) and D) mouse cortex (Mann-Whitney-Wilcoxon test, W = W = 4.56 x 10 7 , P < 2.23 x 10 -308 ). They were also characterized by lower transcript expression than non-lncRNA transcripts in E) human (Mann-Whitney-Wilcoxon test, W = 2.27 x 10 7 , P = 9.44 x 10 -35 ) and F) mouse cortex (Mann-Whitney-Wilcoxon test, W = 3.16 x 10 7 , P = 5.67 x 10 -40 ). Finally, they showed lower isoform diversity in G) human (Mann-Whitney-Wilcoxon test, W = 6.63 x 10 6 , P = 1.21 x 10 -80 ) and H) mouse cortex (Mann-Whitney-Wilcoxon test, W = 7.40 x 10 6 , P = 5.76 x 10 -107 ). Lnc-RNAlong non-coding RNA. A B Figure S15: Intron retention and NMD in the human and mouse cortex, related to Figure 5. Shown is the number of genes with intron-retained transcripts comparing A) human and mouse cortex and B) human adult and human fetal cortex. NMD is particularly enriched amongst transcripts with intron retention, as shown by the overlap of genes with IRtranscripts, NMD-transcripts, and transcripts with both IR and NMD in C) human cortex, D) human adult cortex, E) human fetal cortex and F) mouse cortex. Genes containing both IR and NMD transcripts were further classified into genes that contain transcripts that were both IR and NMD (purple) and genes that contain transcripts where IR and NMD were mutually exclusive (dark orange). G) A larger proportion of lowly expressed genes showed evidence for IR than highly expressed genes in both human (< 2.5 Log10 TPM, n = 2,269 (88.4%) genes; > 2.5 Log10 TPM, n = 297 (11.6%) genes) and mouse (< 2.5 Log10 TPM, n = 3,039 (90.04%) genes; > 2.5 Log10 TPM, n = 336 (9.96%) genes. IR -Intron retention. NMD -Nonsense-mediated mRNA decay. TPM -Transcripts per Million transcripts in human fetal brain regions, shaded by the number of full-length reads. Differential transcript usage was observed in RTN4, with one isoform (boxed in red -RTN4.1, red arrow) strongly expressed in fetal cortex while downregulated in adult cortex, and another isoform (boxed in green -RTN4.14, green arrow) strongly expressed only in adult cortex. Differential transcript usage was observed in APLP1 with one transcript (boxed in red -APLP1.20, ENST00000586861.5) strongly expressed in fetal hippocampus (red arrow) while not detected in fetal cortex, and another novel isoform (boxed in green -APLP1.9) strongly expressed in fetal cortex (green arrow).      Table S8: Examples of proteomic support for novel transcripts identified using Iso-Seq, related to Figures 2 and 5. We identified five novel peptides, each mapping uniquely to a single novel transcript, providing evidence for the stable translation of novel isoforms in the human cortex. PB.ID refers to the specific Iso-Seq transcript supported. Peptide support for novel exon inclusion in VTI1A is illustrated in Figure 2E, intron retention in RGS11 in Figure 5F, and novel exon skipping in RELCH in Figure 5G.

A)
Splicing event

Number and proportion of splicing events Human Cortex
Human ( Table S12: Alternative splicing events observed in human and mouse cortex, related to Figure 5. Tabulated are the A) number of splicing events and B) number of genes observed with those splicing event, in human cortex (n = 7 biologically independent samples), human adult cortex (n = 4 biologically independent samples), human fetal cortex (n = 3 biologically independent samples), and mouse cortex (n = 12 biologically independent samples). Of note, a single gene can be characterised by multiple splicing events and can thus appear more than once in B). The percentage refer to the proportion of total number of detected genes. A combination of the SUPPA2 package and custom analysis scripts were used to identify transcripts associated with i) exon skipping (SE), ii) mutually exclusive exon use (MX), iii) alternative first (AF) and last (AL) exons, iv) alternative 3' and 5' splice sites, and v) intron retention (IR).     Table S19: Summary of transcripts mapping to disease-associated genes in human and mouse cortex, related to Figure 4. Isoform diversity was assessed in genes robustly associated with autism (393 genes nominated as being category 1 (high confidence) and category 2 (strong candidate) from the SFARI Gene database https://gene.sfari.org/), Alzheimer's disease (three familial AD genes and 59 genes nominated from the most recent GWAS meta-analysis) and schizophrenia (SZ) (339 genes nominated from the most recent GWAS meta-analysis). AD -Alzheimer's disease, SZ -Schizophrenia. IR -Intron retention, NMD -Nonsense-mediated mRNA decay, FSM -Full Splice Match, ISM -Incomplete Splice Match, NIC -Novel In Catalogue, NNC -Novel Not in Catalogue  Table S22: Determining a common high gene expression threshold between human and mouse, related to STAR Methods. A high gene expression threshold was applied to a few analyses to further understand the relationship between isoform number and gene length ( Figure S5), isoform number and gene exon number (Figure S5C,S5D,S5G and S5H), and to investigate whether there was a difference in intron retention rate between highly-expressed and lowly-expressed genes ( Figure S15G). A gene expression cut-off was sequentially applied to both human and mouse cortex Iso-Seq dataset, and the number of isoforms of the filtered genes were then correlated. Subsequently, the gene expression threshold was determined by the gene expression at which number of isoforms for commonly expressed genes was most correlated between human and mouse with still a significant number surpassing the threshold -in this case, 2.5 Log10TPM. Of note, the genes filtered could have an expression surpassing threshold in mouse but not in human, and vice versa. TPM -Transcripts per Million