SURVIV for survival analysis of mRNA isoform variation

The rapid accumulation of clinical RNA-seq data sets has provided the opportunity to associate mRNA isoform variations to clinical outcomes. Here we report a statistical method SURVIV (Survival analysis of mRNA Isoform Variation), designed for identifying mRNA isoform variation associated with patient survival time. A unique feature and major strength of SURVIV is that it models the measurement uncertainty of mRNA isoform ratio in RNA-seq data. Simulation studies suggest that SURVIV outperforms the conventional Cox regression survival analysis, especially for data sets with modest sequencing depth. We applied SURVIV to TCGA RNA-seq data of invasive ductal carcinoma as well as five additional cancer types. Alternative splicing-based survival predictors consistently outperform gene expression-based survival predictors, and the integration of clinical, gene expression and alternative splicing profiles leads to the best survival prediction. We anticipate that SURVIV will have broad utilities for analysing diverse types of mRNA isoform variation in large-scale clinical RNA-seq projects.

ukaryotic cells generate remarkable regulatory and functional complexity from a finite set of genes. Production of mRNA isoforms through alternative processing and modification of RNA is essential for generating this complexity. A prevalent mechanism for producing mRNA isoforms is the alternative splicing of precursor mRNA 1 . Over 95% of the multiexon human genes undergo alternative splicing 2,3 , resulting in an enormous level of plasticity in the regulation of gene function and protein diversity. In the last decade, extensive genomic and functional studies have firmly established the critical role of alternative splicing in cancer [4][5][6] . Alternative splicing is involved in a full spectrum of oncogenic processes including cell proliferation, apoptosis, hypoxia, angiogenesis, immune escape and metastasis 7,8 . These cancer-associated alternative splicing patterns are not merely the consequences of disrupted gene regulation in cancer but in numerous instances actively contribute to cancer development and progression. For example, alternative splicing of genes encoding the Bcl-2 family of apoptosis regulators generates both anti-apoptotic and pro-apoptotic protein isoforms 9 . Alternative splicing of the pyruvate kinase M (PKM) gene has a significant impact on cancer cell metabolism and tumour growth 10 . A transcriptomewide switch of the alternative splicing programme during the epithelial-mesenchymal transition plays an important role in cancer cell invasion and metastasis 11,12 .
RNA sequencing (RNA-seq) has become a popular and costeffective technology to study transcriptome regulation and mRNA isoform variation 13,14 . As the cost of RNA-seq continues to decline, it has been widely adopted in large-scale clinical transcriptome projects, especially for profiling transcriptome changes in cancer. For example, as of April 2015 The Cancer Genome Atlas (TCGA) consortium had generated RNA-seq data on over 11,000 cancer patient specimens from 34 different cancer types. Within the TCGA data, breast invasive carcinoma (BRCA) has the largest sample size of RNA-seq data covering over 1,000 patients, and clinical information such as survival times, tumour stages and histological subtypes is available for the majority of the BRCA patients 15 . Moreover, the median follow-up time of BRCA patients is B400 days, and 25% of the patients have more than 1,200 days of follow-up. Collectively, the large sample size and long follow-up time of the TCGA BRCA data set allow us to correlate genomic and transcriptomic profiles to clinical outcomes and patient survival times.
To date, systematic analyses have been performed to reveal the association between copy number variation, DNA methylation, gene expression and microRNA expression profiles with cancer patient survival 16,17 . By contrast, despite the importance of mRNA isoform variation and alternative splicing, there have been limited efforts in transcriptome-wide survival analysis of alternative splicing in cancer patients. Most RNA-seq studies of alternative splicing in cancer transcriptomes focus on identifying 'cancer-specific' alternative splicing events by comparing cancer tissues with normal controls (see refs 18-23 for examples). A recent analysis of TCGA RNA-seq data identified 163 recurrent differential alternative splicing events between cancer and normal tissues of three cancer types, among which five were found to have suggestive survival signals for breast cancer at a nominal P-value cutoff of 0.05 (ref. 21). Some other studies reported a significant survival difference between cancer patient subgroups after stratifying patients with overall mRNA isoform expression profiles 24,25 . However, systematic cancer survival analyses of alternative splicing at the individual exon resolution have been lacking. Two main challenges exist for survival analyses of mRNA isoform variation and alternative splicing using RNA-seq data. The first challenge is to account for the estimation uncertainty of mRNA isoform ratios inferred from RNA-seq read counts. The statistical confidence of mRNA isoform ratio estimation depends on the RNA-seq read coverage for the events of interest, with larger read coverage leading to a more reliable estimation 14 .
Modelling the estimation uncertainty of mRNA isoform ratio is an essential component of RNA-seq analyses of alternative splicing, as shown by various statistical algorithms developed for detecting differential alternative splicing from multi-group RNA-seq data 14,[26][27][28][29] . The second challenge, which is a general issue in survival analysis, is to properly model the association of mRNA isoform ratio with survival time, while accounting for missing data in survival time because of censoring, that is, patients still alive at the end of the survival study, whose precise survival time would be uncertain. To date, no algorithm has been developed for survival analyses of mRNA isoform variation that accounts for these sources of uncertainty simultaneously.
Here we introduce SURVIV (Survival analysis of mRNA Isoform Variation), a statistical model for identifying mRNA isoform ratios associated with patient survival times in large-scale cancer RNA-seq data sets. SURVIV models the estimation uncertainty of mRNA isoform ratios in RNA-seq data and tests the survival effects of isoform variation in both censored and uncensored survival data. In simulation studies, SURVIV consistently outperforms the conventional Cox regression survival analysis that ignores the measurement uncertainty of mRNA isoform ratio. We used SURVIV to identify alternatively spliced exons whose exon-inclusion levels significantly correlated with the survival times of invasive ductal carcinoma (IDC) patients from the TCGA breast cancer cohort. Survival-associated alternative splicing events are identified in gene pathways associated with apoptosis, oxidative stress and DNA damage repair. Importantly, we show that alternative splicing-based survival predictors outperform gene expression-based survival predictors in the TCGA IDC RNA-seq data set, as well as in TCGA data of five additional cancer types. Moreover, the integration of clinical information, gene expression and alternative splicing profiles leads to the best prediction of survival time.

Results
SURVIV statistical model. The statistical model of SURVIV assesses the association between mRNA isoform ratio and patient survival time. While the model is generic for many types of alternative isoform variation, here we use the exon-skipping type of alternative splicing to illustrate the model (Fig. 1a). For each alternative exon involved in exon-skipping, we can use the RNA-seq reads mapping to its exon-inclusion or -skipping isoform to estimate its exon-inclusion level (denoted as c, or PSI that is Per cent Spliced In 14 ). A key feature of SURVIV is that it models the RNA-seq estimation uncertainty of exon-inclusion level as influenced by the sequencing coverage for the alternative splicing event of interest. This is a critical issue in accurate quantitative analyses of mRNA isoform ratio in large-scale RNA-seq data sets 14,[26][27][28][29] . Therefore, SURVIV contains two major components: the first to model the association of mRNA isoform ratio with patient survival time across all patients, and the second to model the estimation uncertainty of mRNA isoform ratio in each individual patient (Fig. 1a).
Briefly, for any individual exon-skipping event, the first component of SURVIV uses a proportional hazards model to establish the relationship between patient k's exon-inclusion level c k and hazard rate l k (t).
For each exon, the association between the exon-inclusion level and patient survival time is reflected by the survival coefficient b.
A positive b means increased exon inclusion is associated with higher hazard rate and poorer survival, while a negative b means increased exon inclusion is associated with lower hazard rate and better survival. l 0 (t) is the baseline hazard rate estimated from the survival data of all patients (see Supplementary Methods for the detailed estimation procedure). A particular patient's survival probability over time S k (t) can be calculated from the patientspecific hazard rate l k (t) as S k ðtÞ ¼ expð À R t 0 l k ðtÞdtÞ. Figure 1b illustrates a simple example with a negative b ¼ À 1 and a constant baseline hazard rate l 0 (t) ¼ 1, where higher exoninclusion levels are associated with lower hazard rates and higher survival probabilities.
The second component of SURVIV models the exon-inclusion level and its estimation uncertainty in individual patient samples. As illustrated in Fig. 1c, the exon-inclusion level c k of a given exon in a particular sample can be estimated by the RNA-seq read count specific to the exon inclusion isoform (IC k ) and the exonskipping isoform (SC k ). Other types of alternative splicing and mRNA isoform variation can be similarly modelled by this framework 29 . Given the effective lengths (that is, the number of unique isoform-specific read positions) of the exon-inclusion isoform (l I ) and the exon-skipping isoform (l S ), the exoninclusion level c k can be estimated asĉ k ¼ Assuming that the exon-inclusion read count IC k follows a binomial distribution with the total read count n k ¼ IC k þ SC k , we have: The binomial distribution models the estimation uncertainty of c k as influenced by the total read count n k , in which the parameter p k represents the proportion of reads from the exoninclusion isoform, given the exon-inclusion level c k adjusted by a length normalization function f(c k ) based on the effective lengths of the isoforms. The definitions of effective lengths for all basic types of alternative splicing patterns are described in ref. 29. Distinct from conventional survival analyses in which predictors do not have estimation uncertainty, the predictors in SURVIV are exon-inclusion levels c k estimated from RNA-seq count data, and the confidence of c k estimate for a given exon in a particular sample depends on the RNA-seq read coverage. We use the statistical framework of survival measurement error model 30 to incorporate the estimation uncertainty of isoform ratio in the proportional hazards model. Using a likelihood ratio test, we test whether the exon-inclusion levels have a significant association with patient survival over the null hypothesis H 0 :b ¼ 0. The false discovery rate (FDR) is estimated using the Benjamini and Hochberg approach 31 . Details of the parameter estimation and likelihood ratio test in SURVIV are described in Supplementary Methods.
Simulation studies of SURVIV. We designed a set of simulation studies to compare the performance of SURVIV with a conventional Cox regression survival analysis using the point estimates of exon-inclusion levels. To recapitulate the features of real data sets, we mimicked the parameters of the TCGA IDC breast cancer data wherever possible in generating the simulated data. Specifically, we simulated 20,000 alternative exons in each simulation (similar to the number of exon-skipping events in the TCGA IDC data), with 90% of the exons from the null hypothesis that the exons were not associated with survival time (b ¼ 0), and the remaining 10% of the exons from the alternative hypothesis that the exons were associated with survival time. The survival coefficient b of exons representing the alternative hypothesis was randomly sampled from the top 100 significant exons based on SURVIV analysis of all IDC patients (see Supplementary Data 1). We simulated a data set of 600 individuals, which was close to the sample size of the TCGA IDC data set. To study the effect of RNA-seq coverage, we simulated the total number of exoninclusion plus exon-skipping splice junction reads (n) with mean value ( n) of 5, 10, 20, 50, 80 and 100 (Fig. 2a). The inset figure in Fig. 2a shows the cumulative distribution of the mean splice junction read counts for individual exon-skipping events averaged across all patients in the TCGA IDC RNA-seq data. The read counts for a given exon in individual patients were then simulated using a Gamma-Poisson model with the variance of read counts estimated from the TCGA IDC data. Details about the simulation procedures are described in the Methods section. The exon-inclusion reads IC k are the reads from the upstream splice junction, the alternative exon itself and the downstream splice junction. The exon-skipping reads SC k are the reads from the skipping splice junction that directly connects the upstream exon to the downstream exon.
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms11548 ARTICLE Using these simulated data, we compared SURVIV with Cox regression in two settings, without or with censoring of the survival time. In the setting without censoring, the death and survival time of each individual is known. In the setting with censoring, certain individuals are still alive at the end of the survival study. Consequently, these patients have unknown death and survival time. Here, in the simulation with censoring, we assumed that 85% of the patients were still alive at the end of the study, similar to the censoring rate of the TCGA IDC data set. In both settings and with different depths of RNA-seq coverage, SURVIV consistently outperformed Cox regression in the truepositive rate at the same false-positive rate of 5% (Fig. 2a). As expected, we observed a more significant improvement in SURVIV over Cox regression when the RNA-seq read coverage was low (Fig. 2a).
To more faithfully recapitulate the read count distribution in a real cancer RNA-seq data set, we performed another simulation study with read counts directly sampled from the TCGA IDC data. To assess the influence of RNA-seq read depth on the performance of SURVIV and Cox regression, sampled read counts were then multiplied by different factors ranging from 10 to 300% to simulate data sets with different RNA-seq read depths (Fig. 2b). The TCGA IDC data set has an average RNA-seq depth of B60 million paired-end reads per patient. Thus, the read depth of these simulated RNA-seq data sets ranged from B6 million reads to 180 million reads per patient, representing low-coverage RNA-seq studies designed primarily for gene expression analysis 32 up to high-coverage RNA-seq studies designed primarily for alternative isoform analysis 29 . At all levels of RNA-seq depth, SURVIV consistently outperformed Cox regression, as reflected by the area under curve of the receiver operating characteristic (ROC) curve as well as the true-positive rate at 5% false-positive rate (Fig. 2b). The improvement of SURVIV over Cox regression was particularly prominent when the read depth was low. For example, at 10% read depth, SURVIV had 7% improvement in area under curve (68% versus 61%) and 8% improvement in the true-positive rate at 5% false-positive rate (46% versus 38%). Collectively, these simulation results suggest that SURVIV achieves a higher accuracy by accounting for the estimation uncertainty of mRNA isoform ratio in RNA-seq data.
SURVIV analysis of TCGA IDC breast cancer data. To illustrate the practical utility of SURVIV, we used it to analyse the overall survival time of 682 IDC patients from the TCGA breast cancer (BRCA) RNA-seq data set (see Methods for details of the data source and processing pipeline). We chose to analyse IDC because it is the most frequent type of breast cancer 33 , comprising B70% of patients in the TCGA breast cancer data set. To control for the effects of significant clinical parameters such as tumour stage and subtype and identify alternative splicing events associated with patient outcomes across multiple molecular and clinical subtypes, we followed the procedure of Croce and colleagues in analysing mRNA and microRNA prognostic signature of IDC 33 and stratified the patients according to their clinical parameters. We then conducted SURVIV analysis in 26 clinical subgroups with at least 50 patients in each subgroup. We identified 229 exon-skipping events associated with patient survival in multiple clinical subgroups that met the criteria of SURVIV P-valuer0.01 in at least two subgroups of the same clinical parameter (cancer subtype, stage, lymph node, metastasis, tumour size, oestrogen receptor status, progesterone receptor status, HER2 status and age as shown in Fig. 3). DAVID (Database for Annotation, Visualization and Integrated Discovery) Gene Ontology analyses 34 of the 229 alternative splicing events suggest an enrichment of genes in cancer-related functional categories such as intracellular signalling, apoptosis, oxidative stress and response to DNA damage ( Supplementary  Fig. 1). Table 1 shows a few selected examples of survivalassociated alternative splicing events in cancer-related genes. Using two-means clustering of each individual exon's inclusion levels, the 682 IDC patients can be segregated into two subgroups with significantly different survival times as illustrated by the Kaplan-Meier survival plot (Fig. 4). We also carried out hierarchical clustering of IDC patients using 176 survival- We compared our SURVIV model with Cox regression using point estimates of exon-inclusion levels, which does not consider the estimation uncertainty of the mRNA isoform ratio. (a) To study the effect of RNA-seq depth, we simulated the mean total splice junction read counts equal to 5, 10, 20, 50, 80 and 100 reads. We generated two sets of simulations with and without data-censoring. For each simulation, the true-positive rate (TPR) at 5% false-positive rate is plotted. The inset figure shows the empirical distribution of the mean total splice junction read counts in the TCGA IDC RNA-seq data (x axis in the log10 scale). (b) To faithfully represent the read count distribution in a real data set, we performed another simulation with read counts directly sampled from the TCGA IDC data. Sampled read counts were then multiplied by different factors ranging from 10 to 300% to simulate data sets with different RNA-seq read depth.  Fig. 2).
As an example of a survival-associated alternative splicing event with potential functional significance, an exon-skipping event of STAT5A (signal transducer and activator of transcription 5a) was found to be significantly associated with IDC patient survival (Fig. 5). STAT5A is a signal transducer and transcriptional activator involved in the programmed cell death 35 . It is activated by a number of cytokines. Once activated, STAT5A enters the nucleus and functions as a transcription factor 36 . The activation of STAT5A is important to both cell growth and differentiation. The alternatively spliced STAT5A mRNA isoform lacking exon 5 encodes a protein variant (DEx5 isoform; see Fig. 5a) that inhibits the production of p21 and Bax and increases cell number 37 . The STAT5A DEx5 protein isoform was found to have only 10% transcriptional regulatory activity as compared with the full-length STAT5A (ref. 37). We found that the inclusion levels of STAT5A exon 5 had a significant positive association with IDC patient survival (SURVIV P ¼ 6.8e À 4; Fig. 5b). When we clustered the 682 IDC patients into two subgroups according to the inclusion levels of STAT5A exon 5, the first group of 420 patients with higher exon 5 inclusion levels (B95% on average) had significantly longer survival times as compared with the second group of 262 patients with lower exon 5 inclusion levels (B85% on average; Fig. 5b,c). We also analysed STAT5A exon 5 in a separate set of TCGA breast cancer RNA-seq data with tumour-normal-matched pairs. Consistent with higher exon 5 inclusion levels associated with more favourable patient outcomes, normal breast tissues had 95% exon 5 inclusion levels as compared with 91% exon-inclusion levels in tumour samples (Fig. 5d). This example illustrates that the survival analysis of mRNA isoform variation can identify alternative splicing events with interesting functional implications in cancer development and progression.  Network of survival-associated alternative splicing events. Next, we characterized the splicing-regulatory network of the 229 alternative splicing events associated with IDC patient survival. Specifically, we asked whether a significant fraction of these alternative splicing events were potentially regulated by a few key splicing factors with altered expression in IDC breast cancer tissues. Splicing factors are RNA-binding proteins that recognize cis-regulatory elements within the pre-mRNA to influence exon selection and splice site choice. Using gene expression levels calculated from the TCGA IDC RNA-seq data, we identified six splicing factors whose gene expression levels were significantly associated with IDC patient survival time (Cox regression FDRr1%, Supplementary Data 2). To identify potential exon targets of these six splicing factors among the 229 survivalassociated alternative splicing events, we performed a gene-exon co-expression network analysis. Specifically, we used robust regression (R package 'robust') to test the correlation between the gene expression levels of the splicing factors and the exoninclusion levels of survival-associated alternative splicing events. Each of the six splicing factors had 41-61 significantly correlated exons among the 229 alternative splicing events (robust regression FDRr1%). We then investigated whether the correlated exons of a given splicing factor were significantly enriched among the 229 survival-associated alternative splicing events. Among the six splicing factors, three (TRA2B, HNRNPH1 and SFRS3) had significant enrichment in the co-expression network of alternative splicing events (Fisher's exact test FDRr1%, see detailed data in Supplementary Data 2). As illustrated in Fig. 6a-c, for each of the three splicing factors, the gene expression levels were significantly correlated with patient survival times, with higher gene expression associated with shorter survival time and poorer prognosis. The expression levels of these splicing factors were strongly correlated or anticorrelated with the exon-inclusion levels of survival-associated alternative splicing events in IDC RNA-seq data. For example, DHX30 is a member of the DEAD-box proteins involved in RNA metabolism 38 . Exon 4 of DHX30 was negatively correlated with splicing factor TRA2B (robust correlation P-value ¼ 1.2e À 17, r ¼ À 0.26, Fig. 6d). Likewise, exon 17 of MAP3K4 was positively correlated with splicing factor HNRNPH1 (robust correlation P-value ¼ 2.6e À 06, r ¼ 0.16, Fig. 6e). MAP3K4 is a component of the mitogen-activated protein kinase pathway, which plays critical roles in cancer development 39 . Collectively, these three splicing factors were significantly correlated with 37% (84 out of 229) of the survivalassociated alternative splicing events, thus placing them at the 'hubs' of the survival-associated splicing regulatory network in IDC (Fig. 6f). A total of 165 significant correlations existed among the 84 survival-associated alternative exons and the three splicing factors in this network (Fig. 6f). As the expression levels of all three splicing factors were negatively associated with survival, exons positively associated with these splicing factors (blue lines) should have negative association between exon-inclusion levels and patient survival times (red dots), while exons negatively associated with these splicing factors (red lines) should have positive association between exon-inclusion levels and patient survival times (blue dots). The majority of the 165 significant correlations had the expected correlation directions between the splicing factor expression level and exon-inclusion level (135/165 ¼ 82%, two-sided binomial test Po2.2e À 16). Our combined survival and co-expression network analysis identified splicing factors known to be dysregulated in breast cancer. For example, TRA2B is known to have elevated expression in breast cancers compared with normal breast tissues, and it regulates the alternative splicing of a number of genes with key functions in cancer development [40][41][42][43] . The TRA2B target genes encode key proteins involved in cell proliferation, survival and migration 44 . The splicing factor HNRNPH1 was reported to inhibit apoptosis in cancer cells partially by regulating the A-Raf kinase in the mitogen-activated protein kinase pathway 45 .
Alternative splicing predictors of cancer patient survival. Finally, we investigated the prognostic value of alternative splicing as predictors of IDC patient survival. Recent work by the TCGA consortium evaluated the clinical utilities of diverse molecular data (for example, copy number variation, DNA methylation, mRNA and microRNA expression and protein expression) as well as clinical parameters in predicting cancer patient survival 16 . Here we asked whether the use of alternative splicing information can lead to comparable or improved accuracy in predicting patient survival. To assess the clinical utility of alternative splicing-based survival predictors in an unbiased manner, we performed a two-fold Monte Carlo cross-validation analysis using the TCGA IDC RNA-seq data (see details in Methods). Briefly, in each round of cross-validation, the IDC patients were randomly split into a training group of 50% patients and a separate testing group of the remaining 50% patients. We used SURVIV to identify survival-associated alternative splicing events in the training group with a loose P-value threshold of r0.01. Then, a penalized Cox regression of L1 penalty was used to build the predictive model combining the individual alternative splicing events identified by SURVIV. We then applied the predictive model built from the training group to predict the survival outcome in the testing group. The crossvalidation was repeated for 100 times. The discriminatory power of the survival predictors was measured by the C-index 46 . A C-index of 1 indicates perfect prediction accuracy and a C-index of 0.5 indicates random guess. Besides alternative splicing-based predictors, we also built other predictive models based on clinical parameters or gene expression profiles. As shown in Fig. 7, both gene expression and alternative splicing-based predictors performed better than clinical parameters alone (gene versus clinical P ¼ 1.5e À 3; splicing versus clinical P ¼ 4.7e À 4; twosided Wilcoxon test). We also observed that alternative splicing predictors alone performed better than gene expression predictors (splicing versus gene P ¼ 4.3e À 2, two-sided Wilcoxon test). Adding clinical parameters to the predictive model collectively improved the performance of both gene expression and alternative splicing-based predictors, while the combined model of alternative splicing plus clinical parameters outperformed the combined model of gene expression plus clinical parameters (P ¼ 7.4e À 9, two-sided Wilcoxon test). We also evaluated the combined effect of incorporating clinical, gene expression and alternative splicing data. A predictive model incorporating all three types of information performed significantly better than any one type of predictors alone (clinical þ gene þ splicing versus clinical Po2.2e À 16; versus gene Po2.2e À 16; versus splicing P ¼ 6.0e À 8, two-sided Wilcoxon test). The improved prediction accuracy was not caused by over-fitting since the training and testing data were completely separated during cross-validation. This result illustrates the potential value of utilizing alternative splicing information for improved prognosis of cancer patients.
Next, we carried out the SURVIV analysis in five additional cancer types in TCGA, including GBM (glioblastoma multiforme), KIRC (kidney renal clear cell carcinoma), LGG (lower grade glioma), LUSC (lung squamous cell carcinoma) and OV (ovarian serous cystadenocarcinoma). As expected, the number of significant events at different FDR or P-value significance cutoffs varied across cancer types, with LGG having the strongest survival-associated alternative splicing signals with 660 significant exon-skipping events at FDRr5% (Supplementary Data 3 and 4). Strikingly, regardless of the number of significant events, alternative splicing-based survival predictors outperformed gene expression-based survival predictors across all cancer types ( Supplementary Fig. 3), consistent with our initial observation on the IDC data set.

Discussion
Alternative processing and modification of mRNA, such as alternative splicing, allow cells to generate a large number of mRNA and protein isoforms with diverse regulatory and functional properties. The plasticity of alternative splicing is often exploited by cancer cells to produce isoform switches that promote cancer cell survival, proliferation and metastasis 7,8 . The widespread use of RNA-seq in cancer transcriptome studies 15,47,48 has provided the opportunity to comprehensively elucidate the landscape of alternative splicing in cancer tissues. While existing studies of alternative splicing in large-scale cancer transcriptome data largely focused on the comparison of splicing patterns between cancer and normal tissues or between different subtypes of cancer 18,21,49 , additional computational tools are needed to characterize the clinical relevance of alternative splicing using massive RNA-seq data sets, including the association of alternative splicing with phenotypes and patient outcomes.
We have developed SURVIV, a novel statistical model for survival analysis of alternative isoform variation using cancer RNA-seq data. SURVIV uses a survival measurement error model to simultaneously model the estimation uncertainty of mRNA isoform ratio in individual patients and the association of mRNA isoform ratio with survival time across patients. Compared with the conventional Cox regression model that uses each patient's mRNA isoform ratio as a point estimate, SURVIV achieves a higher accuracy as indicated by simulation studies under a variety of settings. Of note, we observed a particularly marked improvement of SURVIV over Cox regression for low-and moderate-depth RNA-seq data (Fig. 2b). This has important practical value because many clinical RNA-seq data sets have large sample size but relatively modest sequencing depth.
Using the TCGA IDC breast cancer RNA-seq data of 682 patients, SURVIV identified 229 alternative splicing events associated with patient survival time, which met the criteria of SURVIV P-valuesr0.01 in multiple clinical subgroups. While the statistical threshold seemed loose, several lines of evidence suggest the functional and clinical relevance of these survival-associated alternative splicing events. These alternative splicing events were frequently identified and enriched in the gene functional groups important for cancer development and progression, including apoptosis, DNA damage response and oxidative stress. While some of these events may simply reflect correlation but not causal effect on cancer patient survival, other events may play an active role in regulating cancer cell phenotypes. For example, a survivalassociated alternative splicing event involving exon 5 of STAT5A is known to regulate the activity of this transcription factor with important roles in epithelial cell growth and apoptosis 37 . Using a co-expression network analysis of splicing factor to exon correlation across all patients, we identified three splicing factors (TRA2B, HNRNPH1 and SFRS3) as potential hubs of the survival-associated alternative splicing network of IDC. The expression levels of all three splicing factors were negatively associated with patient survival times (Fig. 6a-c), and both TRA2B and HNRNPH1 were previously reported to have an impact on cancer-related molecular pathways [40][41][42][43][44][45] . Finally, despite the limited power in detecting individual events, we show that the survival-associated alternative splicing events can be used to construct a predictor for patient survival, with an accuracy higher than predictors based on clinical parameters or gene expression profiles (Fig. 7). This further demonstrates the potential biological relevance and clinical utility of the identified alternative splicing events.
We performed cross-validation analyses to evaluate and compare the prognostic value of alternative splicing, gene expression and clinical information for predicting patient survival, either independently or in combination. As expected, the combined use of all three types of information led to the best prediction accuracy. Because we used penalized regression to build the prediction model, combining information from multiple layers of data did not necessarily increase the number of predictors in the model. The perhaps more surprising and intriguing result is that alternative splicing-based predictors appear to outperform gene expression-based predictors when used alone and when either type of data was combined with clinical information (Fig. 7). We observed the same trend in five additional cancer types (Supplementary Fig. 3). We note that this finding was consistent with a previous report that cancer subtype classification based on splicing isoform expression performed better than gene expression-based classification 25 . While this trend seems counterintuitive because accurate estimation of gene expression requires much lower RNA-seq depth than accurate estimation of alternative splicing 29 , one possible explanation may be the inherent characteristic of isoform ratio data. By definition, mRNA isoform ratio is estimated as the ratio of multiple mRNA isoforms from a single gene. Therefore, mRNA isoform ratio data have a 'built-in' internal control that could be more robust against certain artefacts and confounding issues that influence gene expression estimates across large clinical RNA-seq data sets, such as poor sample quality and RNA degradation 12 . Regardless of the reasons, our data call for further studies to fully explore the utility of mRNA isoform ratio data for various clinical research applications.
The SURVIV source code is available for download at https://github.com/Xinglab/SURVIV. SURVIV is a general statistical model for survival analysis of mRNA isoform ratio using RNA-seq data. The current statistical framework of SURVIV is applicable to RNA-seq based count data for all basic types of alternative splicing patterns involving two isoform choices from an alternatively spliced region, such as exonskipping, alternative 5 0 splice sites, alternative 3 0 splice sites, mutually exclusive exons and retained introns, as well as other forms of alternative isoform variation such as RNA editing. With the rapid accumulation of clinical RNA-seq data sets, SURVIV will be a useful tool for elucidating the clinical relevance and potential functional significance of alternative isoform variation in cancer and other diseases.

Methods
Simulation studies of SURVIV. We designed simulation studies to compare the performance of SURVIV with the conventional Cox regression survival analysis using point estimates of mRNA isoform ratios. We mimicked the characteristics of the TCGA IDC breast cancer RNA-seq data wherever possible in the simulations. For each simulation, we simulated a data set of 600 individuals, which was similar to the sample size of the TCGA IDC data. To study the effect of RNA-seq depth, for individual alternative splicing events we simulated the total number of splice junction reads (n) with the mean value ( n) equal to 5, 10, 20, 50, 80 or 100, respectively. We simulated 20,000 exons at each RNA-seq depth, with 10% of the exons from the alternative hypothesis that the exons were associated with survival time. The survival coefficient b of individual exons was randomly sampled from the top 100 significant exons in the TCGA IDC data set. The remaining 90% of the exons were from the null hypothesis that the exons were not associated with survival time (b ¼ 0).
For each exon, we then simulated the total number of splice junction reads and exon-inclusion levels across the 600 individuals. To model the variation in the total splice junction counts across individuals, we used a Gamma-Poisson model to simulate the total number of splice junction reads of each individual k. First, we generated the mean of Poisson distribution (l k ) for each individual k as: ð Þ, where the shape parameter a ¼ nb and the rate parameter b ¼ 0.05, which produced a read count variance similar to the variance in the TCGA IDC data. Then, we generated the total number of splice junction reads (n k ) for each individual k as: n k $ Poisson l k ð Þ. After that, we simulated the mean exon-inclusion level (c) according to a distribution Beta(0.51,0.36) that fitted the observed distribution of all alternative exons from the TCGA IDC data set. Similarly, the variance of the exon-inclusion levels (s) was simulated according to a distribution Gamma(1.31,15.04) that fitted the observed variance of all alternative exons from the TCGA IDC data set. Each individual's exon-inclusion level (c k ) was simulated as: c k $ Normal c; s ð Þ. The inclusion splice junction counts were generated as: IC k $ Binomial n k ; c k ð Þ . We also performed a second set of simulation, in which we directly sampled the read counts from the read counts of the exon-skipping events in the TCGA IDC data. Sampled read counts were then multiplied by different factors (ranging from 10 to 300%) to simulate data sets with different RNA-seq read depths.
We simulated the patient survival time based on the proportional hazards model, in which the individual exon-inclusion level c k was used as the covariate. We used the Nelson-Aalen estimator 50 of the baseline hazard function from the TCGA IDC data set as the baseline hazard function of the proportional hazards model. In the simulation with censored data, the censored individuals were randomly selected from the 85% of the individuals (note IDC patients had 89% censoring rate). The censor time was randomly sampled from 10 to 90% of the actual death time.
TCGA RNA-seq data. We obtained the RNA-seq read counts and clinical parameters of 682 IDC patients from TCGA breast cancer (BRCA) cohort (TCGA dbGaP Accession ID phs000178). The RNA-seq reads were mapped to the human genome and transcriptome by the TCGA consortium using Burrows-Wheeler Aligner (BWA) with default settings 15 . The TCGA consortium provided the numbers of RNA-seq read counts on splice junctions as part of the Level 3 RNAseq data available at the TCGA data portal (https://tcga-data.nci.nih.gov/tcga/). For each exon-skipping event defined in the Ensembl annotations (release #75), we identified the read counts for both exon-inclusion and -skipping splice junctions from the Level 3 data. The IDC patient list is provided in Supplementary Data 5. The same procedure was used to obtain and process TCGA RNA-seq data and clinical parameters of five additional cancer types, including GBM, KIRC, LGG, LUSC and OV.
Cross-validation of survival predictors. We evaluated and compared the survival prediction using clinical information, gene expression and alternative splicing. We used two-fold cross-validation to split the IDC patients into separate training and testing groups. The even split of the IDC patients ensured that there were enough death events in both training and testing groups for accurate model training and testing. The cross-validation was repeated for 100 times. Each time, half of the IDC patients were randomly selected as the training group and the other half were selected as the testing group. Using the training group, we built a series of survival prediction models based on clinical information, gene expression or alternative splicing predictors as well as the combination of these predictors. The clinical parameters used in the Monte Carlo cross-validation were identical to those used for the initial SURVIV analysis of 26 clinical subgroups (Fig. 3), which included cancer subtype, stage, lymph node status, metastasis, tumour size, oestrogen receptor status, progesterone receptor status, HER2 status and age. First, we used univariate regression (Pr0.01) to select potential survival predictors associated with patient survival time. We used Cox regression for the univariate analysis of clinical information and gene expression. We used SURVIV for the univariate analysis of alternative splicing. Then, starting from the predictors selected from univariate regression, we used a L1-penalized Cox regression model 51 to select a set of predictors in the multivariate prediction model. Owing to the different numerical scales of gene expression and alternative splicing predictors, when we combined these two types of predictors in the penalized regression, a z-score normalization was used to scale the gene expression and alternative splicing predictors in the penalized regression. After the model was built using the training group, we used the remaining patients from the testing group to test the model. The prediction accuracy of the model was measured by the C-index 46 . A C-index of 1 indicates perfect prediction accuracy and a C-index of 0.5 indicates random guess.