Transcriptomic and genomic profiling of early-stage ovarian carcinomas associated with histotype and overall survival

Ovarian cancer is the most lethal gynecological malignancy in the western world. Despite recent efforts to characterize ovarian cancer using molecular profiling, few targeted treatment options are currently available. Here, we examined genetic variants, fusion transcripts, SNP genotyping, and gene expression patterns for early-stage (I and II) ovarian carcinomas (n=96) in relation to clinicopathological characteristics and clinical outcome, thereby identifying novel genetic features of ovarian carcinomas. Furthermore, mutation frequencies of specific genetic variants and/or their gene expression patterns were associated with histotype and overall survival, e.g. SLC28A2 (mucinous ovarian carcinoma histotype), ARCN1 (low expression in 0-2 year survival group), and tumor suppressor MTUS1 (mutation status and overall survival). The long non-coding RNA MALAT1 was identified as a highly promiscuous fusion transcript in ovarian carcinoma. Moreover, gene expression deregulation for 23 genes was associated with tumor aggressiveness. Taken together, the novel biomarkers identified here may improve ovarian carcinoma subclassification and patient stratification according to histotype and overall survival.


INTRODUCTION
Recent advances in our understanding of ovarian carcinoma contributed to the reclassification of the disease into five major histotypes (high-grade serous (HGSC), low-grade serous (LGSC), endometrioid (EC), mucinous (MC) and clear cell (CCC) carcinomas) based on differences in origin, morphology, and clinical and biological behavior [1][2][3]. Standard treatment options for ovarian carcinoma are currently limited to surgical cytoreduction, followed by platinum-based chemotherapy, despite the recent introduction of new promising treatment options, e.g. poly (ADP-ribose) polymerase (PARP) inhibitors as a targeted therapy for carriers of genetic variants in BRCA1/2, the use of antiangiogenic agents in combination with first-line treatment or as maintenance treatment, and the administration of intraperitoneal chemotherapy [4]. The majority of ovarian cancer patients are still currently treated with conventional treatment based on tumor stage and grade, regardless of histotype or other biological characteristics. Unfortunately, there are no alternative treatment regimens alone that have www.oncotarget.com Oncotarget, 2018, Vol. 9, (No. 80), pp: 35162-35180 Research Paper www.oncotarget.com been proven to be superior to conventional therapy, which in itself is inadequate since e.g. the overall 5-year survival rate is below 50% and the majority of advancedstage ovarian cancer patients develop recurrences [4][5][6][7]. Consequently, there is a profound need for novel biomarkers with improved prognostic and diagnostic value that may guide the selection of therapeutic targets and thereby improve personalized medicine based on an individual ovarian carcinoma patient's clinicopathological and tumor characteristics [8].
Since the introduction of the revised 2014 World Health Organization (WHO) criteria for histotype diagnoses, few efforts have been made to identify novel molecular biomarkers that stratify ovarian carcinomas according to clinical outcome and the current histotypes. Surprisingly, relatively few studies have characterized histotype-specific genetic features in early-stage ovarian carcinomas (I and II) [9][10][11]. In addition, few studies have previously been performed on early-stage ovarian carcinomas to identify novel prognostic biomarkers [12]. One reason may be that the majority of all ovarian carcinomas are diagnosed at advanced stages III and IV [13].
Cancer is a complex disease comprised of genetic alterations that are accumulated during cancer development and progression. Although early-stage high grade ovarian carcinoma is rare, a recent report showed that early-and late-stage HGSC share many acquired genetic aberrations [14]. Therefore, early-stage tumors were chosen for the present study since they generally present less complex genetic profiles in comparison with late-stage tumors, thereby simplifying the analysis. Here, 96 primary ovarian carcinomas corresponding to the International Federation of Gynecology and Obstetrics (FIGO) stages I and II were profiled using RNA sequencing (RNA-seq) and genome-wide SNP genotyping. The present study thereby presents a good opportunity to characterize specific events associated with ovarian carcinogenesis and disease-specific genetic aberrations. The transcriptomic and genomic data were integrated with clinicopathological features and patient clinical outcome to characterize a) genetic aberrations associated with histotype or patient survival and b) novel prognostic and diagnostic biomarkers that may improve ovarian tumor classification and patient stratification. The current study thereby identifies novel genetic profiles and constitutes an important addition to existing research.

Landscape of genetic variation reveals few potentially deleterious variants in early-stage ovarian carcinoma
After removing common genetic variants found in the normal human population, the mean number of genomic and coding (exonic) variants per tumor was 44,138.3±1,240.3 (±SEM; range, 19,884-80,024) and 442.3±11.9 (range, 203-1,146), respectively. The variants are hereinafter termed tumor-specific variants. The number of tumor-specific variants was most prevalent within intronic regions of the genome, followed by variants within intergenic and non-coding RNA (ncRNA) regions, respectively ( Figure 1A). In exonic regions, tumor-specific coding variants spanned 11,529 different genes and the most frequent variant type was synonymous singlenucleotide variants (SNVs) followed by nonsynonymous SNVs ( Figure 1B). Base-pair substitutions associated with A>G (39.4%) and T>C (38.4%) were most prevalent in the genomic regions and G>A (15.3%), C>T (15.2%), T>C (12.8%) and A>G (12.6%) were most prevalent in the exonic regions ( Figure 1C-1D).
After conservative filtering, 8,187 tumor-specific coding variants (frameshift insertion, frameshift deletion, stopgain, or stoploss) were identified, which were predicted to have a disruptive effect on protein function (classified as potential deleterious variants, e.g. resulting in protein truncation, gain/loss of function or nonsense mediated decay). Among the 8,187 variants, potential deleterious variants were identified in 26 genes in at least 30% of the samples. Frameshift insertion and deletion were more prevalent than stopgain and stoploss and no significant differences in variant types were detected between the different histotypes and survival groups.
The mutation frequency of recurrent deleterious variants, i.e. deleterious variants present in at least 30% of the patients, were also assessed for each histotype (HGSC, EC, MC, CCC) and survival group (0-2 years, 2-5 years, 5-10 years or >10 years; Figure 2). Thirtyeight and 49 recurrent deleterious variants were identified in the histotype and survival groups, respectively. To identify tumor-specific variants in ovarian carcinoma, the presence of the identified recurrent deleterious variants were evaluated in normal ovarian tissue samples (control cohort). All of the recurrent deleterious variants significantly differed in mutation rates between the study cohort and the normal controls (Supplementary Table  1A-1B). Four recurrent deleterious variants corresponding to the CHD1L (99.0%), GFM1 (91.7%), MEIS1 (90.6%) and NFX1 (93.8%) genes were found in the majority of the tumor samples in the study cohort, but in none of the normal controls. Frameshift insertion in the UBR5 gene was the only deleterious variant to be found in the COSMIC database. Only three deleterious variants in the KIAA0040 (study cohort 31.3%, control cohort 10%) and MAML3 (study cohort 18.8%, control cohort 53.3%) genes were also present in the control cohort.
Pathway analysis with Ingenuity Pathway Analysis (IPA), performed using genes corresponding to the potential deleterious variations, revealed an association with a number of cancer-related pathways. Biological processes significantly associated with the histotype and survival groups included cell cycle, cell death and

Histotype and survival group specific mutation frequency and its impact on gene expression patterns
The mutation frequency for several deleterious genetic variants also varied between the different histotypes and survival groups ( Figure 2 showed significantly higher mutation rates in the 0-2 year survival group compared to the other survival groups (Figure 3, Supplementary Table 1A-1B).
Subsequently, differences in gene expression patterns for the potential deleterious variants were evaluated between the histotype and survival groups. The majority of the genes were differentially expressed in the different histotype (n=33, 89.2%) and survival groups (n=43, 89.6%) in comparison with the control cohort (Supplementary Table 2A-B). In addition, the expression of several genes also correlated with specific histotypes or survival groups. The AP2M1, GFM1, HIST1H1E and ZNF148 genes were significantly overexpressed and the BECN1 and NCOR1 genes significantly underexpressed in HGSC compared to the other histotypes. The SEMA4D gene was significantly overexpressed in the EC histotype compared to MC samples and ARV1 was significantly underexpressed in CCC in comparison with the other histotypes. Furthermore, ALMS1 was significantly underexpressed in the 0-2 year survival group compared to the 2-5 year and >10 year survival groups, while VWA3A was significantly underexpressed in the 0-2 year survival group compared with the 2-5 year and >10 year survival groups ( Supplementary Table 2A-B).
To determine whether mutation status had an effect on the gene expression patterns of the potential deleterious variants, the expression levels for tumors harboring (variant carriers) or lacking (non-variant carriers) the recurrent deleterious variant were compared. In total, 15 of the 52 (28.8%) different genes containing deleterious variants (AP2M1, ARHGAP11A, C3, CASP8AP2,  HIST1H1E, IPMK, ITGB1BP1, MAGI3, MAML3, MTUS1,  SDR16C5, SLC28A2, TTC3, VWA3A, ZHX1) identified in either the histotype or survival groups showed a significant difference in gene expression patterns between variant and non-variant carriers. Differential gene expression was also evaluated for deleterious variants showing significant differences in mutation rates within the histotype (15/38 variants) or survival groups (4/49 variants; Figure 3). In at least one of the histotype comparisons, BECN1, Figure 2: Recurrent deleterious variants. Blue bars depict recurrent deleterious variants present in at least 30% of at least one of the histotype groups (A) or survival groups (B) across the study cohort. Orange bars designate deleterious variants found in the control cohort. Genes known to be associated with cancer, according to the COSMIC database, are indicated by * . www.oncotarget.com CASP8AP2, GFM1, IPMK, MEIS1, SEMA4D, SLC28A2 and UBR5 were differentially regulated between variant and non-variant carriers. Differences in gene expression patterns were predominantly dependent on histotype and not mutation status. However, the SLC28A2 gene showed a significant difference in gene expression patterns between variant and non-variant carriers in the MC histotype. In addition, the ARCN1, OSTM1 and POC5 genes were differentially regulated between variant and non-variant carriers in at least one of the survival group comparisons, and the ARCN1 gene was differentially regulated between variant carriers in the 0-2 year and 5-10 year survival groups. Moreover, frameshift insertion in MTUS1 was the only gene among the deleterious variants showing significant differences in gene expression patterns dependent on mutation status, which also had a protective effect on overall survival (OS) compared with non-variant carriers of the gene (Figure 4). No correlation was found between disease-specific survival (DSS) and the deleterious variants.

Analysis of fusion transcripts identified a commonly occurring fusion transcript partner
In total, 3,735 fusion transcripts were identified in the initial analysis, of which approximately 10.4% (n=388) of the identified fusion transcripts were flagged by FusionCatcher as having a high or very high probability of being false positives and therefore removed from further analysis. Among the remaining 3,344 fusion transcripts (mean fusion transcripts per tumor 34.7±4.4 (±SEM); range, 0-266), 1,503 unique fusion transcripts were identified. The top ten most common fusion transcripts in the study The long non-coding RNA (lncRNA) MALAT1 was involved in the majority of recurrent fusion transcripts found in at least five of the study cohort samples ( Figure 5A). Overall, the majority (76.2%) of the fusion transcripts had one gene partner spanning non-coding exonic regions (no known coding DNA sequence (CDS)), such as MALAT1. A total of 675 fusion transcripts had MALAT1 as its 5' fusion partner and 701 fusion transcripts had MALAT1 as its 3' fusion partner. A small proportion (12.7%) of the fusion transcripts were predicted to be in exonic regions (out-of-frame/ coding-coding (4.9%), coding-3'UTR (3.1%), in-frame/ coding-coding (4.8%), promotor-coding (3.7%)), 2.0% in untranslated regions and 5.1% in intronic regions ( Figure  5B). Interchromosomal fusions were more common than intrachromosomal fusions and the most abundant fusion transcripts in the study cohort were fusion events involving chromosome 11. More specifically, the most abundant intrachromosomal fusions were found on chromosome 11 and the most abundant interchromosomal fusions were found between chromosomes 1 and 11 ( Figure 5C). In addition, 17 fusion gene partners (CDKL1, EEF2K,  ERBB2, HK1, IRAK2, MAST4, MASTL, OXSR1, PAK1,  PI4KA, PIP4K2B, PIP5K1B, PKM, PRKDC, TJP2,   the most commonly found fusion transcript not involving MALAT1. MALAT1 and AHNAK were both significantly overexpressed in the study cohort compared with the control cohort. The 11q13.1 locus spanning MALAT1 was the most commonly involved locus spanning at least one fusion gene partner (linked to 19.6% of all unique fusion transcripts).
Circos plots were used to summarize the genomic rearrangements (genetic variations, copy number alterations and fusion events) in the ovarian carcinoma samples ( Figure 6). SNP genotyping showed that the fusion events identified on the RNA level were frequently associated with DNA breakpoints (DNA gains or losses). Generally, DNA loss was observed more often than DNA gain. Dual-color FISH was performed on tumor touchprint preparations from at least two different tumor samples to validate fusion transcripts with MALAT1 as the 5' fusion partner in combination with the AHNAK, RPPH1, MUC16, RMRP, SYNE2, COL3A1, COL1A2, EIFA2, MACF1, C3, CANX and XIST as the 3' fusion partners. FISH analysis showed intratumoral heterogeneity with few tumor cells containing the fusion transcript of interest. A validated fusion transcript between MALAT1 and MUC16 as well as the genomic sequence at its fusion transcript break point is shown in Figure 6.
Differential gene expression analysis was performed between the four most commonly occurring fusion transcripts (AHNAK-MALAT1, MALAT1-AHNAK, MALAT1-RPPH1 and RPPH1-MALAT1) within the study cohort ( Figure 7). Significantly higher gene expression levels were seen in samples harboring specific Oncofuse was used to identify fusion transcripts with oncogenic potential. In total, 105 of the 3,344 (3.1%) true fusion transcripts were identified across the study group as being significant driver fusion transcripts (Supplementary Table 3). Six of the most frequent fusion partners containing the MUC16, AHNAK, SYNE2, COL3A1, COL1A2 and MACF1 genes were found to be significant driver fusion transcripts. Five of the 17 identified kinase fusion gene partners (EEF2K, ERBB2, MAST4, PAK1, TTN) were also proposed by Oncofuse to be driver fusions. MALAT1 was not identified in any significant driver fusion transcript. Pathway analysis using the significant driver fusion transcripts revealed an association with cancer-related pathways, including cell death and survival, cell morphology, cellular development, cellular growth and proliferation, and cellular movement.

Gene expression analysis uncover differentially expressed genes associated with tumor aggressiveness
Gene expression profiles for patient samples belonging to the 0-2 year, 2-5 year and 5-10 year survival groups (short-time survivors) were compared with the >10 year survival group (long-time survivors) for overall survival. In total, 127 (0-2 years vs >10 years), 134 (2-5 years vs >10 years) and 19 genes (5-10 years vs >10 years) were differentially expressed (adjusted P<0.05; 1.5-fold change cutoff). Twenty-three genes were recurrently deregulated in at least two of the survival group comparisons ( Figure 8A). Underexpression was generally more pronounced (lower log 2 -ratio) in the 0-2 year survival group than the other short-term survival groups (adjusted P<0.05, 1.5-fold change cutoff). Overexpression was also more pronounced (higher log 2 -ratio) in the 0-2 year survival group. Pathway analysis showed that the differentially expressed genes were involved in cancer-related pathways such as cellular functions and maintenance, cellular movement, cellular assembly and organization, protein synthesis, cellular development, cellto-cell signaling, and cellular growth and proliferation. In addition, MMP1 was the only gene associated with tumor aggressiveness to play a role in at least five of the cancerrelated biological processes. Three of the differentially expressed genes (CCDC114, CCDC173, CCDC40) are members of the CCDC gene family relating to cellular Differential gene expression analysis was then performed by further stratifying the study cohort by both histotype and survival group. To avoid compromising the validity of the results due to small sample sizes, this analysis was only performed using three of the survival groups (2-5 years, 5-10 years and >10 years) for HGSC. The MTRNR2L1 gene was significantly overexpressed in the 2-5 year group compared with long-term survivors (log 2 fold change = -4.0) and 5-10 year group compared with long-term survivors (log 2 fold change = -2.7; Figure 8B).

DISCUSSION
Due to intertumoral heterogeneity, tumors with similar histopathology and presumed tissue of origin may have distinctly different clinical behaviors and respond differently to the same treatment. It is therefore important to improve personalized medicine using robust physiological biomarkers. Multi-omics approaches enable a comprehensive identification and evaluation of novel cancer biomarkers [15]. Here, whole-transcriptome RNA sequencing and whole-genome SNP genotyping analysis were performed on early-stage ovarian tumors in relation to clinicopathological features and clinical outcome to identify novel prognostic and/or diagnostic biomarkers, and improve histotype classification and patient stratification. This study presents an important addition to existing research due to its complete genome characterization of a large sample size of early-stage ovarian carcinoma specimens since few studies have previously characterized histotype-specific genetic features in early-stage ovarian carcinomas [9][10][11]. Moreover, few studies relate to complete genome studies on non-serous ovarian carcinoma histotypes [11,16,17] and there are few studies on early-stage HGSC. It has however been shown that early-stage HGSC show similar genetic aberrations to late-stage HGSC [14]. In the present investigation, TP53 mutation analysis showed that mutation frequencies were highest in HGSC, which is in line with frequency rates observed in latestage HGSC [18]. Overall, we identified (a) recurrent deleterious tumor-specific genetic variants, not previously associated with cancer or tumor aggressiveness, that may have potential diagnostic value due to the absence of the variants in normal tissue, (b) expressed fusion transcripts predominantly in ncRNAs, in particular MALAT1, and (c) potential prognostic genes associated with tumor aggressiveness that may play a pivotal role in cancerrelated processes. In summary, these results present a comprehensive characterization of ovarian carcinoma with respect to histotype and overall survival highlighting the genetic complexity of ovarian carcinomas.
Several genetic variants had high mutation rates across the entire study group in comparison with normal ovarian controls, e.g. recurrent deleterious variants in the CHD1L, GFM1, MEIS1 and NFX1 genes, suggesting specificity to cancer. Similar mutation rates were identified for these genes in a breast cancer cohort [19]. Interestingly, no genetic variants were specific for either ovarian or breast cancer suggesting similarities between the genetic variants found in both cancer forms. Furthermore, genetic variants in CHD1L, GFM1, MEIS1 and NFX1 have previously been reported in other cancer types (the type of genetic variant was not specified) but at lower frequency than reported here, such as genetic variants in CHD1L for 10% of primary CNS lymphomas, GFM1 for 5.7% of cholangiocarcinomas, MEIS1 for 5.7% of uterine carcinosarcomas, and NFX1 for 6.7% of malignant peripheral nerve sheath tumors [20]. The majority of the recurrent deleterious variants were not found in the control cohort indicating tumor specificity. Two of the identified genes, NCOR1 and ASPM, have previously been identified as mutational cancer driver genes (high confidence driver) by Tamborero et al. [21].
Of the 52 different genes harboring recurrent deleterious variants found in at least one of the histotype or survival groups, frameshift insertion in the UBR5 gene was the only deleterious variant that has previously been associated with cancer according to the COSMIC database. This finding suggests that the present study identified novel deleterious variants associated with cancer. Moreover, novel mutation signatures in relation to histotypes were identified. Among these, increased DNA methylation of the CASP8AP2 gene has previously been associated with less sensitivity to cisplatin and taxol in a breast cancer cell line (MDA-MB-231) [22], elevated UBR5 gene expression levels have been observed in cisplatin-resistant ovarian cancer patients in comparison with cisplatin-responsive patients [23], and SEMA4D has been shown to play a role in progression of multiple cancers, such as ovarian cancer [24]. The IPMK gene has been shown to be involved in cell cycle arrest and apoptosis in ovarian cancer [25] and the BECN1 gene has been suggested to be a tumor-suppressor gene and its expression levels to be associated with ovarian cancer prognosis [26]. To our knowledge, no association with ovarian cancer has previously been found for the genetic variants in the CLEC2B, ACACA, OSTM1 and SLC28A2 genes. As next-generation sequencing analyses become more commonly used in the clinic, these novel mutation signatures may help to improve subclassification of the histotypes. The deleterious variant in the OSTM1 gene may be of prognostic importance, since it showed significantly higher mutation rates in the 0-2 year survival group compared to the other survival groups.
Several of the genes comprising the deleterious variants also had an effect on gene expression patterns. Significant overexpression of AP2M1, GFM1, HIST1H1E, and ZNF148, and underexpression of BECN1 and NCOR1 were identified in HGSC compared to the other histotypes. Out of these genes, only the AP2M1 gene (DNA amplification) has previously been associated with HGSC [27]. Moreover, the SEMA4D gene was significantly overexpressed, and the ARV1 gene was significantly underexpressed in CCC in comparison with the EC and MC histotypes. With regard to survival groups, ALMS1 and VWA3A may be used as prognostic biomarkers, since they were significantly underexpressed in the 0-2 year survival group compared to the 2-5 year and >10 year survival groups, and 2-5 year and >10 year survival groups, respectively, wherein the VWA3A gene has previously been associated with survival [28]. However, the observed changes in gene expression pattern were primarily associated with specific histotype or survival groups rather than a result of the genetic variant. These findings indicate that the gene expression patterns were influenced by other molecular mechanisms than the genetic variation in specific indels (insertions/deletions) and SNVs (single-nucleotide variants). Of the 15 genes showing significant changes in gene expression patterns in the presence of potential deleterious variants, SLC28A2 was the only gene with differential gene expression patterns that could be attributed to mutation status in the MC histotype, as well as, between variant carriers in the HGSC and MC histotypes. Consequently, only histotypespecific changes in gene expression were observed for the remaining genes containing genetic variants, which may suggest that histotype and/or other molecular mechanisms play a role in determining the gene expression levels for these genes. In the survival groups, ARCN1 showed a significant difference in gene expression patterns between variant carriers in the 0-2 year and 5-10 year survival groups, suggesting that ARCN1 may have a tumor suppressor effect as elevated expression thereof indicated longer survival times. The deleterious frameshift insertion in MTUS1 was particularly interesting since it resulted in a significant difference in both gene expression patterns and overall survival compared to non-variant carriers. The elevated gene expression levels in variant carriers compared to non-variant carriers caused by the frameshift insertion may be explained in that the MTUS1 degradation site may have been removed. Interestingly, a protective effect for overall survival was shown for the deleterious variant in MTUS1, suggesting that the genetic variant has a tumor-suppressor effect. Although the tumor suppressor effect of MTUS1 has been shown in multiple cancer forms [29][30][31], this is the first report to our knowledge correlating both MTUS1 gene expression and mutation status with overall survival.
Fusion events may play an important role in the development of epithelial cancers, since they may be strong driver mutations and promote genomic instability [32]. High resolution RNA-seq can be used to identify fusion transcripts that are expressed within tumor tissue. The majority of the identified fusion transcripts had one gene partner spanning non-coding exonic DNA regions, suggesting that the majority of fusion transcripts do not generate a corresponding fusion protein, but may influence the expression of the fusion partner. These fusion transcripts may have arisen by chance due to genomic instability. Only 12.7% of the identified fusion transcripts were predicted to be in coding regions, e.g. in-frame, outof-frame or promoters (5' UTRs). Out-of-frame fusion transcripts are also assigned to this group since they may be placed back in-frame through alternative splicing or genetic variations. Almost half of the fusion transcripts had an inverted orientation of their fusion partners, indicating fusion transcript formation via inversion events. Interestingly, we identified a higher prevalence of interchromosomal fusion events than intrachromosomal fusion events, which is in contrast to previous reports in cancer [33,34]. These differences may be the result of multiple interchromosomal fusions on chromosome 11, which predominantly spanned ncRNAs. Moreover, several interesting in-frame kinase fusions were identified which may be targetable with kinase inhibitors.
MALAT1 lncRNA has been associated with overexpression in various cancers and linked to unfavorable overall survival when overexpressed [35]. MALAT1 was significantly overexpressed in the study cohort compared to the normal controls, which is consistent with recent studies linking MALAT1 overexpression with significantly increased cell proliferation and invasion in ovarian cancer [36,37]. To our knowledge, MALAT1 has not previously been reported to be involved in fusion events in ovarian carcinomas. MALAT1 was highly promiscuous (forming fusions with multiple partners) in the present study, suggesting that the majority of the fusions occurred at the RNA level. Nevertheless, FISH analysis demonstrated that several MALAT1 fusions also occur at the DNA level. However, the fusions were observed in few cells indicating extensive clonal heterogeneity. The most commonly identified fusion transcript across the study cohort, AHNAK-MALAT1, was found in 25.8% (n=24) of the samples and this frequency is even higher (35.4%, n=34) if the reciprocal fusion transcript (AHNAK-MALAT1, MALAT1-AHNAK) is taken into account. These numbers are extremely high considering the heterogeneous nature of ovarian carcinomas. Few fusion transcripts involving MALAT1 were specific for a particular histotype or survival group. Previous reports have identified several fusion transcripts in HGSC, including BCAM-AKT2 (7%), CDKN2D-WDFY2 (20%) and ESRA-C11orf20 (15%), wherein the majority of the samples were in late-stages, but absent in the present study [38][39][40]. A further study identified nine recurrent fusion transcripts (present in at least 2/220), whereof two fusion transcripts CRHR1-KANSL1 and COL14A1-DEPTOR were also found in the present study cohort [41].
Oncofuse identified 105 potentially oncogenic fusion transcripts. Six of the most common fusion partners containing the MUC16, AHNAK, SYNE2, COL3A1, COL1A2 and MACF1 genes were identified as significant driver fusion transcripts. MALAT1 was not present as a fusion partner among the 105 driver fusion transcripts. This may suggest that MALAT1 has not previously been identified as a fusion transcript and further research is necessary to elucidate its role in ovarian tumorigenesis.
However, Lanzós et al. [42] recently identified MALAT1 as a high confidence candidate of being a cancer driver lncRNA.
In the gene expression analysis, 23 genes were found to be differentially expressed in at least two of the survival group comparisons, and associated with tumor aggressiveness. These genes may be seen as common cancer genes rather than ovarian cancer specific genes associated with tumor aggressiveness, since the histotypes have different origin and clinical behavior. Interestingly, the majority of the differentially expressed genes was involved in cancer-related pathways and/or were associated with cancer, such as female genital tract cancer. For example, ARG2 and MMP1 expression have been found to correlate with poor prognosis in gastric and ovarian cancer, respectively [43,44]. MTRNR2L1 was significantly underexpressed in both the 2-5 year survival group and 5-10 year survival group compared with longterm survivors (only HGSC samples taken into account). Little is known about the function of the MTRNR2L1 gene and it has not previously been associated with HGSC.
In summary, we have identified several novel genetic aberrations associated with histotype and/or clinical outcome for early-stage ovarian carcinomas. In addition, several potential diagnostic and prognostic biomarkers were identified. Potential diagnostic biomarkers include the recurrent deleterious tumor-specific genetic variants which were only found in the study cohort but absent in the normal controls, among which 15 genetic variants were identified with significant differences in mutation rates between the histotype groups, e.g. frameshift insertion in OSTM1 (higher mutation rates in MC compared with the other histotype groups), frameshift deletion in BECN1 (lower mutation rates in CCC compared with the other histotype groups), and absence of the frameshift insertion in SEMA4D in MC. In addition, the SLC28A2 gene showed a significant difference in gene expression patterns between variant and non-variant carriers in the MC histotype. Lastly, MALAT1 expression may also be used as a diagnostic biomarker as it was significantly higher in the study cohort compared to the normal controls. Potential prognostic biomarkers include the recurrent deleterious tumor-specific genetic variants which differed between the survival groups, e.g. frameshift insertion in OSTM1 (higher mutation rates in 0-2y compared with the other survival groups). The frameshift insertion in MTUS1 presented a potential tumor-suppressor effect due to the mutation-dependent gene expression patterns and its protective effect on OS in variant carriers. Moreover, the 23 differentially expressed genes associated with overall survival (tumor aggressiveness) may be used as prognostic biomarkers. Further studies need to be performed in order to validate the clinical significance these aberrations may have on ovarian carcinoma subclassification and patient stratification using e.g. immunohistochemistry and in vivo functional studies. www.oncotarget.com

Patients and tumor samples
Primary invasive ovarian carcinomas from 96 patients diagnosed between 1994 and 2006 were obtained from the fresh-frozen tumor bank at the Sahlgrenska University Hospital Oncology lab (Gothenburg, Sweden). Clinicopathological characteristics and overall survival data were obtained from the National Quality Registry at the Regional Cancer Center West (Gothenburg, Sweden) and the Cancer Registry at the National Board of Health and Welfare, respectively. Of the available fresh-frozen samples in the tumor bank, patients were chosen for inclusion in the study cohort according to the International Federation of Gynecology and Obstetrics (FIGO) stage I and II, as well as overall survival calculated from the date of initial diagnosis to the date of death of any cause and stratified into four survival groups, i.e. 0-2 years, 2-5 years, 5-10 years and >10 years.
The tumor specimens were reclassified according to current WHO criteria [1][2][3]45] with regard to histotype and histological grade by pathologists at Sahlgrenska University Hospital using four micrometer full-face FFPE sections stained with hematoxylin and eosin. The histotype distribution in the present study was relatively consistent with previous reports. However, a slightly higher number of HGSC samples (52%) and lower number of EC (18%) and CCC (18%) were used in the present study than has previously been reported for early-stage ovarian carcinomas (HGSC (35.5%), LGSC (1.9%), EC (26.6%), MC (7.5%), CCC (26.2%)) [46]. All procedures were performed in accordance with the Declaration of Helsinki and approved by the Regional Ethical Review Board (Gothenburg, Sweden; case number 767-14). The Regional Ethical Review Board approved a waiver of written consent to use the tumor specimens. The clinicopathological features of the 96 cases with regard to the histotype reclassification are shown in Table 1

Control cohort
Approval for access to the Cancer Genome Atlas (TCGA) genomic data was obtained through the database of Genotypes and Phenotypes (dbGaP; project #11044). Raw sequencing data for 30 normal ovarian solid tissue samples (control cohort; Supplementary Table 5) was retrieved from TCGA-OV data collection through the Genomic Data Commons Data Portal (GDC Data Portal) [18,47]. The normal controls had been analyzed by whole exome sequencing (WXS) and mapped against GRCh38. To minimize batch differences between the study cohort and the control cohort, the raw data for the control cohort was converted to FASTQ format using the BEDTools (v. 2.25.0) module and compressed using the gzip module in Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX). The FASTQ files were thereafter processed using the same pipeline as the study cohort for variant calling and differential gene expression analysis (see below).

Whole-transcriptome RNA-seq
Total RNA was extracted from the tumor samples using the RNeasy Lipid Tissue Mini Kit (Qiagen), followed by RNA concentration and integrity assessment using Nanodrop ND-1000 (Nanodrop Technologies) in combination with QuBit (ThermoFisher Scientific) and the RNA 6000 Nano LabChip Kit with Agilent 2100 Bioanalyzer (Agilent Technologies), respectively. Samples with an RNA integrity number (RIN) of 6 or higher were processed at the Science for Life Laboratory (National Genomics Infrastructure, Stockholm). Illumina TruSeq strand-specific RNA libraries (Ribosomal depletion using RiboZero human) containing 125 bp pair-end reads were obtained for each sample on a HiSeq2000 sequencer (Illumina). Read alignment yielded approximately 9.7 to 22.3 million aligned reads per sample (median, 19.4 million aligned reads). Computations were performed on resources provided by SNIC through UPPMAX under Project SNIC b2015239.

Quality control of raw RNA-seq data
A quality control check of the FASTQ files was performed using default FastQC parameters (v. 0.11.2) and compiled using MultiQC (v. 0.6). Low quality bases (Phred quality scores below 20) and adapter sequences were removed utilizing the TrimGalore wrapper script (v. 0.4.0). FastQC and MultiQC were thereafter performed on the trimmed FASTQ files examining the quality of the trimmed reads and ensuring that all adapter sequences had been removed [48]. A principal component analysis (PCA) was performed on the variance stabilizing transformation data (vsd) values on raw gene counts for each sample using DESeq2 (v. 1.14.0) in R/Bioconductor (v. 3.2.5).

Variant calling
Genetic variants were identified according to the Genome Analysis Toolkit (GATK) (v. 3.6) Best Practices protocol, Broad Institute (https://software.broadinstitute. org/gatk/guide/article?id=3891). More specifically, the trimmed FASTQ files were aligned to the human reference genome hg19 (GRCh37) using the STAR (v. 2.5.0c) aligner in a two-pass approach [49]. MultiQC was run on Log.final.out-files received from the mapping with STAR to control the mapping quality of each sample. One sample had below 80 % uniquely mapped reads, and was hence removed from the study cohort (OV386 = 52.4% uniquely mapped reads). This sample is not accounted for under the "Patients and tumor samples" section above.
In brief, the GATK SplitNCigarReads tool was used to remove false positive calls caused by splicing inaccuracies from the STAR aligner. Base quality score recalibration (BQSR) was performed using GTF files (dbsnp_138.hg19.vcf, dbsnp_138.hg19.vcf.idx, Broad Institute). The genetic variants were called using the HaplotypeCaller. Hard filters were applied filtering clusters of at least three SNPs, which were in a window of at least 35 bases and based on Fisher Strand (FS>30) and quality by depth (QD<2). The filtered genetic variants were annotated with ANNOVAR [50] and thereafter further filtered with the 1000 Genomes Project dataset (1000g2015aug) [51,52] and dbSNP (hg19_snp138) with a minor allele frequency (MAF) threshold of 0.01 to remove common genetic variants found in the human population. Potential deleterious variants predicted to have an effect on the amino acid sequence were identified by filtering conservatively for the Sequence Ontology terms frameshift insertion (SO:0001909), frameshift deletion (SO:0001910), stopgain (SO:0001587), or stoploss (SO:0001578) [53,54]. The genetic variants were matched against the Catalogue of Somatic Mutations in Cancer (COSMIC) database (v.70, Aug. 2014) [55] to annotate known genetic variants associated with cancer.

Differential gene expression analysis
Raw read counts, i.e. the number of sequences that map to each transcript of the human reference genome (hg19 assembly), were calculated using htseq-count in the htseq (v. 0.6.1) module in UPPMAX on name sorted STAR (1-pass mode) BAM files. Fragments per kilobase of exon per million fragments mapped (FPKM) values were computed by running the Cufflinks (v. 2.2.1) module on STAR (1-pass mode) BAM files sorted by coordinates. Differentially expressed transcripts were determined using normalized count values of the tumor samples in the study and control cohort. DESeq2 (v. 1.14.0) in R/Bioconductor (v. 3.2.5) was used to compare gene expression levels between patients belonging to different histotype and/or survival groups to identify the differentially expressed transcripts.

Identification and validation of fusion transcripts
FusionCatcher [56] (v. 0.99.5a) with the associated databases ENSEMBL, UCSC and RefSeq was used to identify fusion transcripts. FusionCatcher implemented Bowtie (v. 0.12.6), BLAT (v. 35), STAR and Bowtie2 (v. 2.2.3) to identify fusion junctions and align the fusion transcripts to the GRCh37 human reference genome assembly. Identified fusion transcripts marked with fusion descriptions indicating fusion genes of high or very high probability of being a false positive fusion transcript were removed [57].
The Oncofuse tool [58] was used to predict the oncogenic potential of the identified fusion transcripts across the study cohort (tissue type: EPI (epithelial origin)). A functional prediction score was assigned to each fusion transcript describing the probability of the fusion transcript being a driver event (Bayesian probability scores < 0.5).

Fluorescence in situ hybridization (FISH)
Dual-color interphase FISH was performed using a selection of recurrent fusion transcripts across the study group identified by FusionCatcher. More specifically, bacterial artificial chromosome (BAC) clones covering each fusion partner were selected from BACPAC Resources Center, Children's Hospital Oakland Research Institute, CA, USA (Supplementary Table  6) and validated on normal metaphase chromosome preparations. BAC DNAs corresponding to each fusion transcript were extracted (Qiagen Plasmid Maxi kit) and separately labeled for dual-color FISH by nick translation with biotin-16-dUTP (detected by FITC) and digoxigenin-11-dUTP (detected by Rhodamine) (Roche Diagnostics, Mannheim, Germany), respectively. Touchprint preparations were prepared from fresh-frozen tumors corresponding to the RNA sequenced ovarian tumor samples. Biotin-16-dUTP and dioxigenin-11-dUTP labeled probes were co-hybridized to denaturated interphase nuclei slides overnight at 37 °C. The hybridized slides were counterstained with DAPI and mounted in an antifade solution (Vectashield DAPI, Vector Laboratories, Burlingame, CA, USA). Sample evaluation and image acquisition were performed using a Leica DMRA2 fluorescent microscope (Leica, Leica Microsystems, Wetzlar, Germany) equipped with an ORCA Hamamatsu CCD (charged-couple devices) camera and filter cubes specific for green fluorescein isothiocyanate (FITC), red rhodamine, and UV for blue 4′,6′-diamidino-2′phenylindole dihydrochloride (DAPI) counterstain visualization. Image preparation was performed using CW4000 software, QFISH.

Ingenuity pathway analysis (IPA)
Pathway analysis was performed using Ingenuity Pathway Analysis (Ingenuity Systems, Redwood City, USA) to identify cancer-related biological functions associated with the identified genetic variants, differentially expressed genes and fusion transcripts. The biological functions were generated using Fisher's exact test (P<0.05).

Genome-wide SNP genotyping
Genomic DNA was extracted (Wizard Genomic DNA Purification Kit, Promega) followed by phenol chloroform purification with Phase Lock Gel Light (5 Prime) for nine samples in the study cohort comprising a FISH verified fusion transcript. Genome-wide SNP www.oncotarget.com genotyping analysis (Illumina Infinium HumanOmni2.5-8 v. 1.3 Beadchips) was performed on the purified DNA at the SCIBLU Genomics DNA Microarray Resource Center (SCIBLU), Department of Oncology, Lund University.
The identified genetic variants and fusion transcripts were correlated to copy number alterations and DNA breakpoints present in the SNP genotyping data. The RNA-seq (genetic variants and fusion transcripts) and SNP genotyping data were visualized as circos plots using the Circos (v. 0.66) module in UPPMAX.

Statistical analyses
A comparison of genetic variant frequency and gene expression levels dependent on genetic variants or fusion transcripts was performed between the study and control cohorts, histotype and/or survival groups using a 0.05 p-value cutoff in Microsoft Excel with the TTEST function or analysis of variance (ANOVA) in R/Bioconductor (v. 3.2.5) as appropriate. All p-values are two-sided. Survival rates according to mutation status were calculated with Kaplan-Meier curves and tested with log-rank test (survival, v. 2.40-1). Univariate Cox proportional hazard models were calculated for mutation status using DSS (the time from initial diagnosis to ovarian cancer-related death) and OS (the time from initial diagnosis to death from any cause). Statistical significance is indicated as * P<0.05, ** P<0.01 or *** P<0.001.

Data availability
The RNA-seq and SNP genotyping data have been deposited in the NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) under accession number GSE101109.

Author contributions
K.H. and P.K. were responsible for overall study concept, design of experiments, and collection of clinical data. T.Z.P. and J.B. contributed to the computational analyses of the RNA-seq and SNP genotyping data. E.W.R., A.K. and C.M. performed the pathological reclassification of the histotypes in the study cohort. K.S. provided technical support relating to ovarian carcinoma and E.M.V.S. extracted RNA for the RNA-seq analysis. H.E. performed the experiments, analyzed the data, and wrote the manuscript. All authors reviewed, edited, and approved the final manuscript.