Impact of Variable RNA-Sequencing Depth on Gene Expression Signatures and Target Compound Robustness: Case Study Examining Brain Tumor (Glioma) Disease Progression

Purpose Gene expression profiling can uncover biologic mechanisms underlying disease and is important in drug development. RNA sequencing (RNA-seq) is routinely used to assess gene expression, but costs remain high. Sample multiplexing reduces RNA-seq costs; however, multiplexed samples have lower cDNA sequencing depth, which can hinder accurate differential gene expression detection. The impact of sequencing depth alteration on RNA-seq–based downstream analyses such as gene expression connectivity mapping is not known, where this method is used to identify potential therapeutic compounds for repurposing. Methods In this study, published RNA-seq profiles from patients with brain tumor (glioma) were assembled into two disease progression gene signature contrasts for astrocytoma. Available treatments for glioma have limited effectiveness, rendering this a disease of poor clinical outcome. Gene signatures were subsampled to simulate sequencing alterations and analyzed in connectivity mapping to investigate target compound robustness. Results Data loss to gene signatures led to the loss, gain, and consistent identification of significant connections. The most accurate gene signature contrast with consistent patient gene expression profiles was more resilient to data loss and identified robust target compounds. Target compounds lost included candidate compounds of potential clinical utility in glioma (eg, suramin, dasatinib). Lost connections may have been linked to low-abundance genes in the gene signature that closely characterized the disease phenotype. Consistently identified connections may have been related to highly expressed abundant genes that were ever-present in gene signatures, despite data reductions. Potential noise surrounding findings included false-positive connections that were gained as a result of gene signature modification with data loss. Conclusion Findings highlight the necessity for gene signature accuracy for connectivity mapping, which should improve the clinical utility of future target compound discoveries.


INTRODUCTION
Gene expression profiling examines the altering state of the transcriptome at many levels. In cancer research, gene expression profiling has been essential in assessing biologic function, pathogenesis, and biomarker discovery. 1,2 In the past, microarrays have been used to measure gene expression; however, methodological drawbacks include background hybridization, reliance on established probes, and limited dynamic range. [3][4][5] A superior method available for gene expression Purpose Gene expression profiling can uncover biologic mechanisms underlying disease and is important in drug development. RNA sequencing (RNA-seq) is routinely used to assess gene expression, but costs remain high. Sample multiplexing reduces RNAseq costs; however, multiplexed samples have lower cDNA sequencing depth, which can hinder accurate differential gene expression detection. The impact of sequencing depth alteration on RNA-seq-based downstream analyses such as gene expression connectivity mapping is not known, where this method is used to identify potential therapeutic compounds for repurposing.
Methods In this study, published RNA-seq profiles from patients with brain tumor (glioma) were assembled into two disease progression gene signature contrasts for astrocytoma. Available treatments for glioma have limited effectiveness, rendering this a disease of poor clinical outcome. Gene signatures were subsampled to simulate sequencing alterations and analyzed in connectivity mapping to investigate target compound robustness.
Results Data loss to gene signatures led to the loss, gain, and consistent identification of significant connections. The most accurate gene signature contrast with consistent patient gene expression profiles was more resilient to data loss and identified robust target compounds. Target compounds lost included candidate compounds of potential clinical utility in glioma (eg, suramin, dasatinib). Lost connections may have been linked to low-abundance genes in the gene signature that closely characterized the disease phenotype. Consistently identified connections may have been related to highly expressed abundant genes that were ever-present in gene signatures, despite data reductions. Potential noise surrounding findings included false-positive connections that were gained as a result of gene signature modification with data loss.
Conclusion Findings highlight the necessity for gene signature accuracy for connectivity mapping, which should improve the clinical utility of future target compound discoveries. measurement is RNA sequencing (RNA-seq) of cDNA transcripts in a high-throughput manner. Sequencing reads are then aligned to a reference genome or transcriptome and mapped to an identified region. Transcript abundance is estimated, facilitating the comparison of gene expression profiles. RNA-seq has wider analytical capabilities, including single nucleotide variants, insertion-deletions, gene splice variants, post-transcriptional modifications, and gene fusion detection, but remains costly. 6,7 Experimental techniques developed to minimize sequencing costs include sample multiplexing.
Multiplexing involves labeling each sample library with a barcode identifier, allowing multiple libraries to be pooled and sequenced simultaneously, reducing costs. 7-10 Smaller volumes of RNA are analyzed for multiplexed samples; thus, the downside to multiplexing is reduced sequencing depth for this library type.
Accurate assessment of transcripts depends on length, abundance, and mappability to the reference and sufficient sequencing depth, particularly for genes with low transcript abundance. 11,12 Sequencing depth alterations can affect the detection of differentially expressed genes (DEGs) and potentially the accuracy of RNA-seq-based downstream analysis. Few studies have assessed the impact of sequencing depth alterations on RNA-seq downstream applications. 13 More studies are required, particularly to assess applications that rely on precise gene signatures, informative in classifying cancer subtypes and improved prognostic and predictive outcomes. 14,15 A gene signature is summarized by DEGs that collectively represent the most prominent features of a cancer subtype or disease progression phenotype. If a gene signature is compiled using gene expression profiles with low sequencing depth, then it may not be fully representative of that phenotype. This could be particularly problematic for connectivity mapping that examines a gene expression signature contrast with the aim of predicting potentially therapeutic US Food and Drug-approved target compounds for repurposing. 16 There is urgent need for new targeted therapies for gliomas, which are the most common form of primary brain tumor. Gliomas can be classified from grade I to IV on the basis of histologic and molecular information. 17 Depending on the cell of origin, each neoplasm is classified as an astrocytoma, oligodendroglioma, or ependymoma. Diffuse astrocytoma (WHO grade II) can demonstrate progression to anaplastic astrocytoma (WHO grade III) and malignant glioblastoma (GBM; WHO grade IV). Patient survival beyond 5 years is 58% for grade II astrocytoma, 23.6% for grade III anaplastic astrocytoma, and only 5% for grade IV GBM. [18][19][20] Patients with GBM undergo concurrent chemoradiotherapy with temozolomide according to the Stupp protocol and adjuvant chemotherapy. 21 Patients with anaplastic glioma may undergo radiotherapy with or without chemotherapy, depending on tumor molecular profile. 22 Low-grade gliomas with poor prognosis may also be considered for adjuvant treatment. 23 There has been minimal improvement in overall survival (14.6 v 12.2 months) 24 ; thus, new treatments are urgently sought for glioma. Herein, reference gene signatures were compiled from publically available sequenced tumors for astrocytoma disease progression. 2 Subsampling was applied to simulate sequencing depth alterations of gene signatures, and the performance of connectivity mapping was assessed. Results reveal that information loss to gene signatures significantly affects target compound robustness.

METHODS
Published whole transcriptome sequencing data of brain tumor biopsy specimens from adults (accession: GSE48865; Bao et al 2 ) was downloaded from the Sequence Read Archive. 25 On average, samples had 50 million reads each. Reads were quality controlled using Trimmomatic software 26 and aligned using Bowtie2, 27 allowing one mismatch against the human genome version hg38. 28 Aligned reads were mapped to genes from the GRCh38.81 annotation 29 using samExploreR software. 30,31 To benchmark a diverse range in performance of the RNA-seq analysis, mapped reads were subsampled to simulate samples with a range of lower cDNA library sequencing depths using a bioinformatics pipeline 32 (Appendix Fig A1; Data Supplement). RNA-seq reads for transcript-level abundance to gene level were summarized and normalized using the relative log expression method and analyzed for differential expression using full (f = 1.0) and simulated samples with DESeq2. 33 Gene expression signature contrasts representative of astrocytoma disease progression were compiled for low to high (L-H) and high to high (H-H)-grade astrocytoma (Data Supplement). Gene signature contrasts were assessed for consistency in a heatmap using pheatmap R package (http:// CRAN.R-project.org/package=pheatmap). The impact of information loss to gene signatures for DEGs, gene ontology (GO) terms, and target compound detection was assessed using differential expression, GO, and gene expression connectivity mapping analysis, respectively, with DESeq2, GOseq, and the QUB Accelerated Drug and Transcriptomic Connectivity (QUADrATiC) software [33][34][35] (Data Supplement). The reproducibility of significant connections to the Library of Integrated Cellular Signatures identified for all cell lines and neuronal specific cell lines (Data Supplement) by QUADrATiC was investigated 16,36-38 (Data Supplement). Results and associated false discovery rates (FDRs) were visualized using the R packages VennDiagram and ggplot2. 39,40

Assessment of the L-H and H-H Gene Expression Signatures
L-H (Dataset_I) and H-H (Dataset_II) gene signature contrasts comprised 47 and 33 patients, respectively (Data Supplement). Some 6,648 DEGs were identified for Dataset_I, which reduced to 2,550 after filtering ( Fig 1A). Just 608 DEGs were identified for Dataset_II, reducing to 327 after filtering ( Fig 1B). Each gene signature contrast clustered into two separate branches, which mostly stratified patients on the basis of disease grade (Figs 1C and 1D). Dataset_I outperformed Dataset_II; all but one patient clustered according to disease grade. For each gene signature contrast, no outliers outside of the two disease grades were identified.

Impact of Information Loss to Gene Signatures for DEG and GO Detection
For Dataset_I, initial reductions in data analyzed (f = 0.8 to 1.0) did not greatly affect the number of DEGs detected ( Fig 1A). However, the rate of loss of DEGs increased after f = 0.8. For Data-set_II, data loss was immediate, and DEG detection reduced equally for every fraction analyzed, as indicated by the linear relationship ( Fig 1B). Variation in the number of DEGs detected was lower for Dataset_I compared with Dataset_II, as evidenced by smaller confidence intervals. When data input was reduced, the FDR for the number of DEGs detected increased linearly and by approximately the same amount for both data sets (Appendix Fig A2). Dataset_I gene signature therefore demonstrated better resilience to data loss for DEG detection compared with Dataset_II.
For the full data set (f = 1.0), > 200 GO terms described the functions of the DEGs identified for Dataset_I (Appendix Fig A3A). Thus, heterogeneous biologic functions are involved in low-to high-grade astrocytoma disease transition. For Dataset_I, only small decreases in GO terms were detected using data fractions between f = 1.0 and 0.1 (Appendix Fig A3A). Thus, GO term detection was more stable compared with DEGs when Dataset_I gene signature had data loss. The impact of data loss on FDR for GO term detection was on the same scale as that observed for DEG detection for Dataset_I ( Fig A3B). Comparatively fewer GO terms, just three, described the DEGs in Dataset_II for the full data set. Given this low number, which reduced to zero on f = 0.5, GO results for subsampled Dataset_II are not depicted.

Impact of Information Loss to Gene Signatures Used in Gene Expression Connectivity Mapping
For the full data set, a greater number of significant reverse (rev) and progress (prog) connections were identified for Dataset_I compared with Dataset_II (Fig 2A). For Dataset_I, data loss did not greatly affect the number of significant rev and prog connections detected. With increasing data loss, Dataset_I significant connections remained relatively stable (f = 1.0 to 0.7) and then slightly increased. For Dataset_II, rev significant connections decreased steadily with data loss, whereas prog connections were slightly more stable. Dataset_I displayed less variability in the number of significant connections identified, compared with Dataset_II, as evidenced by smaller confidence intervals. For both Data-set_I and Dataset_II, FDR for the number of significant connections increased steadily with decreasing data used ( Fig 2B). However, FDR was three-fold greater for Dataset_II and quickly increased to approximately 10% and 20% for rev and prog connections, respectively, when just 1% of reads were removed (f = 0.99). For target compound identification, Dataset_I was therefore more resilient to alterations in cDNA sequencing depth compared with Dataset_II.
The impact of data loss to gene signatures and the reproducibility of connectivity mapping is presented in Figures 3-5. When full data sets were used for the gene signature (f = 1.0), target compound identification was consistent, and mostly the same compounds were identified between iterations (Figs 3, 4A, and 4C; frequency = 1.0). With data loss to the gene signature (f = 0.01, 0.5), fewer compounds were consistently identified, and a higher number of target compounds were detected at low frequencies of iterations. For example, 3,135 rev connections were identified for Dataset_I using f = 1.0; this increased to approximately 5,000 when subsampled to f = 0.01, but approximately 60% of compounds were infrequently detected (Figs 3B and 3C). Proportion of significant connections that are consistently identified decreases with data loss, but the impact was less for Dataset_I. For Dataset_I, when 50% of reads were removed, approximately 62.5% rev (approximately 2,500 of 4,000) and approximately 50% prog significant connections (approximately 1,500 of 3,000) were identified with every iteration (Fig 3). For Dataset_II, when 50% of reads were removed, just approximately 13% rev (approximately 400 of 3,000) and 9% prog significant connections (approximately 180 of 2,000) were identified with every iteration (Fig 4). No robust calls were identified for Dataset_II at f = 0.01, and little 4 ascopubs.org/journal/po JCO™ Precision Oncology improvement was observed when half the reads were included (f = 0.5; Fig 4). Gene signatures differed in the proportion of significant connections that were consistently identified when cDNA sequencing depth was reduced. When affected by data loss, connectivity mapping results were more robust for Dataset_I compared with the Dataset_II gene signature.
Reducing data to the gene signature led to the loss, gain, and consistent identification of significant connections to target compounds ( Fig 5). for Dataset_I, f = 0.01, that were not identified by the full data set (Fig 5A). These connections were false positives, because they had not been detected with the full gene signature. Last, we examined the loss of significant connections to target compounds for Dataset_I and II. When 50% of the reads were removed, nine and 27 target compounds identified for neuronalspecific cell lines were lost, respectively, for Dataset_I and II (Figs 5C and 5D; Table 1). Thus, for Dataset_II, more target compounds identified by the full gene signature were lost, and some of these included compounds of potential clinical utility for glioma, such as suramin and dasatinib (Table 1). A comparison of the rate of impact of data loss on GO terms and significant connections detection for Dataset_I can be seen by comparing Figures A3 and 2A.

DISCUSSION
Understanding molecular pathways and regulatory networks driving cancer is essential for the development of new therapies. Gene expression profiling using RNA-seq has led to the development of clinically relevant gene signatures that are informative for cancer subtypes. 14  more patient gene expression profiles matched their WHO grade classifications. Results support the subjective nature of tumor classification, which has interobserver variability. 41 Gene signatures provided a framework to assess connectivity mapping output for a well-performing accurate versus a poorer-performing less accurate contrast.
Characterization of the disease progression gene signatures revealed they differed in biologic complexity. L-H gene signature had ten-fold more DEGs (approximately 2,550) compared with the H-H gene signature (327). Results demonstrated the possibility that more genes are involved in low-to high-grade astrocytoma disease transition. After data reduction, DEG loss was not immediate for the L-H gene signature, but with lowering fractions DEG loss increased. For the H-H gene signature, there was immediate and steady DEG loss with reduced data input. FDR for DEG detection increased linearly for both gene signatures; however, the range of FDR values was lower for the L-H gene signature. Thus, the L-H gene signature was more resilient to data loss for DEG detection and had greater test sensitivity compared with the H-H gene signature. Gene signatures also differed in their resilience to data loss for the detection of significant connections to target compounds. Overall, the number of significant connections detected for the L-H gene signature was greater, most likely explained by the heterogeneous biologic mechanisms involved in low-to high-grade astrocytoma transition. With data loss, both rev and prog significant connections remained relatively stable for the L-H gene signature. Data loss led to a steady decrease in rev significant connections for the H-H gene signature; however, prog connections were initially more stable. For both gene signatures, the FDR of significant connections increased with data loss. Overall FDR values and CIs were smaller for the L-H gene signature. For comparative purposes, consider an FDR of 0.1 as an acceptable threshold, where one in every 10 significant connections is a false positive. With data loss, this FDR threshold was reached by the L-H and H-H gene signatures, respectively, when 70% and just 1% of reads were removed. Thus, the L-H gene signature was more resilient to data loss for the detection of significant connections to target compounds using connectivity mapping.
Subsampling of gene signatures for connectivity mapping revealed that the suite of significant connections to target compounds became modified with data loss. Notably, some connections     to target compounds of potential clinical utility were lost when the reads were reduced to 50%. Compounds lost by the H-H gene signature (WHO grade III to IV) included suramin, lopinavir, dasatinib, and vincristine, which have already been considered as glioma treatments. Suramin, an anticancer agent, inhibits the binding of growth factors understood to play a role in glioma progression, angiogenesis, and radioresistance and has been used to treat newly diagnosed GBMs. 42,43 Lopinavir, a protease inhibitor, has reached phase II clinical trials for the treatment of high-grade glioma. 44 Dasatinib, a kinase inhibitor that acts on members of the Src family of kinases, is well studied in glioma and has shown preclinical promise. 45 Vincristine, a spindle poison, is used in combination with procarbazine and lomustine to treat high-grade glioma and has also been successful in a phase III trial for the treatment of low-grade gliomas. 22,46,47 Reductions in transcript abundance probably led to the loss of low-abundance genes from the full gene signature and altered the DEGs detected, leading to the loss of these connectivity mapping connections. Perhaps lowabundance genes that closely characterize the disease phenotype may offer the greatest potential for target compound discovery. If this is the case, then the subsampling approach described herein could potentially identify these important links to target compounds. Fewer significant connections identified by the full data sets were lost by the L-H gene signature compared with the H-H gene signature, suggesting it was more resilient to data loss. It was interesting to note that reduction in cDNA sequencing depth of gene signatures also led to the gain of significant connections to target compounds. Indeed, more significant connections were identified when fewer data were used for both gene signatures; however, few of these connections were consistently identified between iterations. A greater proportion of significant connections were consistently identified with all iterations for the L-H gene signature compared with the H-H gene signature. For connections that were consistently identified, these may have related to the most highly expressed and abundant DEGs in the gene signature contrast. Similarly, in another subsampling RNA-seq study of healthy organisms from multiple taxa, highly expressed genes regulating metabolism and pathogenesis of disease were consistently identified even when downsampling RNA-seq reads to only 1 million reads, 13 thereby corroborating our findings from diseased tumors.
Results highlight the need for determining the optimal cDNA sequencing depth for accurately identifying DEGs when compiling gene signatures. In the future, RNA standard and spike-in controls may be useful to inform RNA-seq best practices. 48 The accuracy of a gene signature was particularly important when carrying out additional downstream analyses, such as connectivity mapping. Information loss to gene signatures led to erroneous and false target compound discoveries. Gene signatures with consistent sample classification and gene expression profiles were more resilient to data loss and provided robust target compound discoveries. Given the instability of gene expression, perhaps using ontology types or ontotypes 49 to characterize contrast phenotypes may be a more reliable approach compared with gene lists in connectivity mapping. Herein, we demonstrate the utility of QUADrATiC software at identifying US Food and Drug Administration-approved compounds that can be repurposed for glioma. Stringent filtering of connectivity mapping results is required to identify reliable significant connections. Subsampling revealed that the connections that were sensitive to data loss were linked to target compounds of potential clinical utility in glioma. These connections may have the best clinical promise for drug repurposing. Other target compounds sensitive to data loss are being tested for their biologic efficacy against glioma stem cells using clonogenic cell survival assays and Western blot analyses in ongoing studies by this research group. For the wider identification of potential therapeutic compounds for repurposing in glioma, gene signatures for oligodendroglioma and ependymoma disease progression could be analyzed using connectivity mapping in the future.