ABSTRACT
The high-throughput short-reads RNA-seq protocols often produce paired-end reads, with the middle portion of the fragments being unsequenced. We explore if the full-length fragments can be computationally reconstructed from the sequenced two ends in the absence of the reference genome---a problem here we refer to as de novo bridging. Solving this problem provides longer, more informative RNA-seq reads, and benefits downstream RNA-seq analysis such as transcript assembly, expression quantification, and splicing differential analysis. However, de novo bridging is a challenging and complicated task owing to alternative splicing, transcript noises, and sequencing errors. It remains unclear if the data provides sufficient information for accurate bridging, let alone efficient algorithms that determine the true bridges. Methods have been proposed to bridge paired-end reads in the presence of reference genome (called reference-based bridging), but the algorithms are far away from scaling for de novo bridging as the underlying compacted de Bruijn graph (cdBG) used in the latter task often contains millions of vertices and edges. We designed a new truncated Dijkstra's algorithm for this problem, and proposed a novel algorithm that reuses the shortest path tree to avoid running the truncated Dijkstra's algorithm from scratch for all vertices for further speeding up. These innovative techniques result in scalable algorithms that can bridge all paired-end reads in a cdBG with millions of vertices. Our experiments showed that paired-end RNA-seq reads can be accurately bridged to a large extent. The resulting tool is freely available at https://github.com/Shao-Group/rnabridge-denovo.
- K.F. Au, H. Jiang, L. Lin, Y. Xing, and W.H. Wong. 2010. Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Res. 38, 14 (2010), 4570--4578.Google ScholarCross Ref
- N.L. Bray, H. Pimentel, P. Melsted, and L. Pachter. 2016. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 5 (2016), 525--527.Google ScholarCross Ref
- Rayan Chikhi, Antoine Limasset, and Paul Medvedev. 2016. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32, 12 (2016), i201--i208.Google ScholarCross Ref
- Maciej Długosz and Sebastian Deorowicz. 2017. RECKONER: read error corrector based on KMC. Bioinformatics 33, 7 (2017), 1086--1089.Google ScholarCross Ref
- A. Dobin, C.A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, and T.R. Gingeras. 2013. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 1 (2013), 15--21.Google ScholarDigital Library
- T. Griebel, B. Zacher, P. Ribeca, E. Raineri, V. Lacroix, R. Guigó, and M. Sammeth. 2012. Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. 40, 20 (2012), 10073--10083.Google ScholarCross Ref
- Brian J Haas, Alexander Dobin, Bo Li, Nicolas Stransky, Nathalie Pochet, and Aviv Regev. 2019. Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods. Genome Biology 20, 1 (2019), 213.Google ScholarCross Ref
- Guillaume Holley and Páll Melsted. 2020. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biology 21, 1 (2020), 1--20.Google ScholarCross Ref
- W James Kent. 2002. BLAT---the BLAST-like alignment tool. Genome research 12, 4 (2002), 656--664.Google Scholar
- D. Kim, B. Langmead, and S.L. Salzberg. 2015. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 4 (2015), 357--360.Google ScholarCross Ref
- Sam Kovaka, Aleksey V Zimin, Geo M Pertea, Roham Razaghi, Steven L Salzberg, and Mihaela Pertea. 2019. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biology 20, 1 (2019), 1--13.Google ScholarCross Ref
- Bo Li and Colin N Dewey. 2011. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 1 (2011), 323.Google ScholarCross Ref
- Yang I Li, David A Knowles, Jack Humphrey, Alvaro N Barbeira, Scott P Dickinson, Hae Kyung Im, and Jonathan K Pritchard. 2018. Annotation-free quantification of RNA splicing using LeafCutter. Nature Genetics 50, 1 (2018), 151--158.Google ScholarCross Ref
- J. Liu, T. Yu, T. Jiang, and G. Li. 2016. TransComb: genome-guided transcriptome assembly via combing junctions in splicing graphs. Genome Biol. 17, 1 (2016), 213.Google ScholarCross Ref
- Juntao Liu, Ting Yu, Zengchao Mu, and Guojun Li. 2019. TransLiG: a de novo transcriptome assembler that uses line graph iteration. Genome Biology 20, 1 (2019), 1--9.Google ScholarCross Ref
- C. Ma, Shao, M., and C. Kingsford. 2018. SQUID: transcriptomic structural variation detection from RNA-seq. Genome Biol. 19, 1 (2018), 52.Google ScholarCross Ref
- R. Patro, G. Duggal, M.I. Love, R.A. Irizarry, and C. Kingsford. 2017. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14 (2017), 417--419. Google ScholarCross Ref
- M. Pertea, G.M. Pertea, C.M. Antonescu, T.-C. Chang, J.T. Mendell, and S.L. Salzberg. 2015. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 3 (2015), 290--295.Google ScholarCross Ref
- ENCODE project. 2019. Bulk RNA-seq Data Standards and Processing Pipeline. https://www.encodeproject.org/data-standards/rna-seq/long-rnas/Google Scholar
- Shao, M. and C. Kingsford. 2017. Accurate assembly of transcripts through phase-preserving graph decomposition. Nature Biotechnology 35, 12 (2017), 1167--1169.Google ScholarCross Ref
- Trung Nghia Vu, Wenjiang Deng, Quang Thinh Trac, Stefano Calza, Woochang Hwang, and Yudi Pawitan. 2018. A fast detection of fusion genes from paired-end RNA-seq data. BMC Genomics 19, 1 (2018), 1--13.Google ScholarCross Ref
- Qimin Zhang, Qian Shi, and Mingfu Shao. 2022. Accurate assembly of multi-end RNA-seq data with Scallop2. Nature Computational Science 2, 3 (2022), 148--152.Google ScholarCross Ref
- Zijun Zhang, Zhicheng Pan, Yi Ying, Zhijie Xie, Samir Adhikari, John Phillips, Russ P Carstens, Douglas L Black, Yingnian Wu, and Yi Xing. 2019. Deep-learning augmented RNA-seq analysis of transcript splicing. Nature Methods 16, 4 (2019), 307--310.Google ScholarCross Ref
Index Terms
- On de novo Bridging Paired-end RNA-seq Data
Recommendations
Strand specific RNA-seq data for higher specificity
RACS '15: Proceedings of the 2015 Conference on research in adaptive and convergent systemsHigh-throughput RNA Sequencing (RNA-seq) has become a popular tool for transcriptome analysis. An important application of RNA-seq is to detect differential alternative splicing, that is, differences in exon splicing patterns under different biological ...
Circular RNA Detection from High-throughput Sequencing
RACS '17: Proceedings of the International Conference on Research in Adaptive and Convergent SystemsAlternative splicing refers to the production of multiple mRNA isoforms from a single gene due to alternative selection of exons or splice sites during pre-mRNA splicing. While canonical alternative splicing produces a linear form of RNA by joining an ...
Towards Reliable Isoform Quantification Using RNA-Seq Data
BIBM '09: Proceedings of the 2009 IEEE International Conference on Bioinformatics and BiomedicineIn eukaryotes, alternative splicing often generates multiple splice variants from a single gene. Here we explore the use of RNA sequencing (RNA-Seq) datasets to address the isoform quantification problem. Given a set of known splice variants, the goal ...
Comments