Skip to main content

Scalable Reference Genome Assembly from Compressed Pan-Genome Index with Spark

  • Conference paper
  • First Online:
Book cover Big Data – BigData 2020 (BIGDATA 2020)

Abstract

High-throughput sequencing (HTS) technologies have enabled rapid sequencing of genomes and large-scale genome analytics with massive data sets. Traditionally, genetic variation analyses have been based on the human reference genome assembled from a relatively small human population. However, genetic variation could be discovered more comprehensively by using a collection of genomes i.e., pan-genome as a reference. The pan-genomic references can be assembled from larger populations or a specific population under study. Moreover, exploiting the pan-genomic references with current bioinformatics tools requires efficient compression and indexing methods. To be able to leverage the accumulating genomic data, the power of distributed and parallel computing has to be harnessed for the new genome analysis pipelines. We propose a scalable distributed pipeline, PanGenSpark, for compressing and indexing pan-genomes and assembling a reference genome from the pan-genomic index. We experimentally show the scalability of the PanGenSpark with human pan-genomes in a distributed Spark cluster comprising 448 cores distributed to 26 computing nodes. Assembling a consensus genome of a pan-genome including 50 human individuals was performed in 215 min and with 500 human individuals in 1468 min. The index of 1.41 TB pan-genome was compressed into a size of 164.5 GB in our experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/NGSeq/PanGenSpark.

  2. 2.

    https://hadoop.apache.org.

  3. 3.

    https://github.com/disq-bio/disq.

  4. 4.

    https://gatk.broadinstitute.org.

  5. 5.

    https://github.com/bigdatagenomics/adam.

  6. 6.

    http://biodoop-seal.sourceforge.net.

  7. 7.

    https://github.com/HadoopGenomics/SeqPig.

  8. 8.

    https://gitlab.com/dvalenzu/CHIC.

  9. 9.

    https://gitlab.com/dvalenzu/PanVC.

  10. 10.

    https://github.com/broadinstitute/gatk.

  11. 11.

    https://github.com/tsnorri/vcf2multialign.

  12. 12.

    http://samtools.sourceforge.net/mpileup.shtml.

References

  1. Marcshall, T., Marz, M., Abeel, T., et al.: Computational pan-genomics: status, promises and challenges. The Computational Pan-Genomics Consortium. Brief. Bioinform. (2016). https://doi.org/10.1093/bib/bbw089

  2. Sherman, R.M., Salzberg, S.L.: Pan-genomics in the human genome era. Nat. Rev. Genet. 21, 243–254 (2020). https://doi.org/10.1038/s41576-020-0210-7

    Article  Google Scholar 

  3. Dilthey, A., Cox, C., Iqbal, Z., Nelson, M., McVean, G.: Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015). https://doi.org/10.1038/ng.3257

    Article  Google Scholar 

  4. Auton, A., Abecasis, G., Altshuler, D., et al.: A global reference for human genetic variation. Nature 526, 68–74 (2015). https://doi.org/10.1038/nature15393

    Article  Google Scholar 

  5. Rouli, L., Merhej, V., Fournier, P.E., Raoult, D.: The bacterial pangenome as a new tool for analysing pathogenic bacteria. New Microbes New Infect. 7, 72–85 (2015)

    Article  Google Scholar 

  6. Rasko, D.A., Rosovitz, M.J., Myers, G.S.A., et al.: The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates. J. Bacteriol. 190, 6881–6893 (2008)

    Article  Google Scholar 

  7. Trost, E., Blom, J., Soares, S.C., et al.: Pangenomic study of Corynebacterium diphtheriae that provides insights into the genomic diversity of pathogenic isolates from cases of classical diphtheria, endocarditis, and pneumonia. J. Bacteriol. 194, 3199–3215 (2012). https://doi.org/10.1128/jb.00183-12

    Article  Google Scholar 

  8. Kehr, B., Helgadottir, A., Melsted, P., et al.: Diversity in non-repetitive human sequences not found in the reference genome. Nat. Genet. 49, 588–593 (2017). https://doi.org/10.1038/ng.3801

    Article  Google Scholar 

  9. Tettelin, H., Masignani, V., Cieslewicz, M.J., et al.: Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial ‘pan-genome’. Proc. Natl. Acad. Sci. U.S.A. 102, 13950–13955 (2005). https://doi.org/10.1073/pnas.0506758102

    Article  Google Scholar 

  10. Sherman, R.M., Forman, J., Antonescu, V., et al.: Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat. Genet. 51, 30–35 (2019). https://doi.org/10.1038/s41588-018-0273-y

    Article  Google Scholar 

  11. Mallick, S., Li, H., Lipson, M., et al.: The Simons genome diversity project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016). https://doi.org/10.1038/nature18964

    Article  Google Scholar 

  12. Duan, Z., Qiao, Y., Lu, J., et al.: HUPAN: a pan-genome analysis pipeline for human genomes. Genome Biol. 20, 149 (2019). https://doi.org/10.1186/s13059-019-1751-y

    Article  Google Scholar 

  13. Hu, Z., et al.: EUPAN enables pan-genome studies of a large number of eukaryotic genomes. Bioinformatics 33(15), 2408–2409 (2017). https://doi.org/10.1093/bioinformatics/btx170

    Article  Google Scholar 

  14. Zhao, Q., Feng, Q., Lu, H., et al.: Pan-genome analysis highlights the extent of genomic variation in cultivated and wild rice. Nat. Genet. 50, 278–284 (2018). https://doi.org/10.1038/s41588-018-0041-z

    Article  Google Scholar 

  15. Maarala, A.I., Bzhalava, Z., Dillner, J., Heljanko, K., Bzhalava, D.: ViraPipe: scalable parallel pipeline for viral metagenome analysis from next generation sequencing reads. Bioinformatics 34(6), 928–935 (2018). https://doi.org/10.1093/bioinformatics/btx702

    Article  Google Scholar 

  16. Valenzuela, D., Norri, T., Välimäki, N., et al.: Towards pan-genome read alignment to improve variation calling. BMC Genomics 19, 87 (2018). https://doi.org/10.1186/s12864-018-4465-8

    Article  Google Scholar 

  17. Siren, J., Välimäki, N., Mäkinen, V.: Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11(2), 375–388 (2014). https://doi.org/10.1109/TCBB.2013.2297101

    Article  Google Scholar 

  18. Huang, L., Popic, V., Batzoglou, S.: Short read alignment with populations of genomes. Bioinformatics 29, 361–370 (2013). https://doi.org/10.1093/bioinformatics/btt215

    Article  Google Scholar 

  19. Schneeberger, K., Hagmann, J., Ossowski, S., et al.: Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009)

    Article  Google Scholar 

  20. Paten, B., Novak, A., Haussler, D.: Mapping to a reference genome structure. ArXiv http://arxiv.org/abs/1404.5010 (2014)

    Google Scholar 

  21. Jeffrey, D., Sanjay, G.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008). https://doi.org/10.1145/1327452.1327492

    Article  Google Scholar 

  22. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud 2010), p. 10. USENIX Association, USA (2010)

    Google Scholar 

  23. Zaharia, M., Chowdhury, M., Das, T., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI 2012), Berkeley, CA, USA, p. 2 (2012)

    Google Scholar 

  24. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009). https://doi.org/10.1093/bioinformatics/btp324

    Article  Google Scholar 

  25. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009). https://doi.org/10.1186/gb-2009-10-3-r25

    Article  Google Scholar 

  26. Niemenmaa, M., Kallio, A., Schumacher, A., Klemelä, P., Korpelainen, E., Heljanko, K.: Hadoop-BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics 28(6), 876–877 (2012). https://doi.org/10.1093/bioinformatics/bts054

    Article  Google Scholar 

  27. Decap, D., Reumers, J., Herzeel, C., Costanza, P., Fostier, J.: Halvade: scalable sequence analysis with MapReduce. Bioinformatics 31(15), 2482–2488 (2015)

    Article  Google Scholar 

  28. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977). https://doi.org/10.1109/TIT.1977.1055714

    Article  MathSciNet  MATH  Google Scholar 

  29. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report 124, Palo Alto, CA, Digital Equipment Corporation (1994)

    Google Scholar 

  30. Valenzuela, D.: CHICO: a compressed hybrid index for repetitive collections. In: Goldberg, A.V., Kulikov, A.S. (eds.) SEA 2016. LNCS, vol. 9685, pp. 326–338. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-38851-9_22

    Chapter  Google Scholar 

  31. Valenzuela, D., Mäkinen, V.: CHIC: a short read aligner for pan-genomic references. bioRxiv 178129 (2017). https://doi.org/10.1101/178129

  32. Hoobin, C., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections. Proc. VLDB Endow. 5(3), 265–273 (2011). https://doi.org/10.14778/2078331.2078341

    Article  Google Scholar 

  33. Rajasekaran, S., Nicolae, M.: An elegant algorithm for the construction of suffix arrays. J. Discrete Algorithms 27, 21–28 (2014). https://doi.org/10.1016/j.jda.2014.03.001

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The financial support of Academy of Finland project #336092 and #309048 are gratefully acknowledged. The computing capacity from CSC - IT Center for Science and the FGCI2 Academy of Finland Infrastructure project are gratefully acknowledged.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Altti Ilari Maarala .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Maarala, A.I., Arasalo, O., Valenzuela, D., Heljanko, K., Mäkinen, V. (2020). Scalable Reference Genome Assembly from Compressed Pan-Genome Index with Spark. In: Nepal, S., Cao, W., Nasridinov, A., Bhuiyan, M.Z.A., Guo, X., Zhang, LJ. (eds) Big Data – BigData 2020. BIGDATA 2020. Lecture Notes in Computer Science(), vol 12402. Springer, Cham. https://doi.org/10.1007/978-3-030-59612-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-59612-5_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-59611-8

  • Online ISBN: 978-3-030-59612-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics