Skip to main content

Compression of Whole Genome Alignments Using a Mixture of Finite-Context Models

  • Conference paper
Image Analysis and Recognition (ICIAR 2012)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 7324))

Included in the following conference series:

  • 2057 Accesses

Abstract

In the last years, advances in DNA sequencing technology have caused a giant growth in the amount of available data related with genomic sequences. One of those types of data sets is that resulting from multiple sequence alignments (MSA). In this paper, we propose a compression method for compressing these data sets, using a mixture of finite-context models and arithmetic coding. The method relies on image compression concepts, it was tested in the multiz28way data set and attained a compression rate around 0.93 bits per symbol on the sequence data, better than the ≈ 1 bit per symbol attained by a recently proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Benson, D.A., Karsch-Mizrachi, I., Clark, K., Lipman, D.J., Ostell, J., Sayers, E.W.: GenBank. Nucl. Acids Res. 40(D1), D48–D53 (2012)

    Google Scholar 

  2. Fujita, P.A., Rhead, B., Zweig, A.S., Hinrichs, A.S., Karolchik, D., et al.: The UCSC Genome Browser Database: update 2011. Nucl. Acids Res. 39(suppl. 1), D876–D882 (2011)

    Google Scholar 

  3. Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., et al.: The Ensembl genome database project. Nucl. Acids Res. 30(1), 38–41 (2002)

    Article  Google Scholar 

  4. Korodi, G., Tabus, I.: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. on Information Systems 23(1), 3–34 (2005)

    Article  Google Scholar 

  5. Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: Proc. of the Data Compression Conf., DCC 2007, Snowbird, Utah, pp. 43–52 (March 2007)

    Google Scholar 

  6. Pinho, A.J., Neves, A.J.R., Ferreira, P.J.S.G.: Inverted-repeats-aware finite-context models for DNA coding. In: Proc. of the 16th European Signal Processing Conf., EUSIPCO 2008, Lausanne, Switzerland (August 2008)

    Google Scholar 

  7. Pinho, A.J., Neves, A.J.R., Bastos, C.A.C., Ferreira, P.J.S.G.: DNA coding using finite-context models and arithmetic coding. In: Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2009, Taipei, Taiwan, pp. 1693–1696 (April 2009)

    Google Scholar 

  8. Clamp, M., Andrews, D., Barker, D., Bevan, P., Cameron, G., Chen, Y., Clark, L., et al.: Ensembl 2002: Accommodating comparative genomics. Nucl. Acids Res. 31(1), 38–42 (2003)

    Article  Google Scholar 

  9. Miller, W., Rosenbloom, K., Hardison, R.C., Hou, M., Taylor, J., Raney, B., et al.: 28-Way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Research 17(12), 1797–1808 (2007)

    Article  Google Scholar 

  10. Hardison, R.C.: Conserved noncoding sequences are reliable guides to regulatory elements. Trends in Genetics 16(9), 369–372 (2000)

    Article  Google Scholar 

  11. Siepel, A., Haussler, D.: Computational identification of evolutionarily conserved exons. In: Proc. of the Eighth Annual International Conference on Research in Computational Molecular Biology, RECOMB 2004, pp. 177–186. ACM, New York (2004)

    Chapter  Google Scholar 

  12. Gross, S.S., Brent, M.R.: Using multiple alignments to improve gene prediction. Journal of Computational Biology 13(2), 379–393 (2006)

    Article  MathSciNet  Google Scholar 

  13. Pedersen, J.S., Bejerano, G., Siepel, A., Rosenbloom, K., et al.: Identification and Classification of Conserved RNA Secondary Structures in the Human Genome. PLoS Computational Biology 2(4), e33 (2006)

    Google Scholar 

  14. Lewin, B.: Genes VIII. Benjamin Cumming (December 2003)

    Google Scholar 

  15. Cooper, G.M., Brudno, M., Stone, E.A., Dubchak, I., Batzoglou, S., Sidow, A.: Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes. Genome Research 14(4), 539–548 (2004)

    Article  Google Scholar 

  16. Blanchette, M.: Computation and Analysis of Genomic Multi-Sequence Alignments. Annual Review of Genomics and Human Genetics 8(1), 193–213 (2007)

    Article  Google Scholar 

  17. Cutello, V., Nicosia, G., Pavone, M., Prizzi, I.: Protein multiple sequence alignment by hybrid bio-inspired algorithms. Nucl. Acids Res. 39(6), 1980–1992 (2011)

    Article  Google Scholar 

  18. Aniba, M.R., Poch, O., Marchler-Bauer, A., Thompson, J.D.: AlexSys: a knowledge-based expert system for multiple sequence alignment construction and analysis. Nucl. Acids Res. 38(19), 6338–6349 (2010)

    Article  Google Scholar 

  19. Ye, L., Huang, X.: MAP2: multiple alignment of syntenic genomic sequences. Nucl. Acids Res. 33(1), 162–170 (2005)

    Article  Google Scholar 

  20. Blanchette, M., Kent, W.J., Riemer, C., Elnitski, L., Smit, A.F.A., et al.: Aligning Multiple Genomic Sequences With the Threaded Blockset Aligner. Genome Research 14(4), 708–715 (2004)

    Article  Google Scholar 

  21. Bray, N., Pachter, L.: MAVID: Constrained Ancestral Alignment of Multiple Sequences. Genome Research 14(4), 693–699 (2004)

    Article  Google Scholar 

  22. Brudno, M., Do, C.B., Cooper, G.M., Kim, M.F., Davydov, E., NISC Comparative Sequencing Program, et al.: LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA. Genome Research 13(4), 721–731 (2003)

    Article  Google Scholar 

  23. Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T.J., Higgins, D.G., Thompson, J.D.: Multiple sequence alignment with the clustal series of programs. Nucl. Acids Res. 31(13), 3497–3500 (2003)

    Article  Google Scholar 

  24. Hanus, P., Dingel, J., Chalkidis, G., Hagenauer, J.: Compression of Whole Genome Alignments. IEEE Trans. on Information Theory 56(2), 696–705 (2010)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Matos, L.M.O., Pratas, D., Pinho, A.J. (2012). Compression of Whole Genome Alignments Using a Mixture of Finite-Context Models. In: Campilho, A., Kamel, M. (eds) Image Analysis and Recognition. ICIAR 2012. Lecture Notes in Computer Science, vol 7324. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31295-3_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-31295-3_42

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-31294-6

  • Online ISBN: 978-3-642-31295-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics