Abstract
In the last years, advances in DNA sequencing technology have caused a giant growth in the amount of available data related with genomic sequences. One of those types of data sets is that resulting from multiple sequence alignments (MSA). In this paper, we propose a compression method for compressing these data sets, using a mixture of finite-context models and arithmetic coding. The method relies on image compression concepts, it was tested in the multiz28way data set and attained a compression rate around 0.93 bits per symbol on the sequence data, better than the ≈ 1 bit per symbol attained by a recently proposed method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Benson, D.A., Karsch-Mizrachi, I., Clark, K., Lipman, D.J., Ostell, J., Sayers, E.W.: GenBank. Nucl. Acids Res. 40(D1), D48–D53 (2012)
Fujita, P.A., Rhead, B., Zweig, A.S., Hinrichs, A.S., Karolchik, D., et al.: The UCSC Genome Browser Database: update 2011. Nucl. Acids Res. 39(suppl. 1), D876–D882 (2011)
Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., et al.: The Ensembl genome database project. Nucl. Acids Res. 30(1), 38–41 (2002)
Korodi, G., Tabus, I.: An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. on Information Systems 23(1), 3–34 (2005)
Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: Proc. of the Data Compression Conf., DCC 2007, Snowbird, Utah, pp. 43–52 (March 2007)
Pinho, A.J., Neves, A.J.R., Ferreira, P.J.S.G.: Inverted-repeats-aware finite-context models for DNA coding. In: Proc. of the 16th European Signal Processing Conf., EUSIPCO 2008, Lausanne, Switzerland (August 2008)
Pinho, A.J., Neves, A.J.R., Bastos, C.A.C., Ferreira, P.J.S.G.: DNA coding using finite-context models and arithmetic coding. In: Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP 2009, Taipei, Taiwan, pp. 1693–1696 (April 2009)
Clamp, M., Andrews, D., Barker, D., Bevan, P., Cameron, G., Chen, Y., Clark, L., et al.: Ensembl 2002: Accommodating comparative genomics. Nucl. Acids Res. 31(1), 38–42 (2003)
Miller, W., Rosenbloom, K., Hardison, R.C., Hou, M., Taylor, J., Raney, B., et al.: 28-Way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Research 17(12), 1797–1808 (2007)
Hardison, R.C.: Conserved noncoding sequences are reliable guides to regulatory elements. Trends in Genetics 16(9), 369–372 (2000)
Siepel, A., Haussler, D.: Computational identification of evolutionarily conserved exons. In: Proc. of the Eighth Annual International Conference on Research in Computational Molecular Biology, RECOMB 2004, pp. 177–186. ACM, New York (2004)
Gross, S.S., Brent, M.R.: Using multiple alignments to improve gene prediction. Journal of Computational Biology 13(2), 379–393 (2006)
Pedersen, J.S., Bejerano, G., Siepel, A., Rosenbloom, K., et al.: Identification and Classification of Conserved RNA Secondary Structures in the Human Genome. PLoS Computational Biology 2(4), e33 (2006)
Lewin, B.: Genes VIII. Benjamin Cumming (December 2003)
Cooper, G.M., Brudno, M., Stone, E.A., Dubchak, I., Batzoglou, S., Sidow, A.: Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes. Genome Research 14(4), 539–548 (2004)
Blanchette, M.: Computation and Analysis of Genomic Multi-Sequence Alignments. Annual Review of Genomics and Human Genetics 8(1), 193–213 (2007)
Cutello, V., Nicosia, G., Pavone, M., Prizzi, I.: Protein multiple sequence alignment by hybrid bio-inspired algorithms. Nucl. Acids Res. 39(6), 1980–1992 (2011)
Aniba, M.R., Poch, O., Marchler-Bauer, A., Thompson, J.D.: AlexSys: a knowledge-based expert system for multiple sequence alignment construction and analysis. Nucl. Acids Res. 38(19), 6338–6349 (2010)
Ye, L., Huang, X.: MAP2: multiple alignment of syntenic genomic sequences. Nucl. Acids Res. 33(1), 162–170 (2005)
Blanchette, M., Kent, W.J., Riemer, C., Elnitski, L., Smit, A.F.A., et al.: Aligning Multiple Genomic Sequences With the Threaded Blockset Aligner. Genome Research 14(4), 708–715 (2004)
Bray, N., Pachter, L.: MAVID: Constrained Ancestral Alignment of Multiple Sequences. Genome Research 14(4), 693–699 (2004)
Brudno, M., Do, C.B., Cooper, G.M., Kim, M.F., Davydov, E., NISC Comparative Sequencing Program, et al.: LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA. Genome Research 13(4), 721–731 (2003)
Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T.J., Higgins, D.G., Thompson, J.D.: Multiple sequence alignment with the clustal series of programs. Nucl. Acids Res. 31(13), 3497–3500 (2003)
Hanus, P., Dingel, J., Chalkidis, G., Hagenauer, J.: Compression of Whole Genome Alignments. IEEE Trans. on Information Theory 56(2), 696–705 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Matos, L.M.O., Pratas, D., Pinho, A.J. (2012). Compression of Whole Genome Alignments Using a Mixture of Finite-Context Models. In: Campilho, A., Kamel, M. (eds) Image Analysis and Recognition. ICIAR 2012. Lecture Notes in Computer Science, vol 7324. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31295-3_42
Download citation
DOI: https://doi.org/10.1007/978-3-642-31295-3_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31294-6
Online ISBN: 978-3-642-31295-3
eBook Packages: Computer ScienceComputer Science (R0)