Open Access   Article Go Back

KMerHuffman upon Biological Sequence Compression

S.Roy 1 , S.Khatua 2

  1. Computer Science and Engineering, Academy of Technology, Kolkata, India.
  2. Computer Science and Engineering, University of Calcutta, Kolkata, India.

Correspondence should be addressed to: subhankar.roy2012@gmail.com.

Section:Research Paper, Product Type: Journal Paper
Volume-5 , Issue-12 , Page no. 201-206, Dec-2017

CrossRef-DOI:   https://doi.org/10.26438/ijcse/v5i12.201206

Online published on Dec 31, 2017

Copyright © S.Roy, S.Khatua . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

View this paper at   Google Scholar | DPI Digital Library

How to Cite this Paper

  • IEEE Citation
  • MLA Citation
  • APA Citation
  • BibTex Citation
  • RIS Citation

IEEE Style Citation: S.Roy, S.Khatua, “KMerHuffman upon Biological Sequence Compression,” International Journal of Computer Sciences and Engineering, Vol.5, Issue.12, pp.201-206, 2017.

MLA Style Citation: S.Roy, S.Khatua "KMerHuffman upon Biological Sequence Compression." International Journal of Computer Sciences and Engineering 5.12 (2017): 201-206.

APA Style Citation: S.Roy, S.Khatua, (2017). KMerHuffman upon Biological Sequence Compression. International Journal of Computer Sciences and Engineering, 5(12), 201-206.

BibTex Style Citation:
@article{_2017,
author = {S.Roy, S.Khatua},
title = {KMerHuffman upon Biological Sequence Compression},
journal = {International Journal of Computer Sciences and Engineering},
issue_date = {12 2017},
volume = {5},
Issue = {12},
month = {12},
year = {2017},
issn = {2347-2693},
pages = {201-206},
url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=1602},
doi = {https://doi.org/10.26438/ijcse/v5i12.201206}
publisher = {IJCSE, Indore, INDIA},
}

RIS Style Citation:
TY - JOUR
DO = {https://doi.org/10.26438/ijcse/v5i12.201206}
UR - https://www.ijcseonline.org/full_paper_view.php?paper_id=1602
TI - KMerHuffman upon Biological Sequence Compression
T2 - International Journal of Computer Sciences and Engineering
AU - S.Roy, S.Khatua
PY - 2017
DA - 2017/12/31
PB - IJCSE, Indore, INDIA
SP - 201-206
IS - 12
VL - 5
SN - 2347-2693
ER -

VIEWS PDF XML
575 306 downloads 297 downloads
  
  
           

Abstract

Huge amount of genomic data are produced due to high-throughput sequencing technology. Those enormous volumes of sequence data require effective storage, fast transmission and provision of quick access for alignment and analysis to any record. It has been proved that standard general purpose lossless compression techniques failed to compress these sequences rather they may increase the size. But some general purpose compression method may be useful with a modification for genome compression. In this paper, a variation of statistical Huffman algorithm have been proposed named KMerHuffman, which instead of calculating frequency of individual character it comes as a substring of length four which we have experiment to be optimal due to redundancy of genome sequence. Then KMerHuffman result on benchmark sequence has been compare with the other biological sequence specific compression algorithm. The result shows that KMerHuffman is competitive with other method. Another important aspect is that there is no need of any reference sequence so it is useful for upcoming sequence.

Key-Words / Index Term

Storage, Transmission, Alignment, Analysis, Compression, Huffman

References

[1] D.A. Huffman, “A method for the construction of minimum-redundancy codes”, Proc. Inst. Radio Eng., vol. 40, pp. 1098–1101, 1952.
[2] D.A. Lelewer, D.S. Hirschberg, “Data compression. Computing Surveys”, vol. 19, no. 3, pp. 261–296, 1987.
[3] A. Cannane, H.E. Williams, “General-Purpose Compression for Efficient Retrieval”, Journal of the American Society for Information Science & Technology, vol. 52, no. 5, pp. 430–437, 2001.
[4] Department of Chemistry, Queen Mary University of London, “Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences”.
[5] S. Wandelt, M. Bux, U. Leser, “Trends in Genome Compression”, June 4, 2013.
[6] P. R. Rajeswari and Dr. A. AppaRao, “DNABIT Compress – Genome compression algorithm”, Bioinformation,vol. 5 no. 8, (2011) January, pp. 350-360.
[7] S. Roy, A. Bhagat, K.A. Sharma, S. Khatua”, SBVRLDNACOMP: AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM”, IJCSA, Vol.5, No.4, pp. 73-85, August 2015.
[8] S. Roy, S. Mondal, S. Khatua, M. Biswas, “An Efficient Compression Algorithm for Forthcoming New Species”, IJHIT, Vol.8, No.11, pp.323-332, November 2015.
[9] J. Ziv and A. Lempel, “A Universal Algorithm for Sequential Data Compression,” IEEE Trans. Information Theory, vol. IT-23, no. 3, pp. 337-343, May 1977.
[10] B. G. Chern, I. Ochoa, A. Manolakos, A. No, K. Venkat and T. Weissman, “Reference Based Genome Compression”, Information Systems Laboratory, Stanford University.
[11] M.C. Brandon, D.C. Wallace, P. Baldi, “Data structures and compression algorithms for genomic sequence data”, Bioinformatics, Vol. 25, no. 14, pages 1731–1738, May, 2009.
[12] http://site.iugaza.edu.ps/jroumy/files/Shanon-Fano.pdf
[13] A.S.E. Campos, “Arithmetic coding “http://www.arturocampos.com/ac_arithmetic.html. (Accessed 02 February 2009)
[14] G. Cormack, N. Horspool. Data compression using dynamic markov modelling. Comput. J., vol. 30: pp. 541-550, 1987.
[15] Ziv, Jacob and Lempel, Abraham, "A Universal Algorithm for Sequential Data Compression", IEEE Transactions on Information Theory, Vol. 23, no. 3, pp. 337–343, May 1977.
[16] Ziv, Jacob and Lempel, Abraham, "Compression of Individual Sequences via Variable-Rate Coding", IEEE Transactions on Information Theory, vol. 24, no. 5, pp. 530–536, September 1978.
[17] S. Kuruppu, S.J. Puglisi, and J. Zobel, “Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval,” Proc. 17th Int’l Conf. String Processing and Information Retrieval (SPIRE ’10), pp. 201-206, 2010.
[18] S. Kuruppu, S. Puglisi, and J. Zobel, “Optimized Relative Lempel- Ziv Compression of Genomes”, Australasian Computer Science Conf., 2011.
[19] S. Deorowicz and S. Grabowski, “Robust Relative Compression of Genomes with Random Access”, Bioinformatics, vol. 27, pp. 2979- 2986, Nov. 2011.
[20] S. Kuruppu, B. Beresford-Smith, T. Conway, and J. Zobel, “Iterative Dictionary Construction for Compression of Large DNA Data Sets”, IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 9, no. 1, Jan./Feb. 2012.
[21] S. Grumbach and F. Tahi, “Compression of DNA sequences”, IEEE Symp. on the Data Compression Conf., DCC-93, Snowbird, UT, (1993), pp. 340–350.
[22] S. Grumbach and F. Tahi, “A new challenge for compression algorithms: genetic sequences”, Info. Process. & Manage, Elsevier, (1994), pp.875-866.
[23] X. Chen, S. Kwong, M. Li, “A compression algorithm for DNA sequences and its applications in genome comparison”, In Proc. 4th Annual Int. Conf. Computation. Molecular Biol. (RECOMB), pp.107 – 117, 2000.
[24] X. Chen, M. Li, B. Ma and J. Tromp, “DNACompress: Fast and Effective DNA Sequence Compression”, Bioinformatics, vol. 18, (2002) June, pp. 1696-1698.
[25] G. Korodi and I. Tabus, “An Efficient Normalized MaximumLikelihood Algorithm for DNA Sequence Compression”, ACM Trans. Information Systems, vol. 23, no. 1, pp. 3-34, 2005.
[26] J. Zhen, J. Zhou, L. Jiang, Q. H. Wu. Overview of DNA sequence data compression techniques. Acta Electronica Sinica, pp. 1113 – 1121, 2010.
[27] L. TAN, J. SUN, W. XIONG, “A Compression Algorithm for DNA Sequence Using Extended Operations”, Journal of Computational Information Systems, vol. 8, no. 18 pp. 7685–7691, 2012.
[28] The GeNML homepage: http://www.cs.tut.fi/~tabus/genml/results.html.