Skip to main content
Log in

IMMAN: free software for information theory-based chemometric analysis

  • Full-Length Paper
  • Published:
Molecular Diversity Aims and scope Submit manuscript

Abstract

The features and theoretical background of a new and free computational program for chemometric analysis denominated IMMAN (acronym for Information theory-based CheMoMetrics ANalysis) are presented. This is multi-platform software developed in the Java programming language, designed with a remarkably user-friendly graphical interface for the computation of a collection of information-theoretic functions adapted for rank-based unsupervised and supervised feature selection tasks. A total of 20 feature selection parameters are presented, with the unsupervised and supervised frameworks represented by 10 approaches in each case. Several information-theoretic parameters traditionally used as molecular descriptors (MDs) are adapted for use as unsupervised rank-based feature selection methods. On the other hand, a generalization scheme for the previously defined differential Shannon’s entropy is discussed, as well as the introduction of Jeffreys information measure for supervised feature selection. Moreover, well-known information-theoretic feature selection parameters, such as information gain, gain ratio, and symmetrical uncertainty are incorporated to the IMMAN software (http://mobiosd-hub.com/imman-soft/), following an equal-interval discretization approach. IMMAN offers data pre-processing functionalities, such as missing values processing, dataset partitioning, and browsing. Moreover, single parameter or ensemble (multi-criteria) ranking options are provided. Consequently, this software is suitable for tasks like dimensionality reduction, feature ranking, as well as comparative diversity analysis of data matrices. Simple examples of applications performed with this program are presented. A comparative study between IMMAN and WEKA feature selection tools using the Arcene dataset was performed, demonstrating similar behavior. In addition, it is revealed that the use of IMMAN unsupervised feature selection methods improves the performance of both IMMAN and WEKA supervised algorithms.

Graphical abstract

Graphic representation for Shannon’s distribution of MD calculating software.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Todeschini R, Consonni V (2009) Molecular descriptors for chemoinformatics, vol 1. Wiley-VCH, Weinheim

    Book  Google Scholar 

  2. Todeschini R, Consonni V, Pavan M (2002) DRAGON Software version 2.1. Milano Chemometric and QSAR Research Group. Milano

  3. Guha R (1991) The CDK descriptor calculator, 0.94th edn. Indiana

  4. Yap CW (2011) PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32:1466–1474. doi:10.1002/jcc.21707

    Article  CAS  PubMed  Google Scholar 

  5. Georg H (2008) BlueDesc-molecular descriptor calculator. University of Tübingen, Tübingen

    Google Scholar 

  6. Liu J, Feng J, Brooks A, Young S (2005) PowerMV. National Institute of Statistical Sciences, Research Triangle Park

    Google Scholar 

  7. ADRIANA. Code (2011) Molecular Networks. Erlangen, Germany

  8. Hong H, Xie Q, Ge W, Qian F, Fang H, Shi L, Su Z, Perkins R, Tong W (2008) Mold2, molecular descriptors from 2D structures for chemoinformatics and toxicoinformatics. J Chem Inf Comput Sci 48:1337–1344. doi:10.1021/ci800038f

    Article  CAS  Google Scholar 

  9. Kellogg GE (2001) Molconn-Z 4.0 edn. eduSoft, Virginia

  10. Liu H, Motoda H (2008) Less is More. In: Liu H, Motoda H (eds) Computational methods of feature selection. Data mining and knowledge discovery series. Taylor * Francis Group, Boca Raton, p 411

  11. Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1:67–82. doi:10.1109/4235.585893

    Article  Google Scholar 

  12. Venkatraman V, Dalby AR, Yang ZR (2004) Evaluation of mutual information and genetic programming for feature selection in QSAR. J Chem Inf Comput Sci 44:1686–1692. doi:10.1021/ci049933v

    Article  CAS  PubMed  Google Scholar 

  13. Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the Twentieth international conference on machine learning, Washington DC

  14. Kira K, Rendell L (1992) The feature selection problem: traditional methods and a new algorithm. Association for the advancement of artificial intelligence. AAAI Press and MIT Press, Cambridge, pp 129–134

    Google Scholar 

  15. Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86

    Article  Google Scholar 

  16. Jeffreys H (1946) An invariant form for the prior probability in estimation problems. Proc Roy Soc A 186:453–461. doi:10.1098/rspa.1946.0056

    Article  CAS  Google Scholar 

  17. Jennifer GD (2008) Unsupervised Feature Selection. In: Liu H, Motoda H (eds) Computational methods of feature selection. Data mining and knowledge discovery series. Taylor & Francis Group, Boca Raton, p 411

  18. Varshavsky R, Gottlieb A, Linial M, Horn D (2006) Novel unsupervised feature filtering of biological data. Bioinformatics 22:e507–e513. doi:10.1093/bioinformatics/btl214

    Article  CAS  PubMed  Google Scholar 

  19. Maldonado AG, Doucet JP, Petitjean M, Fan B-T (2006) Molecular similarity and diversity in chemoinformatics: from theory to applications. Mol Divers 10:39–79. doi:10.1007/s11030-006-8697-1

    Article  CAS  PubMed  Google Scholar 

  20. Godden JW, Stahura FL (2000) Variability of molecular descriptors in compound databases revealed by Shannon entropy calculations. J Chem Inf Comput Sci 40:796–800. doi:10.1021/ci000321u

    Article  CAS  PubMed  Google Scholar 

  21. Godden JW, Bajorath J (2002) Chemical descriptors with distinct levels of information content and varying sensitivity to differences between selected compound databases identified by SE-DSE analysis. J Chem Inf Comput Sci 42:87–93. doi:10.1021/ci0103065

    Article  CAS  PubMed  Google Scholar 

  22. Barigye SJ, Marrero-Ponce Y, Pérez-Giménez F, Bonchev D (2014) Trends in information theory-based chemical structure codification. Mol Divers 18:673–686. doi:10.1007/s11030-014-9517-7

    Article  CAS  PubMed  Google Scholar 

  23. Witten IH, Eibe F, Hall MA (2011) Data mining: practical machine learning tools and techniques. The Morgan Kaufmann series in data management systems, 3rd edn. Morgan Kaufmann, Burlington

  24. Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97:10101–10106. doi:10.1073/pnas.97.18.10101

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  25. Devakumari D, Thangavel K (2010) Unsupervised adaptive floating search feature selection based on contribution entropy. In: 2010 international conference on communication and computational intelligence (INCOCCI), pp 623–627

  26. Dash M, Choi K, Scheuermann P, Huan L (2002) Feature selection for clustering—a filter solution. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM 2003), pp 115–122. doi:10.1109/icdm.2002.1183893

  27. Stahura FL, Godden JW, Bajorath J (2002) Differential Shannon entropy analysis identifies molecular property descriptors that predict aqueous solubility of synthetic compounds with high accuracy in binary QSAR calculations. J Chem Inf Comput Sci 42:550–558. doi:10.1021/ci010243q

    Article  CAS  PubMed  Google Scholar 

  28. Wassermann AM, Nisius B, Vogt M, Bajorath J (2010) Identification of descriptors capturing compound class-specific features by mutual information analysis. J Chem Inf Model 50:1935–1940. doi:10.1021/ci100319n

    Article  CAS  PubMed  Google Scholar 

  29. Cover TM, Thomas JA (1991) Elements of Information theory. Wiley, New York

    Book  Google Scholar 

  30. Desurvire E (2009) Classical and quantum information theory. Cambridge University Press, New York

    Book  Google Scholar 

  31. Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games. In: Michalski R, Carbonell J, Mitchell T (eds) Machine learning. Symbolic computation. Springer, Berlin, pp 463–482. doi:10.1007/978-3-662-12405-5_15

  32. Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1988) Numerical recipes in C: the art of scientific computing. Cambridge University Press, New York

    Google Scholar 

  33. Consonni V, Todeschini R, Pavan M, Gramatica P (2002) Structure/response correlations and similarity/diversity analysis by GETAWAY descriptors. Part 2. Application of the novel 3D molecular descriptors to QSAR/QSPR studies. J Chem Inf Comput Sci 42:693–705. doi:10.1021/ci0155053

    Article  CAS  PubMed  Google Scholar 

  34. Pérez González M, Terán C, Teijeira M, González-Moa MJ (2005) GETAWAY descriptors to predicting A2A adenosine receptors agonists. Eur J Med Chem 40:1080–1086. doi:10.1016/j.ejmech.2005.04.014

    Article  Google Scholar 

  35. Saiz-Urra L, Pérez González M (2007) Quantitative structure-activity relationship studies of HIV-1 integrase inhibition.1. GETAWAY descriptors. Eur J Med Chem 42:64–70. doi:10.1016/j.ejmech.2006.08.005

    Article  CAS  PubMed  Google Scholar 

  36. Fedorowicz A, Singh H, Soderholm S, Demchuk E (2005) Structure–activity models for contact sensitization. Chem Res Toxicol 18:954–969. doi:10.1021/tx0497806

    Article  CAS  PubMed  Google Scholar 

  37. Saiz-Urra L, Pérez González M (2006) QSAR studies about cytotoxicity of benzophenazines with dual inhibition toward both topoisomerases I and II: 3D-MoRSE descriptors and statistical considerations about variable selection. Bioorg Med Chem 14:7347–7358. doi:10.1016/j.bmc.2006.05.081

    Article  CAS  PubMed  Google Scholar 

  38. Gasteiger J, Sadowski J, Schuur J, Selzer P, Steinhauer L, Steinhauer V (1996) Chemical information in 3Dspace. J Chem Inf Comput Sci 36:1030–1037. doi:10.1021/ci960343+

    Article  CAS  Google Scholar 

  39. Gasteiger J, Schuur J, Selzer P, Steinhauer L, Steinhauer V (1997) Finding the 3D structure of a molecule in its IR spectrum. Fresen J Anal Chem 359:50–55. doi:10.1007/s002160050534

    Article  CAS  Google Scholar 

  40. Schuur J, Selzer P, Gasteiger J (1996) The coding of the three-dimensional structure of molecules by molecular transforms and its application to structure-spectra correlations and studies of biological activity. J Chem Inf Comput Sci 36:334–344. doi:10.1021/ci950164c

    Article  CAS  Google Scholar 

  41. Baumann K (1999) Uniform-length molecular descriptors for quantitative structure-property relationships (QSPR) and quantitative structure-activity relationships (QSAR): classification studies and similarity searching. TRAC 18:36–46. doi:10.1016/S0165-9936(98)00075-2

    CAS  Google Scholar 

  42. Jelcic Z (2004) Solvent molecular descriptors on poly(D, L-lactide-co-glycolide) particle size in emulsification-diffusion process. Coll Surf A Physico-Chem Eng Asp 242:159–166. doi:10.1016/j.colsurfa.2004.03.027

    Article  CAS  Google Scholar 

  43. Todeschini R, Bettiol C, Giurin G, Gramatica P, Miana P, Argese E (1996) Modeling and prediction by using WHIM descriptors in QSAR studies. Submitochondrial particles (SMP) as toxicity biosensors of chlorophenols. Chemosphere 33:71–79. doi:10.1016/0045-6535(96)00153-1

    Article  CAS  Google Scholar 

  44. Randic M (1995) Molecular profiles. Novel geometry-dependent molecular descriptors. New J Chem 19:781–791

    CAS  Google Scholar 

  45. Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th international joint conference on artificial intelligence, pp 1022–1027. http://dblp.uni-trier.de/db/conf/ijcai/ijcai93.html#FayyadI93

  46. Newman DJ, Hettich S, Blake CL, Merz CJ (1998) UCI repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA. http://www.ics.uci.edu/~mlearn/MLRepository.html

  47. Guyon I, Gunn SR, Ben-Hur A, Dror G (2004) Result analysis of the NIPS 2003 feature selection challenge. In: Advances in neural information processing systems, Vancouver, BC, pp 545–552. http://papers.nips.cc/paper/2728-result-analysis-of-the-nips-2003-feature-selection-challenge

  48. Webb AR (2002) Statistical pattern recognition, 2nd edn. Wiley, Chichester

  49. Cover TM (1974) The best two independent measurements are not the two best. IEEE Trans Syst Man Cybern 4:116–117. doi:10.1109/TSMC.1974.5408535

    Article  Google Scholar 

Download references

Acknowledgments

Barigye, S. J. acknowledges financial support from CNPq. Marrero-Ponce, Y. thanks the program ‘International Visiting Professor’ for a fellowship to work at Universidad Tecnológica de Bolívar (Colombia) in 2014. Finally, the authors are also indebted to the Molecular Diversity Editor in Chief Dr. Guillermo A. Morales for his comments and manuscript revision, as well as his kind attention.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yovani Marrero-Ponce.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (doc 129 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Urias, R.W.P., Barigye, S.J., Marrero-Ponce, Y. et al. IMMAN: free software for information theory-based chemometric analysis. Mol Divers 19, 305–319 (2015). https://doi.org/10.1007/s11030-014-9565-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11030-014-9565-z

Keywords

Navigation