IMMAN: free software for information theory-based chemometric analysis

Urias, Ricardo W. Pino; Barigye, Stephen J.; Marrero-Ponce, Yovani; García-Jacas, César R.; Valdes-Martiní, José R.; Perez-Gimenez, Facundo

doi:10.1007/s11030-014-9565-z

IMMAN: free software for information theory-based chemometric analysis

Full-Length Paper
Published: 26 January 2015

Volume 19, pages 305–319, (2015)
Cite this article

Molecular Diversity Aims and scope Submit manuscript

Ricardo W. Pino Urias^1,2,
Stephen J. Barigye³,
Yovani Marrero-Ponce^1,4,5,
César R. García-Jacas^1,6,
José R. Valdes-Martiní² &
…
Facundo Perez-Gimenez⁴

462 Accesses
43 Citations
2 Altmetric
Explore all metrics

Abstract

The features and theoretical background of a new and free computational program for chemometric analysis denominated IMMAN (acronym for Information theory-based CheMoMetrics ANalysis) are presented. This is multi-platform software developed in the Java programming language, designed with a remarkably user-friendly graphical interface for the computation of a collection of information-theoretic functions adapted for rank-based unsupervised and supervised feature selection tasks. A total of 20 feature selection parameters are presented, with the unsupervised and supervised frameworks represented by 10 approaches in each case. Several information-theoretic parameters traditionally used as molecular descriptors (MDs) are adapted for use as unsupervised rank-based feature selection methods. On the other hand, a generalization scheme for the previously defined differential Shannon’s entropy is discussed, as well as the introduction of Jeffreys information measure for supervised feature selection. Moreover, well-known information-theoretic feature selection parameters, such as information gain, gain ratio, and symmetrical uncertainty are incorporated to the IMMAN software (http://mobiosd-hub.com/imman-soft/), following an equal-interval discretization approach. IMMAN offers data pre-processing functionalities, such as missing values processing, dataset partitioning, and browsing. Moreover, single parameter or ensemble (multi-criteria) ranking options are provided. Consequently, this software is suitable for tasks like dimensionality reduction, feature ranking, as well as comparative diversity analysis of data matrices. Simple examples of applications performed with this program are presented. A comparative study between IMMAN and WEKA feature selection tools using the Arcene dataset was performed, demonstrating similar behavior. In addition, it is revealed that the use of IMMAN unsupervised feature selection methods improves the performance of both IMMAN and WEKA supervised algorithms.

Graphical abstract

Graphic representation for Shannon’s distribution of MD calculating software.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised Learning Methods and Similarity Analysis in Chemoinformatics

A Feature Selection Method Based on Ranked Vector Scores of Features for Classification

Article 29 July 2017

References

Todeschini R, Consonni V (2009) Molecular descriptors for chemoinformatics, vol 1. Wiley-VCH, Weinheim
Book Google Scholar
Todeschini R, Consonni V, Pavan M (2002) DRAGON Software version 2.1. Milano Chemometric and QSAR Research Group. Milano
Guha R (1991) The CDK descriptor calculator, 0.94th edn. Indiana
Yap CW (2011) PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32:1466–1474. doi:10.1002/jcc.21707
Article CAS PubMed Google Scholar
Georg H (2008) BlueDesc-molecular descriptor calculator. University of Tübingen, Tübingen
Google Scholar
Liu J, Feng J, Brooks A, Young S (2005) PowerMV. National Institute of Statistical Sciences, Research Triangle Park
Google Scholar
ADRIANA. Code (2011) Molecular Networks. Erlangen, Germany
Hong H, Xie Q, Ge W, Qian F, Fang H, Shi L, Su Z, Perkins R, Tong W (2008) Mold2, molecular descriptors from 2D structures for chemoinformatics and toxicoinformatics. J Chem Inf Comput Sci 48:1337–1344. doi:10.1021/ci800038f
Article CAS Google Scholar
Kellogg GE (2001) Molconn-Z 4.0 edn. eduSoft, Virginia
Liu H, Motoda H (2008) Less is More. In: Liu H, Motoda H (eds) Computational methods of feature selection. Data mining and knowledge discovery series. Taylor * Francis Group, Boca Raton, p 411
Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1:67–82. doi:10.1109/4235.585893
Article Google Scholar
Venkatraman V, Dalby AR, Yang ZR (2004) Evaluation of mutual information and genetic programming for feature selection in QSAR. J Chem Inf Comput Sci 44:1686–1692. doi:10.1021/ci049933v
Article CAS PubMed Google Scholar
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the Twentieth international conference on machine learning, Washington DC
Kira K, Rendell L (1992) The feature selection problem: traditional methods and a new algorithm. Association for the advancement of artificial intelligence. AAAI Press and MIT Press, Cambridge, pp 129–134
Google Scholar
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86
Article Google Scholar
Jeffreys H (1946) An invariant form for the prior probability in estimation problems. Proc Roy Soc A 186:453–461. doi:10.1098/rspa.1946.0056
Article CAS Google Scholar
Jennifer GD (2008) Unsupervised Feature Selection. In: Liu H, Motoda H (eds) Computational methods of feature selection. Data mining and knowledge discovery series. Taylor & Francis Group, Boca Raton, p 411
Varshavsky R, Gottlieb A, Linial M, Horn D (2006) Novel unsupervised feature filtering of biological data. Bioinformatics 22:e507–e513. doi:10.1093/bioinformatics/btl214
Article CAS PubMed Google Scholar
Maldonado AG, Doucet JP, Petitjean M, Fan B-T (2006) Molecular similarity and diversity in chemoinformatics: from theory to applications. Mol Divers 10:39–79. doi:10.1007/s11030-006-8697-1
Article CAS PubMed Google Scholar
Godden JW, Stahura FL (2000) Variability of molecular descriptors in compound databases revealed by Shannon entropy calculations. J Chem Inf Comput Sci 40:796–800. doi:10.1021/ci000321u
Article CAS PubMed Google Scholar
Godden JW, Bajorath J (2002) Chemical descriptors with distinct levels of information content and varying sensitivity to differences between selected compound databases identified by SE-DSE analysis. J Chem Inf Comput Sci 42:87–93. doi:10.1021/ci0103065
Article CAS PubMed Google Scholar
Barigye SJ, Marrero-Ponce Y, Pérez-Giménez F, Bonchev D (2014) Trends in information theory-based chemical structure codification. Mol Divers 18:673–686. doi:10.1007/s11030-014-9517-7
Article CAS PubMed Google Scholar
Witten IH, Eibe F, Hall MA (2011) Data mining: practical machine learning tools and techniques. The Morgan Kaufmann series in data management systems, 3rd edn. Morgan Kaufmann, Burlington
Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97:10101–10106. doi:10.1073/pnas.97.18.10101
Article PubMed Central CAS PubMed Google Scholar
Devakumari D, Thangavel K (2010) Unsupervised adaptive floating search feature selection based on contribution entropy. In: 2010 international conference on communication and computational intelligence (INCOCCI), pp 623–627
Dash M, Choi K, Scheuermann P, Huan L (2002) Feature selection for clustering—a filter solution. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM 2003), pp 115–122. doi:10.1109/icdm.2002.1183893
Stahura FL, Godden JW, Bajorath J (2002) Differential Shannon entropy analysis identifies molecular property descriptors that predict aqueous solubility of synthetic compounds with high accuracy in binary QSAR calculations. J Chem Inf Comput Sci 42:550–558. doi:10.1021/ci010243q
Article CAS PubMed Google Scholar
Wassermann AM, Nisius B, Vogt M, Bajorath J (2010) Identification of descriptors capturing compound class-specific features by mutual information analysis. J Chem Inf Model 50:1935–1940. doi:10.1021/ci100319n
Article CAS PubMed Google Scholar
Cover TM, Thomas JA (1991) Elements of Information theory. Wiley, New York
Book Google Scholar
Desurvire E (2009) Classical and quantum information theory. Cambridge University Press, New York
Book Google Scholar
Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games. In: Michalski R, Carbonell J, Mitchell T (eds) Machine learning. Symbolic computation. Springer, Berlin, pp 463–482. doi:10.1007/978-3-662-12405-5_15
Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1988) Numerical recipes in C: the art of scientific computing. Cambridge University Press, New York
Google Scholar
Consonni V, Todeschini R, Pavan M, Gramatica P (2002) Structure/response correlations and similarity/diversity analysis by GETAWAY descriptors. Part 2. Application of the novel 3D molecular descriptors to QSAR/QSPR studies. J Chem Inf Comput Sci 42:693–705. doi:10.1021/ci0155053
Article CAS PubMed Google Scholar
Pérez González M, Terán C, Teijeira M, González-Moa MJ (2005) GETAWAY descriptors to predicting A2A adenosine receptors agonists. Eur J Med Chem 40:1080–1086. doi:10.1016/j.ejmech.2005.04.014
Article Google Scholar
Saiz-Urra L, Pérez González M (2007) Quantitative structure-activity relationship studies of HIV-1 integrase inhibition.1. GETAWAY descriptors. Eur J Med Chem 42:64–70. doi:10.1016/j.ejmech.2006.08.005
Article CAS PubMed Google Scholar
Fedorowicz A, Singh H, Soderholm S, Demchuk E (2005) Structure–activity models for contact sensitization. Chem Res Toxicol 18:954–969. doi:10.1021/tx0497806
Article CAS PubMed Google Scholar
Saiz-Urra L, Pérez González M (2006) QSAR studies about cytotoxicity of benzophenazines with dual inhibition toward both topoisomerases I and II: 3D-MoRSE descriptors and statistical considerations about variable selection. Bioorg Med Chem 14:7347–7358. doi:10.1016/j.bmc.2006.05.081
Article CAS PubMed Google Scholar
Gasteiger J, Sadowski J, Schuur J, Selzer P, Steinhauer L, Steinhauer V (1996) Chemical information in 3Dspace. J Chem Inf Comput Sci 36:1030–1037. doi:10.1021/ci960343+
Article CAS Google Scholar
Gasteiger J, Schuur J, Selzer P, Steinhauer L, Steinhauer V (1997) Finding the 3D structure of a molecule in its IR spectrum. Fresen J Anal Chem 359:50–55. doi:10.1007/s002160050534
Article CAS Google Scholar
Schuur J, Selzer P, Gasteiger J (1996) The coding of the three-dimensional structure of molecules by molecular transforms and its application to structure-spectra correlations and studies of biological activity. J Chem Inf Comput Sci 36:334–344. doi:10.1021/ci950164c
Article CAS Google Scholar
Baumann K (1999) Uniform-length molecular descriptors for quantitative structure-property relationships (QSPR) and quantitative structure-activity relationships (QSAR): classification studies and similarity searching. TRAC 18:36–46. doi:10.1016/S0165-9936(98)00075-2
CAS Google Scholar
Jelcic Z (2004) Solvent molecular descriptors on poly(D, L-lactide-co-glycolide) particle size in emulsification-diffusion process. Coll Surf A Physico-Chem Eng Asp 242:159–166. doi:10.1016/j.colsurfa.2004.03.027
Article CAS Google Scholar
Todeschini R, Bettiol C, Giurin G, Gramatica P, Miana P, Argese E (1996) Modeling and prediction by using WHIM descriptors in QSAR studies. Submitochondrial particles (SMP) as toxicity biosensors of chlorophenols. Chemosphere 33:71–79. doi:10.1016/0045-6535(96)00153-1
Article CAS Google Scholar
Randic M (1995) Molecular profiles. Novel geometry-dependent molecular descriptors. New J Chem 19:781–791
CAS Google Scholar
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th international joint conference on artificial intelligence, pp 1022–1027. http://dblp.uni-trier.de/db/conf/ijcai/ijcai93.html#FayyadI93
Newman DJ, Hettich S, Blake CL, Merz CJ (1998) UCI repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA. http://www.ics.uci.edu/~mlearn/MLRepository.html
Guyon I, Gunn SR, Ben-Hur A, Dror G (2004) Result analysis of the NIPS 2003 feature selection challenge. In: Advances in neural information processing systems, Vancouver, BC, pp 545–552. http://papers.nips.cc/paper/2728-result-analysis-of-the-nips-2003-feature-selection-challenge
Webb AR (2002) Statistical pattern recognition, 2nd edn. Wiley, Chichester
Cover TM (1974) The best two independent measurements are not the two best. IEEE Trans Syst Man Cybern 4:116–117. doi:10.1109/TSMC.1974.5408535
Article Google Scholar

Download references

Acknowledgments

Barigye, S. J. acknowledges financial support from CNPq. Marrero-Ponce, Y. thanks the program ‘International Visiting Professor’ for a fellowship to work at Universidad Tecnológica de Bolívar (Colombia) in 2014. Finally, the authors are also indebted to the Molecular Diversity Editor in Chief Dr. Guillermo A. Morales for his comments and manuscript revision, as well as his kind attention.

Author information

Authors and Affiliations

Unit of Computer-Aided Molecular “Biosilico” Discovery and Bioinformatic Research (CAMD-BIR International), Cartagena de Indias, Bolívar, Colombia
Ricardo W. Pino Urias, Yovani Marrero-Ponce & César R. García-Jacas
Faculty of Mathematics Physics and Computation, Universidad Central “Marta Abreu” de Las Villas, Santa Clara, 54830, Villa Clara, Cuba
Ricardo W. Pino Urias & José R. Valdes-Martiní
Departamento de Química, Universidade Federal de Lavras, UFLA, Caixa Postal 3037, 37200-000, Lavras, MG, Brazil
Stephen J. Barigye
Facultad de Farmacia, Universitat de València, Burjasot, 46100, València, Spain
Yovani Marrero-Ponce & Facundo Perez-Gimenez
Grupo de Investigación en Estudios Químicos y Biológicos, Facultad de Ciencias Básicas, Universidad Tecnológica de Bolívar, Cartagena de Indias, Bolívar, Colombia
Yovani Marrero-Ponce
Grupo de Investigación de Bioinformática, Centro de Estudio de Matemática Computacional (CEMC), Universidad de las Ciencias Informáticas, La Habana, Cuba
César R. García-Jacas

Authors

Ricardo W. Pino Urias
View author publications
You can also search for this author in PubMed Google Scholar
Stephen J. Barigye
View author publications
You can also search for this author in PubMed Google Scholar
Yovani Marrero-Ponce
View author publications
You can also search for this author in PubMed Google Scholar
César R. García-Jacas
View author publications
You can also search for this author in PubMed Google Scholar
José R. Valdes-Martiní
View author publications
You can also search for this author in PubMed Google Scholar
Facundo Perez-Gimenez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yovani Marrero-Ponce.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (doc 129 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Urias, R.W.P., Barigye, S.J., Marrero-Ponce, Y. et al. IMMAN: free software for information theory-based chemometric analysis. Mol Divers 19, 305–319 (2015). https://doi.org/10.1007/s11030-014-9565-z

Download citation

Received: 29 August 2014
Accepted: 24 December 2014
Published: 26 January 2015
Issue Date: May 2015
DOI: https://doi.org/10.1007/s11030-014-9565-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions