Abstract
The features and theoretical background of a new and free computational program for chemometric analysis denominated IMMAN (acronym for Information theory-based CheMoMetrics ANalysis) are presented. This is multi-platform software developed in the Java programming language, designed with a remarkably user-friendly graphical interface for the computation of a collection of information-theoretic functions adapted for rank-based unsupervised and supervised feature selection tasks. A total of 20 feature selection parameters are presented, with the unsupervised and supervised frameworks represented by 10 approaches in each case. Several information-theoretic parameters traditionally used as molecular descriptors (MDs) are adapted for use as unsupervised rank-based feature selection methods. On the other hand, a generalization scheme for the previously defined differential Shannon’s entropy is discussed, as well as the introduction of Jeffreys information measure for supervised feature selection. Moreover, well-known information-theoretic feature selection parameters, such as information gain, gain ratio, and symmetrical uncertainty are incorporated to the IMMAN software (http://mobiosd-hub.com/imman-soft/), following an equal-interval discretization approach. IMMAN offers data pre-processing functionalities, such as missing values processing, dataset partitioning, and browsing. Moreover, single parameter or ensemble (multi-criteria) ranking options are provided. Consequently, this software is suitable for tasks like dimensionality reduction, feature ranking, as well as comparative diversity analysis of data matrices. Simple examples of applications performed with this program are presented. A comparative study between IMMAN and WEKA feature selection tools using the Arcene dataset was performed, demonstrating similar behavior. In addition, it is revealed that the use of IMMAN unsupervised feature selection methods improves the performance of both IMMAN and WEKA supervised algorithms.
Graphical abstract
Graphic representation for Shannon’s distribution of MD calculating software.
Similar content being viewed by others
References
Todeschini R, Consonni V (2009) Molecular descriptors for chemoinformatics, vol 1. Wiley-VCH, Weinheim
Todeschini R, Consonni V, Pavan M (2002) DRAGON Software version 2.1. Milano Chemometric and QSAR Research Group. Milano
Guha R (1991) The CDK descriptor calculator, 0.94th edn. Indiana
Yap CW (2011) PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32:1466–1474. doi:10.1002/jcc.21707
Georg H (2008) BlueDesc-molecular descriptor calculator. University of Tübingen, Tübingen
Liu J, Feng J, Brooks A, Young S (2005) PowerMV. National Institute of Statistical Sciences, Research Triangle Park
ADRIANA. Code (2011) Molecular Networks. Erlangen, Germany
Hong H, Xie Q, Ge W, Qian F, Fang H, Shi L, Su Z, Perkins R, Tong W (2008) Mold2, molecular descriptors from 2D structures for chemoinformatics and toxicoinformatics. J Chem Inf Comput Sci 48:1337–1344. doi:10.1021/ci800038f
Kellogg GE (2001) Molconn-Z 4.0 edn. eduSoft, Virginia
Liu H, Motoda H (2008) Less is More. In: Liu H, Motoda H (eds) Computational methods of feature selection. Data mining and knowledge discovery series. Taylor * Francis Group, Boca Raton, p 411
Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1:67–82. doi:10.1109/4235.585893
Venkatraman V, Dalby AR, Yang ZR (2004) Evaluation of mutual information and genetic programming for feature selection in QSAR. J Chem Inf Comput Sci 44:1686–1692. doi:10.1021/ci049933v
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the Twentieth international conference on machine learning, Washington DC
Kira K, Rendell L (1992) The feature selection problem: traditional methods and a new algorithm. Association for the advancement of artificial intelligence. AAAI Press and MIT Press, Cambridge, pp 129–134
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86
Jeffreys H (1946) An invariant form for the prior probability in estimation problems. Proc Roy Soc A 186:453–461. doi:10.1098/rspa.1946.0056
Jennifer GD (2008) Unsupervised Feature Selection. In: Liu H, Motoda H (eds) Computational methods of feature selection. Data mining and knowledge discovery series. Taylor & Francis Group, Boca Raton, p 411
Varshavsky R, Gottlieb A, Linial M, Horn D (2006) Novel unsupervised feature filtering of biological data. Bioinformatics 22:e507–e513. doi:10.1093/bioinformatics/btl214
Maldonado AG, Doucet JP, Petitjean M, Fan B-T (2006) Molecular similarity and diversity in chemoinformatics: from theory to applications. Mol Divers 10:39–79. doi:10.1007/s11030-006-8697-1
Godden JW, Stahura FL (2000) Variability of molecular descriptors in compound databases revealed by Shannon entropy calculations. J Chem Inf Comput Sci 40:796–800. doi:10.1021/ci000321u
Godden JW, Bajorath J (2002) Chemical descriptors with distinct levels of information content and varying sensitivity to differences between selected compound databases identified by SE-DSE analysis. J Chem Inf Comput Sci 42:87–93. doi:10.1021/ci0103065
Barigye SJ, Marrero-Ponce Y, Pérez-Giménez F, Bonchev D (2014) Trends in information theory-based chemical structure codification. Mol Divers 18:673–686. doi:10.1007/s11030-014-9517-7
Witten IH, Eibe F, Hall MA (2011) Data mining: practical machine learning tools and techniques. The Morgan Kaufmann series in data management systems, 3rd edn. Morgan Kaufmann, Burlington
Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97:10101–10106. doi:10.1073/pnas.97.18.10101
Devakumari D, Thangavel K (2010) Unsupervised adaptive floating search feature selection based on contribution entropy. In: 2010 international conference on communication and computational intelligence (INCOCCI), pp 623–627
Dash M, Choi K, Scheuermann P, Huan L (2002) Feature selection for clustering—a filter solution. In: Proceedings of the 2002 IEEE international conference on data mining (ICDM 2003), pp 115–122. doi:10.1109/icdm.2002.1183893
Stahura FL, Godden JW, Bajorath J (2002) Differential Shannon entropy analysis identifies molecular property descriptors that predict aqueous solubility of synthetic compounds with high accuracy in binary QSAR calculations. J Chem Inf Comput Sci 42:550–558. doi:10.1021/ci010243q
Wassermann AM, Nisius B, Vogt M, Bajorath J (2010) Identification of descriptors capturing compound class-specific features by mutual information analysis. J Chem Inf Model 50:1935–1940. doi:10.1021/ci100319n
Cover TM, Thomas JA (1991) Elements of Information theory. Wiley, New York
Desurvire E (2009) Classical and quantum information theory. Cambridge University Press, New York
Quinlan JR (1983) Learning efficient classification procedures and their application to chess end games. In: Michalski R, Carbonell J, Mitchell T (eds) Machine learning. Symbolic computation. Springer, Berlin, pp 463–482. doi:10.1007/978-3-662-12405-5_15
Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1988) Numerical recipes in C: the art of scientific computing. Cambridge University Press, New York
Consonni V, Todeschini R, Pavan M, Gramatica P (2002) Structure/response correlations and similarity/diversity analysis by GETAWAY descriptors. Part 2. Application of the novel 3D molecular descriptors to QSAR/QSPR studies. J Chem Inf Comput Sci 42:693–705. doi:10.1021/ci0155053
Pérez González M, Terán C, Teijeira M, González-Moa MJ (2005) GETAWAY descriptors to predicting A2A adenosine receptors agonists. Eur J Med Chem 40:1080–1086. doi:10.1016/j.ejmech.2005.04.014
Saiz-Urra L, Pérez González M (2007) Quantitative structure-activity relationship studies of HIV-1 integrase inhibition.1. GETAWAY descriptors. Eur J Med Chem 42:64–70. doi:10.1016/j.ejmech.2006.08.005
Fedorowicz A, Singh H, Soderholm S, Demchuk E (2005) Structure–activity models for contact sensitization. Chem Res Toxicol 18:954–969. doi:10.1021/tx0497806
Saiz-Urra L, Pérez González M (2006) QSAR studies about cytotoxicity of benzophenazines with dual inhibition toward both topoisomerases I and II: 3D-MoRSE descriptors and statistical considerations about variable selection. Bioorg Med Chem 14:7347–7358. doi:10.1016/j.bmc.2006.05.081
Gasteiger J, Sadowski J, Schuur J, Selzer P, Steinhauer L, Steinhauer V (1996) Chemical information in 3Dspace. J Chem Inf Comput Sci 36:1030–1037. doi:10.1021/ci960343+
Gasteiger J, Schuur J, Selzer P, Steinhauer L, Steinhauer V (1997) Finding the 3D structure of a molecule in its IR spectrum. Fresen J Anal Chem 359:50–55. doi:10.1007/s002160050534
Schuur J, Selzer P, Gasteiger J (1996) The coding of the three-dimensional structure of molecules by molecular transforms and its application to structure-spectra correlations and studies of biological activity. J Chem Inf Comput Sci 36:334–344. doi:10.1021/ci950164c
Baumann K (1999) Uniform-length molecular descriptors for quantitative structure-property relationships (QSPR) and quantitative structure-activity relationships (QSAR): classification studies and similarity searching. TRAC 18:36–46. doi:10.1016/S0165-9936(98)00075-2
Jelcic Z (2004) Solvent molecular descriptors on poly(D, L-lactide-co-glycolide) particle size in emulsification-diffusion process. Coll Surf A Physico-Chem Eng Asp 242:159–166. doi:10.1016/j.colsurfa.2004.03.027
Todeschini R, Bettiol C, Giurin G, Gramatica P, Miana P, Argese E (1996) Modeling and prediction by using WHIM descriptors in QSAR studies. Submitochondrial particles (SMP) as toxicity biosensors of chlorophenols. Chemosphere 33:71–79. doi:10.1016/0045-6535(96)00153-1
Randic M (1995) Molecular profiles. Novel geometry-dependent molecular descriptors. New J Chem 19:781–791
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th international joint conference on artificial intelligence, pp 1022–1027. http://dblp.uni-trier.de/db/conf/ijcai/ijcai93.html#FayyadI93
Newman DJ, Hettich S, Blake CL, Merz CJ (1998) UCI repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA. http://www.ics.uci.edu/~mlearn/MLRepository.html
Guyon I, Gunn SR, Ben-Hur A, Dror G (2004) Result analysis of the NIPS 2003 feature selection challenge. In: Advances in neural information processing systems, Vancouver, BC, pp 545–552. http://papers.nips.cc/paper/2728-result-analysis-of-the-nips-2003-feature-selection-challenge
Webb AR (2002) Statistical pattern recognition, 2nd edn. Wiley, Chichester
Cover TM (1974) The best two independent measurements are not the two best. IEEE Trans Syst Man Cybern 4:116–117. doi:10.1109/TSMC.1974.5408535
Acknowledgments
Barigye, S. J. acknowledges financial support from CNPq. Marrero-Ponce, Y. thanks the program ‘International Visiting Professor’ for a fellowship to work at Universidad Tecnológica de Bolívar (Colombia) in 2014. Finally, the authors are also indebted to the Molecular Diversity Editor in Chief Dr. Guillermo A. Morales for his comments and manuscript revision, as well as his kind attention.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Urias, R.W.P., Barigye, S.J., Marrero-Ponce, Y. et al. IMMAN: free software for information theory-based chemometric analysis. Mol Divers 19, 305–319 (2015). https://doi.org/10.1007/s11030-014-9565-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11030-014-9565-z