Abstract
The biological sciences are undergoing an explosion in the amount of available data. New data analysis methods are needed to deal with the data. We present work using KDD to analyse data from mutant phenotype growth experiments with the yeast S. cerevisiae to predict novel gene functions. The analysis of the data presented a number of challenges: multi-class labels, a large number of sparsely populated classes, the need to learn a set of accurate rules (not a complete classification), and a very large amount of missing values. We developed resampling strategies and modified the algorithm C4.5 to deal with these problems. Rules were learnt which are accurate and biologically meaningful. The rules predict function of 83 putative genes of currently unknown function at an estimated accuracy of > 80%.
Chapter PDF
Similar content being viewed by others
Keywords
- Knowledge Discovery
- Functional Class
- Decision Tree Algorithm
- Functional Hierarchy
- International Human Genome Sequencing Consortium
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
M. Andrade, C. Ouzounis, C. Sander, J. Tamames, and A. Valencia. Functional classes in the three domains of life. Journal of Molecular Evolution, 49:551–557, 1999.
W. P. Blackstock and M. P. Weir. Proteomics: quantitative and physical mapping of cellular proteins. Tibtech, 17:121–127, 1999.
C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998.
M. Brown, W. Nobel Grundy, D. Lin, N. Cristianini, C. Walsh Sugnet, T. Furey, M. Ares Jr., and D. Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Nat. Acad. Sci. USA, 97(1):262–267, Jan 2000.
J. DeRisi, V. Iyer, and P. Brown. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278:680–686, October 1997.
M. des Jardins, P. Karp, M. Krummenacker, T. Lee, and C. Ouzounis. Prediction of enzyme classification from protein sequence without the use of sequence similarity. In ISMB’ 97, 1997.
B. Efron and R. Tibshirani. An introduction to the bootstrap. Chapman and Hall, 1993.
M. Eisen, P. Spellman, P. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc. Nat. Acad. Sci. USA, 95:14863–14868, Dec 1998.
J. Fürnkranz. Separate-and-conquer rule learning. Artificial Intelligence Review, 13(1): 3–54, 1999.
The Arabidopsis genome initiative. Analysis of the genome sequence of the flowering plant arabidopsis thaliana. Nature, 408:796–815, 2000.
International human genome sequencing consortium. Initial sequencing and analysis of the human genome. Nature, 409:860–921, 2001.
Aram Karalic and Vlado Pirnat. Significance level based classification with multiple trees. Informatica, 15(5), 1991.
D. Kell and R. King. On the optimization of classes for the assignment of unidentified reading frames in functional genomics programmes: the need for machine learning. Trends Biotechnol., 18:93–98, March 2000.
R. King, A. Karwath, A. Clare, and L. Dehaspe. Genome scale prediction of protein functional class from sequence using data mining. In KDD 2000, 2000.
R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI 1995, 1995.
D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In ICML 97, 1997.
E. Koonin, R. Tatusov, M. Galperin, and M. Rozanov. Genome analysis using clusters of orthologous groups (COGS). In RECOMB 98, pages 135–139, 1998.
A. Kumar, K.-H. Cheung, P. Ross-Macdonald, P.S.R. Coelho, P. Miller, and M. Snyder. TRIPLES: a database of gene function in S. cerevisiae. Nucleic Acids Res., 28:81–84, 2000.
M. Lussier, A. White, J. Sheraton, T. di Paolo, J. Treadwell, S. Southard, C. Horenstein, J. Chen-Weiner, A. Ram, J. Kapteyn, T. Roemer, D. Vo, D. Bondoc, J. Hall, W. Zhong, A. Sdicu, J. Davies, F. Klis, P. Robbins, and H. Bussey. Large scale identification of genes involved in cell surface biosynthesis and architecture in Saccharomyces cerevisiae. Genetics, 147:435–450, Oct 1997.
A. McCallum. Multi-label text classification with a mixture model trained by EM. In AAAI 99 Workshop on Text Learning, 1999.
A. McCallum, R. Rosenfeld, T. Mitchell, and A. Ng. Improving text classification by shrinkage in a hierarchy of classes. In ICML 98, 1998.
H.W. Mewes, K. Heumann, A. Kaps, K. Mayer, F. Pfeiffer, S. Stocker, and D. Frishman. MIPS: a database for protein sequences and complete genomes. Nucleic Acids Research, 27:44–48, 1999.
D. Michie, D. J. Spiegelhalter, and C. C. Taylor, editors. Machine Learning, Neural and Statistical Classification. Ellis Horwood, London, 1994. Out of print but available at http://www.amsta.leeds.ac.uk/~charles/statlog/.
D. Mladenic and M. Grobelnik. Learning document classification from large text hierarchy. In AAAI 98, 1998.
S. Oliver. A network approach to the systematic analysis of yeast gene function. Trends in Genetics, 12(7):241–242, 1996.
J. R. Quinlan. C4.5: programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993.
L. M. Raamsdonk, B. Teusink, D. Broadhurst, N. Zhang, A. Hayes, M. C. Walsh, J. A. Berden, K. M. Brindle, D. B. Kell, J. J. Rowland, H. V. Westerho., K. van Dam, and S. G. Oliver. A functional genomics strategy that uses metabolome data to reveal the phenotype of silent mutations. Nature Biotech, pages 45–50, 2001.
A. Ram, A. Wolters, R. Ten Hoopen, and F. Klis. A new approach for isolating cell wall mutants in Saccharomyces cerevisiae by screening for hypersensitivity to calcofluor white. Yeast, 10: 1019–1030, 1994.
M. Riley. Systems for categorizing functions of gene products. Current Opinion in Structural Biology, 8:388–392, 1998.
R. Schapire and Y. Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000.
K. Sugimoto, Y. Sakamoto, O. Takahashi, and K. Matsumoto. HYS2, an essential gene required for DNA replication in Saccharomyces cerevisiae. Nucleic Acids Res, 23(17):3493–500, Sep 1995.
P. Törönen, M. Kolehmainen, G. Wong, and E. Castrén. Analysis of gene expression data using self-organizing maps. FEBS Lett., 451(2):142–6, May 1999.
J. C. Venter et al. The sequence of the human genome. Science, 291:1304–1351, 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Clare, A., King, R.D. (2001). Knowledge Discovery in Multi-label Phenotype Data. In: De Raedt, L., Siebes, A. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2001. Lecture Notes in Computer Science(), vol 2168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44794-6_4
Download citation
DOI: https://doi.org/10.1007/3-540-44794-6_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42534-2
Online ISBN: 978-3-540-44794-8
eBook Packages: Springer Book Archive