Knowledge Discovery in Multi-label Phenotype Data

Clare, Amanda; King, Ross D.

doi:10.1007/3-540-44794-6_4

Amanda Clare³ &
Ross D. King³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2168))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

4115 Accesses
351 Citations

Abstract

The biological sciences are undergoing an explosion in the amount of available data. New data analysis methods are needed to deal with the data. We present work using KDD to analyse data from mutant phenotype growth experiments with the yeast S. cerevisiae to predict novel gene functions. The analysis of the data presented a number of challenges: multi-class labels, a large number of sparsely populated classes, the need to learn a set of accurate rules (not a complete classification), and a very large amount of missing values. We developed resampling strategies and modified the algorithm C4.5 to deal with these problems. Rules were learnt which are accurate and biologically meaningful. The rules predict function of 83 putative genes of currently unknown function at an estimated accuracy of > 80%.

Download to read the full chapter text

Chapter PDF

Phenotype Prediction with Semi-supervised Classification Trees

Robust identification of molecular phenotypes using semi-supervised learning

Article Open access 28 May 2019

Heinrich Roder, Carlos Oliveira, … Joanna Roder

An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat

Article Open access 23 October 2019

Nastasiya F. Grinberg, Oghenejokpeme I. Orhobor & Ross D. King

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

M. Andrade, C. Ouzounis, C. Sander, J. Tamames, and A. Valencia. Functional classes in the three domains of life. Journal of Molecular Evolution, 49:551–557, 1999.
Article Google Scholar
W. P. Blackstock and M. P. Weir. Proteomics: quantitative and physical mapping of cellular proteins. Tibtech, 17:121–127, 1999.
Google Scholar
C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998.
Google Scholar
M. Brown, W. Nobel Grundy, D. Lin, N. Cristianini, C. Walsh Sugnet, T. Furey, M. Ares Jr., and D. Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Nat. Acad. Sci. USA, 97(1):262–267, Jan 2000.
Article Google Scholar
J. DeRisi, V. Iyer, and P. Brown. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278:680–686, October 1997.
Google Scholar
M. des Jardins, P. Karp, M. Krummenacker, T. Lee, and C. Ouzounis. Prediction of enzyme classification from protein sequence without the use of sequence similarity. In ISMB’ 97, 1997.
Google Scholar
B. Efron and R. Tibshirani. An introduction to the bootstrap. Chapman and Hall, 1993.
Google Scholar
M. Eisen, P. Spellman, P. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc. Nat. Acad. Sci. USA, 95:14863–14868, Dec 1998.
Google Scholar
J. Fürnkranz. Separate-and-conquer rule learning. Artificial Intelligence Review, 13(1): 3–54, 1999.
Article MATH Google Scholar
The Arabidopsis genome initiative. Analysis of the genome sequence of the flowering plant arabidopsis thaliana. Nature, 408:796–815, 2000.
Article Google Scholar
International human genome sequencing consortium. Initial sequencing and analysis of the human genome. Nature, 409:860–921, 2001.
Article Google Scholar
Aram Karalic and Vlado Pirnat. Significance level based classification with multiple trees. Informatica, 15(5), 1991.
Google Scholar
D. Kell and R. King. On the optimization of classes for the assignment of unidentified reading frames in functional genomics programmes: the need for machine learning. Trends Biotechnol., 18:93–98, March 2000.
Google Scholar
R. King, A. Karwath, A. Clare, and L. Dehaspe. Genome scale prediction of protein functional class from sequence using data mining. In KDD 2000, 2000.
Google Scholar
R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI 1995, 1995.
Google Scholar
D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In ICML 97, 1997.
Google Scholar
E. Koonin, R. Tatusov, M. Galperin, and M. Rozanov. Genome analysis using clusters of orthologous groups (COGS). In RECOMB 98, pages 135–139, 1998.
Google Scholar
A. Kumar, K.-H. Cheung, P. Ross-Macdonald, P.S.R. Coelho, P. Miller, and M. Snyder. TRIPLES: a database of gene function in S. cerevisiae. Nucleic Acids Res., 28:81–84, 2000.
Article Google Scholar
M. Lussier, A. White, J. Sheraton, T. di Paolo, J. Treadwell, S. Southard, C. Horenstein, J. Chen-Weiner, A. Ram, J. Kapteyn, T. Roemer, D. Vo, D. Bondoc, J. Hall, W. Zhong, A. Sdicu, J. Davies, F. Klis, P. Robbins, and H. Bussey. Large scale identification of genes involved in cell surface biosynthesis and architecture in Saccharomyces cerevisiae. Genetics, 147:435–450, Oct 1997.
Google Scholar
A. McCallum. Multi-label text classification with a mixture model trained by EM. In AAAI 99 Workshop on Text Learning, 1999.
Google Scholar
A. McCallum, R. Rosenfeld, T. Mitchell, and A. Ng. Improving text classification by shrinkage in a hierarchy of classes. In ICML 98, 1998.
Google Scholar
H.W. Mewes, K. Heumann, A. Kaps, K. Mayer, F. Pfeiffer, S. Stocker, and D. Frishman. MIPS: a database for protein sequences and complete genomes. Nucleic Acids Research, 27:44–48, 1999.
Article Google Scholar
D. Michie, D. J. Spiegelhalter, and C. C. Taylor, editors. Machine Learning, Neural and Statistical Classification. Ellis Horwood, London, 1994. Out of print but available at http://www.amsta.leeds.ac.uk/~charles/statlog/.
MATH Google Scholar
D. Mladenic and M. Grobelnik. Learning document classification from large text hierarchy. In AAAI 98, 1998.
Google Scholar
S. Oliver. A network approach to the systematic analysis of yeast gene function. Trends in Genetics, 12(7):241–242, 1996.
Article MathSciNet Google Scholar
J. R. Quinlan. C4.5: programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993.
Google Scholar
L. M. Raamsdonk, B. Teusink, D. Broadhurst, N. Zhang, A. Hayes, M. C. Walsh, J. A. Berden, K. M. Brindle, D. B. Kell, J. J. Rowland, H. V. Westerho., K. van Dam, and S. G. Oliver. A functional genomics strategy that uses metabolome data to reveal the phenotype of silent mutations. Nature Biotech, pages 45–50, 2001.
Google Scholar
A. Ram, A. Wolters, R. Ten Hoopen, and F. Klis. A new approach for isolating cell wall mutants in Saccharomyces cerevisiae by screening for hypersensitivity to calcofluor white. Yeast, 10: 1019–1030, 1994.
Article Google Scholar
M. Riley. Systems for categorizing functions of gene products. Current Opinion in Structural Biology, 8:388–392, 1998.
Article Google Scholar
R. Schapire and Y. Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000.
Article MATH Google Scholar
K. Sugimoto, Y. Sakamoto, O. Takahashi, and K. Matsumoto. HYS2, an essential gene required for DNA replication in Saccharomyces cerevisiae. Nucleic Acids Res, 23(17):3493–500, Sep 1995.
Article Google Scholar
P. Törönen, M. Kolehmainen, G. Wong, and E. Castrén. Analysis of gene expression data using self-organizing maps. FEBS Lett., 451(2):142–6, May 1999.
Article Google Scholar
J. C. Venter et al. The sequence of the human genome. Science, 291:1304–1351, 2001.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Wales Aberystwyth, SY23 3DB, UK
Amanda Clare & Ross D. King

Authors

Amanda Clare
View author publications
You can also search for this author in PubMed Google Scholar
Ross D. King
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Albert-Ludwigs University Freiburg, Georges Köhler-Allee, Geb. 079, 79110, Freiburg, Germany
Luc De Raedt
Inst.of Information and Computing Sciences Dept. of Mathematics and Computer Science, University of Utrecht, Padualaan 14, de Uithof, 3508, TB Utrecht, The Netherlands
Arno Siebes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Clare, A., King, R.D. (2001). Knowledge Discovery in Multi-label Phenotype Data. In: De Raedt, L., Siebes, A. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2001. Lecture Notes in Computer Science(), vol 2168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44794-6_4

Download citation

DOI: https://doi.org/10.1007/3-540-44794-6_4
Published: 28 August 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42534-2
Online ISBN: 978-3-540-44794-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Knowledge Discovery in Multi-label Phenotype Data

Abstract

Chapter PDF

Similar content being viewed by others

Phenotype Prediction with Semi-supervised Classification Trees

Robust identification of molecular phenotypes using semi-supervised learning

An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Knowledge Discovery in Multi-label Phenotype Data

Abstract

Chapter PDF

Similar content being viewed by others

Phenotype Prediction with Semi-supervised Classification Trees

Robust identification of molecular phenotypes using semi-supervised learning

An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation