Skip to main content

A Novel Method for Prediction of Protein Domain Using Distance-Based Maximal Entropy

  • Conference paper
Book cover Advances in Neural Networks – ISNN 2007 (ISNN 2007)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4492))

Included in the following conference series:

  • 1699 Accesses

Abstract

Detecting the boundaries of protein domains has been an important and challenging problem in experimental and computational structural biology. In this paper the domain detection is first taken as an imbalanced data learning problem. A novel undersampling method using distance-based maximal entropy in the feature space of SVMs is proposed. On multiple sequence alignments that are derived from a database search, multiple measures are defined to quantify the domain information content of each position along the sequence. The overall accuracy is about 87% together with high sensitivity and specificity. Simulation results demonstrate that the utility of the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general imbalanced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Orengo, A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH-a Hierarchic Classification of Protein Domain Structures. Structure 5, 1093–1108 (1997)

    Article  Google Scholar 

  2. Murzin, G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a Structural Classification of Proteins Database for the Investigation of Sequences and Structures. J. Mol. Biol. 247, 536–540 (1995)

    Google Scholar 

  3. Alexandrov, N., Shindyalov, I.: PDP: Protein Domain Parser. Bioinformatics 19(3), 429–430 (2003)

    Article  Google Scholar 

  4. Holm, L., Sander, C.: Mapping the Protein Universe. Science 273, 595–602 (1996)

    Article  Google Scholar 

  5. Sonnhammer, E.L., Kahn, D.: Modular Arrangement of Proteins as Inferred from Analysis of Homology. Protein Sci. 3, 482–492 (1994)

    Article  Google Scholar 

  6. Gracy, J., Argos, P.: Automated Protein Sequence Database Classification. I. Integration of Copositional Similarity Search, Local Similarity Search and Multiple Sequence Alignment. Bioinformatics 14(2), 164–187 (1998)

    Article  Google Scholar 

  7. Tong, S., Chang, E.: Support Vector Machine Active Learning for Image Retrieval. In: Proceedings of ACM International Conference on Multimedia, pp. 107–118 (2001)

    Google Scholar 

  8. Joachims, T.: Text Categorization with SVM: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  9. Wu, G., Chang, E.: Class-Boundary Alignment for Imbalanced Dataset Learning. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II, Washington, DC (2003)

    Google Scholar 

  10. Nagaragan, N., Yona, G.: Automatic Prediction of Protein Domains from Sequence Information Using a Hybrid Learn System. Bioinformatics 1, 1–27 (2004)

    Google Scholar 

  11. Galzitskaya, O.V., Melnik, B.S.: Prediction of Protein Domain Boundaries from Sequence Alone. Protein Science 12, 696–701 (2003)

    Article  Google Scholar 

  12. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000)

    Google Scholar 

  13. Akbani, R., Kwek, S.S., Japkowicz, N.: Applying Support Vector Machines to Imbalanced Datasets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 39–50. Springer, Heidelberg (2004)

    Google Scholar 

  14. Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the Sensitivity of Support Vector Machines. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 55–60 (1999)

    Google Scholar 

  15. Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering 30(1), 25–36 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Derong Liu Shumin Fei Zengguang Hou Huaguang Zhang Changyin Sun

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Zou, S., Huang, Y., Wang, Y., Hu, C., Liang, Y., Zhou, C. (2007). A Novel Method for Prediction of Protein Domain Using Distance-Based Maximal Entropy. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds) Advances in Neural Networks – ISNN 2007. ISNN 2007. Lecture Notes in Computer Science, vol 4492. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72393-6_149

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72393-6_149

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72392-9

  • Online ISBN: 978-3-540-72393-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics