A Novel Method for Prediction of Protein Domain Using Distance-Based Maximal Entropy

Zou, Shuxue; Huang, Yanxin; Wang, Yan; Hu, Chengquan; Liang, Yanchun; Zhou, Chunguang

doi:10.1007/978-3-540-72393-6_149

Shuxue Zou¹,
Yanxin Huang¹,
Yan Wang¹,
Chengquan Hu¹,
Yanchun Liang¹ &
…
Chunguang Zhou¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4492))

Included in the following conference series:

International Symposium on Neural Networks

1699 Accesses

Abstract

Detecting the boundaries of protein domains has been an important and challenging problem in experimental and computational structural biology. In this paper the domain detection is first taken as an imbalanced data learning problem. A novel undersampling method using distance-based maximal entropy in the feature space of SVMs is proposed. On multiple sequence alignments that are derived from a database search, multiple measures are defined to quantify the domain information content of each position along the sequence. The overall accuracy is about 87% together with high sensitivity and specificity. Simulation results demonstrate that the utility of the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general imbalanced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Orengo, A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH-a Hierarchic Classification of Protein Domain Structures. Structure 5, 1093–1108 (1997)
Article Google Scholar
Murzin, G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a Structural Classification of Proteins Database for the Investigation of Sequences and Structures. J. Mol. Biol. 247, 536–540 (1995)
Google Scholar
Alexandrov, N., Shindyalov, I.: PDP: Protein Domain Parser. Bioinformatics 19(3), 429–430 (2003)
Article Google Scholar
Holm, L., Sander, C.: Mapping the Protein Universe. Science 273, 595–602 (1996)
Article Google Scholar
Sonnhammer, E.L., Kahn, D.: Modular Arrangement of Proteins as Inferred from Analysis of Homology. Protein Sci. 3, 482–492 (1994)
Article Google Scholar
Gracy, J., Argos, P.: Automated Protein Sequence Database Classification. I. Integration of Copositional Similarity Search, Local Similarity Search and Multiple Sequence Alignment. Bioinformatics 14(2), 164–187 (1998)
Article Google Scholar
Tong, S., Chang, E.: Support Vector Machine Active Learning for Image Retrieval. In: Proceedings of ACM International Conference on Multimedia, pp. 107–118 (2001)
Google Scholar
Joachims, T.: Text Categorization with SVM: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, Springer, Heidelberg (1998)
Chapter Google Scholar
Wu, G., Chang, E.: Class-Boundary Alignment for Imbalanced Dataset Learning. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II, Washington, DC (2003)
Google Scholar
Nagaragan, N., Yona, G.: Automatic Prediction of Protein Domains from Sequence Information Using a Hybrid Learn System. Bioinformatics 1, 1–27 (2004)
Google Scholar
Galzitskaya, O.V., Melnik, B.S.: Prediction of Protein Domain Boundaries from Sequence Alone. Protein Science 12, 696–701 (2003)
Article Google Scholar
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000)
Google Scholar
Akbani, R., Kwek, S.S., Japkowicz, N.: Applying Support Vector Machines to Imbalanced Datasets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 39–50. Springer, Heidelberg (2004)
Google Scholar
Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the Sensitivity of Support Vector Machines. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 55–60 (1999)
Google Scholar
Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering 30(1), 25–36 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Jilin University, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, 130012, China
Shuxue Zou, Yanxin Huang, Yan Wang, Chengquan Hu, Yanchun Liang & Chunguang Zhou

Authors

Shuxue Zou
View author publications
You can also search for this author in PubMed Google Scholar
Yanxin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chengquan Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yanchun Liang
View author publications
You can also search for this author in PubMed Google Scholar
Chunguang Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Derong Liu Shumin Fei Zengguang Hou Huaguang Zhang Changyin Sun

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zou, S., Huang, Y., Wang, Y., Hu, C., Liang, Y., Zhou, C. (2007). A Novel Method for Prediction of Protein Domain Using Distance-Based Maximal Entropy. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds) Advances in Neural Networks – ISNN 2007. ISNN 2007. Lecture Notes in Computer Science, vol 4492. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72393-6_149

Download citation

DOI: https://doi.org/10.1007/978-3-540-72393-6_149
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72392-9
Online ISBN: 978-3-540-72393-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics