Skip to main content
Log in

Identification of functionally diverse lipocalin proteins from sequence information using support vector machine

  • Original Article
  • Published:
Amino Acids Aims and scope Submit manuscript

Abstract

Lipocalins are functionally diverse proteins that are composed of 120–180 amino acid residues. Members of this family have several important biological functions including ligand transport, cryptic coloration, sensory transduction, endonuclease activity, stress response activity in plants, odorant binding, prostaglandin biosynthesis, cellular homeostasis regulation, immunity, immunotherapy and so on. Identification of lipocalins from protein sequence is more challenging due to the poor sequence identity which often falls below the twilight zone. So far, no specific method has been reported to identify lipocalins from primary sequence. In this paper, we report a support vector machine (SVM) approach to predict lipocalins from protein sequence using sequence-derived properties. LipoPred was trained using a dataset consisting of 325 lipocalin proteins and 325 non-lipocalin proteins, and evaluated by an independent set of 140 lipocalin proteins and 21,447 non-lipocalin proteins. LipoPred achieved 88.61% accuracy with 89.26% sensitivity, 85.27% specificity and 0.74 Matthew’s correlation coefficient (MCC). When applied on the test dataset, LipoPred achieved 84.25% accuracy with 88.57% sensitivity, 84.22% specificity and MCC of 0.16. LipoPred achieved better performance rate when compared with PSI-BLAST, HMM and SVM-Prot methods. Out of 218 lipocalins, LipoPred correctly predicted 194 proteins including 39 lipocalins that are non-homologous to any protein in the SWISSPROT database. This result shows that LipoPred is potentially useful for predicting the lipocalin proteins that have no sequence homologs in the sequence databases. Further, successful prediction of nine hypothetical lipocalin proteins and five new members of lipocalin family prove that LipoPred can be efficiently used to identify and annotate the new lipocalin proteins from sequence databases. The LipoPred software and dataset are available at http://www3.ntu.edu.sg/home/EPNSugan/index_files/lipopred.htm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Adam B, Charloteaux B, Beaufays J, Vanhamme L, Godfroid E, Brasseur R, Lins L (2008) Distantly related lipocalins share two conserved clusters of hydrophobic residues: use in homology modeling. BMC Struct Biol 8:1

    Article  PubMed  CAS  Google Scholar 

  • Akerstrom B, Flower DR, Salier JP (2000) Lipocalins: unity in diversity. Biochim Biophys Acta 1482:1–8

    CAS  PubMed  Google Scholar 

  • Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402

    Article  CAS  PubMed  Google Scholar 

  • Bairoch A, Apweiler R (2000) The SWISS-PROT protein sequence database, its supplement TrEMBL in 2000. Nucleic Acids Res 28(1):45–48

    Article  CAS  PubMed  Google Scholar 

  • Bishop RE (2000) The bacterial lipocalins. Biochim Biophys Acta 1482:73–83

    CAS  PubMed  Google Scholar 

  • Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2:121–167

    Article  Google Scholar 

  • Cai YD, Liu XJ, Xu XP, Chou KC (2002) Prediction of protein structural classes by support vector machines. Comput Chem 26:293–296

    Article  CAS  PubMed  Google Scholar 

  • Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ (2003) SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 31(13):3692–3697

    Article  CAS  PubMed  Google Scholar 

  • Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/_cjlin/libsvm

  • Chou KC (2001) Prediction of protein cellular attributes using pseudo amino acid composition. Proteins Struct Funct Genet 43:246–255

    Article  CAS  PubMed  Google Scholar 

  • Chou KC (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21:10–19

    Article  CAS  PubMed  Google Scholar 

  • Chou KC, Cai YD (2005) Prediction of membrane protein types by incorporating amphipathic effects. J Chem Inform Model 45:407–413

    Article  CAS  Google Scholar 

  • Chou KC, Shen HB (2009) Recent advances in developing web-servers for predicting protein attributes. Nat Sci 1:63–92

    Article  Google Scholar 

  • Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20:273–297

    Google Scholar 

  • Devarajan P (2007) Neutrophil gelatinase-associated lipocalin: new paths for an old shuttle. Cancer Ther 5(B):463–470

    PubMed  Google Scholar 

  • Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14(9):755–763

    Article  CAS  PubMed  Google Scholar 

  • Flower DR (1996) The lipocalin protein family: structure and function. Biochem J 318:1–14

    CAS  PubMed  Google Scholar 

  • Flower DR, North AC, Attwood TK (1993) Structure and sequence relationships in the lipocalins and related proteins. Protein Sci 2:753–761

    Article  CAS  PubMed  Google Scholar 

  • Flower DR, North AC, Sansom CE (2000) The lipocalin protein family: structural and sequence overview. Biochim Biophys Acta 1482:9–24

    CAS  PubMed  Google Scholar 

  • Fouchécourt S, Charpigny G, Reinaud P, Dumont P, Dacheux JL (2002) Mammalian lipocalin-type prostaglandin D2 synthase in the fluids of the male genital tract: putative biochemical and physiological functions. Biol Reprod 66:458–467

    Article  PubMed  Google Scholar 

  • Frank E, Hall M, Trigg L, Holmes G, Witten IH (2004) Data mining in bioinformatics using Weka. Bioinformatics 20:2479–2481

    Article  CAS  PubMed  Google Scholar 

  • Frenette Charron JB, Breton G, Badawi M, Sarhan F (2002) Molecular and structural analyses of a novel temperature stress-induced lipocalin from wheat and Arabidopsis. FEBS Lett 517:129–132

    Article  CAS  PubMed  Google Scholar 

  • Ganfornina MD, Gutiérrez G, Bastiani M, Diego S (2000) A phylogenetic analysis of the lipocalin protein family. Mol Biol Evol 17:114–126

    CAS  PubMed  Google Scholar 

  • Gasymov OK, Abduragimov AR, Yusifov TN, Glasgow BJ (1999) Binding studies of tear lipocalin: the role of the conserved tryptophan in maintaining structure, stability and ligand affinity. Biochim Biophys Acta 1433:307–320

    CAS  PubMed  Google Scholar 

  • Glasgow BJ, Abduragimov AR, Yusifov TN, Gasymov OK, Horwitz J, Hubbell WL, Faull KF (1998) A conserved disulfide motif in human tear lipocalins influences ligand binding. Biochemistry 37:2215–3325

    Article  CAS  PubMed  Google Scholar 

  • Grzyb J, Latowski D, Strzalka K (2006) Lipocalins—a family portrait. J Plant Physiol 163:895–915

    Article  CAS  PubMed  Google Scholar 

  • Hieber AD, Bugos RC, Yamamoto HY (2000) Plant lipocalins: violaxanthin de-epoxidase and zeaxanthin epoxidase. Biochim Biophys Acta 1482:84–91

    CAS  PubMed  Google Scholar 

  • Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, Finn RD, Gough J, Haft D, Hulo N, Kahn D, Kelly E, Laugraud A, Letunic I, Lonsdale D, Lopez R, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Mulder N, Natale D, Orengo C, Quinn AF, Selengut JD, Sigrist CJ, Thimma M, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C (2009) InterPro: the integrative protein signature database. Nucleic Acids Res 37(Database Issue):224–228

    Google Scholar 

  • Jensen LJ, Gupta R, Staerfeldt HH, Brunak S (2003) Prediction of human protein function according to gene ontology categories. Bioinformatics 19(5):635–642

    Article  CAS  PubMed  Google Scholar 

  • Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28:27–30

    Article  CAS  PubMed  Google Scholar 

  • Kawashima S, Ogata H, Kanehisa M (1999) AAindex: amino acid index database. Nucleic Acids Res 27:368–369

    Article  CAS  PubMed  Google Scholar 

  • Li W, Jaroszewski L, Odzik GA (2001) Clustering of highly homologous sequences to reduce the size of large protein database. Bioinformatics 17:282–283

    Article  CAS  PubMed  Google Scholar 

  • Logan DW, Marton TF, Stowers L (2008) Species specificity in major urinary proteins by parallel evolution. PLoS ONE 3(9):e3280

    Article  PubMed  CAS  Google Scholar 

  • Mantyjarvi R, Rautiainen J, Virtanen T (2000) Lipocalins as allergens. Biochim Biophys Acta 1482:308–317

    CAS  PubMed  Google Scholar 

  • McGuffin LJ, Bryson K, Jones DT (2000) The PSIPRED protein structure prediction server. Bioinformatics 16(4):404–405

    Article  CAS  PubMed  Google Scholar 

  • Mitchell TM (1997) Machine learning. McGraw-Hill, New York

    Google Scholar 

  • Muller KR, Mika S, Ratsch G, Tsuda K, Scholkopf B (2001) An introduction to kernel-based learning algorithms. IEEE Trans Neural Netw 2:181–201

    Article  Google Scholar 

  • Pugalenthi G, Kumar KK, Suganthan PN, Gangal R (2008) Identification of catalytic residues from protein structure using support vector machine with sequence and structural features. Biochem Biophys Res Commun 367:630–634

    Article  CAS  PubMed  Google Scholar 

  • Ribeiro JM, Hazzard JM, Nussenzveig RH, Champagne DE, Walker FA (1993) Reversible binding of nitric oxide by a salivary heme protein from a bloodsucking insect. Science 260:539–541

    Article  CAS  PubMed  Google Scholar 

  • Schlehuber S, Skerra A (2005) Lipocalins in drug discovery: from natural ligand-binding proteins to anticalins. Drug Discov Today 10:23–33

    Article  CAS  PubMed  Google Scholar 

  • Sonnhammer EL, Eddy SR, Durbin R (1997) Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28(3):405–420

    Article  CAS  PubMed  Google Scholar 

  • Tang K, Pugalenthi G, Suganthan PN, Lanczycki CJ, Chakrabarti S (2009) Prediction of functionally important sites from protein sequences using sparse kernel least squares classifiers. Biochem Biophys Res Commun 384(2):155–159

    Article  CAS  PubMed  Google Scholar 

  • Williford A, Stay B, Bhattacharya D (2004) Evolution of a novel function: nutritive milk in the viviparous cockroach, Diploptera punctata. Evol Dev 6:67–77

    Article  CAS  PubMed  Google Scholar 

  • Xu S, Venge P (2000) Lipocalins as biochemical markers of disease. Biochim Biophys Acta 1482:298–307

    CAS  PubMed  Google Scholar 

  • Yang CY, Gu ZW, Blanco-Vaca F, Gaskell SJ, Yang M, Massey JB, Gotto AM, Pownall HJ (1994) Structure of human apolipoprotein D: locations of the intermolecular and intramolecular disulfide links. Biochemistry 33:12451–12455

    Article  CAS  PubMed  Google Scholar 

  • Yusifov TN, Abduragimov AR, Gasymov OK, Glasgow BJ (2000) Endonuclease activity in lipocalins. J Biochem 347:815–819

    Article  CAS  Google Scholar 

Download references

Acknowledgments

GP and PNS acknowledge the financial support offered by the A*Star (Agency for Science, Technology and Research). RS acknowledges the support provided by the National Center for Biological Sciences (NCBS). KKK acknowledges the support by the Graduate School for Computing in Medicine and Life Sciences funded by Germany’s Excellence Initiative [DFG GSC 235/1]. KKK acknowledges Mr. Rajeev Gangal, CEO, Insilico division, Systems Biology India Pvt Ltd, Maharashtra, India and Prof. Thomas Martinetz and Dr. Stefen Moller, Institute for Neuro- and Bioinformatics, University of Luebeck, Germany for their support.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to P. N. Suganthan or R. Sowdhamini.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pugalenthi, G., Kandaswamy, K.K., Suganthan, P.N. et al. Identification of functionally diverse lipocalin proteins from sequence information using support vector machine. Amino Acids 39, 777–783 (2010). https://doi.org/10.1007/s00726-010-0520-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00726-010-0520-8

Keywords

Navigation