Abstract
One of the most important aspects for a deep interpretation of molecular biology is the precise annotation of protein functions. An overwhelming majority of proteins, across species, do not have sufficient supplementary information available, which causes them to stay uncharacterized. Contrastingly, all known proteins have one key piece of information available: their amino acid sequence. Therefore, for a wider applicability of algorithms, across different species proteins, researchers are motivated to make computational techniques that characterize proteins using their amino acid sequence. However, in case of computational techniques like deep learning algorithms, huge amount of labeled information is required to produce good results. The labeling process of data is time and resource consuming making labeled data scarce. Utilizing the characteristic to address the formerly mentioned issues of uncharacterized proteins and traditional deep learning algorithms, we propose a model called GOGAN, that operates on the amino acid sequence of a protein to predict its functions. Our proposed GOGAN model does not require any handcrafted features, rather it extracts automatically, all the required information from the input sequence. GOGAN model extracts features from the massively large unlabeled protein datasets. The term “Unlabeled data” is used for piece of information that have not been assigned labels to identify their characteristics or properties. The features extracted by GOGAN model can be utilized in other applications like gene variation analysis, gene expression analysis and gene regulation network detection. The proposed model is benchmarked on the Homo sapiens protein dataset extracted from the UniProt database. Experimental results show clear improvements in different evaluation metrics when compared with other methods. Overall, GOGAN achieves an F1 score of 72.1% with Hamming loss of 9.5%, using only the amino acid sequences of protein.
Similar content being viewed by others
Availability of data and materials
The dataset can be obtained from UniProt (Consortium 2015). We have also provided dataset at: https://github.com/musadaqmansoor/gogan.
Code Availability Statement
The code for this research project can be found as open source at: https://github.com/musadaqmansoor/gogan.
References
(1999) Interpro. https://www.ebi.ac.uk/interpro/. Accessed on 01 July 2020
Aebersold R, Mann M (2003) Mass spectrometry-based proteomics. Nature 422(6928):198
Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Ange Tato RN (2018) Improving adam optimizer. bioRxiv p 262501
Apostolopoulos ID, Mpesiana TA (2020) Covid-19: automatic detection from X-ray images utilizing transfer learning with convolutional neural networks. Phys Eng Sci Med, p 1
Arjovsky M, Chintala S, Bottou L (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875
Ashburner M, Ball CA, Blake JA et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29
Babbar R, Schölkopf B (2019) Data scarcity, robustness and extreme multi-label classification. Mach Learn 108(8–9):1329–1351
Bartel PL, Roecklein JA, SenGupta D et al (1996) A protein linkage map of escherichia coli bacteriophage t7. Nat Genet 12(1):72
Benso A, Di Carlo S, ur Rehman H, et al (2013) A combined approach for genome wide protein function annotation/prediction. Proteome Sci 11(1):S1
Borhani M (2020) Multi-label log-loss function using l-bfgs for document categorization. Eng Appl Artif Intell 91(103):623
Bork P, Dandekar T, Diaz-Lazcoz Y et al (1998) Predicting function: from genes to genomes and back. J Mol Biol 283(4):707–725
Causier B (2004) Studying the interactome with the yeast two-hybrid system and mass spectrometry. Mass Spectrom Rev 23(5):350–367
Che J, Chen L, Guo ZH et al (2020) Drug target group prediction with multiple drug networks. Combin Chem High Throughput Screen 23(4):274–284
Chen Y, Qin X, Wang J, et al (2020) Fedhealth: a federated transfer learning framework for wearable healthcare. IEEE Intell Syst
Consortium U (2015) Uniprot: a hub for protein information. Nucleic Acids Res 43(D1):D204–D212
Cooper GM (2000) The cell: a molecular approach, 2nd edn. ASM Press, Washington
Cruz LM, Trefflich S, Weiss VA, et al (2017) Protein function prediction. Funct Genomics, pp 55–75
Deng M, Zhang K, Mehta S et al (2003) Prediction of protein function using protein-protein interaction data. J Comput Biol 10(6):947–960
Di Tullio A, Reale S, De Angelis F (2005) Molecular recognition by mass spectrometry. J Mass Spectrom 40(7):845–865
Finley RL, Brent R (1994) Interaction mating reveals binary and ternary connections between drosophila cell cycle regulators. Proc Natl Acad Sci 91(26):12,980-12,984
Friedberg I (2006) Automated protein function prediction-the genomic challenge. Brief Bioinform 7(3):225–242
Gaudet P, Livstone MS, Lewis SE et al (2011) Phylogenetic-based propagation of functional annotations within the gene ontology consortium. Brief Bioinform 12(5):449–462
Gene OC, et al (2015) Gene ontology consortium: going forward. Nucleic Acids Res 43(Database issue):D1049–56
Ghahramani A, Watt FM, Luscombe NM (2018) Generative adversarial networks uncover epidermal regulators and predict single cell perturbations. bioRxiv p 262501
Ghavidel A, Cagney G, Emili A (2005) A skeleton of the human protein interactome. Cell 122(6):830–832
Giot L, Bader JS, Brouwer C, et al (2003) A protein interaction map of drosophila melanogaster. Science 302(5651) : 1727–1736
Gligorijević V, Barot M, Bonneau R (2018) deepnf: deep network fusion for protein function prediction. Bioinformatics 34(22):3873–3881
Goodfellow I, Pouget-Abadie J, Mirza M, et al (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
Gulrajani I, Ahmed F, Arjovsky M, et al (2017) Improved training of wasserstein gans. In: Advances in neural information processing systems, pp 5767–5777
Gunnar H (2018) Real-valued medical time series generation with recurrent conditional gans. bioRxiv p 262501
Gupta A, Zou J (2018) Feedback gan (fbgan) for dna: a novel feedback-loop architecture for optimizing protein functions. arXiv preprint arXiv:1804.01694
Huttenhower C, Hibbs M, Myers C et al (2006) A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics 22(23):2890–2897
Jiang Y, Oron TR, Clark WT et al (2016) An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol 17(1):184
Joo W, Kim D, Shin S, et al (2020) Generalized gumbel-softmax gradient estimator for various discrete random variables. arXiv preprint arXiv:2003.01847
Kanehisa M (2020) Kanehisa Laboratories - Growth of Major Databases. Pathway Solutions; Bioinfomatics Center. https://www.kanehisa.jp/en/db_growth.html. Accessed 01 July 2020
Killoran N, Lee LJ, Delong A, et al (2017) Generating and designing dna with deep generative models. arXiv preprint arXiv:1712.06148
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Letovsky S, Kasif S (2003) Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 19(suppl_1):i197–i204
Li S, Armstrong CM, Bertin N, et al (2004) A map of the interactome network of the metazoan c. elegans. Science 303 (5657):540–543
Liang G, Zheng L (2020) A transfer learning method with deep residual network for pediatric pneumonia diagnosis. Comput Methods Programs Biomed 187(104):964
Liao W, Wang Y, Yin Y et al (2020) Improved sequence generation model for multi-label classification via cnn and initialized fully connection. Neurocomputing 382:188–195
Liu X (2017) Deep recurrent neural network for protein function prediction from sequence. arXiv preprint arXiv:1701.08318
Lv Z, Ao C, Zou Q (2019) Protein function prediction: from traditional classifier to deep learning. Proteomics, p 1900119
Marcotte EM, Pellegrini M, Ng HL et al (1999) Detecting protein function and protein–protein interactions from genome sequences. Science 285(5428):751–753
Martin Arjovsky S, Bottou L (2017) Wasserstein generative adversarial networks. In: Proceedings of the 34 th international conference on machine learning, Sydney, Australia
Nabieva E, Jim K, Agarwal A, et al (2005) Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21(suppl_1):i302–i310
Najafabadi MM, Villanustre F, Khoshgoftaar TM et al (2015) Deep learning applications and challenges in big data analytics. J Big Data 2(1):1
Nauman M, Rehman HU, Politano G et al (2019) Beyond homology transfer: deep learning for automated annotation of proteins. J Grid Comput 17(2):225–237
Ouyang W, Aristov A, Lelek M et al (2018) Deep learning massively accelerates super-resolution localization microscopy. Nat Biotechnol 36(5):460
Pal D, Eisenberg D (2005) Inference of protein function from protein structure. Structure 13(1):121–130
Pazos F, Sternberg MJ (2004) Automated prediction of protein function and detection of functional sites from structure. Proc Natl Acad Sci 101(41):14754–14759
Pellegrini M, Marcotte EM, Thompson MJ et al (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci 96(8):4285–4288
Piovesan D, Giollo M, Leonardi E et al (2015) Inga: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res 43(W1):W134–W140
Radivojac P, Clark WT, Oron TR et al (2013) A large-scale evaluation of computational protein function prediction. Nat Methods 10(3):221–227
Rual JF, Venkatesan K, Hao T et al (2005) Towards a proteome-scale map of the human protein-protein interaction network. Nature 437(7062):1173
Shen LX, Basilion JP, Stanton VP (1999) Single-nucleotide polymorphisms can cause different structural folds of mrna. Proc Natl Acad Sci 96(14):7871–7876
Shoemaker BA, Panchenko AR (2007) Deciphering protein–protein interactions. Part I. experimental techniques and databases. PLoS Comput Biol 3(3):e42
Tieleman T, Hinton G (2012) Divide the gradient by a running average of its recent magnitude. Coursera neural netw. Mach Learn 6:26–31
Vazquez A, Flammini A, Maritan A et al (2003) Global protein function prediction from protein–protein interaction networks. Nat Biotechnol 21(6):697
Villani C (2008) Optimal transport: old and new, vol 338. Springer, Berlin
Vincent P, Larochelle H, Lajoie I, et al (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(12)
Walhout AJ, Sordella R, Lu X, et al (2000) Protein interaction mapping in c. elegans using proteins involved in vulval development. Science 287(5450):116–122
Watson JD, Laskowski RA, Thornton JM (2005) Predicting protein function from sequence and structural data. Curr Opin Struct Biol 15(3):275–284
Xin F, Radivojac P (2011) Computational methods for identification of functional residues in protein structures. Curr Protein Pept Sci 12(6):456–469
Zeiler MD (2012) Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701
Zhang F, Song H, Zeng M, et al (2019) Deepfunc: a deep learning framework for accurate prediction of protein functions from protein sequences and interactions. Proteomics, p 1900019
Zhang ML, Fang JP (2020) Partial multi-label learning via credible label elicitation. IEEE Trans Pattern Anal Mach Intell
Zhuang F, Qi Z, Duan K, et al (2019) A comprehensive survey on transfer learning. arXiv preprint arXiv:1911.02685
Funding
The authors did not receive any funding for this research project.
Author information
Authors and Affiliations
Contributions
The idea of using GANs for Bioinformatics was thought by Musadaq Mansoor and Muhammad Nauman. Hafeez Ur Rehman and Alfredo Benso provided domain knowledge. Writing code, running experiments and analyzing results were done by Musadaq Mansoor and Muhammad Nauman. Alfredo Benso and Hafeez Ur Rehman helped in analyzing results and providing the final discussion. Manuscript was written by Musadaq Mansoor. The manuscript was reviewed and updated by all authors.
Corresponding author
Ethics declarations
Conflict of interest
There are no competing financial interests connected with the carried out research work.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Every author agrees for publication of this manuscript.
Additional information
Communicated by Irfan Uddin.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mansoor, M., Nauman, M., Ur Rehman, H. et al. Gene Ontology GAN (GOGAN): a novel architecture for protein function prediction. Soft Comput 26, 7653–7667 (2022). https://doi.org/10.1007/s00500-021-06707-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-021-06707-z