A bacterial phyla dataset for protein function prediction

Protein function prediction has been the most worked upon and the most challenging problem for computational biologists. The vast majority of known proteins have yet not been characterised experimentally, and there is significant gap between their structures and functions. New un-annotated sequences are being added to the public protein databases (e.g. UniprotKB) at an enormous pace [1]. Such proteins with unknown functions might play key role in the metabolism, growth and development regulation. Thus, if functions of unknown proteins left undiscovered, researchers may skip important information(s). Based on their sequence, structure, evolutionary history, and their association with other proteins, tools of computational biology can provide insights into the function of proteins [2]. For proteins with well characterised close relatives, it is trivial to infer function. Orphan proteins without discernible sequence relatives present a greater challenge [3]. Here the task of experimental characterisation is blind and becomes unwieldy. It is highly unlikely that all known proteins will ever be completely experimentally characterised [4]. Thus, there is an emergent need to develop fast and accurate computational approaches to fulfil this requirement. Towards this end, we prepared a dataset for protein function prediction by extracting protein sequences and annotations of reviewed prokaryotic proteins (total count 323,719 as accessed on date March 10, 2019) belonging to 9 bacterial phyla Actinobacteria, Bacteroidetes, Chlamydiae, Cyanobacteria, Firmicutes, Fusobacteria, Proteobacteria, Spirochaetes and Tenericutes. Corresponding to the most frequent 1739 Gene Ontology (Molecular Function) terms, samples were filtered, and 171,212 proteins were retrieved for feature generation. The Dataset was generated by calculating the sequence, sub-sequence, physiochemical, annotation-based features for each 171,212 reviewed proteins using method in [10]. These features constitute a total of 9890 attributes for each sequence of protein along with 1739 Gene Ontology terms. Each protein sequence is assigned one or more of 1739 Gene Ontology (Molecular Function) term as its target label. The Dataset contains the Entry and Entry name of each sequence corresponding to UniprotKB Database. This dataset being huge in size (171,212 samples X 9890 features, 1739 classes with multiple values) and equipped with enough number of positive and negative samples of each 1739 class, is good for testing efficiency of any upcoming deep learning models [5]. We divided the full dataset of 171,212 reviewed proteins in the ratio 3:1 to form Train/Test dataset 1; train dataset with 128,409 samples and test dataset with 42,803 samples to facilitate training of a deep learning model. The train and test datasets are stratified to contain good proportion of each 1739 classes. We then prepared a dataset 2 of pathogenic unreviewed proteins of the 9 bacterial phyla each with 9890 features same as train/train dataset of reviewed proteins but without target labels in order to predict their functions using deep learning model proposed in [5].


Data
The 171,212 extracted reviewed protein samples belong to 9 bacterial phyla Actinobacteria, Bacteroidetes, Chlamydiae, Cyanobacteria, Firmicutes, Fusobacteria, Proteobacteria, Spirochaetes and Tenericutes. Each Phyla has a Train and Test.csv (comma separated values) files, where Train file contains the 75% of data and Test file contains 25% of the data from each Phyla. A Test dataset 2 was constructed for pathogenic unreviewed protein sequences belonging to 9 bacterial Phyla. These entries in UniProtKB have yet not received any annotation [1e4] towards Gene Ontology and therefore can be used for prediction.
Each data file contains the following columns given below in points 1 to 8.

Entry
Entry is the unique ID given to each protein entry available on UniProtKB.

Entry name
Entry name is a mnemonic identifier for the unique ID provided to each protein entry.

Sequence
Amino acid sequence for the corresponding protein entry.

Sequence based Features
These are the attributes guided by the primary structure of protein.

Physicochemical Features
These are the attributes based on the physical and chemical properties of the monomeric unit of a protein i.e. an amino acid.

Annotation based features
These are the attributes based on already present annotations regarding subcellular localisation, binding preference of proteins and presence of transmembrane regions.

Subsequence based features
These are the attributes corresponding to the local similarities within a given protein sequence.

Gene Ontology (Molecular Function domain only) terms
The following are the names of supplementary data files along with their short description: Dataset 1 (FASTA files of Dataset 1): Fasta Sequences of 171,212 proteins of 9 bacterial phyla in 2 parts with names "Dataset1 non-proteo.fasta" (containing fasta sequences of all proteins of phylum other than proteobacterium) and "Dataset1 proteo.fasta" (containing fasta sequences of all proteins of phylum proteobacterium).
These two fasta files are zipped together (fasta seq of dataset.zip). Dataset 2 (Train Dataset 1): with feature vectors extracted from reviewed proteins (75% of 171,212 reviewed proteins) of 9 Bacterial phyla. A total of 18 excel sheets all zipped, also available on project's GitHub repository.
Dataset 3 (Test Dataset 1): with feature vectors extracted from reviewed proteins (25% of 171,212 reviewed proteins) 9 Bacterial phyla. A total of 12 excel sheets all zipped, also available on project's GitHub repository.
Dataset 4 (Test Dataset 2): with feature vectors extracted from unreviewed and hypothetical Proteins of 9 Bacterial phyla from pathogenic bacterial species (9 excel sheets all zipped).

Experiment design, materials, and methods
Using web-scraping libraries in Python [7], reviewed proteins of 9 bacterial phyla were extracted from UniprotKB. These samples were filtered based on the relevant 1739 Gene Ontology (belonging to molecular function domain only) terms. Further, for each sample, Motifs were extracted from Prosite server [9] using Python. These Motifs were analysed to remove redundancy and added as feature in dataset. Finally, for each sample, Sequence-based, sub-sequence-based [8], annotation-based and Physiochemical features were calculated along with Gene Ontology (Molecular Function) as a target label (If a sample consist a GO term, it had 1 in the corresponding column, else 0). All the features are generated using method in [10] utilising the following packages: Biopython [7], and I-feature [6]. The dataset acquired is then randomly split into two parts: Train (75%) and Test (25%) for each phylum, each of which is stratified to contain good proportion of each 1739 classes.
Every Train/Test dataset which is part of this bacterial phyla dataset for protein function prediction is having 9890 features and 1739 GO terms stored in excel (CSV) sheet format. Test dataset 2 is not having any target label associated with its entries as this dataset is used for predictions and belongs to hypothetical and unreviewed category.