TCN-HBP: A Deep Learning Method for Identifying Hormone-Binding Proteins from Amino Acid Sequences Based on a Temporal Convolution Neural Network

Hormone-binding proteins (HBPs) are carrier proteins that specifically bind to targeted hormones. Some evidence suggests that the abnormal expression of HBPs causes various diseases. Therefore, it is significant to accurately identify HBPs to study these diseases. Recently, many researchers have proposed traditional machine learning methods to complete this work, but these methods are neither suitable for training on large-scale datasets nor take into account the contextual features of HBPs. In this paper, I propose a new deep learning method, TCN-HBP, to distinguish HBPs. TCN-HBP consists of a coding layer, embedding layer, convolutional neural network (CNN) layer and temporal convolutional network (TCN) layer. The coding and embedding layers extend the protein sequences into two-dimensional matrix data. The CNN layer convolves the matrix data to form feature maps. The TCN layer captures the contextual features present in the feature maps. Experiments show that the data generalization capabilities and recognition accuracy (99.15%) of TCN-HBP on large datasets perform better than previous methods.


Introduction
Hormone-binding proteins (HBPs) produce specific expression in cells because they can bind specifically to targeted hormones [1,2], as shown in figure 1. Hormone binding is divided into many types, and they take on different tasks in the human body. For example, abnormalities of thyroid hormones in serum are closely related to the functional expression of some thyroid HBPs [3], and sex hormone-binding globulins control the steroid levels in plasma [4]. However, some evidence suggests that multifarious diseases in the human body arise due to the deviant expression of HBPs [5]. Therefore, it is significant to accurately identify HBPs to study these diseases and hormone regulation.
Due to the high cost of identifying protein types based on biological experiments, many researchers have proposed computational methods to complete this work in recent years. For example, Tang et al. introduced a predictor called HBPred, which extracts the dipeptide composition from the protein residue sequences and predicts HBPs through an incremental feature selection strategy and support vector machine (SVM) [6]. Wang et al. used feature rating techniques and an SVM to predict HBPs, which can remove redundant information in the feature set and improve accuracy [7]. Basith et al. developed a method to identify HBPs using an extremely randomized tree, called iGHBP [8]. Akbar et al. used location-specific scoring matrices and SVM to identify HBPs [9].
The computational methods proposed by these scholars belong to the category of machine learning, and their experiments were carried out on hundreds of protein sequences. Traditional machine learning Hormones specifically bind to HBPs [13].
Facing the explosive growth in the number of newly discovered proteins and the dilemma of machine learning methods, some scholars have used deep learning methods to identify protein types [14,15]. Deep learning methods consist of complex neural networks suitable for large-scale data training and have achieved great success in image recognition, intelligent speech and natural language processing [16][17][18]. The temporal convolutional network (TCN) is an innovative model for natural language processing proposed by Bai et al. in 2018 that can efficiently excavate contextual features in sequence information [19]. Studies by some scholars have shown that the contextual features of primary protein sequences (amino acid sequences) have an important influence on functional expression [20][21][22].
In this treatise, I introduce a neoteric approach, TCN-HBP, to identify HBPs. TCN-HBP comprises a coding layer, embedding layer, convolutional neural network (CNN) layer and TCN layer. The coding and embedding layers extend the protein sequences into two-dimensional matrix data. The CNN layer convolves the matrix data to form feature maps. The TCN layer captures the contextual features present in the feature maps. I collected 27,682 protein sequences in the Universal Protein Resource (UniProt) and National Center for Biotechnology Information (NCBI) for training, validation, and testing TCN-HBP. Compared to previous models identifying HBPs, TCN-HBP can excavate contextual features in protein sequences and be applied to large-scale data training. Experimental results show that the data generalization capabilities and recognition accuracy of TCN-HBP on large datasets perform better than previous methods. All the data and source codes used in this study are available at https://doi.org/10.6084/m9.figshare.14773575.v1.

Materials
Special attention should be given to the data collection process when conducting biological data analysis. Since this article is aimed at protein identification under big data, it is necessary to collect as much data as possible. The data are from the UniProt database [23] and the NCBI database [24]. Our use of the data complies with the provisions of the above database. After searching with the keyword "hormone binding", a total of 13,749 hormone-binding proteins were collected. There are 123 sequences from reference [7] also added to the dataset. After redundant processing was performed, 13,841 sequences were used as positive samples. Balanced datasets are very helpful in model training and do not bias the judgement of positive samples. To collect negative samples marked by experts, the keyword "not hormone binding" was selected from the UniProt database in this article and 13,841 data points were collected as negative samples.

Methods
The TCN-HBP comprises coding layer, embedding layer, CNN layer and TCN layer.

Coding Layer
In this process, the protein sequences consisting of amino acids were encoded into a list of numbers. The amino acid sequences converted into number lists could be easily manipulated mathematically, and they also laid the foundation for the next step in the embedding layer (see figure 2 for details).

Embedding Layer
Word embedding is a distributed acceptance expression that can map words to low-dimensional vectors. The superiority of applying word embedding is that it can capture the word semantics or the relationships between words. In the embedding stage, each number in the list is embedded in a consecutive vector space. After this process, the different vectors represent each input amino acid, and the protein sequences are transformed into a digital matrix (see figure 3 for details). The calculation process is shown in equation (2).  Figure 3. Embedding layer.

CNN Layer
The CNN extracts potential features in two-dimensional data through convolutional mapping operations [25]. Recently, CNNs have been applied to natural language processing, using filters to obtain the characteristics of natural statements expressed in a two-dimensional matrix [26]. In this paper, the amino acids in protein sequences are treated as words in natural statements (see figure 4 for details). The CNN captures the features of the two-dimensional data through convolutional kernel operations, obtains new feature maps, and simplifies the calculation while acquiring the sequence features.

TCN Layer
The recurrent neural network is a classical algorithm that captures sequence features before and after statement dependence. However, it often encounters the problem of vanishing or exploding gradients, influenced by the sequence length [27,28]. TCN provides a different method for calculating the hidden state, which overcomes the above problems by introducing gating and storage mechanisms [19], as shown in figure 5. As shown in the figure above, sequence 0 ..., is the TCN input, and sequence 0 ..., is the TCN output. The output of a single node in the network, such as , is associated with the entire input sequence. A TCN can process the sequence in parallel without processing in order as a recurrent neural network [19].
After 3 passes TCN, I rate the following function to determine whether the amino acid sequence is an HBP.

TCN-HBP Model
The composition of TCN-HBP is shown in figure 6. The coding layer uses 1-20 different numbers to encode 20 amino acids, and the purpose of the embedding layer is to convert the coded sequence into a feature matrix with a higher dimension to enter the CNN layer. The CNN layer has two steps: a convolution operation and a pooling operation. In the CNN layer, the matrix output in the embedding layer is scanned by the filters to obtain new feature maps, and the subsequent feature maps are pooled to obtain the main feature information. The TCN layer can perform another deeper extraction of the feature output from the convolutional layer to mine contextual features in HBPs.

Model Parameters
The adjustable parameters in the model will have a great effect on the training process and results of the model. The parameter selection is shown below.
• The parameter selection in the embedding layer is 26.
• In the three CNN layers, the filter numbers are 64, 64, and 64, the kernel sizes are 13, 7, and 5, and the activation selection is ReLU.
• The pooling-size parameters in the three-layer pooling layer are all 2.
• In the two TCN layers, the filter selections are 64, 64, the kernel sizes are 13, 5, the stack selection is 1, the drop-rate is 0.1, and the activation is ReLU.
• The batch size is 64, the epoch size is 50, and the learning rate is 0.01.

Training Results
To avoid overfitting the model training and to find a more suitable model for HBP recognition, I set up 10 k cross-validation, which is applied to both the balanced experiment and the unbalanced experiment. The number of negative samples in the unbalanced experiment is three times that of the positive samples. The numbers of positive and negative samples in the balanced experiment are equivalent. The number of epochs for the two training models is set to 50. Table 2 shows the training result. From the experimental results, it can be seen that the training accuracy of the balanced dataset can reach 99.08%, and the accuracy of the verification set can reach 98.03%.
During training, the model loss declines rapidly, and the model accuracy improves steeply before 5 epochs. As training continues, both training loss and accuracy are stable with specific intervals (figure 7).

Performance in Independent Samples Data Set and Comparison with Previous Models
The four variables , , and in the above formula represent the number of true positive, true negative, false positive and false negative samples, respectively. I compared TCN-HBP with several previous models identifying HBPs using independent test samples in reference [9]. The models are HBPred [7], iGHBP [9], HBPred2.0 [29] and iHBP-DeepPSSM [10]. I use the above four standards to estimate the representation of various models on independent test sets. The results show that the TCN-HBP outperforms HBPred, iGHBP and HBPred2.0 testing on independent datasets from the literature [9] and is close to that of iHBP-DeepPSSM. However, unlike the other four approaches, TCN-HBP is a method based on big data training. Therefore, I used an independent test dataset consisting of 1,000 positive samples and 1,000 negative samples from UniProt and NCBI to test the TCN-HBP (see table 3 for details).

Comparison to Other Deep Learning Models
To further investigate TCN-HBP, I trained a pure convolutional network model (CNN model) and a CNN-LSTM model on the same dataset [30]. The training and test comparison results are shown in the table below. Obviously, compared with the CNN model, our model significantly improves training accuracy and test accuracy. Compared with the CNN+LSTM model, the test accuracy of TCN-HBP is 0.9% higher, which proves that our model has higher applicability than the CNN-LSTM model in the problem of HBP recognition.
I visualized the training process of the above three models in figure 5 so that I can compare the three models more clearly to see the differences. The training accuracy of the CNN Model cannot be improved with 10 epochs, and it is stable at approximately 63%. Although the training accuracy of the CNN-LSTM model is only slightly lower than that of TCN-HBP, its model accuracy and loss show large fluctuations in the later stage of the training process. Therefore, TCN-HBP is more robust in the field of protein sequence recognition than the CNN and CNN-LSTM models (figure 8).

Conclusions
The identification of HBPs is one of the leading issues currently studied by biologists. In this article, I introduced a novel deep learning technique to help biologists reduce preliminary work and improve the efficiency of identifying HBPs. For potential contextual features in protein sequences, our method excavates them by combining CNN and TCN to improve recognition accuracy. Experimental results show that the data generalization capabilities and recognition accuracy of our method on large datasets are better than those of previous methods.