Robust ensemble of handcrafted and learned approaches for DNA-binding proteins

Purpose – AutomaticDNA-bindingprotein(DNA-BP)classificationisnowanessentialproteomictechnology. Unfortunately, many systems reported in the literature are tested on only one or two datasets/tasks. The purpose of this study is to create the most optimal and universal system for DNA-BP classification, one that performs competitively across several DNA-BP classification tasks. Design/methodology/approach – Efficient DNA-BP classifier systems require the discovery of powerful protein representations and feature extraction methods. Experiments were performed that combined and compared descriptors extracted from state-of-the-art matrix/image protein representations. These descriptors were trained on separate support vector machines (SVMs) and evaluated. Convolutional neural networks with different parameter settings were fine-tuned on two matrix representations of proteins. Decisions were fused withtheSVMsusingtheweightedsumruleandevaluatedtoexperimentallyderivethemostpowerfulgeneral-purposeDNA-BPclassifiersystem. Findings – The best ensemble proposed here produced comparable, if not superior, classification results on a broad and fair comparison with the literature across four different datasets representing a variety of DNA-BP classification tasks, thereby demonstrating both the power and generalizability of the proposed system. Originality/value – Most DNA-BP methods proposed in the literature are only validated on one (rarely two) datasets/tasks. In this work, the authors report the performance of our general-purpose DNA-BP system on four datasets representing different DNA-BP classification tasks. The excellent results of the proposed best classifier system demonstrate the power of the proposed approach. These results can now be used for baseline comparisons by other researchers in the field.


Introduction
All living things rely on DNA-binding proteins (DNA-BPs). They are vital components in eukaryotic and prokaryotic proteomes that regulate and affect cellular processes, such as transcription and DNA replication, repair and recombination [1]. Because anywhere from 6-7% of all eukaryotic proteins bind DNA, methods for differentiating DNA-BPs from non-DNA-BPs have been the focus of much recent scientific research. This interest has resulted in hundreds of thousands of protein sequences available to scientists [2]. The development of Robust ensemble for DNA-BP automatic machine learning (ML) methods that quickly and accurately identify these proteins has now become a critical proteomic technology.
The key to building automatic DNA-BP classification systems is finding powerful protein representations. Protein representations are generally classified into two categories: sequence-based models and structure-based models. The structural model relies on information obtained from the high-resolution three-dimensional (3D) structure of a protein's sequence. Representations based on the structural model enhance DNA-BP prediction because of the close connection between structural features and protein function. To illustrate the advantages of using structural-based models for DNA-BP classification, see the literature review provided in [3]. A significant problem with structural-based models is the so-called "structure knowledge gap" [4], where a massive number of protein sequences have a limited number of known structures.
In contrast to the structural model, the sequence model of protein representation is based on extracting features directly from the amino acid composition (AAC) of a protein.
A revolutionary protein representation that expanded AAC is Chou's pseudo AAC (PseAAC) [5,6], which preserves vital information embedded in a protein's sequence (e.g. the sequential order represented as a series of rank-different correlation factors along the protein chain). There are many variants of PseAAC. Most relevant here, however, is the work of Cai and Lin [7], who classified DNA-BP, rRNA-BP and RNA-BP by representing proteins as a 40dimensional PseAAC variant. Much later, in Liu et al. [8], this same PseAAC variant was combined with a physicochemical distance transformation [9] using a reduced alphabet to decrease computational complexity and improve prediction. In Nanni and Lumini [10], sets of reduced alphabets were taken directly from AAC via a genetic algorithm, and a multiclassifier was developed for DNA-BP identification that combined these sets with a method based on grouped weight [11]. Another protein representation approach is that which includes the normalized occurrence frequencies (represented as a vector of length 20 n ) of a given n-peptide: dipeptide [12][13][14][15], tripeptide [16] and tetrapeptide [17].
Recently, sets of features based on Chou's general PseAAC have been proposed in [18] and transformed for deep learning in [19] using an embedding layer common in natural language processing. Convolutional neural networks (CNNs) have also been trained on sequence-based descriptors in [20][21][22][23][24][25][26]. In [20], for instance, CNNs were trained on amino acid sequences combined with contextual information, and, in [26], several CNNs and recurrent neural networks (RNNs), as well as combinations of the two, were trained on sequential descriptors and compared.
The inclusion of evolutionary information is yet another effective approach based on the ACC model. Evolutionary information is included in sequence profiles generated by positionspecific iterated Basic Local Alignment Search Tool (PSI-BLAST) [27]. Using a simple support vector machine (SVM) classifier called dDNAbinder, Kumar et al. [28] were the first to discover that including evolutionary information resulted in superior identification performance. Since the publication of [28], many researchers have shown that these profiles improve DNA-BP prediction [29][30][31][32][33].
Researchers have also succeeded in extracting robust descriptors from the positionspecific scoring matrix (PSSM) [34], which describes a protein using a PSI-BLAST similarity search. For example, Nanni et al. [35] generated a high-performing ensemble by combining many matrix representations from PSSM, and Warris et al. [15] improved the DNA-BP prediction via descriptors extracted from split AAC, dipeptide composition and PSSM. Wang et al. [36] were successful using three different feature vectors: a 200-dimension normalized Moreau-Broto autocorrelations vector [37], a 100-dimension PSSM-DCT (PSSM compressed by a discrete cosine transform) vector and a 1,040-dimension PSSM-DWT (PSSM compressed by a discrete wave transform) vector. Wei et al. [38] segmented PSSMs into equally sized sub-PSSMs. Pre-PSSM features [39] were extracted from these segments and classified using ACI random forest (RF) [40]. RF was also used to classify pseudo PSSM (PsePSSM) features in [19,40].
Finally, proteins can be treated or represented as images in several ways [41][42][43][44]. One method is to treat matrix representations of proteins as images and then extract texture features from them [41][42][43]. In Nanni et al. [41], for instance, different sets of handcrafted texture descriptors (e.g. Haralick descriptors, several local binary patterns (LBP) variants, and features based on the Radon feature transform) were extracted from the two-dimensional (2D) distance matrix, which was obtained from the 3D tertiary structure of a protein.
Combining these different feature sets produced significant improvement in ensemble classification performance across several datasets representing protein fold recognition, DNA-BP recognition, biological processes and molecular function recognition. In [42], different feature descriptors were extracted by Nanni et al., starting from a wavelet representation of the protein, and in Kavianpour and Vasighi [43], excellent performance was obtained by extracting texture features from cellular automata images of proteins.
In contrast to extracting handcrafted features from matrix representations of proteins, another image-based method is to classify 3D protein shapes from 2D image renderings as was done in [44], where a deep learning approach was applied using a set of multi-view 2D representations of proteins taken from 3D views rendered using JMol, a well-known protein visualization tool. These images were fed into different pre-trained CNNs, and the fusion of the CNNs resulted in improved classification performance.
The goal of this study is to generate the most optimal and universal system for DNA-BP classification by combining several image-based/matrix approaches as well as other descriptors. In the literature, most ML researchers investigating the DNA-BP problem test their systems on only one or occasionally two datasets. The objective here is to produce a system that works across several datasets representing different DNA-BP classification tasks. A high-performing universal system for DNA-BP classification will provide baseline performance for future comparisons. To accomplish this objective, we generate sets of ensembles trained on many powerful protein features and image-based representations to discover which ones work best across the four DNA-BP classification tasks. Experimentally tested are the following ensemble building blocks: (1) Sets of matrix representations generated from one-dimensional (1D) vector representations; (2) Sets of different features extracted from these matrix representations and trained on separate SVMs, which are then combined by sum rule; (3) Different topologies of pre-trained CNNs fine-tuned with the protein matrix representations and with 2D snapshots of the 3D protein shapes rendered by Jmol; and (4) Different combinations of heterogeneous classifiers (CNNs combined with SVMs).
We use the same ensembles with the same set of parameters across all four DNA-BP datasets for fair comparisons to demonstrate the universality of the final system proposed here. The results section shows that the proposed ensemble achieves competitive performance with the best performing systems across all four datasets, obtaining state-of-the-art results on two.

Materials and methods
Since the goal of this study is to produce a universal system for classifying DNA-BP proteins, heterogeneous classifiers are built that combine sets of SVMs and pre-trained CNNs, as illustrated in Figure 1 and detailed in Section 2.1. In ML, representations should produce compact and effective fixed-length descriptors. In our proposed system, DNA-BP proteins are Robust ensemble for DNA-BP represented in seven ways (primarily matrix/image-based as described in Section 2.2). A set of CNNs with different parameter settings is fine-tuned on two representations. Seven descriptors (Section 2.3) are extracted from these representations and trained on the SVMs. These descriptors are divided into PsePSSM and a set of texture descriptors extracted from the matrix/image-based representations. For some feature extraction methods, the extraction process is applied for every physicochemical property obtained from the amino acid index database [45] available at http://www.genome.jp/dbget/aaindex.html. The amino acid index database currently contains 566 indices and 94 substitution matrices. This low number is not a problem since it is well known that a reduced number of properties is adequate for most protein classification problems. Ignored here are those properties where the amino acids have a value of 0 or 1.

Classifiers
Descriptors are extracted from the different representations, as illustrated in Figure 1, and trained on separate SVMs. SVM [46] is a popular classifier in the area of bioinformatics. SVMs are binary-class predictors that find the equation of a hyperplane that divides a two-class training set so that all the points belonging to a given class are located on the same side, with the maximum distance separating the two classes and the margin [47]. SVM handles both linear and nonlinear data. Kernel functions are used to project the data onto a higher-dimensional feature space so that they can be separated by a hyperplane in problems that do not have a linear decision boundary. SVM is implemented here with the LibSVM toolbox (http://www.csie. ntu.edu.tw/∼cjlin/libsvm/) and are linearly normalized to [0, 1]. Fine-tuning is not performed on SVM: the same SVM parameters (radial basis function with gamma 5 0.1 and cost 5 1,000) are applied to all extracted descriptors across all four datasets to avoid overfitting. A set of CNNs is fine-tuned on two matrix/image representations of proteins ( Figure 1). CNNs are currently one of the most accurate image classification methods. They are a class of deep feed-forward neural networks composed of interconnected layers of neurons with inputs that have, like most neural networks, learnable weights, biases and activation functions. CNN layers have neurons that are arranged in three dimensions so that every layer transforms a 3D input volume into a 3D output volume of neuron activations. Typically, CNNs are composed of five classes of layers: convolutional (CONV), activation (ACT), pooling (POOL), fully-connected (FC) and classification (CLASS).
Eight CNN topologies pre-trained on ImageNet are tested in the experiments presented here: (1)   (2) GoogleNet [49] and (3) InceptionV3 [50], two CNNs with inception modules that approximates a sparse CNN with a normal dense construction; (4) VGGNet16 and (5) VGGNet19 [51], both of which improve AlexNet by replacing large kernel-sized filters with multiple 3 3 3 kernel-sized filters; (6) ResNet50 and (7) ResNet101 [52], which are CNN topologies available in MATHLAB with 50 and 101 deep layers, respectively; and (8) DenseNet [53], which is similar to ResNet but interconnects each of the layers. The pre-trained CNN topologies are fine-tuned for all layers using the training data in each dataset and for each of the matrix representations of the proteins and 2D projections.
Fine-tuning is a technique where training is resumed on a pre-trained network with the objective of learning a new classification problem. Since it is not always possible to train a CNN with large batch sizes (BS), a "GPU out of memory" message results in the removal of the CNN. Moreover, CNNs that produce random results on the training data are also eliminated since they fail to converge. If representations are not matrix-based, protein features are resized into a square matrix so that they can be trained on a CNN. Size is recalculated as the maximum size of either the rows or columns, and all empty entries are padded with zeros. The matrix is then resized as needed for input into the other pre-trained CNNs.
Once the SVMs and CNNs are trained, the classifiers are fused, as illustrated in Figure 1, with the final decision obtained by combining the pool of classifiers by the weighted sum rule.

Protein representations
The protein representations used in this work are predominately matrix-based. In our approach, they are treated as an image, which means that texture descriptors are extracted from the matrices and trained on the SVMs. As noted in Figure 1, PSSM and DM matrix/ images are also fed directly into the pre-trained CNNs as other images would be (after padding and resizing as described above).
2.2.2 Position-specific scoring matrix (PSSM). PSI-BLAST calculates PSSM [34] by taking the PSSM profiles and searching for related proteins or DNA. The parameters considered by PSSM include (1) the position of each amino acid residue in a protein sequence; (2) the probe, which groups sequences of proteins based on functional similarity; (3) the profile matrix of 20 columns corresponding to the 20 amino acids; and (4) consensus, which are those sequences of amino acid residues that are most similar to the alignment residues of probes.
PSSM scores are integers that indicate whether an amino acid occurs more frequently than expected if positive or less frequently if negative.

Variant substitution matrix representation (VSMR)
. VSMR [35] is a variant of the substitution matrix representation (SMR) [54]. The SMR for protein P 5 (p 1, p 2, . . ., p N ) is a N 3 20 matrix calculated for VSMR as: where M is a 20 3 20 substitution matrix, whose element M i;j represents the probability of amino acid i mutating to amino acid j during the evolution process.
In the experiments reported here, 25 substitution matrix properties are randomly selected to create VSMR-based ensembles.
2.2.4 Wavelet (WAVE). Wavelet encoding starts from a protein sequence described by substituting each amino acid with a numeric value corresponding to a physicochemical property c. A decomposition scale produces different results, with high decomposition scales generating excessive redundancy and low decomposition scales discarding too much information.

Robust ensemble for DNA-BP
The method used in this study is described in Li and Li [55]. The Meyer continuous wavelet is applied to the wavelet transform coefficients ðWAVE c Þ. A feature set is extracted considering 100 decomposition scales. Twenty-five physicochemical properties are randomly selected to create a WAVE ensemble composed of 25 WAVE c -based predictors.
2.2.5 3D tertiary structure (DM). DM [56] is a heatmap of the inter-residue distances (it considers the distances between atoms and between residues in a protein data bank (PDB) structure). When the size of the heatmap exceeds 250 3 250 pixels, it is resized so that the computational complexity of the feature extraction steps is manageable. DM is treated as a grayscale image from which the texture descriptors described in Section 2.3.2 are extracted.
2.2.6 Reaction center (RC). RC [57] takes a protein's structure and transforms it into two sets of feature maps: one describing the shape of the protein backbone from the torsion angles density of the local distributions of the angles w and ψ for each amino acid and one extracted from the distances between the amino acid building blocks. These two feature maps are treated as sets of images from which the texture descriptors are extracted. Since the torsion angle densities build a map that is m 3 19 3 19, it is treated as 19 images size m319. Since the density of the amino acid distances builds a map that is m 3 m 3 8, this feature map is treated as eight images size m 3 m. Different descriptors are extracted from these two sets of images, and they are trained on separate SVMs that are finally combined by sum rule.
2.2.7 2D projection. 2D projection [44] takes a 3D visualization of a protein's PDB code from Jmol [58], a molecular visualization software tool. Several multi-view projections are produced by uniformly rotating the protein's 3D structure around its central X, Y and Z viewing axes to produce 125 2D images. Jmol can generate many different protein visualizations. In this study, we use Trace, Ribbons, Rockets and Strands. Trace images illustrate the secondary structures inside a molecule with a smooth curve passing through the middle points between successive atoms in the alpha carbons of a peptide chain or the phosphorus atoms of nucleic acids. Ribbons are like Trace, except that they display a line that connects the main atoms in the backbone as a solid flat ribbon. Rockets place cylinders in stretches where there are alpha helixes and planks for beta stretches, with both ending with an arrowhead. Strands are like Rockets but display the backbones as a series of thin lines so that the molecular structure is represented by parallel longitudinal threads.

Methods for protein descriptor extraction
In this section, we describe the different approaches used to extract descriptors from the protein representations introduced in Section 2.2.
2.3.1 PsePSSM (PP). The idea behind PsePSSM [59,60] is to retain information about the AAS by considering the PseAAC. Given an input matrix Mat ∈ ℜ N 320 ; the PsePSSM descriptor is a vector PP ∈ ℜ 320 defined as: ½Eði; jÞ À Eði þ lag; jÞ where k is a linear index used to scan the cells of Mat, lag is the distance between one residue and its neighbors, N is the length of the sequence and E∈ℜ N320 is the normalized version of Mat, defined as:  [41][42][43]) have demonstrated the benefit of treating a protein matrix representation as an image so that powerful texture descriptors can be extracted from it. Once these texture descriptors are extracted from the protein image, they can be trained on the classifiers, with the set of classifiers combined by sum rule for a final score.

Datasets
Ensembles generated from the methods described above are evaluated across four benchmark datasets taken from PDB1075 [8], PDB594 [67], PDB676 [68], PDB186 [67] (which is from the Protein Databank located at http://www.rcsb.org/pdb/home/home.do) and the dataset in [69]. Protein sequences in these datasets with less than 50 amino acids or that contain the character "X" were removed. Also deleted were all sequences having more than 25% similarity with another sequence. The PDB1075 dataset has 525 DNA-BPs and 550 DNA-non-BPs; the PDB594 dataset [67] has 297 DNA-BPs and 297 DNA-non-BPs; and the PDB186 dataset, which was designed as an independent testing dataset derived from [67], contains 93 DNA-BPs and 93 DNA-non-BPs. The four datasets used in this paper are the following: (1) IND1: training takes place on the PDB1075 dataset and testing on the independent PDB186 dataset; (2) IND2: training takes place on the PDB594 dataset and testing on the independent PDB186 dataset; (3) IND3: training takes place on the PDB1075 dataset and testing on PDB676 [68], an independent dataset that contains 338 DNA-BPs and 338 non-DNA-BPs. PDB676 was obtained (1) by searching for all DNA-binding and non-DNA-binding sequences from Robust ensemble for DNA-BP PDB, (2) by removing sequences that had less than 60 amino acids or contained the character "X" and (3) by deleting all sequences having more than 25% similarity with another sequence. Filtered as well were all sequences already present in the PDB186 dataset; and (4) IND4 [69]: training takes place on 2104 proteins and testing on 296 proteins. This new non-redundant gold-standard dataset was created following Chou's five-step rule [70]. First, 1,200 non-DNA-BPs and 1,200 DNA-BPs were collected from PDB. Second, all sequences having more than 25% similarity as well as any chain less than 50 residues in length were deleted. Also removed were any proteins containing an "X" residue. These two steps produced 1,052 proteins in each class. A total of 148 chains were chosen from each class to form the negative independent validation subset.

Performance indicators
Two performance indicators are reported in the experiments presented below: classification accuracy and area under the ROC curve (AUC). Accuracy is the ratio between the number of correctly classified samples and the total number of samples. The ROC curve is a graphical plot of the sensitivity of a binary classifier vs false positives (1 À specificity). AUC [71] is a scalar measure of the probability that the classifier will assign a lower score to a randomly picked positive pattern over a randomly picked negative pattern. With multiclass datasets, the one-versus-all area under ROC curve [72] is used. AUC is considered one of the most reliable performance indicators [73]. For this reason, the internal comparisons are evaluated using AUC. Accuracy is reported so that the system proposed here can be compared to others in the literature that do not report the AUC performance indicator. It should be noted that before each fusion, scores are normalized to mean 0 and standard deviation 1.

Experiments
In Table 1, performance is reported (with accuracy at the top and AUC at the bottom of each cell) on the set of texture (TXT) descriptors described in Section 2.3.2 (namely, LBP, WLD, CLBP, RIC, MORPH and HASH) extracted from the matrix protein representations detailed in Section 2.2 (PSSM, VSMR, WAVE, DM, RC). In Table 2, the performance of PsePSSM (PP) as a matrix-based descriptor is reported.
In Tables 1 and 2, the column labeled FUS is the fusion by sum rule of PSSM, VSMR, WAVE, DM and RC. The column labeled FUS_noPDB reports the performance of the fusion of methods not based on PDB, i.e. PSSM, VSMR and WAVE.
From the results reported above, the following conclusions can be drawn: ( (3) The fusion of the different features descriptors extracted from the same matrix representation enhances performance, with FUS the average best method considering all the datasets; and (4) Although the representations related to the PDB protein format boosts performance, the enhancement in performance using this representation is not exceptional.
In Table 3, we report the performance obtained by the deep learning ensembles labeled eCNN and Proj, as well as the performance obtained by combining, via weighted sum rule, the different methods tested in this work. Proj is the ensemble built by combining, via the sum rule, three CNN topologies: GoogleNet, Inceptionv3 and ResNet50, coupled with the four 2D projections (Ribbons, Rockets, Strands, Trace). Only three CNN topologies are examined due to computational issues. No selection strategy is performed. For each 2D projection, a different CNN is trained, resulting in the fusion of 12 CNNs by the sum rule (due to computational issues, we combine Proj only with BS 5 30 and LR 5 0.001).
With the weighted sum rules, values were not determined by running a parameter selection process or by overfitting them but rather by testing a set of reasonable values (0.50 and 0.25). As noted in Table 3 In Tables 4 and 5, the best ensembles presented here are compared with the literature for  IND1, IND2 and IND4. So far, IND3 is used only in [68], where an AUC of 89.78% is reported; our best method proposed here produces a similar AUC of 88.68%. As is evident in Tables 4 and 5, our proposed ensemble is similar in performance (across all four datasets) to the bestreported methods trained on each of the individual datasets. Notice that our proposed method achieves state-of-the-art performance on IND1 and IND2.

Conclusions
The purpose of this study was to generate a powerful general-purpose heterogeneous ensemble for DNA-binding proteins. Experiments performed across four DNA-binding datasets show that the PsePSSM descriptor extracted from a set of matrix representations of proteins and trained on SVMs that are combined with two sets of CNNs trained on the matrix representations and the 2D projections of 3D renderings of proteins using Jmol, significantly boost classification performance. Our best ensemble obtained state-of-the-art performance across all four datasets, demonstrating generalizability. Future studies will focus on combining other classification approaches, including ensembles made with AdaBoost and Rotation Forest. These methods, however, require enormous computational resources during the training phase.
Instead of implementing Chou's recommendation of developing a web server for identifying DNA-BP with our proposed classification system, we are sharing the MATLAB code used in this study so that the public can freely implement and set up any number of servers using our best ensemble. Sharing our source code will also allow other researchers to extend and compare our work with their novel approaches.