Network-based characterization of drug-protein interaction signatures with a space-efficient approach

Background Characterization of drug-protein interaction networks with biological features has recently become challenging in recent pharmaceutical science toward a better understanding of polypharmacology. Results We present a novel method for systematic analyses of the underlying features characteristic of drug-protein interaction networks, which we call “drug-protein interaction signatures” from the integration of large-scale heterogeneous data of drugs and proteins. We develop a new efficient algorithm for extracting informative drug-protein interaction signatures from the integration of large-scale heterogeneous data of drugs and proteins, which is made possible by space-efficient representations for fingerprints of drug-protein pairs and sparsity-induced classifiers. Conclusions Our method infers a set of drug-protein interaction signatures consisting of the associations between drug chemical substructures, adverse drug reactions, protein domains, biological pathways, and pathway modules. We argue the these signatures are biologically meaningful and useful for predicting unknown drug-protein interactions and are expected to contribute to rational drug design.


Background
Target proteins of drug molecules are classified into a primary target and off-targets. The former is the desired target, whereas the latter could lead to adverse drug reactions [1] or unexpected beneficial effects in drug repositioning [2]. Therefore, comprehensive analysis throughout primary targets and off-targets on a genome-wide scale is crucial in drug discovery. The in silico approach is expected to improve the research productivity in this field.
Several computational methods have been presented for predicting drug-protein interactions (or compoundprotein interactions) from chemogenomic and pharmacogenomic viewpoints on a large-scale. The basic idea behind the chemogenomic approach is that chemically pathways) and complicated associations between the heterogeneous features.
A variety of feature extraction methods have recently been proposed for automatically characterizing drugprotein interactions. A data mining method was proposed for extracting molecular substructure pairs appearing frequently in interacting drug-target pairs [13]. Machine learning methods with sparse statistical models were presented to associate protein domains with drug chemical substructures [14,15] or with drug side effects [16]. The inference of proteins eliciting drug side effects has been reported by several groups [17,18]. However, the scalability of these methods is very limited, and these studies were conducted from the perspective of either protein functional sites, drug chemical substructures or drug phenotypic effects. There is a strong and growing need to develop efficient and scalable methods for characterizing overall drug-protein interactions with many types of features of drugs and proteins at once.
We present a novel method for systematic analyses of the underlying features characteristic of drug-protein interaction networks, which we call "drug-protein interaction signatures". We develop a new efficient algorithm for extracting informative drug-protein interaction signatures from the integration of large-scale heterogeneous data of drugs and proteins, which is made possible by spaceefficient representations for fingerprints of drug-protein pairs and sparsity-induced classifiers. In the results, our method infers a set of drug-protein interaction signatures consisting of the associations between drug chemical substructures, adverse drug reactions, protein domains, biological pathways, and pathway modules. We argue that these signatures are biologically meaningful and useful for predicting unknown drug-protein interactions. To the best of our knowledge, this is the first report on characterizing a large-scale drug-protein interaction network with various biological features of drugs and proteins in an integrative framework. The drug-protein interaction signatures comprehensively inferred with our method are expected to contribute to rational drug design.

Drug-protein interactions
We got the information on drug-protein interactions from five databases: ChEMBL [19], KEGG [20], DrugBank [21], PDSP Ki [22], and Matador [23]. The number of unique drug-protein interactions in the merged dataset is 78,692. These interactions involve 2302 drugs and 2334 target proteins, and the number of all possible drug-protein pairs is 5,372,868. We utilized this dataset in our experiments.

Drug profiles
We described drug chemical structures by 17,017 chemical substructures using the KEGG Chemical Function and Substructures (KCF-S) descriptor [24]. We represented each drug by a 17,017-dimension binary vector where the presence or absence of each of the KCF-S substructures is coded as 1 or 0. The resulting vector is referred to as a chemical profile.
We obtained the information about adverse drug reactions (ADRs) from the public release of the adverse event reporting system (AERS) of the US Food and Drug Administration (FDA) [25]. We derived 2,904,050 reports from 2004 to 2010 and mapped the drug names to KEGG following a previous study [12]. Based on the resulting 10,543 ADRs, we represented each drug by a 10,543dimension binary vector where the presence or absence of each ADR is coded as 1 or 0. The resulting vector is referred to as an ADR profile.
Finally, we constructed an integrative feature vector of each drug by concatenating the chemical and the ADR profiles into a single one. The dimension of the resulting feature vector of each drug was 27,560.

Protein profiles
We obtained functional domains, biological pathways, and pathway modules (compactly clustered pathways) about proteins from the KEGG [20] and the PFAM [26] databases.
Based on 2678 PFAM domains, we represented each protein by a 2678-dimension binary vector where the presence or absence of a functional domain is coded as 1 or 0. The resulting vector is referred to as domain profile. Based on 270 KEGG pathway maps, we represented each protein by a 270 dimension binary vector where the presence or absence of the involvement in a biological pathway is coded as 1 or 0. The resulting vector is referred to as a pathway profile. Based on 107 KEGG pathway modules, we represented each protein by a 107-dimension binary vector where the presence or absence of the involvement in a pathway module is coded as 1 or 0. The resulting vector is referred to as module profile.
Finally, we constructed an integrative feature vector of each protein by concatenating the domain, pathway, and module profiles into a single profile. The dimension of the resulting feature vector of each protein was 3055.
We address the problem of extracting features characterizing drug-protein interaction networks in the framework of supervised classification.

Linear model for drug-protein pairs
Let C be a drug (or a drug candidate compound) and let P be a target protein (or a target candidate protein). We represent a drug-protein pair (C, P) as a high dimensional feature vector (C, P) and present a linear function, f (C, P) = w T (C, P), whose output is used to predict whether a (C, P) is an interacting pair or not. The weight vector w is estimated such that each drug-protein pair is correctly classified into the interaction class (positive class) or non-interaction class (negative class) based on the training set.
An advantage of the linear model is that one can interpret features effective for predictions from learned models. Since each element in (C, P) corresponds to an element of w, effective features can be selected by extracting highly weighted features. However, the performance of the linear model depends heavily on the feature vector design.
We represent each drug-protein pair as a high dimension feature vector by taking the tensor product of a drug profile and protein profile. The representation is similar to that in previous studies [15,16]. The profile of a C is defined as a D-dimension binary vector: The profile of a P is defined as a D -dimension binary vector: (P) = (p 1 , p 2 , ..., p D ) T , where p i ∈ {0, 1}, i = 1, ..., D . We compute the tensor product between a drug profile (C) and protein profile (P), and define a feature vector (C, P) as follows: where (C, P) is composed of all possible products between elements in (C) and those in (P). The resulting feature vector is a D × D -dimension binary vector, i.e., fingerprint, for encoding cross-integrated biological features. This is referred to as a "tensor-product fingerprint". In this study, (C) was a 27,560-dimension binary vector, and (P) was a 3055-dimension binary vector. Thus, the tensor-product fingerprint (C, P) of each drugprotein pair is a 84,195,800-dimension binary vector.
A simpler way for representing each drug-protein pair is to concatenate (C) and (P) into a single feature vector as (C, P) = (C) T , (P) T T [7]. However, it cannot determine the correlation between drug and protein features. The feature vector is referred to as a "concatenated fingerprint".

Logistic regression
We apply logistic regression to train the weight vector in the linear model and introduce L 1 -regularizations to prevent over-fitting. The L 1 -regularization induces sparsity in the weight vector and drives most of the weight elements corresponding to unimportant features to zeros, which makes it easier for us to interpret the model and extract features.
Minimizing the logistic loss with L 1 -regularization for a large number of high dimensional data is difficult, but several efficient algorithms have recently been proposed. To the best of our knowledge, LIBLINEAR [27] is the most efficient and high-performance algorithm, but it requires a huge amount of memory for extremely highdimensional data. In fact, it was not computationally feasible for our dataset in this study because of the memory problem (see the "Results" section). To overcome this difficulty, we introduce a gradient-based method.
Given a collection of drug-protein pairs and their labels (C i , P j ), y ij where y ij ∈ {+1, −1} (i = 1, ..., n, j = 1, ..., m), the logistic loss is defined as The logistic loss with L 1 -regularization is defined as where w 1 is L 1 norm (the sum of absolute value in the vector) and C is a regularization parameter.
Since L 1 -LR(w) is a convex function, the weight vector w minimizing L 1 -LR(w) can be found at zero of its gradient. However, it is impossible to compute the gradient of L 1 -LR(w), because L 1 norm contains non-differential points where w d = 0. Instead, we compute the d-th dimensional gradient ∇ d LR(w) of LR(w) as follows: The use of ∇LR(w) enables the global minimum for the optimal w in L 1 -LR(w) to be found using an efficient gradient-based optimization algorithm called orthantwise limited-memory quasi-newton (OWL-QN) [28]. The L 1 -regularized logistic regression methods, with the tensor product of the fingerprint proposed and with the concatenated fingerprint, is referred to as L1LOG-tensor and L1LOG-concat, respectively.
For comparison, we also trained models with L 2regularized logistic regression using the gradient-based algorithm called the limited memory quasi-Newton (L-BFGS) [29]. The L 2 -regularized logistic regression method, with the tensor-product fingerprint and the concatenated fingerprint, are referred to as L2LOG-tensor and L2LOG-concat, respectively.

Space-efficient representation of drug-protein pairs
Compact representation of drug-protein pairs is crucial for training linear models in memory, so we use the set representation with items corresponding to dimensions of one bit in the fingerprint. However, this still consumes a huge amount of memory for storing them, resulting Bit string R 11 1 1 10 Bit string P 11 1 1 01 Bit string R 12 10 10 10 10 Bit string P 12 01 01 01 01 Bit string R 13 10 10 10 100 Bit string P 13 01 01 01 001

i) Fingerprints of drug-protein pairs ii) Set represetation of fingerprints (SET)
iii) Compute differences between k-th element S(C i ,P j )[k] and (k-1)-th element S(C i ,P j )[k-1] iv) Variable-length array representation of fingerprints (VLA). S' (C i ,P j ) is preresented by two bit strings R ij ,and P ij . S' (C i ,P j )[k] is represented by corresponding substring in R ij . P ij encodes length of each substring of R ij .
v) Trie representation of fingerprints (left) and succinct trie representation of fingerprints (SUCTRIE) (right). SUCTRIE consists of bit string and two arrays.  Table 1 for further details) in limited scalability in memory for extremely highdimensional data. To overcome this memory problem, we constructed two space-efficient representations of fingerprints. We present a brief overview of these representations (further details are given in the supplemental material [30]). Figure 1 illustrates the construction of our two representations. We first represent each fingerprint   Table 1 contains items corresponding to dimensions of one bit in (C i , P j ). We refer to a set representation of fingerprints as SET. To minimize each item, we then compute the difference between the k-th item S( ) and keep the results in a new set S (C i , P j ). We can recover S(C i , P j ) by cumulatively adding the items in S (C i , P j ).
We constructed our two space-efficient representations of fingerprints by leveraging the idea behind succinct data structures that achieve space-efficient representations of data structures while preserving the property of fast operations. The first one is a variable-length array for compactly representing fingerprints. The S (C i , P j ) is represented by two bit strings R ij and P ij which are indexed by rank/select dictionary, i.e., a succinct data structure for bit strings. We can randomly access any element in S (C i , P j ) in O(1) time by using fast operations in the rank/select dictionary [31]. We refer to this variable-length array representation of fingerprints as VLA.
The second one is a type of succinct trie for representing fingerprints. The trie is a data structure for strings, and it is also practical for representing fingerprints. A standard point-based implementation of trie consumes a huge amount of memory, resulting in limited scalability. Alternatively, we present a compact representation of trie by using a succinct data structure called LOUDS [32]. We can recover the original fingerprints by traversing a succinct trie in a depth-first manner. We refer to this succinct trie representation of fingerprints as SUCTRIE.

Extraction of drug-protein interaction signatures
We applied the proposed method (L1LOG-tensor) to extract drug-protein interaction signatures from drug profiles (chemical substructures and adverse drug reactions) and protein profiles (protein domains, biological pathways, and pathway modules), based on a largescale drug-protein interaction network. Each signature is the association between a drug feature and protein feature, where two features in the same signature are thought of as being associated in terms of drug-protein interactions. The results of all extracted drug-protein interaction signatures are presented in the supplemental material [30].
L1LOG-tensor extracted 105,684 signatures, while L2LOG-tensor extracted 7,843,218 signatures. Note that the number of all possible combinations of drug features and protein features is 84,195,504. The number of signatures from our L1LOG-tensor method was much less than that of L2LOG-tensor, due to the sparsity induced by L1-regularization. This makes it easier to analyze the extracted drug-protein interaction signatures for biological interpretation, so we focused on analyzing the results from L1LOG-tensor below. Figure 2 shows a network representation of some of the drug-protein signatures extracted with L1LOGtensor, where highly weighted associations of five features of drugs or proteins, that is, drug-chemical substructures (blue), adverse drug reaction (red), protein pathway (green), pathway module (yellow) and protein domain (gray). Only selected results are shown due to space limitation. The inferred signature association network provides us with clues about the important features behind the drug-protein interaction network. There has been no study on the inference of these associations.   Figure 3 shows an extracted signature representing the association between a drug-chemical substructure (SKELETON C1b(N1d)-C1b(O7a) in the KCF-S format) and biological pathway (hsa04080 Neuroactive ligandreceptor interaction), where the vertical axis on the heat map (a) shows all drugs sharing the extracted substructure, and the horizontal axis shows all proteins sharing the extracted pathway. The extracted drugchemical substructures on the associated drug structures (b) are in pink. Drugs and proteins in known interacting pairs tend to have such extracted features in the same signature. For example, Propantheline bromide (D00481), Methanthelinium bromide (D00721), Acetylcholine chloride (D00999), Carbachol (D00524), Succinylcholine chloride (D00766), and Suxamethonium chloride (D02275) share a choline skeleton, and all known to act on acetylcholine receptors. However, there are many other drugs sharing the extracted drug feature and proteins sharing the extracted protein feature, and the drug-protein interactions are not known. Thus, it may be possible to predict previously unknown interactions between drugs and proteins through the extracted features in the signatures. See Table 1 and Fig. 4 for detail. Table 2 shows an extracted signature representing the association between an ADR (R01631 Graft-versus-host disease) and protein domain (PF14446 Prokaryotic RING finger family 1), where all drugs sharing the extracted ADR and all proteins sharing the extracted protein domain are also shown. Interestingly, most drugs sharing the ADR (R01631 Graft-versus-host disease) were related to inflammation, immunosuppression, and cancer, which supports the recently expanded concept that inflammation is a critical component of cancer progression [33]. See Fig. 5 and Table 3 for detail. Figure 4 shows an extracted signature representing the association between a drug-chemical substructure (RING C1x-C1x-C1y(C1z)-C1y(C2x)-C1y-C1x-C1x-C1z (C5a+O7a)-C1z(C1a) in the KCF-S format) and biological pathway (hsa04080 Neuroactive ligand-receptor interaction). It was observed that Megestrol acetate (D00952), Cyproterone acetate (D01368) and Chlormadinone acetate (D01299) share common ring structures. All these drugs are known to act on many neuroactive ligand-receptors. See Table 1 for detail. Figure 6 show an extracted signature representing the association between a drug-chemical substructure (SKELETON C5a(N1b+O5a)-C1c(N1b)-C1b-C8y-C8x-C8x-C8x-C8x-C8x in the KCF-S format) and biological pathway (hsa03050 Proteasome). Proteasome inhibitors have been applied to the treatment of cancer, especially multiple myeloma. The substructure "SKELETON C5a (N1b+O5a)-C1c(N1b)-C1b-C8y-C8x-C8x-C8x-C8x-C8x" corresponds to a phenylalanine residue, which is captured as a characteristic substructure in known proteasome inhibitors Bortezomib (D03150) and Carfilzomib (D08880). See Table 4 for detail.
We used the receiver operating characteristic curve (ROC curve), which is defined as a plot of true positive rates against false positive rates based on various thresholds, and the precision-recall curve (PR curve), which is defined as a plot of precision (positive predictive value) against recall (sensitivity) based on various thresholds, as evaluation measures for prediction performance.
We computed the area under the ROC curve (AUC score) and the area under the PR curve (AUPR score). The parameters involved in each method (e.g., regularization parameter) were fit with AUC and AUPR as the objective functions. Figure 7 shows the AUC and AUPR scores in the pairwise cross-validation, where the number of negative pairs in the training set was changed from the same number of positive examples to that of all possible negative examples in the training set. We observed that the prediction accuracy of the models trained with all five methods improved as the number of negative examples in the training set increased. This suggests that using all possible negative examples for learning a predictive model will enhance prediction reliability. L1LOG-tensor performed the best.
L1LOG-LIBLINEAR-tensor did not perform well with an increasing number of negative examples in the training set because of the memory storage problem. The learning process with the LIBLINEAR algorithm consumed all the memory of our machine with 128GB-memory. In  contrast, the other four methods with our space-efficient algorithm were able to finish the training process. This suggests that our space-efficient algorithm is more suitable and powerful for learning a predictive model on extremely high-dimensional data. L1-LOG-tensor and L2LOG-tensor performed better than L1-LOG-concat and L2LOG-concat, which suggests that the tensor-product fingerprint can capture relevant information for drug-protein interaction prediction. On the other hand, the concatenated fingerprint cannot capture enough information, even though calculation is faster. Table 5 shows the AUC score, AUPR score, training time, and consumed memory in the pair-wise crossvalidation. L1LOG-tensor and L2LOG-tensor consumed 24 GB for learning predictive models on all possible drugprotein pairs, which suggests their applicability for largescale drug-protein interaction prediction. They also took about 24 hours, which can be considered reasonable on a practical level, though they were slower than L1LOGconcat and L2LOG-concat.
In the pair-wise cross-validation, drugs and proteins in test pairs often overlap with those in the training set.
We conducted a different 5-fold cross-validation to avoid the overlap of drugs and proteins in test pairs between those in the training set, which we call "block-wise crossvalidation". The results of this block-wise cross-validation are shown in Fig. 8 and Table 6. The same tendency in the pair-wise cross-validation was also seen in the block-wise cross-validation. However, the AUC and AUPR scores in the block-wise cross-validation were much lower than those in the pair-wise cross validation. The results indicate that predictions of unknown interactions for new drug candidates (without known targets) and orphan proteins (without known ligands) are much more difficult than detecting missing interactions between drugs of known targets and proteins of known ligands in practice.
Finally, we tested SUCTRIE, VLA, and SET on their space-efficiencies of fingerprint representations. Note that SET is a standard representation, and SUCTRIE and VLA are those constructed with our proposed method. Figure 9 shows a plot of the consumed memory against the number of fingerprints. SET is known to use a large amount of memory for storing all possible fingerprints. In fact, it consumed 57GB for storing all possible drugprotein pairs in our dataset, which limits its practical  usage. In contrast, our proposed representations SUC-TREE and VLA are more space-efficient than SET. The consumed memory of SUCTREE was slightly smaller than that of VLA. SUCTREE and VLA consumed 16 and 20 GB, respectively, for storing all possible drug-protein pairs, Suggesting the usefulness of our SUCTREE and VLA. In fact, we were not able to conduct all the analyses for this study without SUCTRIE.

Conclusions
We proposed a novel method of extracting the underlying features characterizing overall drug-protein interactions, which we call "drug-protein interaction signatures". We extracted a set of drug-protein interaction signatures consisting of the associations between drug chemical substructures, adverse drug reactions, protein domains, biological pathways, and pathway modules, and argued that the extracted drug-protein interaction signatures were biologically meaningful. Our proposed method is original in that the space-efficient representations for high-dimensional fingerprints of drug-protein pairs, in the characterization of a large-scale drug-protein interaction network with various features in an integrative framework, and in the interpretability for the extracted feature associations.
Our proposed method will be useful for various applications in drug discovery. A limitation of the method is that it cannot extract the associations between different attributes of drugs or proteins. For example, it cannot detect the associations between drugchemical substructures and adverse drug reactions or the associations between protein domains and biological pathways. Extension of the method for analyzing such more complicated features is an important future work.