Biochemical and Biophysical Research Communications
Improving residue–residue contact prediction via low-rank and sparse decomposition of residue correlation matrix
Introduction
In natural environment, a protein usually adopts a specific tertiary structure determined primarily by its amino acid sequence [3]. Under chemical and physical effects, some residues are spatially close to others, forming a set of residue–residue contacts. These contacts are known to be responsible for stabilizing the native protein folds [13]. The accurate prediction of residue–residue contacts can provide distance information among residues, which should greatly helps both free modeling [24], [26] and template-based modeling strategies [22] for protein structure prediction.
A large variety of approaches have been proposed for residue–residue contact prediction, including supervised-learning approaches [7], [9], [32], [30] and purely sequence-based approaches [5], [8], [27], [6]. Typically, a purely sequence-based approach begins with building multiple sequence alignment (MSA) for a target protein, and then identifies possible residue–residue contacts through correlated mutation analysis [29], [12]. The underlying principle is that residue–residue contacts, generally being responsible for stabilizing protein structure, tend to be held during evolutionary history of the protein; thus, if a residue in contact mutates, its contacting partner is expected to accordingly mutate to maintain the contact. This coevolution between contacting residues commonly appear as correlations between the corresponding columns in MSA of the target protein (hereafter called true correlation); the correlation among MSA columns, in turn, can be explored to infer residue contacts.
Two difficulties are involved in the purely sequence-based strategy for correlation analysis [16], [24]. First, the true correlations are generally blurred by transitive correlations, also known as indirect coupling. More precisely, suppose the ith residue correlates with the jth residue, and the jth residue correlates with the kth residue; in this situation, even if the ith residue does not contact with the kth residue, correlation might still be observed between them due to transitive effects. Second, the intrinsic background correlations usually interfere with the identification of coevolution signals. The background correlations come from at least two sources: (1) During the phylogenetic history of a certain protein family, mutations occurring in an ancestral protein will be inherited by all of its descendants. Thus, almost all residue pairs appear to have some degree of correlations purely caused by phylogenetic biases. (2) The highly variable columns in MSAs usually lead to relatively high level of both random and non-random correlations among these columns [8], which forms another source of the background correlations. The background correlations, as well as the indirect coupling, often confound the correlation analysis and subsequent contact prediction.
Recently, there have been significant progresses in overcoming the indirect coupling difficulty. For example, mfDCA employed the mean field technique for direct coupling analysis [27], while plmDCA exploited the pseudo-likelihood maximization technique to achieve the same objective [10], [18]. Another approach, called sparse inverse covariance estimation (PSICOV), models MSA using a Gaussian distribution, and estimates partial correlations by inverting the empirical covariance matrix through graphical lasso technique [16]. Following this strategy, Andreatta et al. proposed to utilize the least-square technique to speed up the inversion of empirical covariance matrices [2]. Note that an MSA usually consists of proteins with divergent sequences but similar folds, Ma et al. successfully applied the group graphical lasso technique into direct coupling analysis of MSA [23]. These approaches were known as “global” since correlated residues are treated dependent on each other; in contrast, the “local” statistical inference models—for instance, MI [25] and OMES [11]—treat a certain residue pair independent of others [24].
Besides these efforts to overcome indirect coupling, a few methods have been developed for removing the background correlations caused by phylogenetic biases. In particular, it has been reported that the exclusion of highly similar sequences helps reduce phylogenetic biases [25]. Bootstrapping and other randomization methods [33], [28] were also found effective in reducing phylogenetic biases. Also promising is the average product correction (APC) technique. APC was originally designed to efficiently estimate the expected levels of background noise arising from phylogenetic sources [8], and currently the APC technique is widely used as a post-processing procedure in both local and global inference strategies. The existing approaches have proven to be relatively successful on various proteins; however, the removal of background correlation still remains a challenge to the correlation analysis of MSA.
In this study, we present a novel approach that employs the low-rank and sparse matrix decomposition (LRS) technique for removing background correlations. The approach distinguishes true correlations from background correlations according to their different characteristics, i.e., the sparsity of true correlations, and the low-rank characteristic of background correlations. On one side, the number of contacts in a L-length protein was estimated as ∼0.05 × L2 [17]. This number is substantially small when considering the total L2 possible contacting residue pairs, and thus implying the considerable sparsity of true correlations. On the other side, the first mode (principal component) of a correlation matrix describes the “coherent” correlations among all positions caused by phylogenetic biases [14]. In fact, the APC technique is essentially equivalent to removing the first mode of a correlation matrix, which implicitly assumes the rank of background correlation as 1 (see supplementary). However, besides the first mode, the phylogenetic biases might also contribute to other modes especially when MSA are constructed from proteins segregated into subfamilies [14]. Here, we adopted the similar but more general assumption of background correlations being low-rank and performed LRS to self-adaptively separates true correlations from background correlations.
It should be pointed out that the LRS technique, also known as robust principle component analysis (PCA), has been widely applied in the field of computational vision analysis [4] and gene expression analysis [21], [31]. As far as we know, this is the first time that the LRS technique has been applied for protein contact prediction.
We evaluated LRS technique on GREMLIN dataset and CASP11 targets as well. The evaluation results suggested that by using the LRS technique, the contact prediction precision was significantly improved regardless of whether local or global inference models were used.
Section snippets
Methods
To apply the LRS technique for protein contacts prediction, we first built a matrix to measure correlations among residues in the target protein. The residue correlation measure can be calculated by using local statistical models (e.g., MI and OMES) or global statistical models (e.g., DCA and PSICOV). Next, by using the LRS technique, we decomposed the residue correlation matrix into a low-rank component plus a sparse component. The sparse component was then used to infer residue–residue
Results and discussion
In our experiments, the PSICOV dataset having 150 single domain monomeric proteins [16] was utilized to train the parameter λ, and the GREMLIN dataset having 329 proteins [18] was utilized as testing set to evaluate the LRS technique. We also evaluated the performance of the LRS technique on CASP11 targets.
To avoid the biases incurred by overlap between training dataset and testing dataset, the similar proteins shared by training dataset and testing dataset were removed. Here, the criterion of
Conflict of interests
All authors declare no conflict of interest.
Acknowledgment
This study was funded by the National Basic Research Program of China (973 Program) (2012CB316502, 2015CB910303), the National Nature Science Foundation of China (11175224, 11121403, 31270834, 61272318, 31171262, 31428012, 31471246) and the Open Project Program of State Key Laboratory of Theoretical Physics (No. Y4KF171CJ1). This work made use of the eInfrastructure provided by the European Commission co-funded project CHAIN-REDS (GA no 306819).
We greatly appreciate Sergey Ovchinnikov for
References (33)
- et al.
Inter-residue interactions in protein folding and stability
Prog. Biophys. Mol. Biol.
(2004) - et al.
Protein sectors: evolutionary units of three-dimensional structure
Cell
(2009) - et al.
Gapped blast and psi-blast: a new generation of protein database search programs
Nucleic Acids Res.
(1997) - M. Andreatta, S. Laplagne, S. C. Li, S. Smale, 2013, Prediction of residue-residue contacts from protein families using...
Studies on the Principles that Govern the Folding of Protein Chains
(1972)- et al.
Robust principal component analysis?
J. ACM JACM
(2011) - et al.
Inferring consensus structure from nucleic acid sequences
Comput. Appl. Biosci. CABIOS
(1991) - et al.
Emerging methods in protein co-evolution
Nat. Rev. Genet.
(2013) - et al.
Deep architectures for protein contact map prediction
Bioinformatics
(2012) - et al.
Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction
Bioinformatics
(2008)
Predicting protein residue–residue contacts using deep networks and boosting
Bioinformatics
Improved contact prediction in proteins: using pseudolikelihoods to infer potts models
Phys. Rev. E
Influence of conservation on calculations of amino acid covariance in multiple sequence alignments
Proteins Struct. Funct. Bioinforma.
Correlated mutations and residue contacts in proteins
Proteins Struct.Funct. Genet.
Amino acid substitution matrices from protein blocks
Proc. Natl. Acad. Sci.
PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments
Bioinformatics
Cited by (0)
- 1
The first two authors contributed equally to this paper.