Improving residue–residue contact prediction via low-rank and sparse decomposition of residue correlation matrix

https://doi.org/10.1016/j.bbrc.2016.01.188Get rights and content

Highlights

  • In contact prediction, correlation analysis is hampered by background correlations.

  • We used LRS (low rank and sparse decomposition) to remove background correlations.

  • True contacts were inferred based on the sparse component of correlation matrix.

  • Our results suggested that LRS significantly improved prediction precision.

  • LRS outperformed the popular denoising technique APC.

Abstract

Strategies for correlation analysis in protein contact prediction often encounter two challenges, namely, the indirect coupling among residues, and the background correlations mainly caused by phylogenetic biases. While various studies have been conducted on how to disentangle indirect coupling, the removal of background correlations still remains unresolved. Here, we present an approach for removing background correlations via low-rank and sparse decomposition (LRS) of a residue correlation matrix. The correlation matrix can be constructed using either local inference strategies (e.g., mutual information, or MI) or global inference strategies (e.g., direct coupling analysis, or DCA). In our approach, a correlation matrix was decomposed into two components, i.e., a low-rank component representing background correlations, and a sparse component representing true correlations. Finally the residue contacts were inferred from the sparse component of correlation matrix.

We trained our LRS-based method on the PSICOV dataset, and tested it on both GREMLIN and CASP11 datasets. Our experimental results suggested that LRS significantly improves the contact prediction precision. For example, when equipped with the LRS technique, the prediction precision of MI and mfDCA increased from 0.25 to 0.67 and from 0.58 to 0.70, respectively (Top L/10 predicted contacts, sequence separation: 5 AA, dataset: GREMLIN). In addition, our LRS technique also consistently outperforms the popular denoising technique APC (average product correction), on both local (MI_LRS: 0.67 vs MI_APC: 0.34) and global measures (mfDCA_LRS: 0.70 vs mfDCA_APC: 0.67). Interestingly, we found out that when equipped with our LRS technique, local inference strategies performed in a comparable manner to that of global inference strategies, implying that the application of LRS technique narrowed down the performance gap between local and global inference strategies. Overall, our LRS technique greatly facilitates protein contact prediction by removing background correlations.

An implementation of the approach called COLORS (improving COntact prediction using LOw-Rank and Sparse matrix decomposition) is available from http://protein.ict.ac.cn/COLORS/.

Introduction

In natural environment, a protein usually adopts a specific tertiary structure determined primarily by its amino acid sequence [3]. Under chemical and physical effects, some residues are spatially close to others, forming a set of residue–residue contacts. These contacts are known to be responsible for stabilizing the native protein folds [13]. The accurate prediction of residue–residue contacts can provide distance information among residues, which should greatly helps both free modeling [24], [26] and template-based modeling strategies [22] for protein structure prediction.

A large variety of approaches have been proposed for residue–residue contact prediction, including supervised-learning approaches [7], [9], [32], [30] and purely sequence-based approaches [5], [8], [27], [6]. Typically, a purely sequence-based approach begins with building multiple sequence alignment (MSA) for a target protein, and then identifies possible residue–residue contacts through correlated mutation analysis [29], [12]. The underlying principle is that residue–residue contacts, generally being responsible for stabilizing protein structure, tend to be held during evolutionary history of the protein; thus, if a residue in contact mutates, its contacting partner is expected to accordingly mutate to maintain the contact. This coevolution between contacting residues commonly appear as correlations between the corresponding columns in MSA of the target protein (hereafter called true correlation); the correlation among MSA columns, in turn, can be explored to infer residue contacts.

Two difficulties are involved in the purely sequence-based strategy for correlation analysis [16], [24]. First, the true correlations are generally blurred by transitive correlations, also known as indirect coupling. More precisely, suppose the ith residue correlates with the jth residue, and the jth residue correlates with the kth residue; in this situation, even if the ith residue does not contact with the kth residue, correlation might still be observed between them due to transitive effects. Second, the intrinsic background correlations usually interfere with the identification of coevolution signals. The background correlations come from at least two sources: (1) During the phylogenetic history of a certain protein family, mutations occurring in an ancestral protein will be inherited by all of its descendants. Thus, almost all residue pairs appear to have some degree of correlations purely caused by phylogenetic biases. (2) The highly variable columns in MSAs usually lead to relatively high level of both random and non-random correlations among these columns [8], which forms another source of the background correlations. The background correlations, as well as the indirect coupling, often confound the correlation analysis and subsequent contact prediction.

Recently, there have been significant progresses in overcoming the indirect coupling difficulty. For example, mfDCA employed the mean field technique for direct coupling analysis [27], while plmDCA exploited the pseudo-likelihood maximization technique to achieve the same objective [10], [18]. Another approach, called sparse inverse covariance estimation (PSICOV), models MSA using a Gaussian distribution, and estimates partial correlations by inverting the empirical covariance matrix through graphical lasso technique [16]. Following this strategy, Andreatta et al. proposed to utilize the least-square technique to speed up the inversion of empirical covariance matrices [2]. Note that an MSA usually consists of proteins with divergent sequences but similar folds, Ma et al. successfully applied the group graphical lasso technique into direct coupling analysis of MSA [23]. These approaches were known as “global” since correlated residues are treated dependent on each other; in contrast, the “local” statistical inference models—for instance, MI [25] and OMES [11]—treat a certain residue pair independent of others [24].

Besides these efforts to overcome indirect coupling, a few methods have been developed for removing the background correlations caused by phylogenetic biases. In particular, it has been reported that the exclusion of highly similar sequences helps reduce phylogenetic biases [25]. Bootstrapping and other randomization methods [33], [28] were also found effective in reducing phylogenetic biases. Also promising is the average product correction (APC) technique. APC was originally designed to efficiently estimate the expected levels of background noise arising from phylogenetic sources [8], and currently the APC technique is widely used as a post-processing procedure in both local and global inference strategies. The existing approaches have proven to be relatively successful on various proteins; however, the removal of background correlation still remains a challenge to the correlation analysis of MSA.

In this study, we present a novel approach that employs the low-rank and sparse matrix decomposition (LRS) technique for removing background correlations. The approach distinguishes true correlations from background correlations according to their different characteristics, i.e., the sparsity of true correlations, and the low-rank characteristic of background correlations. On one side, the number of contacts in a L-length protein was estimated as ∼0.05 × L2 [17]. This number is substantially small when considering the total L2 possible contacting residue pairs, and thus implying the considerable sparsity of true correlations. On the other side, the first mode (principal component) of a correlation matrix describes the “coherent” correlations among all positions caused by phylogenetic biases [14]. In fact, the APC technique is essentially equivalent to removing the first mode of a correlation matrix, which implicitly assumes the rank of background correlation as 1 (see supplementary). However, besides the first mode, the phylogenetic biases might also contribute to other modes especially when MSA are constructed from proteins segregated into subfamilies [14]. Here, we adopted the similar but more general assumption of background correlations being low-rank and performed LRS to self-adaptively separates true correlations from background correlations.

It should be pointed out that the LRS technique, also known as robust principle component analysis (PCA), has been widely applied in the field of computational vision analysis [4] and gene expression analysis [21], [31]. As far as we know, this is the first time that the LRS technique has been applied for protein contact prediction.

We evaluated LRS technique on GREMLIN dataset and CASP11 targets as well. The evaluation results suggested that by using the LRS technique, the contact prediction precision was significantly improved regardless of whether local or global inference models were used.

Section snippets

Methods

To apply the LRS technique for protein contacts prediction, we first built a matrix to measure correlations among residues in the target protein. The residue correlation measure can be calculated by using local statistical models (e.g., MI and OMES) or global statistical models (e.g., DCA and PSICOV). Next, by using the LRS technique, we decomposed the residue correlation matrix into a low-rank component plus a sparse component. The sparse component was then used to infer residue–residue

Results and discussion

In our experiments, the PSICOV dataset having 150 single domain monomeric proteins [16] was utilized to train the parameter λ, and the GREMLIN dataset having 329 proteins [18] was utilized as testing set to evaluate the LRS technique. We also evaluated the performance of the LRS technique on CASP11 targets.

To avoid the biases incurred by overlap between training dataset and testing dataset, the similar proteins shared by training dataset and testing dataset were removed. Here, the criterion of

Conflict of interests

All authors declare no conflict of interest.

Acknowledgment

This study was funded by the National Basic Research Program of China (973 Program) (2012CB316502, 2015CB910303), the National Nature Science Foundation of China (11175224, 11121403, 31270834, 61272318, 31171262, 31428012, 31471246) and the Open Project Program of State Key Laboratory of Theoretical Physics (No. Y4KF171CJ1). This work made use of the eInfrastructure provided by the European Commission co-funded project CHAIN-REDS (GA no 306819).

We greatly appreciate Sergey Ovchinnikov for

References (33)

  • M.M. Gromiha et al.

    Inter-residue interactions in protein folding and stability

    Prog. Biophys. Mol. Biol.

    (2004)
  • N. Halabi et al.

    Protein sectors: evolutionary units of three-dimensional structure

    Cell

    (2009)
  • S.F. Altschul et al.

    Gapped blast and psi-blast: a new generation of protein database search programs

    Nucleic Acids Res.

    (1997)
  • M. Andreatta, S. Laplagne, S. C. Li, S. Smale, 2013, Prediction of residue-residue contacts from protein families using...
  • C.B. Anfinsen

    Studies on the Principles that Govern the Folding of Protein Chains

    (1972)
  • E.J. Candès et al.

    Robust principal component analysis?

    J. ACM JACM

    (2011)
  • D.K. Chiu et al.

    Inferring consensus structure from nucleic acid sequences

    Comput. Appl. Biosci. CABIOS

    (1991)
  • D. de Juan et al.

    Emerging methods in protein co-evolution

    Nat. Rev. Genet.

    (2013)
  • P. Di Lena et al.

    Deep architectures for protein contact map prediction

    Bioinformatics

    (2012)
  • S.D. Dunn et al.

    Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction

    Bioinformatics

    (2008)
  • J. Eickholt et al.

    Predicting protein residue–residue contacts using deep networks and boosting

    Bioinformatics

    (2012)
  • M. Ekeberg et al.

    Improved contact prediction in proteins: using pseudolikelihoods to infer potts models

    Phys. Rev. E

    (2013)
  • A.A. Fodor et al.

    Influence of conservation on calculations of amino acid covariance in multiple sequence alignments

    Proteins Struct. Funct. Bioinforma.

    (2004)
  • U. Gobel et al.

    Correlated mutations and residue contacts in proteins

    Proteins Struct.Funct. Genet.

    (1994)
  • S. Henikoff et al.

    Amino acid substitution matrices from protein blocks

    Proc. Natl. Acad. Sci.

    (1992)
  • D.T. Jones et al.

    PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments

    Bioinformatics

    (2012)
  • Cited by (0)

    1

    The first two authors contributed equally to this paper.

    View full text