Statistical correlation of nonconservative substitutions of HIV gp41 variable amino acid residues with the R5X4 HIV-1 phenotype

The interaction of the envelope glycoprotein of HIV-1 (gp120/gp41) with coreceptor molecules has important implications for specific cellular targeting and pathogenesis. Experimental and theoretical evidences have shown a role for gp41 in coreceptor tropism, although there is no consensus about the positions involved. Here we analyze the association of physicochemical properties of gp41 amino acid residues with viral tropism (X4, R5, and R5X4) using a large set of HIV-1 sequences. Under the assumption that conserved regions define the complex structural features essential for protein function, we focused our search only on amino acids in the gp41 variable regions. Gp41 amino acid sequences of 2823 HIV-1 strains from all clades with known coreceptor tropism were retrieved from Los Alamos HIV Database. Consensus sequences were constructed for homologous sequences (those obtained from the same patient and having the same tropism) in order to avoid bias due to sequence overrepresentation, and the variability (entropy) per site was determined. Comparisons of hydropathy index (HI) and charge (Q) of amino acid residues at highly variable positions between coreceptor groups were performed using two non-parametrical tests and Benjamini-Hochberg correction. Pearson’s correlation analysis was performed to determine covariance of HI and Q values. Calculation of variability per site rendered 58 highly variable amino acid positions. Of these, statistical analysis rendered significantly different HI or Q only for the R5 vs. R5X4 comparison at twelve positions: 535, 602, 619, 636, 640, 641, 658, 662, 667, 723, 756 and 841. The largest differences in particular amino acid frequencies between coreceptor groups were found at 619, 636, 640, 641, 662, 723 and 756. A hydrophobic tendency of residues 619, 640, 641, 723 and 756, along with a hydrophilic/charged tendency at residues 636 and 662 was observed in R5X4 with respect to R5 sequences. HI of position 640 covariated with that of 602, 619, 636, 662, and 756. Variability and significant correlations of physicochemical properties with viral phenotype suggest that substitutions at residues in the loop (602 and 619), the HR2 (636, 640, 641, 662), and the C-terminal tail (723, 756) of gp41 may contribute to phenotype of R5X4 strains.


Background
Important features of the HIV-1 induced disease are determined by the interaction of three main classes of viruses with different subsets of CD4+ cells, currently designated as R5, X4 and R5X4 viruses depending on the coreceptor they use to enter cells (CCR5, CXCR4, or both, respectively). CCR5 is expressed mainly by macrophages and the activated/memory T subset, whereas CXCR4 is predominantly expressed by the naïve, but also the memory, subsets of CD4+ T-lymphocytes and by CD4+ T-cell lines. R5 viruses are responsible for transmission and persist through the whole course of the disease in most of patients. The appearance of R5X4 and X4 viruses in blood associates with the onset of AIDS [1].
Entry of the HIV-1 genome into target cells depends on trimmeric complexes of the viral envelope glycoprotein (Env) heterodimer, which is composed of a hypervariable surface subunit (gp120), and a more conserved, though highly variable, transmembrane subunit (gp41) [2]. CD4 binding to gp120 induces the exposure/formation of the binding site for the coreceptor [3]. The gp120-CD4-coreceptor interaction then allows the extension of gp41 and the insertion of the fusion peptide into the target membrane. Current models indicate that packing of three gp41 C-terminal helices into the grooves of a coiled coil formed by the N-terminal helices forms a structure known as the six-helix bundle, enforcing virus-cell membrane fusion [4,5].
Determinants of HIV-1 coreceptor tropism have been identified mainly in the hypervariable gp120 V3 loop, where a high positive net charge associates with X4 tropism [6,7]. V1, V2 and V5 loops modulates the V3 effects [8][9][10][11][12]. In addition, experimental evidence of the participation of gp41 in coreceptor recognition has been provided [13][14][15][16]. Gp41 contains approximately 346 amino acids and is composed of an ectodomain, a membrane spanning domain, and a long C-terminal tail (CTT). The ectodomain is organized in an N-terminal fusion peptide, two helical regions known as HR1 and HR2, a central loop, and the membrane proximal external region (MPER). In the ectodomain, HR2 concentrates the highest variation rate [17], whereas the Cterminal tail display the higher average diversity in the protein [2]. Theoretical studies have shown the statistical association of gp41 with coreceptor tropism although there is no a consensus about the putative sites implicated [18][19][20][21], and congruency with experimental investigations of coreceptor associated mutations [13,14] is not clear. Given the high variability and adaptive nature of gp41, discordances may be caused by differences in the databases used, as well as to distinct analytical approaches. Thus, while it seems clear that different gp41 domains participate in determination of virus phenotype, the specific changes involved may develop in a complex, context-dependent manner, similarly to the different mutational pathways observed in studies of the correlates of the gp120 sequence with coreceptor tropism [10] or that obtained for resistance to maraviroc of R5tropic viruses [22].
Unlike other studies, we focused our analysis on the relationship of the hydropathy index and charge of amino acid positions between coreceptor groups in order to determine if general physicochemical properties of gp41 residues correlate with different virus phenotypes. In addition, we focused on highly variable amino acid positions of gp41 since conserved positions are most probably engaged in maintaining the highly stringent structural properties required for membrane fusion. With this purpose, we retrieved amino acid sequences of a set of 2823 HIV-1 strains from all clades with known coreceptor tropism from Los Alamos HIV Database. Consensus sequences were constructed for homologous sequences (those obtained from the same patient and having the same tropism) in order to avoid bias due to sequence overrepresentation. Then, the variability (entropy) per site was determined and amino acid positions with high variability scores or with large differences in variability between coreceptor groups were selected. Next, we performed a statistical analysis for the association of the viral tropism (X4, R5 and R5X4) with the hydropathy index (HI) and charge (Q) of amino acid residues at those positions. Twelve positions were found linked to coreceptor usage in this analysis. We suggest that some of the most gp41 variable residues are involved in the coreceptor recognition process.

Variability of gp41
The statistical association between coreceptor tropism and hydropathy index (HI) or charge (Q) of variable amino acids was analyzed for 2823 gp41 sequences from individual viruses with known coreceptor tropism included in Los Alamos Database at January 2014, considering all clades. After alignment and construction of consensus for homologous sequences, a final number of 773 sequences was obtained as follows: 621 R5, 73 X4, and 79 R5X4. Table 1 presents the percentage of consensus sequences of strains with a given coreceptor tropism in genetic subtypes.
The protein variability calculated by means of the entropy (S k ) per site for the whole gp41 sequence is presented in Fig. 1. The highest entropy peaks concentrated at the ectodomain, particularly at positions 619-621 of the C-terminal end of the loop, and 640, 641, and 644 in HR2. In the C-terminal tail, regions with high variability were observed in the putative minor ectodomain (ME) [23][24][25] and membrane spanning domain three (MSD3) [25,26], as well as in the lentivirus lytic peptide one (LLP-1). Similar patterns of gp41 variability have been reported before [2,17].
We considered as highly variable those positions with the highest entropy scores (S k > 0.9). This criterion yielded 27 positions in the ectodomain and 31 in the transmembrane domain and cytoplasmic tail. Thus, 58 variable positions were considered for statistical analysis of correlation with coreceptor usage ( Fig. 1 and Additional file 1: Table S1).

Relationship of coreceptor usage with hydropathy index and charge of highly variable amino acids
We tested the independence of HI distributions (Mann-Whitney U test) and the association of the hydrophobic (HI > 0) or hydrophilic (HI < 0) character (χ 2 test) with coreceptor usage in the R5 vs. X4, R5 vs. R5X4, and X4 vs. R5X4 comparisons. In order to correct for multiple tests we employed the Benjamini-Hochberg procedure by considering false discovery rates (Q FD ) of 0.05 and 0.1. With both criteria significant p values were obtained only for the R5 vs. R5X4 comparison. Additional file 1: Table S1 contains the average and standard deviation of HI at each position in the R5, X4, and R5X4 groups, as well as the p values obtained for comparisons between them before correction for multiple tests. Table 2 shows the summary of statistics of positions with significant p values after Benjamini-Hochberg correction. Using a Q FD of 0.05, the test of HI-independence distribution (Mann-Whitney U test) rendered ten significant amino acid positions. Three of these positions (619, 641 and 667) as well as 602 also showed statistical linkage of hydrophilicity or hydrophobicity (χ2) with coreceptor tropism. The same tests were applied to the analysis of correlation of Q with coreceptor usage. Additional file 2: Table S2 shows p values obtained for all comparisons before correction for multiple tests and Table 3 contains the summary of significant position statistics after multiple test correction. Statistical independence of Q distribution was found only at position 636, whereas significant association of charged or uncharged character with viral tropism was obtained for this position and for 602 and 658. In total, twelve different positions rendered significant p values for HI or Q. Figure 2 compares the mean hydropathy value of all 58 variable residues (listed in Additional file 1: Table S1) among coreceptor groups. Red markers indicate  526  533  540  547  554  561  568  575  582  589  596  603  610  617  624  631  638  645  652  659  666  673  680  687  694  701  708  715  722  729  736  743  750  757  764  771  778  785  792  799  806  813  820  827  834  841  848   The number of sequences in each subtype and each tropism group is indicated in parenthesis positions that produced significant p values with a Q FD of 0.05 showed in Tables 2 and 3. According with statistical analyses, the largest differences in HI were observed for the R5X4-R5 comparison (Fig. 2a). Large increments of hydrophobicity in R5X4 respective to R5 sequences were observed at positions 619, 641, 667 and 841, and moderated increments at 640, 723 and 756, whereas increased hydrophilicity in R5X4 respective to R5 sequences was observed at positions 602, 636 and 662. Position 658, which showed significantly different Q    Table S1. (a) ΔHI between R5X5 and R5 sequences. Positions showing significant differences of HI (Q FD = 0.05) between R5X4 and R5 viruses are indicated with red circles. Position 658, which exhibited difference in charge only (Table 3) is indicated with a red square. Positions with the largest differences in amino acid frequencies between coreceptor groups (see Fig. 3) are indicated with position number. (b) ΔHI between R5X5 and X4 sequences. (c) ΔHI between X4 and R5 sequences. Positive or negative differences in HI imply a hydrophobic or hydrophilic tendency, respectively, for R5X4 (a, b) or X4 (c) sequences. Note that positions that were significant in the R5X4-R5 comparison (a) where not significant for the comparisons shown in (b) and (c), and are presented to illustrate the diminution of the hydrophobic or hydrophilic tendency of the respective residues (white circles and squares) between R5X4 and R5 sequences, is indicated with a red square. A similar pattern, although not significant, was observed in R5X4 respective to X4 sequences (Fig. 2b) and only minor differences were observed in X4 with respect to R5 sequences (Fig. 2c). Figure 3 shows a survey of the frequency distribution of particular amino acids at these sites. The major differences between coreceptor groups were at positions 619, 636, 640, and 641. The content of hydrophobic residues at positions 619, 640 and 641 was between 38 and 52 % greater in R5X4 than in R5 sequences, whereas the content of charged residues at position 636 was 40 % greater in R5X4 sequences. Positions 535, 602, 658, 662, 667, 723, 756 and 841 exhibited differences between 18 and 34 % in the content of particular residues.
In summary, taking into account the extent of differences in hydropathy and charge, as well as the frequency distribution of amino acids, a tendency to a hydrophobic character at positions 619, 640, 641, 723 and 756, and to In order to detect differences in HI or Q in other comparisons (R5X4 vs. X4 and R5 vs. X4), a statistical evaluation was performed by broadening the criterion of false discovery rate. Considering a value of Q FD = .10, again the R5 vs. R5X4 comparison was the only that provided statistically relevant sites. In addition to positions obtained using a Q FD = .05, differences in HI were obtained at positions 746 and 778, whereas different charge was observed at 809 (Tables 2 and 3).

Correlation between sites
A covariation analysis was performed for positions that were statistically different between coreceptor groups in order to assess if HI or Q values change in a correlated manner. Given the highly organized structure of gp41, it is predictable that many positions should covariate significantly, which is necessary to maintain the structure and function of the protein. However, a higher correlation index for a pair of residues in one tropism group respective to others would be indicative of a complementary contribution to virus phenotype. Thus, the analysis was performed separately on the R5, X4 and R5X4 groups. The covariance analysis also provides information about the positive or negative correlation between values, providing an assessment, for example, of the tendency to hydrophobicity of a pair of residues (positive correlation), or a tendency to hydrophobicity of one residue along with a tendency to hydrophilicity of another (negative correlation). Table 4 contains Pearson's correlation coefficients (r) for hydrophaty index of pairs of positions in the R5, X4, and R5X4 groups. As expected, most of residue pairs covariate significantly with moderate or high correlation coefficients. However, pairs 602-640, 602-723, 619-640, 636-640, 640-662, and 640-756 correlated with higher r's (>0.4) in the R5X4 group than in the R5 and X4 groups (indicated with bold characters in the column R5X4 in Table 4). Of these, a positive correlation was observed for the 619-640 and 640-756 positions, in agreement with a hydrophobic tendency observed for these residues in R5X4 sequences (Fig. 2a). Instead, negative correlations were observed for the 602-640, 602-723, 636-640 and 640-662 pairs in the R5X4 group, accordingly with the opposite hydrophaty tendencies of these residues in this group observed before (Fig. 2a). Noticeably, position 640 participated in five of six of these covariations, emphasizing the importance of the hydrophobic character of the 640 residue for the R5X4 phenotype.
Correlation with r > 0.4 was also observed for the pairs 636-723 and 641-723 in both R5 and X4 groups (indicated with bold characters in the R5 and X4 columns in Table 4), but not in the R5X4 group, indicating that R5 and X4 sequences share hydropathy features at these positions.
Regarding charge, no correlations with r > 0.4 between positions were observed (Additional file 3: Table S3).

Discussion
Our results indicate that the R5X4 phenotype associates with a hydrophobic tendency of positions at the Cterminal half of the loop (619) the HR2 (640, 641), the so called minor ectodomain (723), and the putative MSD3 (756), as well as with a hydrophilic/charged tendency in a residue at the disulfide bridge region of the loop (602), and the HR2 (636, 662). The location of the nine positions belonging to the ectodomain is shown in the structure of the six-helix bundle in Fig. 4. Since this study is correlative, it does not necessarily implicates that these residues establish contact with coreceptor molecules, but only that hydrophobic or hydrophilic residues at these positions are more frequently harbored by R5X4 than R5 and X4 viruses. However, it can be speculated that they may contribute to virus phenotype by several mechanisms. Position 602 is the most variable site in the disulfide bridge region of the loop (Figs. 1 and  4). It is known that hydrophobicity of the loop is important for the stability of the gp120-gp41 association [27], so a hydrophilic residue at position 602 may favor gp120 shedding and fusion. Position 619 is part of the LEQleucine-glutamate-glutamine in the HXB2 strainhighly variable triplet located at the loop-HR2 boundary (Fig. 4). To our knowledge, there are no experimental studies regarding the role of this position. However, a more conserved fragment comprising nearby residues 579-613 of the loop (which includes the 602 residue) and another fragment containing the 619 amino acid, interact with and perturb cellular and model membranes [28][29][30]. It has been hypothesized that the loop may bind to and destabilize the host cell membrane, as well as stabilize the trimeric helical hairpin, then favoring the formation of the fusion pore [28]. Thus, a hydrophobic 619 residue in R5X4 strains may enhance the interaction of the loop with membranes. On the other hand, since the loop is part of a wide region composing the gp120-gp41 interface [27,31], it may influence the efficiency of gp120 shedding. It has been demonstrated that gp120 shedding requires the presence of CXCR4 [5], although a similar analysis for CCR5 is still lacking.
HR2 amino acids 636, 640, and 641 may participate in coreceptor recognition by interacting with the gp120 coreceptor binding site. The HR2-based peptide T-20 interacts with peptides derived from the bridging sheet [32], and can block the interaction of gp120-CD4 complexes with the CXCR4 coreceptor through binding a region near the base of the gp120 V3 loop [33]. Recently, Moseri and cols. showed that T-20 binds to the conserved region 4 of R5 gp120 trough mostly hydrophobic interactions [34]. On the other hand, the direct interaction of the gp41 ectodomain with the coreceptor molecule has been suggested by the observation that T-20 and the related T22 peptide, inhibited the binding to CXCR4 of the anti-CXCR4 HIV-blocking antibody 12G5 [35]. CXCR4, but not CCR5, contains a highly hydrophobic groove in the region located between the second and third extracellular loops. Since the second extracellular loop is critical for coreceptor function [36,37], this region represents a putative site for interaction with the hydrophobic residues of the gp41 ectodomain of R5X4 viruses. Finally, it is possible that residues 619, 640 and 641 of R5X4 gp41 proteins strengthen the interaction of this molecule with membrane lipids. HR1 and HR2 peptides interact with membrane vesicles and it has been proposed that they play an important role in the interaction of gp41 with the viral and cellular membranes during the opening of the fusion pore [38][39][40][41][42]. Current structural models indicate that residues 636, 640, 641 are not part of the HR1-HR2 interface in the six-helix bundle [43], so they would be exposed on this structure and available for membrane interactions in late stages of the fusion process, contributing to fusogenicity and pathogenicity of R5X4 viruses (Fig. 4). Importantly, correlation analysis revealed that the hydropathy index of pairs 602-640, 602-723, 619-640, 636-640, 640-662, and 640-756, covariate with higher correlation coefficients in the R5X4 group than in the R5 and X4 groups (Table 4), suggesting a complementary functionality of these residues for determination of the R5X4 phenotype. The positive covariation of the 619-640 and 640-756 pairs suggests a joint hydrophobic effect of these positions in R5X4 viruses for membrane lipid interactions (Fig. 2). On the other hand, the negative covariation observed for positions with opposed hydropathy tendencies (602-640, 602-723, 636-640 and 640-662) remarks the importance of the concurrence of hydrophilicity at positions 602, 636 and 662 (Fig. 2) for the R5X4 phenotype. In particular, the participation of position 640 in five of six covariations and the exposed position of this residue on the six-helix bundle structure (Fig. 4), suggest an important role of this residue for the R5X4 phenotype.
Residue 723 is part of a region in the C-terminal tail that may be transiently exposed on the surface virus and infected cells and is so called the minor ectodomain [23][24][25], while position 756 locates in a region that may constitute a third membrane spanning domain (MSD3) during exposition of the minor ectodomain [25,26]. A hydrophobic residue at this position may favor the exposure of the minor ectodomain, although with still unsuspected consequences.
A less restrictive analysis (Q FD = 0.1) rendered additional positions located at different domains of gp41 and again, only for the R5X4 vs. R5 comparison. Thus, statistical analysis suggests a role for gp41 in the R5X4 virus phenotype.  [47,48]. The coordinates of this structure are available in the Additional file 4: Figure S1 (Gp41 coordinates -Homology model) Our analysis of the relationship of the gp41 sequence with virus phenotype did not yield differences between the X4 and R5 groups. It is well known that V3 gp120 residues influence the macrophage-tropic R5 (M-R5) and T-cell tropic (T-X4) viral phenotypes [44,45], yet the role of V3 as a major determinant of phenotype is less clear in the case of dually tropic viruses [8]. Since our analysis was performed independently of the gp120 sequence, it is likely that we only observed residues influencing the R5-R5X4 shift in gp41, whereas residues in gp120 would be significant in determination of the R5 and X4 phenotypes.

Conclusions
R5 and R5X4 are the two main classes of viruses found in the circulation of patients with HIV-1 infection. Our analysis suggests that physicochemical properties of the variable amino acid residues at positions 602, 619, 636, 640, 641, 662, 723 and 756 of gp41 may contribute to enhanced virus-host membrane fusion of R5X4 viruses respective to R5 viruses.

HIV-1 sequences
A total of 2823 gp41 amino acid sequences from all main subtypes with defined coreceptor usage available in Los Alamos HIV database (19) were downloaded as follows: 2346 R5, 197 X4 and 280 R5X4. Consensus were constructed for homologous sequences (i.e. those derived from the same patient and having the same tropism), by using the Consensus Maker software available in Los Alamos HIV database website (19). As a result, a set of 773 sequences was obtained and classified according to coreceptor usage: 621 R5, 73 X4, and 79 R5X4. Table 1 presents the relative abundance of consensus sequences from strains with a given coreceptor tropism in the main genetic subtypes. Sequences from B and C clades were the most abundant and belonged mainly to the R5 group (81.4 and 87.2 %, respectively). Recombinant subtypes were grouped together in "others". Sequences were aligned with respect to the reference HXBc2 strain by using the Clustal W subroutine of the MEGA 5.2 software.

Entropy determination
The softwares Entropy-one and Entropy-two available from the Los Alamos HIV database were employed to localize non-conserved regions of gp41 by evaluating Shannon's entropy (S k ) for each aligned position: where f(r, k) is the frequency of the residue r at position k. Entropy differences between groups at site k were calculated as S kB -S kA , where A and B designate either R5, X4, or R5X4 virus sequences. The entropy per site S k and the mean entropy S M for a set of sequences satisfy the relation where N is the total number of sites considered in the analysis.

Statistical analysis
Independence of the HI or Q distributions at a given amino acid position between coreceptor groups was determined by the Mann-Whitney U test. On the other hand, the hypothesis of linkage of coreceptor usage with the hydrophobic/hydrophilic or charged/uncharged character of residues was tested by means of a χ2 analysis. Correction for multiple tests was performed by means of Benjamini-Hochberg procedure [46] by considering either false discovery rates Q FD = 0.05 and Q FD = 0.10.

Correlation analysis
A covariance analysis was performed on HI and Q values for pairs of statistically significant positions. Covariation was expressed in terms of Pearson's correlation coefficient r.