Topography Prediction of Helical Transmembrane Proteins by a New Modification of the Sliding Window Method

Protein functions are specified by its three-dimensional structure, which is usually obtained by X-ray crystallography. Due to difficulty of handling membrane proteins experimentally to date the structure has only been determined for a very limited part of membrane proteins (<4%). Nevertheless, investigation of structure and functions of membrane proteins is important for medicine and pharmacology and, therefore, is of significant interest. Methods of computer modeling based on the data on the primary protein structure or the symbolic amino acid sequence have become an actual alternative to the experimental method of X-ray crystallography for investigating the structure of membrane proteins. Here we presented the results of the study of 35 transmembrane proteins, mainly GPCRs, using the novel method of cascade averaging of hydrophobicity function within the limits of a sliding window. The proposed method allowed revealing 139 transmembrane domains out of 140 (or 99.3%) identified by other methods. Also 236 transmembrane domain boundary positions out of 280 (or 84%) were predicted correctly by the proposed method with deviation from the predictions made by other methods that does not exceed the detection error of this method.


Introduction
Problem and relevance of the study of membrane proteins, including GPCRs, are as follows. Membrane proteins are responsible for many cellular functions and processes, in particular ensuring the selective exchange of substances between the cell and its environment, maintaining the electric potential inside and outside the cell, and providing the transfer of electric signals into and out of the cell. They participate in nearly all energy transduction processes in the organism.
Protein functions are specified by its three-dimensional structure, which is usually obtained by X-ray crystallography [1,2]. This method is directly applied to protein crystals, which must be produced beforehand using a very complex and laborious technique. The difficulty of handling membrane proteins during their production, purification, and crystallization due to protein instability, unfolding, aggregation, and heterogeneity has made it hard to solve their structures experimentally and to date the structure has only been determined for a very limited part of membrane proteins (<4%).
It is supposed that all information about the ultimate structure of a protein is contained in its amino acid sequence. Therefore, methods of computer modeling based on the data on the primary protein structure or the symbolic amino acid sequence have become an actual alternative to the experimental method of X-ray crystallography for studying the structure of membrane proteins [3].
From the variety of membrane proteins, the group of integral polytopic proteins (transmembrane proteins, TMPs) with multiple hydrophobic sites, domains permeating the membrane, is of considerable interest. Many of these proteins function as gateways or "loading docks" to transport specific substances and relay signals across the biological membrane.
If the repetition of transmembrane regions is aperiodic, it can be revealed by another method, that is, the method of the reiterated (four to five times) averaging of the protein hydrophobicity function in a window within the limits of 9-11 amino acids that moves along the sequence. This method is a novel advanced version of the known method of sliding window, which has been proposed and used in our previous work [4] to investigate the secondary structure of different membrane proteins.
The aim of the present work is to apply this method for the prediction of the characteristics of unknown secondary structures of TMPs, mainly of GPCRs; these characteristics specify the functional properties of the proteins.
G protein-coupled receptors (GPCRs), also known as seven-transmembrane domain receptors, comprise the largest family of membrane proteins in the human genome and the richest source of targets for the pharmaceutical industry [6].
Over 800 unique GPCRs have been revealed from human genome sequence analysis, approximately 460 of which are predicted to be olfactory receptors [7,8]. The physiologic function of a large fraction of these 800 GPCRs is unknown. There are many obstacles to obtaining structures of GPCRs by X-ray crystallography; the major difficulties include poor protein stability and absence of homogeneity during crystallization due to inherent properties of these receptors [6,9,10].
Therefore, it is necessary to develop novel approaches in structurally resolving aspects of their biology [11][12][13]. One of such useful approaches is to screen these proteins with help of structural bioinformatics and methods of computer modeling to identify those of them with the best characteristics for structural studies and for crystallography trials.

Materials and Methods
We used the method of reiterated averaging hydrophobicity function within a sliding window over the amino acid sequence. Since TM domains (TMDs) consist predominantly of hydrophobic amino acids, it is evident that the average hydrophobicity for this region, as specified in the protein sequence by a function ( ) = [ ( )] of amino acid number in the sequence, must be higher than that for both hydrophilic topological domains (TPDs) adjacent to it. Furthermore, this local property does not depend on the periodicity of the arrangement of characteristic TMDs and TPDs in the amino acid sequence. Here, ( ) = 1, 2, . . . , 20 is the number of amino acids of the 20 known (Table 1), which is located at position in the protein sequence.
For the first time, this idea was realized in [14], where averaging of the function ( ) within the limits of a segment, or window of width = 5, 7, 9, 11, or 13 amino acids, moving along the amino acid sequence, was used. The result of averaging was assigned to a member of a new numerical sequence 1 ( ) with number corresponding to the current position of the average segment point.
The scale of hydrophobicity ( ) used in this method can be specified in different ways (Table 1) depending on the physically measured value that characterizes this property [14][15][16][17][18][19][20]. In [14][15][16], the change of value of free energy of amino acid side groups upon their transfer into water from a hydrophobic medium was used as a measure of hydrophobicity. In [17,19], the measure (scale) of amino acid hydrophobicity was defined as the function 4 ( ) = 1 − ⟨ ⟩/ 0 (Table 1) based on the values of the amino acid surface area 0 ( ), which is available to solvent in the standard state, and the mean solvent accessible surface area ⟨ ( )⟩ in a folded protein conformation. In [17], the correlation between the free energy value and the surface area available to solvent was established.
The set of 20 amino acids can be divided into a few characteristic groups based on their degree of hydrophobicity by different ways. Thus, according to [19], we used the division of 20 amino acids into three groups by the degree of hydrophobicity, including hydrophobic (C, F, I, L, M, V, and W, seven in total), hydrophilic (D, E, G, K, N, P, Q, R, S, and T, ten in total), and neutral (A, H, and Y, three in total). The hydrophobic amino acids were assigned a value of +1, the hydrophilic amino acids were assigned a value of −1, and the neutral amino acids were assigned a value of 0. Thus, we obtained the crude scale 3 ( ) in Table 1. On another crude scale 2 ( ) the hydrophobic amino acids were assigned a value of +1, and the remaining amino acids were assigned a value of 0.
In our previous work [4], we proposed the procedure, different from that used in [14], for averaging the function ( ) on the scale ( ). The averaging was carried out not once, but repeatedly, using the algorithm where every new averaging was performed on the previous function −1 ( ) over a window with a greater width = 2 + 1; thus, the first averaging was over three elements, the second one was over five elements, and so on. In our opinion, the best result was obtained at = 4 and the averaging over the window of width = 9 amino acids (sometimes at = 5 and = 11 amino acids).
It is interesting to compare the values of the functions ( ) with the characteristic value of the initial hydrophobicity function 0 ( ) = ( ), its arithmetic mean, calculated for the entire length of the protein chain For the major part of each hydrophobic region, in particular TMD, the correlation ( ) > must be performed, and in the hydrophilic region (TPD), a different correlation ( ) < must be performed. The scale and function of hydrophobicity can be specified in different ways (there are more than 30 known ones). Code Abbreviation Name 1 ( ), [14] 2 ( ), [19] 3 ( ), [19] 4 ( ), [17,19] 5 ( ), [18] 6 ( ), [16] 7 ( ) [ A comparison of different scales and functions of hydrophobicity carried out in our previous work [4] showed that the numbers and arrangements of transmembrane regions obtained upon their usage were often almost identical, even for very simple (rough) scales, for example, 2 ( ) and 3 ( ) (see Table 1). However, sometimes a particular scale can be preferable for a given protein due to the better resolution of closely spaced TMDs.  (1) was applied in this work to the group of membrane proteins, such as GPCRs, and to some other transmembrane -helical proteins.

Results and Discussion
To further test the predictions of our method, first it was used to examine 5 proteins with already known structure ( Table 2). Figure 1 shows the results of averaging the hydrophobicity function for the protein sequence P47871 on the scale 5 ( ) in Table 1. Obviously, a hydrophobic segment in the form of a narrow peak relating to the signal peptide (SP) is present on the left edge of the graph of the function 4 ( ). If this peak is excluded, the remaining seven wide peaks that exceed the mean level = const = 0.27 will just correspond to 7 TMDs in the resolved structure of this protein [21,22]. In the graph of the function 2 ( ) the 2nd, the 3rd, the 5th, and the 7th TMDs have not been resolved yet, and there are several narrow peaks in their places. Figure 2 shows the results obtained for the protein sequence P34998 using the relatively rough hydrophobicity scale 3 ( ) in Table 1. Apparently, a hydrophobic segment relating to the SP is revealed on the left edge of the graph of the function 5 ( ) above the mean level = ⟨ ( )⟩ = −0.05, and also, in contrast to the function 2 ( ), all 7 TMDs known for the protein structure P34998 [21, 23] are resolved.
The boundaries of TMDs of different proteins were determined by the intersection of the graph of the function ( ) with the straight line of some level = const (e.g., the mean level = ⟨ ( )⟩ for the whole protein sequence). They are summarized in Table 2 for 5 known proteins.
The TMD boundaries from [21] are also shown for comparison in Table 2.
Taking into account the errors Δ ≈ /2 ≈ 5 ⋅ ⋅ ⋅ 6 of the TMD boundary detection, good agreement of the results of the TMD boundary position calculations with the data from [21] can be obtained. Indeed, according to Table 2, 34 TMDs out of 35 were resolved (or 97%); the obtained TMD boundary positions do not exceed the detection errors (Δ ≤ 6) for 62 out of 70 boundaries (or 89%).

Remark 1.
In the protein with a code P41595, the 2nd and the 3rd domains not resolved in calculating can be resolved using the outer boundaries of the combined segment of 89-151 aa by adding to the left border = 89 and subtracting from the right border = 151 the estimated average length of a domain 20 aa, as shown in Table 2 in a bold font.   Table 2 after averaging at = 2 and = 4 on the scale 5 ( ) in Table 1; dotted line shows the level = const = 0.266.
In [21], a signal peptide (SP) consisting of 1-25 aa of a protein sequence is indicated in the structure of the protein P47871. In this part of the protein chain, the hydrophobic region of 11-23 aa was detected by the proposed method. Similarly, the sequence of the protein P34998 [21] contains a signal peptide consisting of 1-23 amino acid residues. The proposed method was helpful to reveal here the hydrophobic region of 9-19 aa.
It is worth noting that processing with reiterated (four to five times) averaging of the hydrophobicity function ( ) on different scales (the rough scales 2 ( ) and 3 ( ) or the more  Table 2 after averaging at = 2 and = 5 on the scale 3 ( ) in Table 1 precise scales 4 ( )-7 ( )) produces different values for the TMD boundaries. Sometimes these differences are minor, but sometimes they are significant [4].

Comparison of Protein Secondary Structure Predictions
Made by the Proposed Method and Other Techniques. Secondary structure predictions of a set of 20 membrane proteins belonging to a class of GPCRs performed using the new proposed method were compared with the predictions made by other methods (Table 3).  As can be seen from Table 3, the proposed method allowed revealing 139 TMDs out of 140 (or 99.3%) identified by other methods. In the protein P35414 (the last one in Table 3) the 6th and the 7th domains "merged" into one long stretch of 246-312 aa. However, taking into account Remark 1, the boundaries of these two domains can be easily recovered using the outer boundaries of the combined segment by adding to the left border = 246 and subtracting from the right border = 312 the estimated average length of a domain 20 aa, as shown in Table 3 in a bold font.
236 TMD boundary positions out of 280 (or 84%) were predicted correctly by the proposed method with deviation from the predictions made by other methods that does not exceed the detection error of this method (Δ ≤ 6).
In [21], a signal peptide (SP) consisting of 1-21 aa of a protein sequence is indicated in the structure of the protein P25116. In this part of the protein chain the hydrophobic region of 6-17 aa was detected by the proposed method. Similarly, the sequence of the protein Q99835 [21] contains a signal peptide consisting of 1-27 amino acid residues. The proposed method was helpful to reveal here the hydrophobic region of 13-23 aa.

Predictions of Unknown Secondary Structure of GPCRs and Other Membrane Proteins.
Then the proposed method of multiple averaging of hydrophobicity function was used to predict the location of hydrophobic regions, including TMDs, in several GPCRs with unknown structure. The results are shown in Table 4.
At least two hydrophobicity scales ( ) were applied to make predictions for each of the 5 proteins. Obviously, these predictions are consistent with each other for most of the domain boundaries considering the detection errors Δ = ±6.
For the protein B5D0C2 the calculation on the 5 ( ) scale resolved the 3rd and the 4th domains, but the application of the 3 ( ) scale did not resolve these domains; they merged into a single domain. And it was vice versa for the protein M9TID6 with the 6th and the 7th TMDs. Taking into account Remark 1, the boundaries of unresolved domains can be restored, as shown in Table 4 in a bold font.
Surprisingly, for the protein Q76L88 given that ( ) is higher than the mean level = ⟨ ( )⟩, only 6 domains were surely detected instead of 7 as for other proteins in Table 4.
The results of prediction of TMDs using the proposed method are shown in Table 5 for 4 -helical membrane proteins of unknown structure. The first two proteins (P71044 and P49785) belong to the group of channels: intercellular, the third one Q8TMG0 to the group of methyltransferases, and the fourth one P77335 to the group of adventitious membrane proteins: alpha-helical pore-forming toxins.
Here, as well as in Table 4, the predictions were made on at least two hydrophobicity scales ( ). Evidently, these predictions are consistent with each other for all domain boundaries considering the detection errors Δ = ±6. Individual single domains predicted earlier by other methods [21] were also identified by the proposed method. Table 6 shows data comparison from [21] with prediction of TMDs made by the proposed method for the long ( = 2424 aa) -helical membrane protein from the group of adventitious membrane proteins: alpha-helical pore-forming toxins. Obviously, compliance between the predictions takes place for most of TMDs considering errors in determining their boundaries Δ ≤ 6.
In the calculation using the proposed method of multiple averaging of hydrophobicity function over a sliding window, besides those domains indicated in Table 6, a hydrophobic region of 16-28 aa was identified, which may belong to a signal peptide (SP) or may be the 1st one out of 24 TMDs of the present protein. Moreover, it is obvious that TMDs numbered in [21] as 5, 11, 17, and 23 and highlighted in Table 6 by a bold font in our prediction have the numbers, which are one less than in [21], but other domains that are not specified in [21] have the numbers, which are one more. Thus, two  varied predictions in Table 6 have great similarities as well as notable differences.

Conclusions
The first membrane protein topology prediction algorithms were based solely on the hydrophobicity plots, for example, [14,16,18], and it seemed that the performance of these early methods was rather poor in practice. Hence, they soon were supplied by novel statistical, machine-learning methods, which use hundreds of free parameters extracted from databases of experimentally mapped topologies [13,27]. However, as it is stated in [27], the translocons (cellular machineries) responsible for membrane-protein biogenesis do not have access to statistical data but rather exploit molecular interactions to ensure that membrane proteins attain their correct topology. Therefore, as it is concluded in [13], those methods which are based on the same physical properties that determine translocon-mediated membrane insertion, by using properly scaled hydrophobicity values, may access the same level of prediction accuracy as the best statistical methods. Thereby, here we presented the results of the study of 35 transmembrane proteins using cascade averaging of hydrophobicity function within the limits of a sliding window, as expressed in formula (1).
In the work [4], the proposed method was successfully applied to predict the location of TMDs, secondary structure elements of a number of membrane proteins, in particular, bacteriorhodopsin, halorhodopsin, sensory rhodopsin 2, some connexins, and others.
In the current work, this method was used to analyze the arrangement of the hydrophobic regions, including the transmembrane domains of another protein class, primarily GPCRs. At first, the method was tested on 5 known proteins of this class. Then an additional comparison of TMDs location predictions made by the proposed method and some other methods [21] was carried out on 20 proteins of the same class. These verifications confirmed the applicability of the proposed method for the stated purposes.
Whereupon, this method was used to predict the TMDs in proteins with unknown structure, namely, 5 GPCRs and 5 -helical transmembrane proteins of other classes. For 9 out of 10 of these proteins (Tables 4 and 5) concordant predictions were made using at least two different hydrophobicity scales. The prediction made by the proposed method for a very long protein (Table 6) is consistent largely with the prediction made by another method [21].
These facts indicate the applicability and usefulness of the new method presented in our work [4] and proposed here.