Evaluation of Transmembrane Protein Structural Models Using HPMScore

: Transmembrane proteins (TMPs) are a class of essential proteins for biological and therapeutic purposes. Despite an increasing number of structures, the gap with the number of available sequences remains impressive. The choice of a dedicated function to select the most probable/relevant model among hundreds is a speciﬁc problem of TMPs. Indeed, the majority of approaches are mostly focused on globular proteins. We developed an alternative methodology to evaluate the quality of TMP structural models. HPMScore took into account sequence and local structural information using the unsupervised learning approach called hybrid protein model. The methodology was extensively evaluated on very different TMP all-α proteins. Structural models with different qualities were generated, from good to bad quality. HPMScore performed better than DOPE in recognizing good comparative models over more degenerated models, with a Top 1 of 46.9% against DOPE 40.1%, both giving the same result in 13.0%. When the alignments used are higher than 35%, HPM is the best for 52%, against 36% for DOPE (12% for both). These encouraging results need further improvement particularly when the sequence identity falls below 35%. An area of enhancement would be to train on a larger training set. A dedicated web server has been implemented and provided to the scientiﬁc community. It can be used with structural models generated from comparative modeling to deep learning approaches.


Introduction
Protein structure knowledge allows the atomistic understanding of biological mechanisms. Nonetheless, most of the available protein structures in the Protein DataBank (PDB) [1] are globular. Indeed, despite their great functional importance, e.g., 20% of all human proteins [2], transmembrane proteins (TMPs) represent less than 0.7% of the PDB (at 8 December 2020). They are implicated in a large series of pathologies [3] and are targeted by more than 60% of current drug [4]. Thus, methods to propose efficient structural models of TMPs are of high importance [5,6].
Although the number of templates was limited, comparative modeling methods have been applied to TMPs de novo, and now, deep learning protein structure predictions are used with the most recent developments [7,8]. Whatever the approach, the major challenge is to detect the structural model with the closest conformation to the native structure, which is accomplished by the so-called model quality assessment programs (MQAPs). By definition, free energy potentials would theoretically allow this selection. Physics-based BioMedInformatics 2023, 3 307 potentials taken from molecular mechanics [9,10] might be considered. It was actually proposed by Feig's group [11], which calculated the energy of models as a sum of the force field conformational energy of the membrane protein plus the interaction energy of the protein with an implicit model of membrane environment. A web server (not available at the present time) was developed to calculate what is designated as memscore. As stated by the authors, the strategy was rather good for decoy close to the native state but further improvements are required for models further from the native state. Thus, even by accounting the membrane environment, force-field-based scoring functions are not the most efficient ones in practice because most of them are not calibrated on free energies.
The statistical potentials derived from experimentally determined protein structures remain the most efficient ones. MQAPs can be divided into different approaches; the most important ones take into account the local 3D environment of the protein structures. Briefly speaking, the scoring is based on the counting of the observed contacts and compared to a reference. However, although based on the same spirit and the same datasets, the formalism of the scoring function itself may be very different (see [12]). In this field, the most widely used was discrete optimized protein energy, or DOPE [13], which was implemented in the Modeller software [14,15]. It is mainly based on the distance between atoms in the analyzed models compared to the ones observed in the dataset of reference. Prosa [16] and its latest incarnation Prosa-web [17] are based on a classical potential of mean force; the output provided by Prosa-web was interesting as it compared the quality of the structural models in regard to a large dataset of X-ray and NMR structures. Verify3D proposed a slightly different view by considering the compatibility of the model (3D) with its sequence (1D) by looking at the environment (secondary structure, hydrophobicity, etc.) as seen in known structures [18,19]. Since this first generation, different improvements have been introduced; they consisted of adding different parameters, such as the residue distance, solvent accessibility and secondary structure content [20][21][22][23]. The weighting of these parameters was optimized with artificial neural networks, support vector machines or machine learning approaches [24][25][26][27]. Consequently, they were in general more dependent on the training procedure and on the training set than classical approaches.
TMPs structural models have often been assessed using this approach. However, these MQAPs were often optimized on water-soluble proteins that bathe in a homogenous environment. In the case of TMPs, the situation is more complex because they are in contact with two very distinct environments; a water environment for the soluble part of the protein and the lipid environment for the membrane embedded region, and even a third one corresponding to the membrane interface. This also corresponds to a striking difference in the amino acid distribution of TMPs [28]. Thus, to make sure these specificities were taken into account, the IQ method was proposed. It is based on the analysis of four types of inter-residue interactions within the transmembrane domains [29]. The ProQM approach used support vector machines trained on contacts, solvent-accessible surface, secondary structure, topology of TM region, Z-coordinate, and evolutionary information [30,31]. It was sensitive to the side-chain positioning.
MEMEMBED is a dedicated statistical potential that considers the membrane depth of residues [32]. More recently, MAIDEN proposed an interesting and innovative development, computing the interatomic distance between all 20 standard residue types, focusing on intramembrane residues [33,34]. QMEANBrane is a more simple approach also using the delineation of a theoretical membrane region to focus on the transmembrane region [35]. It was only tested on a GPCR, while MAIDEN was tested on the most diverse set of protein folds.
In the RosettaMembrane/RosettaMP approach [36][37][38], a specific function for TMP has been established in a Rosetta way, namely the force field is a linear combination of a Lennard-Jones potential to model the VDW interactions, a backbone torsional term, a knowledge-based pair interaction term for the electrostatic interactions, reference energies to normalize the overall amino acid composition, an implicit atomic solvation term, and an orientation-dependent hydrogen bonding term [39]. This development is highly dependent on the specific generation of the models by Rosetta. All these scoring functions can only compare a set of equivalent structural models, but not different sequences. AlphaFold2 has its own quality schema evaluation, called pLDDT, for "predicted local distance difference test", which is a per-residue confidence metric [40]. pLDDT is not a score for comparing models but rather a local confidence measure of regions of the structural models [40]; it appears worse for qualifying regions in membrane protein compared to those in globular proteins [8,[41][42][43][44].
In a previous study [45], we learnt and analyzed the sequence-structure relationship of TMPs with an unsupervised learning approach, called the hybrid protein model (HPM) [46,47]. HPM was also shown to be efficient in analyzing globular proteins, e.g., building of overlapping local structural prototypes [48][49][50] or the prediction [51][52][53] of flexibility. HPM was used to analyze protein fragments present in a non-redundant databank of all-α transmembrane proteins. The method has many advantages, which are: (i) A simultaneous learning of sequence (polarity, volume, and hydrophobicity) and structures (ϕ and ψ dihedral angles) properties, e.g., distribution of amino acids associated with different local conformations; (ii) Unsupervised learning due to the given descriptors (sequence and structure), i.e., without any a priori; and (iii) The learning of the overlapping of protein fragments, taking into account the sequentiality (or continuity) essential in proteins, i.e., without any constraints. After a fine-tuning of learning parameters, the sequence-structure relationship was analyzed in light of a structural alphabet, called protein blocks [54,55], underlining two helical regions with very different hydrophobic patterns, identifying groups with properties specific to extremities of helices, or to loops, or to helices. Moreover, some groups showed preferential localizations for the periphery of the membrane or inside the membrane. This can be used for annotation as channel/non-channel, but also for the assessment of the quality of structures and structural models.
In this study, we have generated a large set of structural models ranging from very good to poor models for a various number of folds. The models were evaluated using classical root mean square deviation (rmsd) and GDT_TS. The latter is the most classical reference metric for comparing diverse structural models [56]. Its interest is to limit the influence of poorly modeled substructures for the protein considered. We also used the famous DOPE scores [13], as using them is one of the most classical approaches to selecting protein structural models though comparative/homology modeling.
In some aspects, the HPM approach can be related to the Verify3D methodology, which encompasses sequence, structure, and environment properties to evaluate the compatibility of a given sequence with a given 3D structure. The Verify3D approach was never dedicated to TMPs. HPM does not need, as is true of other approaches, to localize helical regions and take into account the connecting loops. We then compared the discrimination of the quality of the models using HPMScore values compared to DOPE scores, and we propose a dedicated webserver HPMScore (https://www.dsimb.inserm.fr/dsimb_tools/hpmscore/ index.php, accessed on 1 March 2023).

Protein Structure Dataset
The membrane protein dataset was derived from the HOMEP dataset [57]. This set of proteins is composed of 76 membrane proteins, separated in 23 categories, depending on their biological function (https://zenodo.org/record/2646540#.Y7b99C3pNTY, accessed on 1 March 2023). This dataset was completed by 13 GPCR structures. The entire dataset is composed of 89 proteins. The protein structures composed of all-α transmembrane domain were taken from the PDB [1]. For analysis purposes, the number of TM domains and their boundaries over the whole protein sequence were predicted using the PPM web server or directly imported from the orientation of protein in a membrane (OPM) database [58,59]. We defined three main categories of protein structures according to the transmembrane content: large (more than 40% of amino acids associated with the transmembrane domain), medium (40%< and >15%), and few (>15%). Please notice that HOMEP was later expanded in EncoMPASS [60].

Generation of Alternative Structural Models
We have generated a large set of structural models ranging from good quality to bad, i.e., to mimic what often happens in daily research. For a given protein, the original sequence from the PDB was extracted and duplicated to create an ideal alignment where the template and the target sequence are initially identical. The alignment was further processed to reproduce point mutations or gap insertions using two strategies. First, we created a similar sequence by randomly picking an amino acid position and exchanging it with another position. This procedure kept the amino acid composition, but varied the sequence identity with the template sequence. The procedure was repeated until a target percentage of identity was obtained or a maximum number of iterations was reached. This iteration number was set arbitrarily at twice the length of the amino acid sequence to save time. The second strategy consisted of perturbing the alignment by random gap modifications, up to 5 random gaps of length between 1 and 8, either on the parent sequence or on its children. Once the alignment was produced, its overall percentage of identity was calculated using BioPerl [61]. The structural models for each alignment were created using Modeller v9.18 [14,15] (the entire process of generation and evaluation of structural models is presented in Figure A1).

Assessment Scores
DOPE scores [13] are directly provided by Modeller [14,15]. HPM scores [45] are determined as follows: (i) The protein structures are cut into fragments of length L (L = 13, as obtained in [45] and recommended from previous studies [46,47,54], see next paragraph); (ii) Each fragment is translated in terms of polarity, volume, and hydrophobicity for their sequence and in the cosine and sine functions of their dihedral angles for their structure; (iii) The fragment and its local environment are then compared to each position of the optimal HPM matrix (determined in [45]); (iv) The maximal score provides the best matching between this position and the HPM matrix that reflects our current knowledge of TMP sequence-structure relationship. The HPMScore value is the sum of all these maximum scores. For further analyses, local DOPE and local HPMScore values were also investigated per domain, i.e., transmembrane region or not, using the segments defined as membranous in OPM [58,59].
From a practical point of view, HPM depends on its total length and the length of the fragments presented. These two parameters were tested in [45] to end with a total length of 100 and fragments of L = 13 positions. With several simulations, these two choices made it possible to have a sufficient occurrence number at each position, and also two distinct types of helices. Then, with these parameters, 100 independent simulations were carried out with a high learning rate similar to the self-organizing maps (SOM) type [62,63]; this high value limits the importance of initializing. The most central HPM (with a minimum distance from all the others) was then taken up as a new initial HPM for a new training. Here, the learning coefficient was quite limited to fix the optimal HPM. These two stages have a strong analogy with the two main phases of learning the SOMS, i.e., diffusion then specialization.

Data Analyses
The 3D structure representation is generated using the PyMOL software (http:// www.pymol.org, accessed on 1 March 2023) [64]. The protein superimposition was carried out using the iPBA software [65] based on the protein block description [54]. RMSD was computed using profit [66], through the iPBA software. In the following step, the computation of the GDT_TS and PBscore alignment was performed [65]. TMalign was also used for comparison [67]. The GDT_TS value is a reference metric for comparing diverse structural models [56]. It weights close to large local RMSD variations to limit the influence of poorly modeled substructures for the protein considered. An ideal GDT_TS value is 100 for a "perfect" match between the model and the experimental structure; the worst value is 0. For each experiment, the best model is defined by the highest GDT_TS in regard to the true 3D structure. It is named "G-model" in the following. Most of the analyses were carried out using the Python language and R software [68]. We have made available a companion website that contains a large number of analyses (https://clipperton.ufip.univ-nantes.fr/hpmeval/, accessed on 1 March 2023). The analyses can be viewed at the level of the whole dataset, but also by a single protein and by protein type. Various data analyses have been performed. The most classic is the calculation of the Top 1, Top 5, and Top 10. The metric is simple and corresponds to the number of times that for the same simulation, the HPM or DOPE method allow you to select the best model. For Top 1, it is a direct comparison, while for Top 5 and Top 10, it is the best as selected by DOPE and/or HPM within their best 5 and 10 scores. The only specificity of these results is that sometimes DOPE and HPM can select the same result (hence, the category HPM and DOPE).

Scripting and Web Server of HPMScore
The original code of HPMScore was developed with the use of a local PDB reader coded in C language that generated a flat file with all the information (sequence in terms of polarity, volume, and hydrophobicity, and structure in terms of ϕ and ψ dihedral angles). The latter is used by the HPM program (also coded in C language) that performs the evaluation. A dedicated web server that encompasses all these properties is made available to the scientific community. It provides a simple interface with a nice visualization (https://www.dsimb. inserm.fr/dsimb_tools/hpmscore/index.php, accessed on 1 March 2023).

Generation of a Set of Structural Models for Sequences with Various Sequence Identities with Templates
The assessment of protein model quality is essential to guide computational biologists to select the best structure for further evaluation and analysis. The main idea was to simulate a large sampling of structural models derived from TMP resolved structures, ranging from sequences close to the sequence of a known structure to sequences far from any structural template sequences leading to very poor models, as it may occur. To mimic the drift of protein sequences through evolution, the initial protein sequence of each protein model was subjected to permutations or mutations to reach a given percentage of identity.
For example, a 100amino-acid-length protein sequence will attain 99% of sequence identity if one mutation is virtually performed, or 98% with a permutation since two positions are exchanged between different amino acids. This degenerated sequence and the original protein structure is then used as inputs for Modeller [14] for producing 3D models of the "drifted" protein. We will detail below how the models are assessed using our original method, HPMScore [45] and DOPE [13].
From the dataset of 89 proteins, a total of 29,571 alignments were generated, which correspond to an average value of 332 degenerated alignments per protein. This value depends on the protein length. The distribution of scrambled sequences ranked by sequence identity is shown in Figure 1. The average sequence identity is 38.9% (for a median of 32.55%) and reaches a peak for the 10-15% interval with more than 3500 alignments available. As the generation of sequences with very low identity percentages (<10%) can be time-consuming, we limited the number of sequence generation, which resulted in a drop in this category. This distribution, which looks roughly as an extreme value distribution, shows that it is easier to generate sequences with low sequence identity than with high sequence identity. It also underlines the interest of categorizing 3 main classes of alignments: good for a sequence identities higher than 75% (3682 sequences), bad for sequence identities less than 35% (15,786 sequences), and medium for the sequences between them (10,102 sequences). For each alignment, 25 models were built using Modeller [49].
Thus, a particularly large number of structural models of very different quality have been proposed, allowing a broad view of all the different types of protein folding of TMPs. This approach allows the evaluation of HPMScore and its comparison with DOPE. BioMedInformatics 2023, 3, FOR PEER REVIEW 6 35% (15,786 sequences), and medium for the sequences between them (10,102 sequences). For each alignment, 25 models were built using Modeller [49]. Figure 1. Distribution of sequence alignments. This histogram provides the distribution of sequence alignments percentage identity (%) between the true sequence and the simulated ones. A simulated sequence is classified as a good sequence if the sequence shares 75% or more sequence identity with the reference (in blue), as medium for a sequence identity above 30% and below 75% (grey), and as bad in other cases (<30%, in tan).
Thus, a particularly large number of structural models of very different quality have been proposed, allowing a broad view of all the different types of protein folding of TMPs. This approach allows the evaluation of HPMScore and its comparison with DOPE.

HPM Selects Better Models than DOPE
To determine which model is the closest to the experimental structure, GDT_TS values [56] were computed for all models proposed from the degenerated sequences. In ideal situations, we should observe a correlation between the scoring functions and the GDT_TS values. Consequently, we addressed two questions: (i) What is the capacity of each scoring function to rank the model with the highest GDT_TS score first? and (ii) What is the quality of the best model (Rank 1) defined by each scoring function? For the first question, we found that both DOPE and HPM can identify the absolute G-model (the one with the highest GDT_TS) with a very limited prediction rate of 7.4% for HPM and 3.7% for DOPE. Although the capacity of each scoring function to identify the absolute G-model is limited, HPM appears slightly more efficient than DOPE.
This result still stands when addressing the second question, i.e., the quality of the model ranked best by each method. Indeed, the first model ranked by HPM has a lower GDT_TS score in 46.9% of cases compared to 40.1% for DOPE, and both select the same in 13.0% of the cases (see Table 1(A)). If the first 5 HPM or DOPE best scores are considered, HPM still outperforms DOPE (48.4% vs. 44.0%), and this situation stands true even if the Figure 1. Distribution of sequence alignments. This histogram provides the distribution of sequence alignments percentage identity (%) between the true sequence and the simulated ones. A simulated sequence is classified as a good sequence if the sequence shares 75% or more sequence identity with the reference (in blue), as medium for a sequence identity above 30% and below 75% (grey), and as bad in other cases (<30%, in tan).

HPM Selects Better Models Than DOPE
To determine which model is the closest to the experimental structure, GDT_TS values [56] were computed for all models proposed from the degenerated sequences. In ideal situations, we should observe a correlation between the scoring functions and the GDT_TS values. Consequently, we addressed two questions: (i) What is the capacity of each scoring function to rank the model with the highest GDT_TS score first? and (ii) What is the quality of the best model (Rank 1) defined by each scoring function? For the first question, we found that both DOPE and HPM can identify the absolute G-model (the one with the highest GDT_TS) with a very limited prediction rate of 7.4% for HPM and 3.7% for DOPE. Although the capacity of each scoring function to identify the absolute G-model is limited, HPM appears slightly more efficient than DOPE.
This result still stands when addressing the second question, i.e., the quality of the model ranked best by each method. Indeed, the first model ranked by HPM has a lower GDT_TS score in 46.9% of cases compared to 40.1% for DOPE, and both select the same in 13.0% of the cases (see Table 1(A)). If the first 5 HPM or DOPE best scores are considered, HPM still outperforms DOPE (48.4% vs. 44.0%), and this situation stands true even if the first 10 models are considered (48.4% vs. 45.5%). This average lower sensitivity of DOPE may be attributed to a more important weight of loop regions in the scoring function. In contrast, when only transmembrane segments are taken into account (see Table 1(B)), DOPE slightly outperforms HPM (47.2% vs. 45.6%) only if the best model is considered. Indeed, when more models are considered (Top 5 or 10 models selected by each method), the differences between the two scoring functions are small but systematically in favor of HPM (46.8% vs. 46.4%, and 47.0% vs. 46.3%, respectively). In a second step, we examined the influence of the sequence identity on the capacity of identifying the best model and the quality of the ranked models for each method (see Table 2). For models produced with medium sequence identity (35-75% of sequence identity), or with high sequence identity, i.e., good sequence alignment (75-100%), the quality of the best ranked model by HPM largely outperforms the quality of the best ranked model by DOPE, with about 52% for models in both medium and good categories detected by HPMScore, 36% detected using DOPE, and 12% where both models find the same model. For sequences below 35% of sequence identity, considered as poor alignments, DOPE (43.8%) performs slightly better than HPM (42.6%), and both methods find the best model in 13.6% of alignments. In summary, for target sequences with a sequence identity compatible with comparative modeling (>35%), HPM is on average more effective than DOPE. When the sequence identity decreases, the differences between the two scoring schemas are much lower, and slightly in favor of the DOPE scoring function. Please note that for poor alignment quality, it is difficult to be sure that the aliasing is properly preserved. It is certain that a significant number of cases are not correct TMPs. Figure 2 illustrates an example of the putative metal-chelate type ABC transporter (PDB ID 2NQ2) and the relationship between the generated alignments, the HPMScore value of the corresponding structural models, and the structural approximation (evaluated here by the GDT_TS). The protein is a homodimer, each monomer being composed of a large transmembrane domain containing eight TM helices and an intracellular domain composed of α-helices mainly and a few β-sheets (Figure 2a). Only Chain A has been evaluated, since Chain B is similar. Figure 2b shows the dependence of the HPM score with the percentage of identity of the target sequences with the template sequences. Since the HPM score is equivalent to a distance, the lower the HPMScore value, the better it is. Figure 2b clearly illustrates the nice correlation between the HPMScore and the sequence identity. Figure 2c shows that the quality of the models (evaluated with the GDT_TS score) is obtained after a strong randomization of the sequence alignment. This figure points out that even with a low sequence identity, the GDT_TS score can be very high, which means that it is possible to keep a native fold. It is clear, however, that the better the alignment, the lower the standard deviation of the category (good, medium, and bad). Figure 2d highlights the correlation between scores from HPMScore and GDT_TS scores. It is clear that for the lowest HPMScore values (associated with good quality alignments), the structural approximation is the best. Moreover, the HPMScore is able to distinguish the best ones from the worst.

Assessment of Protein Model Quality
After evaluating how HPM correlates with a robust global measure, such as the GDT_TS, we go further in the evaluation of the quality of the models ranked by HPM and DOPE, respectively. An example of the models obtained for medium sequence identity to the reference protein (38%) is presented in Figure 3. The best model according to HPM is more compact and possesses slightly more secondary structures than the model with the best DOPE score. A closer inspection reveals a more consistent architecture of the seven transmembrane segments, a better orientation of the third intracytoplasmic loop characteristic of GPCR proteins, and the conservation of the extracellular loop involved as a lid Hence, the example in Figure 2 shows the complexity of proposing structural models of different quality, but also how essential it is. TMPs are more often difficult cases than simple ones. The analysis of Top 1 to Top 10 shows that the HPMScore allows on average a better selection of models. The analysis of the alignments compatible with the comparative modeling (average and good quality) shows that the HPMScore gives a better selection in 52% of the cases, 12% are common with DOPE, and DOPE performed better in 36% of the cases. The difference is clear.

Assessment of Protein Model Quality
After evaluating how HPM correlates with a robust global measure, such as the GDT_TS, we go further in the evaluation of the quality of the models ranked by HPM and DOPE, respectively. An example of the models obtained for medium sequence identity to the reference protein (38%) is presented in Figure 3. The best model according to HPM is more compact and possesses slightly more secondary structures than the model with the best DOPE score. A closer inspection reveals a more consistent architecture of the seven transmembrane segments, a better orientation of the third intracytoplasmic loop characteristic of GPCR proteins, and the conservation of the extracellular loop involved as a lid for the ligand binding pocket. Overall, both models are of poor quality and would not be considered as sufficient for further use as support models, but the HPM-selected models are better candidates for further modeling studies. All analyses for all proteins have been made available on the companion site (https://clipperton.ufip.univ-nantes.fr/hpmeval/, accessed on 1 March 2023). The analyses can be viewed at the level of the whole dataset, but also by single protein and by protein type.

Web Server Usage and Example
The web server can be accessed at the following url: https://www.dsimb.inserm.fr/dsimb_tools/hpmscore/index.php, accessed on 1 March 2023). The main page gives a small introduction and a direct access to the section for uploading the structural models (see Figure 4). Two options are possible that consist of: (i) Analyzing models one by one (see Figure 4B); or (ii) A set of models uploaded from an archive (see Figure 4A). Please note that structural models must be provided in a classical PDB format, as generated by Modeller [14], Robetta [69], RoseTTAfold [70], AlphaFold2 [40], I-Tasser and other classical approaches.
At the top, links to access other pages are found on all pages. The first page is the Home page (see Figure 4C), followed by a page of explanation of the HPM methodology (see Figure 4D), a concrete example of usage and analysis (see Figure 4E), and finally, the last page contains the contacts of the people involved in this research (see Figure 4F).
When the files have been loaded, the program launches an intermediate note page stating Please wait while HPMScore is computed'. Each job is associated with a temporary directory, which will be kept for two months. All analyses for all proteins have been made available on the companion site (https:// clipperton.ufip.univ-nantes.fr/hpmeval/, accessed on 1 March 2023). The analyses can be viewed at the level of the whole dataset, but also by single protein and by protein type.

Web Server Usage and Example
The web server can be accessed at the following url: https://www.dsimb.inserm.fr/ dsimb_tools/hpmscore/index.php, accessed on 1 March 2023). The main page gives a small introduction and a direct access to the section for uploading the structural models (see Figure 4). Two options are possible that consist of: (i) Analyzing models one by one (see Figure 4B); or (ii) A set of models uploaded from an archive (see Figure 4A). Please note that structural models must be provided in a classical PDB format, as generated by Modeller [14], Robetta [69], RoseTTAfold [70], AlphaFold2 [40], I-Tasser and other classical approaches. The results page (see Figure 5) is divided into six main parts. The example proposed here can be found on the website, and corresponds to a putative Halorhodopsin with no At the top, links to access other pages are found on all pages. The first page is the Home page (see Figure 4C), followed by a page of explanation of the HPM methodology (see Figure 4D), a concrete example of usage and analysis (see Figure 4E), and finally, the last page contains the contacts of the people involved in this research (see Figure 4F).
When the files have been loaded, the program launches an intermediate note page stating 'Please wait while HPMScore is computed'. Each job is associated with a temporary directory, which will be kept for two months.
The results page (see Figure 5) is divided into six main parts. The example proposed here can be found on the website, and corresponds to a putative Halorhodopsin with no known structure and less than 40% sequence identity to related ones, i.e., a classical case of structural modeling.
BioMedInformatics 2023, 3, FOR PEER REVIEW 12 known structure and less than 40% sequence identity to related ones, i.e., a classical case of structural modeling. The first section lists the structural models by HPM score in descending order (the best being the first, see Figure 5A). Then, a histogram shows the distribution of the HPM scores of the different models (see Figure 5B). This information allows the user to carry out analyses, for example, to compare the best and the worst model, or other ranking questions. HPM, like DOPE score or Verify3D and PROSA, computes a local score, it uses an overlapping sequence window of 13 residues. The third section provides this information for the first model with a plot (see Figure 5C). It could allow comparing alternative proposed conformations. The fourth part shows the 3D model in two orientations (and an extra one with the surface) thanks to the software PyMOL. The structural model is colored according to the quality considered by the HPM score (see Figure 5D). The user can directly interact with the structural model (see Figure 5E). An essential point is the availability of an archive summarizing all this information (see Figure 5F), which can be downloaded locally. It contains all the information detailed here, but also provided for every The first section lists the structural models by HPM score in descending order (the best being the first, see Figure 5A). Then, a histogram shows the distribution of the HPM scores of the different models (see Figure 5B). This information allows the user to carry out analyses, for example, to compare the best and the worst model, or other ranking questions. HPM, like DOPE score or Verify3D and PROSA, computes a local score, it uses an overlapping sequence window of 13 residues. The third section provides this information for the first model with a plot (see Figure 5C). It could allow comparing alternative proposed conformations. The fourth part shows the 3D model in two orientations (and an extra one with the surface) thanks to the software PyMOL. The structural model is colored according to the quality considered by the HPM score (see Figure 5D). The user can directly interact with the structural model (see Figure 5E). An essential point is the availability of an archive summarizing all this information (see Figure 5F), which can be downloaded locally. It contains all the information detailed here, but also provided for every model not shown on the website. Structural models are provided with an HPM score. It is possible to observe them with visualization software, such as the PyMOL software. All of this information makes it easy to choose the model that seems the most relevant, knowing the difficulty of this type of question for transmembrane proteins.
Thus, the HPMScore webserver allows the specialist and the neophyte (it has been particularly used in several training sessions) to evaluate models in a simple way. It then allows visualizing the areas considered as the most successful. The specialist can also use it to go further in comparative modeling by combining multiple models according to their local HPMScore values.

Use with Structural Models Coming from Different Approaches
We have assessed the interest of our approach based on comparative modeling, while new approaches of interest exist (see Figure 6). We have so built a 3D structural model of the putative Halorhodopsin used in Figure 5 with the threading approach Phyre [71] and deep learning approaches RoseTTAfold [70], ESMFold [72], and AlphaFold2 [40]. Other approaches were tested but they cannot provide complete models.
BioMedInformatics 2023, 3, FOR PEER REVIEW 13 of this information makes it easy to choose the model that seems the most relevant, knowing the difficulty of this type of question for transmembrane proteins. Thus, the HPMScore webserver allows the specialist and the neophyte (it has been particularly used in several training sessions) to evaluate models in a simple way. It then allows visualizing the areas considered as the most successful. The specialist can also use it to go further in comparative modeling by combining multiple models according to their local HPMScore values.

Use with Structural Models Coming from Different Approaches
We have assessed the interest of our approach based on comparative modeling, while new approaches of interest exist (see Figure 6). We have so built a 3D structural model of the putative Halorhodopsin used in Figure 5 with the threading approach Phyre [71] and deep learning approaches RoseTTAfold [70], ESMFold [72], and AlphaFold2 [40]. Other approaches were tested but they cannot provide complete models. We have only added the best Modeller results. This works well and shows the diversity and difficulty of proposing TMP structural models. HPMScore values are distant, so the lower the better. Hence, AlphaFold2 [40] is very far away (and close to ESMFold) with the highest HPMScore values. Phyre is the intermediate when our supervised Modeller is the one associated with the lowest (best) HPMScore value. Interestingly, RoseTTAfold is We have only added the best Modeller results. This works well and shows the diversity and difficulty of proposing TMP structural models. HPMScore values are distant, so the lower the better. Hence, AlphaFold2 [40] is very far away (and close to ESMFold) with the highest HPMScore values. Phyre is the intermediate when our supervised Modeller is the one associated with the lowest (best) HPMScore value. Interestingly, RoseTTAfold is not too far away but has a wrong local topology. This last example clearly underlines the interest of HPMScore, which is a specific development for protein of high pharmaceutical interest. Structure models were evaluated on a large scale with a very large set of model quality showing its stability. The HPM scoring function, performing on average better than DOPE, is the reference scoring function in the Modeller suite [14] (that can be used to rank models made from other approaches).
This example highlights the importance of having an external and simple tool to test results from different tools, even in this period of Deep Learning with AlphaFold2 and related methods.
These approaches do not provide 3D structural models but they provide interesting behaviors. The first and most common proposition of TMP structural models is homology modeling with Modeller [14] and SwissModel [103]. Based on sequence alignment with a structural template, it remains essential in the TMP area. Some methods have been developed specifically for TMP. For instance, MEMOIR (membrane protein modeling pipeline) [104], and MEDELLER [105], which proposed only high-quality regions and did not complete others. Threading was used in TMFoldWeb [106], a web implementation of TMFoldRec [107]. Rosetta had interestingly incorporated a specific membrane-specific version of the original Rosetta energy function, which considers the membrane environment as an additional variable next to amino acid identity, inter-residue distances, and density [108]. It was included in RosettaMP [98]. In fact, all structural modeling methods, e.g., Phyre [71], Modeller [14], SwissModel [103], RoseTTAfold [70], ESMFold [72], and AlphaFold2 [40] can be used for TMPs (see Section 3.5).
However, a quasi-systematic bias is the use of score functions related to globular proteins and not to transmembrane proteins, such as DOPE. Independent tools, such as Verify3D [18] or Prosa II [16,17], are based on data that mainly emphasize globular proteins largely over-represented in PDB globular proteins compared to TMPs.
It is worth noting some studies of interest. Postic and collaborators have, thus, set up an empirical energy function for the structural assessment of protein transmembrane domains [33]. This statistical potential quantifies the interatomic distance between residues located in the lipid bilayer. Following a leave-one-out cross-validation procedure, they show that their method outperforms statistical potentials in discriminating correct from incorrect membrane protein models. The approach must be locally installed. Studer and coworkers proposed an equivalent method named QMEANBrane [35] derived from the original QMEAN scoring function [20,109]. It is integrated in the SwissModel environment but cannot be used with external models [103]. More recently, AlphaFold2 had proposed its pLDDT scores [40] associated with the quality of the proposed structural models. However, it cannot be used with results from other approaches. It seems so interesting to see if the HPMScore could be interesting for the scientific community.
Our work can easily raise three questions: (i) Which proteins can be used? (ii) Which structural models can be generated? and (iii) How can the results be assessed?.
Transmembrane proteins are difficult to obtain experimentally. In 2000, only one structure was in the protein data bank. Thanks to new methodologies, their number had greatly increased. Now, 1561 unique PTM structures can be found, for all-α and all-β TMPs, as stated by mpstruct [110,111] (https://blanco.biomol.uci.edu/mpstruc/#news, accessed on 17 January 2023). However, the number of different folds had not really increased, and redundancies exist. We have kept the HOMEP dataset as we know it very well and represent correctly the different known TMP folds.
From this dataset, we need to generate a series of structural models. Different approaches have been proposed to generate decoys that deviated from the real structure. As no dataset was available, we generated our own. To do this, we decided to make point mutations, insertions, and deletions to move further and further away from the real structure. Of course, this does not represent a directed (or rather degenerated) evolution [112], but it does allow for an important sampling of conformational space. The conservation of the membrane part plays on a weaker amino acid alphabet [113] than the one we used. Figure 2c shows how complex this is. Even with a 25% alignment, it is possible to have GDT_TS ranging from 10 to 90.
Finally, we have analyzed the results with RMSD [114], PBscore [65], and GDT_TS [56]. They all provide the same trends. Top 1, Top 5 and Top 10 underline the interest of the HPMScore to select the best models. As discussed before, we are in the idea of comparative modeling, i.e., for sequence alignment higher than 35%; the HPMScore gives a better selection in 52% of the cases, 12% are common with DOPE, and DOPE is associated with it in 36% of the cases. Figure 7 shows a visualization of the quality of the prediction by a slice of 5% of sequence identity. A regression is performed for the HPM results of DOPE and cases where both give the same result. The direction of the lines highlights the superiority of HPM. This evaluation unequivocally demonstrates the value of the approach. A Welsh test on the question of whether HPM is better than a DOPE score alone (data in Figure 7) gave a significant positive answer (0.01). A rather complex point to apprehend is the variability of the results simply by protein. The generation of unsupervised alternative alignments gives very different results depending on the topology of the protein, its amino acid composition or the impacts of insertions-deletions.
Lastly, we should remember that the HPMScore is built on the HPM matrix, fully described in [45]. The HPM strategy is based on a learning process combining sequence and structural properties, which depends on a few parameters.
In the present work, we kept the optimal HPM matrix finely tuned after an extensive grid search of the parameters and trained on 52 PDB files. Despite its small size, this dataset contains most of the representative folds of α-helical TM protein. Given the good results with the present version of the HPM matrix, we may reasonably expect improvement with new training on a larger dataset that includes 3D structures solved since. This will be the subject of a forthcoming study. For convenience, we have made available an additional website (https://clipperton.ufip.univ-nantes.fr/hpmeval/, accessed on 1 March 2023) with a large number of analyses, which highlights this complexity.
The HPMScore web server allows a simple and efficient use; we used it regularly (and also for courses). The example presented with results from very different predictive tools clearly demonstrates the usability of the methodology. Lastly, we should remember that the HPMScore is built on the HPM matrix, fully described in [45]. The HPM strategy is based on a learning process combining sequence and structural properties, which depends on a few parameters.
In the present work, we kept the optimal HPM matrix finely tuned after an extensive grid search of the parameters and trained on 52 PDB files. Despite its small size, this dataset contains most of the representative folds of α-helical TM protein. Given the good results with the present version of the HPM matrix, we may reasonably expect improvement with new training on a larger dataset that includes 3D structures solved since. This Figure 7. Evaluation summary. Per bins of 5% of sequence identity has shown the best results with HPMScore (red), DOPE (black) or both (blue). On the upper part, the number of evaluated models is given.

Conclusions
When one wants to produce a model and evaluate its quality, it is important to understand how the scoring procedure will indicate the overall quality of the model. Most of the proposed structural models have been created using comparative modeling [7], while AlphaFold2 can provide an interesting alternative [41,115,116]. In our study, we first simulated the evolutionary drift in protein sequence between homologous proteins by creating degenerated sequences using amino acids mutations or permutations. For each resulting sequence, we modeled the putative target protein from the template protein where the 3D structure was available. We then assessed the performance of our new method against the reference DOPE function, reportedly very effective for membrane proteins. Our new scoring function is based on the hybrid protein model approach, trained on a set of representative membrane proteins. It is widely accepted that membrane proteins are difficult to model since the amino acids forming the transmembrane segments are densely packed due to the hydrophobic environment and the lipid compaction surrounding the protein, whilst the extra-and intra-cellular amino acids are exposed to a more hydrophilic medium.
This study is interesting as the HPMScore is a non-classical approach, and was tested with the greatest number of different TMPs and the largest number of generated models. Moreover, Top 1 was used, but also Top 5 and Top 10; sequence identity rate influence was evaluated and even the analysis of the transmembrane region was assessed. It is, therefore, a systematic large-scale study.
A server is up for model validation. It can take as input a single model or a large number of models coming from various prediction methods. Interestingly, it can be used to select models and to analyze them at residue level (and so potentially combine different structural models).

Acknowledgments:
We would like to thank the anonymous reviewers who helped to improve the manuscript, Jean-Christophe Gelly for the fruitful discussions, and Sylvain Léonard for the technical support.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.