Phylogenetic analysis of Harmonin homology domains

Harmonin Homogy Domains (HHD) are recently identified orphan domains of about 70 residues folded in a compact five alpha-helix bundle that proved to be versatile in terms of function, allowing for direct binding to a partner as well as regulating the affinity and specificity of adjacent domains for their own targets. Adding their small size and rather simple fold, HHDs appear as convenient modules to regulate protein–protein interactions in various biological contexts. Surprisingly, only nine HHDs have been detected in six proteins, mainly expressed in sensory neurons. Here, we built a profile Hidden Markov Model to screen the entire UniProtKB for new HHD-containing proteins. Every hit was manually annotated, using a clustering approach, confirming that only a few proteins contain HHDs. We report the phylogenetic coverage of each protein and build a phylogenetic tree to trace the evolution of HHDs. We suggest that a HHD ancestor is shared with Paired Amphipathic Helices (PAH) domains, a four-helix bundle partially sharing fold and functional properties. We characterized amino-acid sequences of the various HHDs using pairwise BLASTP scoring coupled with community clustering and manually assessed sequence features among each individual family. These sequence features were analyzed using reported structures as well as homology models to highlight structural motifs underlying HHDs fold. We show that functional divergence is carried out by subtle differences in sequences that automatized approaches failed to detect. We provide the first HHD databases, including sequences and conservation, phylogenic trees and a list of HHD variants found in the auditory system, which are available for the community. This case study highlights surprising phylogenetic properties found in orphan domains and will assist further studies of HHDs. We unveil the implication of HHDs in their various binding interfaces using conservation across families and a new protein–protein surface predictor. Finally, we discussed the functional consequences of three identified pathogenic HHD variants involved in Hoyeraal-Hreidarsson syndrome and of three newly reported pathogenic variants identified in patients suffering from Usher Syndrome.

Two other surfaces on the side of the bundle were also observed, providing cis (Fig. 2d) and trans (Fig. 2i) noncanonical protein-protein interactions, involved in the scaffolding function of HHDs together with the formation of intramolecular supramodules, as illustrated by the HHD-PDZ complex in Harmonin (Fig. 2d) that tunes the binding affinity of the PDZ towards its partners [21,23].
Finally, HHDs can be found in crystals as oligomers (Fig. 2f, j), or swapped dimer ( Fig. 2k) with the merged α2/α3 helices of one monomer interacting with the merged α4/ α5 helices of the second monomer [24].
Despite a high structural conservation, available structures suggest that HHDs can accommodate various partners via different binding sites exposed at the surface of the five-helix bundle.

Identification of HHD-containing sequences
As a first task, we aimed at identifying new HHD-containing proteins. The Harmonin-N-like superfamily from the National Center for Biotechnology Information (NCBI; accession cd07347) was used as a starting set of sequences. It consists of the 94 known sequences of HHDs among the six HHD-containing proteins from various organisms. These domain sequences were used as a seed on which we built a profile HMM (Hidden Markov Model) to screen the UniProtKB database [25,26]. Profiles HMM are probability-based models allowing to predict true probabilities of finding each residue at every position of an alignment. These are derived from observed frequencies using a statistical approach, instead of directly using these frequencies for scoring. This provides a model that is less biased towards the starting alignment than frequency-based approaches and thus allows to detect more distant homologues. HMM profiles are adequate tools when using a rather short alignment of sequences as a starting point for a large dataset screening.
Screening of the entire UniProtKB database using our generated profile HMM yielded 2939 domain hits, with E-values lower than 10 -5 as inclusion threshold. Profile HMM and all sequence alignments are accessible online on http:// doi. org/ 10. 5281/ zenodo. 42558 47.
Our second task was to identify full-length proteins corresponding to each domain hit, given that most associated Uniprot entries were not annotated. We ran the BLASTP algorithm on each domain sequence and corresponding full-length protein, in addition to manually checking each individual UniProtKB entry. This process was supported by a clustering approach (Fig. 3) based on sequence identity of the domains. For this analysis, we started with the whole set of domain sequences (2939) and ran the clustering algorithm with an increasing identity inclusion threshold.
The following procedure was implemented: (1) We increased the inclusion threshold until a significant cluster is isolated, representing more than 5% of the initial full set of sequences. (2) If annotated sequences appear homogeneous, this annotation is used to name the cluster and we proceed to step 3, otherwise we go back to step 1. (3) We further increased the inclusion threshold until another split happened, with hits in subclusters accounting for more than 10% of the total sequences of the parental cluster. The incremental approach allows to isolate all clusters at their appearing threshold. An alternative display of the clustering is presented Fig. 3c. It is the result of a calculation at a  22:190 single threshold applied to all sequences, thus fragmenting clusters of lower homogeneity (RTEL from insects and PAH).
Our procedure allowed to estimate the conservation within each cluster and to identify subclasses of proteins. From our dataset, only 79 domain sequences (out of 2939) arise from proteins without known HHD or that could not be identified (2.7% of the dataset). These sequences did not cluster together and formed a collection of isolated hits that do not make up for a new HHD-containing family.
Interestingly, 5% of the hits (146 over 2939) corresponded to PAH (Paired Amphipathic Helices) domains that satisfied the inclusion threshold and clustered together. PAH domains are protein-protein interaction domains involved in eukaryotic transcription, found in proteins such as Mad1 and Sin3 [27]. It consists of four helices instead of five as found in HHDs. However, the similarity between PAH and HHDs, considering the four first helices, is striking both in fold and sequence conservation (Faure, et al., Proteins (2014) (https:// doi. org/ 10. 1002/ prot. 24438)) [28], with an overall identity of 12% ± 8%, ranging from 8% between Delphinin-HHD2 and PAH to 17% between HHD from RTEL of plants and PAH (Additional file 1: Figure S2, Table S1). Importantly, PAH also can interact with amphipathic helices through the interface formed by its α1 and α2 helices (ex PDB: 1pd7, 2rms), a binding mode conserved in HHDs. The overall conservation of fold and function suggests the two domains are evolutionary related.
The 2939 sequences are clustered in eight groups shown in Table 1.

HHDs in RTEL
We found 787 hits corresponding to HHDs in RTEL proteins from eukaryotic organisms. The HHD is located at the C-terminal of the protein. In contrast with other HHD-containing protein families, HHDs in RTEL form three clusters corresponding to viridiplantae, bilaterian and specifically insects. Based on our analysis, only vertebrates possess two HHDs in their RTEL proteins, as previously suggested [28], whereas a single HHD is found in other bilaterian, insects or plants.

HHDs in CCM2
We identified 490 sequences corresponding to HHDs in CCM2 proteins, only found in metazoans. For this family, exclusively one domain hit per protein sequence was found, Fig. 3 Clustering analysis of HHD-containing sequences. a The inclusion threshold (%identity) is incremented from top to bottom. Indicated percentages refer to inclusion thresholds where subclusters emerged, indicated by arrows. Outliers correspond to sequences that did not cluster successfully. The inclusion threshold is further incremented for subclusters with inhomogeneous sequence annotations. b Each resulting cluster is again submitted to an incrementation of the inclusion threshold. The first percentage value and arrow correspond to the limit identity where only one main cluster remains and point toward the number of sequences in that cluster. The second percentage value and arrows indicate the identity value where multiple clusters accounting for more than 10% of starting sequences emerge, pointing to the sizes of the two main clusters at the given identity threshold. c EFI-Enzyme Similarity Tool representation using a filter-value of 18. This threshold is determined empirically to differentiate and display all clusters in a single snapshot. The circled cluster highlights sequences from insects, corresponding to the Dyschronic sequences also circled in b (See figure on next page.)  22:190 as previously reported [29]. The HHD domain in CCM2 is identified in the C-terminal 200 amino acid region involved in the induction of cell death.

HHDs in Whirlin
We pointed out 592 hits that can be identified as belonging to Whirlin proteins. Two clusters were obtained, one with 198 sequences including the first HHD, and the other with 366 hits including the second HHD of known sequences. However, at an inclusion threshold over 55% identity, this second larger cluster leads to an additional subcluster of 110 hits. Interestingly, these sequences come exclusively from insects, and 75 of them can be identified as a Dyschronic protein, a homolog of Whirlin involved in the synaptic development of flies expressed in major neuronal tracts [30][31][32]. We can detect only one HHD in these sequences. The N-terminal region of 300 amino acids is predicted as disordered upstream its first PDZ. By contrast, Whirlin has a first HHD in the N-terminal 140 amino acids region upstream its PDZ tandem. The unique HHD of Dyschronic is located between a PDZ tandem, but its HHD is tethered to the third PDZ by a shorter disordered region (about 230 residues instead of 550 in Whirlin). Finally, the Dyschronic protein has 50 additional residues following its third and last PDZ. The two proteins end with a type-I PBM. We hereby predict that the protein referred to as Dyschronic possesses a HHD similar to the second HHD of Whirlin. However, given the confusingly high similarity of Dyschronic to Whirlin, it will be considered as a Whirlin subclass in the rest of the study.

HHDs in Delphilin
Among a pool of 413 hits, two clusters were obtained for Delphilin, one comprising the first HHDs and the other the second HHDs of annotated proteins. Protein sequences only originate from metazoans.

HHDs in Harmonin
A total of 277 hits correspond to HHDs in Harmonin proteins from metazoans. This cluster has the highest degree of conservation, up to 93% identity for cluster core. The highly conserved residues cover almost all the surface of the HHD. As illustrated in Fig. 2b-d, it has been shown that the HHD of human Harmonin exhibits several surfaces of binding providing both intramolecular (PDZ domain) and intermolecular (Cad-herin23) interactions. We suggest that the highest degree of conservation of HHDs in Harmonin might underlie multiple potential binding interfaces, resulting in an overall well conserved surface.

HHDs in PDZD7
Coming from metazoans only, 155 hits have been identified as PDZD7 proteins. The vast majority of these sequences encode for one HHD domain, however two HHDs are detected in 11 sequences from 6 organisms (Branchiostoma floridae, Crassostrea gigas, Lingula unguis, Stylophora pistillata, Capitella teleta, Strongylocentrotus purpuratus), with an additional hit before the first PDZ, leading to a Whirlin-like organization of domains. This suggests that PDZD7 might have diverged from Whirlin and lost its first HHD. This observation is consistent with the phylogenetic tree (see below) that highlights a low evolutionary distance between the second HHD of the two proteins.
Overall, only the Dyschronic homologue of Whirlin appears as a new HHD-containing protein. All domain hits identified in this study falls into known HHD-containing protein families, with no significant new architecture. We provide an ensemble of sequences to support HHD studies, with an improved classification of HHD families, highlighting 11 main clusters.

Phylogeny
Given the similarity between PAH and HHDs, in terms of conserved residues, fold and canonical binding mode, we hypothesized that PAH and HHDs are evolutionarily related and that the loss or gain of the fifth helix is the evolutionary event to transition from one domain to the other. Using this hypothesis, we decided to root our phylogenetic trees between PAH and HHDs. The various trees we generated based on the full set of sequences presented some variability. This is due to the high number of sequences we compared relatively to their length, resulting in a low signal to anchor each sequence. To overcome this issue, we subsampled each cluster for representative sequences using CD-HIT [33] and successfully obtained a stable tree with statistically supported branching (Fig. 4).
The branching between HHD families in the tree is consistent with the phylogenetic groups in which the proteins are found. First, PAH-containing proteins are found in all eukaryotic organisms and consist in a branch opposed to all HHDs, this is a direct consequence of our working hypothesis. Then, are found HHDs from RTEL in Viridiplantae against all HHDs in Opisthokonta. Among those, a branch containing HHDs from RTEL in Bilateria, with a sub-branching for HHDs in Insecta, is opposed to other HHD containing proteins that are not in plants. Finally, the last branch of the tree exclusively supports neuronal proteins and splits CCM2 from PDZ-containing sequences (Delphilin, Harmonin, Whirlin and PDZD7). The fact that the six HHDs arise closely in the tree suggests that they evolved from one ancestral HHD sequence that was retained only in these neuronal proteins. We note that, at least, the HHD1 domains of Harmonin and Whirlin can form a supramodule organization together with the adjacent PDZ domain that can regulate the inter-and intra-molecular interactions of Usher proteins [19] (also supported by unpublished data). The co-occurrence of HHD and PDZ may indicate a complementarity of function of the domains and suggests that supramodular interactions have been evolutionarily selected in this branch of paralog proteins.

Heatmap
Phylogenetic analysis based on isolated domains helps to understand the processes promoting protein differentiation, notably through partial (domain) or whole gene (protein) duplication. However, it is not straightforward to conclude on which ensembles of sequences, or branches, are significantly different from one another. Here, we used the BLASTP tool to score each HHD sequence against every other sequence detected using our HMM approach. Converting the obtained data to an adjacency matrix allows us to cluster sequences into communities. Sequences within one community are then not significantly different. Using this approach, we could group all sequences of Whirlin HHD2 and PDZD7 HHD, Delphilin HHD1 and Harmonin HHD, RTEL HHD1 and HHD2 in Metazoa and the specific cluster from insects, as well as the sequences from PAH and HHDs from plant RTEL. CCM2 HHD, Delphilin HHD2 and Whirlin HHD1 are significantly different from any other HHD cluster (Fig. 5).
These results are consistent with the phylogenetic tree (Fig. 4) and branches can be regrouped as follow: PAH and HHDs from RTEL-Viridiplantae; HHD1 and HHD2 in

Motifs analysis
Whole domain comparison, as performed with phylogenetic analysis and blast, can fail at differentiating overall similar domains based on short motifs in the sequences. Therefore, the functional features that are specific to each domain, or rather conserved among them, remain elusive. Unfortunately, the use of automatized motif discovery approaches to identify stretches of residues specific to a family, or conserved from one domain to another, did not lead to results we could decipher. This is most likely due to the high conservation within as well as between clusters, leading to a low signal to noise ratio when searching for local stretches of conserved residues with associated functions. Therefore, we manually calculated the conservation for each position of individual HHDs, mapped on a logo representation of the corresponding clusters. Sequences of each cluster were filtered to reduce global redundancy and increase signal for the detection of positionspecific conservation. To discuss the conservation of given positions across all HHDs, we will further use the numbering from Fig. 6.  22:190 We identified three stabilizing pairs of positions shared among most HHDs (Fig. 7). The residues position 30 and 69, forming a hydrogen bond (Y to charged) or cation-pi interaction (F to cationic), stabilize the bundle between the top of helices α2 and α5. These two positions are found conserved in all clusters except RTEL from Viridiplantae and the bound is indicated as a dashed line Fig. 6. A second pair of residues, positions 19 (anionic) and 54 (cationic) facing each with charge to charge interaction, stabilizes the bundle between the bottom of helices α2 and α4. The two positions are found conserved in 7 of the 12 clusters as indicated Fig. 6 by a plain colored line. Similarly, a third pair of residues, positions 62 (cationic) and 74 (anionic), links the α4 to the shorter α5, further stabilizing the helix bundle. This last stabilizing pair is found in 8 of the 12 clusters, indicating by a plain colored line Fig. 6.
Based on available experimental models, we identified a list of positions in close proximity to the canonical helix binding site that are or could be involved in the interaction with the partner. They can be sorted into two sets, some with side chain participating in the hydrophobic core of the HHD (positions 7, 11, 27 and 61) forming the background of the binding groove and the others located on the α1 and α2 helices and lining the binding groove (positions 4, 5, 8, 12, 17, 20, 21, 24, 25, 28 and 31). The first set cannot be used to discriminate between HHDs, as the four positions are conserved across all clusters with hydrophobic residues, as taking part in the hydrophobic core of the domain. Looking at the conservation of the second set of positions: 7/11 are conserved in the HHD of RTEL from Viridiplantae, 4/11 for RTEL Insecta, 8/11 for RTEL HHD1 Bilateria, 3/11 for RTEL HHD2 Bilateria, 6/11 for CCM2, 8/11 for Delphilin HHD1, 9/11 for Harmonin, 3/11 for Delphilin HHD2, 8/11 for Whirlin HHD1, 3/11 for PDZD7 and 3/11 for Whirlin HHD2. These observations suggest that some HHDs did not evolve to bind to a helical partner. However, using a proteinprotein interaction surface predictor, we could not distinguish the various HHDs in terms of size and shape of the binding site (Fig. 8, method submitted for publication). The binding pocket might be too shallow for accurate prediction, with hydrophobic contacts contributing too much to the interaction compared to subtle variations of residues along the binding site determining their specificity and promiscuity. However, the interface was detected unambiguously for every HHD suggesting that all HHDs could accommodate an amphipathic helix as a binder. More experimental data are required to support any further conclusion.
(See figure on next page.) Fig. 6 Logo representation of sequences for each individual HHD, extracted from the overall alignment of all HHDs to maintain column correspondence. Clusters corresponding to each HHD were filtered using CD-hit at 95% identity threshold to remove redundant sequences and increase diversity. Number of remaining sequences: 68 RTEL (Viridiplantae), 54 RTEL (Insecta), 83 RTEL HHD1 (Bilateria), 73 RTEL HHD2 (Bilateria), 78 CCM2, 65 Delphilin HHD1, 23 Harmonin, 42 Delphilin HHD2, 33 Whirlin HHD1, 48 PDZD7, 91 Whirlin HHD2. Grayed positions are represented in less than 80% of the sequence hits, non-colored positions are represented in most sequences (>80%), but are not conserved (< 75%) in the final set. Colored positions are conserved in more than 75% of sequences, yellow for hydrophobic, blue for cationic, red for anionic, green for polar and non-charged residues and orange for prolines and glycines. Residues were grouped as follow for conservation calculation:  22:190 Two other interfaces for protein-protein interactions are observed in crystal structures. The first one is found on the α2-α3 face of the CCM2 HHD [18] (PDB entry 4Y5O). In the crystal structure, the interaction is mediated by an alpha helix   22:190 from the MEKK3 protein, as discussed so far, as well as its NPB1 domain, which contacts the HHD on this α2-α3 surface. Picking residues from these two helices facing the NPB1 domain rise the following list of positions: 17, 18, 21, 22, 25, 28, 29, 38, 39, 42, 43 and 46. However, only position 28 is conserved using our criteria, and it is also involved in helix binding. This observation suggests that this interface was not conserved and results from crystal packing, leaving the interaction mainly mediated by the MEKK3 N-terminal helix. This is supported by authors indicating that mutations in the N-terminal helix of MEKK3 severely reduce its ability to pull-down CCM2, while mutations in the NPB1 domain only moderately affect the interaction. The second one consists of the lower segments of α3, α4 and α5 of the Harmonin HHD [21] (PDB entry 3K1R). In the crystal structure, this interface is located at the C-terminus of the HHD and promotes the formation of a supramodule with the following PDZ. Picking positions taking part in this surface of interaction are: 45, 50, 51, 55, 58, 59, 74 and 75. From our study, 7 of the 8 positions are conserved, which is consistent with a functional supramodule affecting the binding of the PDZ to its partners.

Mutations in HHD domains
According to the low number of HHD-containing proteins and their recent discovery, only few mutations associated with human diseases have been described.
The first HHD of the human RTEL protein is affected by three mutations responsible for the Hoyeraal-Hreidarsson syndrome (HHS), a severe form of dyskeratosis congenital [34]. The first mutation, K897E (UniProtKB entry Q9NZ71), is positioned in the beginning of the first helix (position 5 of the alignment) and is fully exposed to the solvent. This position is conserved in more than 75% of Bilateria RTEL HHD1 sequences we identified in this study. It is located in close proximity to the canonical binding groove of HHDs and the charge switch induced by the mutation could affect the binding of the partner. However, this cannot be ascertained as no helix binding to the RTEL HHD1 have been identified thus far. The second mutation, R957W, is located in the short loop between helices α4 and α5, corresponding to the position 66 of the alignment. This position is conserved and fully exposed and the function of the arginine or effect of the tryptophan is unclear. The last mutation, F964L is located in the middle of the α5, corresponding to the conserved position 73 of the alignment. This phenylalanine takes part in the hydrophobic core of the HHD, occupying a rather large pocket. This mutation might destabilize the core of the domain by introducing a shorter aliphatic side-chain.
Harmonin, Whirlin and PDZD7 are deafness proteins sometimes associated with the Usher syndrome. We have identified 99 variants in HHD domains of these proteins on the largest cohort of USH patients studied to date [35]; 37 in Harmonin HHD, 25 in Whirlin HHD1, 22 in Whirlin HHD2 and 12 in PDZD7 HHD. Three of them have been identified as pathogenic, while the 96 others are classified as Variants of Unknown Significance (VUS, complete list per domain Additional file 1: Table S2). The first pathogenic variant is the R63W mutant in the Harmonin HHD. It corresponds to the conserved arginine position 62 of the alignment, involved in the stability pair stabilizing  22:190 helices α4 and α5, as previously described. This mutation may thus impair the stability of Harmonin HHD. The second pathogenic variant is the D447H mutant in Whirlin HHD2. This mutation corresponds to position 28 of the alignment and is not conserved. However, this position is located right next to the binding groove and is directly involved in the binding to the helix in Harmonin and CCM2. This mutation may thus affect the binding to a partner. The last pathogenic variant is the R490H mutant in Whirlin HHD2. It corresponds to the non-conserved position 72 of the alignment, completely exposed to the solvent, and its effect is unclear.

Conclusion
HHDs are found across the whole Eukaryota phylogenetic domain and evolved from a single ancestor, in common with PAH domains. However, only six proteins could be detected in our screening, resulting in a surprisingly narrow variety of proteins encompassing. Adding to these six proteins, we identified a subcluster containing sequences referred to as Dyschronic, the closest homolog to Whirlin in Drosophila. The Dyschronic protein is a component of the circadian output pathway present in the fly central brain. This study led us to identify one HHD domain in this protein similar to the second HHD of Whirlin. Among HHD-containing proteins, the human RTEL protein was reported to possess two HHDs [28]. Here, we highlight that only bilaterians have a two-HHDs RTEL protein, except insects, while RTEL in other organisms only possesses one C-terminal HHD. Interestingly, the HHDs of the various RTEL proteins constitute multiple clusters, reflecting consistent differences in sequence that could be associated with the variability of telomeric properties between taxons. HHDs are involved in intermolecular interactions, both in homotypic (dimer) and heterotypic (HHD-helix, HHD-PDZ) complexes. The small number of domains identified in the UniProtKB databank, and the diversity of molecular recognition characterized up to now prevent us from predicting potential partners. The main function of HHD is likely to promote the formation of large oligomeric complexes with the recruitment of multidomain scaffold proteins. The reasons are still unclear to explain why the number of identified HHD is limited despite our extensive search. HHDs are small and well-folded domains and would be good candidates for duplication in proteins to serve as molecular scaffolds for partner interactions. We are currently investigating the role of HHDs in Whirlin and PDZD7 proteins to shed light on the roles of these domains in hearingassociated protein networks. These results will provide clues to decipher the molecular basis of their interactions.

Sequence alignments
All sequence alignments in this paper were performed using the G-INS-i algorithm from the MAFFT package [36], suitable for sequences containing single domains. Position filtering depending on gap percentage was carried out, when indicated, with Goalign "clean sites" option (https:// github. com/ evolb ioinfo/ goali gn). This allows to remove positions that are not representative of the overall alignment, but rather of unique sequence features.