Dali server: structural unification of protein families

Abstract Protein structure is key to understanding biological function. Structure comparison deciphers deep phylogenies, providing insight into functional conservation and functional shifts during evolution. Until recently, structural coverage of the protein universe was limited by the cost and labour involved in experimental structure determination. Recent breakthroughs in deep learning revolutionized structural bioinformatics by providing accurate structural models of numerous protein families for which no structural information existed. The Dali server for 3D protein structure comparison is widely used by crystallographers to relate new structures to pre-existing ones. Here, we report two most recent upgrades to the web server: (i) the foldomes of key organisms in the AlphaFold Database (version 1) are searchable by Dali, (ii) structural alignments are annotated with protein families. Using these new features, we discovered a novel functionally diverse subgroup within the WRKY/GCM1 clan. This was accomplished by linking the structurally characterized SWI/SNF and NAM families as well as the structural models of the CG-1 family and uncharacterized proteins to the structure of Gti1/Pac2, a previously known member of the WRKY/GCM1 clan. The Dali server is available at http://ekhidna2.biocenter.helsinki.fi/dali. This website is free and open to all users and there is no login requirement.


Introduction
Dali (http://ekhidna2.biocenter.helsinki.fi/dali/) is a protein structure comparison server based on distance matrix comparison (https://onlinelibrary.wiley.com/doi/full/10.1002/pro.3749). In favourable cases, structure comparison can reveal distant evolutionary relationships not seen by sequence comparison. The web server supports searches in three databases [Protein Data Bank (PDB); PDB25, a representative subset of PDB; AlphaFold (version 1)] and enables two customised types of structure comparisons: i) pairwise structure comparison to one query structure and ii) all against all structure comparison.
The server takes the 3D coordinates of protein structures as input and returns a list of similar structures, structural alignments and superimposed structures. The all against all comparison also returns a structural dendrogram. The results are linked to sequence search and function prediction servers. Dali tutorial 2016 explains the web interface of the Dali server using live examples. It is still fully valid. This update, Dali tutorial 2022, explains more recently added features: enhanced visualization using stacked Pfam cartoons, and support for the AlphaFold Database. These features are illustrated with a case study (Table 1). Gray boxes mark exercises that you can try out interactively in the web server.

Finding structures for your protein [*new features*]
First, we use the sequence to identify the corresponding PDB and AF-DB models. Paste the FASTA formatted sequence (https://predictioncenter.org/casp14/target.cgi?id=130&view=all) to the SANSparallel server at http://ekhidna2.biocenter.helsinki.fi/cgi-bin/daliviewer/call_sans.cgi. Select the searched database using the radio buttons. We learn that the best PDB match is 7k7vA and the best AlphaFold Database match is gugsA ( Figure 1).

Figure 1.
Interactive homology search using SANSparallel: a submission example (left) and outputs from searches in the PDB (middle) and AlphaFold Database (right).
Dali internally uses four-letter identifiers for the PDB entry plus chain identifier. PDB identifiers are inherited from the Protein Data Bank and start with a number 1-9. Dali's internal identifiers for AlphaFold Database (AF-DB) models start with a letter a-h. The original AF-DB identifier can be recovered from the description field (cf. Figure 1).
Second, do a pairwise Dali alignment of gugsA and 7k7vA to verify the high accuracy of the model predicted by AlphaFold2 with 0.2 A RMSD over 192 residues ( Figure 2). Important to note that the PDB structure of Ssr4 was released after the CASP14 competition. AlphaFold generates end-to-end models of the whole protein sequence. In the present case, only the Nterminal domain of gugsA is modelled as a compact domain. The C-terminal domain is not present in the crystal structure. The crystal structure has an artificial N-terminal His tag, which is naturally not present in the AF-DB model. Subsequent analysis will focus on the N-terminal domain of Ssr4, so we use 7k7vA as query instead of the AF-DB model.

Searching known structures in the Protein Data Bank
Next, we did a PDB search for structural neighbors of 7k7vA. For initial screening, it is useful to filter redundant hits by inspecting the top hits in the PDB25 subset ( Figure 3-left). The stacked structural alignment shows the extent of the common core ( Figure 3-right). The top row shows the query sequence. Structurally equivalent segments from the other structures are mapped onto it. We see that most structures match the beta strands, but the alpha helix preceding the beta strands is present in only three structural neighbors. We'll select these three (1ut4B, 3m8bR, 4ywzA) for further comparison with 7k7vA.  Click on the PDB link in the summary list and check the HEADER record 1ut4B and 4m8bR. These entries were released by PDB well before CASP14. (Hint: The coordinates are transformed to superimpose with the query structure. Download them to generate publication-quality graphics in PyMol.) 1ut4B and 4m8bR were the top hits in the PDB search. The Z-scores are moderate (5.6-5.7) but the common core covers practically all secondary structure elements with an RMSD 3.1-3.3 A over 88-100 C atoms. We conclude that contrary to its classification as a free modeling target in CASP, T1090 did not represent a novel fold but suitable templates were present in the PDB.  4ywzA had a relatively low Z-score in the PDB search and has a different architecture. We'll drop 4ywzA from further consideration and pursue the hypothesis that T1090 shares the DNA binding function with 4m8bR and 1ut4B.
The loop that connects the edge strands of the beta sheet is structurally conserved between the remaining structures. It is the DNA recognition loop in 4m8bR, fitting into the major groove of dsDNA. We observe that this loop is disordered in 1ut4B, which causes chain breaks in the PDB structure. There are multiple PDB structures for this NAC-domain containing protein 19 (https://www.uniprot.org/uniprot/Q9C932). Some of these structures are complexed with DNA, e.g. PDB entry 3swm. However, they too have disordered loops. Recalling that AlphaFold generates end-to-end models, we will see how these loops are modelled by AlphaFold and whether they interfere with DNA binding. Dali's internal identifier for the AF-DB model is akgpA.
Make a new pairwise alignment of 7k7vA (first structure) against akgpA, 1ut4B, 3swmA, 3swmB, 3swmC, 3swmD and 4mb8R (second structures). Verify in the structural alignment view that akgpA models loops missing from the 3swm chains. Inspect the contacts to DNA.

Pfam annotations [*new features*]
Next, we study what is known about the biological functions of our proteins of interest. Here, we exploit information gathered in the Pfam database.
Return to the summary page, select all hits and click the Pfam button. The Pfam graphics show that the hits belong to different protein families ( Figure 6). Hover the cursor above a green cartouche for more information on the domain family. Read the documentation of the families at the Pfam website, e.g. https://pfam.xfam.org/family/PF08549. Substitute the Pfam identifier PFxxxxx for the other families.   ) and AF-DB models (which are based on Uniprot sequences). User may upload their own structures to the Dali server, but these will not have associated Pfam annotation. Since Pfam releases have an interval of a year, annotations will also be missing for the most recent PDB structures.
It is common for distinct Pfam families to share structural similarity, either by common descent (homology) or by physical convergence. In our example (Figure 4), we found that three domain families share a common fold: SWI/SNF and RSC complexes subunit Ssr4 N-terminal (PF08549), no apical meristem (NAM) protein (PF02365), and the Gt1/Pac2 family (PF09729). Intriguingly, they are all involved in gene regulation. Pfam places the Gt1/Pac2 family in the WRKY-GCM1 clan (CL0274). Pfam clans unify remote homologs. No clan is assigned to the other two families. WRKY and GCM1 are metal chelating DNA-binding domains (DBD) which share a four-stranded fold (http://www.ncbi.nlm.nih.gov/pubmed/17130173). Our families share a larger core consisting of five strands and a helix ( Figure 2).

Searching AlphaFold Database (version 1) [*new features*]
So far, we have done a PDB search and identified known structures with the same fold as T1090. Let's move on to check if AF-DB presents us with more potential members of the emerging superfamily of DNA-binding domain (Box 1). PDB searches (PDB search and PDB25 search tabs) and AF-DB searches (AF-DB search tab) are kept separate in the Dali server. PDB contains experimental structures, whereas AF-DB contains (remarkably accurate) predictions.
AF-DB searches in Dali are currently limited to one model organism at a time. You choose from 21 model organisms, including human. AF-DB searches have long queueing times. Therefore, we show below the top hits from precomputed searches of 7k7vA against HUMAN (Figure 7). Visual inspection shows that members of the CG-1 family (PF03859), represented by e.g. e3ikA, resemble the DNA-binding domains. A search against Arabidopsis again shows CG-1 as the top hit, an uncharacterized protein At3g16750 at rank 6, and the previously picked up NAM family as the next best hits (Figure 8). Also WRKY domains match the beta sheet. A four-stranded beta sheet is a common structural motif, so it is noteworthy that WRKY domains match better than other beta sheet proteins. A search against fission yeast recovers the query (Ssr4) and two paralogs of the Gt1/Pac2 family, already known members of our superfamily (Figure 9). Other top hits did not pass the secondary structure filtering step.

Putting it all together: structural dendrogram and sequence signatures
We performed an all-against-all comparison using representatives of old and new clan members ( Figure  10). For this analysis, we used cropped AF-DB models and a local DaliLite installation (see Download tab on Dali server website). Cropped AF-DB models retain only residues that are confidently modelled (pLDDT>70). Cropping removes non-compact loops in AF-DB models, like the C-terminal part of T1090 (see Figure 2), which add spurious equivalences to Dali alignments that destroy the 3-D superimposition.
Copy-paste the resulting newick tree to iToL (https://itol.embl.de/): The crystal structure of SWI/SNF and the AF-DB models of CG-1 and unclassified abinA share strong structural similarity and form a subgroup with the Gt1/Pac2 family. CG-1 is a specific DNA-binding protein (CG-1, a parsley light-induced DNA-binding protein -PubMed (nih.gov)). There are crystal structures of NAM domains complexed with dsDNA (e.g. 3swm). SWI/SNF is involved in chromatin remodeling. However, electrostatic potential calculations do not show no clear face of potential that might signal specific binding to DNA (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7716260/). abinA (At3g16750) is an uncharacterized protein. Sequence searches bring up few homologs which seem taxonomically restricted to rosids. The protein has a negative net charge, making DNA binding unlikely. Nevertheless, there is striking conservation of a key interaction stabilizing the DNA recognition loop in the Gt1/Pac2 family ( Figure 11).
In structural alignment view, click the Show Stacked Sequence Logos button to compare sequence conservation profiles between families. Convince yourself of the conservation of the DxxxW motif of the recognition loop ( Figure 11). Figure 11. Structural alignment of the recognition loop. The red arrow points to W72, which is essential for DNA binding in the Gt1/Pac2 family (Crystal structure of the WOPR-DNA complex and implications for Wor1 function in white-opaque switching of Candida albicans | Cell Research (nature.com)). W72 forms a hydrogen bond to D68, stabilizing the loop. These residues are highly conserved in the other families.

Conclusion
Our analysis added three more Pfam families to the WRKY-GCM1 clan, namely CG-1, NAM, SWI/SNF, and a small family represented by At3g16750 (abinA). Despite strong structural similarity to sequence-specific DNA binding domains, electrostatic potential calculations (e.g. https://pdbj.org/eF-surf/top.do) do not support this function in the last two families. The starting point of this study, SWI/SNF, was accurately modelled from sequence by AlphaFold. CG-1 and abinA were modelled by AlphaFold and still lack experimental structures, making them interesting targets for structural genomics.