SSDraw: Software for generating comparative protein secondary structure diagrams

Abstract The program SSDraw generates publication‐quality protein secondary structure diagrams from three‐dimensional protein structures. To depict relationships between secondary structure and other protein features, diagrams can be colored by conservation score, B‐factor, or custom scoring. Diagrams of homologous proteins can be registered according to an input multiple sequence alignment. Linear visualization allows the user to stack registered diagrams, facilitating comparison of secondary structure and other properties among homologous proteins. SSDraw can be used to compare secondary structures of homologous proteins with both conserved and divergent folds. It can also generate one secondary structure diagram from an input protein structure of interest. The source code can be downloaded (https://github.com/ncbi/SSDraw) and run locally for rapid structure generation, while a Google Colab notebook allows easy use.

released a web-based collection of >772 million structures predicted from metagenomic sequences (Lin et al., 2023).Similarly, >200 million structures predicted by Alpha-Fold2 (Jumper et al., 2021)-a highly accurate deeplearning-based model for protein structure prediction-are available through a web repository for easy user access (Varadi et al., 2021), and many have been deposited into the UniProt sequence database (UniProt, 2021).
The enormous increase in available models of protein structure presents opportunities to identify large-scale relationships between structure and other properties, such as sequence conservation or prediction confidence.Such relationships are often most effectively depicted when multiple protein structures are compared, motivating the development of structural alignment algorithms that match common elements of protein structure rather than amino acid sequence (Pettersen et al., 2021).Nevertheless, important relationships between protein structures can be obscured by three-dimensional visualizations that do not effectively convey all structural features through one image.This shortcoming especially impacts homologous proteins with nonconserved structural features arising from insertions, deletions, or mutations that cause substantial changes in secondary structure.Indeed, the need for easily interpretable structure diagrams is underscored by several recent studies highlighting how protein structure can transform dramatically in response to seemingly minor sequence changes (Chakravarty, Sreenivasan, et al., 2023;Dishman et al., 2021;Liu et al., 2023;Ruan et al., 2023;Solomon et al., 2023).To observe these transformations accurately, secondary structures of the proteins of interest must be registered, meaning that amino acids with annotated secondary structures must be aligned with their corresponding amino acids in a multiple sequence alignment (MSA).Once registered, secondary structures of homologous proteins aligned within the MSA can be compared, and their respective secondary structure diagrams become comparative.That is, the secondary structure of Protein A at position X can be compared directly to the secondary structure of its homolog, Protein B, at position X if their secondary structures are both registered to the same MSA (Figure 1).Comparative secondary structure diagrams also simplify the visualization of fold-switching proteins, single sequences evolutionarily selected to remodel their secondary and tertiary structures in response to cellular stimuli (Murzin, 2008;Porter & Looger, 2018;Schafer & Porter, 2023).In short, as increasing evidence indicates that highly similar or identical protein sequences can assume folds with drastically different secondary structures (Porter, 2023), the need to graphically depict structural differences among homologous proteins and relate them to other protein properties increases.
To effectively depict relationships between the structures of homologous proteins and other properties of interest, we present SSDraw, a Python-based program that rapidly generates secondary structure diagrams from three-dimensional protein coordinates.These linear diagrams are registered to a user-inputted MSA and colored by any property of interest.Running SSDraw once generates a diagram of one protein from an MSA.Multiple diagrams from one MSA can be generated and stacked for easy comparison.These functionalities distinguish SSDraw images from other secondary structure visualizations (Gouet et al., 2003;Hutarova Varekova et al., 2021;Hutchinson & Thornton, 1990;Kocincova et al., 2017;Midlik et al., 2022;O'Donoghue et al., 2015;Stivala et al., 2011).For instance, ESPript (Gouet et al., 2003) relates secondary structures derived from one representative protein structure to multiple homologous sequences, usually divided on multiple lines of text.This format works well when the user seeks to visualize sequence conservation patterns in a protein family with conserved secondary structures.SSDraw may be preferable if the user seeks to compare structures of homologous proteins with divergent secondary structures by stacking each diagram and comparing structural differences.As another example, Aquaria (O'Donoghue et al., 2015) also generates stackable linear secondary structure diagrams but colors by sequence conservation only.SSDraw may be preferable if the user seeks to color the stacked diagrams by a property other than sequence conservation.In short, SSDraw was written to flexibly relate secondary structure differences between homologous proteins with other protein properties of interest.While this software was originally designed for foldswitching proteins (Porter & Looger, 2018) and homologous sequences that with different secondary structures (Chakravarty, Sreenivasan, et al., 2023), it also serves as a tool to quickly generate secondary structure diagrams for individual proteins with custom coloring by sequence position in seconds (local install) to minutes (Google Colab notebook).

| Software overview
SSDraw requires two inputs to run: (1) a file containing three-dimensional protein coordinates in PDB format and (2) an MSA in FASTA format (Figure 2).SSDraw requires only alpha carbon coordinates to generate an image.The user may specify the chain ID if they input a multi-chain PDB.The MSA can be generated with programs such as MUSCLE (Edgar, 2004), Clustal Omega (Sievers & Higgins, 2014), or HMMER (Finn et al., 2011), so long as it is inputted in FASTA format.The user may also input a single ungapped FASTA sequence if they are interested in generating a diagram from a single sequence.
By default, SSDraw computes secondary structure annotations for each amino acid using Define Secondary Structure of Proteins (DSSP; Joosten et al., 2011;Kabsch & Sander, 1983), which annotates secondary structure from three-dimensional protein structures based on hydrogen bonding patterns (Section 4).In lieu of a PDB file, users may input alternative secondary structure annotations (Midlik et al., 2021;Srinivasan & Rose, 1999) or precomputed DSSP annotations in .horizformat.
Annotated secondary structures are then aligned in register with the input sequence alignment (Figure 2) in FASTA format.For proper alignment, the user inputs the name of the reference sequence in the alignment.Protein structures determined by x-ray crystallography or

SSDraw
cryo-EM often have unresolved regions due to weak or missing electron density, leading to gaps in their experimentally determined structures.These structural gaps lead to alignment gaps between reference sequences and annotated secondary structures.Accordingly, SSDraw adjusts the reference sequence to be the same length as the secondary structure annotations taken from experimentally determined structures; experimentally unresolved regions are assumed to be disordered and are therefore visualized as loops.
Secondary structures are then drawn with patches from the Matplotlib (Hunter, 2007) package for Python3 (Figure 2; Section 4).Successive slanted polygons are used to represent α-helices, arrows represent β-sheets, rectangles represent loops, and empty spaces between secondary structures represent alignment gaps.Loops are layered under secondary structures.Segments of regular secondary structure shorter than 4/3 successive residues (α-helices/β-sheets), loops, β-turns, and disordered regions are represented as thin rectangles layered under secondary structure elements (Section 4).
If desired, secondary structures can be colored by sequence conservation score, B-factor, or another userdefined input (Figure 1).This feature was originally developed to compare secondary structure conservation in a family of bacterial response regulators with some secondary structure elements that switch from α-helix to β-sheet in response to stepwise mutation (Chakravarty, Sreenivasan, et al., 2023).Sequence conservation scores are computed automatically from the input sequence alignment (Section 4), though scores from Rate4Site (Pupko et al., 2002), a more accurate conservation metric, may also be inputted.Alternatively, the image can be colored with a solid fill specified by the user.For instance, the first diagram in Figure 2 was generated using a white fill.Custom coloring schemes and custom colormaps may be specified by the user.
If the user wants to assign custom coloring scores to each residue, they have two options.The first is to upload a custom scoring file that contains residue-specific scores.This file is formatted with two columns: column one contains one-letter amino acid codes for each residue to be colored; column two contains scores corresponding to the amino acids in column one; columns are delimited by one space.The second option for custom scoring is to input a PDB file with C-alpha B-factors corresponding to custom scores and coloring the image by B-factor.This option allows the user to easily visualize confidence scores from structure predictors such as AlphaFold2 (Jumper et al., 2021) and ESMfold (Lin et al., 2023), if desired.Any range of scores can be used for custom coloring: scores are normalized before the image is colored.Because SSDraw uses the Matplotlib (Hunter, 2007) 2.2 | Advanced example 1: Comparing distinct structures with highly identical sequences using a custom color map SSDraw can be used to compare secondary structures of proteins with high levels of sequence identity but different folds (Figure 4).Extensive work has been performed to engineer (Alexander et al., 2007;Alexander et al., 2009;He et al., 2012;Ruan et al., 2023) and characterize (Sikosek et al., 2016;Solomon et al., 2023;Tian & Best, 2020) variants of the human serum albumin-binding protein GA and the immunoglobulin binding protein GB.While GA folds into a trihelical bundle, GB folds into a 4β + α structure.One or several mutations can cause the protein to flip from one ground-state fold to the other (Alexander et al., 2009;He et al., 2012).The distinct secondary structures of GA and GB variants can be visualized readily with SSDraw (Figure 4).The top structure (GB95) is the reference and therefore has no mutations.Three mutations (cyan) to GB95 switch its fold to the three helical bundle (GA95); two mutations to GA95 (magenta) switch it back to the 4β + α fold (GB98).GB98 can be switched back to the trihelical fold with one mutation (yellow), which can be switched back to the 4β + α-fold with three mutations (GB98-T25I, L20A, white).Finally, another single mutation (green) switches GB98-T25I, L20A back to the trihelical fold (GB98-T25I).Interestingly, fold-switching mutations tend to occur in the central region of the protein (residues 20, 25, and 30) rather than at the termini, where the closest known fold-switching mutation is 11 residues away from the C-terminus (position 45).Furthermore, all mutations occur in regions of secondary structure rather than loops.

| Advanced example 2: Comparing sequence conservation in similar structures with a default color map
SSDraw can also be used to relate sequence conservation to secondary structure in protein families with conserved folds.These comparisons for ubiquitin and ubiquitin-like proteins (Walters et al., 2004) are shown in Figure 5.Not F I G U R E 3 SSDraw has three modes of use.In the first mode, the user inputs a protein structure and an ungapped sequence; a single ungapped secondary structure diagram is outputted (A).In the second mode, the user inputs a protein structure and a gapped sequence; a single gapped secondary structure diagram is outputted (B).In the third mode, the user inputs multiple structures and a multiple sequence alignment that aligns their sequences; multiple stacked secondary structure diagrams are outputted.In all three panels, the experimentally determined structure of the transcriptional regulator RfaH (Belogurov et al., 2007;Zuber et al., 2018; PDB ID: 5OND, chain A, dark purple) and its sequence (in different alignments) are inputted.In panel (C), its homolog NusG (Kang et al., 2018; PDB ID: 6ZTJ, chain CF, light purple) is also inputted with a multiple sequence alignment (MSA).RfaH and NusG are members of the only known family of transcriptional regulators conserved from bacteria to humans (Werner, 2012).They share a structurally conserved N-terminal domain, while their C-terminal domains differ dramatically in the ground state (Burmann et al., 2012;Porter et al., 2022): RfaH's is all α-helical, while NusG's is all β-sheet.
surprisingly, sequences in loop regions tend to be least conserved, while sequences that fold into secondary structures tend to be more conserved.One exception is the second β-sheet, which has been identified as a SUMO1 binding site and putative NEDD8 binding motif by NMR spectroscopy (Song et al., 2004) and structural modeling (He et al., 2017), respectively.Thus, sequence variation in the second β-sheet may foster different binding functions in different ubiquitin-like proteins.Sequence conservation was calculated directly from the input sequence alignment (Section 4).

| DISCUSSION AND CONCLUSIONS
SSDraw generates publication-quality secondary structure diagrams in seconds to minutes.These diagrams can be colored by conservation score, B-factors, or a user-specified metric, allowing relationships between secondary structure and other protein properties to be observed readily.SSDraw is expected to be most useful for comparing secondary structures of homologous proteins with different folds, an emerging class of proteins (Chakravarty, Schafer, & Porter, 2023) for which few computational tools are available.Nevertheless, SSDraw may also be used to (1) diagram single structures and color them by any property of interest and (2) compare secondary structures of homologous proteins with conserved folds.

| Secondary structure annotation
SSDraw uses DSSP (Joosten et al., 2011;Kabsch & Sander, 1983) to annotate secondary structure from three-dimensional protein coordinates in PDB format.The local install uses the DSSP module in Biopython (Cock et al., 2009) to parse the annotations generated by separate compiled software.Only C-alpha coordinates are necessary for annotation.In addition to regular secondary structure (α-helices and β-sheets), DSSP annotates various local structures such as β-turns and 3 10 helices.These features are not displayed in SSDraw diagrams because they are not represented well enough.Due to limitations of the patches library, at least 4 consecutive identical annotations (e.g., HHHH or EEEE) would be needed to introduce additional structural elements into these diagrams.Table 1 shows that α-helices, β-sheets, and loops comprise 87% of all consecutive identical annotations; the next most frequent annotation is Turns, representing 4%.These statistics were calculated from DSSP annotations of 185,725 unique PDB files.Helices are drawn for at least four consecutive "H" annotations, and β-sheets are drawn for at least three consecutive "E" or "B" annotations, combined in any way.All other annotations are visualized as loops.Short helices with <4 consecutive "H" annotations and short β-sheets with <3 "E" or "B" annotations are also visualized as loops.
In some cases, the user who wishes to install SSDraw locally may have difficulty installing DSSP with conda.The user may run SSDraw with the PyDSSP library.PyDSSP is a simplified pytorch-based implementation of DSSP that makes three-state secondary structure annotations (Helix, Sheet, Loop) that match DSSP 97% of the time (https://github.com/ShintaroMinami/PyDSSP).

| Drawing secondary structures
Annotated secondary structures are grouped into three categories: Loop, Helix, and Strand.The lengths of each segment of structure in each category are calculated.Then, each category is drawn separately using the patches library from Matplotlib (Hunter, 2007) for Python3.First, Loops are drawn.Loop lengths are calculated as the number of consecutive annotations divided by 6.0 with the Rectangle patch.When Loops connect elements of secondary structure, they are extended at both ends by 1.0/6.0.All loops have a zorder of 0 so that their images are layered under strand and helix diagrams.Then, coordinates for images of β-sheets and α-helices are stored to be drawn later for better performance.Strands are drawn using the FancyArrow patch with a width of 1.0, linewidth of 0.5, zorder = index increasing over all secondary structures from left to right, head_width of 2.0, and head length of 2.0/6.0.Length is defined as the number of consecutive annotations for the strand being drawn/6.0; to avoid incorrect gapping, this length is extended by 1.0/6.0 if C-terminal elements of secondary structure follow the strand.Helices are drawn as stacked Polygon patches with right-leaning patches layered on top and left-leaning patches layered underneath.The short sides of the polygons measure 1.0/6.0; the long sides measure 1.8/6.Helices begin and end with shorter polygons that align with other secondary structures (height of 1.4/6, width of 1.0/6).All lengths are proportional measures scaled to fit into a figure 25 inches long.Consequently, shorter proteins will have larger secondary structures in the horizontal dimension and vice versa.Vertical heights of all secondary structures are kept constant.

| Coloring secondary structures
Secondary structures have black edges; their insides are filled by clipping an input colormap equal in size to the diagram.Groups of loops, helices, and strands are each converted to clipping paths using Matplotlib's mpath.Path command.These paths are then converted to patches with mpatch.PathPatch.Finally, an input colormap equal in size to the diagram is generated from userspecified parameters or a solid color and clipped to fill the insides of the path (im.set_clip_path command); the rest of the colormap is discarded.Repetitively generating the colormap slows performance considerably.For instance, generating one diagram of a 215-residue response regulator with a mixture of helices and strands (PDB ID: 1A04) takes 1 min, 5 s when a colormap for each secondary structure element-including every polygon to make the helices-must be generated.To improve performance, SSDraw generates colormaps three times-once for each class of secondary structure.
F I G U R E 5 diagrams for ubiquitin and ubiquitin-like proteins colored by sequence conservation score (1.0 is highly conserved; 0.0 is least conserved).Sequences of secondary structure elements tend to be more conserved, with the notable exception of the second β-sheet, whose binding functions vary among some ubiquitinlike proteins.Running this improved implementation hastened image generation of 1A04 to 2.6 s, a 25-fold speed-up from 1 min, 5 s.The Google Colab notebook takes about 2 min to generate its first secondary structure diagram because it must load outside software packages, such as DSSP, before running.

| Conservation scores
Conservation scores are computed directly from an input sequence alignment.First, the consensus sequence is determined by calculating the most common amino acids in column of the alignment.A conservation score is then calculated by: 1. Determining the number, N, of amino acids in column i with substitution scores ≥0 for the consensus residue in column i. Substitution scores are calculated using the BLOSUM62 (Henikoff & Henikoff, 1992) matrix supplied by Biopython (Cock et al., 2009).2. N is then normalized by the total number of amino acids in column i. Gaps are not included in the normalization.
SSDraw can also take Consurf and Rate4Site scores as input.Consurf scores are taken directly from the input file and used to color the output structure with no modification to the values.Rate4Site scores are normalized and grouped into nine bins as in Ref. (Ashkenazy et al., 2016).

F
I G U R E 1 Comparative secondary structure diagrams result from registering secondary structure annotations with their corresponding aligned amino acid sequences.Secondary structures of unregistered diagrams (above) are not aligned, disallowing reliable inferences about secondary structure evolution.Diagrams align when their secondary structures have been registered with their corresponding aligned amino acid sequences (comparative secondary structure diagrams, below), suggesting possible secondary structure evolution where α-helices align with β-sheets.Secondary structure diagrams were made from structures of bacterial response regulators FixJ (PDB ID: 5XSO, chain A) and KdpE (PDB ID: 4KFC, chain A).The C-terminal domains of FixJ and KdpE are colored blue and red, respectively, indicating different folds (helix-turn-helix, blue; winged helix, red), while their structurally conserved N-terminal domains are gray.Arrows pointing to the gray domains indicate homologous secondary structures; arrows pointing to the colored domains indicate divergent secondary structures.Previous phylogenetic analysis and ancestral reconstruction (Chakravarty, Sreenivasan, et al., 2023) indicate that the C-terminal β-sheet of the winged helix evolved from the C-terminal α-helix of the helix-turn-helix by stepwise mutation.

F
I U R E 4 Comparing the structures of proteins with highly identical amino acid sequences but different folds.Diagrams show very different secondary structures derived from the nuclear magnetic resonance structures (Protein Data Bank [PDB] IDs in parentheses) of engineered variants of immunoglobulin binding protein GB (4β + α-fold) and human serum albumin binding protein GA (trihelical bundle).This figure should be read from top to bottom.Position-specific mutations required to switch a given fold from that of its predecessor, the diagram directly above it, are shown in different colors representing mutations unique to each protein.Black positions were not mutated relative to their immediate predecessors.
Abbreviation: DSSP, Define Secondary Structure of Proteins; PDB, Protein Data Bank.