The BackMAP Python module: how a simpler Ramachandran number can simplify the life of a protein simulator

Protein backbones occupy diverse conformations, but compact metrics to describe such conformations and transitions between them have been missing. This report re-introduces the Ramachandran number (ℛ) as a residue-level structural metric that could simply the life of anyone contending with large numbers of protein backbone conformations (e.g., ensembles from NMR and trajectories from simulations). Previously, the Ramachandran number (ℛ) was introduced using a complicated closed form, which made the Ramachandran number difficult to implement. This report discusses a much simpler closed form of ℛ that makes it much easier to calculate, thereby making it easy to implement. Additionally, this report discusses how ℛ dramatically reduces the dimensionality of the protein backbone, thereby making it ideal for simultaneously interrogating large numbers of protein structures. For example, 200 distinct conformations can easily be described in one graphic using ℛ (rather than 200 distinct Ramachandran plots). Finally, a new Python-based backbone analysis tool—BackMAP—is introduced, which reiterates how ℛ can be used as a simple and succinct descriptor of protein backbones and their dynamics.

Large-scale changes in a protein occur due to changes in protein backbone conformations. Figure 1 is a cartoon representation of a peptide/protein backbone, with the backbone bonds themselves represented by darkly shaded bonds. Ramachandran, Ramakrishnan & Sasisekharan (1963) had recognized that the backbone conformational degrees of freedom available to an amino acid (residue) i can be almost completely described by only two dihedral angles: ϕ i and c i (Fig. 1, green arrows). Today, Ramachandran plots are used to qualitatively describe protein backbone conformations.
The Ramachandran plot is recognized as a powerful tool for two reasons: (1) it serves as a map for structural "correctness" (Laskowski et al., 1993;Hooft, Sander & Vriend, 1997;Laskowski, 2003), since many regions within the Ramachandran plot space are energetically not permitted (Momen et al., 2017); and (2) it provides a qualitative snapshot of the structure of a protein (Berg, Tymoczko & Stryer, 2010;Alberts et al., 2002;Subramanian, 2001;Lovell et al., 2003). For example, particular regions within the Ramachandran plot indicate the presence of particular locally-ordered secondary structures such as the a-helix and β-sheet (see Fig. 2A).
While the Ramachandran plot has been useful as a measure of protein backbone conformation, it is not popularly used to assess structural dynamism and transitions (unless specific knowledge exists about whether a particular residue is believed to undergo a particular structural transition). This is because of the two-dimensionality of the plot: describing the behavior of every residue involves tracking its position in two-dimensional (ϕ, c) space. For example, a naive description of positions of a peptide in a Ramachandran plot (Fig. 2B) needs more annotations for a per-residue analysis of the peptide backbone's structure. Given enough residues, it would be impractical to track the position of each residue within a plot. This is compounded by time, as each point in (Fig. 2B) becomes a curve in (Fig. 2C), further confounding the situation. The possibility of picking out previously unseen conformational transitions and dynamism becomes a logistical impracticality. As indicated above, this impracticality arises primarily from the fact that the Ramachandran plot is a two-dimensional map. Beyond Ramachandran plots, tracking changes in a protein trajectory is either overly detailed or overly holistic: an example of an overly detailed study is the tracking of exactly one or a few atoms over time (this already poses a problem, since we would need to know exactly which atoms are expected to partake in a transition); an example of a holistic metric is the radius of gyration (this also poses a problem, since we will never know which residues contribute to a change in radius of gyration without additional interrogation). With our understanding of protein dynamics undergoing a new renaissance-especially due to intrinsically disordered proteins and allostery-having hypothesis-agnostic yet detailed (residue-level) metrics of protein structure has become even more relevant. But unfortunately, there has been no single compact descriptor of protein structure. This impedes the naïve or hypothesis-free exploration of new trajectories/ensembles. It has recently been shown that the two Ramachandran backbone parameters (ϕ, c) may be conveniently combined into a single number-the Ramachandran number [R(ϕ, c) or simply R]-with little loss of information ( Fig. 3; Mannige, Kundu & Whitelam, 2016). In a previous report, detailed discussions were provided regarding the reasons behind and derivation of R (Mannige, Kundu & Whitelam, 2016). This report provides a simpler version of the equation previously published (Mannige, Kundu & Whitelam, 2016), and further discusses how R may be used to provide information about protein ensembles and trajectories. Finally, this report introduces a software package-BACKMAP-that can be used to produce pictograms that describe the (C) represents a trajectory within which the backbone conformation is allowed to change. Here, each curve represents the evolution of each residue, with corners representing discrete states. While the Ramachandran plot is useful for getting a qualitative sense of peptide backbone structure (A, B), it is not a convenient representation for exploring peptide backbone dynamics (C). Secondary structure keys used here and throughout the document: a, a-helix; 3 10 , 3 10 -helix; β, β-sheet/extension; ppII, polyproline II helix.

INTRODUCING THE SIMPLIFIED RAMACHANDRAN NUMBER (R)
The Ramachandran number is both an idea and an equation. Conceptually, the Ramachandran number (R) is any closed form that collapses the dihedral angles ϕ and c into one structurally meaningful number (Mannige, Kundu & Whitelam, 2016). Mannige, Kundu & Whitelam (2016) presented a version of the Ramachandran number (shown in the appendix as Eq. (7)) that was complicated in closed form, thereby reducing its utility. Here, a simpler and more accurate version of the Ramachandran number is introduced. The appendix shows how this simplified form was derived from the original closed form (Eq. (7)). Given arbitrary limits of ϕ ∈ [ϕ min , ϕ max ) and c ∈ [c min , c max ), where the minimum and maximum values differ by 360 , the most general and accurate equation for the Ramachandran number is For consistency, we maintain throughout this paper that ϕ min = c min = -180 or -π radians, which makes As evident in Fig. 3, the secondary structure distributions within the Ramachandran plot are faithfully reflected in corresponding distributions within Ramachandran number space. This paper shows how the Ramachandran number is both compact enough and informative enough to generate immediately useful graphs (MAPs) of a dynamic protein backbone.

Ramachandran numbers are structurally meaningful
In addition to resolving positions of secondary structures (Fig. 3), R corresponds well to structural measures such as radius of gyration (R g ), end-to-end distance (R e ), and chirality (x). These relationships are shown in Fig. 4. Note that chirality comes in many forms, for example, one could be talking about different stereo-isomers, such as L vs D amino acids, or one could be concerned with left-twisting versus right-twisting backbones, that is, handedness (Mannige, 2017). This report will primarily be focused on chirality in context of backbone twist/handedness.
The trends in Fig. 4 show that as one progresses from low to high R, various structural properties also progress smoothly. Additionally, backbones that display similar R also show little variation in structural properties, as evidenced by the small standard deviation bars. It is also important to note that the standard deviations shown in Fig. 4 were calculated Figure 4 Relationships between R and other structural features. The Ramachandran number R displays smooth relationships with respect to radius of gyration (R g ; A), end-to-end distance (R e ; B), and chirality (x; C), as calculated within Mannige (2017). Light blue lines are average trends, dark blue horizontal lines are error bars. Average positions of dominant secondary structures are shown to the right. These trends explain why R is a useful and compact structural measure. Structural measures R g , R e , and x were obtained by computationally generating polyglycine peptides of length 10 for all possible ϕ and c ∈ [-180, -175, : : : , 175, 180]. This was done using the Python library PeptideBuilder (Tien et al., 2013). Values for R g , R e , and x were obtained for each peptide and binned with respect to its R(ϕ, c) (each bin represents a region in R space that is 0.01 R in width). Given that actual values for R g and R e mean little (since one rarely deals with polyglycines of length 10), actual values are omitted. x ranges from -1 to +1.
Full-size  DOI: 10.7717/peerj.5745/ fig-4 by first populating every possible region of (ϕ, c)-space. However, in reality, most regions of (ϕ, c)-space are unoccupied due to steric/electrostatic constraints, which means that these error bars are likely to be even smaller for natural protein backbones than those depicted here. Finally, the R number is calculated by taking "sweeps" of the (ϕ, c)-space in lines that are parallel to the negatively-sloping diagonal. Interestingly, such "sweeps" encounter only one major (dense) region within (ϕ, c)-space (e.g., R's in the general vicinity of 0.34 represent structures that resemble a-helices). This means that R can also be used to assess the types of secondary structure present in a protein conformation.

Ramachandran codes are stackable
An important aspect of the Ramachandran number (R) lies in its compactness compared to the traditional Ramachandran pair (ϕ, c). The value of the conversion from (ϕ, c)-space to R-space is that the structure of a protein can be described in various one-dimensional arrays (per-structure "Ramachandran codes" or "R-codes" or multi-angle maps); see, e.g., Fig. 5.
In addition to assuming a small form factor, R-codes may then be stacked side-by-side for visual and computational analysis. There lies its true power.
For example, the one-R-to-one-residue mapping means that the entire residue-by-residue structure of a protein can be shown using a string of R i s (which would show regions of (d) Figure 5 Two types of R-codes. Digesting protein structures (A) using R numbers either as histograms (B) or per-residue codes (C) allow for compact representations of salient structural features. For example, a single glance at the histograms indicate that protein 1mba is likely all a-helical, while 2acy is likely a mix of a-helices and β-sheets. Additionally, residue-specific codes (C) not only indicate secondary structure content, but also secondary structure stretches (compare to D), which gives a more complete picture of how the protein is linearly arranged. secondary structure and disorder, for starters). Additionally, an entire protein's backbone makeup can be shown as a histogram in R-space (which may reveal a protein's topology). The power of this format lies not only in the capacity to distill complex structure into compact spaces, but in its capacity to display many complex structures in this format, side-by-side (stacking). Peptoid nanosheets (Mannige et al., 2015) will be used here as an example of how multiple structures, in the form of R-codes, may be stacked to provide immediately useful pictograms. Peptoids are stereo-isomers of peptides, where the sidechain is attached to the backbone nitrogen rather than the carbon atom. Since both peptoids and peptides share identical backbone connectivity, the analysis described below could be applied to both peptides and peptoids. Peptoid nanosheets are a recently discovered peptide-mimic that, in one molecular dynamics simulation (Mannige et al., 2015), were shown to display a novel secondary structure. In the reported model (Mannige et al., 2015), each peptoid within the nanosheet displays backbone conformations that alternate in chirality, causing the backbone to look like a meandering snake that nonetheless maintains an overall linear direction. This secondary structure was discovered by first setting up a nanosheet where all peptoid backbones were restrained to be fully extended, after which the restraints were energetically softened and completely released. Figure 6A shows snapshots of each of these states. As evident in Figs. 6B and 6C, the two types of R-code stacks display salient information at first glance: (1) Fig. 6B shows that the extended backbone first undergoes some rearrangement with softer restraints, and then becomes more binary in arrangement as we look down the backbone; and (2) Fig. 6C shows that lifting restraints on the backbone causes a dramatic change in backbone topology, namely a birth of a bimodal distribution evident in the two parallel horizontal bands.
By utilizing R, maps such as those in Fig. 6 provide information about every ϕ and c within the backbone. As such, these maps are dubbed MAPs. A Python package called BACKMAP created Figs. 6A and 6B, and is provided as a GitHub repository (https://github.com/ranjanmannige/BackMAP). BACKMAP takes in a PDB . While these plots are instructive, it is difficult to compare such plots. For example, it is difficult to pick out the change in the a-helial region of the proline plot (pro). However, when we convert Ramachandran plots to Ramachandran lines [by converting (ϕ i , c i ) / R i ], we are able to conveniently "stack" Ramachandran lines calculated for each residue (B). Then, even visually, it becomes obvious that proline does not occupy the canonical a-helix region, which is not evident to an untrained eye in (A). Full-size  DOI: 10.7717/peerj.5745/ fig-7 structure file containing a single structure, or multiple structures separated by the code "MODEL." Case study: picking out subtle differences from high volume of data This section expands on the notion that R-numbers-due to their compactness/stackabilitycan be used to pick out backbone structural trends that would be hard to decipher using any other metric. For example, it is well known that prolines (pro) display unusual backbone behavior: in particular, proline backbones occupy structures that are close to but distinct from a-helical regions. Due to the two-dimensionality of Ramachandran plots (Fig. 7A), such distinctions are hard to visually pick out from Ramachandran plots. However, stacking per-amino-acid R-codes side-by-side make such differences patent ( Fig. 7B; see arrow).
It is also known that amino acids preceeding prolines display unusual shifts in backbone twist/chirality. For example, Fig. 8C shows that amino acids appearing before prolines behave differently than they would otherwise (see the upward-facing arrow). Additionally, amino acids following glycines also appear to have their structures modified Motif studied: Y-X Figure 8 How residue neighbors modify structure. Similar to Fig. 7B, (A) represents the behavior of an amino acid "X" situated before a leucine (X-leu; assuming that we are reading a sequence from the N terminal to the C terminal). (B) similarly represents the behavior of specific amino acids situated before a proline (X-pro). While residues preceding a leucine behave similarly to their average behavior (Fig. 7A), most residues preceding prolines appear to be enriched in structures that change "direction" or backbone chirality (this is evident by many amino acids switching from R < 0.5 to R > 0.5). (C) and (D) show the behavior of individual amino acids when situated before and after each of the 20 amino acids, respectively. (C) and (D) show a major benefit of side-by-side Ramachandran line "stacking": general trends become much more obvious. For example, it is evident that prolines dramatically modify the structure of an amino acid preceding it (compared to average behavior of amino acids in Fig. 7B), while residues following glycines also have a higher prevalence of R > 0.5 conformations (both trends are indicated by small arrows). Such trends, while previously discovered (see text), would not be accessible when naïvely considering Ramachandran plots because one would require 400 (20 Â 20) distinct Ramachandran plots to compare. Note that the statistics for each R-line in (C) and (D) are dependent on the joint prevalence of the residues being considered. For this reason, some R-lines (e.g., those associated with cysteines) look more rough or "dotty" than others. Full-size  DOI: 10.7717/peerj.5745/ fig-8 ( Fig. 8D; upward arrow). Note that these results are not new, and it has already been confirmed that, for example, nearest neighbors affect the conformational behavior of an amino acid as witnessed within Ramachandran plots (Ting et al., 2010), and proline changes the backbone conformation of the preceeding residue (Gunasekaran et al., 1998;Ho & Brasseur, 2005). However, Figs. 7 and 8 indicate that such information can be more concisely shown/identified when structures are stacked side-by-side in the form of R-codes. Such subtle changes are often witnessed when protein backbones transition from one state to another.

USING THE BACKMAP PYTHON MODULE
Installation BACKMAP may either be installed locally by downloading the GitHub repository, or installed directly by running the following line in the command prompt (assuming that pip exists): > pip install backmap.

Usage
The module can either be imported and used within existing scripts, or used as a standalone package using the command "python -m backmap." First the in-script usage will be discussed.

Select and read a protein PDB file
Each trajectory frame must be a set of legitimate protein databank (PDB) "ATOM" records separated by "MODEL" keywords (distinct models show up as distinct frames on the x-axis or abscissa).

Select color scheme (color map)
In addition to custom colormaps listed in the next section, one can also use traditional colormaps available at matplotlib.org (e.g., "Reds" or "Reds_r").
Additionally, by changing how one assigns values to "X" and "Y," one can easily construct and draw other types of graphs such as time-resolved histograms, per-residue fluctuations when compared to the first (D 1 ) and previous structure (D -1 ) within the trajectory, etc. In-script usage III: creating custom graphs Other types of graphs can be easily created by modifying part three of the code above. For example, the following code creates histograms of R, one for each model (starting from line 11 above).

In-script usage IV: available color schemes (CMAPs)
Aside from the general color maps (cmaps) that exist in matplotlib (e.g., "Greys," "Reds," or, god forbid, "jet"), BACKMAP provides two new colormaps: "Chirality" (key: +ve twists: red; -ve twists: blue), and "SecondaryStructure" (key: potential a-helices: red; β-sheets: blue; ppII-helices: cyan). Right twisting backbones are shown in red; left twisting backbones are shown in blue. Figure 10 shows how a single protein ensemble may be described using these schematics. As illustrated in Fig. 10B, cmaps available within the standard matplotlib package do not distinguish between major secondary structures well, while those provided by BACKMAP do. In case it is known that the protein backbone accesses non-traditional regions of the Ramachandran plot, a four-color schematic will be needed (see below for more discussions).

Stand alone usage
BACKMAP can be used as a stand alone package by running "> python -m backmap -pdb <pdb_dir_or_file>." The sectons below describe the expected outputs and how they may be interpreted.

Stand alone Example I: a stable protein
Figures 11B-11F below were created by running "> python -m backmap ./tests/pdbs/1xqq. pdb" (Fig. 11A was created using VMD). These graphs indicate that conformational ensemble 1xqq describes a conformationally stable protein, since 1) each conformation shows little change in the R histogram over time (Fig. 11B), and 2) each residue fluctuates little in color (structure) over "time" (Figs. 11C-11F; see Methods). Here and below, it is assumed that discrete models represent distinct states of the protein over "time".
In particular, each column in Fig. 11B describes the histogram in Ramachandran number (R) space for a single model/timeframe. These histograms show the presence of both a-helices (at R ≈ 0.34) and β-sheets (at R ≈ 0.52). Additionally, Figs. 11C and 11D describe per-residue conformational plots (colored by two different metrics or CMAPs), which show that most of the protein backbone remains relatively stable over time (e.g., few fluctuations in state or "color" are evident over frame #). Finally, Fig. 11E describes the extent to which a single residue's state has deviated from the first frame, and Fig. 11F describes the extent to which a single residue's state has deviated from its state in the previous frame. All these graphs show that this protein is relatively conformationally stable. Stand alone Example II: an intrinsically disrodered protein Figure 12 is identical to Fig. 11, except that the panels pertain to an intrinsically disordered protein 2fft whose structural ensemble describes dramatically distinct conformations.
As compared to the conformationally stable protein above, protein 2fft is much more flexible. Figure 11B shows that the states accessed per model are diverse and dramatically fluctuate over the entire range of R (this is especially true when compared to a stable protein, see Fig. 11B). The diverse states occupied by each residue (Figs. 11C and 11D) confirm the conformational variation displayed by most of the backbone (Figs. 11E and 11F similarly show that most of the residues fluctuate dramatically).
Yet, interestingly, Figs. 11C-11F also show an unusually stable region-residues 15-25-which consistently display the same conformational (a-helical) state at R ≈ 0.34 (interpreted as the color red in Fig. 11C). This trend would be hard to recognize by simply looking at the structural ensemble (Fig. 11A).

A signed Ramachandran number for "misbehaving" backbones
The Ramachandran number increases in value from the bottom left of the Ramachandran plot to the top right in sweeps that are parallel to the negative sloping diagonal. As discussed in Mannige, Kundu & Whitelam (2016), this method of mapping a two-dimensional space into one number is still structurally meaningful and descriptive P( , non-chiral amino acids such as Glycines (or their N-substituted variantspeptoids) display no strong preference, which causes distinct secondary structures that lie on the same "sweep" to be localized at similar regions in R (e.g., in d, polyproline-II and a D helices both localize at R ≈ 0.6). However, a signed Ramachandran number (R S ) solves this problem by multiplying those R's derived from backbones with ϕ > c by -1. The resolving power of R S is evident by the separation of polyproline-II and a D helices (F). The mapping of (ϕ, c) to R and R S are respectively, shown in (E) and (G). Full-size  DOI: 10.7717/peerj.5745/ fig-13 because (1) most structural features of the protein backbone-for example, radius of gyration (Mannige, Kundu & Whitelam, 2016), end-to-end distance (Mannige, Kundu & Whitelam, 2016), and chirality (Mannige, 2017)-vary little along lines parallel to the negatively-sloping diagonal (this is indicated by relatively small standard deviations in structural metrics for similar R s ; Fig. 4), and (2) most protein backbones display chiral centers and therefore predominantly appear on the top left region of the Ramachandran plot (above the dashed diagonal in Fig. 13A). However, not all backbones localize in only one half of the Ramachandran plot. Particularly, among biologically relevant amino acids, glycine occupies both regions of the Ramachandran plot ( Fig. 13B; of note, the a L helix region becomes relatively prominent). On the other hand, prolines are known to form polyproline-II helices (ppII in Fig. 13C), which falls on almost the same "sweep" as glycine rich peptides (red dot-dashed line). In situations where both prolines and glycines are abundant, the Ramachandran number (R) would fail to distinguish a L from ppII ( Fig. 13D; regions outlined by rectangles).
To accomodate the situation where achiral backbones are expected (e.g., if peptoids or polygycines are being studied), an additional Ramachandran number-the signed Ramachandran number R s -is introduced here. R s is identical to the original number in magnitude, but which changes sign from + toas you approach R numbers that are to the right (or below) the positively sloped diagonal i.e., Signed Ramachandran numbers are calculated by adding "-signed" within the command line implementation or by adding "signed=True" when making backmap.R() calls.
As an example of the utility of R s , Fig. 13F shows that R s easily distinguishes a D from ppII.
Note that the signed R s , while useful, would be important in very limited scenarios, as more than 96% of the amino acids in the PDB occupy the upper-left region of the Ramachandran plot (with 3% of "rule breakers" contributed mostly by glycines).

CONCLUSION
A simpler Ramachandran number is reported-R ¼ ðf þ w þ 2pÞ=ð4pÞ-which, while being a single number, provides much information. For example, for proteins within the protein databank, R values above 0.5 are left-handed in twist, while those below 0.5 are right handed; R values close to 0, 0.5, and 1 are extended; β-sheets occupy R values at around 0.52; and right-handed a-helices hover around 0.34. Given the Ramachandran number's "stackability," single graphs can hold detailed information of the progression/evolution of molecular trajectories. Indeed, Fig. 8 shows how 400 distinct Ramachandran plots can easily be fit into one graph when using R. Finally, a python script/module (BACKMAP) has been provided in an online GitHub repository to promote the utility of R as a universal metric.
Given the absence of ppII annotation in the present version of DSSP, statistics for ppII (used to generate the ppII distributions in Figs. 2A, 3A and 3C) were obtained from segments within 16,535 proteins annotated by PolyprOnline (Chebrek et al., 2014) to contain three or more residues of the secondary structure. Figure 6 represents a trajectory of a portion of a single peptoid backbone within a "relaxing" peptoid nanosheet bilayer. The conformation of this backbone-derived from work by Mannige et al. (2015) and Mannige, Kundu & Whitelam (2016)-is also available as "/tests/pdbs/nanosheet_birth_U7.pdb" within the companion GitHub repository.
The following protein structures were obtained from the PDB: 1mba, 2acy, 1xqq, and 2fft. The first two in the list (1mba, 2acy) describe single conformations and the last two (1xqq, 2fft) describe ensembles. R-based MAPs were created for each structure X ∈ [nanosheet_birth_U7.pdb,2fft,2acy,1xqq,1mba] using the following command line code: > python -m backmap -pdb tests/pdbs/X.pdb The output of this command line implementation were used in Figs. 5B, 6B and 10B, 11B and 12B.
In order to describe changes in structure, this report uses two metrics for structural deviation: deviation in present structure when compared to the first conformation in the trajctory (D 1 ), and deviation in structure compared to the previous conformation in the trajctory (D -1 ). For any residue r at time t, these equations can be described as follows: All three-dimensional representations of proteins (Figs. 5A, 6A and 10A, 11A and 12A) were created using VMD (Humphrey, Dalke & Schulten, 1996). Finally, all other figures-excepting Fig. 1 that is derived from Mannige, Kundu & Whitelam (2016)-were created using helper Python scripts available in manuscript/python_generators/ within the companion GitHub repository.

APPENDIX Simplifying the Ramachandran number (R)
This section will derive the simplified Ramachandran number presented in this paper from the more complicated looking Ramachandran number introduced previously (Mannige, Kundu & Whitelam, 2016).