Macromolecular modeling and design in Rosetta : new methods and frameworks

The Rosetta software suite for macromolecular modeling, docking, and design is widely used in pharmaceutical, industrial, academic, non-profit, and government laboratories. Rosetta’s advantage is interoperability between several broad modeling capabilities and it consistently ranks highly when compared to other leading methods created for highly specialized protein modeling and design tasks. Developed for over two decades by a global community of scientists at more than 60 institutions, Rosetta has been refactored and extended continuously and now comprises over three million lines of code. Here we discuss methods and applications developed in the last five years, including the latest protocols for structure prediction, protein–protein and protein–small molecule docking, protein structure and interface design, loop modeling, the incorporation of various types of experimental data, and modeling of peptides, antibodies and other proteins in the immune system, nucleic acids, non-standard amino acids, carbohydrates, and membrane proteins. We briefly discuss improvements to the score function, user interfaces, and usability of the software. Rosetta is available at www.rosettacommons.org. Introduction It has long been understood that, in biological systems, structure determines function. This relationship has motivated decades of experimental determination of protein structure and function. Many computational packages have been developed to provide valuable guidance to experimental methods, one of which is the Rosetta modeling and design suite. Most computational tools are specialized for a small number of specific purposes; in this regard Rosetta is different, and over two decades has been expanded to include broad capabilities that span many bioinformatics and structural-bioinformatics tasks. Computational structural biology tools and frameworks with similar comprehensive scope are few, but key to progress in biology. Schrodinger1, the Molecular Operating Environment2,3, and Discovery Studio4 are computational chemistry platforms for advanced modeling and design for structural biology, drug discovery and material science, based on molecular mechanics, molecular dynamics and quantum mechanics calculations. The HHSuite5 includes various tools for bioinformatics, sequence alignments, structure prediction and modeling. The BioChemicalLibrary6 (BCL) includes tools structure prediction, drug discovery, and several sequence-tostructure methods using machine learning approaches. The Integrative Modeling Platform7 (IMP) allows modeling of large macromolecular complexes by incorporation of various types of experimental data. OpenBabel8 is a ChemInformatics toolbox supporting molecular mechanics calculations but is most heavily used for interconversion of file formats. Molecular dynamics simulation packages like CHARMM9, AMBER10, GROMACS11, OPLS12, Desmond13, and FoldX14 simulate most atoms explicitly with a physics-based energy function that relies on solving Newton’s equation of motion. These methods can be used for folding small proteins, model refinement, high-resolution phenomena such as ion flow through membrane channels, and modeling interactions with small molecules and are therefore highly complementary to Rosetta. OpenMM15 is an API (application programming interface) for setting up molecular simulations and can be used as a library or standalone application. Many other tools are available for more specialized tasks, for instance for de novo modeling (AlphaFold16, QUARK17, RaptorX18, MULTICOM19), homology modeling (Modeller20, SwissModel21), fold recognition (iTasser22), protein-protein docking (HADDOCK23, Zdock24, ClusPro25, Gramm-X26, PatchDock27), ligand docking (AutoDock28, FlexX29, Glide30) and the numerous other tasks that require molecular modeling. A large body of methods other than Rosetta is listed and cited in the Supplement to this paper, as the focus of this perspective is the description of methods recently developed in Rosetta. Given that it is well beyond the scope of this work to give a fully comprehensive picture of computational chemistry or biomolecular modeling software, we focus this perspective on Rosetta, which a proven state-of-the-art code for a wide variety of bio-macromolecular prediction and design tasks. One of Rosetta’s advantages is the interoperability of its large number of applications. The challenge however is to track the scope of functionality Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 16 October 2019


Introduction
It has long been understood that, in biological systems, structure determines function.This relationship has motivated decades of experimental determination of protein structure and function.Many computational packages have been developed to provide valuable guidance to experimental methods, one of which is the Rosetta modeling and design suite.Most computational tools are specialized for a small number of specific purposes; in this regard Rosetta is different, and over two decades has been expanded to include broad capabilities that span many bioinformatics and structural-bioinformatics tasks.Computational structural biology tools and frameworks with similar comprehensive scope are few, but key to progress in biology.Schrodinger 1 , the Molecular Operating Environment 2,3 , and Discovery Studio 4 are computational chemistry platforms for advanced modeling and design for structural biology, drug discovery and material science, based on molecular mechanics, molecular dynamics and quantum mechanics calculations.The HHSuite 5 includes various tools for bioinformatics, sequence alignments, structure prediction and modeling.The BioChemicalLibrary 6 (BCL) includes tools structure prediction, drug discovery, and several sequence-tostructure methods using machine learning approaches.The Integrative Modeling Platform 7 (IMP) allows modeling of large macromolecular complexes by incorporation of various types of experimental data.OpenBabel 8 is a ChemInformatics toolbox supporting molecular mechanics calculations but is most heavily used for interconversion of file formats.
Molecular dynamics simulation packages like CHARMM 9 , AMBER 10 , GROMACS 11 , OPLS 12 , Desmond 13 , and FoldX 14 simulate most atoms explicitly with a physics-based energy function that relies on solving Newton's equation of motion.These methods can be used for folding small proteins, model refinement, high-resolution phenomena such as ion flow through membrane channels, and modeling interactions with small molecules and are therefore highly complementary to Rosetta.OpenMM 15 is an API (application programming interface) for setting up molecular simulations and can be used as a library or standalone application.
Many other tools are available for more specialized tasks, for instance for de novo modeling (AlphaFold 16 , QUARK 17 , RaptorX 18 , MULTICOM 19 ), homology modeling (Modeller 20 , SwissModel 21 ), fold recognition (iTasser 22 ), protein-protein docking (HADDOCK 23 , Zdock 24 , ClusPro 25 , Gramm-X 26 , PatchDock 27 ), ligand docking (AutoDock 28 , FlexX 29 , Glide 30 ) and the numerous other tasks that require molecular modeling.A large body of methods other than Rosetta is listed and cited in the Supplement to this paper, as the focus of this perspective is the description of methods recently developed in Rosetta.Given that it is well beyond the scope of this work to give a fully comprehensive picture of computational chemistry or biomolecular modeling software, we focus this perspective on Rosetta, which a proven state-of-the-art code for a wide variety of bio-macromolecular prediction and design tasks.One of Rosetta's advantages is the interoperability of its large number of applications.The challenge however is to track the scope of functionality available to scientists who wish to use the software.This perspective is meant to be an illustrated guide to the new, returning, or seasoned user; to help find the right protocol hiding in the Rosetta haystack.We present the list of tools that were developed in the past five years, therefore providing direction as to where to find information for specific modeling problems.For detailed instructions for each of the specific protocols, we refer to the links in the Supplement which direct to the appropriate corners of the expansive Rosetta documentation and codebase.
Development of Rosetta started in the mid-1990s for protein structure prediction and to gain insights into the protein folding problem 31 , which remains a grand challenge of structural biology.Over time, the number of applications grew to address a wider array of modeling tasks, ranging from protein-protein or -small molecule docking to incorporating NMR data, loop modeling, protein design, and interaction with peptides and nucleic acids (Figure 1).Over more than 20 years, the community of developers and scientists, the RosettaCommons, grew from a single academic laboratory to laboratories at over 60 institutions around the globe 32 .The software has undergone several transitions, including in programming language and implementation, with the latest protocols based on Rosetta3, which was first released in 2008 33 .The score function has been continuously improved, detailed descriptions of which can be found in previous articles 34 and 35 .Throughout Rosetta's lifetime, efforts to improve interfaces to the code, and documentation have drastically improved usability and enable modular application to new problems.As part of our sustained focus on accessibility, usability, and scientific reproducibility, we developed several interfaces, (PyRosetta 36 , RosettaScripts 37 , Foldit 38 ) and emphasized publishing protocol captures 39 that accompany manuscripts.As the software's interfaces have grown more versatile, development has accelerated and branched in many directions.However, this extensibility and the very large number of scientists that combine modules in unthought combinations make it difficult to keep up with all the developments that are happening within the software and the scientific community.To address this growth in functionality, we have compiled the latest method developments in the Rosetta software suite from the past five years, divided into several modeling categories.The supplement contains a more detailed tour of the protocols discussed with extensive links to documentation, resources on the web, and limitations and competitors to each method.

Figure 1: Capabilities of the Rosetta macromolecular modeling suite
Some popular tasks that can be addressed in Rosetta (blue) and major systems that can be modeled (red).This is an incomplete list of Rosetta's broad modeling capabilities.

General overview and challenges
The general outline of a typical Rosetta protocol is depicted in Figure 2A: the conformation of a biomolecule (the Pose) is altered, either deterministically or stochastically, via a Mover and the resulting conformation is evaluated by a ScoreFunction.The Move is accepted based on the Metropolis criterion and the energy difference between the original and the new conformation: if   <   accept if   ≥   accept with probability  =  −((  −  )/) Many trajectories are generated, and the final models are evaluated based on the scientific objective.This setup highlights common limitations in Rosetta protocols involving sampling, scoring (discussed in the score function section), or technical challenges.Many protocols suffer from under-sampling 40 , especially when flexibility is involved.Sampling is a limitation for structure prediction, especially for large structures; protein design; unconstrained global protein-protein docking which leads to success in only about 30% of the cases; local docking is limited by backbone flexibility and deteriorates with larger flexibility in the binding interface; small molecule docking relies on correct identification of the binding interface and is limited by flexibility between unbound and bound states, and loop and antibody modeling suffer from sampling challenges, especially for loops longer than 12 residues.Huge conformational search spaces are also prohibitive for RNA modeling due to the size of their torsion space (see RNA section), membrane proteins due to their size, and carbohydrates because of branching and flexibility.Some Rosetta applications suffer from technical challenges in implementation, for instance a unified framework for various types of experimental data is lacking (see Supplement), code usability revealed by lack of documentation, protocol captures, or support (e.g.DNA modeling, new de novo structure prediction protocols), and a need for implementation of more diverse chemistries, for instance for specific carbohydrates, spin labels, non-canonical chemistries, and lipids.Technical challenges are either historical or due to lack of interest in the community to develop and advance methods in these unique areas.

Rosetta's brain: its score function
Rosetta's score function has been continuously improved over many years 41 with guiding principles including: improving speed of computation, increasing extensibility, and improving accuracy across multiple tasks.The main score function is a linear combination of weighted score terms that balances physics-based or statistically derived potentials describing van der Waals energies, hydrogen bonds, electrostatics, disulfide bonds, residue solvation, backbone torsion angles, sidechain rotamer energies, and an average unfolded state reference energy (Figure 2B):  =   +  ℎ +   +  +   +   +   +   Some energy terms are decomposed into several components to be able to parameterize each of them separately.For instance, the van der Waals energy is split into attractive and repulsive terms between different residues, in addition to an intra-residue repulsive term.A complete account of the all-atom score function was published recently 34 .
The newest score function REF2015 35 features two main advancements.First, reproducibility of thermodynamic observables (such as liquid-phase properties 12 and liquid-to-vapor transfer free energies 42 ) was added to the optimization objectives, in addition to structure 43 -based tests.Second, a new, derivativefree optimization technique was developed, which is suitable for robust optimization of >100 parameters.Further, a new energy term was added that takes into consideration non-ideality of bond lengths and angles in cartesian space 44 .The cartesian term 44 is also the basis for a cartesian_ddG method that has been used to calculate ΔΔGs of mutation to probe changes in protein stability.Only the backbones and side chains of residues nearby the mutation site are allowed to move 45 .Due to the local optimization, this protocol is much faster than ddg_monomer 46 , while retaining the same level of accuracy.The default Rosetta score function is now also compatible with an expanded palette of chemical building-blocks: canonical and non-canonical L-α-amino acids and their D-amino acid counterparts, exotic achiral amino acids like 2-aminoisobutyric acid (AIB), peptoids, and oligoureas.The ability to model metalloproteins has also been added 47,48 .As noted above, score functions that enable simultaneous modeling of protein and RNA are being explored 49 .The score function is now thread-safe and fully mirror symmetric, i.e. enantiomers in mirror conformations score identically.Guidance energy terms for design have been added to encourage certain features, such as specific amino acid compositions 50,51 , hydrogen bonding networks, or global or local net charges, and discourage others, such as repeat sequences that hinder NMR assignments, buried unsatisfied hydrogen bond donors and acceptors, or voids within the protein 52 .
Hydrogen bond networks are important for biomolecular structure and catalysis but have been challenging to design because of pairwise interactions that have multi-body, cooperative properties.The HBNet protocol 53 has been used to design de novo coiled coils with interaction specificity mediated by designed hydrogen bond networks, including homo-oligomers 53 , membrane proteins 54 , and large sets of orthogonal heterodimers 55 .An improvement to HBNet uses a Monte Carlo search procedure to sample hydrogen bond networks with drastically improved performance 56 .We further developed a statistical potential to place highly-coordinated water molecules on the surface of biomolecules.On a data set of 153 high-resolution protein-protein interfaces, the method predicts 17% of native interface waters with 20% precision within 0.5 Å of the crystallographic water positions 57 .The potential is accessible through the WaterBoxMover (or ExplicitWaterMover) in RosettaScripts.
There are several limitations to the score function: (1) it does not directly model entropy 58 , which has been shown to improve sampling efficiency 59 .However, rotamer bond angles, solvation, fragments and pair terms all implicitly model free energy, which at these temperatures and solvation densities account for more than half of the entropy.(2) In most cases, knowledge-based score terms are derived from high-resolution crystal structures, which represent a single state on the energy landscape measured with a specific experimental method, does not represent flexibility well and is in a solid-state environment (compared to solution NMR); (3) knowledge-based terms are less comprehensible than physics-based terms; (4) different score functions are required for different applications, which shouldn't be the case as nature has a single energy function, indicating that the score functions are only approximations of the truth; (5) scoring correlates with the number of score terms and scoring has become slower, yet more accurate, over time; (6) the solvation model is implicit, which is fast, but hinders explicit modeling of ions, water molecules, or lipid environments accurately; (7) score functions for specific applications such as for RNA, membrane proteins, carbohydrates, non-canonicals, or lipids are immature compared to the score functions for 'mainstream' applications in Rosetta.

Major applications
Predicting protein structures Rosetta was originally developed for de novo protein structure prediction, which is accomplished by assembling fragments from known protein structures via a Monte Carlo procedure and evaluating the models with the score function.While the community's main objective has moved to protein and biomacromolecular design over the past decade, performance in the CASP blind prediction challenge remains respectable 60 , with ranking for refinement and prediction of multimeric complexes among the top three groups.Meanwhile, many groups have developed specialized tools exploiting evolutionary couplings and machine learning methods, for instance Google's DeepMind developed AlphaFold 16 with outstanding performance in the recent CASP12 61 .Other highly ranking methods are iTasser 17  (Yang Zhang), MULTICOM 19 (Jianglin Cheng), and QUARK 17 (Yang Zhang).
Improvements in homology modeling were achieved by multi-template modeling in RosettaCM 62 (now available on the new Robetta 60,63 server), which hybridizes the most homologous portions from multiple templates into a single model while modeling missing residues de novo 64 .If a template is absent, protein structures can be predicted de novo, which remains one of the most challenging tasks in structural biology, even though the incorporation of evolutionary coupling constraints (for instance from GREMLIN 65 ) has led to enormous improvements in model quality.To sample the conformational space further, an iterative hybridize approach was implemented.It uses a genetic algorithm that recombines models from an input pool to create models that have features from their parents but are also distinctly different.Creating several child models in each iteration, updating the input pool, and performing 30-50 iterations lead to improved model accuracy because features that are scored favorably by the score function are repeatedly used in the recombination, such that the models in the pool converge over time.This approach has been used to improve model quality of de novo predicted models 66 as well as homology models 67 .Model refinement or generating ensembles of structures (useful in particular for design) can be accomplished by several algorithms in Rosetta: FastRelax 68 , Backrub 69 , or using vicinity sampling in the KIC/Next-Generation-KIC loop modeling algorithms 70,71 .
Loop modeling 72 was implemented early in Rosetta 73,74 to close gaps in models or sample loop conformations, with initial approaches relying on fragments sampling and iterative Cyclic Coordinate Descent (CCD) 75 for chain closure.Subsequent developments introduced inverse kinematic closure (termed "KIC"), relying on polynomial resultants to analytically solve for closed conformations, producing more native-like loops 76,77 .Next-Generation KIC (NGK) 71 made improvements to sampling by employing diversification (i.e.wider range of conformations) and intensification (i.e.focus around previously generated conformations), substantially increasing the fraction of near-native models 71 and allowing modeling of longer loops.GeneralizedKIC 50 (GenKIC) samples loop geometries between fixed endpoints including nonstandard peptide chemistries, for instance L-and D-α-amino acids, β-amino acids, peptoids, oligoureas, or side-chain connections, covalently-attached ligands or crosslinkers, or chemistries that conventional loopmodelling algorithms do not typically handle.

Modeling protein-protein complexes
Another early method was RosettaDock, which predicts the structure of protein-protein complexes from input monomers.The most recent iteration, RosettaDock4.0 81incorporates protein flexibility from pregenerated protein ensembles, mimicking conformer selection.This has improved sampling efficiency by automatically adjusting the sampling rate based on the diversity of the input ensembles.Scoring has been improved by using a novel, six-dimensional coarse-grained scoring scheme called motif_dock_score, which employs score grids generated from known complexes in the Protein Data Bank (PDB).In local docking benchmarks and backbone deviations of up to 2.2 Å, RosettaDock4.0 was able to successfully dock ~50% of complexes.For symmetric homomers, Rosetta SymDock2 82 can be used, which uses the same sixdimensional scoring scheme as in RosettaDock.Symmetry information can be extracted from a homologous complex, or a global docking search can be performed for a given point symmetry using our symmetry framework 159 .An induced-fit based all-atom refinement relieves clashes in tightly-packed complexes to give physically realistic models.On a benchmark set of 43 complexes with different cyclic and dihedral symmetries, global docking on homology models had accuracies of 61% and 42% for cyclic and dihedral symmetries, respectively.These accuracies are substantially higher than for other symmetric docking tools and can be dramatically improved when adding restraints.

Figure 3: Rosetta can successfully address diverse biological questions
(A) Curved -sheet design: overlay of the designed homo-dimeric curved -sheet (dcs-E_4_dim_cav3) in rainbow and the crystal structure in gray (PDBID 5u35).The protein is designed de novo and features a curved -sheet, a large pocket, and a homodimer interface 79 .(B) Parametric design: overlay of the de novo designed macrocycle 3H1 in blue and the NMR structure in gray (PDBID 5v2g).This "CovCore" (covalent core) miniprotein is held together covalently by a hydrophobic cross-linker at its core (in red for the design and gray for the NMR structure) 115 .(C) PyTXMS: the interactome of M1 protein (virulence factor of Group A streptococcus) and 15 human plasma proteins on the surface of bacteria (peptidoglycan layer (dark green), and the membrane (brown)).This 1.8MDa structure is measured in a complex mixture of intact bacteria and human plasma by PyTXMS.All models are provided by Rosetta: M1 protein (gray), IgG (red), four fibrinogens (dark to light blue), six albumins (dark to light pink), coagulation factor XIII A [F13A] (purple), C4bPa (cyan), haptoglobin [HP] (brown), and alpha-1-antitrypsin [SerpinA1] (plum).This complex contains over 200 chemical cross-links 135 .(D) RosettaSurface: model of an LK-α peptide (LKKLLKLLKKLLKL with a periodicity of 3.5 assuming a helical conformation) on a hydrophilic selfassembled monolayer surface.In solution, the peptide is unstructured in solution and assumes helical structure 122 when on the surface, as experiments show.(E) RosettaCarbohydrate: flexible docking of a carbohydrate antigen to an antibody.The crystal structure is in gray (PDBID 1mfa) and the model in blue, with the carbohydrate in green.Antibody coordinates were taken from the PDB and glycan coordinates started from a randomized backbone conformation and rigid-body orientation 150 .(F) PIPER-FlexPepDock: high-resolution model of a peptide-protein complex generated using PIPER-FlexPepDock (model: blue; solved structure in gray, PDBID 1mfg).The model was generated from a peptide sequence (LDVPV, derived from the C-terminal tail of ErbB2R) and the unbound structure of the receptor (Erbin PDZ domain, PDBID 2h3l, colored in red) 118 .

Docking of small molecule ligands into proteins
Structure-based drug design has become a common approach for drug optimization due to increasing numbers of deposited structures in the PDB.RosettaLigand 83 has demonstrated success in predicting small molecule-protein interactions.
Later in the drug development process, when medicinal chemists optimize ligands based on structureactivity relationships (SAR), they synthesize ligands that typically share a core chemical scaffold and are assumed to bind to their target in a similar fashion 160 .RosettaLigandEnsemble 86 improves sampling during ligand docking by taking advantage of ligand similarities and docking a congeneric series of ligands at the same time, allowing for a placement that works for all considered ligands while optimizing the binding interface for each ligand independently.Experimental SARs can be included by promoting certain binding modes.
Another approach for therapeutic intervention is to use small molecule ligands as competitive inhibitors of protein-protein interactions.A common challenge, however, is that the protein's inhibitor-bound conformation often differs from the unbound or protein-protein bound conformation.The pocket optimization approach identifies protein surface pockets and uses their volume as an additional scoring term: this allows the user to start from an unbound protein structure and carry out biased sampling of a protein such that low-energy pocket-containing states are preferentially explored 87,88 .The specific conformations sampled through this approach match "druggable" alternate conformations observed in ligand-bound structures 87,88 , implying that these states are excellent starting points for virtual screening.The pockets sampled on the protein surface can then be matched to complementary ligands directly, by using the pocket itself as the starting point for pharmacophore-based screening 161 .

Modeling antibodies and other immune system proteins
Due to the therapeutic significance of antibodies, several protocols have been developed for structure prediction, docking and design that involve antibodies and other proteins in the immune system, such as T-cell receptors (TCR), displayed antigens of the Major Histocompatibility Complex (MHC) and other soluble antigens and immunogens.RosettaAntibody [92][93][94][95] is a protocol for homology modeling of antibodies 95 .It identifies homologous templates, assembles them into a single structure and then models CDR H3 loops de novo while simultaneously refining the VH-VL orientation 162 .Recent advances have focused on using multiple templates 162 , incorporating a key structural constraints 163,164 into CDR H3 modeling, modeling camelid antibodies 94 and antibodies on the scale of the human repertoire 165,166 .AbPredict 96 predicts antibody structures without homologous templates.Instead, it samples backbone fragments and rigid-body orientations from known antibody structures, without relying on sequence homology, therefore being able to accurately model cases with sequence identity as low as 10%.
AbPredict2 is available as a webserver 97 .SnugDock 100 is a method for antibody-antigen docking.SnugDock takes as input a plausible starting conformation and optionally an ensemble of antibodies/antigens, then runs local docking to refine both the antibody-antigen interface and the heavy-light chain interface (within the antibody) and re-models the CDR H2/H3 loops at the interface.Recent advances include a CDR H3 structural constraint 163,164 and docking of camelid antibodies 167 .Limitations in antibody modeling depend on the task: docking is limited by the knowledge of the binding site (global vs. local docking); structure prediction, design and refinement are limited by protein flexibility, and modeling of CDRs or other loops is challenging if they are longer than 12 to 15 residues.

Design of antibodies and immune system proteins
RosettaAntibodyDesign 101 (RAbD) is based on RosettaAntibody 94 (see below) and allows design of specific CDRs of different clusters and lengths, sequence design using cluster-based CDR profiles or conservative mutations, or de novo design of whole antibodies.RAbD uses North-Dunbrack CDR clustering 168 , reducing deleterious sequence mutations, and was benchmarked on 60 diverse antibody-antigen interfaces from both and antibodies.Experimental benchmarking of two antibody-antigen complexes showed affinity improvements between 10 and 50-fold.
Rosetta has been integrated with experimental immunogenic epitope data, MHC epitope prediction tools, and host genomic data to enable the design of proteins with reduced immunogenicity while retaining function and stability 102 .The approach implements machine learning-based epitope prediction for 28 different alleles, restricts design to select 15mer epitope regions, and uses a greedy stepwise protein design 103 to eliminate the most immunogenic epitopes with as few mutations as possible, avoiding disruptive core mutations likely to destabilize the protein.
AbDesign features design by cutting experimentally determined antibody structures along conserved positions to create interchangeable segments and then recombining them to produce a conformationally diverse set of novel antibody models 104,105 .The models are docked to a target of interest, either locally to a specific epitope, or globally, followed by an optimization step comprised of rigorous backbone sampling and sequence design for improving model stability and binding affinity.

Designing new proteins and functions
Protein design 169 (where the objective is identification of a sequence that best represents a given structure) relies on several of the same core functionalities needed for protein prediction, and synergy and interoperability between design and prediction models has always been a core Rosetta design principle.That this circular dependence can be achieved is highlighted by the recently implemented biased forward folding method: During computational de novo protein design 170 , a stringent test for the consistency of the designed sequence is whether ab initio structure prediction will yield the same structure that was used as a starting point for the design.However, computationally testing a large number of designs is prohibited by the vast conformational search space for ab initio structure prediction.To drastically limit that search space and test many more designs, the biased forward folding method 79 uses three (instead of the typical 200) fragments per residue position.Fragments are chosen based on the RMSD to the native structure in design.
Protein design is somewhat easier when starting from known starting structures and when redesigning for thermostability or features like the protein surface 171 .This is more successful because much information about sequence-structure relationships is readily available in public databases.Most difficult problems are de novo design (without a template structure) and design for novel folds or functions.Successes in these cases are sparse and require sampling of enormous conformational spaces, depending on the protein size, several 100s of thousands to millions of models.Another simplification of de novo design is thermostabilization of the protein, essentially creating rigid structures that are mostly non-functional, by expanding the energy gap between folded and unfolded designs to facilitate structural characterization.To date, novel functional designs mostly exploit known structures and the next frontier is the design of novel functions onto de novo scaffolds.Moreover, nature typically does not design for the global minimum energy conformation (in terms of stability) because proteins require flexibility to carry out their functions.
De novo design and design of novel protein functions towards therapeutic intervention is addressed by various methods in Rosetta: The SEWING protocol creates de novo designs by recombining parts of protein structures from randomly-selected helical building blocks 106 .SEWING's requirement-driven approach allows users to specify features or properties that should be incorporated into their designs during backbone generation without necessarily requiring a certain size or three-dimensional fold.New features include incorporation of functional motifs such as protein-binding peptides for protein interface design and partial or complete ligand binding sites for ligand-binding protein design 107 .A somewhat similar algorithm has been implemented for antibody design (AbDesign, see below), which was generalized for enzyme design 172 .A more general approach is RosettaRemodel, which performs protein design by rebuilding parts or all of the structure 108 from fragments of known proteins structures.RosettaRemodel relies on a blueprint file in which the user defines secondary and supersecondary structure of the fold to be built.Remodel interfaces with a number of Rosetta protocols and can be used for various applications such as de novo modeling, fixedbackbone sequence design, refinement, loop insertion, deletion, and remodeling, as well as disulfide engineering, domain assembly, and motif grafting.
A common task is not only design towards a certain goal (positive design), but additionally, design away from undesired features (negative design).Such a Multi-State Design 173 (MSD) approach evaluates strengths and weaknesses of a single sequence on multiple backbones, for instance binding to one but not another protein partner.REstrained CONvergence 110 (RECON) takes this idea one step further by allowing each state to sample multiple sequences during the design process, which is iteratively applied by increasing the restraint weight to encourage sequence convergence.RECON achieves on average 70% sequence recovery (a 30% increase compared to MSD!) for large multi-state design problems, such as antibody affinity maturation or prediction of evolutionary sequence profiles of flexible backbones 174,175 .
Design of protein function can be accomplished by motif grafting, i.e. grafting a known motif or predicted active-or binding-site from a template structure onto a new protein.This approach has been used for antibodies and vaccine design 111 using the fold_from_loops application, where the functional motif is used as a starting point of an extended structure that is folded following the constraints of a target topology.
Iterative refinement is carried out via sequence design and structural relaxation before filtering and humanguided optimization.This protocol has been extended into the Functional Folding and Design (FunFolDes) protocol, which includes multi-segment motif grafting, different residue length motif insertion, incorporation of restraints, and folding in the presence of a binding target 112 .Performance of the folding stage can be improved by selecting fragments according to the target topology via the StructFragmentMover.

Designing interfaces between proteins and interaction partners
Problems related to protein design include designing interfaces of proteins with their interaction partners such as proteins or small molecule ligands and predicting ΔΔGs of mutation (e.g.alanine scanning).Predicting ΔΔGs of mutations for protein stability or protein-protein interactions is a difficult problem with low correlation coefficients (0.5-0.7) 176 , because the effect of the mutation is small compared to the total energy in the system, and because protein flexibility adds noise to the energies that can mask the effect of mutations.In the simplest case of alanine scanning (mutating into Ala), methods that use a "soft-repulsive" score function without modeling backbone flexibility 177,178 have typically outperformed methods that allow protein flexibility and use hard-repulsive score functions 179 .FlexDDG 113 was created to improve proteinprotein interface ΔΔG predictions and generalize them to residues other than Ala.The protocol creates conformational ensembles using backrub sampling 180 , then repacks sidechains, minimizes torsions and computes change in protein-protein interaction ΔΔG by averaging across the ensembles.On 1240 interface mutants, FlexDDG outperforms the earlier ddg_monomer application, which was originally created and validated to predict changes in stability upon mutation, not interfaces.
Symmetric protein assemblies can now be modeled using parametric design.Nature created super-helical coiled-coils that are well-described by geometric equations using Crick parameters 181 , which include variables for the radius of the bundle, major helical twist, minor helix rotation about the primary axis, etc.Several Movers such as MakeBundle, PerturbBundle, and BundleGridSampler allow designing helical bundles 54,115 and -barrels based on pre-defined or sampled parameters.Since parametric methods do not rely on fragments libraries, these modules can be applied to non-canonical coiled-coil heteropolymers.

Modeling peptides and peptidomimetics
The inherent flexibility of peptides imparts a large conformational search space to them, which leads to challenging modeling problems; when peptide modeling is combined with another simulation, e.g.docking, the increase in conformational space makes the modeling task virtually impossible using traditional approaches.PIPER-FlexPepDock 118 is to our knowledge the only global peptide docking protocol.It rigidbody docks these fragments using PIPER FFT-based docking 182 , and refines the complex using FlexPepDock 116 .PIPER-FlexPepDock can generate highly accurate peptide-protein complexes from a peptide sequence and a free receptor structure (Figure 3F).Performance decreases in case of receptor flexibility and when fragments are not available in the fragment database.
Conformations of cyclic peptides can be sampled with simple_cycpep_predict, which restricts the conformational search space through cyclization 50,51,115 via the Generalized Kinematic Closure (GenKIC) algorithm (see "loop modeling" below).Simple_cycpep_predict does not rely on protein fragments and can model non-canonical chemistries (Figure 3B), being a generalization of earlier protocols.
Experimental protein structure determination is challenging for proteins on solid surfaces such as biominerals, self-assembled monolayers, inorganic catalysts, and nanomaterials.RosettaSurface 121 samples protein conformations ab initio in both the solution and adsorbed states (Figure 3D) in order to account for adsorption-induced conformational changes.Experimental data can be incorporated into the simulation 122 to improve scoring, which remains difficult because the score function has been optimized for soluble proteins in aqueous solvent.

Using experimental data to direct modeling
The use of experimental data in modeling can vastly restrict the conformational search space, therefore allowing the modeling of larger, more complex biomolecules to greater accuracy.Electron density maps from cryo-electron microscopy (cryoEM) or X-ray crystallography have become more readily available in the past decade and methods to incorporate these types of data have been successfully used for highresolution structure determination.Since cryoEM density maps are often of low resolution, de novo structure determination methods require a combinatorial search procedure to unambiguously assign all densities to residues in the protein.RosettaES 125 is an enumerative sampling approach that does not require initial assignment of densities; it gradually extends the model one residue at a time until all residues have been assigned.At each iteration, short fragments are used to sample the nearby conformational space of the growing model, while undergoing a series of clustering and filtering steps based on the energy and fit to the density.
If assignment is complete but the data are low-resolution, refinement into density maps is necessary.Several methods have been developed for density maps in the 3.0-4.5Åresolution range.More recently, an automated fragment-guided refinement pipeline 128 splits the density map into independent training and validation maps.It finds regions with poor density fit, iteratively rebuilds them with fragments using the training map, filters the models based on their fit to the validation map, model geometry from MolProbity and fit to the full map, and then optimizes against the full map.The frameworks for electron density maps and carbohydrate modeling 150 (below) were connected 151 for refinement of carbohydrates into lowresolution electron density maps from cryoEM or crystallography.
NMR data were incorporated into de novo structure prediction early in the software's development, creating RosettaNMR.Chemical shifts were used for fragment picking using CS-Rosetta 129 , which could be used in conjunction with NOE, RDC 183 , PCS 130,131,184 and PRE data.Improvements, for instance through RASREC resampling 185 allowed the use of sparse 186 or unassigned data 187 , easier-to-obtain data (backbone-only 188 ), modeling larger and more complex proteins 189 , membrane proteins 190 , symmetric systems 191 , and combination with data from SAXS 192 , cryoEM 193 , distance restraints from homologous proteins 194 and evolutionary couplings 195 .CS-Rosetta also has the AutoNOE 196,197 module for automatic assignment of NOESY data for use in structure calculations.RosettaNMR was recently overhauled and reconciled with CS-Rosetta and PCS-Rosetta to allow seamless integration of several types of NMR restraints (CS, RDC, PCS, PRE, NOE) in one consistent framework 132 that could be applied to structure prediction, proteinprotein docking, protein-ligand docking, and symmetric assemblies.
Covalent labeling mass spectrometry data provides information on relative solvent exposure of residues, therefore yielding information on protein tertiary structure.A low-resolution score term from hydroxyl radical foot-printing has been implemented that can improve model quality in structure prediction 133,134 .Finally, data from chemical cross-linking mass spectrometry has been incorporated into an automated workflow to identify protein-protein interactions.The PyTXMS 135 method combines the sensitivity of mass spectrometry to analyze complex samples with the power of Rosetta structural modeling and protein-protein docking to efficiently sample the vast conformational space and identify interactions (Figure 3C).A machine learning algorithm based on high resolution MS1 data guide the potential binding interface selection which is then validated and adjusted by a repository of structural models and MS2 (DDA) samples.
Modeling nucleic acids and their interactions with proteins DNA and RNA modeling face a multitude of challenges due to a lack of structures leading to underdeveloped score functions, low quality alignments, the sampling torsion space is much larger than for proteins (70 residue RNA comparable to 200 residue protein), and a lack of interest in the scientific community leading to a gap in knowledge.Moreover, in contrast to protein helices where sequence information is displayed on the helix exterior through side-chains, helical RNA sidechains point inwards, therefore hiding sequence information from the environment, making prediction of tertiary or non-local contacts vastly more difficult.Non-local contacts are mediated by loops, further enormously challenging prediction algorithms.
Several advances have been made in the representation of nucleic acids in Rosetta.The StepWise Monte Carlo protocol (SWM) has achieved RNA structure predictions reaching atomic accuracy 138 ; the approach provides an acceleration over the original enumerative StepWise Assembly (SWA) method 136,137 .A version of SWA that rebuilds one nucleotide at a time enables fine-grained correction of errors in RNA coordinates fit into crystallographic or cryo-EM maps by Enumerative Real-space Refinement ASsisted by Electron density under Rosetta 142,143 (ERRASER).
The most recent advances in RNA tools expand the fragment assembly protocol to support modeling RNAprotein complexes through simultaneous folding and docking 141 .RNA-protein interactions are handled via additional knowledge-based score terms that supplement the low-resolution RNA score function.Free energy perturbations from RNA or protein mutations can be modeled with the Rosetta-Vienna G protocol 49 .Structure coordinates can further be built into cryo-EM density maps for large RNA-protein complexes with DRRAFTER (De novo Ribonucleoprotein modeling in Real space through Assembly of Fragments Together with Experimental density in Rosetta) 145 .
Redesign and prediction of protein-DNA interfaces is also possible 198,199 and has been accomplished with flexible protein backbones 200 , genetic algorithms 198,200,201 and motif-biased rotamer sampling 202,203 .However, the biggest limitation of these approaches is that they rely on fixed DNA backbone conformations, which in nature can be highly flexible.Key to successful protein-DNA design is an score function that is optimized 203,204 for these highly polar and solvated interfaces.The software further supports prediction of specificity and affinity 205 and the prediction of DNA binding preferences of homologous proteins.Multitemplate modeling in RosettaCM 62 was successfully applied to this challenge 206 .To accomplish this, protein homology modeling was followed by docking of multiple competing DNA sequences threaded onto the original crystal structure backbone and comparing the energies of the resulting protein-DNA complexes.

Modeling membrane proteins
Membrane proteins constitute about 30% of all proteins and are targets for over 60% of pharmaceuticals on the market 207 .However, experimental difficulties have limited our understanding of their structures 208 .Previously, Yarov-Yarovoy 209,210 and Barth 211 implemented tools for low-and high-resolution structure prediction of membrane proteins, termed RosettaMembrane.These tools were recently re-engineered for compatibility with Rosetta3 33 into a platform called RosettaMP 146 .RosettaMP implements core modules for representing, sampling, and scoring proteins in the context of an implicit membrane.RosettaMP is compatible with key modeling protocols including docking, design, ∆∆G prediction 176 , PyMOL visualization 212 , and assembly of symmetric proteins.In addition, a set of basic modeling tools 147 is implemented, for instance for scoring, transforming a membrane protein into the membrane coordinate frame, de novo modeling for single transmembrane span helices, introducing mutations, and visualization in the membrane.RosettaMP has further enabled rapid development of new modeling tools including structure-based detection of lipid exposed residues in the membrane 148 and domain assembly of full-length protein models from structures of transmembrane and soluble domains 149 .The RosettaCM protocol for multi-template homology modeling has also been adapted to membrane proteins 39 .
Describing membrane protein energetics is challenging since the proteins live in an anisotropic environment and tend to bury polar solvent molecules (e.g.water, ions) that stabilize the structure and participate in important conformational transitions.Implicit membrane models often fail to reliably model membrane protein interiors.A method SPaDES was developed based on a hybrid explicit-implicit solvent model that enhanced the prediction and design of membrane protein structures 213 .
Limitations to membrane protein modeling are similar but less severe than for RNA modeling: there are fewer structures in databases, fewer method developers in this field and hence fewer available tools.As a consequence, the score function is much less mature compared to the latest score functions for soluble proteins: the implicit solvent hydrophobic slab model is a very coarse-gained representation of the membrane.Ongoing efforts expand this model by including pores, lipid specificity and different thicknesses 214 , yet many effects remain to be acknowledged such as measurement-specific or observed membrane geometries (micelles, bicelles, nanodiscs, vesicles, different pore types, fusion and fission of multiple membranes) and macroscopic physical phenomena like membrane tension and fluidity.Challenges in including these effects are experimental measurements for parameterization of these models and adaptation of a multitude of scoreterms.

Adding carbohydrates to the modeling process
Carbohydrates are fundamental to life 215,216 , but because of challenges in experimental characterization and computational sampling and scoring, their structures have been historically under-studied.The RosettaCarbohydrate framework 150 allows modeling of carbohydrate structures and complexes.The framework is integrated into the software such that it is possible to model glycosylated proteins or proteinsugar complexes (Figure 3F) with the same algorithms one would use for proteins.RosettaCarbohydrate is not limited to commonly studied sugars but can handle the full gamut of carbohydrate structures, including linear, cyclic, and branched structures, sugar modifications, and conjugations.Methods exist for sampling ring conformations, packing substituents, refining glycosidic linkages, sampling from linkage "fragments", and extending glycan chains.Scoring of saccharide-containing sugars includes a quantum-mechanically derived intrinsic backbone term 217 .Because saccharide residues are stored as distinct data structures, we can integrate bioinformatic and statistical data into our algorithms, which opens the doors for glycoengineering and design applications.RosettaCarbohydrate has been integrated with various other frameworks, such as loop modeling (GenKIC and Stepwise Assembly), refinement (GlycanTreeModeler), symmetry, and RosettaScripts-accessible classes such as MoveMaps and ResidueSelectors.Linkages are automatically determined during PDB read-in.Carbohydrates work with Cartesian minimization, and they can be refined into electron density maps 151 .Limitations in the carbohydrate framework are the increased sampling space due to carbohydrate flexibility and branching, implementation of different chemistries due to branching and cyclization also requiring adjustments to the score function.Developments in this area have only started in the past years and much work has yet to be done.

User interfaces and usability
Advances have also focused on improving usability of Rosetta through developing several user interfaces to suit different use cases and styles (Figure 4).The command line interface was the first and is still the most-often used interface to Rosetta methods.In addition, the software features two major scripting interfaces: RosettaScripts and PyRosetta.RosettaScripts 37 is a popular scripting interface that uses Extensible Markup Language (XML) to build fairly complex protocols using core machinery 33 , without requiring knowledge of the codebase.PyRosetta 36,152 is a collection of Python bindings to the source code, allowing custom protocol development that is flexible and fast, but requires familiarity with the underlying codebase.Other interfaces are InteractiveRosetta 153 and the gaming interface Foldit Standalone 154,156 , further described in the Supplement.Our community has devoted an enormous effort to enhance the user friendliness of Rosetta by rewriting and adding documentation (Figure 5).We now use a public-facing Gollum wiki (https://www.rosettacommons.org/docs/latest/Home)for various levels of documentation, such as application documentation, tutorials for beginning users, and static protocol captures that accompany manuscripts for scientific reproducibility (see supplement for links).The Gollum wiki is easily editable by members of the RosettaCommons which has drastically improved the quantity and quality of documentation.
A limitation of Rosetta is the need for a local installation and compilation in a Unix-like environment.Webservers provide a user-friendly alternative and a number of independent servers have emerged in our community.However, implementing and maintaining such servers comes at a substantial cost.To make it easier to provide protocols webservers, ROSIE (Rosetta Online Server that Includes Everyone) 158,218 (http://rosie.rosettacommons.org/)implements a simple framework for "serverification" of protocols.ROSIE currently contains 21 webservers, with additional protocols continually being added.A look into the future Rosetta development is ongoing and will continue to focus on expanding the scope of protein design and modeling by integrating high-throughput experimental data with high-throughput computational methods, which in turn impacts score function development and aids in developing novel therapeutic interventions 219 ; restructuring the software for massively parallel computing architectures (e.g.GPUs, TPUs) and quantum computers 220 ; greater use of machine-learning (e.g.deep-learning) approaches (e.g. for score function development); modeling more realistic cellular environments; and improving user interfaces to continue making our software accessible to more scientists.The predictive powers implemented in Rosetta that we have reviewed above can be leveraged not only to analyze and verify existing data but to inform the experiments that will galvanize engineering industrial enzymes, enable the creation of novel biomaterials, and accelerate the discovery of new potent therapeutics.

Conclusion
The Rosetta software is developed by a large, global community that aims to solve complex problems through real time collaborative code development.In the last five years, great strides have been made in our software.More protocols are available now that enable modeling a broader range of biological and chemical macromolecular systems.Prediction accuracies have improved through advances in the score function, which is a combination of physics-based and knowledge-based potentials that were fit against known structures and thermodynamic observables.Incorporating experimental data into the modeling process has been facilitated and improved.Further, our community saw the need to develop more general, reusable, user-friendly, and scientifically reproducible protocols.This was motivated by the growth of the software and the developer community, the various user interfaces, the diversity of the community 32 , and the complexities of the protocols used to solve real-world problems.The improvements to documentation allow users to quickly start using or developing custom protocols, while facilitating user support for the various interfaces (command line, RosettaScripts, PyRosetta, etc.).Over the years, these applications have moved beyond tackling basic science questions (i.e. the protein folding and design challenges) to more application-based scientific developments.The myriad of advances described above have made integration of Rosetta into existing experimental and computational scientific workflows increasingly useful and standard, as evidenced by the large number of licenses (~30,000 academic and ~70 commercial), 11 spinoff companies that were created from the RosettaCommons 32 , and the ever-increasing number of citations from labs beyond those affiliated with RosettaCommons.

Figure 2 :
Figure 2: Main elements of Rosetta are scoring and sampling (A) Three main elements are required in a Rosetta protocol.The Pose is the biomolecule, such as a protein, RNA, DNA, small molecule, or glycan, in a specific conformation.Residues in the Pose can be selected via ResidueSelectors and the behavior for side-chain optimization or mutation can be defined by TaskOperations.Specific Movers then control how the conformation of the Pose is changed, and the new conformation is subsequently evaluated by a ScoreFunction.The Metropolis criterion decides whether the new conformation is accepted in the sampling trajectory.Many independent sampling trajectories are generated, and the final models are evaluated based on the purpose of the protocol.(B) The score function consists of a weighted linear combination of various score terms, highlighted in the figure and described above.

Figure 4 :
Figure 4: User interfaces to the codebase (A) Rosetta can be run from a terminal and offers three different interfaces to the codebase.The top panel outlines the task to be accomplished: making two mutations in a protein and then refining the structure.The panels underneath show how this task can be accomplished in the different interfaces.The command line panel shows the executable, input files and options to run two specific applications.RosettaScripts is XML-based scripting language that offers more flexibility by combining Movers and ScoreFunctions into a custom Protocol.PyRosetta offers direct access to the underlying code objects but requires knowledge of the codebase.(B) Point-and-click interfaces to the codebase.InteractiveRosetta is a graphical user-interface (GUI) to PyRosetta.It offers controls to the most popular protocols, file formats and options.Foldit is a videogame primarily used to crowd-source real-world scientific puzzles but can also be used on custom proteins of interest.It allows access to some popular applications via a game interface.ROSIE is a super-server hosting a multitude of servers each executing a particular protocol.It currently includes servers for 21 Rosetta methods.[The InteractiveRosetta panel was reproduced with permission from Bioinformatics.]

Figure 5 :
Figure 5: Main external documentation pageIn 2015, our community performed a complete overhaul of our documentation.Documentation is now hosted on a Gollum wiki, which is version controlled and easily editable for members of our community.Accessibility and ability to edit the documentation has drastically improved the userexperience of the software.