Miniprotein design: past, present, and prospects. Accounts of Chemical Research , 50 (9),

The design and study of miniproteins, that is, polypeptide chains <40 amino acids in length that adopt defined and stable 3D structures, is resurgent. Miniproteins offer possibilities for reducing the complexity of larger proteins and so present new routes to studying sequence-to-structure and sequence-to-stability relationships in proteins generally. They also provide modules for protein design by pieces and, with this, prospects for building more-complex or even entirely new protein structures. In addition, miniproteins are useful scaffolds for templating functional domains, for example, those involved in protein-protein interactions, catalysis, and biomolecular binding, leading to potential applications in biotechnology and medicine. Here we select examples from almost four decades of miniprotein design, development, and dissection. Simply because of the word limit for this Account, we focus on miniproteins that are cooperatively folded monomers in solution and not stabilized by cross-linking or metal binding. In these cases, the optimization of noncovalent interactions is even more critical for the maintenance of the folded states than in larger proteins. Our chronology and catalogue highlights themes in miniproteins, which we explore further and begin to put on a firmer footing through an analysis of the miniprotein structures that have been deposited in the Protein Data Bank (PDB) thus far. Specifically, and compared with larger proteins, miniproteins generally have a lower proportion of residues in regular secondary structure elements (α helices, β strands, and polyproline-II helices) and, concomitantly, more residues in well-structured loops. This allows distortions of the backbone enabling mini-hydrophobic cores to be made. This also contrasts with larger proteins, which can achieve hydrophobic cores through tertiary contacts between distant regions of sequence. On average, miniproteins have a higher proportion of aromatic residues than larger proteins, and specifically electron-rich Trp and Tyr, which are often found in combination with Pro and Arg to render networks of CH-π or cation-π interactions. Miniproteins also have a higher proportion of the long-chain charged amino acids (Arg, Glu, and Lys), which presumably reflects salt-bridge formation and their greater surface area-to-volume ratio. Together, these amino-acid preferences appear to support greater densities of noncovalent interactions in miniproteins compared with larger proteins. We anticipate that with recent developments such as parametric protein design, it will become increasingly routine to use computation to generate and evaluate models for miniproteins in silico ahead of experimental studies. This could include accessing new structures comprising secondary structure elements linked in previously unseen configurations. The improved understanding of the noncovalent interactions that stabilize the folded states of such miniproteins that we are witnessing through both in-depth bioinformatics analyses and experimental testing will feed these computational protein designs. With this in mind, we can expect a new and exciting era for miniprotein design, study, and application.

The design and study of miniproteinsthat is, polypeptide chains < 40 amino acids in length that adopt defined and stable 3D structuresis resurgent. Miniproteins offer possibilities for reducing the complexity of larger proteins, and so present new routes to studying sequence-to-structure and sequence-to-stability relationships in proteins generally. They also provide modules for protein design by pieces and, with this, prospects for building more-complex or even entirely new protein structures. In addition, miniproteins are useful scaffolds for templating functional domains; e.g., those involved in protein-protein interactions, catalysis, and biomolecular binding, leading to potential applications in biotechnology and medicine.
Here we select examples from almost four decades of miniprotein design, development and dissection. Simply because of the word limit for these Accounts, we focus on miniproteins that are cooperatively folded monomers in solution and not stabilized by crosslinking or metal binding. In these cases, the optimization of non-covalent interactions is even more critical for the maintenance of the folded states than in larger proteins. Our chronology and catalogue highlights themes in miniproteins, which we explore further and begin to put on a firmer footing through an analysis of the miniprotein structures that have been deposited in the Protein Data Bank (PDB) thus far.
Specifically, and compared with larger proteins, miniproteins generally have a lower proportion of residues in regular secondary structure elements ( helices,  strands, and polyproline II helices) and, concomitantly, more residues in well-structured loops. This allows distortions of the backbone enabling mini hydrophobic cores to be made. This also contrasts with larger proteins, which can achieve hydrophobic cores through tertiary contacts between distant regions of sequence. On average, miniproteins have a higher proportion of aromatic residues than larger proteins, and specifically electron-rich Trp and/or Tyr, which are often found in combination with Pro and Arg to render networks of CH- or cation- interactions.
Miniproteins also have a higher proportion of the long-chain charged amino acids (Arg, Glu, and Lys), which presumably reflects salt-bridge formation and their greater surface area-tovolume ratio. Together, these amino-acid preferences appear to support greater densities of non-covalent interactions in miniproteins compared with larger proteins.
We anticipate that with recent developments such as parametric protein design, it will become increasingly routine to use computation to generate and evaluate models for miniproteins in silico ahead of experimental studies. This could include accessing new structures comprising secondary structure elements linked in previously unseen configurations. The improved understanding of the non-covalent interactions that stabilize the folded states of such miniproteins that we are witnessing through both in-depth bioinformatics analyses and experimental testing will feed these computational protein designs. With this in mind, we can expect a new and exciting era for miniprotein design, study and application.

I. INTRODUCTION
Polypeptide chains fold into myriad three-dimensional shapes determined by their amino acid sequences. Currently, there are approaching 120,000 protein structures in the Protein Data Bank (PDB), the chains of which adopt over 1300 distinct folds and have an average length of 800 residues. Understanding this process is known as the protein-folding problem, 1 which has three aspects: How do proteins fold mechanistically? How do proteins fold in vivo? And, what information in the amino-acid sequence encodes 3D structure and function? 2 These are complex problems because the folded states are determined by the interplay of many weak non-covalent interactions between thousands of atoms. 3 Moreover, the entropic cost of folding protein chains is only just outweighed by the enthalpy of forming these interactions making folded proteins only marginally stable. One way to address these various aspects of the protein-folding problem is to study so-called miniproteins where complexity is reduced, providing accessible platforms for dissecting contributions to protein folding and stability both in silico and in vitro. 4,5 Here we define miniproteins as short proteins of 40 amino acids with well-defined folds consisting of two or more secondary structure elements, sequestered hydrophobic cores, and cooperative folding. Standalone, small and cooperatively folded secondary structures exist e.g., single -helical peptides, 6 however these lack the hydrophobic cores typical of globular proteins. This definition resonates with the concept of foldamers more generally. 7 The small size of miniproteins necessarily means smaller hydrophobic cores and fewer non-covalent interactions than found in proteins generally. Therefore, many miniproteins are stabilized by metal binding or covalent cross-linking; for example, EF hands, 8 zinc fingers 9 and cysteine-knot peptides. 10 Here, we focus on water-soluble and largely monomeric miniproteins stabilized solely by non-covalent interactions and without bound metals or covalent cross-links, although we recognize that great strides have been and continue to be made with these miniproteins more generally. 11,12 We give a brief history of the sub-field, highlight key examples and themes in more detail, and tease out rules for miniprotein design supported by an analysis of the PDB. Finally, we discuss potential applications and outlook for miniproteins.
Many of the miniproteins described over the last four decades are fragments of larger globular proteins and have been subject to iterative redesign and optimization to enhance stability and impart function. 5,[13][14][15] Successes in this have taught us some general rules of thumb for miniprotein design that we discuss with examples here. However, and despite this, the field of miniprotein design is far from mature.
Most recently, high-throughput methods have been applied to miniprotein design. 16,17 This starts with fragment-based computational design of backbones followed by the generation of libraries of best-fitting sequences. Experimentally, the miniprotein libraries are displayed on yeast and unstable and stable variants are distinguished by protease treatment and then identified by fluorescence-activated cell sorting and deep sequencing. This reveals sequence-to-stability relationships, which, reassuringly, mirror some well-established rules of protein folding.

PANCREATIC POLYPEPTIDES
The polyproline II:loop:-helix fold was first observed in the X-ray crystal structure of avian pancreatic peptide hormone (aPP) dimer. 18 This compact fold is stabilized by the interdigitation of proline residues from the polyproline-II helix with aromatic residues presented by the  helix to form a hydrophobic core, and in addition -stacking interactions stabilize the dimer interface, Figure 1. Recently, we have designed a monomeric 34-residue miniprotein with the same overall topology of the pancreatic polypeptides, 19 which we call PP and discuss later, Figures 2A&3.
Directed evolution of synthetic aPP monomers has been used to develop miniproteinbased ligands as therapeutics. Optimization of the polyproline II helix in aPP gave a variant with high affinity for the ActA target protein in Listeria monocytogenes, EVH1 mena1-112. Importantly, the miniprotein discriminates between paralogs and reduces bacterial motility. 13 A similar strategy has been used to optimize the -helix of aPP for sequence-specific DNA recognition, 14 and the introduction of arginine residues in aPP facilitates transport of the miniprotein into cells. 20 Artificial esterases have also been developed by grafting catalytic residues onto the solvent-exposed  helix of bovine pancreatic polypeptide, bPP. 21

 FOLDS & METAL-FREE ZINC FINGERS
Small, independent  units are best exemplified by DNA-binding zinc fingers. These were first identified in transcription factor IIIA from Xenopus oocytes, and an NMR structure determined for the Xfin domain followed. 23 The fold, which comprises a  hairpin, a connecting loop, and an  helix, is not driven by the conserved hydrophobic core but by the binding of zinc usually through His2Cys2 motifs. The development and modular assembly of metalbinding zinc-finger domains has created artificial proteins and enzymes that can recognize defined regions of DNA for the activation, repression or alteration of user-specified genes, and has contributed early in the development of genome editing. 24 Regarding metal-free designs, a 23-residue monomeric structure has been achieved through iterative design enhancing a hydrophobic core, -helix structure and inclusion of a suitable turn; 25 and a computational design has produced a weakly cooperatively folded peptide with a midpoint of unfolding (TM) of 39C. 26 The broad transition is consistent with a low enthalpy of folding expected for a small hydrophobic core.

VILLIN HEADPIECE
Another approach to miniprotein design is to pare down larger natural proteins. The Villin headpiece, a 35-residue fragment of the chicken protein (HP-35), folds in water. 27 The NMR ensemble reveals three -helical segments with each helix contributing residues to a central hydrophobic core. 27 HP-35 is surprisingly thermostable with a TM of 70C. 28

-HAIRPINS AND TRP-ZIPPERS
Early examples of free-standing -structures based on natural fragments were only moderately folded in fully aqueous media. 32,33 However, designed Trp-zippers are well-folded examples of short (12 to 16 residues) de novo  hairpins with interlocked Trps and righthanded highly-twisted strands, Figure 2D. 29 They have exceptional thermal stabilities and reversible cooperative unfolding.
Trp-pocket  hairpins are stabilized via cation- interactions in which a single Lys packs against a diTrp cleft on the opposite strand. 34 Some 12-residue Trp-pockets are fully folded 34 and resist degradation, 35 making them the most stable  hairpins reported.
From a wealth of studies, reliable guidelines for the design, optimization and stabilization of monomeric -hairpins include: a hydrophobic cluster on one surface of the hairpin and close to the loop; 36 inter-strand side-chain interactions, particularly Trp-Trp; 29,37 high turn and -sheet propensities; 38 and charged or aromatic residues, or -capping motifs to secure the termini via cross-strand interactions. 39

DESIGNED THREE-STRANDED -SHEETS
A challenge in the design of isolated -sheets is to avoid -amyloid-like assemblies. 40 Some tentative three-stranded antiparallel -sheets have been achieved in aqueous methanol by appending strands onto -hairpins. 41 An early NMR structure of 'Betanova' in water incorporates an aromatic-rich hydrophobic cluster on one surface of the sheet. 42 However, subsequent studies indicate only partial folding, though improved stability has been achieved by computationally informed mutations. 43 Protein redesign presents another route to -sheet miniproteins. WW domains are natural antiparallel 3-stranded  sheets, named after the two conserved Trp residues in the first and third  strands. 44 X-ray crystal structures of two shorter (34 and 37 residues) natural WW domains followed, in addition to that for a designed 33-residue prototype, Figure 2E. 30 Natural WW domains have been engineered for alternative functions; e.g., the incorporation of a DNA binding pocket, 15 and for probing carbohydrate-aromatic packing interactions. 5

TRP-CAGE
The Trp-cage is a 20-residue miniprotein from the glia monster extendin-4. 45 The original truncation is only folded in aqueous trifluoroethanol. 45 However, NMR structures of variants show a well-ordered fold comprising an  helix followed by a well-structured loop, with a hydrophobic core centered on a single Trp residue buttressed by Pro side chains, Figure  2F. 31 Owing to its small size and wealth of experiment data now available, the Trp-cage has become a paradigm for experimental and computational miniprotein folding. 46  DESIGNS ()n repeats occur widely in natural proteins, e.g. TIM barrels, the Rossmann fold, and Leucine Rich Repeats. The alternating secondary structure elements leads to parallel sheets. The first de novo designed, standalone, water-soluble  unit comprises a 12-residue  helix paired with two 5-residue  strands via a small hydrophobic core. 47 Although initial designs are molten globule, a folded state has been stabilized by installing a Trp-zip-like Trp pair in the -strands. 29 An NMR structure of the resulting 36-residue  construct ( Figure 2G) confirms face-to-face packing of the installed pair. The miniprotein is highly stable up to temperatures of 90C, which is remarkable for a small miniprotein with only proteinogenic amino acids and without covalent cross-links.

TRP-PLEXUS
The first miniprotein comprising a -strand and a polyproline II helix has been achieved through a fragment-based design, Figure 2H. 22 TrpPlexus combines a short, Arg-rich, Nterminal  strand, and a C-terminal polyproline II helix that is free of Pro but rich in Trp in a WSXWS motif. These are borrowed from a fibronectin III binding domain, and linked by a D-Pro-Gly loop to give a 19-residue construct. NMR spectroscopy shows the Arg and Trp residues interdigitate to form a cation- network. 22 A disulfide-cyclized TrpPlexus tolerates Nsubstituted Gly and Pro residues in the polyproline II helix, opening up potential for proteinogenic and peptoid side-chain placements for peptidomimetic inhibitors of proteinprotein interactions. 48

PP
Recently we combined fragment-based and rational design to create the monomeric miniprotein, PP. 19 PP comprises a polyproline II helix:loop:-helix topology that is stabilized by interdigitation of Pro residues of the polyproline helix into aromatic residues presented by the -helix, similar to knobs-into-holes interactions found in coiled coils, Figure 3. The helices were borrowed from the bacterial adhesin AgI/II 49 and an intervening loop from a pancreatic polypeptide. 50 We selected ≈6 turns of  helix to partner ≈3 turns of polyproline helix and to maintain knobs-into-holes-like interactions along the lengths of both helices. The tyrosine variant, PP-Tyr, is water soluble, monomeric and unfolds cooperatively with a TM of 39C. An NMR structure reveals intimate CH- interactions 51 between the proline and aromatic side chains, Figure 3D. We have explored the importance of these interactions by mutating the three parent tyrosine residues to para-substituted phenylalanine residues with varying ring electron densities. These experiments highlight electronic and electrostatic contributions to the interaction beyond van der Waals' contacts, as the more electron-rich aromatics lead to more stable PP variants. Interestingly, of the proteinogenic aromatic side chains, Pro-Tyr and Pro-Trp interactions gave PP variants with similar thermal stabilities. This corroborates our bioinformatics analyses of the PDB which highlights a preference for CH- interactions between Pro and the electron-rich aromatics but not Phe. 19

III. COMMON FEATURES: BIOINFORMATICS ANALYSIS OF MINIPROTEINS IN THE PDB
The above examples hint at common features that relate miniprotein sequence, structure and stability. For example, the importance of large aromatics, particularly Trp, in the hydrophobic cores of miniproteins. To explore this, and to seek other sequence-to-structure relationships, we performed a comparative bioinformatics analysis of the mini-and large proteins in the PDB. For this, we culled a non-redundant database of X-ray crystal and solution NMR structures of miniproteins of  40 residues and with  40% pairwise sequence identity, Figure 4; and a set of larger proteins with  100 residues and high-resolution ( 1.0 Å) X-ray crystal structures. We verified our set of miniproteins contained only monomeric structures determined in aqueous media.
First, we compared the amino-acid compositions of the two sets, Figure 5. Miniproteins use three classes of amino acids more often than the larger proteins: long and/or charged amino acids, e.g., Arg, Glu, Lys, Met (and to a lesser extent His); electron-rich aromatics, Trp and Tyr, with Trp showing a preference for miniproteins twice that of larger proteins; and Pro. In contrast, small amino acids and the aliphatic hydrophobics are found more often in the larger proteins. This suggests that the polar aromatics, and longer charged amino acids provide routes to good non-covalent interactions; i.e., salt bridges, and CH- and cation- interactions; and that Pro helps reduce the entropic cost of folding as well as buttressing aromatics in the core.  We have also examined secondary structure content using the Define Secondary Structure of Proteins (DSSP) algorithm, Figure 4. 52 This revealed similar proportions of  helix in miniproteins and larger proteins; but less  strand in the former (with the exception of the water-soluble -hairpins and three-stranded -sheets, of course). It is possible that this is due to bias in the dataset rather than miniprotein requirements. Also, it appears that miniproteins have more-contorted backbones and make better use of structured loops to best sequester hydrophobics within their interiors. Concomitant with reduced regular secondary structures, we found fewer main chain-main chain hydrogen bonds in miniproteins, with half of the residues making these compared with approximately three quarters in larger proteins. The shortfall in miniproteins is likely made up by more hydrogen bonds to water, as their small size gives greater surface area-to-volume ratios.
So how are miniproteins stabilized? True, the overall entropic cost of folding a minoprotein might be lower than for larger proteins, but it will still be unfavorable and will have to be recouped by enthalpically favorable interactions. However, it is clear that fewer such interactions are made in miniproteins as they generally have broader thermal unfolding transitions than larger proteins. 19 Nonetheless, there must be favorable non-covalent interactions to outweigh the T∆S term of folding. To address this, we are actively interrogating the above databases for sequence-to-structure/stability relationships. It is early in this analysis, but trends are emerging. For example, we find that when normalized for length miniproteins make up to eight times as many salt bridges than their longer counterparts; and when normalized for both length and number of aromatic residues miniproteins are approximately six-times denser in CH- interactions.

IV: OUTLOOK
We have attempted to cover four decades of miniprotein research in a 6,000-word Account. As a result, we have had to omit many of the fascinating aspects of these studies, including details of the approaches taken, the experimental and theoretical methods used, the sequences explored, and the nuances of the results obtained. However, we hope to have conveyed two main aspects of miniprotein research and development that others may find useful, namely: first, a chronology and overview of miniprotein discovery and, with it, the design and engineering approaches taken to reveal them; and, second, common sequence and structural features found in miniproteins through these studies and our own, albeit preliminary, bioinformatics analysis of the RCSB PDB.
On the latter, sequence-to-structure/stability relationships are clearly emerging for miniprotein folding. Moreover, these are being discussed and, indeed, understood in terms of the non-covalent interactions that underpin them. For miniproteins, and mostly in contrast to what is possible with larger proteins, these interactions can be probed with atomic resolution using synthetic peptide chemistry: non-proteinogenic amino acids can be introduced into miniproteins, and principles and methods from physical organic chemistry can be used to rationalize their impact on structure and stability. 5,19 Such knowledge and understanding will not only illuminate how miniproteins are stabilized but also how protein structures are specified and maintained in general. 5,19,53,54 In turn, this will undoubtedly improve our abilities to engineer existing miniproteins and to design new examples of these de novo. We anticipate that, along with the in biro and rational design approaches that have been favored to date, parametric computational protein design will become increasingly used to deliver new miniproteins. Here, a challenge for the protein-design community will be to make their methods more available to non-expert users.
One thing that we have neglected in this Account is the potential to functionalize miniproteins for both basic and applied research; we have had to focus on structures and sequence-to-structure/stability relationships rather than on structure-function relationships. Many have contributed to this important aspect of the field. 5,13-15 For example, miniproteins provide scaffolds onto which functional motifs can be grafted. 14,55 The most clear-cut application here is the introduction of binding and recognition motifs, for example to interfere with protein-protein interactions. In addition, and although more challenging, there are prospects for introducing catalytic functions into simplified peptide and protein architectures. 14,21 With the array of structures that we have presented and others that are coming online, and with the underpinning thermodynamic understanding of these, we anticipate further and considerable advances on this road to functional miniproteins.
Finally, and particularly exciting to us, is the concept that miniproteins might be used as building blocks to design and engineer entirely new protein folds. This might be termed proteins from peptides (we thank Andrei Lupas for this phrase) or protein design by pieces. [56][57][58] It offers routes into what is being called the dark matter of protein fold space, 59 and, thus, into a truly synthetic biology. 60

Emily G. Baker
Dr Emily Baker received a Chemistry degree and then her PhD from the University of Bristol. Her PhD was on the design of single -helical peptides and the understanding of electrostatic interactions within these. Emily is now a post-doctoral research associate at Bristol working on the de novo design of unexplored protein folds with a focus on probing non-covalent interactions within these, and using DNA-guided peptide assembly.

Gail J. Bartlett
Dr Gail Bartlett obtained her degree in Biochemistry from the University of Oxford, and her PhD in structural bioinformatics from the University of London. Her current research interests focus on the relationships between protein sequence and structure/function, and particularly on the role of non-covalent interactions in protein folding, stability and design.

Kathryn L. Porter Goff
Kate Porter Goff received her MSci in Chemistry from the University of Bristol. She is currently pursuing a PhD within the Bristol Chemical Synthesis CDT under the supervision of Prof. Dek Woolfson working on the rational design, synthesis and characterisation of new protein folds.
Derek N. Woolfson Prof Dek Woolfson took his first degree in Chemistry at the University of Oxford, and gained a PhD in Chemistry and Biochemistry at the University of Cambridge. He then did postdoctoral research at University College London and the University of California, Berkeley. After 10 years as Lecturer through to Professor of Biochemistry at the University of Sussex, he moved to the University of Bristol in 2005 to take up a joint chair in Chemistry and Biochemistry.
Dek's research is at the interface between chemistry and biology, applying chemical methods and principles to understand biological phenomena. Specifically, his group is interested in the challenge of rational protein design, and how this can be applied in synthetic biology and biotechnology. His particular emphasis is on making completely new protein structures from peptide blocks, and peptide-based biomaterials for applications in cell biology and medicine.
Dek is also co-Director of BrisSynBio, a BBSRC/EPSRC-funded Synthetic Biology Research Centre.

ACKNOWLEDGMENTS
EGB and DNW are supported by a BBSRC/ERASynBio grant (BB/M005615/1); GJB and DNW are supported by the ERC (340764); KLPG is supported by the EPSRC-funded Bristol Chemical Synthesis Centre for Doctoral Training (EP/G036764/1); and DNW is a Royal Society Wolfson Research Merit Award holder (WM140008).