Protein diversification through post-translational modifications, alternative splicing, and gene duplication

Proteins provide the basis for cellular function. Having multiple versions of the same protein within a single organism provides a way of regulating its activity or developing novel functions. Post-translational modifications of proteins, by means of adding/removing chemical groups to amino acids, allow for a well-regulated and controlled way of generating functionally distinct protein species. Alternative splicing is another method with which organisms possibly generate new isoforms. Additionally, gene duplication events throughout evolution generate multiple paralogs of the same genes, resulting in multiple versions of the same protein within an organism. In this review, we discuss recent advancements in the study of these three methods of protein diversification and provide illustrative examples of how they affect protein structure and function. Addresses 1 Department of Structural and Molecular Biology, University College London, London, United Kingdom 2 Department of Applied Physics, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Malaysia Corresponding author: Orengo, Christine (c.orengo@ucl.ac.uk) (Sen N.) (Orengo C.) Current Opinion in Structural Biology 2023, 81:102640 This review comes from a themed issue on Sequences and Topology (2023) Edited by Madan Babu and Rita Casadio For a complete overview see the Issue and the Editorial Available online xxx https://doi.org/10.1016/j.sbi.2023.102640 0959-440X/© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons. org/licenses/by/4.0/). Introduction Traditionally, we think of protein variation in terms of genetic variation within a population, or between species. A protein appears in two different individuals or two different species, but with small changes to its www.sciencedirect.com sequence, due to mutation. In contrast, there are many examples of how multiple versions, or protein species, of the same protein come about within an individual [1]. Three sources of such protein species are posttranslational modifications (PTM), alternative splicing (AS), and gene duplication (GD) that result in gene paralogs (Figure 1). PTMs refer to the post-translational chemical modification of protein residues by the covalent addition of chemical groups such as acetyl, glucosyl, methyl, phosphoryl, or ubiquitin. These modifications expand the repertoire of the standard 20 amino amino acids and lead to various effects on protein interactions, lifespan, folding, solubility, and localization.Hence, PTMs are important in various biological processes such as signal transduction, gene expression regulation, cell cycle, and DNA repair. AS is the rearrangement and assembly of distinct exons (protein-coding sequences) from a single gene, resulting in multiple protein species called isoforms [2]. Multicellular creatures including humans, animals, and plants have been found to exhibit AS [3]. AS is a useful way of possibly increasing protein diversity and introducing additional levels of regulation, as different protein isoforms can be differentially expressed in various tissues and different developmental stages. Finally, another evolutionary path for the formation of multiple protein species of the same protein within a species is GD [4]. The scale of GD events throughout evolution can range from duplication of single genes to the duplication of whole genomes. The resulting duplicates, called paralogs, accumulate mutations over time, resulting in evolutionary divergence in both sequence and function [5]. In order to study the structural diversity between paralogs within different CATH families, we compared the good quality protein domains containing high pLDDT, low unordered regions, higher secondary structures, high globularity, and packing density (for details about the structures being considered, please refer to Bordin et al. for the comparison) [6]. We were interested in looking at how much the structures Current Opinion in Structural Biology 2023, 81:102640


Introduction
Traditionally, we think of protein variation in terms of genetic variation within a population, or between species. A protein appears in two different individuals or two different species, but with small changes to its sequence, due to mutation. In contrast, there are many examples of how multiple versions, or protein species, of the same protein come about within an individual [1]. Three sources of such protein species are posttranslational modifications (PTM), alternative splicing (AS), and gene duplication (GD) that result in gene paralogs ( Figure 1).
PTMs refer to the post-translational chemical modification of protein residues by the covalent addition of chemical groups such as acetyl, glucosyl, methyl, phosphoryl, or ubiquitin. These modifications expand the repertoire of the standard 20 amino amino acids and lead to various effects on protein interactions, lifespan, folding, solubility, and localization. Hence, PTMs are important in various biological processes such as signal transduction, gene expression regulation, cell cycle, and DNA repair.
AS is the rearrangement and assembly of distinct exons (protein-coding sequences) from a single gene, resulting in multiple protein species called isoforms [2]. Multicellular creatures including humans, animals, and plants have been found to exhibit AS [3]. AS is a useful way of possibly increasing protein diversity and introducing additional levels of regulation, as different protein isoforms can be differentially expressed in various tissues and different developmental stages.
Finally, another evolutionary path for the formation of multiple protein species of the same protein within a species is GD [4]. The scale of GD events throughout evolution can range from duplication of single genes to the duplication of whole genomes. The resulting duplicates, called paralogs, accumulate mutations over time, resulting in evolutionary divergence in both sequence and function [5]. In order to study the structural diversity between paralogs within different CATH families, we compared the good quality protein domains containing high pLDDT, low unordered regions, higher secondary structures, high globularity, and packing density (for details about the structures being considered, please refer to Bordin et al. for the comparison) [6]. We were interested in looking at how much the structures diverged and hence only considered the most diverse protein domain alignment in our study. We found that in some families paralogs had similar structures, while in others the structures diverged significantly (Figure 2) [6e8]. These findings further highlight how GD contributes to structural and functional diversity. GD is a useful evolutionary tool, as it increases the tolerance for the accumulation of deleterious mutations, as well as potentially beneficial mutations, thus introducing opportunities to develop new functions.
In this review, we examine how these three forms of protein species contribute to protein functional diversity and evolutionary fitness of an organism. We also highlight how advances in protein structure determination and analysis, both experimental and computational, can Different sources of protein species. a) Post-translational modifications (PTMs) are chemical groups that modify the protein by covalently binding to one or more of its amino acid residues. b) Alternative splicing (AS) is the formation of different protein isoforms from the same gene by alternative combinations of its exons during the splicing process. c) Gene duplication events result in multiple copies of a single gene, called paralogs. The different paralogs accumulate mutations over time, resulting in different versions of the same gene.
contribute to our understanding of the functional diversity of protein species.

Post-translational modifications
PTMs contribute to protein species diversification dramatically as they provide a plethora of ways of modifying a protein. Large-scale, mass spectrometry-based proteomics studies have identified tens of thousands of PTMs, a large majority of which have not been associated with functional relevance [12]. Their effect on the biophysical and, by extension, biological properties of proteins is complex. Analysis of X-ray structures in the PDB has shown that only 7% of glycosylated and 13% of phosphorylated proteins undergo changes >2 Å Structural diversity within paralogs -All against all superimposition of paralogs of protein domains (within a superfamily and species) were created for good quality AlphaFold domains [6,9]using FoldSeek [10]. The RMSD scores for the aligned residues for the 30 superfamilies with minimum TM-scores (calculated using FoldSeek) were calculated using the structure superimposition tool SSAP [11]. These maximal RMSD scores were plotted against the respective superfamily. Representative examples of protein domains belonging to diverse superfamilies (highlighted in red) have been shown in ribbon diagram. For the superfamilies 3.40.1110.10, 2.60.40.1180, and 2.30.29.30 representative domains in the figure belong to UniProt IDs G5EEK8 (residue nos 361-603) and A0A7I9BBC6 (residue no. 168-345), Q9UL18 (residue no. 34-164) and Q9BQ17 (residue no 508-632), and Q2QXF6 (residue no.  and Q2RAZ2 (residue no. 530-647), respectively.
[13]. The effects of PTMs on backbone conformation could be either stabilizing or destabilizing, depending on the type of PTM [14]. Recent studies involving molecular dynamics (MD) simulations have explored the effects of PTMs on proteineprotein interactions [15]. In general, acetylation has been shown to have a stabilizing effect on such interactions, while phosphorylation tends to have a destabilizing effect, however, these effects are not additive. Furthermore, these PTMs affect the dynamics and allosteric interactions of proteins. Computational studies show that phosphorylation sites tend to be predominantly on solvent-exposed residues, or buried residues that are flexible enough to be exposed after modification [16e18]. Genomic analysis of these PTM sites in humans have shown them to be negatively selected based on percentage of rare substitutions and ratio of non-synonymous to synonymous mutations. In addition, these sites have a higher number of disease-associated mutations compared to other residues [19]. Conservation of PTMs across the tree of life has shown that the PTM sites have only weak evolutionary constraints [20]. However, clade specific studies on human PTMs in which they were compared to other ordered and disordered regions of the eukaryotes have shown strong conservation signals based on the PTM type and whether a PTM is in a structured or disordered region [21]. Indeed, PTMs are found in many intrinsically disordered proteins (IDPs)/intrinsically disordered regions (IDRs) which are also involved in various cellular regulation pathways. Additions/removals of the PTMs in these IDPs can lead to various structural changes and transitions between ordered and disordered states [22]. PTMs in IDPs can also lead to phase separation of these proteins leading to phase-separated droplets and membrane-less compartments in the cell [23]. The protein huntingtin, involved in the Huntington disease, has an N-terminus that is intrinsically disordered and has PTM sites. MD studies of huntingtin have shown that phosphorylation leads to helix stabilization and charge neutralization by N-terminus acetylation [24]. Similarly, tau protein is also an IDP and has multiple PTM sites. Tau PTMs cause various structural changes leading to phase separation, aggregation, microtubule assembly, and degradation [25].
With the boom in the number of near-accurate computational predictions of protein structures, tools such as Privateer [26,28] have been developed to model the PTMs on the residues that undergo these modifications. Other tools such as StructureMap have been developed to map PTMs from proteomics studies onto these computational models [12]. Additionally, there are several databases with information on PTM sites, their function, mutations, and 3D structural context [27]. A summary of these can be found in Table 1.
Alternative splicing AS and the resulting isoforms provide a useful mechanism for protein diversification and protein species generation within an individual. A fundamental question is the prevalence of isoforms that originate from AS in nature. The evidence from transcriptomics experiments does indicate that AS generates many transcripts. In contrast, proteomic studies have only been able to confirm the presence of a small number of isoforms that are generated by such transcripts. Transcripts may not be translated into proteins, may generate only small amounts of protein, or could be only expressed in Table 1 Bioinformatics resources providing information on post-translational modifications (PTMs), alternative splicing (AS), and paralog structures. While the extent of the contribution of AS to protein polymorphism in general is still being established, there is emerging evidence for AS playing a functional role in biology. Perhaps the most indicative evidence of AS functionalization is the tissue specificity of expression of different transcripts. A recent study showed that more than a third of splice events for which both proteomics and RNAseq evidence can be found are tissue-specific [41]. Furthermore, from an evolutionary perspective, the vast majority of such tissue-specific splice events are ancient, conserved over more than 400 million years. This indicates a correlation between functionalization and evolutionary conservation of AS. A recent study used evolutionary splicing graphs to investigate a set of 50 genes, and the findings suggest that AS may be conserved between amphibians and primates, providing additional evidence for its potential functional significance [36]. A wide range of studies resulted in AS data, which can be found in a selection of curated databases. For a summary of several such resources, see Table 1.

Resource
An important question is how AS contributes to functional diversity of a protein at the structural level. There are many different types of AS, but one type, mutually exclusive exons (MXE), is a good example (Figure 3). The AS of proteins whose isoforms are confirmed by proteomics experiments tend to be enriched in MXE, which are less likely to disrupt the protein structural core [43,45,46]. Interestingly, structural analysis studies reveal that AS involving MXEs tend to affect surfaceexposed residues at functional sites and lead to radical amino acid substitutions [47]. In the case of tandem duplicated exons, residue substitutions also tend to be at the protein surface, however, the nature of these substitutions is more conservative in terms of amino acid identities and may serve as a means of fine-tuning protein function [48]. Proteineprotein networks may undergo tissue-specific rewiring as a result of AS [49,50]. It is important to note that MXEs represent only a small part of all AS.

Figure 3
Examples of MXE events altering protein functional sites. a) MXE varying residues were discovered at PKM1/PKM2 isoforms' allosteric site and tetramerization interface region. b) Key catalytic residue switches between CG42249 isoforms; Enzymes with different activities may be produced by such physicochemical changes. c) Two threonine residues that serve as phosphorylation sites were replaced in one integrin-B1 isoform, potentially causing a shift in downstream signaling.
Spliceosomes, which are RNA-protein complexes responsible for catalyzing the splicing process, and splicing factors, which interact with the spliceosome to regulate its activity and guide the selection of splice sites, have been found to undergo evolutionary changes that impact the diversity of splicing isoforms [51,52]. Splicing in prokaryotes occurs independently of a spliceosome, as this complex is exclusively present in eukaryotic cells. To trace the evolutionary origins of the spliceosome in the last eukaryotic common ancestor (LECA), Vosseberg et al. conducted homology searches using human spliceosomal proteins and identified 145 spliceosomal orthogroups [52]. Their analysis revealed that the prokaryote-derived core of the spliceosome was supplemented with an excess of proteins associated with ribosome-related processes, which underwent extensive duplications, leading to increased complexity in the evolving spliceosome.
A discussion of AS and its effect on biological function would be incomplete without concrete examples.
There are several cases of proteins with key biological roles that undergo AS that impact their function. One such example is that of the histone core component H2A and its two isoforms: macroH2A1.1 and macro-H2A1.2 [53]. The isoform macroH2A1.1 contains features that allow it to accommodate ADP-ribose within the binding pocket, while macroH2A1.2 lacks these features. Therefore, whichever isoform is expressed may affect ADP-ribose signaling and NAD þ metabolism. The AS of H2A appears to be a recent addition in the evolution of histones and is only observed in jawed vertebrates [54].
Another important example is that of G Protein-Coupled Receptors (GPCR). A recent study demonstrated the functional divergence of different isoforms of a single GPCR gene, with varied signaling capabilities [55]. The study highlights how the expression of different unique isoform combinations in different tissues activates distinct signaling mechanisms. Some isoforms may alter cellular responses to drugs and provide novel targets for treatments with greater tissue selectivity.

Gene duplication
While GD provides a way for proteins to acquire new functions without sacrificing their original role in the organism, the constraints affecting the evolutionary pathways of paralogs are all but simple. This is particularly the case when paralogs share interactions with other proteins, resulting in trigenic interactions. In such cases, the sequence and structural divergence of one paralog can affect its counterpart via evolutionary changes in their shared partner [56]. Indeed, the more entangled the two paralogs are in their interactions, the more they tend to retain functional redundancy [57]. A particularly interesting example of how protein interactions can influence the evolution of paralogs is the case of homo-oligomeric proteins. When genes of proteins that form homo-oligomeric assemblies undergo duplication and divergence, two possible assembly outcomes emerge. The first outcome is the formation of two different sets of homo-oligomers, each corresponding to a different paralog, where the paralogs do not mix. In the second outcome the two paralogs form heterooligomers [58]. In eukaryotes, hetero-oligomeric complexes appear to be more common [59e61]. Nevertheless, paralogs can evolve to avoid heteromeric assembly in certain cases [62]. Because paralogs of the same transcription factor compete for the same DNA binding sites with different affinities, their divergence is a means of tweaking gene networks [68]. Interestingly, it has been shown in yeast that the way in which the paralog DNA binding affinities are modified is not through changes in the DNA binding domains, but rather changes to parts of the protein sequence that affect interactions with secondary factors, which in turn affect the affinity [69].
We find interesting examples of paralogs of proteins involved in fundamental cellular processes. Several proteins involved in DNA repair have associated paralogs. In humans, topoisomerase II has two paralogs, TOP2A and TOP2B, whose sequences are similar (70e80% sequence homology) but they differ in function [70]. Another example is that of RAD51, a protein involved in DNA repair and homologous recombination. The RAD51 paralogs have been shown to be involved in the formation of different protein complexes with different roles [71e75]. There are also paralogs found in the transcription machinery. Examples of such genes with paralogs are GPN, a crucial biogenesis factor of RNA polymerase II, and the paralogs POLR3G and POLR3L, which are subunits of RNA polymerase III [76,77].
Interestingly, genes coding for the components of the fundamental molecular complexes charged with protein production and degradation, the ribosome and proteasome, have paralogs as well. An emerging field of research that is attracting much attention and debate is that of ribosome heterogeneity [78e80]. The paralogs of the ribosomal protein genes bL31 and bL36 in bacteria, and RPL8 in yeast, have been shown to be involved in response to changes in the environment of the organism [81e83]. In mammals, ribosomal protein paralogs such as RPL10L, RPL39L, and RPL22L have been shown to be important in fertility, cell proliferation, and development, and some of them are involved in certain types of cancer [84e88].
Finally, in addition to the standard proteasome, there are three more versions of the proteasome with high tissue expression and functional specificity: immunoproteasome, thymoproteasome, and spermatoproteasome [89,90]. These are defined by specific paralogs that are incorporated into the subunits of the proteasome and expressed in the respective tissues. An example is PSMA8, a paralog of PSMA7, that codes for the a4s subunit, a component of the spermatoproteasome [91,92]. Additionally, PA28 is a proteasome activator for which there are 3 paralogs in jawed vertebrates: PA28a, PA28ẞ, and PA28g. These paralogs assemble into either a hetero-heptameric ring (PA28aẞ) or a homoheptameric ring (PA28g) [93e95].

Conclusions
It is clear, given these different forms of protein species (PTM, AS, GD), that protein diversity within an individual is prevalent and that protein species play an important functional role in biology. Furthermore, while substantial progress has been made in mapping different protein species and analyzing their function, it is also likely that there may be many more protein species that contribute to biological functions that have not been properly assessed yet.
The explosion in the number of solved protein structures in recent years, partially due to advances in cryo-EM techniques, together with the remarkable computational progress in structure determination and prediction, primarily due to AlphaFold2, presents a fantastic opportunity in this context [96]. The vast amounts of structural data can be invaluable in analyzing protein species and provide a lens with which we can determine the effects of sequence variation among protein species on their function.

Declaration of competing interest
The authors declare no competing interests.

Data availability
Data will be made available on request. This study uses molecular dynamics simulations and free energy calculations to understand the modulation of protein interactions due to acetylation and phosphorylation in yeast. They have also analysed conformational changes due to PTM and its conservation in paralogues. baab012. This article reviews the major PTM databases, tools for PTM predictions, association of PTM with diseases and biological processes. . This structural analysis of mutually exclusive exons (MXEs) in the genomes of five metazoan species revealed that MXE-specific residues are highly enriched in surface-exposed residues and cluster at/ near protein functional sites, demonstrating the ability of mutually exclusive exons to fine-tune the function of proteins. 48 * . Martinez-Gomez L, Cerdán-Vélez D, Abascal F, Tress ML: Origins and evolution of human tandem duplicated exon substitution events. Genome Biology and Evolution. 2022, https://doi.org/ 10.1093/gbe/evac162. This analysis of human tandem duplicated exon substitutions revealed that despite being highly enriched in surface-exposed residues, these residues have undergone more moderate changes. Three-quarters of tandem duplicated exon events are tissue-specific and are enriched in terms of functionality pertaining to brain and skeletal muscle structures.