Mapping the Constrained Coding Regions in the Human Genome to Their Corresponding Proteins

Graphical abstract


Introduction
Predicting the impact of variants on protein function has traditionally been based on combining information derived from protein sequences, inter-species conservation and knowledge of the structure and function of the protein. With the emergence of multiple human genome sequences from different populations, observed variation patterns in humans provide an orthogonal source of information for assessing the impact of variants. The comprehensive catalogues of genetic variations, compiled from many human population sequencing projects, have fuelled the development of different metrics that measure the general tolerance of genes to variation. Metrics like the probability of being loss-of-function intolerant (pLI) and missense Z-scores are extensively used to prioritise genes in genome interpretation of individuals 2 . However, it is well established that 'all parts of a protein are not equal'. The modular presence of different domains and folds can endow different regions with different functions and structural constraints [3][4] . Hence, one would expect that regions which are extremely important for the function of a protein would be depleted of protein changing variants in healthy individuals.
In 2019 Havrilla et al. defined the Constrained Coding Regions (CCRs) as regions in the human coding genome where the following proteinchanging variants were depleted: missense, stop gained, stop lost, start lost, frameshift variant, initiator codon variant, rare amino acid variant, protein altering variant, inframe insertion, inframe deletion, and splice donor variant or splice acceptor variant when affecting the protein sequence. These regions were identified in whole exome and genome sequencing data from large cohorts of healthy control populations from the Genome Aggregation Database (gnomAD2.0.1), which is currently the largest and most widely used publicly available collection of data on population variation from harmonised sequencing data 2,5 . The premise of Havrilla et al. was that the data in gnomAD2.0.1 came from individuals who were either healthy or did not have early onset developmental abnormalities ('healthy control populations') and therefore their variant loci could be considered as 'tolerated'.
In the CCRs model, each of the variant-depleted (constrained) regions is weighted based on i) its length in base pairs, and ii) the fraction of individuals (above 50% of total individuals) having at least a 10x sequencing coverage at each bp of the region. A linear regression is then calculated comparing the weights and the CpG density of the regions, as an indicator of the region mutability upon spontaneous deamination of methylated cytosines. Regions with a greater weighted distance between protein-changing variants than expected based upon their CpG density (residual from the linear regression), are predicted to be under the greatest constraint. The residuals of the regression are ranked in CCRs percentiles (CCRpct) from 0 to 100, with 0 signifying unconstrained (i.e. having 'tolerated' variants in gnomAD) and 100 being the most highly constrained regions. Put simply, the longer a constrained region and the larger its CpG content, in general, the higher its CCRpct will be. Havrilla et al. observed that only $1% of the highly constrained regions were found to be enriched with known pathogenic variants and associated with developmental disorders, and that 72% of the genes harbouring a CCR in the 99th percentile or higher were not linked yet to any disease, suggesting that CCRs could be used to reveal regions of protein coding genes that are likely to be under potentially purifying selection.
Given their relevance, it has been proposed that analysing the presence of regions like these can complement the classical procedures of phylogenetic conservation, amino acid substitution scores, and three-dimensional protein structural characterization and aid in the process of variant interpretation 1,6 . Although recent studies have used CCRpct as an extra score for assessing the pathogenicity of variants [7][8][9][10][11][12][13] , there has been no large-scale attempt to map the distribution of these constrained genomic regions to amino acids and to analyse their co-occurrence with different protein functional features.
The human proteome is a continuum where proteins can be fully ordered, intrinsically disordered (ID) or flexible/mobile, have a mixture of folded and ID regions or even exert transitions between both states upon binding with other proteins [14][15] . These ID proteins and regions, can perform important and diverse functions in the cell from displaying sites for post-translational modifications (PTMs) to assembling molecular complexes that promote the phenomena of liquid-liquid phase separation (LLPS) and formation of membraneless organelles in the cell, amongst others [16][17][18] . Disease-causing mutations can occur in both ordered and ID regions [19][20] , and recently the focus has turned towards variants predisposing to disease in LLPS regions, mostly related to autism spectrum disorders (ASD), cancer, neurodegeneration, and infectious diseases [21][22][23] . However, ID regions are usually not well conserved, lack a stable protein three-dimensional structure, sequence alignments have poor accuracy and most studies and tools focus on ordered regions, making it a challenge to interpret the molecular mechanism behind disease-related variants in these regions. Hence, observing CCRs in these regions may provide some insight into their constraints during human evolution.
Herein we map the CCRs onto their corresponding protein sequences and 3D structure, by assigning the CCRpct to each amino acid site (residue) spanned by each CCR. This process has the potential to highlight key functional amino acids in both ordered and disordered proteins, lying in regions of the protein which are strongly constrained. We explore the distribution of these regions across human proteins and compare their co-occurrence with different protein functional features annotated at the amino acid level. We then perform an enrichment test of Gene Ontology (GO) terms to explore which protein classes and cellular pathways are more frequently associated with genes harbouring regions with high CCRpct.

Mapping the CCRs to amino acids
Our aim was to explore how CCRs are distributed across the human canonical protein sequences as defined by UniProtKB/Swiss-Prot 24 . For this purpose, first, we ran the CCRs model pipeline (available in the repository accompanying the work of Havrilla et al., 2019: https://quinlan-lab.github.io/ ccr/examples/updates) to obtain the genomic coordinates of the CCRs, but using the gnomAD3.0 dataset of variants, which aggregates 76,156 whole genomes using coordinates from the human GRCh38 genome assembly. The resulting file with the genomic coordinates of the CCRs can be obtained from our GitHub repository (https:// github.com/marciaah/CCRStoAAC/blob/main/data/ rawCCRs/gnomad3_0/vep101/sort_weightedresiduals-cpg-synonymous-novariant.txt.gz). Then, we mapped the genomic coordinates of the CCRs to amino acids in UniProtKB proteins, via the Ensmebl transcripts in the GENCODE basic set (see Figure 9 A for further details), the corresponding code of the pipeline is available in our repository https:// github.com/marciaah/CCRStoAAC, and the output of the mapping to amino acids can be found here https://github.com/marciaah/CCRStoAAC-output.
The use of GRCh38 gives a more accurate crossmapping of genes, transcripts and proteins in the Ensembl 25 and UniProtKB databases, while using the Swiss-Prot canonical set of proteins ensures the availability of functional annotations for further analysis.
From a total of 18,583 human UniProtKB/Swiss-Prot canonical proteins that matched the Ensembl protein sequences, we were able to map CCRpct to at least one region in 17,366 of them ( Figure 1 (A)). 6,608 of the 17,366 proteins had partial coverage of CCRpct for their amino acid sites (residues) and 1,217 (from the expected 18,583) completely lacked CCRpct, as a consequence of low quality conflicted genomic regions (see Methods) that prevented the identification of CCRs. In total, about 9.8 million amino acid sites (Figure 1(B)) are represented with CCRpct, out of an expected 10.7 million from the total 18,583 sequences.
68.8% of the 9.8 million amino acids sites that we could map are in constrained regions, i.e. no protein changing variants are reported in gnomAD3.0. The remaining 3.06 million (31.2% of the mapped residues) contain at least one variant with a minimum allele count of one (Figure 1(B)).
We categorised the CCRpct into different groups, as shown in Figure 1(B) and (C), based on the original considerations proposed by Havrilla et al, 2019, i.e., percentiles in the top 1% [99,100] for the most highly constrained regions down to 0 for unconstrained (i.e. sites having variants in gnomAD3.0). As expected, mapping from CCRs to amino acid sites gives approximately 10% of residues in each group (Figure 1(C)). The exception is for the 0-10% group, which is underpopulated at the residue level, since these CCRs are the shortest regions, with an average length of only 1.05 amino acids without variants.
To summarise, we were able to assign CCRs to 93.5% of human UniProt/SwissProt canonical proteins, equating to 91.6% of the expected residues; about two thirds of residues are constrained (i.e. without protein changing variants in gnomAD 3.0); of these sites only 0.24% residues are assigned to the top [99,100] CCRpct bin (i.e. most highly constrained) and are exclusively from 839 regions in 751 proteins.
Comparing CCRs percentiles with other whole gene scores (pLI and missense OEUF) In clinical genomics and population genetics a number of metrics for assessing the overall intolerance to variability for a given gene or protein have become popular. The latest version of these scores is based on gnomAD2.1.1 2,26 . One, the pLI score, represents the probability of a gene being intolerant to heterozygous putative loss of function (pLoF) variants: nonsense (stopgained), frameshift, splice acceptor, and splice donor variants. A pLI ! 0.9 has been used to highlight "essential" transcripts/genes/proteins. For missense and synonymous variants, Z-scores are used, measuring how far from the mean a gene is in terms of observed/expected (o/e) missense or synonymous variants. Accompanying these metrics, the authors recommend the use of upper bound fraction of the 90% confidence intervals around the o/e ratios (OEUF) for the different types of variants. For further details, please refer to Supplementary Methods.
For our analysis, we used the recommended thresholds: pLI ! 0.9 and missense OEUF 0.35, as a simple way to define a gene/protein as highly constrained for pLoF and missense variants, respectively.
We observed 2,916 'essential' proteins with pLI ! 0.9, and 75% of these have at least one region that scores with very high CCRpct in the range [95,100]. Also, only 113 proteins presented missense OEUF 0.35, and 93% of them have CCRpct in the [95,100] group. However, when we looked at proteins with pLI < 0.9 or OEUF > 0. 35, we found that about a quarter or more of these more "variant-tolerant" proteins also include highly constrained regions, which are distributed across the proteins with all values of missense OEUF and pLI (Supplementary Figure 1). These observations highlight the importance of looking at local constraint scores, using CCRpct, in order to understand more deeply the impact of variants in protein coding genes.
The correlation between CCRs percentiles, interspecies conservation and length of the regions In a similar way, for all amino acid positions we compared the length of the CCRs (in number of amino acids) against their percentiles (Figure 2(C) and (D)). The average length of regions increases with CCRs percentile, which is expected given that CCRs are prioritised by region length. Nevertheless, there is a high variability within each CCRs percentile category. The most highly constrained regions (percentiles [99,100]) are only present in proteins of at least 100 amino acids in length (Supplementary Figure 3).
To explore the numbers of amino acid sites having different combinations of CCRpct and conservation scores, we stratified both measures into ten groups and built a 2D matrix counting the numbers of residues in each cell. The majority of both constrained and unconstrained positions have conservation scores > 0.4 and distribute evenly in a plateau up to a conservation of 0.95 ( Figure 2). Above this level, the counts increase, in particular for the two extremes of more constrained (percentiles [90,100]) and unconstrained sites (percentile 0). The 3D surface shown in Figure 2 highlights the disparity between these CCRpct and conservation scores and the high frequency (importance) of residues which are completely conserved (ScoreCons = 1) 28 . CCRpct are able to differentiate between such residues, according to observed variation and length of conserved regions, providing a valuable score for analysis.

Protein features and CCRs percentiles
Protein annotations from UniProtKB/Swiss-Prot 24 , PDBe 29 , VarSite, 30 M-CSA 31 , BioLip 32 , MobiDB 33 , ELM 34 , Ensembl 35 and ClinVar 36 data-bases were obtained and aggregated for each amino acid site in the human canonical and annotated proteins of UniProtKB. 9.8 million sites were annotated in this way, assigning CCRpct and the 30 protein annotations listed in Figure 9 (Methods).   In order to investigate in detail how different protein features are constrained across human populations, we calculated odds ratios (OR) to measure the enrichment of residues in CCRpct categorised in 7 groups for each of the 30 protein feature annotations, compared to a random distribution (see in Methods, Odds ratios tests for enrichment: I. CCRpct and presence of each one of the 30 protein features). Figure 4 presents forest plots for comparing the resulting OR (listed in Supplementary Table 1).
Additionally, we performed OR tests for assessing the overall enrichment of sites with the different protein features and their co-occurrence with inter-species conservation scores and CCRpct. We did this by defining 6 different groups, as described in Methods (Odds ratios tests for enrichment: II. CCRpct and conservation with the presence of protein features). Put simply, we divided the cells of the heatmap in Figure 2 into 4 quadrants and the histogram for unconstrained sites into 2 halves, counted the numbers of residues and calculated the ORs. Supplementary Table 2 shows the resulting OR and Table 1 summarises the features enriched in each group combining CCRpc and conservation score.
For capturing the strongest associations, we set a threshold of OR ! 1.5 for our analysis and grouped the different features for discussing their enrichments with CCRpct and with CCRpct and conservation, as is summarised below: I. Domains and compositionally-biassed protein regions. This group includes large features (e.g. structural domains) and biassed sequences (e.g. coiled-coils) ( Figure 4(A)). The most striking ORs distribution occur for domain regions (i.e. those regions of a protein classified as being in a domain, according to Pfam or CATH) compared to those lying outside such a domain. There is a clear indication that amino acid sites within domains are more constrained than those outside. The repeated domains show a similar tendency. Surprisingly, the transmembrane regions showed little if any enrichment for highly constrained regions, perhaps reflecting the lipid environment where variants between the hydrophobic amino acids are common. The residues in low complexity and coiled-coil regions are preferentially unconstrained and rarely show the highest levels of constraint.
II. Interactions and catalytic residues. Residues involved in catalytic sites, binding to metals and/or ligands, protein-protein interactions, proteinprotein cross-linking, interactions with DNA/RNA, linear motifs, and linear Interacting peptides (LIPs) are all more likely associated with medium to high percentiles of constraint (in the range [60,100]) ( Figure 4(B)). Most of these residues are also less likely to be associated with unconstrained regions. Catalytic sites, in particular, presented the highest odds of having high CCRpct and high conservation (OR = 1.9, in Table 1), and the average residue conservation is consistently high in combination with all the CCRpct, including the unconstrained sites where gnomAD tolerated variants are located (Supplementary Figure 5). Disulphide bonds are an interesting exceptionshowing no preference to lie in a highly constrained region, but also they are rarely unconstrained. In particular, catalytic sites, disulphide bonds and cross-linking covalent linkages have ORs that suggest they are of the order of 0.5-0.6 times as likely to have tolerated variants, while for sites involved in other interactions the ORs are between 0.7 and 0.91.
III. Disorder related features. The 1.8 M amino acids that were annotated in our dataset as intrinsically disordered (ID) or mobile and with CCRpct assigned (about 18.5% from the total of 9.8 M residues) showed two contrasting tendencies. Although with weaker OR, these sites were more likely to coincide with unconstrained and very lowly constrained sites (percentiles [0,30)) and also with the top most highly constrained percentiles [99,100] (Figure 4(C)). ID regions/proteins tend to evolve faster than structured proteins at the sequence level [37][38] , this is reflected in the enrichment of lower mean interspecies amino acid conservation across all levels of constraint, even for the high CCRpct residues (Table 1).
High percentiles of constraint ([95,100]) were strongly associated with residues in disorder to order (D-to-O), disorder to disorder (D-to-D) and context dependent transitions upon binding with other protein partners, with higher OR values for those undergoing D-to-O and context dependent transitions.
Residues in regions driving LLPS present the highest association we observed, with OR 9.26 times more likely to be in the most highly constrained regions (percentiles [99,100]) and with the longest average length of 75 amino acids ( Figure 3). Furthermore, 29 (53%) of the 54 Table 5).
IV. Signalling regions. Residues in propeptides, signal peptides and transit signalling regions tend to be in regions more likely to have unconstrained to medium constrained percentiles in the interval [0,60] (Figure 4(D)). The very few highly constrained sites in propeptides and signal peptides show much lower conservation and shorter region length ( Figure 3 and Table 1).
V. Post translationally modified positions (PTMs). Although with weak OR, the overall tendency is for sites that are post translationally modified to be more likely in constrained regions at lower percentiles ( Figure 4(E)). Given that these are specific sites with, in general, very short flanking motifs, it is expected that shorter regions, and therefore lower CCRpct, are associated with these positions. Also, some PTMs may not be functionally relevant and represent false positives 39 .
Glycosylated sites tend to be more associated with being unconstrained or at low constraint (percentiles [0,60]). The number of lipidated sites is very low, therefore statistically not significant for most of the CCRs percentile categories. However, they are less likely to be unconstrained (OR = 0.62, 95% CI: 0.53-0.72).
Other types of PTMs are more likely to be associated with highly constrained regions (percentiles [95,100]). There are 482 highly constrained sites, 302 (62%) correspond to Nacetylations, and 79% are N-acetyl lysine. This is not surprising given the abundance of the latter modification. When comparing conservation and constraint for these sites, the behaviour is mixed and ORs are overall weak. Phosphorylation sites tend to be less conserved in general for all constraint levels (Table 1), and a slightly higher prevalence is for sites with high constraint and low conservation. Lipidated sites have a stronger association with being more conserved and with low percentiles. Glycosylation and other PTMs are more associated with high conservation for all constraints.
VI. Other relevant sites and regions. The localisation of "other sites" and "other regions" was obtained from UniProt. This category corresponds to regions/positions of functional relevance for proteins, identified mostly from experimental evidence, that cannot be described by other feature annotations of UniProt. We also recorded coding DNA sequence (CDS) junctions, by translating the genomic coordinates of these sites obtained from Ensembl onto the corresponding amino acids in the UniProt sequences. As suspected, given their relevance, the protein sites in these three categories are more likely to be constrained at high percentiles (Figure 4(F)) and also mostly related to high percentiles and high conservation, although with a weak OR (Table 1).
VI. Residues without annotations. 1.8 M amino acids did not have any of the functional sites or regions or domains that we aggregated in the present study. These were very weakly associated with unconstrained and low constraint regions and particularly less associated with higher percentiles (Figure 4(G)). They were slightly associated with all combinations of constraints and low conservation (Table 1), and slightly more with low CCRpct and low conservation. The 96.4 K and 658.7 K residues that are in CCRpct [50,100] with low and high conservation, could be explained by functional features that still remain to be discovered and annotated for some proteins, or some domains that were difficult to delimit, creating fuzziness at their boundaries.
In summary, when bringing together the classifications regarding CCRpct and protein functional features, for the 9.8 M positions that can be assigned to a CCRs percentile, it is possible to observe how the co-occurrence with certain functional features becomes more evident at higher percentiles ( Figure 3(A) vs (B) and Figure 4). The 23.6 K most highly constrained residues in the human genome (CCRpct in [99,100], 0.24% of the total mapped residues) correspond to, on average, the longest linear stretches depleted of tolerated variability, and  Amino acid sites with clinically interpreted variants, their CCR percentiles and cooccurrence with protein features We next investigated how residues with different types of clinically interpreted variants correlate with the different percentiles of constraint Figure 5. For this purpose, we employed variants from the ClinVar database (https://www.ncbi.nlm. nih.gov/clinvar), and classified the amino acid positions in our dataset as pathogenic (including pathogenic/likely_pathogenic), benign (including benign/likely_benign) and/or VUS/conflicting (including variants of uncertain significance or with conflicting interpretations of pathogenicity), and followed a similar methodology for performing OR tests ( Supplementary Figures 6 and 7).
Pathogenic missense variants account for only 28,398 protein sites, with the majority of missense variants in ClinVar corresponding to VUS/conflicting interpretations, affecting 194,508 residues. Residue sites with pathogenic missense variants were observed as strongly associated with higher percentiles of constraint, in the intervals between [90,100]. Surprisingly, these types of variants were also associated with unconstrained regions (percentile [0,0]), although with weaker odds. Sites with benign and VUS/conflicting missense variants had higher odds of being in unconstrained regions, although a few benign variants, affecting 476 amino acid sites, were also present in the top most highly constrained regions.  When comparing the distribution of residues with the three groups of variants with CCRpct and conservation scores, we observed that they are spread across all categories of conservation and CCRpct (Supplementary  Table  4 and Supplementary Figure 4), but only the amino acid sites affected by pathogenic/likely_pathogenic were OR = 1.63 times more likely to be more conserved and with high CCRpct.
Additionally, we assessed the co-occurrence of ClinVar variants with the 30 protein feature annotations (Supplementary Figure 7). Sites with missense pathogenic variants are mostly associated with being in domains, transmembrane regions, catalytic sites, metal binding, ligand binding, protein binding, disulphide bonds, DNA/ RNA binding, linear motifs, LIPs, D-to-O, lipidation, other regions, and CDS junctions. Sites with missense benign variants were more likely to be disordered/mobile, low complexity, LLPS, signal, propeptide and transit peptides, and sites without any protein features. Sites with missense VUS/conflicting variants, were mostly associated with those regions generally more difficult to characterise: repeats, coiled-coils, LIPS, D-to-O and D-to-D transitions, phosphorylation sites, other regions of biological relevance annotated in UniProt, and sites without any feature.
Over-representation of regions with high percentiles in GO protein classes and Reactome pathways We investigated whether there was an enrichment of specific types of proteins and biological pathways in the proteins having regions with high CCRpct. For this purpose, we submitted a 'query list' of 6,402 protein identifiers of sequences with percentiles in the interval [95,100] to the PANTHER Classification System 40 to perform a Gene Ontology (GO) Over-representation Test for two categories of terms: 'protein class' and 'Reactome pathways'. We used as the 'reference list' the 17,366 genes/proteins for which we have CCRs estimates; however, for only 17,022 PANTHER was able to assign GO terms.
Genes with [95,100] CCRpct were enriched in 14 protein classes out of a total of 196 that were assigned to our lists of proteins. Genes with [95,100] CCRpct were enriched in 70 Reactome pathways, out of a total of 2.482. Figure 6  presents this information, and Supplementary Table 6 summarises the list of clinical conditions and number of LLPS driving genes/proteins associated with them.
We observed the majority of LLPS driving genes (35 out of the 54, 63%) act in key biological processes that facilitate DNA damage repair, epigenetic gene repression and RNA metabolism (transcription, splicing, polyadenylation, transport and translation, see column 'Main biological process groups' and the corresponding genes in Supplementary Table 5). The remaining 19 genes are involved in many different processes, including neuron cell growth, adhesion, axonogenesis and synaptogenesis, development, synaptic plasticity and regulation of neurotransmitter vesicles release; signal transduction pathways for cell survival, migration, proliferation, differentiation and apoptosis; protein degradation/recycling; cell cycle regulation; immune responses; nuclear transport; elasticity of organs and tissues; muscle structure and function and glomerular filtration in kidney. 34 (64%) of the 54 proteins driving LLPS had pLI>=0.9 (i.e. essential genes highly intolerant to loss of function in heterozygosity), in particular this is the case for proteins involved in RNA metabolism. 43 of the 54 LLPS proteins (79.6%) have amino acid positions which drive phase separation and are highly constrained for variation (CCRpct in [90,100]).
30 LLPS proteins (56%) had variants related to at least one disease: seventeen (31.5%) of the 54 genes were associated with severe early onset developmental disorders, including different organ malformations, oestrogen resistance with absence of sexual maturation or severe early onset immunodeficiency, with 10 of them in particular linked to neurodevelopmental disorders (Supplementary Table 5, column 'Disease groups'). 10 genes (18.5% of the 54) were associated with later onset diseases, mostly neurodegenerative but also affecting muscles and bone and triggering earlier menopause. 5 genes (9%) were associated with different cancers of pancreas, breast, uterus and prostate, lung and leukaemia. There was also a high incidence of associations with conditions not yet described (see Supplementary Table 6, 'not provided' or 'not specified').
The remaining 24 LLPS driving proteins (44% out of the total 54) have not yet been associated to any disease by protein changing variants by the time we consulted ClinVar. Fourteen of these 24 (58%) have pLIP0.9 (i.e. extremely intolerant to LoF) and are associated to RNA metabolism (10 genes), DNA damage repair (1 gene), immune response (1 gene) and signal transduction for cell proliferation and differentiation (1 gene), all of them presenting multiple regions of high constraint (CCRpct [90,100]).
In summary, the LLPS driving regions are clearly biologically important, often related to disease and very constrained for variability in the human genome.
Exploring some examples: U2AF2, the splicing factor U2AF 65 kDa subunit, and SLC12A2, the solute carrier family 12 member 2 Aggregating different protein annotations such as inter-species conservation, human variability and constraint, functional features, 3D structure and presence of clinically interpreted variants can help understand why variants have different propensities in different contexts. We have chosen two examples of proteins for illustration, the first of which illustrates a protein with highly disordered/ mobile regions which have nevertheless been constrained during human evolution, although the conservation across species is patchy. The second example is a transmembrane protein in which the functional ion channel residues are highlighted as highly constrained by the CCRs and also a disordered region of 20 amino acids which are variable across species but depleted of tolerated variants and include a cluster of pathogenic variants.
The first example is U2AF2, the 65 kDa subunit of the U2 auxiliary splicing factor U2AF (Figure 7), a protein that is highly constrained against variability. This is an essential splicing factor that recognizes the polypyrimidine-tract (Py) 3 0 splicesite signal in pre-mRNA and initiates spliceosome assembly in the nucleus 41 . U2AF2 is highly constrained for missense and LoF variability in gno-mAD2.1 (missense OEUF = 0.31, missense Zscore = 4.21, pLI = 1, LOEUF = 0.133), suggesting essentiality for humans. There is partial structural data for this protein, derived from 3 separate PDB files, each containing a different part of the protein.
The protein contains a low complexity arginineserine rich motif (RS) at positions 27-62, which has been proposed to initiate liquid-liquid phase separation (LLPS) to form nuclear speckle drops in the nucleus, bringing together pre-mRNAs and the proteins of the spliceosome 42 . Further along the sequence, the region 85-112 is a UHM ligand motif (ULM) that has been shown responsible for the interactions with U2AF1 (also known as U2AF35) the 35 kDa subunit of the splicing factor  [44][45] , and are connected by a flexible/disordered linker region (231-258) that modulates the binding specificity for the proper Py-tracts in pre-mRNA 46 . The third RRM domain, known as U2AF Homology Motif, UHM, is atypical and has lost its RNAbinding ability, but interacts with splicing factor 1 (SF1) (right dashed box with protein structures on top) [46][47] .
The protein has long, moderate to highly constrained regions (CCRs panels) that colocalize with the U2AF2-U2AF1 binding interface, the three RRMs, the flexible linker and also with regions involved in LLPS and/or linear interacting peptides (pink and violet rectangles in the "Other protein features" panel). A few variants have been reported in the literature as associated with a developmental disorder and different types of cancer (represented with lollipops above Pfam domains in Figure 7) and only a VUS is reported in ClinVar. All the pathological missense variants represent drastic physiochemical changes, and affect highly constrained and highly conserved sites.
U2AF2 illustrates how CCRpct and conservation both highlight that this protein is not only essential, but also has functional protein sites along its length. The clinical data supports this hypothesis, with variants associated with developmental disorders and cancers.
The second example is shown in Figure 8, which illustrates how CCR data, combined with protein function information, can highlight regions with potential functions that have yet to be determined. The gene is SLC12A2, which encodes for the solute carrier family 12 member 2 protein, a Na+, K+ and 2ClÀ cotransporter 1 (NKCC1), which plays a critical role in the homeostasis of K+ enriched endolymph in the membranous labyrinth of the inner ear. NKCC1 subunits are ion channels that work as homodimers, and each protein monomer comprises a transmembrane domain (TMD), cytosolic N-and C-terminal domains (NTD and CTDs, respectively) and extracellular flexible loops stabilised by disulphide bonds. The dimeric interface involves interactions between TMDs and CTDs [50][51] .
SLC12A2 is overall highly constrained for missense and LoF variability in gnomAD2.1 (missense OEUF = 0.78, missense Z-score = 2.4, pLI = 0.96, LOEUF = 0.31), suggesting essentiality for humans. The protein structures (Figure 8), reveal that the highest CCRpct in this protein (in the range [90,99)) corresponds to functionally important regions: residues in porelining helices (involved in the ion flow through the channel), the dimer interface, and also a 'not so obviously relevant' disordered and lowly conserved (scores < 0.4) region in the C-terminal domain. Intriguingly, this last region, comprising amino acids 977 to 993, is fully encoded by exon 21 of SLC12A2 and coincides with the boundaries of a CCR ranked with a moderate percentile of 94 (small dashed rectangle in Figure 8). It has previously been noted that this region, whose functionality still remains unclear, is unique to SLC12A2 and is not shared with the other proteins in the SLC12 family 52 , suggesting that it might confers a specific functional characteristic to this protein.
Seven pathogenic missense variants have been reported in ClinVar for this protein. Two of them, associated with Delpire-McNeill neurodevelopmental syndrome, are in the TMD in highly conserved and low-medium constrained residues: N376I lining the pore (conservation = 1, CCRs pct = 37.68) and A327V adjacent to a porelining helix (conservation = 0.72, CCRs pct = 74.95). The other five: E979K, E980K, D981Y, P988T and P988S cause deafness and sensorineural hearing loss, and cluster in the moderately constrained region encoded by exon 21. Furthermore, functional assays in cultured cells showed that applying the variants E979K, D981Y and P988T, or skipping exon 21, significantly decreases chloride influx mediated by the SLC12A2 protein 53 . All this evidence and the moderately high CCRpct for this disordered and lowly conserved region, highlight its putative relevance for the function of this protein.
Both the U2AF2 and SLC12A2 proteins described above also serve to exemplify different scenarios for amino acid sites where high CCRs percentiles go hand in hand with high conservation, and the converse where high CCRs percentiles go with low conservation, and vice versa.

Discussion
In the present work we extended the characterisation of constrained coding regions in the human genome, by accurately fine-mapping these regions and their level of constraint from the Human Build 38 genomic coordinates to protein sequence coordinates in 17,366 human UniProt canonical sequences, totalling about 9.8 million amino acid positions. Furthermore, aggregating protein functional annotations, available for these positions, allowed us to analyse the distribution and correlation of the different levels of constraint and inter-species conservation with different protein features.
Overall, our results agreed with the previous observations of Havrilla et al. 1 that the correlation between the CCRpct and the average nucleotide GERP++ conservation scores 27 for the regions is very low and hence the intra-species conservation in humans complements the interspecies conservation.
For the catalytic sites and interactions with different partners (small molecules, proteins, DNA/ RNA, metals), we observed the expected associations between high percentiles of constraint and high conservation scores. This is in concordance with the observations of Havrilla et al. 1 that domains enriched with the most highly constrained regions were involved in ion transport and in different DNA/RNA interactions (like zinc fingers, helicases and translation factors). Additionally, we observed that the unconstrained (i.e. with gnomAD3.0 variants) or lowly constrained (i.e. average shorter regions depleted of variants) regions were mostly associated with signalling regions (signal, propeptides and transit peptides), low complexity, glycosylation sites, and with more mixed inter-species conservation scores. Surprisingly, the transmembrane regions showed little if any enrichment for highly constrained regions, but slightly higher enrichment for medium constraint (CCRpct [60,90)), i.e. on average shorter regions. Perhaps, this reflects the lipid environment where variants between the hydrophobic amino acids are common.
Among the unexpected results, we observed that disulphide bond cysteines were more prone to lie within regions with low to medium percentiles of constraint (CCRpct in (0,90)). Disulphide bonds are covalent tertiary interactions important for stabilising protein folds and/or performing physiologically relevant redox activity and hence highly conserved in evolution 54 . We hypothesise that the association with lower-medium CCRpct (i.e. average shorter regions depleted of variants, with mean length = 20 amino acids) reflects the fact that the formation of such bonds requires only the presence of short motifs involving only the cysteines and their immediate flanking residues 55 .
Perhaps the most unexpected results we observed were related to disordered and mobile regions in proteins, showing dual enrichment for unconstrained/lowly constrained and also for highly constrained percentiles, mostly in sites with low conservation, and this might relate to the multiplicity of functions, or "flavours", of disorder that such regions can present, which depend on their length, composition and location in proteins 14,56 . Disordered proteins and regions are able to fulfil a variety of tasks: they can serve as flexible linkers between structured regions or flexible binding sites for ligands, they can undergo disorder-order transitions upon binding to other proteins through specific molecular recognition features (MoRFs) within longer disordered regions, they can also have short linear motifs that work as targets for post-translational modifications or cell signalling, or longer regions which promote molecular recognition and protein-protein interactions. The characterisation of the dynamic of IDPs/IDRs has led to the identification of their plausible role in regulating enzymatic activity 57 and has also been useful to investigate ligand selection for developing drugs 58 . This motivates the necessity of characterising disordered proteins and regions, for discovering the function and relevant mechanism where they are involved.
In the present work, in particular, residues involved in D-to-O, context dependent transitions and in driving LLPS showed association with high constraints, for conserved and also unconserved sites. In the case of residues in order-disorder transitions, our observations align with what has been proposed in terms of the binding mechanisms. D-to-O are defined by a single, welldefined, fully ordered binding configuration, mediated by a unique well-defined contact pattern that excludes ambiguities and is determined by the presence of binding motifs. Context dependent transitions involve alternative binding configurations, which change with the cellular conditions and different partners. Conversely, Dto-D transitions are defined by many different binding configurations, including alternative contact patterns, often with weak or redundant motifs 15 . For amino acids driving LLPS, our results suggest a strong association with constrained regions ranked with the highest percentiles ([95,100]) and that such regions are, on average, the longest stretches (75 amino acids in length) depleted of protein changing variants across the human coding genome.
The scientific community is beginning to untangle the complexity of interactions and regulations involved in LLPS, and evidence shows that these protein condensates do not follow classical rules of molecular recognition. LLPS regions are generally mobile/disordered and involve long sequence stretches that orchestrate multiple and multivalent interactions with proteins and RNA for the formation of membrane-less organelles in the cell. They are important for organising and regulating key cellular processes such as transcription, splicing, translation, chromosome condensation, synapsis and downstream signalling, all essential for tightly regulating the differential expression of genes, ensuring cell survival, correct differentiation into different tissues, and for the development and function of the neuronal and immune systems [59][60][61][62][63] . However, it was remarkable that amino acid positions in regions driving LLPS were not observed as significantly associated with Pathogenic protein altering variants (Supplementary Figure 7), considering that previous works have observed these genes frequently related to cancer, autism spectrum Figure 8. An example where CCRs highlight regions important for protein function: the pore lining helices and dimer interface in the transmembrane region of SLC12A2 (NKCC1) solute carrier family 12 member 2 (UniProtKB: P55011-1, Ensembl: ENST00000262461) and a peculiar disordered region of this protein encoded by its exon 21 and with a cluster of pathogenic/likely_pathogenic variants related to deafness and hearing loss 53 . From middle to bottom the plots in the horizontal panels represent, by amino acid position, gnomAD3.0 allele counts (AC), CCRpct, species conservation scores, sites with ClinVar missense variants, Pfam domains with disorder/mobility, low-complexity and transmembrane regions and post-translationally modified sites (PTMs). Pfam domains correspond to: 'AA permease N'=Amino acid permease N-terminal, 'AA permease'= Amino acid permease, 'SLC12'=Solute carrier family 12. Over these domains, the dashed rectangles call out the transmembrane region of the protein characterised in the 6PZT PDB structure as a homodimer. This structure is coloured by CCRpct and by conservation and displayed on the top panels. The location in the structure of the pathogenic/likely_pathogenic ClinVar variants A327V and N376I is depicted with triangles and squares, respectively. The small dashed rectangle fully encloses exon 21 (amino acids 977-993). Residue conservation scores, as calculated by ScoreCons, were obtained from VarSite, domains from Pfam, low-complexity, mobility and disorder from MobiDB, transmembrane regions and PTMs from UniProtKB. disorders, neurodegeneration, and infectious diseases [21][22][23] . Here, apart from these associations, we noticed that 31.5% of these proteins were involved in diseases with profound impact in the normal human postnatal and early development, with a high prevalence of neurodevelopmental disorders. In addition, 64% of the LLPS driving genes were highly constrained for loss-of-function in heterozygosity and also, 44% of the 54 genes were not associated with any disease with 14 of them, mostly associated to RNA metabolism being highly constrained for loss-of-function variation. We hypothesise that variants in such genes could possibly cause severe phenotypes and affect embryonic viability.
Most of the significantly enriched GO terms for proteins with highly constrained regions (CCRpct in [95,100]) were related to RNA-processing, DNA binding, protein-protein interactions, and enzymatic activities. This is in concordance with our observations that amino acid sites functionally annotated as binding DNA/RNA and/or proteins, in catalytic sites and in LIPs, and/or driving LLPS, are among the ones with the greatest odds of being highly constrained, i.e. in the, on average, longest regions in the human genome intolerant to variability.
Our results also complement and extend what the authors of the CCRs model 1 have derived before by analysing the co-occurrence with Pfam domains and observing that the highly constrained regions are involved in ion transport and in different DNA/ RNA interactions (like zinc fingers, helicases and translation factors), but also that about 30% of these highly constrained regions did not correspond to any protein Pfam 3 domain.
Undoubtedly, the sequencing of genetically more diverse human populations will refine some CCRs further, but the data presented here has significant clinical utility. The key challenge for clinical genomics is interpreting the pathogenicity of rare variants. Identifying whether a rare variant lies within a defined constrained region of the protein facilitates consequence interpretation especially for novel variants absent from existing genomic databases.
We emphasise that combining interspecies and intraspecies (human population) conservation can help to highlight regions of individual genes that have appeared more recently in evolution or confer some degree of uniqueness/specificity to an individual paralogue. This data has the potential to facilitate the discovery of new associations between genes/variants with previously unknown phenotypes. CCRs also highlight many highly constrained regions currently not linked to any Mendelian disease. This may indicate mutations in these regions are lethal to humans or are sufficiently rare that they have not yet been identified. 1 The current tools that attempt to predict the clinical relevance of a specific sequence variant have been developed mostly based on the characteristics of folded protein regions 64 making it difficult to understand the effect of variants affecting intrinsically disordered/mobile and liquid-liquid phase separating regions. Furthermore, many protein functional sites still remain to be characterised and currently lack sufficient functional annotations, in particular the difficult cases, where flexible linkers/disordered regions are poorly characterised and/or can be poorly conserved across species while being constrained, at different levels, in human populations. Our mapping of CCRs to amino acids helps to define these regions in proteins more accurately and could contribute to the further annotation of these challenging regions.
Our approach, however, is limited to the analysis of single features, we are aware of the possibility that multiple features can co-occur for the same protein site or that other confounding biological factors can be present. Furthermore, because the CCRs were defined from genomic sequences, i.e. in a uni-dimensional or linear space, and the weighting of the regions takes into consideration their length, the model does not consider the possibility of having short lowly constrained regions coming together in the protein threedimensional space to define a larger structural cluster constrained for variability. Looking forward, we believe that all these current limitations will open new avenues for further research and refinement.

Generation of the CCRs based on gnomAD3
To obtain the CCRs we ran the pipeline developed by 1 (https://quinlan-lab.github.io/ccr/examples/updates) but employing the dataset of gnomAD 26 version 3.0 and the corresponding files for the coordinates of the human genome in version GRCh38 65 (https://www.ncbi.nlm.nih.gov/grc). We used the Variant Effect Predictor (VEP) 66 of Ensembl 25 version 101. We also followed the recommendations of the authors to only consider genetic variants from autosomes and chromosome X, and avoid those in conflicting genomic regions -i.e. where there are segmental duplications and/or high identity with other genomic regions (>=90% identity) or with low sequencing coverage. In the same line of recommendations, we ran the weighting of the regions for autosomes and X chromosome separately, but merged both output files into one for performing the mapping of the coordinates of the regions to protein amino acids.

Mapping the CCRs to protein amino acids
We developed an in-house pipeline in R that uses the 'ensembldb' Bioconductor R package 67 to map the genomic coordinates of CCRs boundaries, and all the coding bases in between, to the Ensembl v101 transcripts which are part of the GENCODE 68 basic set version 35. This was to ensure we were including complete and well annotated relevant transcripts. For those amino acid sites where the corresponding codon had constrained and unconstrained bases, we assigned such amino acid sites as unconstrained. We then obtained the sequence identifiers that crosslink Ensembl transcripts and proteins and UniProtKB proteins 24 by querying the APIs (application programming interfaces) of both databases. Finally, the CCRpct were accurately transferred to the amino acids in UniProtKB sequences by downloading the corresponding protein sequences from Ensembl and UniProtKB and performing Blastp local alignments 69 requesting 100% sequence identity (perfect match). The workflow is summarised in Figure 9(A), explained more in detail in Supplementary Methods and the corresponding scripts of the pipeline are available in this repository https://github.com/marciaah/ CCRStoAAC.

Aggregation of protein features annotations and clinically interpreted variants
We developed our own pipeline in R for fetching different protein features and functional annotations from multiple resources. For this purpose, we captured annotations and based our analysis only for the 9.8 million protein sites in UniProtKB/SwissProt canonical sequences because such sequences are the main references for annotations in the databases we employed. An overview of the databases and features that we included are presented in Figure 9(B), and obtained as described more in detail in Supplementary Methods.

Gene Ontology enrichment tests
The GO statistical over-representation tests were performed using the PANTHER classification system 40 (https://www.pantherdb.org/tools/index.jsp, PANTHER version 17.0 release 22-02-2022, with Reactome version 65), submitting the list of genes of interest (e.g. those presenting regions with percentiles in [95,100]) and using as "reference list' only those genes for which we were able to map CCRpct.

Odds ratios tests for enrichment
We performed four different Odds ratio (OR) test analyses to measure the enrichment of amino acid sites presenting different combinations of CCRpct, conservation, protein features and ClinVar variants: I. CCRpct and presence of each one of the 30 protein features: we binary assigned whether or not an amino acid site had any of the protein features (see Figure 9(B) for full list) and a CCRpct in any of the 7 bins: unconstrained = [0]; low-medium constraint= (0,30), [30,60) and [60,90); moderately constrained = [90,95), highly constrained = [95,99) and most highly constrained = [99,100].
II. CCRpct and conservation with the presence of protein features: we binary classified the amino acid sites as having or not any of the 30 protein features and any of the 6 combinations: a) CCRs unconstrained (0 pct) and conservation score 0.5, b) CCRs unconstrained (0 pct) and conservation score > 0.5, c) CCRs in (0,50] pct and conservation score 0.5, d) CCRs in (0,50] pct and conservation score > 0.5, e) CCRs in (50,100] pct and conservation score 0.5, f) CCRs in (50,100] pct and conservation score > 0.5. III. CCRpct and presence of ClinVar variants: amino acid sites were binary assigned whether or not they had "pathogenic/likely_pathogenic", "benign/ likely_benign" or "VUS/conflicting interpretations of pathogenicity" variants and a CCRpct in any of the 7 bins mentioned in (I).
IV. CCRpct and conservation with the presence of ClinVar variants: we classified residues according to whether they had or not any of the 3 groups of variants as described in (III) and any of the 6 combinations of CCRpct and conservation as described in (II).
For the four enrichment analyses, contingency tables were constructed counting amino acid sites with the different classifications (See Supplementary Methods: OR tests for enrichment for further details) and the OR were calculated using two-tailed and one-tailed Fisher's exact tests 70 for obtaining the corresponding P-values and 95% confidence intervals (CI 95%). It is worth clarifying that when counting residues we did not request exclusivity in the intersections, i.e. a residue with a given CCRpct can intersect with being in DOMAIN, DOSORDER_MOBILE and DNA-RNA_BIND and hence will contribute to the cells in the three corresponding contingency tables.

DATA AVAILABILITY
The code and data used for the present analysis is provided in GitHub repositories, and the authors welcome requests for additional information