Missense Variants Reveal Functional Insights Into the Human ARID Family of Gene Regulators

Graphical abstract


Introduction
Advances in high-throughput sequencing have led to a sweeping expansion in genetic variation data of human protein-coding genes. With nearly 15 million curated exome variants made available by the international Genome Aggregation Database (gnomAD), 1 statistical analyses have identified genes that are intolerant to loss-offunction and are likely associated with disease. 1,2 This has aided large-scale assessments, for example, of genetic causality in autism spectrum disorder 3 and inherited cardiomyopathies. 4 The increase in statistical power afforded by the size of these datasets has allowed genes to be ranked for their importance to human health based on their intolerance to variation.
Beyond the effects of loss-of-function variants on a gene, constraint analyses can be further narrowed to focus on missense variants at the level of a protein domain. We define a domain as an independent folding unit or a conserved sequence block that is likely to approximate an independent folding unit. Depletion of missense variants in whole domains, or specific segments within domains, has been found to correlate with evolutionary conservation. 5 Rare mutations occurring in such regions are likely to be pathogenic. 6,7 In agreement, saturation mutagenesis studies have shown that mutation-intolerant regions map to conserved protein domains, particularly at residues that are involved in DNA, protein, or ligand binding and are associated with pathogenic variants. 8,9 These findings indicate that, like whole genes, functionally important regions in proteins are subject to negative selection. Patterns in population-wide missense variation could therefore be harnessed to gain insights into protein function. 10 While calculating variant distributions along protein sequences can help to identify essential domains, it fails to consider the arrangement of variants in 3D space. 6 This has been addressed by manual mapping of variants 11 or computational mapping of variant depletion scores directly onto protein structures. For example, Hicks et al., developed a '3D Tolerance Score' which compares an observed and expected number of missense variants in 5 A-radius spheres around individual atoms in a 3D structure. 12 In an alternative approach, Tang et al. introduced PSCAN, which scores the spatial dispersion of missense variants onto structures, 13 based on a previous finding that neutral variants tend to be dispersed, while pathogenic variants cluster. 14 In a further approach, the MISCAST suite provides an approach to connect probabilities of loss of function with primary sequence level features. 15 Collectively, the above studies show that surfaces of proteins where important functional sites are located are depleted of missense variants. Statistical approaches enable sorting and/or predicting functional sites using proteome-wide approaches but may not necessarily enable nonspecialists to inspect an individual protein of their choice. We developed two programs to allow easy visual inspection of variants on primary sequences (1D) and tertiary structures (3D) from the gnomAD database ( Figure 1(A)).
First, we generated a program to calculate the average density of missense variants in a protein domain (Vd) to the average density of missense variants in the whole protein (Vp). We show that plotting variants in 1D together with the ratio of Vd/Vp helps identify domains that are missense depleted. The standalone values of Vp for the ARID family members exhibit good correlation with the missense Z score, typically used in gnomAD. 2 Second, we developed a simple, convenient program called 1D-to-3D to map the same variant data onto any 3D structure, representing variants as spheres of increasing size and color intensity with increasing allele frequency (Figure 1(A)). This allows users to find surfaces of the protein that are depleted for missense variants and to compare this with complementary information such as surface conservation.
Using 1D-to-3D, we performed a comprehensive analysis of missense variation in the "AT-rich interactive domain" (ARID) family of gene regulators (Figure 1(B)). This family was selected due to the clinical significance of its members in multiple cancer types [16][17][18] and rare developmental disorders such as Coffin-Siris syndrome. 19 In humans, the ARID family comprises 15 proteins that can further be categorized into 7 subfamilies, all of which have an ARID DNA binding domain. 20 Despite their common domain, the family members regulate distinct sets of genes via diverse molecular mechanisms (Figure 1(B)). Members of the ARID1 sub-family and ARID2 are core subunits of large nucleosome remodeling complexes while JARID2 is an accessory subunit of a transcription repressor complex. 21 In contrast, the ARID3 proteins are transcription factors, 22 while ARID4s and ARID5s are adapter proteins that recruit other transcriptional regulators. [23][24][25] ARID5A is thought to be an RNAbinding protein. 26 Finally, the four JARID1 proteins are enzymes that mediate transcriptional changes by removal of histone H3K4 di-/tri-methylation marks. 17 Here we provide a comprehensive analysis of the ARID family as a whole, including domain architecture mapping, phylogenetic analysis and searching for known pathogenic variants. We complemented these analyses with our 1D-to-3D approach to identify surfaces of proteins that are depleted (or not) of missense variants, to provide a deeper annotation of functional sites within these proteins.

Structures and models
A complete list of analyzed structures is in Supplementary  Table 1. For domains with no available structure, we used models from the AlphaFold protein structure database. 37 Only regions with predicted local-distance difference test (pLDDT) scores >70 were considered. The pLDDT score is a confidence measure that reflects the validity of local inter-atomic distances in a predicted structure. A cut-off of >70 is considered a "generally correct backbone prediction". 37 All of the structures were visualized in PyMOL (Schrödinger Inc.). Surface electrostatics scores for the ARID4A and ARID4B hybrid Tudor domains were calculated using the Adaptive Poisson-Boltzmann Solver (RRID: SCR_008387) 38 in PyMol.

Isoform expression analysis
Isoform-specific expression data for ARID5B was obtained from ISOexpresso. 41 The ratio of Isoform 1 (uc001jlt.2) to Isoform 2 (uc001jlu.2) expression levels was compared across 735 samples from 22 human tissue types of healthy individuals.

Constraint metrics
Established constraint metrics including pLi, LOEUF, missense Z, and RVIS scores for each ARID family member were obtained from the official gnomAD and Genic Intolerance web browsers. 1,42 P-values corresponding to missense Z-scores were calculated using the Excel NORM.S.DIST function (the output for positive Z-scores was subtracted from 1). A complete list of collected metrics is available in Supplementary Table 1. Variant data processing 7,652 non-synonymous variants associated with UniProt canonical sequences of the 15 ARID family proteins were extracted from the gnomAD v2.1.1 dataset (GRCh37/hg19). 1 The dataset is publicly available and contains variants from 125,748 quality-controlled exomes of unrelated, adult individuals not affected by severe pediatric disease. 1 To ensure our analysis was restricted to neutral missense variants, only variants with the Variant Effect Predictor annotation 'missense' were considered, and variants with the ClinVar annotation 'pathogenic', 'likely pathogenic', 'conflicting interpretations of pathogenicity', and 'uncertain significance' were filtered out. To perform this filtering, we developed a Python program called 1D-to-3D.py, which processes variants from csv files downloaded directly from gnomAD 1 (further details in "3D Visualization"). After filtering, 7,540 variants were used for further analyses. All raw and processed gnomAD data can be accessed in supporting information Supplementary Table 2 and Supplementary Table 3 respectively.

Pathogenic variants
A family-wide search for pathogenic missense variants was performed using ClinVar (RRID:SCR_006169) and DECIPHER (RRID:SCR_006552), two publicly-accessible databases of clinical variants and their phenotypes. 43,44 Using ClinVar, we identified variants with the search criteria 'missense' and 'pathogenic' or 'likely pathogenic.' In DECIPHER, we searched for 'research variants' from the Deciphering Developmental Disorders Study, which collected variants from $14,000 UK children with undiagnosed developmental disorders. 43 All variant accession codes are available in Supplementary Table 1.

1D plots and the Vd/Vp ratio
Filtered missense variants were mapped onto protein sequences of ARID family members using Plot Protein. 45 The Plot Protein R script was modified to allow for color manipulation and domain diagram alterations of the output graphs in Inkscape. Vd/Vp ratios of functional domains were calculated using the following formula:

3D visualization
To visualize missense variation in 3D, the 1D-to-3D program uses filtered data from gnomAD 1 and generates a PyMOL script that maps the variants onto protein structures. The variants appear as spheres at the Ca of the associated residue and increase in size and shade of blue with increasing allele frequency. In the case of multiallelic sites, the program applies the addition rule for disjoint events, i.e. if multiple variants occur at the same residue position, their allele frequencies are summed. The allele frequency values for each position are compressed using a base 10 log scale and the positions are sorted into 6 bins (allele frequency <10 À6 -10 À5 , 10 À5 -10 À4 . . .10 À1 -10 0 ). Variants in each bin are visualized as spheres of different size and shade of blue. Further details regarding user inputs and numerical handling can be found in the user instructions and program script, accessible at GitLab (https://git.ecdf.ed.ac.uk/cooklab/deak). We used the 1D-to-3D program to annotate 11 solved and 6 modelled structures of the ARID family members with missense variants. A list of these structures, PyMOL selection names, and start/end residues can be found in Supplementary Table 1

Genetic constraint in the ARID family
As a preliminary assessment of variation in the ARID family, we collated existing constraint metrics for each member from gnomAD. These included "loss-of-function observed/expected upper bound fraction" (LOEUF) 1 and missense Z scores ( Figure 2(A)) as well as pLi and RVIS scores (Supplementary Table 1). 1,42 The lower the LOEUF, the fewer the variants observed than expected, indicating negative selection against loss-of-function variation. All ARID family genes except ARID3B/3C and JARID1B/1D can be classified as loss-offunction-intolerant (Figure 2(A)). This indicates that they are subject to strong purifying selection and are statistically likely to have disease associations, a higher number of protein-protein interaction partners and broad tissue expression. 1 Intolerance to variants in the ARID family is also supported by pLI and RVIS scores (Supplementary Table 1).
The missense Z score represents the deviation of the observed from the expected number of missense variants (single amino acid substitutions), for a given gene, where positive scores indicate missense depletion and negative scores indicate missense enrichment. 2 ARID1A and JARID1C have missense Z P-values of <0.001 and nearly all members have positive scores, indicating that specific residue positions are also under selective constraint (Figure 2(A)).
We calculated Vp values that denote the average density of missense variants for each protein in our dataset. As expected, lower Vp values correlate with higher missense Z scores (Figure 2(B)). However, JARID1C and JARID1D do not fit the observed correlation between missense Z and V p values ( Figure 2(B)). Rather than missense depletion, this is likely to arise from low data availability because JARID1C and JARID1D are encoded on the X and Y chromosomes respectively. 1,46 This analysis suggests that Vp is a good proxy for missense Z and that values outside of the range of 0.29-0.42 may indicate when insufficient data are available for analysis. We excluded JARID1C and JARID1D from subsequent analyses.
These constraint metrics demonstrate that individual ARID family members are under significant selective pressure, yet they are less informative on missense depletion at the level of domains or smaller functional regions (Figure 2 (A)). To bridge this gap, we developed the '1D-to-3D' approach ( Figure 1(A)). In '1D', primary protein sequences are annotated with neutral or pathogenic missense variants and Vd/Vp ratios are calculated to compare the linear distribution of missense variants in functional domains. In '3D', the missense variants are mapped onto solved or modelled protein structures. This allows integration of the allele frequencies of variants at each residue position, with visual discrimination of allele frequencies using increasing sphere size and shade of blue (Figure 1(A)). In line with largerscale analyses, 5,12 we hypothesized that the annotated structures would reveal functionally important 3D sites depleted of neutral variants and enriched in pathogenic variants.
Validating the 1D-to-3D approach on known ARID complex structures To verify that the 1D-to-3D approach can be used to identify sites depleted of missense variants for this family, we analyzed the structurally wellcharacterized ARID1 and JARID1 subfamilies. The ARID1 subfamily comprises ARID1A and ARID1B, two vertebrate paralogs with identical domain architecture (Figure 2 chromatin remodeling complex. 47 BAF complexes modulate transcription through ATPase-dependent nucleosome sliding/ejection or the recruitment of other regulators. 34 They comprise three modules: a catalytic ATPase (BRG1/BRM), Actin-Related Protein module, and a base module that scaffolds the ATPase, nucleosome, and other regulators (Figure 3(A)). 34 ARID1A or ARID1B are incorpo-rated into the base module through a C-terminal Core Binding Region made up of conserved core binding region A and core binding region B segments connected by intrinsically disordered loops. 48 Mutations in the BAF complex promote tumorigenesis in multiple cancer types reviewed in. 16 A recent analysis of missense cancer mutations in the BAF complex revealed that several mutations cluster at a junction between the ARID1A/B core binding region and the helicase-SANT-associated helix of the ATPase (Figure 3 (A)). 49 We find that this junction is depleted of neutral missense variants (Figure 3(B)), and this correlates well with evolutionary surface conservation of the core binding region in vertebrates (Figure 3(C)). We also found that pathogenic variants associated with Coffin-Siris Syndrome and non-syndromic intellectual disability variants map in proximity to the junction (Figure 3(D)). ARID2 plays a functionally analogous role to the ARID1 family in the related Polybromo-associated BAF (PBAF) nucleosome remodeling complex. 48 Even though ARID2 is distantly related to the ARID1 family, we found a similar depletion of missense mutations around the putative helix binding site in a model of the ARID2 core binding region ( Figure S1). The distribution of missense variants therefore serves as a valuable, additional layer of information for investigating key protein-protein interaction interfaces in multi-subunit complexes. Next, we tested whether our approach could identify the catalytic site in members of the JARID1 subfamily. JARID1A/1B are Fe(II)-and 2oxoglutarate (2-OG)-dependent dioxygenases that catalyse the demethylation of di-or tri-methylated histone H3K4 via their JmjC domain (Figure 3(E)). We report that the catalytic site in the JmjC domain is visibly missense depleted (Figure 3(F)), correlating with evolutionary conservation (Figure 3(G)). The variants exhibit clear spatial surface segregation, with a lower abundance of variants on the face containing the catalytic site and higher abundance on the far surface of the enzyme (Movie S1). A similar depletion around the active site is observed for JARID1B ( Figure S2), but is less pronounced, consistent with the higher LOEUF and lower missense Z metrics reported for JARID1B (Figure 2(A)).
Apart from the catalytic site, depletion of missense variants in JARID1A is also observed in a groove between the ARID and C5HC2 domains, which was hypothesized to accommodate doublestranded DNA (Figure 3(F); Movie S1). 35 However, it should be noted that there has been no experimental evidence to confirm this hypothesis. The second site of depletion could represent a different kind of functional surface, such as a protein-protein interaction or other ligand interaction site. Overall, our findings validate the 1D-to-3D approach and demonstrate that missense variants complement the use of conservation data to identify surfaces involved in macromolecular interactions and enzyme active sites.

Comparative analysis of domains using the Vd/ Vp ratio
To further investigate the utility of missense variants at the domain level, we compared JARID1A and JARID1B. In addition to the JmjN-JmjC, ARID, and C5HC2 zinc finger domains, they also comprise three PHD fingers (Figure 4  (A)-(B)). Mapping missense variants on JARID1A and JARID1B sequences and calculating the Vd/ Vp ratios for each domain revealed differences in missense variant depletion in each PHD finger (Figure 4(B)). These differences may indicate paralog sub-functionalization at the level of individual domains. In JARID1A, PHD3 but not PHD1 is missense depleted, while in JARID1B, PHD1 but not PHD3 is missense depleted. Neither protein shows missense depletion in PHD2 (Figure 4(B)). In JARID1A and JARID1B, PHD1 preferentially binds to unmethylated histone H3K4, leading to increased affinity of the JmjC domain for its methylated H3K4 substrates. [50][51][52] In contrast, PHD3 is thought to be a reader of tri-methylated H3K4 marks. 53,54 These two domains tolerate sequence variation, suggesting that their ability to bind histone tails is not crucial for function. In particular, residues D292/Y294/L308 in JARID1A and W1502/W512 in JARID1B, which are required for histone peptide binding, 54,55 are affected by neutral missense variants (Figure 4(C)-(D)). Conversely, residues responsible for histone peptide binding in PHD3 of JARID1A (Figure 4(C)) 53,55 and PHD1 of JARID1B (Figure 4(D)) 52,56 are under selective constraint, indicating that they are important in targeting or enhancing the activity of the JARID1 enzymes at their appropriate genomic locations. Consistent with their functional importance, PHD3 in JARID1A was shown to be critical in driving the oncogenic effects of a JARID1A-NUP98 fusion protein in acute myeloid leukemia. 53 Furthermore, mutations in PHD1, but not PHD3 of JARID1B decreased the regulatory effects on cell migration in a model of triple negative breast cancer. 54 In summary, functional differences in the PHD fingers correlate with different Vd/Vp ratio patterns in JARID1A and JARID1B, suggesting that the differences in missense variant depletion we observe represent domain-level functional differences between the two paralogs.

Limitations of the Vd/Vp ratio
Despite its utility for comparing domains, it should be noted that Vd/Vp ratios are dependent on the user's definition of domain boundaries. Where domain boundaries are not clearly defined, it is possible that Vd/Vp ratios might be over-or underestimated. Furthermore, small sites that are missense depleted may also be overlooked. For example, the Vd/Vp ratios calculated for the chromobarrel domain of the ARID4 subfamily proteins are relatively high, yet mapping missense variants in 3D reveals depletion of variants in a conserved Tyr-Tyr-Trp-Tyr aromatic cage ( Figure 5).
The ARID4 subfamily comprises ARID4A and ARID4B (Figure 2(A)), two paralogous, multidomain adapter proteins that recruit transcriptional regulators such as the retinoblastoma protein, androgen receptor, and the mSin3A histone deacetylase complex to gene promoters. 25,57 Given the presence of an aromatic cage, the chromobarrel domain was hypothesized to bind histone methyl marks. 24 However, independent nuclear magnetic resonance and isothermal titration calorimetry experiments with the ARID4A chromobarrel domain and methylated histone peptides produced conflicting results. 24,58 Mapping of missense variants shows that residues that form the aromatic cage are depleted of missense variants in both the solved structure of the ARID4A chromobarrel domain ( Figure 5(A)) and a model of the ARID4B chromobarrel domain ( Figure 5(B)). This suggests that methyl-lysine recognition is intact. This observation highlights the importance of mapping missense variants onto 3D structures, especially in the case of small domains, where the Vd/Vp ratio may not provide sufficient resolution to detect functionally important sites.

Comparison of missense depletion between paralogs indicates sites of subfunctionalization
ARID4A and ARID4B have diverged only in vertebrate lineages 31 and share identical domain architecture (Figure 2(A)). They participate in the same molecular pathways, including the recruitment of the mSin3A repressive complex to gene promoters 59 and co-activation of the androgen receptor in regulation of male fertility. 57 However, ARID4A knockout mice are viable whereas ARID4B knockouts show early embryonic lethality. 60 Moreover, ARID4B is necessary for spermatogenesis while ARID4A is not. 57 These findings indicate that of the two paralogs, ARID4B is likely a more critical determinant of cell fate decisions and cell cycle progression.
Since all domains of ARID4A and ARID4B are structurally related, we investigated if missense depletion could reveal functional differences between the two paralogs. We found domain-wide missense depletion in 1D to be more pronounced in ARID4B ( Figure S3) and observed a difference in the 3D distribution of missense variants on their hybrid Tudor domains (HTDs). The HTDs comprise two sub-domains HTD-1 and HTD-2 ( Figure 6(A)). Unlike structurally analogous Tudor domains, ARID4 HTDs lack an aromatic cage associated with histone binding in HTD-2 and have instead been shown, experimentally, to bind DNA through HTD-1 61,62 ( Figure 6(A)). A conserved, structurally important glycine in the ARID4B Tudor domain is associated with a developmental disorder variant (Figure 6(A)).
We find that HTD-1 of ARID4B is missense depleted, while HTD-1 of ARID4A is not ( Figure 6 (B)). The depleted site corresponds to previously identified DNA-binding residues (Figure 6(B)) and a positively-charged DNA-binding surface of the domain (Figure 6(C)). 62 We also note that an RGR motif, recently found to enhance the DNA-binding affinity of ARID4B, 62 tolerates missense variation ( Figure 6(A)). Overall, our findings indicate that the ARID4B Tudor domain likely contributes to the functional differences in DNA binding between ARID4A and ARID4B.
We next investigated if missense variants could also reveal paralog sub-functionalization in the ARID3 subfamily. The ARID3 proteins are transcription factors comprising the ARID domain and a C-terminal oligomerization domain called REKLES (Figure 2(A)). We note that the Vd/Vp ratios of all three REKLES domains are >1.00 ( Figure S4). While this suggests that the oligomerization is less likely to be required for ARID3 activity, these domains are short motifs with no available structural data, so may not be suitable for this analysis.
The ARID domain binds to AT-rich promoter sequences and is essential for ARID3 protein function. 22 In humans, the ARID3 subfamily has three members, where the ARID3A ARID domain shares 70 % sequence identity with the ARID domain of ARID3B and 87% identity with the ARID domain of ARID3C. Given this high degree of sequence identity, we hypothesized that differences in missense variation likely reflect paralog subfunctionalization.
The ARID domain is built from six core helices (H1-H6) and two larger loops L1 (between helices H1-H2) and L2 (between helices H4-H5) (Figure 7 (A)). A solution structure of the Drosophila Dead ringer in complex with DNA 63 and nuclear magnetic resonance titration experiments of the ARID1A, 64 ARID5B, 65 and JARID1A 66 ARID domains suggest a common mode of DNA binding. This includes non-sequence specific contacts with the phosphate backbone by L1 and sequence-specific contacts with the major groove by a non-canonical helixturn-helix motif formed by H4-L2-H5 [63][64][65][66] (Figure 7 (B)).
The Vd/Vp ratios of ARID domains of the three human ARID3 paralogs are lowest in ARID3A and highest in ARID3C (Figure 7(C)-(E)). In ARID3A, the proposed DNA binding residues in L1 and H4-L2-L5, inferred from sequence conservation with Dead ringer, are clear of missense variants. 67 The domain also harbors a developmental disorder variant, 43   (C)). ARID3B has a higher Vd/Vp ratio and some missense variants map to residues typically involved in DNA binding in the ARID family. This suggests that DNA binding activity is less likely to be of functional importance in this paralog (Figure 7 (D)). Similarly, in ARID3C, several residues typically involved in DNA-binding are found to have reported variants (Figure 7(E)). Missense variation can therefore also be leveraged to filter structurally similar domains in paralogs for functional importance.

The ARID5B BAH domain shows likely loss and gain of function
Finally, we used 1D-to-3D to give insights on novel structure-guided functional predictions in ARID5B. ARID5B is a highly constrained gene (Figure 2(A)) and a key regulator of liver metabolism, chondrogenesis, and adipogenesis. 23,68,69 ARID5B is known to target the H3K9me2 demethylase PHF2 to gene promoters via its ARID domain. 23,70 ARID5B has two isoforms, 1 and 2 ( Figure 8(A)). In Xenopus, isoform expression is spatially segregated during embryonic development, where isoform 1 shows higher abundance than isoform 2. 71 Isoform 1 also shows higher expression in healthy adult human tissues. 41 Isoform1 has an additional N-terminal region that is highly conserved in a subset of vertebrate species ( Figure S5) and predicted to form a bromo-adjacent homology (BAH) domain 72 ; this likely defines the functional differences between the isoforms. We compared an AlphaFold model of the ARID5B BAH domain to experimentally determined BAH domain structures and used 1D-to-3D to map missense variants onto the domain.
We identified bovine DNMT1, mouse BAHCC1, and mouse ORC1 BAH domains as the closest structural homologs to ARID5B ( Figure S6). All three homologs read histone methyl marks (Figure 8(B)). [73][74][75] The lower lobe of each BAH fold contains a conserved aromatic cage and acidic residues that bind to methylated lysine through cationpi and electrostatic/hydrogen bonding interactions, respectively (Figure 8(C)). The lower lobe of the ARID5B BAH domain does not have an aromatic cage: one aromatic residue, F77, is present but two positions normally occupied by aromatic residues are replaced with small hydrophobic residues (C53 and L75). Two acidic residues, D81 and E100, are present but both E100 and the aliphatic L75 tolerate missense variation (Figure 8(C)). Collectively, these amino acids changes, compared with classic BAH domains, and their tolerance of variation suggest that ARID5B BAH domain is unlikely to bind methyl-lysine marks at this site.
The AlphaFold prediction for the ARID5B BAH domain differs from those in DNMT1, BAHCC1 and ORC1 in that an additional, conserved segment C-terminal to the BAH domain forms part of the fold of this domain (Figure 8(D)). When mapped to this extended AlphaFold model, we note that the missense variants are depleted on the highly conserved, positively-charged C- terminal helix, rather than the classic peptide binding interface of BAH domains. This suggests that the C-terminal domain extension could serve as a protein-protein or protein-nucleic acid interaction module within a chromatin context (Figure 8(E)). These predictions call for experimental evidence. However, our analysis demonstrates the power of missense variation in screening for functional features together with structural data.

Discussion
We provide a convenient set of tools for mapping missense variants onto primary and tertiary structures of proteins. Our initial analysis of variant depletion in proteins showed that Vp is a good proxy for other scores such as missense Z scores, which reveal resistance to variation ( Figure 2). This also allowed us to identify paralogs that were not suitable for further analysis based on limited available information, such as for JARID1C and JARID1D, which are encoded on sex chromosomes. Vp, and the related Vd/Vp ratio, have the advantage that they are easy to understand in relation to the available data.
Our approach further allowed us to visually locate regions of proteins that are depleted of population variants, indicative of negative selection pressure.
Using this approach, we demonstrated that mapping missense variants onto 3D structures in the context of a large family of proteins reveals functional insights. Our method focused on essential human proteins because our data comprised human variants. However, we showed that it is complementary to phylogenetic conservation analysis ( Figure 3) and it is likely that the conserved functional surfaces we characterized are relevant to homologues in other organisms as well. In cases where recent mammalian paralogs have a single homologue in distantly related organisms, our approach also revealed insights into paralog sub-functionalization of protein domains (Figures 6 and 7).
Using Vd/Vp ratios for individual domains may have a particular utility for researchers working on multidomain proteins where the goal is to identify which domains contribute essential functions. Ranking by Vd/Vp ratio could help prioritize which domains to delete in functional assays. This approach could also be useful for researchers seeking to provide minimal functional constructs of a protein for gene therapy approaches, where limiting the length of the protein, and therefore its coding sequence, can be critical for packaging into a virus. 76,77 However, we also note that Vd/Vp ratios do not always provide sufficient information to determine whether small domains are under selective pressure. In the example of the ARID4B chromobarrel domain (Figure 5), mapping variants onto the 3D structure revealed a likely functional site that could not have been supported by the Vd/ Vp ratio alone. Therefore, mapping variants onto structural models is a useful complementary application of the tool.
Some additional limitations are noted. First, the sample size of the gnomAD dataset does not achieve mutational saturation. 1 This sets limits on interpreting constraint in smaller domains/linear motifs and prevented us from analyzing proteins encoded on the sex chromosomes. Second, while our approach allows allele frequency to be visualized, it does not normalize for codon mutability, where nucleotide sequence composition skews missense enrichment in protein sequences. For example, methylated CpG dinucleotides are known to be hypermutable, resulting in an overrepresentation of variants in CpG rich and/or heavily methylated codons. 78 Finally, some of our analyses were based on calculated models rather than experimentally-determined structures. As missense mapping depends on positional information, validation of these models is essential to confirm any interpretations.

Conclusions
Our findings build on previous studies showing that depletion of missense variants can serve to identify functionally important protein sites. 5,12 We demonstrate how 1D and 3D mapping approaches complement existing findings, provide context to understand the impact of pathogenic variants, functionally differentiate structurally similar domains in paralogs, and support formulation of novel mechanistic hypotheses. Although we focused on proteins with catalytic, DNA binding and epigenetic roles, our approach is applicable to a broad range of protein functions.