Abstract
The depletion of disruptive variation caused by purifying natural selection (constraint) has been widely used to investigate protein-coding genes underlying human disorders1,2,3,4, but attempts to assess constraint for non-protein-coding regions have proved more difficult. Here we aggregate, process and release a dataset of 76,156 human genomes from the Genome Aggregation Database (gnomAD)—the largest public open-access human genome allele frequency reference dataset—and use it to build a genomic constraint map for the whole genome (genomic non-coding constraint of haploinsufficient variation (Gnocchi)). We present a refined mutational model that incorporates local sequence context and regional genomic features to detect depletions of variation. As expected, the average constraint for protein-coding sequences is stronger than that for non-coding regions. Within the non-coding genome, constrained regions are enriched for known regulatory elements and variants that are implicated in complex human diseases and traits, facilitating the triangulation of biological annotation, disease association and natural selection to non-coding DNA analysis. More constrained regulatory elements tend to regulate more constrained protein-coding genes, which in turn suggests that non-coding constraint can aid the identification of constrained genes that are as yet unrecognized by current gene constraint metrics. We demonstrate that this genome-wide constraint map improves the identification and interpretation of functional human genetic variation.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The aggregated allele frequency dataset is available in a browser at https://gnomad.broadinstitute.org, with bulk downloads for VCF files and Hail tables, as well as all constraint statistics described in this manuscript. Additionally, we provide a subset of the dataset that includes individual-level data for the HGDP85 and 1000 Genomes projects86—the generation and use of this dataset is described in a companion manuscript75. There are no restrictions on the aggregate data released. External datasets used in this study are available in the following public resources: ENCODE cCREs, https://screen-v2.wenglab.org/; super enhancers, http://www.licpathway.net/sedb/download.php; FANTOM5 enhancers, https://fantom.gsc.riken.jp/5/datafiles/reprocessed/hg38_latest/extra/enhancer/; miRNA, https://genome.ucsc.edu/cgi-bin/hgTables (All GENCODE V32 track); FANTOM5 lncRNA, https://fantom.gsc.riken.jp/cat/v1/#/genes; GWAS Catalog, https://genome.ucsc.edu/cgi-bin/hgTables (GWAS Catalog track); GWAS fine-mapping, https://www.finucanelab.org/data; CNV morbidity map of DD, https://genome.ucsc.edu/cgi-bin/hgTables (Development Delay track); ClinVar, https://genome.ucsc.edu/cgi-bin/hgTables (ClinVar Variants track); TOPMed, https://bravo.sph.umich.edu/freeze8/hg38/downloads; ClinGen, https://genome.ucsc.edu/cgi-bin/hgTables (ClinGen track); MGI, https://www.informatics.jax.org/; OMIM, https://www.omim.org/; Roadmap Epigenomics Enhancer-Gene Linking, https://ernstlab.biolchem.ucla.edu/roadmaplinking/; GTEx https://gtexportal.org/home/datasets.
Code availability
All code to perform quality control of the resource is publicly available at https://github.com/broadinstitute/gnomad_qc, and many of the functions are documented in a Python package (gnomad) at https://broadinstitute.github.io/gnomad_methods/index.html. The code to compute the constraint statistics is available at https://github.com/atgu/gnomad_nc_constraint.
Change history
15 January 2024
A Correction to this paper has been published: https://doi.org/10.1038/s41586-024-07050-7
References
Short, P. J. et al. De novo mutations in regulatory elements in neurodevelopmental disorders. Nature 555, 611–616 (2018).
Satterstrom, F. K. et al. Large-scale exome sequencing study implicates both developmental and functional changes in the neurobiology of autism. Cell 180, 568–584.e523 (2020).
Singh, T. et al. The contribution of rare variants to risk of schizophrenia in individuals with and without intellectual disability. Nat. Genet. 49, 1167–1173 (2017).
Ganna, A. et al. Quantifying the impact of rare and ultra-rare coding variation across the phenotypic spectrum. Am. J. Hum. Genet. 102, 1204–1211 (2018).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & Goldstein, D. B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).
Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).
Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl Acad. Sci. USA 106, 9362–9367 (2009).
Lanyi, J. K. Photochromism of halorhodopsin. cis/trans isomerization of the retinal around the 13–14 double bond. J. Biol. Chem. 261, 14025–14030 (1986).
Mathelier, A., Shi, W. & Wasserman, W. W. Identification of altered cis-regulatory elements in human disease. Trends Genet. 31, 67–76 (2015).
Spielmann, M. & Mundlos, S. Looking beyond the genes: the role of non-coding variants in human disease. Hum. Mol. Genet. 25, R157–R165 (2016).
Zhang, F. & Lupski, J. R. Non-coding genetic variants in human disease. Hum. Mol. Genet. 24, R102–R110 (2015).
Seplyarskiy, V. B. & Sunyaev, S. The origin of human mutation in light of genomic data. Nat. Rev. Genet. 22, 672–686 (2021).
Seplyarskiy, V. B. et al. Population sequencing data reveal a compendium of mutational processes in the human germ line. Science 373, 1030–1035 (2021).
Gussow, A. B. et al. Orion: Detecting regions of the human non-coding genome that are intolerant to variation using population genetics. PLoS ONE 12, e0181604 (2017).
di Iulio, J. et al. The human noncoding genome defined by genetic diversity. Nat. Genet. 50, 333–337 (2018).
Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740 (2022).
Ritchie, G. et al. Functional annotation of noncoding sequence variants. Nat. Methods 11, 294–296 (2014).
Vitsios, D., Dhindsa, R. S., Middleton, L., Gussow, A. B. & Petrovski, S. Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning. Nat. Commun. 12, 1504 (2021).
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Halldorsson, B. V. et al. Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science 363, eaau1043 (2019).
An, J. Y. et al. Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science 362, eaat6576 (2018).
Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).
The ENCODE Project Consortium. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020).
Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).
Jiang, Y. et al. SEdb: a comprehensive human super-enhancer database. Nucleic Acids Res. 47, D235–D243 (2019).
Pott, S. & Lieb, J. D. What are super-enhancers? Nat. Genet. 47, 8–12 (2015).
Bartel, D. P. Metazoan microRNAs. Cell 173, 20–51 (2018).
Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014).
Kanai, M. et al. Insights from complex trait fine-mapping across diverse populations. Preprint at medRxiv https://doi.org/10.1101/2021.09.03.21262975 (2021).
Jung, R. G. et al. Association between plasminogen activator inhibitor-1 and cardiovascular events: a systematic review and meta-analysis. Thromb. J. 16, 12 (2018).
Song, C., Burgess, S., Eicher, J. D., O’Donnell, C. J. & Johnson, A. D. Causal effect of plasminogen activator inhibitor type 1 on coronary heart disease. J. Am. Heart Assoc. 6, e004918 (2017).
Schaefer, A. S. et al. Genetic evidence for PLASMINOGEN as a shared genetic risk factor of coronary artery disease and periodontitis. Circ. Cardiovasc. Genet. 8, 159–167 (2015).
Li, Y. Y. Plasminogen activator inhibitor-1 4G/5G gene polymorphism and coronary artery disease in the Chinese Han population: a meta-analysis. PLoS ONE 7, e33511 (2012).
Drinane, M. C., Sherman, J. A., Hall, A. E., Simons, M. & Mulligan-Kehoe, M. J. Plasminogen and plasmin activity in patients with coronary artery disease. J. Thromb. Haemost. 4, 1288–1295 (2006).
Lowe, G. D. et al. Tissue plasminogen activator antigen and coronary heart disease. Prospective study and meta-analysis. Eur. Heart J. 25, 252–259 (2004).
Wang, Q. S. et al. Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs. Nat. Commun. 12, 3394 (2021).
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).
Stenson, P. D. et al. Human Gene Mutation Database (HGMD): 2003 update. Hum. Mutat. 21, 577–581 (2003).
Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010).
Greenway, S. C. et al. De novo copy number variants identify new genes and loci in isolated sporadic tetralogy of Fallot. Nat. Genet. 41, 931–935 (2009).
Mefford, H. C. et al. Recurrent reciprocal genomic rearrangements of 17q12 are associated with renal disease, diabetes, and epilepsy. Am. J. Hum. Genet. 81, 1057–1069 (2007).
Sebat, J. et al. Strong association of de novo copy number mutations with autism. Science 316, 445–449 (2007).
Stefansson, H. et al. Large recurrent microdeletions associated with schizophrenia. Nature 455, 232–236 (2008).
Walsh, T. et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science 320, 539–543 (2008).
Wright, C. F. et al. Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data. Lancet 385, 1305–1314 (2015).
Spielmann, M., Lupianez, D. G. & Mundlos, S. Structural variation in the 3D genome. Nat. Rev. Genet. 19, 453–467 (2018).
Spielmann, M. & Mundlos, S. Structural variations, the regulatory landscape of the genome and their alteration in human disease. Bioessays 35, 533–543 (2013).
Coe, B. P. et al. Refining analyses of copy number variation identifies specific genes associated with developmental delay. Nat. Genet. 46, 1063–1071 (2014).
Cooper, G. M. et al. A copy number variation morbidity map of developmental delay. Nat. Genet. 43, 838–846 (2011).
Klopocki, E. et al. Copy-number variations involving the IHH locus are associated with syndactyly and craniosynostosis. Am. J. Hum. Genet. 88, 70–75 (2011).
Barroso, E. et al. Identification of the fourth duplication of upstream IHH regulatory elements, in a family with craniosynostosis Philadelphia type, helps to define the phenotypic characterization of these regulatory elements. Am. J. Med. Genet. A 167A, 902–906 (2015).
Will, A. J. et al. Composition and dosage of a multipartite enhancer cluster control developmental expression of Ihh (Indian hedgehog). Nat. Genet. 49, 1539–1545 (2017).
Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Rehm, H. L. et al. ClinGen—the Clinical Genome Resource. N. Engl. J. Med. 372, 2235–2242 (2015).
Blake, J. A. et al. The Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics. Nucleic Acids Res. 39, D842–D848 (2011).
McKusick, V. A. Mendelian Inheritance in Man and its online version, OMIM. Am. J. Hum. Genet. 80, 588–604 (2007).
Consortium, G. T. The Genotype–Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
Xu, H. et al. Elevated ASCL2 expression in breast cancer is associated with the poor prognosis of patients. Am. J. Cancer Res. 7, 955–961 (2017).
Jubb, A. M. et al. Achaete-scute like 2 (ascl2) is a target of Wnt signalling and is upregulated in intestinal neoplasia. Oncogene 25, 3445–3457 (2006).
Tian, Y. et al. MicroRNA-200 (miR-200) cluster regulation by achaete scute-like 2 (Ascl2): impact on the epithelial-mesenchymal transition in colon cancer cells. J. Biol. Chem. 289, 36101–36115 (2014).
Guo, M. H. et al. Inferring compound heterozygosity from large-scale exome sequencing data. Nat. Genet. https://doi.org/10.1038/s41588-023-01608-3 (2023).
Zhu, P. et al. Single-cell DNA methylome sequencing of human preimplantation embryos. Nat. Genet. 50, 12–19 (2018).
Tang, W. W. et al. A unique gene regulatory network resets the human germline epigenome for development. Cell 161, 1453–1467 (2015).
Ross, D. A., Lim, J., Lin, R.-S. & Yang, M.-H. Incremental learning for robust visual tracking. Int. J. Comput. Vision 77, 125–141 (2008).
Karolchik, D. et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32, D493–D496 (2004).
Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).
Davis, C. A. et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 46, D794–D801 (2018).
Goldmann, J. M. et al. Germline de novo mutation clusters arise during oocyte aging in genomic regions with high double-strand-break incidence. Nat. Genet. 50, 487–492 (2018).
Zhao, H. et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007 (2014).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Kent, W. J., Zweig, A. S., Barber, G., Hinrichs, A. S. & Karolchik, D. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 26, 2204–2207 (2010).
Koenig, Z. et al. A harmonized public resource of deeply sequenced diverse human genomes. Preprint at bioRxiv https://doi.org/10.1101/2023.01.23.525248 (2023).
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
Hon, C. C. et al. An atlas of human long non-coding RNAs with accurate 5′ ends. Nature 543, 199–204 (2017).
Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine-mapping. J. R. Stat. Soc. B 82, 1273–1300 (2020).
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
Budescu, D. V. Dominance analysis: a new approach to the problem of relative importance of predictors in multiple regression. Psych. Bull. 114, 542 (1993).
Azen, R. & Budescu, D. V. The dominance analysis approach for comparing predictors in multiple regression. Psych. Methods 8, 129 (2003).
Ernst, J. et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49 (2011).
Liu, Y., Sarkar, A., Kheradpour, P., Ernst, J. & Kellis, M. Evidence of reduced recombination rate in human regulatory domains. Genome Biol. 18, 193 (2017).
Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 1–8 (2011).
Bergstrom, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Acknowledgements
The authors thank the individuals whose data is in gnomAD for their contributions to research. Development of the Genome Aggregation Database was supported by NIDDK U54DK105566 and the NHGRI of the National Institutes of Health under award number U24HG011450. Additional funding for Genome Aggregation Database Consortium members is listed in the Supplementary Information. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Author information
Authors and Affiliations
Consortia
Contributions
S.C., L.C.F., J.K.G., Q.W., A.O.-L., H.L.R., M.J.D., B.M.N., D.G.M. and K.J.K. contributed to the writing of the manuscript and generation of figures. S.C., R.L.C., M.K. and K.J.K. contributed to the analysis of data. L.C.F., Q.W., C.V., L.D.G., T.P., C.S., M.E.T., B.M.N. and K.J.K. developed tools and methods. L.C.F., J.K.G., J.A., M.W.W., Y.T., W.P., M.T.Y., Z.K., Y.F., E.B., S.D., S.G., N.G., S.F., C.T., S.N., L.B., D.R., V.R.-R., M.C., C.L., N.P., G.W., T.J., R.M., K.T., A.R.M., G.T. and K.J.K. contributed to the production and quality control of the gnomAD dataset. N.A.W., R.G., M.S. and K.J.K. contributed to the gnomAD browser. All authors listed under The Genome Aggregation Database Consortium contributed to the generation of the primary data incorporated into the gnomAD resource. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Competing interests
K.J.K. is a consultant for Vor Biopharma, Tome Biosciences, and is on the Scientific Advisory Board of Nurture Genomics. D.G.M. is a paid advisor to GSK, Insitro, Variant Bio and Overtone Therapeutics, and has previously received research support from AbbVie, Astellas, Biogen, BioMarin, Eisai, Merck, Pfizer and Sanofi-Genzyme.
Peer review
Peer review information
Nature thanks Slavé Petrovski, Ryan Dhindsa and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Fig. 1 Construction of mutational model and Gnocchi score.
a,b, Estimation of trinucleotide context-specific mutation rates. The proportion of possible variants observed for each substitution and context in 76,156 gnomAD genomes (y-axis) is exponentially correlated with the absolute mutation rate estimated from 1,000 downsampled genomes (x-axis). Fit lines were modeled separately for human autosomes (a) and chromosome X (b). c, Estimation of the effects of regional genomic features on mutation rates. The effects of 13 genomic features at four scales (window sizes 1kb-1Mb; x-axis) on the mutation rate of 32 trinucleotide contexts (y-axis) are shown, colored by the coefficient from regressing de novo mutations (DNMs) on each specific feature and window size. Red/Blue color indicates a positive/negative effect of increasing the feature value on mutation rates; grey crosses indicate significant features at the smallest possible window size after Bonferroni correction for 13×4 = 52 tests. Abbreviations: LCR=low-complexity region, SINE/LINE=short/long interspersed nuclear element, Dist=Distance, Recomb=Recombination, Methyl=Methylation. d,e, The distribution of Gnocchi score as a function of expected and observed variation. Each point represents the Gnocchi score of a 1kb window on the genome (N = 1,984,900 on autosomes (d) and N = 57,729 on chromosome X (e)), which quantifies the deviation of observed variation from expectation. A positive Gnocchi score (red) indicates depletion of variation (observed<expected) and the higher the score the stronger the depletion; the red dashed line indicates the 99th percentile of Gnocchi scores across the autosomes (d) or chromosome X (e).
Extended Data Fig. 2 Comparison of Gnocchi score between coding and non-coding regions.
a, The proportion of highly constrained windows (Gnocchi ≥ 4) as a function of the percentage of coding sequences in a window (left to right: N = 1,906/49,525, 3,244/55,676, 2,240/18,461, 1,506/7,094, 969/3,519, 569/1,946, 364/1,223, 283/910, 243/724, 10,392/30,138). The intervals (x-axis) are left exclusive and right inclusive. “Exonic only” refers to the 1kb windows created from directly concatenating coding exons into 1kb sequences. Error bars indicate standard errors of the proportions. b, The exonic-only regions (N = 27,875; purple) present a significantly higher Gnocchi score than regions that are exclusively non-coding (N = 1,843,559; blue). Dashed lines indicate the medians. c, The proportion of highly constrained windows (Gnocchi≥4) as a function of the proportion of exonic windows being added to the dataset of non-coding windows. d, Gnocchi score percentiles of non-coding versus exonic windows. About 0.05% (100-99.95%) and 3.12% (100-96.88%) of the non-coding windows exhibit similar constraint to the 90th and 50th of exonic regions, respectively.
Extended Data Fig. 3 Estimation of constraint for aggregated regulatory annotations.
a,b, Gnocchi scores of aggregated promoter (dark purple), enhancer (light purple), microRNA (miRNA; dark blue), and long non-coding RNA (lncRNA; light blue) annotations are compared against those of exonic (a) and non-coding (b) regions at a 1kb scale. The Gnocchi score percentiles of each annotation (y-axis) are benchmarked by the score deciles of exonic or non-coding regions (10–100 percentiles; x-axis); the grey dashed vertical line indicates the median (50th percentile).
Extended Data Fig. 4 Applications of Gnocchi for characterizing non-coding regions in addition to existing functional annotations.
a, Use of Gnocchi for prioritizing non-coding regions with or without a regulatory annotation (N = 464,504 and 1,379,055, respectively). Constrained non-coding regions are enriched for GWAS variants, independent of the candidate cis-regulatory element (cCRE) annotation from ENCODE. Error bars indicate 95% confidence intervals of the odds ratios. b, Use of Gnocchi in statistical fine-mapping. The increase in posterior inclusion probability (PIP) when incorporating Gnocchi score as a functional prior into previous fine-mapping results (that used a uniform prior; denoted as PIPGnocchi and PIPunif, respectively) is shown for 164 new likely causal associations with a PIPGnocchi ≥0.8 as a function of PIPGnocchi.
Extended Data Fig. 5 Comparison of Gnocchi and other predictive metrics in prioritizing non-coding variants.
a, Receiver operating characteristic (ROC) curves of Gnocchi and other seven metrics in classifying putative functional non-coding variants (“positive” variant set) – left to right: 9,229 GWAS Catalog variants, 2,191 GWAS fine-mapping variants, a subset of 140 high-confidence fine-mapped variants, and 1,026 likely pathogenic variants – against “negative” variant set randomly drew from the population with a similar allele frequency (AF). AF>5% and allele count (AC) = 1 were applied respectively for matching the three GWAS variant sets and the likely pathogenic variant set, based on their AF distributions in TOPMed (shown in b). b, AUCs of the classification with a varying AF threshold for the negative variant set. As most GWAS variants are common and most likely pathogenic variants are very rare (not seen in the population), AF>5% and AC = 1 were applied respectively in the primary analyses shown in a.
Extended Data Fig. 6 Comparison of constraint scores built from different mutational models and genomic windows.
Gnocchi (presented in this study) outperforms the scores rebuilt from mutational models that only consider local sequence context – trinucleotide (trimer-only) or heptanucleotide (heptamer-only) – without adjustment on mutation rate by regional genomic features, and the performance is robust to the artificial break of genomic windows when computed at a 1kb sliding by 100bp scale.
Extended Data Fig. 7 Pairwise correlations between different constraint/conservation metrics.
The Spearman’s rank correlation between each pair of the eight metrics was computed based on the mean value of each score on 1kb windows across the genome.
Extended Data Fig. 8 Power of constraint detection.
a,b, The sample size required for well-powered non-coding constraint detection. The percentage of non-coding regions powered to detect constraint (Gnocchi ≥ 4) at a 1kb (a) and 100bp (b) scale under varying levels of selection (depletion of variation) is shown as a function of log-scaled sample size. Lighter color indicates milder deletion of variation (weaker selection), which requires a larger sample size to detect constraint; the grey dashed vertical line indicates the current sample size of 76,156 genomes. Dotted curves (left to right) benchmark the 95th, 90th, and 50th percentile of depletion of variation observed in coding exons of similar size. The number of samples required to obtain an 80% detection power is labeled at corresponding benchmarks. c, AUCs of Gnocchi scores computed on different window sizes in identifying putative functional non-coding variants. 1kb (used in this study) presents the optimal window size with high performance while maintaining reasonable resolution. d, AUCs of Gnocchi scores computed from different subsets of gnomAD in identifying putative functional non-coding variants. While with an equal sample size, the downsampled dataset with diverse ancestries presents higher performance than the Non-Finnish European (NFE)-only dataset.
Supplementary information
Supplementary Information
This file provides detailed information about the aggregation, processing, and release of 76,156 human genomes from the Genome Aggregation Database (gnomAD), including Supplementary Figs. 1–8, Supplementary Tables 1–3, and descriptions of supplementary datasets.
Supplementary Datasets
This zipped file contains supplementary dataset items 1–6: see Supplementary Information for supplementary dataset guide.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, S., Francioli, L.C., Goodrich, J.K. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2024). https://doi.org/10.1038/s41586-023-06045-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41586-023-06045-0
This article is cited by
-
Multilocus pathogenic variants contribute to intrafamilial clinical heterogeneity: a retrospective study of sibling pairs with neurodevelopmental disorders
BMC Medical Genomics (2024)
-
A de novo missense mutation in synaptotagmin-1 associated with neurodevelopmental disorder desynchronizes neurotransmitter release
Molecular Psychiatry (2024)
-
An expanded genomic database for identifying disease-related variants
Nature (2024)
-
Workshop report: the clinical application of data from multiplex assays of variant effect (MAVEs), 12 July 2023
European Journal of Human Genetics (2024)
-
Lung cancer in patients who have never smoked — an emerging disease
Nature Reviews Clinical Oncology (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.