Abstract
Mosaic variants (MVs) reflect mutagenic processes during embryonic development and environmental exposure, accumulate with aging and underlie diseases such as cancer and autism. The detection of noncancer MVs has been computationally challenging due to the sparse representation of nonclonally expanded MVs. Here we present DeepMosaic, combining an image-based visualization module for single nucleotide MVs and a convolutional neural network-based classification module for control-independent MV detection. DeepMosaic was trained on 180,000 simulated or experimentally assessed MVs, and was benchmarked on 619,740 simulated MVs and 530 independent biologically tested MVs from 16 genomes and 181 exomes. DeepMosaic achieved higher accuracy compared with existing methods on biological data, with a sensitivity of 0.78, specificity of 0.83 and positive predictive value of 0.96 on noncancer whole-genome sequencing data, as well as doubling the validation rate over previous best-practice methods on noncancer whole-exome sequencing data (0.43 versus 0.18). DeepMosaic represents an accurate MV classifier for noncancer samples that can be implemented as an alternative or complement to existing methods.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
WGS data used to generate the training set are available at the SRA (accession nos. SRP028833 and SRP100797, BioData1). The gold-standard WGS data and validated capstone project data are available at the National Institute of Mental Health Data Archive (NIMH Data Archive ID 792 and 919: https://nda.nih.gov/study.html?id=792, BioData2, and https://nda.nih.gov/study.html?id=919, BioData3) and the Brain Somatic Mosaicism Consortium Data Portal, independent benchmark brain genotyping is also part of the SRA accession no. PRJNA736951 (BioData3). Simulated data generated from NA24385 (HG002) are available at https://humanpangenome.org/hg002/. The independent sperm and blood deep WGS data are available at SRA (accession nos. PRJNA588332 and PRJNA660493, BioData4). Independent WES data from brain, blood and saliva samples were available in NIMH Data Archive under study number 1484 (https://nda.nih.gov/study.html?id=1484, BioData5). TCGA-MC3 data are available on the GDC portal (https://portal.gdc.cancer.gov/, sample IDs provided with variants in Supplementary Table 3). Annotations downloaded from UCSC genome browser (https://genome.ucsc.edu/) and ANNOVAR (https://annovar.openbioinformatics.org/en/latest/).
Code availability
DeepMosaic is currently implemented in Python; the source code, documentation and demos are available at https://github.com/Virginiaxu/DeepMosaic. Codes for running different MV callers are documented in the Methods section.
References
Dou, Y., Gold, H. D., Luquette, L. J. & Park, P. J. Detecting somatic mutations in normal cells. Trends Genet. 34, 545–557 (2018).
Biesecker, L. G. & Spinner, N. B. A genomic view of mosaicism and human disease. Nat. Rev. Genet. 14, 307–320 (2013).
Lee, J. H. et al. Human glioblastoma arises from subventricular zone cells with low-level driver mutations. Nature 560, 243–247 (2018).
Yang, X. et al. MosaicBase: a knowledgebase of postzygotic mosaic variants in noncancer disease-related and healthy human individuals. Genom. Proteom. Bioinform. 18, 140–149 (2020).
Poduri, A., Evrony, G. D., Cai, X. & Walsh, C. A. Somatic mutation, genomic variation, and neurological disease. Science 341, 1237758 (2013).
Freed, D., Stevens, E. L. & Pevsner, J. Somatic mosaicism in the human genome. Genes 5, 1064–1094 (2014).
Yang, X. et al. Developmental and temporal characteristics of clonal sperm mosaicism. Cell 184, 4772–4783 e4715 (2021).
Breuss, M. W., Yang, X. & Gleeson, J. G. Sperm mosaicism: implications for genomic diversity and disease. Trends Genet. 37, 890–902 (2021).
Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).
Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Huang, A. Y. et al. MosaicHunter: accurate detection of postzygotic single-nucleotide mosaicism through next-generation sequencing of unpaired, trio, and paired samples. Nucleic Acids Res. 45, e76 (2017).
Dou, Y. et al. Accurate detection of mosaic variants in sequencing data without matched controls. Nat. Biotechnol. 38, 314–319 (2020).
Dou, Y. et al. Postzygotic single-nucleotide mosaicisms contribute to the etiology of autism spectrum disorder and autistic traits and the origin of mutations. Hum. Mutat. 38, 1002–1013 (2017).
McNulty, S. N. et al. Diagnostic utility of next-generation sequencing for disorders of somatic mosaicism: a five-year cumulative cohort. Am. J. Hum. Genet. 105, 734–746 (2019).
Wang, Y. et al. Comprehensive identification of somatic nucleotide variants in human brain tissue. Genome Biol. 22, 92 (2021).
Huang, A. Y. et al. Postzygotic single-nucleotide mosaicisms in whole-genome sequences of clinically unremarkable individuals. Cell Res. 24, 1311–1327 (2014).
Huang, A. Y. et al. Distinctive types of postzygotic single-nucleotide mosaicisms in healthy individuals revealed by genome-wide profiling of multiple organs. PLoS Genet. 14, e1007395 (2018).
Breuss, M. W. et al. Somatic mosaicism reveals clonal distributions of neocortical development. Nature 604, 689–696 (2022).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (eds. Bajcsy, R., Li, F.F., & Tuytelaars, T.) 2818–2826 (IEEE, 2016).
He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (eds. Bajcsy, R., Li, F.F., & Tuytelaars, T.) 770–778 (IEEE, 2016).
Iandola, F. et al. Densenet: implementing efficient convnet descriptor pyramids. Preprint at arXiv arXiv:1404.1869 (2014) https://arxiv.org/abs/1404.1869
Tan, M. & Le, Q. V. Efficientnet: rethinking model scaling for convolutional neural networks. PMLR 97, 6105–6114 (2019).
Springenberg, J. T., Dosovitskiy, A., Brox, T. & Riedmiller, M. Striving for simplicity: the all convolutional net. Preprint at arXiv arXiv:1412.6806 (2014) https://arxiv.org/abs/1412.6806
Ewing, A. D. et al. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nat. Methods 12, 623–630 (2015).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Breuss, M. W. et al. Autism risk in offspring can be assessed through quantification of male sperm mosaicism. Nat. Med. 26, 143–150 (2020).
Pelorosso, C. et al. Somatic double-hit in MTOR and RPS6 in hemimegalencephaly with intractable epilepsy. Hum. Mol. Genet. 28, 3755–3765 (2019).
Fan, Y. et al. MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biol. 17, 178 (2016).
Larson, D. E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28, 311–317 (2012).
Koboldt, D. C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).
Radenbaugh, A. J. et al. RADIA: RNA and DNA integrated analysis for somatic mutation detection. PLoS ONE 9, e111516 (2014).
Ellrott, K. et al. Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst. 6, 271–281 e277 (2018).
Sahraeian, S. M. E. et al. Deep convolutional neural networks for accurate somatic mutation detection. Nat. Commun. 10, 1041 (2019).
Zink, F. et al. Clonal hematopoiesis, with and without candidate driver mutations, is common in the elderly. Blood 130, 742–752 (2017).
Lawson, A. R. J. et al. Extensive heterogeneity in somatic mutation and selection in the human bladder. Science 370, 75–82 (2020).
Xia, Y., Liu, Y., Deng, M. & Xi, R. Pysim-sv: a package for simulating structural variation data with GC-biases. BMC Bioinf. 18, 53 (2017).
Koressaar, T. & Remm, M. Enhancements and modifications of primer design program Primer3. Bioinformatics 23, 1289–1291 (2007).
Hansen, R. S. et al. Sequencing newly replicated DNA reveals widespread plasticity in human replication timing. Proc. Natl Acad. Sci. USA 107, 139–144 (2010).
Chung, C. et al. Comprehensive multiomic profiling of somatic mutations in malformations of cortical development. Nat. Genet. (in the press).
Acknowledgements
We thank Y. Dou for helping to set up the MosaicForecast pipeline. We thank M. K. Gilson for the help with computational resources. We thank P. J. Park, G. W. Cottrell, J. V. Moran, M. Gymrek, P. J. Reed, A. Y. Huang, S.-J. Cheng and Y. Chen for their valuable comments, help and suggestions. This work was supported by the National Institute of Mental Health (NIMH) (grant nos. U01MH108898 and R01MH124890 to J.G.G.), Rady Children’s Institute for Genomic Medicine and the Howard Hughes Medical Institute. We thank San Diego Supercomputer Center (grant no. TG-IBN190021 to X.Y. and J.G.G.) for computational help. This publication includes data generated at the UC San Diego IGM Genomics Center using an Illumina NovaSeq 6000 platform that was purchased with funding from a National Institutes of Health SIG grant (no. S10OD026929 X.Y. and J.G.G.).
Author information
Authors and Affiliations
Consortia
Contributions
X.Y., X.X. and J.G.G. conceived this project with input from M.W.B. and D.A. X.Y. designed the study and managed the project. X.X. implemented the image representation and neural network classifier under supervision and instruction by X.Y. X.Y., C.L., X.X., J.S. and Y.C. generated and collected all the training and benchmark data with the help from D.A., R.D.G., L.W. and L.B.A. X.X. performed the training and model selection under supervision by X.Y. The independent dataset was processed by M.W.B., D.A. and R.D.G. under supervision by J.L.S. and J.G.G. X.Y. and M.W.B. performed the validation experiments with help from L.L.B. and C.C. X.Y. and X.X. wrote the original and revised manuscript with input from all listed authors. X.Y. and J.G.G. revised and edited the manuscript. DeepMosaic is benchmarked on part of the BSMN Reference Tissue Project and common analysis pipeline for SNVs contributed by Y.W., T.B. under supervision by A.A. and the BSMN capstone project contributed by M.W.B., X.Y., D.A. and X.X. under supervision by J.G.G. All authors discussed the results and contributed to the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
L.B.A. is a compensated consultant and has equity interest in io9, LLC. His spouse is an employee of Biotheranostics, Inc. L.B.A. is an inventor of a US Patent 10,776,718 and he also declares US provisional applications with serial numbers: 63/289,601; 63/269,033; 63/366,392 and 63/367,846. All other authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks Anders Skanderup, Moritz Gerstung and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Training strategies and examples of training data for DeepMosaic.
(a) More than 200,000 training and validation variants were generated for DeepMosaic, including computational simulations (SimData1), and biologically validated variants from existing studies with manually curated technical artifacts (BioData1). We further included 1 gold-standard dataset for testing and model selection (BioData2); all selected positive or negative variants underwent amplicon sequencing in at least one tissue sample according to the publication. We further included independent simulated data (SimData2 and SimData3) and validated independent biological data (BioData3-WGS, BioData4-WGS, and BioData5-WES) to benchmark DeepMosaic. (b) The overall strategies of model training and benchmarking for each tested model. (c) The distribution of probability density of expected AFs for different variants from the training set. Red: Reference homozygous variants and technical artifacts are labeled ‘Negative’ in the training set. Green: Heterozygous variants are also labeled ‘Negative’ in the training set. Blue: True mosaic variants are labeled ‘Positive’ in the training set. (d) Two examples of false positive variants with different sequencing artifacts, left: multiple alternative alleles from sequencing bias or alignment artifacts; right: reads truncated because of sequencing or alignment artifacts. (e) All training images were down-sampled and up-sampled into 30×, 50×, 100×, 150×, 200×, 250×, 300×, 400× and 500×, mutant allelic fractions (AFs) from the simulated data that were set as 1%, 2%, 3%, 4%, 5%, 10%, 15%, 20%, 25% and shown.
Extended Data Fig. 2 Network model selection based on an independent gold-standard testing set.
(a) Comparison of network structures implementing a variety of classification algorithms. For different build versions of EfficientNet, only a general structure is shown. Inception v3 was used in DeepVariant, and Resnet was used in NeuSomatic. (b) All models were trained on 180,000 training variants from BioData1 and SimData1 until the models reach training accuracy > 0.9. Accuracy, Matthews’s correlation coefficient (MCC), and Sensitivity of different network structures trained with the same data with different epochs. EfficientNet-b4 trained at 6 epochs demonstrated the highest Accuracy, MCC, and TPR (true positive rate, sensitivity) on the gold standard validation set16 (BioData2); thus it was used as the default core model for DeepMosaic. We additionally provide an option for experienced users to train their own models with self-labeled training data. (c) EfficientNet-b4 models were trained on 5 additional datasets, each for 15 epochs. The training datasets were generated with different compositions of biologically validated data and simulated data. Models trained only on simulated data showed overall higher sensitivity but much lower specificity on the gold standard evaluation set (BioData2) due to the high fraction of false-positive calls. Models trained only on biological data showed similar overall performance compared with models trained on a mixture of biological and simulated data. All three training sets are generated with the same number of positive and negative data points as the biological data and with the same number of total variants. M2S2 Positive: training variants were labeled positive by both MuTect2 and Strelka2. n = 15, boundaries are the range for each violin plot, for data in the inner boxplot, the center is the median, upper bound is the upper hinge/75% quantile, the lower bond is the lower hinge/25% percentile, lower whisker represents lower hinge – 1.5*IQR, upper whisker represents upper hinge + 1.5*IQR.
Extended Data Fig. 3 The convolutional neural network of the DeepMosaic default model and gradient visualization with guided backpropagation for the DeepMosaic default model (EfficientNet-b4).
(a) Down-sampled and up-sampled image files coded from the original BAM files were used as input. 16 mobile convolutional layers were adapted from EfficientNet-b4, with optimized parameter size and structures. Numbers represent the dimensions of trained hyperparameters. (b) A mosaic, a homozygous, and a heterozygous variant with artifacts, as well as a technical artifact, are shown here for the gradient visualization with guided backpropagation method25 implemented for the DeepMosaic core model, EfficientNet-b4 trained at epoch 6, left: image coding, right: gradient heatmap. The edges of bases, the sequence information, as well as other high-dimensional information, are highlighted by the model.
Extended Data Fig. 4 Performance of DeepMosaic default model (EfficientNet-b4) on data hidden from training.
(a) Receiver operating characteristic (ROC) curve for DeepMosaic. True positive rates (TPR) and false-positive rates (FPR) were evaluated from 20,265 variants (BioData1 and SimData1) hidden from model training and model selection. Colors show groups of intended read depth. (b) Precision-recall curves for DeepMosaic, evaluated from the 20,265 hidden variants, dots showed the performance of the default parameters for DeepMosaic-CM. (c) ROC curve for DeepMosaic. TPR and FPR were evaluated from 20,265 variants (BioData1 and SimData1) hidden from model training and model selection. Colors show groups of bins of different expected AFs. (d) Precision-recall curves for DeepMosaic, evaluated from the 20,265 hidden variants, dots showed the performance of the default parameters for DeepMosaic-CM for different AF bins. Iso-F1 curves were shown for each precision-recall pair with identical F1 scores labeled in (b) and (d).
Extended Data Fig. 5 Performance of DeepMosaic and other mosaic variant callers on SimData2.
Sensitivity of DeepMosaic and other mosaic callers on 439,200 independently simulated benchmark variants (SimData2) at simulated read depths and AFs. DeepMosaic performed equally well or better than other tested methods, especially at lower expected AFs. The true positive sites to calculate sensitivity do not include variants that fall into genomic repetitive regions.
Extended Data Fig. 6 Sensitivity and specificity of DeepMosaic and other mosaic variant callers on BioData4.
Sensitivity and specificity were calculated from the orthogonal validation experiment of 239 variants from BioData4. Mosaic variant detection was carried out with DeepMosaic, MosaicForecast, MosaicHunter, MuTect2, NeuSomatic, and Strelka2 on 16 WGS samples sequenced at 200×. Raw variant calls are provided in Supplementary Table 1, and a summary of performance is provided in Supplementary Table 3. SM: single mode, variant calling without control; PM: paired mode, variant calling by comparing the sequences between two samples. PM: paired mode; SM: single mode.
Extended Data Fig. 7 Comparison of DeepMosaic and traditional mosaic variant calling strategies on a WGS biological dataset (BioData4).
(a) Compared with the mosaic variant calling strategy (M2S2MH) used in a previous publication28, DeepMosaic, and MosaicForecast13 strategies are also listed. (b) Schematics for amplicon validation. Primers were designed for different candidates and amplicons were collected for Illumina sequencing. Information from aligned reads was calculated and genotypes were determined. (c) Venn diagram of the experimentally validated results and the portions of variants from different study strategies. DeepMosaic demonstrated a 96.3% (158/164) validation rate. Of all the 819 variants identified by DeepMosaic, 33.0% (271/819) were missed by the MuTect2 Strelka2 MosaicHunter pipeline with a validation rate of 97.26 (71/73) and 21.0% (172/819) were missed by the MosaicForecast pipeline with validation rate 97.06 (33/34). (d) Examples of validated variants are called by DeepMosaic and MosaicForecast (i), only by DeepMosaic (ii), or by DeepMosaic and other traditional methods (iii).
Extended Data Fig. 8 Comparison of features of variants called by DeepMosaic and other pipelines.
(a) Different overlapping groups of variants detected by the 3 pipelines were separated into 7 groups. (b) DeepMosaic-specific (G1) variants present similar base-substitution features compared with variants detected by the MuTect2-Strelka2-MosaicHunter combined pipeline as well as the MosaicForecast pipeline (G2-G7). (c) Allelic fractions of the variants detected in the original WGS sample showed that DeepMosaic-specific variants (G1, G2, and G4) showed a significantly lower average AF than variants detectable by all 3 pipelines (G3, p < 2.2e-16 by a two-tailed Wilcoxon rank sum test with continuity correction) and lower than variants detectable only in other pipelines (G5, G6, and G7, p = 0.0027 by a two-tailed Wilcoxon rank sum test with continuity correction; n = 160 for G1; n = 99 for G2; n = 548 for G3; n = 12 for G4; n = 203 for G5; n = 143 for G6; n = 130 for G7; for data in the inner boxplot, centre is the median, upper bound is the upper hinge/75% quantile, lower bond is the lower hinge/25% percentile, lower whisker represent lower hinge – 1.5*IQR, upper whisker represent upper hinge + 1.5*IQR, boundry of the violin plot is the range). (d) Recovery rate of DeepMosaic, M2S2MH, and MosaicForecast at different depths from downsampling of BioData3. DeepMosaic showed a similar variant recovery rate compared with M2S2MH and MosaicForecast, even when considering the lower AF variants detected by DeepMosaic.
Extended Data Fig. 9 Enrichment of genomic features for variants called by DeepMosaic and conventional methods.
(a) Variants called from different pipelines shared similar variant types and contributions. The groups are defined the same as Extended Data Fig. 8a. The relative contribution of different types of MVs is stable between different variant groups. (b) Enrichment analysis of variants in different genomic features. Unlike the variants shared with other callers, DeepMosaic-specific (G1) variants present depletion in high nucleosome occupancy regions. 10,000 permutation was carried out on randomly selected gnomAD variants, significant comparisons are shown in pink. Overall DeepMosaic-specific variants (G1) do not show significantly different genomic features compared with permutation intervals.
Extended Data Fig. 10 Comparison of DeepMosaic and traditional mosaic variant calling strategies on a WES biological dataset (BioData5), and the computational resources required for WES (BioData6) and WGS (BioData4).
(a) Compared with the mosaic variant calling strategy (GATK Haplotypecaller ‘polidy’ 50 with Heuristic filters) established in the previous publication and DeepMosaic strategies. (b) Venn diagram of the experimentally validated results and the portions of variants from different study strategies. DeepMosaic demonstrated a 43.1% (25/58) validation rate, significantly overperforming the 17.6% (44/250) validation rate established before16. (c) DeepMosaic consumes on average 1403.8 (range 9.1 – 50168.9) seconds to run an exome and 22718.2 (range 6565.8–60800.0) seconds for a 300× genome, respectively, on a 12-core CPU node. (d) DeepMosaic consumes an average of 1.3 Gb (range 0.9 Gb–1.8 Gb) maximum memory for an exome and an average of 1.2 Gb (range 1.1 Gb–1.3 Gb) for a genome. Some exomes required more resources than others and formed a bimodal distribution, but the cause for this was not explored. Results were calculated from real data run at the San Diego Supercomputer Center. For data in(c) and (d), upper and lower boundary of the violin plot is the range.
Supplementary information
Supplementary Information
Extended Data Figs. 1–10, Tables 1–5 and Text.
Supplementary Tables
Supplementary Tables 1–5.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, X., Xu, X., Breuss, M.W. et al. Control-independent mosaic single nucleotide variant detection with DeepMosaic. Nat Biotechnol 41, 870–877 (2023). https://doi.org/10.1038/s41587-022-01559-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41587-022-01559-w
This article is cited by
-
Cell-type-resolved mosaicism reveals clonal dynamics of the human forebrain
Nature (2024)
-
Genetic variation across and within individuals
Nature Reviews Genetics (2024)
-
Revealing parental mosaicism: the hidden answer to the recurrence of apparent de novo variants
Human Genomics (2023)
-
Comprehensive benchmarking and guidelines of mosaic variant calling strategies
Nature Methods (2023)
-
Genomic Mosaicism of the Brain: Origin, Impact, and Utility
Neuroscience Bulletin (2023)