Control-independent mosaic single nucleotide variant detection with DeepMosaic

Yang, Xiaoxu; Xu, Xin; Breuss, Martin W.; Antaki, Danny; Ball, Laurel L.; Chung, Changuk; Shen, Jiawei; Li, Chen; George, Renee D.; Wang, Yifan; Bae, Taejeong; Cheng, Yuhe; Abyzov, Alexej; Wei, Liping; Alexandrov, Ludmil B.; Sebat, Jonathan L.; Gleeson, Joseph G.

doi:10.1038/s41587-022-01559-w

Article
Published: 02 January 2023

Control-independent mosaic single nucleotide variant detection with DeepMosaic

Xiaoxu Yang ORCID: orcid.org/0000-0003-0219-0023^1,2^na1,
Xin Xu^1,2^na1,
Martin W. Breuss^1,2,3,
Danny Antaki^1,2,
Laurel L. Ball^1,2,
Changuk Chung^1,2,
Jiawei Shen^1,2,
Chen Li ORCID: orcid.org/0000-0002-1790-6664^1,2,
Renee D. George^1,2,
Yifan Wang ORCID: orcid.org/0000-0001-8056-9755⁴,
Taejeong Bae⁴,
Yuhe Cheng^5,6,7,
Alexej Abyzov ORCID: orcid.org/0000-0001-5405-6729⁴,
Liping Wei⁸,
Ludmil B. Alexandrov^5,6,7,
Jonathan L. Sebat^9,10,11,12,
NIMH Brain Somatic Mosaicism Network &
…
Joseph G. Gleeson ORCID: orcid.org/0000-0002-6713-8018^1,2

Nature Biotechnology volume 41, pages 870–877 (2023)Cite this article

5855 Accesses
11 Citations
135 Altmetric
Metrics details

Subjects

Abstract

Mosaic variants (MVs) reflect mutagenic processes during embryonic development and environmental exposure, accumulate with aging and underlie diseases such as cancer and autism. The detection of noncancer MVs has been computationally challenging due to the sparse representation of nonclonally expanded MVs. Here we present DeepMosaic, combining an image-based visualization module for single nucleotide MVs and a convolutional neural network-based classification module for control-independent MV detection. DeepMosaic was trained on 180,000 simulated or experimentally assessed MVs, and was benchmarked on 619,740 simulated MVs and 530 independent biologically tested MVs from 16 genomes and 181 exomes. DeepMosaic achieved higher accuracy compared with existing methods on biological data, with a sensitivity of 0.78, specificity of 0.83 and positive predictive value of 0.96 on noncancer whole-genome sequencing data, as well as doubling the validation rate over previous best-practice methods on noncancer whole-exome sequencing data (0.43 versus 0.18). DeepMosaic represents an accurate MV classifier for noncancer samples that can be implemented as an alternative or complement to existing methods.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Image representation, model training strategies and framework of DeepMosaic.**

**Fig. 2: DeepMosaic performance on simulated benchmark variants.**

**Fig. 3: DeepMosaic performance validated on biological data.**

Accurate detection of mosaic variants in sequencing data without matched controls

Article 06 January 2020

GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data

Article 21 August 2023

Cue: a deep-learning framework for structural variant discovery and genotyping

Article 23 March 2023

Data availability

WGS data used to generate the training set are available at the SRA (accession nos. SRP028833 and SRP100797, BioData1). The gold-standard WGS data and validated capstone project data are available at the National Institute of Mental Health Data Archive (NIMH Data Archive ID 792 and 919: https://nda.nih.gov/study.html?id=792, BioData2, and https://nda.nih.gov/study.html?id=919, BioData3) and the Brain Somatic Mosaicism Consortium Data Portal, independent benchmark brain genotyping is also part of the SRA accession no. PRJNA736951 (BioData3). Simulated data generated from NA24385 (HG002) are available at https://humanpangenome.org/hg002/. The independent sperm and blood deep WGS data are available at SRA (accession nos. PRJNA588332 and PRJNA660493, BioData4). Independent WES data from brain, blood and saliva samples were available in NIMH Data Archive under study number 1484 (https://nda.nih.gov/study.html?id=1484, BioData5). TCGA-MC3 data are available on the GDC portal (https://portal.gdc.cancer.gov/, sample IDs provided with variants in Supplementary Table 3). Annotations downloaded from UCSC genome browser (https://genome.ucsc.edu/) and ANNOVAR (https://annovar.openbioinformatics.org/en/latest/).

Code availability

DeepMosaic is currently implemented in Python; the source code, documentation and demos are available at https://github.com/Virginiaxu/DeepMosaic. Codes for running different MV callers are documented in the Methods section.

References

Dou, Y., Gold, H. D., Luquette, L. J. & Park, P. J. Detecting somatic mutations in normal cells. Trends Genet. 34, 545–557 (2018).
Article CAS PubMed PubMed Central Google Scholar
Biesecker, L. G. & Spinner, N. B. A genomic view of mosaicism and human disease. Nat. Rev. Genet. 14, 307–320 (2013).
Article CAS PubMed Google Scholar
Lee, J. H. et al. Human glioblastoma arises from subventricular zone cells with low-level driver mutations. Nature 560, 243–247 (2018).
Article CAS PubMed Google Scholar
Yang, X. et al. MosaicBase: a knowledgebase of postzygotic mosaic variants in noncancer disease-related and healthy human individuals. Genom. Proteom. Bioinform. 18, 140–149 (2020).
Article Google Scholar
Poduri, A., Evrony, G. D., Cai, X. & Walsh, C. A. Somatic mutation, genomic variation, and neurological disease. Science 341, 1237758 (2013).
Article PubMed PubMed Central Google Scholar
Freed, D., Stevens, E. L. & Pevsner, J. Somatic mosaicism in the human genome. Genes 5, 1064–1094 (2014).
Article PubMed PubMed Central Google Scholar
Yang, X. et al. Developmental and temporal characteristics of clonal sperm mosaicism. Cell 184, 4772–4783 e4715 (2021).
Article CAS PubMed PubMed Central Google Scholar
Breuss, M. W., Yang, X. & Gleeson, J. G. Sperm mosaicism: implications for genomic diversity and disease. Trends Genet. 37, 890–902 (2021).
Article CAS PubMed PubMed Central Google Scholar
Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).
Article CAS PubMed PubMed Central Google Scholar
Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).
Article CAS PubMed Google Scholar
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Article CAS PubMed Google Scholar
Huang, A. Y. et al. MosaicHunter: accurate detection of postzygotic single-nucleotide mosaicism through next-generation sequencing of unpaired, trio, and paired samples. Nucleic Acids Res. 45, e76 (2017).
Article CAS PubMed PubMed Central Google Scholar
Dou, Y. et al. Accurate detection of mosaic variants in sequencing data without matched controls. Nat. Biotechnol. 38, 314–319 (2020).
Article CAS PubMed PubMed Central Google Scholar
Dou, Y. et al. Postzygotic single-nucleotide mosaicisms contribute to the etiology of autism spectrum disorder and autistic traits and the origin of mutations. Hum. Mutat. 38, 1002–1013 (2017).
Article CAS PubMed PubMed Central Google Scholar
McNulty, S. N. et al. Diagnostic utility of next-generation sequencing for disorders of somatic mosaicism: a five-year cumulative cohort. Am. J. Hum. Genet. 105, 734–746 (2019).
Article CAS PubMed PubMed Central Google Scholar
Wang, Y. et al. Comprehensive identification of somatic nucleotide variants in human brain tissue. Genome Biol. 22, 92 (2021).
Article CAS PubMed PubMed Central Google Scholar
Huang, A. Y. et al. Postzygotic single-nucleotide mosaicisms in whole-genome sequences of clinically unremarkable individuals. Cell Res. 24, 1311–1327 (2014).
Article CAS PubMed PubMed Central Google Scholar
Huang, A. Y. et al. Distinctive types of postzygotic single-nucleotide mosaicisms in healthy individuals revealed by genome-wide profiling of multiple organs. PLoS Genet. 14, e1007395 (2018).
Article PubMed PubMed Central Google Scholar
Breuss, M. W. et al. Somatic mosaicism reveals clonal distributions of neocortical development. Nature 604, 689–696 (2022).
Article CAS PubMed PubMed Central Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article CAS PubMed PubMed Central Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (eds. Bajcsy, R., Li, F.F., & Tuytelaars, T.) 2818–2826 (IEEE, 2016).
He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (eds. Bajcsy, R., Li, F.F., & Tuytelaars, T.) 770–778 (IEEE, 2016).
Iandola, F. et al. Densenet: implementing efficient convnet descriptor pyramids. Preprint at arXiv arXiv:1404.1869 (2014) https://arxiv.org/abs/1404.1869
Tan, M. & Le, Q. V. Efficientnet: rethinking model scaling for convolutional neural networks. PMLR 97, 6105–6114 (2019).
Google Scholar
Springenberg, J. T., Dosovitskiy, A., Brox, T. & Riedmiller, M. Striving for simplicity: the all convolutional net. Preprint at arXiv arXiv:1412.6806 (2014) https://arxiv.org/abs/1412.6806
Ewing, A. D. et al. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nat. Methods 12, 623–630 (2015).
Article CAS PubMed PubMed Central Google Scholar
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Article CAS PubMed PubMed Central Google Scholar
Breuss, M. W. et al. Autism risk in offspring can be assessed through quantification of male sperm mosaicism. Nat. Med. 26, 143–150 (2020).
Article CAS PubMed Google Scholar
Pelorosso, C. et al. Somatic double-hit in MTOR and RPS6 in hemimegalencephaly with intractable epilepsy. Hum. Mol. Genet. 28, 3755–3765 (2019).
Article CAS PubMed PubMed Central Google Scholar
Fan, Y. et al. MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biol. 17, 178 (2016).
Article PubMed PubMed Central Google Scholar
Larson, D. E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28, 311–317 (2012).
Article CAS PubMed Google Scholar
Koboldt, D. C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576 (2012).
Article CAS PubMed PubMed Central Google Scholar
Radenbaugh, A. J. et al. RADIA: RNA and DNA integrated analysis for somatic mutation detection. PLoS ONE 9, e111516 (2014).
Article PubMed PubMed Central Google Scholar
Ellrott, K. et al. Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst. 6, 271–281 e277 (2018).
Article CAS PubMed PubMed Central Google Scholar
Sahraeian, S. M. E. et al. Deep convolutional neural networks for accurate somatic mutation detection. Nat. Commun. 10, 1041 (2019).
Article PubMed PubMed Central Google Scholar
Zink, F. et al. Clonal hematopoiesis, with and without candidate driver mutations, is common in the elderly. Blood 130, 742–752 (2017).
Article CAS PubMed PubMed Central Google Scholar
Lawson, A. R. J. et al. Extensive heterogeneity in somatic mutation and selection in the human bladder. Science 370, 75–82 (2020).
Article CAS PubMed Google Scholar
Xia, Y., Liu, Y., Deng, M. & Xi, R. Pysim-sv: a package for simulating structural variation data with GC-biases. BMC Bioinf. 18, 53 (2017).
Article Google Scholar
Koressaar, T. & Remm, M. Enhancements and modifications of primer design program Primer3. Bioinformatics 23, 1289–1291 (2007).
Article CAS PubMed Google Scholar
Hansen, R. S. et al. Sequencing newly replicated DNA reveals widespread plasticity in human replication timing. Proc. Natl Acad. Sci. USA 107, 139–144 (2010).
Article CAS PubMed Google Scholar
Chung, C. et al. Comprehensive multiomic profiling of somatic mutations in malformations of cortical development. Nat. Genet. (in the press).

Download references

Acknowledgements

We thank Y. Dou for helping to set up the MosaicForecast pipeline. We thank M. K. Gilson for the help with computational resources. We thank P. J. Park, G. W. Cottrell, J. V. Moran, M. Gymrek, P. J. Reed, A. Y. Huang, S.-J. Cheng and Y. Chen for their valuable comments, help and suggestions. This work was supported by the National Institute of Mental Health (NIMH) (grant nos. U01MH108898 and R01MH124890 to J.G.G.), Rady Children’s Institute for Genomic Medicine and the Howard Hughes Medical Institute. We thank San Diego Supercomputer Center (grant no. TG-IBN190021 to X.Y. and J.G.G.) for computational help. This publication includes data generated at the UC San Diego IGM Genomics Center using an Illumina NovaSeq 6000 platform that was purchased with funding from a National Institutes of Health SIG grant (no. S10OD026929 X.Y. and J.G.G.).

Author information

These authors contributed equally: Xiaoxu Yang, Xin Xu.

Authors and Affiliations

Department of Neurosciences, University of California, San Diego, La Jolla, CA, USA
Xiaoxu Yang, Xin Xu, Martin W. Breuss, Danny Antaki, Laurel L. Ball, Changuk Chung, Jiawei Shen, Chen Li, Renee D. George, Dan Averbuj, Subhojit Roy, Eric Courchesne & Joseph G. Gleeson
Rady Children’s Institute for Genomic Medicine, San Diego, CA, USA
Xiaoxu Yang, Xin Xu, Martin W. Breuss, Danny Antaki, Laurel L. Ball, Changuk Chung, Jiawei Shen, Chen Li, Renee D. George, Dan Averbuj, Subhojit Roy, Eric Courchesne & Joseph G. Gleeson
Department of Pediatrics, Section of Genetics and Metabolism, University of Colorado School of Medicine, Aurora, CO, USA
Martin W. Breuss
Department of Quantitative Health Sciences, Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
Yifan Wang, Taejeong Bae, Alexej Abyzov & Yeongjun Jang
Department of Cellular and Molecular Medicine, UC San Diego, La Jolla, CA, USA
Yuhe Cheng & Ludmil B. Alexandrov
Department of Bioengineering, UC San Diego, La Jolla, CA, USA
Yuhe Cheng & Ludmil B. Alexandrov
Moores Cancer Center, UC San Diego, La Jolla, CA, USA
Yuhe Cheng & Ludmil B. Alexandrov
Center for Bioinformatics, State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing, China
Liping Wei
Beyster Center for Genomics of Psychiatric Diseases, University of California, San Diego, La Jolla, CA, USA
Jonathan L. Sebat
Department of Psychiatry, University of California, San Diego, La Jolla, CA, USA
Jonathan L. Sebat
Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA
Jonathan L. Sebat
Department of Pediatrics, University of California, San Diego, La Jolla, CA, USA
Jonathan L. Sebat
Boston Children’s Hospital, Boston, MA, USA
August Y. Huang, Alissa D’Gama, Caroline Dias, Christopher A. Walsh, Javier Ganz, Michael Lodato, Michael Miller, Pengpeng Li, Rachel Rodin, Robert Hill, Sara Bizzotto, Sattar Khoshkhoo & Zinan Zhou
Harvard University, Cambridge, MA, USA
Alice Lee, Alison Barton, Alon Galor, Chong Chu, Craig Bohrson, Doga Gulhan, Eduardo Maury, Elaine Lim, Euncheon Lim, Giorgio Melloni, Isidro Cortes, Jake Lee, Joe Luquette, Lixing Yang, Maxwell Sherman, Michael Coulter, Minseok Kwon, Peter J. Park, Rebeca Borges-Monroy, Semin Lee, Sonia Kim, Soo Lee, Vinary Viswanadham & Yanmei Dou
Icahn School of Medicine at Mt. Sinai, New York, NY, USA
Andrew J. Chess, Attila Jones, Chaggai Rosenbluh & Schahram Akbarian
Kennedy Krieger Institute, Baltimore, MD, USA
Ben Langmead, Jeremy Thorpe & Sean Cho
Lieber Institute for Brain Development, Baltimore, MD, USA
Andrew Jaffe, Apua Paquola, Daniel Weinberger, Jennifer Erwin, Jooheon Shin, Michael McConnell, Richard Straub & Rujuta Narurkar
Sage Bionetworks, Camarillo, CA, USA
Cindy Molitor & Mette Peters
Salk Institute for Biological Studies, La Jolla, CA, USA
Fred H. Gage, Meiyan Wang, Patrick Reed & Sara Linker
Stanford University, Stanford, CA, USA
Alexander Urban, Bo Zhou & Xiaowei Zhu
Universitat Pompeu Fabra, Barcelona, Spain
Aitor S. Amero, David Juan, Inna Povolotskaya, Irene Lobon, Manuel S. Moruno, Raquel G. Perez & Tomas Marques-Bonet
University of Barcelona, Barcelona, Spain
Eduardo Soriano
University of California, Los Angeles, Los Angeles, CA, USA
Gary Mathern
University of Michigan, Ann Arbor, MI, USA
Diane Flasch, Trenton Frisbie, Huira Kopera, Jeffrey Kidd, John Moldovan, John V. Moran, Kenneth Kwan, Ryan Mills, Sarah Emery, Weichen Zhou & Xuefang Zhao
University of Virginia, Charlottesville, VA, USA
Aakrosh Ratan
Yale University, New Haven, CT, USA
Alexandre Jourdon, Flora M. Vaccarino, Liana Fasching, Nenad Sestan, Sirisha Pochareddy & Soraya Scuderi

Authors

Xiaoxu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Martin W. Breuss
View author publications
You can also search for this author in PubMed Google Scholar
Danny Antaki
View author publications
You can also search for this author in PubMed Google Scholar
Laurel L. Ball
View author publications
You can also search for this author in PubMed Google Scholar
Changuk Chung
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Shen
View author publications
You can also search for this author in PubMed Google Scholar
Chen Li
View author publications
You can also search for this author in PubMed Google Scholar
Renee D. George
View author publications
You can also search for this author in PubMed Google Scholar
Yifan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Taejeong Bae
View author publications
You can also search for this author in PubMed Google Scholar
Yuhe Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Alexej Abyzov
View author publications
You can also search for this author in PubMed Google Scholar
Liping Wei
View author publications
You can also search for this author in PubMed Google Scholar
Ludmil B. Alexandrov
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan L. Sebat
View author publications
You can also search for this author in PubMed Google Scholar
Joseph G. Gleeson
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

Contributions

X.Y., X.X. and J.G.G. conceived this project with input from M.W.B. and D.A. X.Y. designed the study and managed the project. X.X. implemented the image representation and neural network classifier under supervision and instruction by X.Y. X.Y., C.L., X.X., J.S. and Y.C. generated and collected all the training and benchmark data with the help from D.A., R.D.G., L.W. and L.B.A. X.X. performed the training and model selection under supervision by X.Y. The independent dataset was processed by M.W.B., D.A. and R.D.G. under supervision by J.L.S. and J.G.G. X.Y. and M.W.B. performed the validation experiments with help from L.L.B. and C.C. X.Y. and X.X. wrote the original and revised manuscript with input from all listed authors. X.Y. and J.G.G. revised and edited the manuscript. DeepMosaic is benchmarked on part of the BSMN Reference Tissue Project and common analysis pipeline for SNVs contributed by Y.W., T.B. under supervision by A.A. and the BSMN capstone project contributed by M.W.B., X.Y., D.A. and X.X. under supervision by J.G.G. All authors discussed the results and contributed to the final manuscript.

Corresponding authors

Correspondence to Xiaoxu Yang or Joseph G. Gleeson.

Ethics declarations

Competing interests

L.B.A. is a compensated consultant and has equity interest in io9, LLC. His spouse is an employee of Biotheranostics, Inc. L.B.A. is an inventor of a US Patent 10,776,718 and he also declares US provisional applications with serial numbers: 63/289,601; 63/269,033; 63/366,392 and 63/367,846. All other authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Anders Skanderup, Moritz Gerstung and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Training strategies and examples of training data for DeepMosaic.

(a) More than 200,000 training and validation variants were generated for DeepMosaic, including computational simulations (SimData1), and biologically validated variants from existing studies with manually curated technical artifacts (BioData1). We further included 1 gold-standard dataset for testing and model selection (BioData2); all selected positive or negative variants underwent amplicon sequencing in at least one tissue sample according to the publication. We further included independent simulated data (SimData2 and SimData3) and validated independent biological data (BioData3-WGS, BioData4-WGS, and BioData5-WES) to benchmark DeepMosaic. (b) The overall strategies of model training and benchmarking for each tested model. (c) The distribution of probability density of expected AFs for different variants from the training set. Red: Reference homozygous variants and technical artifacts are labeled ‘Negative’ in the training set. Green: Heterozygous variants are also labeled ‘Negative’ in the training set. Blue: True mosaic variants are labeled ‘Positive’ in the training set. (d) Two examples of false positive variants with different sequencing artifacts, left: multiple alternative alleles from sequencing bias or alignment artifacts; right: reads truncated because of sequencing or alignment artifacts. (e) All training images were down-sampled and up-sampled into 30×, 50×, 100×, 150×, 200×, 250×, 300×, 400× and 500×, mutant allelic fractions (AFs) from the simulated data that were set as 1%, 2%, 3%, 4%, 5%, 10%, 15%, 20%, 25% and shown.

Extended Data Fig. 2 Network model selection based on an independent gold-standard testing set.

(a) Comparison of network structures implementing a variety of classification algorithms. For different build versions of EfficientNet, only a general structure is shown. Inception v3 was used in DeepVariant, and Resnet was used in NeuSomatic. (b) All models were trained on 180,000 training variants from BioData1 and SimData1 until the models reach training accuracy > 0.9. Accuracy, Matthews’s correlation coefficient (MCC), and Sensitivity of different network structures trained with the same data with different epochs. EfficientNet-b4 trained at 6 epochs demonstrated the highest Accuracy, MCC, and TPR (true positive rate, sensitivity) on the gold standard validation set16 (BioData2); thus it was used as the default core model for DeepMosaic. We additionally provide an option for experienced users to train their own models with self-labeled training data. (c) EfficientNet-b4 models were trained on 5 additional datasets, each for 15 epochs. The training datasets were generated with different compositions of biologically validated data and simulated data. Models trained only on simulated data showed overall higher sensitivity but much lower specificity on the gold standard evaluation set (BioData2) due to the high fraction of false-positive calls. Models trained only on biological data showed similar overall performance compared with models trained on a mixture of biological and simulated data. All three training sets are generated with the same number of positive and negative data points as the biological data and with the same number of total variants. M2S2 Positive: training variants were labeled positive by both MuTect2 and Strelka2. n = 15, boundaries are the range for each violin plot, for data in the inner boxplot, the center is the median, upper bound is the upper hinge/75% quantile, the lower bond is the lower hinge/25% percentile, lower whisker represents lower hinge – 1.5*IQR, upper whisker represents upper hinge + 1.5*IQR.

Extended Data Fig. 3 The convolutional neural network of the DeepMosaic default model and gradient visualization with guided backpropagation for the DeepMosaic default model (EfficientNet-b4).

(a) Down-sampled and up-sampled image files coded from the original BAM files were used as input. 16 mobile convolutional layers were adapted from EfficientNet-b4, with optimized parameter size and structures. Numbers represent the dimensions of trained hyperparameters. (b) A mosaic, a homozygous, and a heterozygous variant with artifacts, as well as a technical artifact, are shown here for the gradient visualization with guided backpropagation method²⁵ implemented for the DeepMosaic core model, EfficientNet-b4 trained at epoch 6, left: image coding, right: gradient heatmap. The edges of bases, the sequence information, as well as other high-dimensional information, are highlighted by the model.

Extended Data Fig. 4 Performance of DeepMosaic default model (EfficientNet-b4) on data hidden from training.

(a) Receiver operating characteristic (ROC) curve for DeepMosaic. True positive rates (TPR) and false-positive rates (FPR) were evaluated from 20,265 variants (BioData1 and SimData1) hidden from model training and model selection. Colors show groups of intended read depth. (b) Precision-recall curves for DeepMosaic, evaluated from the 20,265 hidden variants, dots showed the performance of the default parameters for DeepMosaic-CM. (c) ROC curve for DeepMosaic. TPR and FPR were evaluated from 20,265 variants (BioData1 and SimData1) hidden from model training and model selection. Colors show groups of bins of different expected AFs. (d) Precision-recall curves for DeepMosaic, evaluated from the 20,265 hidden variants, dots showed the performance of the default parameters for DeepMosaic-CM for different AF bins. Iso-F1 curves were shown for each precision-recall pair with identical F1 scores labeled in (b) and (d).

Extended Data Fig. 5 Performance of DeepMosaic and other mosaic variant callers on SimData2.

Sensitivity of DeepMosaic and other mosaic callers on 439,200 independently simulated benchmark variants (SimData2) at simulated read depths and AFs. DeepMosaic performed equally well or better than other tested methods, especially at lower expected AFs. The true positive sites to calculate sensitivity do not include variants that fall into genomic repetitive regions.

Extended Data Fig. 6 Sensitivity and specificity of DeepMosaic and other mosaic variant callers on BioData4.

Sensitivity and specificity were calculated from the orthogonal validation experiment of 239 variants from BioData4. Mosaic variant detection was carried out with DeepMosaic, MosaicForecast, MosaicHunter, MuTect2, NeuSomatic, and Strelka2 on 16 WGS samples sequenced at 200×. Raw variant calls are provided in Supplementary Table 1, and a summary of performance is provided in Supplementary Table 3. SM: single mode, variant calling without control; PM: paired mode, variant calling by comparing the sequences between two samples. PM: paired mode; SM: single mode.

Extended Data Fig. 7 Comparison of DeepMosaic and traditional mosaic variant calling strategies on a WGS biological dataset (BioData4).

(a) Compared with the mosaic variant calling strategy (M2S2MH) used in a previous publication²⁸, DeepMosaic, and MosaicForecast¹³ strategies are also listed. (b) Schematics for amplicon validation. Primers were designed for different candidates and amplicons were collected for Illumina sequencing. Information from aligned reads was calculated and genotypes were determined. (c) Venn diagram of the experimentally validated results and the portions of variants from different study strategies. DeepMosaic demonstrated a 96.3% (158/164) validation rate. Of all the 819 variants identified by DeepMosaic, 33.0% (271/819) were missed by the MuTect2 Strelka2 MosaicHunter pipeline with a validation rate of 97.26 (71/73) and 21.0% (172/819) were missed by the MosaicForecast pipeline with validation rate 97.06 (33/34). (d) Examples of validated variants are called by DeepMosaic and MosaicForecast (i), only by DeepMosaic (ii), or by DeepMosaic and other traditional methods (iii).

Extended Data Fig. 8 Comparison of features of variants called by DeepMosaic and other pipelines.

(a) Different overlapping groups of variants detected by the 3 pipelines were separated into 7 groups. (b) DeepMosaic-specific (G1) variants present similar base-substitution features compared with variants detected by the MuTect2-Strelka2-MosaicHunter combined pipeline as well as the MosaicForecast pipeline (G2-G7). (c) Allelic fractions of the variants detected in the original WGS sample showed that DeepMosaic-specific variants (G1, G2, and G4) showed a significantly lower average AF than variants detectable by all 3 pipelines (G3, p < 2.2e-16 by a two-tailed Wilcoxon rank sum test with continuity correction) and lower than variants detectable only in other pipelines (G5, G6, and G7, p = 0.0027 by a two-tailed Wilcoxon rank sum test with continuity correction; n = 160 for G1; n = 99 for G2; n = 548 for G3; n = 12 for G4; n = 203 for G5; n = 143 for G6; n = 130 for G7; for data in the inner boxplot, centre is the median, upper bound is the upper hinge/75% quantile, lower bond is the lower hinge/25% percentile, lower whisker represent lower hinge – 1.5*IQR, upper whisker represent upper hinge + 1.5*IQR, boundry of the violin plot is the range). (d) Recovery rate of DeepMosaic, M2S2MH, and MosaicForecast at different depths from downsampling of BioData3. DeepMosaic showed a similar variant recovery rate compared with M2S2MH and MosaicForecast, even when considering the lower AF variants detected by DeepMosaic.

Extended Data Fig. 9 Enrichment of genomic features for variants called by DeepMosaic and conventional methods.

(a) Variants called from different pipelines shared similar variant types and contributions. The groups are defined the same as Extended Data Fig. 8a. The relative contribution of different types of MVs is stable between different variant groups. (b) Enrichment analysis of variants in different genomic features. Unlike the variants shared with other callers, DeepMosaic-specific (G1) variants present depletion in high nucleosome occupancy regions. 10,000 permutation was carried out on randomly selected gnomAD variants, significant comparisons are shown in pink. Overall DeepMosaic-specific variants (G1) do not show significantly different genomic features compared with permutation intervals.

Extended Data Fig. 10 Comparison of DeepMosaic and traditional mosaic variant calling strategies on a WES biological dataset (BioData5), and the computational resources required for WES (BioData6) and WGS (BioData4).

(a) Compared with the mosaic variant calling strategy (GATK Haplotypecaller ‘polidy’ 50 with Heuristic filters) established in the previous publication and DeepMosaic strategies. (b) Venn diagram of the experimentally validated results and the portions of variants from different study strategies. DeepMosaic demonstrated a 43.1% (25/58) validation rate, significantly overperforming the 17.6% (44/250) validation rate established before¹⁶. (c) DeepMosaic consumes on average 1403.8 (range 9.1 – 50168.9) seconds to run an exome and 22718.2 (range 6565.8–60800.0) seconds for a 300× genome, respectively, on a 12-core CPU node. (d) DeepMosaic consumes an average of 1.3 Gb (range 0.9 Gb–1.8 Gb) maximum memory for an exome and an average of 1.2 Gb (range 1.1 Gb–1.3 Gb) for a genome. Some exomes required more resources than others and formed a bimodal distribution, but the cause for this was not explored. Results were calculated from real data run at the San Diego Supercomputer Center. For data in(c) and (d), upper and lower boundary of the violin plot is the range.

Supplementary information

Supplementary Information

Extended Data Figs. 1–10, Tables 1–5 and Text.

Reporting Summary

Supplementary Tables

Supplementary Tables 1–5.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, X., Xu, X., Breuss, M.W. et al. Control-independent mosaic single nucleotide variant detection with DeepMosaic. Nat Biotechnol 41, 870–877 (2023). https://doi.org/10.1038/s41587-022-01559-w

Download citation

Received: 13 November 2020
Accepted: 10 October 2022
Published: 02 January 2023
Issue Date: June 2023
DOI: https://doi.org/10.1038/s41587-022-01559-w

This article is cited by

Cell-type-resolved mosaicism reveals clonal dynamics of the human forebrain
- Changuk Chung
- Xiaoxu Yang
- Joseph G. Gleeson
Nature (2024)
Genetic variation across and within individuals
- Zhi Yu
- Tim H. H. Coorens
- Pradeep Natarajan
Nature Reviews Genetics (2024)
Revealing parental mosaicism: the hidden answer to the recurrence of apparent de novo variants
- Mianne Lee
- Adrian C. Y. Lui
- Brian H. Y. Chung
Human Genomics (2023)
Comprehensive benchmarking and guidelines of mosaic variant calling strategies
- Yoo-Jin Ha
- Seungseok Kang
- Sangwoo Kim
Nature Methods (2023)
Genomic Mosaicism of the Brain: Origin, Impact, and Utility
- Jared H. Graham
- Johannes C. M. Schlachetzki
- Martin W. Breuss
Neuroscience Bulletin (2023)