Skip to main content
Log in

A comparative investigation of single nucleotide variant calling for a personal non-Caucasian sequencing sample

  • Research Article
  • Published:
Genes & Genomics Aims and scope Submit manuscript

Abstract

Background

Dropping cost and increasing clinical application of whole genome sequencing (WGS) lead a necessity of efficient (accurate and rapid) variant calling procedures from a personal WGS data (n = 1). A number of variant calling pipelines have been introduced utilizing the human genome reference GRCh38 as a reference and a benchmark dataset called ‘NA12878’, which are both ‘standard’ but limited ethnic origin. Considering the nature of variant calling algorithms and recent updates in sequencing protocol, however, it is necessary to revisit the efficiency of the current best pipelines for a personal WGS data from diverse ethnicity.

Objective

We discuss the most efficient practices for variant calling of a personal WGS reads, with a particular emphasis on whether (1) ethnic match or mismatch between the reference genome and a WGS data produces a distinct result and more importantly (2) there is an ethnic-specific optimal workflow.

Methods

Here, we generate an appropriate WGS data, DNA array, and sufficient number of Sanger validated variants from a single Korean subject to perform such a comprehensive comparison. We applied this WGS reads and the ‘NA12878’ reads to 8 different variant calling pipelines with 2 different reference genomes (GRCh38 and KOREF, a Korean reference genome) to which the WGS reads from different ethnic origins are aligned.

Results

We evaluated the performance of the pipelines with the matched array genotype data and Sanger sequencing validation and demonstrated that: regardless to the ethnic match/mismatch (1) Novoalign-GATK4 showed the most efficient performance with the exceptional calls in MHC region; (2) the overall performance was better with GRCh38, while a significant difference in recall was observed. In addition, we found it is largely reduced computing cost maintaining performance to remove ‘markduplication’ step with PCR-free WGS data.

Conclusion

For variant calling of a personal PCR-free WGS data, regardless of ethnicity consideration, we recommend the use of the Novoalign + GATK4 with GRCh38 and without ‘markduplication’.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

FASTQ file of a Korean subject and the genotype true set list from Sanger validation are available (SRR22944222).

References

  • Ballouz S, Dobin A, Gillis JA (2019) Is it time to change the reference genome? Genome Biol 20(1):159

    Article  PubMed  PubMed Central  Google Scholar 

  • Beck TF, Mullikin JC, Program NCS, Biesecker LG (2016) Systematic evaluation of sanger validation of next-generation sequencing variants. Clin Chem 62(4):647–654

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Chen J, Li X, Zhong H, Meng Y, Du H (2019) Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers. Sci Rep 9(1):9345

    Article  PubMed  PubMed Central  Google Scholar 

  • Cho YS, Kim H, Kim HM, Jho S, Jun J, Lee YJ, Chae KS, Kim CG, Kim S, Eriksson A, Edwards JS, Lee S, Kim BC, Manica A, Oh TK, Church GM, Bhak J (2016) An ethnically relevant consensus Korean reference genome is a step towards personal reference genomes. Nat Commun 7:13637

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Cornish A, Guda C (2015) A comparison of variant calling pipelines using genome in a bottle as a reference. Biomed Res Int 2015:456479

    Article  PubMed  PubMed Central  Google Scholar 

  • Ebbert MT, Wadsworth ME, Staley LA, Hoyt KL, Pickett B, Miller J, Duce J, I Alzheimer’s Disease Neuroimaging, Kauwe JS, Ridge PG (2016) Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches. BMC Bioinf 17(7):239

    Article  Google Scholar 

  • Heldenbrand JR, Baheti S, Bockol MA, Drucker TM, Hart SN, Hudson ME, Iyer RK, Kalmbach MT, Kendig KI, Klee EW, Mattson NR, Wieben ED, Wiepert M, Wildman DE, Mainzer LS (2019) Recommendations for performance optimizations when using GATK3.8 and GATK4. BMC Bioinf 20(1):557

    Article  Google Scholar 

  • Hwang KB, Lee IH, Li H, Won DG, Hernandez-Ferrer C, Negron JA, Kong SW (2019) Comparative analysis of whole-genome sequencing pipelines to minimize false negative findings. Sci Rep 9(1):3219

    Article  PubMed  PubMed Central  Google Scholar 

  • Hwang S, Kim E, Lee I, Marcotte EM (2015) Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci Rep 5:17875

    Article  PubMed  PubMed Central  Google Scholar 

  • Illumina (2018) Illumina sequencing platforms

  • Kim S, Scheffler K, Halpern AL, Bekritsky MA, Noh E, Kallberg M, Chen X, Kim Y, Beyter D, Krusche P, Saunders CT (2018) Strelka2: fast and accurate calling of germline and somatic variants. Nat Methods 15(8):591–594

    Article  CAS  PubMed  Google Scholar 

  • Kishikawa T, Momozawa Y, Ozeki T, Mushiroda T, Inohara H, Kamatani Y, Kubo M, Okada Y (2019) Empirical evaluation of variant calling accuracy using ultra-deep whole-genome sequencing data. Sci Rep 9(1):1784

    Article  PubMed  PubMed Central  Google Scholar 

  • Kumaran M, Subramanian U, Devarajan B (2019) Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data. BMC Bioinf 20(1):342

    Article  Google Scholar 

  • Li H, Dawood M, Khayat MM, Farek JR, Jhangiani SN, Khan ZM, Mitani T, Coban-Akdemir Z, Lupski JR, Venner E, Posey JE, Sabo A, Gibbs RA (2021) Exome variant discrepancies due to reference-genome differences. Am J Hum Genet 108(7):1239–1250

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754–1760

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Liu C, Yang X, Duffy B, Mohanakumar T, Mitra RD, Zody MC, Pfeifer JD (2013) ATHLATES: accurate typing of human leukocyte antigen through exome sequencing. Nucleic Acids Res 41(14):e142

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Miller JR, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data. Genomics 95(6):315–327

    Article  CAS  PubMed  Google Scholar 

  • Momozawa Y, Mizukami K (2021) Unique roles of rare variants in the genetics of complex diseases in humans. J Hum Genet 66(1):11–23

    Article  PubMed  Google Scholar 

  • Ng PC, Levy S, Huang J, Stockwell TB, Walenz BP, Li K, Axelrod N, Busam DA, Strausberg RL, Venter JC (2008) Genetic variation in an individual human exome. PLoS Genet 4(8):e1000160

    Article  PubMed  PubMed Central  Google Scholar 

  • Poplin R, Chang PC, Alexander D, Schwartz S, Colthurst T, Ku A, Newburger D, Dijamco J, Nguyen N, Afshar PT, Gross SS, Dorfman L, McLean CY, DePristo MA (2018) A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol 36(10):983–987

    Article  CAS  PubMed  Google Scholar 

  • Souilmi Y, Lancaster AK, Jung JY, Rizzo E, Hawkins JB, Powles R, Amzazi S, Ghazal H, Tonellato PJ, Wall DP (2015) Scalable and cost-effective NGS genotyping in the cloud. BMC Med Genomics 8:64

    Article  PubMed  PubMed Central  Google Scholar 

  • Supernat A, Vidarsson OV, Steen VM, Stokowy T (2018) Comparison of three variant callers for human whole genome sequencing. Sci Rep 8(1):17851

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Suwinski P, Ong C, Ling MHT, Poh YM, Khan AM, Ong HS (2019) Advancing personalized medicine through the application of whole exome sequencing and big data analytics. Front Genet 10:49

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Zhang J, Chiodini R, Badr A, Zhang G (2011) The impact of next-generation sequencing on genomics. J Genet Genom 38(3):95–109

    Article  Google Scholar 

  • Zhao S, Agafonov O, Azab A, Stokowy T, Hovig E (2020) Accuracy and efficiency of germline variant calling pipelines for human genome data. Sci Rep 10(1):20222

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Zhao Y, Fang LT, Shen TW, Choudhari S, Talsania K, Chen X, Shetty J, Kriga Y, Tran B, Zhu B, Chen Z, Chen W, Wang C, Jaeger E, Meerzaman D, Lu C, Idler K, Ren L, Zheng Y, Shi L, Petitjean V, Sultan M, Hung T, Peters E, Drabek J, Vojta P, Maestro R, Gasparotto D, Koks S, Reimann E, Scherer A, Nordlund J, Liljedahl U, Foox J, Mason CE, Xiao C, Hong H, Xiao W (2021) Whole genome and exome sequencing reference datasets from a multi-center and cross-platform benchmark study. Sci Data 8(1):296

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We would like to thank to gimLAB members for the insightful discussions and Editage (www.editage.co.kr) for English language editing. This research was supported by the Korea Brain Research Institute basic research program funded by the Ministry of Science and ICT (22-BR-03-05) and the Healthcare AI Convergence Research & Development Program through the National IT Industry Promotion Agency of Korea (NIPA) funded by the Ministry of Science and ICT (No.1711120216) from the Republic of Korea. This research was also supported by U01-AG062602, funded by NIA NIH HHS from the United States.

Author information

Authors and Affiliations

Authors

Contributions

HSP performed all the analyses and wrote the manuscript. JSG designed the work and revised the manuscript.

Corresponding author

Correspondence to JungSoo Gim.

Ethics declarations

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, H., Gim, J. A comparative investigation of single nucleotide variant calling for a personal non-Caucasian sequencing sample. Genes Genom 45, 1527–1536 (2023). https://doi.org/10.1007/s13258-023-01439-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13258-023-01439-w

Keywords

Navigation