Skip to main content

Scalable Framework for the Analysis of Population Structure Using the Next Generation Sequencing Data

  • Conference paper
  • First Online:
Foundations of Intelligent Systems (ISMIS 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10352))

Included in the following conference series:

Abstract

Genomic variant data obtained from the next generation sequencing can be used to study the population structure of the genotyped individuals. Typical approaches to ethnicity classification/clustering consist of several time consuming pre-processing steps, such as variant filtering, LD-pruning and dimensionality reduction of genotype matrix. We have developed a framework using R programming language to analyze the influence of various pre-processing methods and their parameters on the final results of the classification/clustering algorithms. The results indicated how to fine-tune the pre-processing steps in order to maximize the supervised and unsupervised classification performance. In addition, to enable efficient processing of large data sets, we have developed another framework using Apache Spark. Tests performed on 1000 Genomes data set confirmed the efficiency and scalability of the presented approach. Finally, the dockerized version of the implemented frameworks (freely available at: https://github.com/ZSI-Bio/popgen) can be easily applied to any other variant data set, including data from large scale sequencing projects or custom data sets from clinical laboratories.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. The 1000 genomes project. http://www.internationalgenome.org/

  2. Apache Spark. RowMatrix. https://github.com/apache/spark

  3. Apache Spark\(^{\rm TM}\). http://spark.apache.org/

  4. Apache SystemML - Declarative Large-Scale Machine Learning. https://systemml.apache.org/

  5. BauerLab/VariantSpark. https://github.com/BauerLab/VariantSpark

  6. Big Data Genomics. http://bdgenomics.org/

  7. Bioconductor - gdsfmt. http://bioconductor.org/packages/gdsfmt

  8. H2o.ai. http://www.h2o.ai/download/sparkling-water/

  9. MLlib—Apache Spark. http://spark.apache.org/mllib/

  10. SNPRelate. http://bioconductor.org/packages/SNPRelate/

  11. The variant call format specification. https://github.com/samtools/hts-specs

  12. Abraham, G., Inouye, M.: Fast principal component analysis of large-scale genome-wide data. PLoS ONE 9(4), e93766 (2014)

    Article  Google Scholar 

  13. Auer, P.L., Lettre, G.: Rare variant association studies: considerations, challenges and opportunities. Genome Med. 7(1), 16 (2015)

    Article  Google Scholar 

  14. Hamilton, D.C., Cole, D.E.C.: Standardizing a composite measure of linkage disequilibrium. Ann. Hum. Genet. 3, 234–239 (2004)

    Article  Google Scholar 

  15. Hinrichs, A.L., Larkin, E.K., Suarez, B.K.: Population stratification and patterns of linkage disequilibrium. Genet. Epidemiol. 33(Suppl 1), S88–S92 (2009)

    Article  Google Scholar 

  16. Lee, S., Abecasis, G., Boehnke, M., Lin, X.: Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 95(1), 5–23 (2014)

    Article  Google Scholar 

  17. Lewontin, R.C.: The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 49(1), 49–67 (1964)

    Google Scholar 

  18. Li, Q., Yu, K.: Improved correction for population stratification in genome-wide association studies by identifying hidden population structures. Genet. Epidemiol. 32(3), 215–226 (2008)

    Article  Google Scholar 

  19. Liu, L., Zhang, D., Liu, H., Arendt, C.: Robust methods for population stratification in genome wide association studies. BMC Bioinform. 14, 132 (2013)

    Article  Google Scholar 

  20. Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B., Hindorff, L.A., Hunter, D.J., McCarthy, M.I., Ramos, E.M., Cardon, L.R., Chakravarti, A., Cho, J.H., Guttmacher, A.E., Kong, A., Kruglyak, L., Mardis, E., Rotimi, C.N., Slatkin, M., Valle, D., Whittemore, A.S., Boehnke, M., Clark, A.G., Eichler, E.E., Gibson, G., Haines, J.L., Mackay, T.F.C., McCarroll, S.A., Visscher, P.M.: Finding the missing heritability of complex diseases. Nature 461(7265), 747–753 (2009)

    Article  Google Scholar 

  21. O’Brien, A.R., Saunders, N.F.W., Guo, Y., Buske, F.A., Scott, R.J., Bauer, D.C.: VariantSpark: population scale clustering of genotype information. BMC Genom. 16, 1052 (2015)

    Article  Google Scholar 

  22. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., Reich, D.: Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38(8), 904–909 (2006)

    Article  Google Scholar 

  23. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M., Bender, D., Maller, J., Sklar, P., de Bakker, P., Daly, M., Sham, P.: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81(3), 559–575 (2007)

    Article  Google Scholar 

  24. Slatkin, M.: Linkage disequilibrium - understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet. 9(6), 477–485 (2008)

    Article  Google Scholar 

  25. Stein, L.D.: The case for cloud computing in genome informatics. Genome Biol. 11(5), 207 (2010)

    Article  Google Scholar 

  26. Weir, B.S.: Genetic Data Analysis. Sinauer Associates, Inc., Sunderland (1996)

    Google Scholar 

  27. Zou, F., Lee, S., Knowles, M.R., Wright, F.A.: Quantification of population structure using correlated SNPs by shrinkage principal components. Hum. Hered. 70(1), 9–22 (2010)

    Article  Google Scholar 

Download references

Acknowledgments

This work has been supported by the Polish National Science Center grants: Opus 2014/13/B/NZ2/01248 and Preludium 2014/13/N/ST6/01843.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anastasiia Hryhorzhevska .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Hryhorzhevska, A., Wiewiórka, M., Okoniewski, M., Gambin, T. (2017). Scalable Framework for the Analysis of Population Structure Using the Next Generation Sequencing Data. In: Kryszkiewicz, M., Appice, A., Ślęzak, D., Rybinski, H., Skowron, A., Raś, Z. (eds) Foundations of Intelligent Systems. ISMIS 2017. Lecture Notes in Computer Science(), vol 10352. Springer, Cham. https://doi.org/10.1007/978-3-319-60438-1_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-60438-1_46

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-60437-4

  • Online ISBN: 978-3-319-60438-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics