Scalable Framework for the Analysis of Population Structure Using the Next Generation Sequencing Data

Hryhorzhevska, Anastasiia; Wiewiórka, Marek; Okoniewski, Michał; Gambin, Tomasz

doi:10.1007/978-3-319-60438-1_46

Anastasiia Hryhorzhevska¹⁹,
Marek Wiewiórka¹⁹,
Michał Okoniewski²⁰ &
…
Tomasz Gambin¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10352))

Included in the following conference series:

International Symposium on Methodologies for Intelligent Systems

1721 Accesses
2 Citations

Abstract

Genomic variant data obtained from the next generation sequencing can be used to study the population structure of the genotyped individuals. Typical approaches to ethnicity classification/clustering consist of several time consuming pre-processing steps, such as variant filtering, LD-pruning and dimensionality reduction of genotype matrix. We have developed a framework using R programming language to analyze the influence of various pre-processing methods and their parameters on the final results of the classification/clustering algorithms. The results indicated how to fine-tune the pre-processing steps in order to maximize the supervised and unsupervised classification performance. In addition, to enable efficient processing of large data sets, we have developed another framework using Apache Spark. Tests performed on 1000 Genomes data set confirmed the efficiency and scalability of the presented approach. Finally, the dockerized version of the implemented frameworks (freely available at: https://github.com/ZSI-Bio/popgen) can be easily applied to any other variant data set, including data from large scale sequencing projects or custom data sets from clinical laboratories.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

The 1000 genomes project. http://www.internationalgenome.org/
Apache Spark. RowMatrix. https://github.com/apache/spark
Apache Spark\(^{\rm TM}\). http://spark.apache.org/
Apache SystemML - Declarative Large-Scale Machine Learning. https://systemml.apache.org/
BauerLab/VariantSpark. https://github.com/BauerLab/VariantSpark
Big Data Genomics. http://bdgenomics.org/
Bioconductor - gdsfmt. http://bioconductor.org/packages/gdsfmt
H2o.ai. http://www.h2o.ai/download/sparkling-water/
MLlib—Apache Spark. http://spark.apache.org/mllib/
SNPRelate. http://bioconductor.org/packages/SNPRelate/
The variant call format specification. https://github.com/samtools/hts-specs
Abraham, G., Inouye, M.: Fast principal component analysis of large-scale genome-wide data. PLoS ONE 9(4), e93766 (2014)
Article Google Scholar
Auer, P.L., Lettre, G.: Rare variant association studies: considerations, challenges and opportunities. Genome Med. 7(1), 16 (2015)
Article Google Scholar
Hamilton, D.C., Cole, D.E.C.: Standardizing a composite measure of linkage disequilibrium. Ann. Hum. Genet. 3, 234–239 (2004)
Article Google Scholar
Hinrichs, A.L., Larkin, E.K., Suarez, B.K.: Population stratification and patterns of linkage disequilibrium. Genet. Epidemiol. 33(Suppl 1), S88–S92 (2009)
Article Google Scholar
Lee, S., Abecasis, G., Boehnke, M., Lin, X.: Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 95(1), 5–23 (2014)
Article Google Scholar
Lewontin, R.C.: The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 49(1), 49–67 (1964)
Google Scholar
Li, Q., Yu, K.: Improved correction for population stratification in genome-wide association studies by identifying hidden population structures. Genet. Epidemiol. 32(3), 215–226 (2008)
Article Google Scholar
Liu, L., Zhang, D., Liu, H., Arendt, C.: Robust methods for population stratification in genome wide association studies. BMC Bioinform. 14, 132 (2013)
Article Google Scholar
Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B., Hindorff, L.A., Hunter, D.J., McCarthy, M.I., Ramos, E.M., Cardon, L.R., Chakravarti, A., Cho, J.H., Guttmacher, A.E., Kong, A., Kruglyak, L., Mardis, E., Rotimi, C.N., Slatkin, M., Valle, D., Whittemore, A.S., Boehnke, M., Clark, A.G., Eichler, E.E., Gibson, G., Haines, J.L., Mackay, T.F.C., McCarroll, S.A., Visscher, P.M.: Finding the missing heritability of complex diseases. Nature 461(7265), 747–753 (2009)
Article Google Scholar
O’Brien, A.R., Saunders, N.F.W., Guo, Y., Buske, F.A., Scott, R.J., Bauer, D.C.: VariantSpark: population scale clustering of genotype information. BMC Genom. 16, 1052 (2015)
Article Google Scholar
Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., Reich, D.: Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38(8), 904–909 (2006)
Article Google Scholar
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M., Bender, D., Maller, J., Sklar, P., de Bakker, P., Daly, M., Sham, P.: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81(3), 559–575 (2007)
Article Google Scholar
Slatkin, M.: Linkage disequilibrium - understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet. 9(6), 477–485 (2008)
Article Google Scholar
Stein, L.D.: The case for cloud computing in genome informatics. Genome Biol. 11(5), 207 (2010)
Article Google Scholar
Weir, B.S.: Genetic Data Analysis. Sinauer Associates, Inc., Sunderland (1996)
Google Scholar
Zou, F., Lee, S., Knowles, M.R., Wright, F.A.: Quantification of population structure using correlated SNPs by shrinkage principal components. Hum. Hered. 70(1), 9–22 (2010)
Article Google Scholar

Download references

Acknowledgments

This work has been supported by the Polish National Science Center grants: Opus 2014/13/B/NZ2/01248 and Preludium 2014/13/N/ST6/01843.

Author information

Authors and Affiliations

Institute of Computer Science, Warsaw University of Technology, 00-665, Warsaw, Poland
Anastasiia Hryhorzhevska, Marek Wiewiórka & Tomasz Gambin
Scientific IT Services, ETH Zurich, 8092, Zurich, Switzerland
Michał Okoniewski

Authors

Anastasiia Hryhorzhevska
View author publications
You can also search for this author in PubMed Google Scholar
Marek Wiewiórka
View author publications
You can also search for this author in PubMed Google Scholar
Michał Okoniewski
View author publications
You can also search for this author in PubMed Google Scholar
Tomasz Gambin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anastasiia Hryhorzhevska .

Editor information

Editors and Affiliations

Warsaw University of Technology, Warsaw, Poland
Marzena Kryszkiewicz
University of Bari Aldo Moro, Bari, Italy
Annalisa Appice
Institute of Informatics, University of Warsaw, Warsaw, Poland
Dominik Ślęzak
Faculty of Electronics & Information, Warsaw University of Technology, Warsaw, Poland
Henryk Rybinski
Institute of Mathematics, Warsaw University, Warsaw, Poland
Andrzej Skowron
Department of Computer Science, University of North Carolina at Charlotte, North Carolina, USA
Zbigniew W. Raś

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hryhorzhevska, A., Wiewiórka, M., Okoniewski, M., Gambin, T. (2017). Scalable Framework for the Analysis of Population Structure Using the Next Generation Sequencing Data. In: Kryszkiewicz, M., Appice, A., Ślęzak, D., Rybinski, H., Skowron, A., Raś, Z. (eds) Foundations of Intelligent Systems. ISMIS 2017. Lecture Notes in Computer Science(), vol 10352. Springer, Cham. https://doi.org/10.1007/978-3-319-60438-1_46

Download citation

DOI: https://doi.org/10.1007/978-3-319-60438-1_46
Published: 14 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60437-4
Online ISBN: 978-3-319-60438-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics