Abstract
Genomic variant data obtained from the next generation sequencing can be used to study the population structure of the genotyped individuals. Typical approaches to ethnicity classification/clustering consist of several time consuming pre-processing steps, such as variant filtering, LD-pruning and dimensionality reduction of genotype matrix. We have developed a framework using R programming language to analyze the influence of various pre-processing methods and their parameters on the final results of the classification/clustering algorithms. The results indicated how to fine-tune the pre-processing steps in order to maximize the supervised and unsupervised classification performance. In addition, to enable efficient processing of large data sets, we have developed another framework using Apache Spark. Tests performed on 1000 Genomes data set confirmed the efficiency and scalability of the presented approach. Finally, the dockerized version of the implemented frameworks (freely available at: https://github.com/ZSI-Bio/popgen) can be easily applied to any other variant data set, including data from large scale sequencing projects or custom data sets from clinical laboratories.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
The 1000 genomes project. http://www.internationalgenome.org/
Apache Spark. RowMatrix. https://github.com/apache/spark
Apache Spark\(^{\rm TM}\). http://spark.apache.org/
Apache SystemML - Declarative Large-Scale Machine Learning. https://systemml.apache.org/
BauerLab/VariantSpark. https://github.com/BauerLab/VariantSpark
Big Data Genomics. http://bdgenomics.org/
Bioconductor - gdsfmt. http://bioconductor.org/packages/gdsfmt
MLlib—Apache Spark. http://spark.apache.org/mllib/
SNPRelate. http://bioconductor.org/packages/SNPRelate/
The variant call format specification. https://github.com/samtools/hts-specs
Abraham, G., Inouye, M.: Fast principal component analysis of large-scale genome-wide data. PLoS ONE 9(4), e93766 (2014)
Auer, P.L., Lettre, G.: Rare variant association studies: considerations, challenges and opportunities. Genome Med. 7(1), 16 (2015)
Hamilton, D.C., Cole, D.E.C.: Standardizing a composite measure of linkage disequilibrium. Ann. Hum. Genet. 3, 234–239 (2004)
Hinrichs, A.L., Larkin, E.K., Suarez, B.K.: Population stratification and patterns of linkage disequilibrium. Genet. Epidemiol. 33(Suppl 1), S88–S92 (2009)
Lee, S., Abecasis, G., Boehnke, M., Lin, X.: Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 95(1), 5–23 (2014)
Lewontin, R.C.: The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 49(1), 49–67 (1964)
Li, Q., Yu, K.: Improved correction for population stratification in genome-wide association studies by identifying hidden population structures. Genet. Epidemiol. 32(3), 215–226 (2008)
Liu, L., Zhang, D., Liu, H., Arendt, C.: Robust methods for population stratification in genome wide association studies. BMC Bioinform. 14, 132 (2013)
Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B., Hindorff, L.A., Hunter, D.J., McCarthy, M.I., Ramos, E.M., Cardon, L.R., Chakravarti, A., Cho, J.H., Guttmacher, A.E., Kong, A., Kruglyak, L., Mardis, E., Rotimi, C.N., Slatkin, M., Valle, D., Whittemore, A.S., Boehnke, M., Clark, A.G., Eichler, E.E., Gibson, G., Haines, J.L., Mackay, T.F.C., McCarroll, S.A., Visscher, P.M.: Finding the missing heritability of complex diseases. Nature 461(7265), 747–753 (2009)
O’Brien, A.R., Saunders, N.F.W., Guo, Y., Buske, F.A., Scott, R.J., Bauer, D.C.: VariantSpark: population scale clustering of genotype information. BMC Genom. 16, 1052 (2015)
Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., Reich, D.: Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38(8), 904–909 (2006)
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M., Bender, D., Maller, J., Sklar, P., de Bakker, P., Daly, M., Sham, P.: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81(3), 559–575 (2007)
Slatkin, M.: Linkage disequilibrium - understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet. 9(6), 477–485 (2008)
Stein, L.D.: The case for cloud computing in genome informatics. Genome Biol. 11(5), 207 (2010)
Weir, B.S.: Genetic Data Analysis. Sinauer Associates, Inc., Sunderland (1996)
Zou, F., Lee, S., Knowles, M.R., Wright, F.A.: Quantification of population structure using correlated SNPs by shrinkage principal components. Hum. Hered. 70(1), 9–22 (2010)
Acknowledgments
This work has been supported by the Polish National Science Center grants: Opus 2014/13/B/NZ2/01248 and Preludium 2014/13/N/ST6/01843.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Hryhorzhevska, A., Wiewiórka, M., Okoniewski, M., Gambin, T. (2017). Scalable Framework for the Analysis of Population Structure Using the Next Generation Sequencing Data. In: Kryszkiewicz, M., Appice, A., Ślęzak, D., Rybinski, H., Skowron, A., Raś, Z. (eds) Foundations of Intelligent Systems. ISMIS 2017. Lecture Notes in Computer Science(), vol 10352. Springer, Cham. https://doi.org/10.1007/978-3-319-60438-1_46
Download citation
DOI: https://doi.org/10.1007/978-3-319-60438-1_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60437-4
Online ISBN: 978-3-319-60438-1
eBook Packages: Computer ScienceComputer Science (R0)