Abstract
The traditional data processing methods working on single computer show less scalability and efficiency for performing ordered full-outer-joining, on merging large number of individual Genome-Wide Associations Studies (GWAS) data. Although the emerging of big data platforms such as Hadoop and Spark shed lights on this problem, the inefficiency of keeping data in total-sorted order as well as the workload imbalance problem limit their performance. In this study, we designed and compared three new methodologies based on MapReduce, HBase and Spark respectively, to merge hundreds of individuals VCF files on their Single Nucleotide Polymorphism (SNP) location into a single TPED file. Our methodologies overcame the limitations stated above and considerably improved the performance with good scalability on input size and computing resources.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chang, F., et al.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 4 (2008)
Danecek, P., et al.: The variant call format and VCFtools. Bioinformatics 27(15), 2156–2158 (2011)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Massie, M., et al.: Adam: genomics formats and processing patterns for cloud scale computing. University of California, Berkeley Technical Report, No. UCB/EECS-2013 2013; 207 (2013)
Purcell, S., et al.: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81(3), 559–575 (2007)
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association (2012)
Acknowledgement
This work is supported in part by NSF ACI 1443054 and NSF IIS 1350885.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Sun, X., Wang, F., Qin, Z. (2017). High Performance Merging of Massive Data from Genome-Wide Association Studies. In: Begoli, E., Wang, F., Luo, G. (eds) Data Management and Analytics for Medicine and Healthcare. DMAH 2017. Lecture Notes in Computer Science(), vol 10494. Springer, Cham. https://doi.org/10.1007/978-3-319-67186-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-67186-4_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67185-7
Online ISBN: 978-3-319-67186-4
eBook Packages: Computer ScienceComputer Science (R0)