survey

A Study on Garbage Collection Algorithms for Big Data Environments

Authors:
Rodrigo Bruno

INESC-ID/Instituto Superior Técnico, University of Lisbon, Portugal

INESC-ID/Instituto Superior Técnico, University of Lisbon, Portugal

0000-0003-1578-5149
View Profile

,
Paulo Ferreira

INESC-ID/Instituto Superior Técnico, University of Lisbon, Portugal

INESC-ID/Instituto Superior Técnico, University of Lisbon, Portugal
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 51 Issue 1Article No.: 20pp 1–35https://doi.org/10.1145/3156818

Published:10 January 2018Publication History

ACM Computing Surveys

Abstract

The need to process and store massive amounts of data—Big Data—is a reality. In areas such as scientific experiments, social networks management, credit card fraud detection, targeted advertisement, and financial analysis, massive amounts of information are generated and processed daily to extract valuable, summarized information. Due to its fast development cycle (i.e., less expensive to develop), mainly because of automatic memory management, and rich community resources, managed object-oriented programming languages (e.g., Java) are the first choice to develop Big Data platforms (e.g., Cassandra, Spark) on which such Big Data applications are executed.

However, automatic memory management comes at a cost. This cost is introduced by the garbage collector, which is responsible for collecting objects that are no longer being used. Although current (classic) garbage collection algorithms may be applicable to small-scale applications, these algorithms are not appropriate for large-scale Big Data environments, as they do not scale in terms of throughput and pause times.

In this work, current Big Data platforms and their memory profiles are studied to understand why classic algorithms (which are still the most commonly used) are not appropriate, and also to analyze recently proposed and relevant memory management algorithms, targeted to Big Data environments. The scalability of recent memory management algorithms is characterized in terms of throughput (improves the throughput of the application) and pause time (reduces the latency of the application) when compared to classic algorithms. The study is concluded by presenting a taxonomy of the described works and some open problems, with regard to Big Data memory management, that could be addressed in future works.

References

Rajendra Akerkar. 2013. Big Data Computing. CRC Press, Boca Raton, FL.Google Scholar
Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-tolerant stream processing at Internet scale. Proceedings of the VLDB Endowment 6, 11, 1033--1044.Google ScholarDigital Library
Bowen Alpern, C. Richard Attanasio, John J. Barton, Michael G. Burke, Perry Cheng, J.-D. Choi, Anthony Cocchi, et al. 2000. The Jalapeno virtual machine. IBM Systems Journal 39, 1 (2000), 211--238.Google ScholarDigital Library
Andrew W. Appel. 1989. Simple generational garbage collection and fast allocation. Software: Practice and Experience 19, 2, 171--183.Google ScholarDigital Library
Henry G. Baker Jr. 1978. List processing in real time on a serial computer. Communications of the ACM 21, 4, 280--294. DOI:http://dx.doi.org/10.1145/359460.359470Google ScholarDigital Library
Peter B. Bishop. 1977. Computer Systems With a Very Large Address Space and Garbage Collection. Technical Report. DTIC Document.Google Scholar
S. Blackburn, P. Cheng, and K. McKinley. 2004. Oil and water? High performance garbage collection in Java with MMTk. In Proceedings of the 26th International Conference on Software Engineering (ICSE’04). 137--146. DOI:http://dx.doi.org/10.1109/ICSE.2004.1317436Google ScholarCross Ref
Stephen M. Blackburn and Kathryn S. McKinley. 2003. Ulterior reference counting: Fast garbage collection without a long wait. In Proceedings of the 18th Annual ACM SIGPLAN Conference on Object-Oriented Programing, Systems, Languages, and Applications (OOPSLA’03). ACM, New York, NY, 344--358. DOI:http://dx.doi.org/10.1145/949305.949336Google Scholar
Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering. IEEE, Los Alamitos, CA, 1151--1162.Google ScholarDigital Library
Dhruba Borthakur, Jonathan Gray, Joydeep Sen Sarma, Kannan Muthukkaruppan, Nicolas Spiegelberg, Hairong Kuang, Karthik Ranganathan, et al. 2011. Apache Hadoop goes realtime at Facebook. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 1071--1080.Google ScholarDigital Library
Don Box and Ted Pattison. 2002. Essential .NET: The Common Language Runtime. Addison-Wesley Longman Publishing Co., Inc.Google Scholar
Rodrigo Bruno, Luís Picciochi Oliveira, and Paulo Ferreira. 2017. NG2C: Pretenuring garbage collection with dynamic generations for hotspot big data applications. In Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management (ISMM’17). ACM, New York, NY, 2--13. DOI:http://dx.doi.org/10.1145/3092255.3092272Google ScholarDigital Library
Randal Bryant, Randy H. Katz, and Edward D. Lazowska. 2008. Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science, and Society. Computing Community Consortium.Google Scholar
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems 26, 2, 4.Google ScholarDigital Library
Kristina Chodorow. 2013. MongoDB: The Definitive Guide. O’Reilly Media, Inc.Google ScholarDigital Library
Daniel Clifford, Hannes Payer, Michael Stanton, and Ben L. Titzer. 2015. Memento mori: Dynamic allocation-site-based optimizations. In Proceedings of the 2015 International Symposium on Memory Management (ISMM’15). ACM, New York, NY, 105--117. DOI:http://dx.doi.org/10.1145/2754169.2754181Google Scholar
Nachshon Cohen and Erez Petrank. 2015. Data structure aware garbage collector. ACM SIGPLAN Notices 50, 28--40.Google ScholarDigital Library
George E. Collins. 1960. A method for overlapping and erasure of lists. Communications of the ACM 3, 12, 655--657. DOI:http://dx.doi.org/10.1145/367487.367501Google ScholarDigital Library
Michael Cox and David Ellsworth. 1997. Application-controlled demand paging for out-of-core visualization. In Proceedings of the 8th Conference on Visualization’97 (VIS’97). IEEE, Los Alamitos, CA, 235--ff. http://dl.acm.org/citation.cfm?id=266989.267068Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1, 107--113.Google ScholarDigital Library
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Operating Systems Review 41, 205--220.Google ScholarDigital Library
David Detlefs, Christine Flood, Steve Heller, and Tony Printezis. 2004. Garbage-first garbage collection. In Proceedings of the 4th International Symposium on Memory Management. ACM, New York, NY, 37--48.Google ScholarDigital Library
Edsger W. Dijkstra, Leslie Lamport, Alain J. Martin, Carel S. Scholten, and Elisabeth F. M. Steffens. 1978. On-the-fly garbage collection: An exercise in cooperation. Communications of the ACM 21, 11, 966--975.Google ScholarDigital Library
R. Dimpsey, R. Arora, and K. Kuiper. 2000. Java server performance: A case study of building efficient, scalable JVMs. IBM Systems Journal 39, 1, 151--174. DOI:http://dx.doi.org/10.1147/sj.391.0151Google ScholarDigital Library
Jens Dittrich and Jorge-Arnulfo Quiané-Ruiz. 2012. Efficient big data processing in Hadoop MapReduce. Proceedings of the VLDB Endowment 5, 12, 2014--2015.Google ScholarDigital Library
David Gay and Bjarne Steensgaard. 2000. Fast escape analysis and stack allocation for object-based programs. In Compiler Construction. Springer, 82--93.Google ScholarCross Ref
Lars George. 2011. HBase: The Definitive Guide. O’Reilly Media, Inc.Google Scholar
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. ACM SIGOPS Operating Systems Review 37, 29--43.Google ScholarDigital Library
Lokesh Gidra, Gaël Thomas, Julien Sopena, and Marc Shapiro. 2013. A study of the scalability of stop-the-world garbage collectors on multicores. ACM SIGPLAN Notices 48, 229--240.Google ScholarDigital Library
Lokesh Gidra, Gaël Thomas, Julien Sopena, Marc Shapiro, and Nhan Nguyen. 2015. NumaGiC: A garbage collector for Big Data on Big NUMA machines. ACM SIGARCH Computer Architecture News 43, 661--673.Google ScholarDigital Library
Ionel Gog, Jana Giceva, Malte Schwarzkopf, Kapil Vaswani, Dimitrios Vytiniotis, Ganesan Ramalingam, Manuel Costa, Derek G. Murray, Steven Hand, and Michael Isard. 2015. Broom: Sweeping out garbage collection from Big Data systems. In Proceedings of the 15th Workshop on Hot Topics in Operating Systems (HotOS XV).Google Scholar
James Gosling. 2000. The Java Language Specification. Addison-Wesley Professional.Google ScholarDigital Library
Herodotos Herodotou and Shivnath Babu. 2011. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. Proceedings of the VLDB Endowment 4, 1, 1111--1122.Google ScholarDigital Library
Richard L. Hudson and J. Eliot B. Moss. 1992. Incremental collection of mature objects. In Memory Management. Springer, 388--403.Google Scholar
R. John M. Hughes. 1982. A semi-incremental garbage collection algorithm. Software: Practice and Experience 12, 11, 1081--1082.Google ScholarCross Ref
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review 41, 59--72.Google ScholarDigital Library
Richard Jones, Antony Hosking, and Eliot Moss. 2011. The Garbage Collection Handbook: The Art of Automatic Memory Management. Chapman 8 Hall/CRC.Google ScholarDigital Library
Richard Jones and Andy C. King. 2005. A fast analysis for thread-local garbage collection with dynamic class loading. In Proceedings of the 5th IEEE International Workshop on Source Code Analysis and Manipulation. IEEE, Los Alamitos, CA, 129--138.Google Scholar
Richard Jones and Chris Ryder. 2006. Garbage collection should be lifetime aware. In Proceedings of the Implementation, Compilation, Optimization of Object-Oriented Languages, Programs, and Systems Conference (ICOOOLPS’06).Google Scholar
Richard E. Jones and Chris Ryder. 2008. A study of Java object demographics. In Proceedings of the 7th International Symposium on Memory Management. ACM, New York, NY, 121--130.Google Scholar
Kenneth C. Knowlton. 1965. A fast storage allocator. Communications of the ACM 8, 10, 623--624. DOI:http://dx.doi.org/10.1145/365628.365655Google ScholarDigital Library
Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-scale graph computation on just a PC. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). 31--46.Google ScholarDigital Library
Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2, 35--40.Google ScholarDigital Library
Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21, 7, 558--565.Google ScholarDigital Library
Henry Lieberman and Carl Hewitt. 1983. A real-time garbage collector based on the lifetimes of objects. Communications of the ACM 26, 6, 419--429.Google ScholarDigital Library
Jimmy Lin and Dmitriy Ryaboy. 2013. Scaling big data mining infrastructure: The Twitter experience. ACM SIGKDD Explorations Newsletter 14, 2, 6--19.Google ScholarDigital Library
Lu Lu, Xuanhua Shi, Yongluan Zhou, Xiong Zhang, Hai Jin, Cheng Pei, Ligang He, and Yuanzhen Geng. 2016. Lifetime-based memory management for distributed data processing systems. Proceedings of the VLDB Endowment 9, 12, 936--947. DOI:http://dx.doi.org/10.14778/2994509.2994513Google ScholarDigital Library
Clifford Lynch. 2008. Big data: How do your data grow? Nature 455, 7209, 28--29.Google Scholar
Martin Maas, Krste Asanović, Tim Harris, and John Kubiatowicz. 2016. Taurus: A holistic language runtime system for coordinating distributed managed-language applications. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 457--471.Google ScholarDigital Library
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 135--146.Google ScholarDigital Library
John McCarthy. 1960. Recursive functions of symbolic expressions and their computation by machine, part I. Communications of the ACM 3, 4, 184--195. DOI:http://dx.doi.org/10.1145/367177.367199Google ScholarDigital Library
David A. Moon. 1984. Garbage collection in a large LISP system. In Proceedings of the 1984 ACM Symposium on LISP and Functional Programming. ACM, New York, NY, 235--246.Google ScholarDigital Library
Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martin Abadi. 2013. Naiad: A timely dataflow system. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, New York, NY, 439--455.Google ScholarDigital Library
Scott Nettles, James O’Toole, David Pierce, and Nicholas Haines. 1992. Replication-Based Incremental Copying Collection. Springer.Google Scholar
Khanh Nguyen, Kai Wang, Yingyi Bu, Lu Fang, Jianfei Hu, and Guoqing Xu. 2015. Facade: A compiler and runtime for (almost) object-bounded big data applications. ACM SIGPLAN Notices 50, 675--690.Google ScholarDigital Library
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 1099--1110.Google ScholarDigital Library
H. Paz, D. F. Bacon, E. K. Kolodner, E. Petrank, and V. T. Rajan. 2007. An efficient on-the-fly cycle collection. ACM Transactions on Programming Languages and Systems 29, 4, Article No. 20. DOI:http://dx.doi.org/10.1145/1255450.1255453Google ScholarDigital Library
Ian Robinson, Jim Webber, and Emil Eifrem. 2013. Graph Databases. O’Reilly Media, Inc.Google Scholar
Semih Salihoglu and Jennifer Widom. 2013. GPS: A graph processing system. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management. ACM, New York, NY, 22.Google ScholarDigital Library
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). IEEE, Los Alamitos, CA, 1--10.Google ScholarDigital Library
Sunil Soman, Chandra Krintz, and Laurent Daynès. 2008. MTM2: Scalable memory management for multi-tasking managed runtime environments. In ECOOP 2008—Object-Oriented Programming. Springer, 335--361.Google ScholarDigital Library
C. J. Stephenson. 1983. New methods for dynamic storage allocation (fast fits). In Proceedings of the 9th ACM Symposium on Operating Systems Principles (SOSP’83). ACM, New York, NY, 30--32. DOI:http://dx.doi.org/10.1145/800217.806613Google ScholarDigital Library
Roshan Sumbaly, Jay Kreps, and Sam Shah. 2013. The big data ecosystem at LinkedIn. In Proceedings of the 2013 International Conference on Management of Data. ACM, New York, NY, 1125--1134.Google ScholarDigital Library
Andrew S. Tanenbaum. 2007. Modern Operating Systems. Prentice Hall Press.Google ScholarDigital Library
Gil Tene, Balaji Iyengar, and Michael Wolf. 2011. C4: The Continuously Concurrent Compacting Collector. ACM SIGPLAN Notices 46, 11, 79--88.Google ScholarDigital Library
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A warehousing solution over a Map-Reduce framework. Proceedings of the VLDB Endowment 2, 2, 1626--1629.Google ScholarDigital Library
David Ungar. 1984. Generation scavenging: A non-disruptive high performance storage reclamation algorithm. ACM SIGPLAN Notices 19, 5, 157--167.Google ScholarDigital Library
David Ungar and Frank Jackson. 1988. Tenuring policies for generation-based storage reclamation. ACM SIGPLAN Notices 23, 1--17.Google ScholarDigital Library
Rik Van Bruggen. 2014. Learning Neo4j. Packt Publishing Ltd.Google Scholar
Tom White. 2009. Hadoop: The Definitive Guide. O’Reilly Media, Inc.Google ScholarDigital Library
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI’08), Vol. 8. 1--14.Google Scholar
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. 10.Google ScholarDigital Library

Index Terms

A Study on Garbage Collection Algorithms for Big Data Environments

Recommendations

NG2C: pretenuring garbage collection with dynamic generations for HotSpot big data applications
ISMM '17

Big Data applications suffer from unpredictable and unacceptably high pause times due to Garbage Collection (GC). This is the case in latency-sensitive applications such as on-line credit-card fraud detection, graph-based computing for analysis on ...
Read More
NG2C: pretenuring garbage collection with dynamic generations for HotSpot big data applications
ISMM 2017: Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management

Big Data applications suffer from unpredictable and unacceptably high pause times due to Garbage Collection (GC). This is the case in latency-sensitive applications such as on-line credit-card fraud detection, graph-based computing for analysis on ...
Read More
Analysis of Garbage Collection Patterns to Extend Microbenchmarks for Big Data Workloads
ICPE '22: Companion of the 2022 ACM/SPEC International Conference on Performance Engineering

Java uses automatic memory allocation where the user does not have to explicitly free used memory. This is done by the garbage collector. Garbage Collection (GC) can take up a significant amount of time, especially in Big Data applications running large ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 51, Issue 1
January 2019
743 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3177787
Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering / University of Florida / Gainesville, FL
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 January 2018
- Accepted: 1 October 2017
- Revised: 1 July 2017
- Received: 1 November 2016
Published in csur Volume 51, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Big Data
Big Data environment
Garbage collection
Java
memory managed runtime
processing platforms
scalability
storage platform
Qualifiers
- survey
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 1,140
  Total Downloads
- Downloads (Last 12 months)98
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.