skip to main content
survey

A Study on Garbage Collection Algorithms for Big Data Environments

Published:10 January 2018Publication History
Skip Abstract Section

Abstract

The need to process and store massive amounts of data—Big Data—is a reality. In areas such as scientific experiments, social networks management, credit card fraud detection, targeted advertisement, and financial analysis, massive amounts of information are generated and processed daily to extract valuable, summarized information. Due to its fast development cycle (i.e., less expensive to develop), mainly because of automatic memory management, and rich community resources, managed object-oriented programming languages (e.g., Java) are the first choice to develop Big Data platforms (e.g., Cassandra, Spark) on which such Big Data applications are executed.

However, automatic memory management comes at a cost. This cost is introduced by the garbage collector, which is responsible for collecting objects that are no longer being used. Although current (classic) garbage collection algorithms may be applicable to small-scale applications, these algorithms are not appropriate for large-scale Big Data environments, as they do not scale in terms of throughput and pause times.

In this work, current Big Data platforms and their memory profiles are studied to understand why classic algorithms (which are still the most commonly used) are not appropriate, and also to analyze recently proposed and relevant memory management algorithms, targeted to Big Data environments. The scalability of recent memory management algorithms is characterized in terms of throughput (improves the throughput of the application) and pause time (reduces the latency of the application) when compared to classic algorithms. The study is concluded by presenting a taxonomy of the described works and some open problems, with regard to Big Data memory management, that could be addressed in future works.

References

  1. Rajendra Akerkar. 2013. Big Data Computing. CRC Press, Boca Raton, FL.Google ScholarGoogle Scholar
  2. Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-tolerant stream processing at Internet scale. Proceedings of the VLDB Endowment 6, 11, 1033--1044.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bowen Alpern, C. Richard Attanasio, John J. Barton, Michael G. Burke, Perry Cheng, J.-D. Choi, Anthony Cocchi, et al. 2000. The Jalapeno virtual machine. IBM Systems Journal 39, 1 (2000), 211--238.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Andrew W. Appel. 1989. Simple generational garbage collection and fast allocation. Software: Practice and Experience 19, 2, 171--183.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Henry G. Baker Jr. 1978. List processing in real time on a serial computer. Communications of the ACM 21, 4, 280--294. DOI:http://dx.doi.org/10.1145/359460.359470Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Peter B. Bishop. 1977. Computer Systems With a Very Large Address Space and Garbage Collection. Technical Report. DTIC Document.Google ScholarGoogle Scholar
  7. S. Blackburn, P. Cheng, and K. McKinley. 2004. Oil and water? High performance garbage collection in Java with MMTk. In Proceedings of the 26th International Conference on Software Engineering (ICSE’04). 137--146. DOI:http://dx.doi.org/10.1109/ICSE.2004.1317436Google ScholarGoogle ScholarCross RefCross Ref
  8. Stephen M. Blackburn and Kathryn S. McKinley. 2003. Ulterior reference counting: Fast garbage collection without a long wait. In Proceedings of the 18th Annual ACM SIGPLAN Conference on Object-Oriented Programing, Systems, Languages, and Applications (OOPSLA’03). ACM, New York, NY, 344--358. DOI:http://dx.doi.org/10.1145/949305.949336Google ScholarGoogle Scholar
  9. Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering. IEEE, Los Alamitos, CA, 1151--1162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dhruba Borthakur, Jonathan Gray, Joydeep Sen Sarma, Kannan Muthukkaruppan, Nicolas Spiegelberg, Hairong Kuang, Karthik Ranganathan, et al. 2011. Apache Hadoop goes realtime at Facebook. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 1071--1080.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Don Box and Ted Pattison. 2002. Essential .NET: The Common Language Runtime. Addison-Wesley Longman Publishing Co., Inc.Google ScholarGoogle Scholar
  12. Rodrigo Bruno, Luís Picciochi Oliveira, and Paulo Ferreira. 2017. NG2C: Pretenuring garbage collection with dynamic generations for hotspot big data applications. In Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management (ISMM’17). ACM, New York, NY, 2--13. DOI:http://dx.doi.org/10.1145/3092255.3092272Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Randal Bryant, Randy H. Katz, and Edward D. Lazowska. 2008. Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science, and Society. Computing Community Consortium.Google ScholarGoogle Scholar
  14. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems 26, 2, 4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kristina Chodorow. 2013. MongoDB: The Definitive Guide. O’Reilly Media, Inc.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Daniel Clifford, Hannes Payer, Michael Stanton, and Ben L. Titzer. 2015. Memento mori: Dynamic allocation-site-based optimizations. In Proceedings of the 2015 International Symposium on Memory Management (ISMM’15). ACM, New York, NY, 105--117. DOI:http://dx.doi.org/10.1145/2754169.2754181Google ScholarGoogle Scholar
  17. Nachshon Cohen and Erez Petrank. 2015. Data structure aware garbage collector. ACM SIGPLAN Notices 50, 28--40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. George E. Collins. 1960. A method for overlapping and erasure of lists. Communications of the ACM 3, 12, 655--657. DOI:http://dx.doi.org/10.1145/367487.367501Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Michael Cox and David Ellsworth. 1997. Application-controlled demand paging for out-of-core visualization. In Proceedings of the 8th Conference on Visualization’97 (VIS’97). IEEE, Los Alamitos, CA, 235--ff. http://dl.acm.org/citation.cfm?id=266989.267068Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1, 107--113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Operating Systems Review 41, 205--220.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. David Detlefs, Christine Flood, Steve Heller, and Tony Printezis. 2004. Garbage-first garbage collection. In Proceedings of the 4th International Symposium on Memory Management. ACM, New York, NY, 37--48.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Edsger W. Dijkstra, Leslie Lamport, Alain J. Martin, Carel S. Scholten, and Elisabeth F. M. Steffens. 1978. On-the-fly garbage collection: An exercise in cooperation. Communications of the ACM 21, 11, 966--975.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Dimpsey, R. Arora, and K. Kuiper. 2000. Java server performance: A case study of building efficient, scalable JVMs. IBM Systems Journal 39, 1, 151--174. DOI:http://dx.doi.org/10.1147/sj.391.0151Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jens Dittrich and Jorge-Arnulfo Quiané-Ruiz. 2012. Efficient big data processing in Hadoop MapReduce. Proceedings of the VLDB Endowment 5, 12, 2014--2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. David Gay and Bjarne Steensgaard. 2000. Fast escape analysis and stack allocation for object-based programs. In Compiler Construction. Springer, 82--93.Google ScholarGoogle ScholarCross RefCross Ref
  27. Lars George. 2011. HBase: The Definitive Guide. O’Reilly Media, Inc.Google ScholarGoogle Scholar
  28. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. ACM SIGOPS Operating Systems Review 37, 29--43.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lokesh Gidra, Gaël Thomas, Julien Sopena, and Marc Shapiro. 2013. A study of the scalability of stop-the-world garbage collectors on multicores. ACM SIGPLAN Notices 48, 229--240.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Lokesh Gidra, Gaël Thomas, Julien Sopena, Marc Shapiro, and Nhan Nguyen. 2015. NumaGiC: A garbage collector for Big Data on Big NUMA machines. ACM SIGARCH Computer Architecture News 43, 661--673.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ionel Gog, Jana Giceva, Malte Schwarzkopf, Kapil Vaswani, Dimitrios Vytiniotis, Ganesan Ramalingam, Manuel Costa, Derek G. Murray, Steven Hand, and Michael Isard. 2015. Broom: Sweeping out garbage collection from Big Data systems. In Proceedings of the 15th Workshop on Hot Topics in Operating Systems (HotOS XV).Google ScholarGoogle Scholar
  32. James Gosling. 2000. The Java Language Specification. Addison-Wesley Professional.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Herodotos Herodotou and Shivnath Babu. 2011. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. Proceedings of the VLDB Endowment 4, 1, 1111--1122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Richard L. Hudson and J. Eliot B. Moss. 1992. Incremental collection of mature objects. In Memory Management. Springer, 388--403.Google ScholarGoogle Scholar
  35. R. John M. Hughes. 1982. A semi-incremental garbage collection algorithm. Software: Practice and Experience 12, 11, 1081--1082.Google ScholarGoogle ScholarCross RefCross Ref
  36. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review 41, 59--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Richard Jones, Antony Hosking, and Eliot Moss. 2011. The Garbage Collection Handbook: The Art of Automatic Memory Management. Chapman 8 Hall/CRC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Richard Jones and Andy C. King. 2005. A fast analysis for thread-local garbage collection with dynamic class loading. In Proceedings of the 5th IEEE International Workshop on Source Code Analysis and Manipulation. IEEE, Los Alamitos, CA, 129--138.Google ScholarGoogle Scholar
  39. Richard Jones and Chris Ryder. 2006. Garbage collection should be lifetime aware. In Proceedings of the Implementation, Compilation, Optimization of Object-Oriented Languages, Programs, and Systems Conference (ICOOOLPS’06).Google ScholarGoogle Scholar
  40. Richard E. Jones and Chris Ryder. 2008. A study of Java object demographics. In Proceedings of the 7th International Symposium on Memory Management. ACM, New York, NY, 121--130.Google ScholarGoogle Scholar
  41. Kenneth C. Knowlton. 1965. A fast storage allocator. Communications of the ACM 8, 10, 623--624. DOI:http://dx.doi.org/10.1145/365628.365655Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-scale graph computation on just a PC. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). 31--46.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2, 35--40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21, 7, 558--565.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Henry Lieberman and Carl Hewitt. 1983. A real-time garbage collector based on the lifetimes of objects. Communications of the ACM 26, 6, 419--429.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jimmy Lin and Dmitriy Ryaboy. 2013. Scaling big data mining infrastructure: The Twitter experience. ACM SIGKDD Explorations Newsletter 14, 2, 6--19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Lu Lu, Xuanhua Shi, Yongluan Zhou, Xiong Zhang, Hai Jin, Cheng Pei, Ligang He, and Yuanzhen Geng. 2016. Lifetime-based memory management for distributed data processing systems. Proceedings of the VLDB Endowment 9, 12, 936--947. DOI:http://dx.doi.org/10.14778/2994509.2994513Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Clifford Lynch. 2008. Big data: How do your data grow? Nature 455, 7209, 28--29.Google ScholarGoogle Scholar
  49. Martin Maas, Krste Asanović, Tim Harris, and John Kubiatowicz. 2016. Taurus: A holistic language runtime system for coordinating distributed managed-language applications. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 457--471.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 135--146.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. John McCarthy. 1960. Recursive functions of symbolic expressions and their computation by machine, part I. Communications of the ACM 3, 4, 184--195. DOI:http://dx.doi.org/10.1145/367177.367199Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. David A. Moon. 1984. Garbage collection in a large LISP system. In Proceedings of the 1984 ACM Symposium on LISP and Functional Programming. ACM, New York, NY, 235--246.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martin Abadi. 2013. Naiad: A timely dataflow system. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, New York, NY, 439--455.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Scott Nettles, James O’Toole, David Pierce, and Nicholas Haines. 1992. Replication-Based Incremental Copying Collection. Springer.Google ScholarGoogle Scholar
  55. Khanh Nguyen, Kai Wang, Yingyi Bu, Lu Fang, Jianfei Hu, and Guoqing Xu. 2015. Facade: A compiler and runtime for (almost) object-bounded big data applications. ACM SIGPLAN Notices 50, 675--690.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 1099--1110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. H. Paz, D. F. Bacon, E. K. Kolodner, E. Petrank, and V. T. Rajan. 2007. An efficient on-the-fly cycle collection. ACM Transactions on Programming Languages and Systems 29, 4, Article No. 20. DOI:http://dx.doi.org/10.1145/1255450.1255453Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Ian Robinson, Jim Webber, and Emil Eifrem. 2013. Graph Databases. O’Reilly Media, Inc.Google ScholarGoogle Scholar
  59. Semih Salihoglu and Jennifer Widom. 2013. GPS: A graph processing system. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management. ACM, New York, NY, 22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). IEEE, Los Alamitos, CA, 1--10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Sunil Soman, Chandra Krintz, and Laurent Daynès. 2008. MTM2: Scalable memory management for multi-tasking managed runtime environments. In ECOOP 2008—Object-Oriented Programming. Springer, 335--361.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. C. J. Stephenson. 1983. New methods for dynamic storage allocation (fast fits). In Proceedings of the 9th ACM Symposium on Operating Systems Principles (SOSP’83). ACM, New York, NY, 30--32. DOI:http://dx.doi.org/10.1145/800217.806613Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Roshan Sumbaly, Jay Kreps, and Sam Shah. 2013. The big data ecosystem at LinkedIn. In Proceedings of the 2013 International Conference on Management of Data. ACM, New York, NY, 1125--1134.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Andrew S. Tanenbaum. 2007. Modern Operating Systems. Prentice Hall Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Gil Tene, Balaji Iyengar, and Michael Wolf. 2011. C4: The Continuously Concurrent Compacting Collector. ACM SIGPLAN Notices 46, 11, 79--88.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A warehousing solution over a Map-Reduce framework. Proceedings of the VLDB Endowment 2, 2, 1626--1629.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. David Ungar. 1984. Generation scavenging: A non-disruptive high performance storage reclamation algorithm. ACM SIGPLAN Notices 19, 5, 157--167.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. David Ungar and Frank Jackson. 1988. Tenuring policies for generation-based storage reclamation. ACM SIGPLAN Notices 23, 1--17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Rik Van Bruggen. 2014. Learning Neo4j. Packt Publishing Ltd.Google ScholarGoogle Scholar
  70. Tom White. 2009. Hadoop: The Definitive Guide. O’Reilly Media, Inc.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI’08), Vol. 8. 1--14.Google ScholarGoogle Scholar
  72. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. 10.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Study on Garbage Collection Algorithms for Big Data Environments

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Computing Surveys
        ACM Computing Surveys  Volume 51, Issue 1
        January 2019
        743 pages
        ISSN:0360-0300
        EISSN:1557-7341
        DOI:10.1145/3177787
        • Editor:
        • Sartaj Sahni
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 January 2018
        • Accepted: 1 October 2017
        • Revised: 1 July 2017
        • Received: 1 November 2016
        Published in csur Volume 51, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • survey
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader