Abstract
The need to process and store massive amounts of data—Big Data—is a reality. In areas such as scientific experiments, social networks management, credit card fraud detection, targeted advertisement, and financial analysis, massive amounts of information are generated and processed daily to extract valuable, summarized information. Due to its fast development cycle (i.e., less expensive to develop), mainly because of automatic memory management, and rich community resources, managed object-oriented programming languages (e.g., Java) are the first choice to develop Big Data platforms (e.g., Cassandra, Spark) on which such Big Data applications are executed.
However, automatic memory management comes at a cost. This cost is introduced by the garbage collector, which is responsible for collecting objects that are no longer being used. Although current (classic) garbage collection algorithms may be applicable to small-scale applications, these algorithms are not appropriate for large-scale Big Data environments, as they do not scale in terms of throughput and pause times.
In this work, current Big Data platforms and their memory profiles are studied to understand why classic algorithms (which are still the most commonly used) are not appropriate, and also to analyze recently proposed and relevant memory management algorithms, targeted to Big Data environments. The scalability of recent memory management algorithms is characterized in terms of throughput (improves the throughput of the application) and pause time (reduces the latency of the application) when compared to classic algorithms. The study is concluded by presenting a taxonomy of the described works and some open problems, with regard to Big Data memory management, that could be addressed in future works.
- Rajendra Akerkar. 2013. Big Data Computing. CRC Press, Boca Raton, FL.Google Scholar
- Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-tolerant stream processing at Internet scale. Proceedings of the VLDB Endowment 6, 11, 1033--1044.Google ScholarDigital Library
- Bowen Alpern, C. Richard Attanasio, John J. Barton, Michael G. Burke, Perry Cheng, J.-D. Choi, Anthony Cocchi, et al. 2000. The Jalapeno virtual machine. IBM Systems Journal 39, 1 (2000), 211--238.Google ScholarDigital Library
- Andrew W. Appel. 1989. Simple generational garbage collection and fast allocation. Software: Practice and Experience 19, 2, 171--183.Google ScholarDigital Library
- Henry G. Baker Jr. 1978. List processing in real time on a serial computer. Communications of the ACM 21, 4, 280--294. DOI:http://dx.doi.org/10.1145/359460.359470Google ScholarDigital Library
- Peter B. Bishop. 1977. Computer Systems With a Very Large Address Space and Garbage Collection. Technical Report. DTIC Document.Google Scholar
- S. Blackburn, P. Cheng, and K. McKinley. 2004. Oil and water? High performance garbage collection in Java with MMTk. In Proceedings of the 26th International Conference on Software Engineering (ICSE’04). 137--146. DOI:http://dx.doi.org/10.1109/ICSE.2004.1317436Google ScholarCross Ref
- Stephen M. Blackburn and Kathryn S. McKinley. 2003. Ulterior reference counting: Fast garbage collection without a long wait. In Proceedings of the 18th Annual ACM SIGPLAN Conference on Object-Oriented Programing, Systems, Languages, and Applications (OOPSLA’03). ACM, New York, NY, 344--358. DOI:http://dx.doi.org/10.1145/949305.949336Google Scholar
- Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering. IEEE, Los Alamitos, CA, 1151--1162.Google ScholarDigital Library
- Dhruba Borthakur, Jonathan Gray, Joydeep Sen Sarma, Kannan Muthukkaruppan, Nicolas Spiegelberg, Hairong Kuang, Karthik Ranganathan, et al. 2011. Apache Hadoop goes realtime at Facebook. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 1071--1080.Google ScholarDigital Library
- Don Box and Ted Pattison. 2002. Essential .NET: The Common Language Runtime. Addison-Wesley Longman Publishing Co., Inc.Google Scholar
- Rodrigo Bruno, Luís Picciochi Oliveira, and Paulo Ferreira. 2017. NG2C: Pretenuring garbage collection with dynamic generations for hotspot big data applications. In Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management (ISMM’17). ACM, New York, NY, 2--13. DOI:http://dx.doi.org/10.1145/3092255.3092272Google ScholarDigital Library
- Randal Bryant, Randy H. Katz, and Edward D. Lazowska. 2008. Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science, and Society. Computing Community Consortium.Google Scholar
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems 26, 2, 4.Google ScholarDigital Library
- Kristina Chodorow. 2013. MongoDB: The Definitive Guide. O’Reilly Media, Inc.Google ScholarDigital Library
- Daniel Clifford, Hannes Payer, Michael Stanton, and Ben L. Titzer. 2015. Memento mori: Dynamic allocation-site-based optimizations. In Proceedings of the 2015 International Symposium on Memory Management (ISMM’15). ACM, New York, NY, 105--117. DOI:http://dx.doi.org/10.1145/2754169.2754181Google Scholar
- Nachshon Cohen and Erez Petrank. 2015. Data structure aware garbage collector. ACM SIGPLAN Notices 50, 28--40.Google ScholarDigital Library
- George E. Collins. 1960. A method for overlapping and erasure of lists. Communications of the ACM 3, 12, 655--657. DOI:http://dx.doi.org/10.1145/367487.367501Google ScholarDigital Library
- Michael Cox and David Ellsworth. 1997. Application-controlled demand paging for out-of-core visualization. In Proceedings of the 8th Conference on Visualization’97 (VIS’97). IEEE, Los Alamitos, CA, 235--ff. http://dl.acm.org/citation.cfm?id=266989.267068Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1, 107--113.Google ScholarDigital Library
- Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Operating Systems Review 41, 205--220.Google ScholarDigital Library
- David Detlefs, Christine Flood, Steve Heller, and Tony Printezis. 2004. Garbage-first garbage collection. In Proceedings of the 4th International Symposium on Memory Management. ACM, New York, NY, 37--48.Google ScholarDigital Library
- Edsger W. Dijkstra, Leslie Lamport, Alain J. Martin, Carel S. Scholten, and Elisabeth F. M. Steffens. 1978. On-the-fly garbage collection: An exercise in cooperation. Communications of the ACM 21, 11, 966--975.Google ScholarDigital Library
- R. Dimpsey, R. Arora, and K. Kuiper. 2000. Java server performance: A case study of building efficient, scalable JVMs. IBM Systems Journal 39, 1, 151--174. DOI:http://dx.doi.org/10.1147/sj.391.0151Google ScholarDigital Library
- Jens Dittrich and Jorge-Arnulfo Quiané-Ruiz. 2012. Efficient big data processing in Hadoop MapReduce. Proceedings of the VLDB Endowment 5, 12, 2014--2015.Google ScholarDigital Library
- David Gay and Bjarne Steensgaard. 2000. Fast escape analysis and stack allocation for object-based programs. In Compiler Construction. Springer, 82--93.Google ScholarCross Ref
- Lars George. 2011. HBase: The Definitive Guide. O’Reilly Media, Inc.Google Scholar
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. ACM SIGOPS Operating Systems Review 37, 29--43.Google ScholarDigital Library
- Lokesh Gidra, Gaël Thomas, Julien Sopena, and Marc Shapiro. 2013. A study of the scalability of stop-the-world garbage collectors on multicores. ACM SIGPLAN Notices 48, 229--240.Google ScholarDigital Library
- Lokesh Gidra, Gaël Thomas, Julien Sopena, Marc Shapiro, and Nhan Nguyen. 2015. NumaGiC: A garbage collector for Big Data on Big NUMA machines. ACM SIGARCH Computer Architecture News 43, 661--673.Google ScholarDigital Library
- Ionel Gog, Jana Giceva, Malte Schwarzkopf, Kapil Vaswani, Dimitrios Vytiniotis, Ganesan Ramalingam, Manuel Costa, Derek G. Murray, Steven Hand, and Michael Isard. 2015. Broom: Sweeping out garbage collection from Big Data systems. In Proceedings of the 15th Workshop on Hot Topics in Operating Systems (HotOS XV).Google Scholar
- James Gosling. 2000. The Java Language Specification. Addison-Wesley Professional.Google ScholarDigital Library
- Herodotos Herodotou and Shivnath Babu. 2011. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. Proceedings of the VLDB Endowment 4, 1, 1111--1122.Google ScholarDigital Library
- Richard L. Hudson and J. Eliot B. Moss. 1992. Incremental collection of mature objects. In Memory Management. Springer, 388--403.Google Scholar
- R. John M. Hughes. 1982. A semi-incremental garbage collection algorithm. Software: Practice and Experience 12, 11, 1081--1082.Google ScholarCross Ref
- Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review 41, 59--72.Google ScholarDigital Library
- Richard Jones, Antony Hosking, and Eliot Moss. 2011. The Garbage Collection Handbook: The Art of Automatic Memory Management. Chapman 8 Hall/CRC.Google ScholarDigital Library
- Richard Jones and Andy C. King. 2005. A fast analysis for thread-local garbage collection with dynamic class loading. In Proceedings of the 5th IEEE International Workshop on Source Code Analysis and Manipulation. IEEE, Los Alamitos, CA, 129--138.Google Scholar
- Richard Jones and Chris Ryder. 2006. Garbage collection should be lifetime aware. In Proceedings of the Implementation, Compilation, Optimization of Object-Oriented Languages, Programs, and Systems Conference (ICOOOLPS’06).Google Scholar
- Richard E. Jones and Chris Ryder. 2008. A study of Java object demographics. In Proceedings of the 7th International Symposium on Memory Management. ACM, New York, NY, 121--130.Google Scholar
- Kenneth C. Knowlton. 1965. A fast storage allocator. Communications of the ACM 8, 10, 623--624. DOI:http://dx.doi.org/10.1145/365628.365655Google ScholarDigital Library
- Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-scale graph computation on just a PC. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). 31--46.Google ScholarDigital Library
- Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2, 35--40.Google ScholarDigital Library
- Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21, 7, 558--565.Google ScholarDigital Library
- Henry Lieberman and Carl Hewitt. 1983. A real-time garbage collector based on the lifetimes of objects. Communications of the ACM 26, 6, 419--429.Google ScholarDigital Library
- Jimmy Lin and Dmitriy Ryaboy. 2013. Scaling big data mining infrastructure: The Twitter experience. ACM SIGKDD Explorations Newsletter 14, 2, 6--19.Google ScholarDigital Library
- Lu Lu, Xuanhua Shi, Yongluan Zhou, Xiong Zhang, Hai Jin, Cheng Pei, Ligang He, and Yuanzhen Geng. 2016. Lifetime-based memory management for distributed data processing systems. Proceedings of the VLDB Endowment 9, 12, 936--947. DOI:http://dx.doi.org/10.14778/2994509.2994513Google ScholarDigital Library
- Clifford Lynch. 2008. Big data: How do your data grow? Nature 455, 7209, 28--29.Google Scholar
- Martin Maas, Krste Asanović, Tim Harris, and John Kubiatowicz. 2016. Taurus: A holistic language runtime system for coordinating distributed managed-language applications. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 457--471.Google ScholarDigital Library
- Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 135--146.Google ScholarDigital Library
- John McCarthy. 1960. Recursive functions of symbolic expressions and their computation by machine, part I. Communications of the ACM 3, 4, 184--195. DOI:http://dx.doi.org/10.1145/367177.367199Google ScholarDigital Library
- David A. Moon. 1984. Garbage collection in a large LISP system. In Proceedings of the 1984 ACM Symposium on LISP and Functional Programming. ACM, New York, NY, 235--246.Google ScholarDigital Library
- Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martin Abadi. 2013. Naiad: A timely dataflow system. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, New York, NY, 439--455.Google ScholarDigital Library
- Scott Nettles, James O’Toole, David Pierce, and Nicholas Haines. 1992. Replication-Based Incremental Copying Collection. Springer.Google Scholar
- Khanh Nguyen, Kai Wang, Yingyi Bu, Lu Fang, Jianfei Hu, and Guoqing Xu. 2015. Facade: A compiler and runtime for (almost) object-bounded big data applications. ACM SIGPLAN Notices 50, 675--690.Google ScholarDigital Library
- Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 1099--1110.Google ScholarDigital Library
- H. Paz, D. F. Bacon, E. K. Kolodner, E. Petrank, and V. T. Rajan. 2007. An efficient on-the-fly cycle collection. ACM Transactions on Programming Languages and Systems 29, 4, Article No. 20. DOI:http://dx.doi.org/10.1145/1255450.1255453Google ScholarDigital Library
- Ian Robinson, Jim Webber, and Emil Eifrem. 2013. Graph Databases. O’Reilly Media, Inc.Google Scholar
- Semih Salihoglu and Jennifer Widom. 2013. GPS: A graph processing system. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management. ACM, New York, NY, 22.Google ScholarDigital Library
- Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). IEEE, Los Alamitos, CA, 1--10.Google ScholarDigital Library
- Sunil Soman, Chandra Krintz, and Laurent Daynès. 2008. MTM2: Scalable memory management for multi-tasking managed runtime environments. In ECOOP 2008—Object-Oriented Programming. Springer, 335--361.Google ScholarDigital Library
- C. J. Stephenson. 1983. New methods for dynamic storage allocation (fast fits). In Proceedings of the 9th ACM Symposium on Operating Systems Principles (SOSP’83). ACM, New York, NY, 30--32. DOI:http://dx.doi.org/10.1145/800217.806613Google ScholarDigital Library
- Roshan Sumbaly, Jay Kreps, and Sam Shah. 2013. The big data ecosystem at LinkedIn. In Proceedings of the 2013 International Conference on Management of Data. ACM, New York, NY, 1125--1134.Google ScholarDigital Library
- Andrew S. Tanenbaum. 2007. Modern Operating Systems. Prentice Hall Press.Google ScholarDigital Library
- Gil Tene, Balaji Iyengar, and Michael Wolf. 2011. C4: The Continuously Concurrent Compacting Collector. ACM SIGPLAN Notices 46, 11, 79--88.Google ScholarDigital Library
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A warehousing solution over a Map-Reduce framework. Proceedings of the VLDB Endowment 2, 2, 1626--1629.Google ScholarDigital Library
- David Ungar. 1984. Generation scavenging: A non-disruptive high performance storage reclamation algorithm. ACM SIGPLAN Notices 19, 5, 157--167.Google ScholarDigital Library
- David Ungar and Frank Jackson. 1988. Tenuring policies for generation-based storage reclamation. ACM SIGPLAN Notices 23, 1--17.Google ScholarDigital Library
- Rik Van Bruggen. 2014. Learning Neo4j. Packt Publishing Ltd.Google Scholar
- Tom White. 2009. Hadoop: The Definitive Guide. O’Reilly Media, Inc.Google ScholarDigital Library
- Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI’08), Vol. 8. 1--14.Google Scholar
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. 10.Google ScholarDigital Library
Index Terms
- A Study on Garbage Collection Algorithms for Big Data Environments
Recommendations
NG2C: pretenuring garbage collection with dynamic generations for HotSpot big data applications
ISMM '17Big Data applications suffer from unpredictable and unacceptably high pause times due to Garbage Collection (GC). This is the case in latency-sensitive applications such as on-line credit-card fraud detection, graph-based computing for analysis on ...
NG2C: pretenuring garbage collection with dynamic generations for HotSpot big data applications
ISMM 2017: Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory ManagementBig Data applications suffer from unpredictable and unacceptably high pause times due to Garbage Collection (GC). This is the case in latency-sensitive applications such as on-line credit-card fraud detection, graph-based computing for analysis on ...
Analysis of Garbage Collection Patterns to Extend Microbenchmarks for Big Data Workloads
ICPE '22: Companion of the 2022 ACM/SPEC International Conference on Performance EngineeringJava uses automatic memory allocation where the user does not have to explicitly free used memory. This is done by the garbage collector. Garbage Collection (GC) can take up a significant amount of time, especially in Big Data applications running large ...
Comments