Abstract
The recent advancements in storage technologies have popularized the use of tiered storage systems in data-intensive compute clusters. The Hadoop Distributed File System (HDFS), for example, now supports storing data in memory, SSDs, and HDDs, while OctopusFS and hatS offer fine-grained storage tiering solutions. However, the task schedulers of big data platforms (such as Hadoop and Spark) will assign tasks to available resources only based on data locality information, and completely ignore the fact that local data is now stored on a variety of storage media with different performance characteristics. This paper presents Trident, a principled task scheduling approach that is designed to make optimal task assignment decisions based on both locality and storage tier information. Trident formulates task scheduling as a minimum cost maximum matching problem in a bipartite graph and uses a standard solver for finding the optimal solution. In addition, Trident utilizes two novel pruning algorithms for bounding the size of the graph, while still guaranteeing optimality. Trident is implemented in both Spark and Hadoop, and evaluated extensively using a realistic workload derived from Facebook traces as well as an industry-validated benchmark, demonstrating significant benefits in terms of application performance and cluster efficiency.
- Cristina L Abad, Yi Lu, and Roy H Campbell. 2011. DARE: Adaptive Data Replication for Efficient Cluster Scheduling. In Proc. of the 2011 IEEE Intl. Conf. on Cluster Computing (CLUSTER). IEEE, 159--168. Google ScholarDigital Library
- Faraz Ahmad, Srimat T Chakradhar, Anand Raghunathan, and TN Vijaykumar. 2012. Tarazu: Optimizing MapReduce on Heterogeneous Clusters. ACM SIGARCH Computer Architecture News 40, 1 (2012), 61--74. Google ScholarDigital Library
- Alluxio 2021. Alluxio: Data Orchestration for the Cloud. Retrieved May 5, 2021 from http://www.alluxio.org/Google Scholar
- Ganesh Ananthanarayanan, Sameer Agarwal, Srikanth Kandula, Albert Greenberg, Ion Stoica, Duke Harlan, and Ed Harris. 2011. Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters. In Proc. of the 6th European Conf. on Computer Systems (EuroSys). ACM, 287--300. Google ScholarDigital Library
- Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2011. Disk-locality in Datacenter Computing Considered Irrelevant. In Proc. of the 13th Workshop on Hot Topics in Operating Systems (HotOS). USENIX, 12--17. Google ScholarDigital Library
- Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Warfield, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, and Ion Stoica. 2012. PACMan: Coordinated Memory Caching for Parallel Jobs. In Proc. of the 9th USENIX Symp. on Networked Systems Design and Implementation (NSDI). USENIX, 267--280. Google ScholarDigital Library
- Apache Hadoop 2021. Apache Hadoop. Retrieved May 5, 2021 from https://hadoop.apache.orgGoogle Scholar
- Apache Spark 2021. Apache Spark. Retrieved May 5, 2021 from https://spark.apache.orgGoogle Scholar
- Quan Chen, D. Zhang, M. Guo, Q. Deng, and S. Guo. 2010. SAMR: A Self-adaptive MapReduce Scheduling Algorithm in Heterogeneous Environment. In Proc. of the 10th IEEE Intl. Conf. on Computer and Information Technology (ICCIT). IEEE, 2736--2743. Google ScholarDigital Library
- Yanpei Chen, Sara Alspaugh, and Randy Katz. 2012. Interactive Analytical Processing in Big Data Systems: A Cross-industry Study of MapReduce Workloads. PVLDB 5, 12 (2012), 1802--1813. Google ScholarDigital Library
- Yanpei Chen, Archana Ganapathi, Rean Griffith, and Randy Katz. 2011. The Case for Evaluating MapReduce Performance using Workload Suites. In Proc. of the 2011 IEEE Intl. Symp. on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 390--399. Google ScholarDigital Library
- Dazhao Cheng, Jia Rao, Yanfei Guo, and Xiaobo Zhou. 2014. Improving MapReduce Performance in Heterogeneous Environments with Adaptive Task Tuning. In Proc. of the 15th IEEE Intl. Conf. on Cluster Computing (CLUSTER). ACM, 97--108.Google ScholarDigital Library
- Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. 2009. Introduction to Algorithms. MIT press. Google ScholarDigital Library
- Francis Deslauriers, Peter McCormick, George Amvrosiadis, Ashvin Goel, and Angela Demke Brown. 2016. Quartet: Harmonizing Task Scheduling and Caching for Cluster Computing. In Proc. of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage). USENIX, 1--5. Google ScholarDigital Library
- Ran Duan and Seth Pettie. 2014. Linear-time Approximation for Maximum Weight Matching. Journal of the ACM (JACM) 61, 1 (2014), 1--23. Google ScholarDigital Library
- Avrilia Floratou, Nimrod Megiddo, Navneet Potti, Fatma Özcan, Uday Kale, and Jan Schmitz-Hermes. 2016. Adaptive Caching in Big SQL using the HDFS Cache. In Proc. of the 7th ACM Symp. on Cloud Computing (SoCC). ACM, 321--333. Google ScholarDigital Library
- Rohan Gandhi, Di Xie, and Y Charlie Hu. 2013. PIKACHU: How to Rebalance Load in Optimizing MapReduce On Heterogeneous Clusters. In Proc. of the 2013 USENIX Annual Technical Conference (ATC). USENIX, 61--66. Google ScholarDigital Library
- Kannan Govindarajan, Supun Kamburugamuve, Pulasthi Wickramasinghe, Vibhatha Abeykoon, and Geoffrey Fox. 2017. Task Scheduling in Big Data-Review, Research Challenges, and Prospects. In Proc. of the 9th Intl. Conf. on Advanced Computing (ICoAC). IEEE, 165--173.Google ScholarCross Ref
- GridGain 2021. GridGain In-Memory Computing Platform. Retrieved May 5, 2021 from http://www.gridgain.com/Google Scholar
- HDFS 2020. HDFS Archival Storage, SSD & Memory. Retrieved May 5, 2021 from https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.htmlGoogle Scholar
- Herodotos Herodotou and Elena Kakoulli. 2019. Automating Distributed Tiered Storage Management in Cluster Computing. PVLDB 13, 1 (2019), 43--56. Google ScholarDigital Library
- HiBench 2020. HiBench Suite. Retrieved May 5, 2021 from https://github.com/intel-hadoop/HiBenchGoogle Scholar
- Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In Proc. of the 8th USENIX Symp. on Networked Systems Design and Implementation (NSDI). USENIX, 295--308. Google ScholarDigital Library
- Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. 2011. The HiBench Benchmark Suite: Characterization of the MapReduce-based Data Analysis. In New Frontiers in Information and Software as Services. Springer, 209--228.Google Scholar
- Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. 2009. Quincy: Fair Scheduling for Distributed Computing Clusters. In Proc. of the 22nd ACM Symp. on Operating Systems Principles (SOSP). ACM, 261--276. Google ScholarDigital Library
- Jingjie Jiang, Shiyao Ma, Bo Li, and Baochun Li. 2016. Symbiosis: Network-aware Task Scheduling in Data-parallel Frameworks. In Proc. of the 35th IEEE Intl. Conf. on Computer Communications (INFOCOM). IEEE, 1--9.Google ScholarDigital Library
- Elena Kakoulli and Herodotos Herodotou. 2017. OctopusFS: A Distributed File System with Tiered Storage Management. In Proc. of the 2017 ACM Intl. Conf. on Management of Data (SIGMOD). ACM, 65--78. Google ScholarDigital Library
- KR Krish, Ali Anwar, and Ali R Butt. 2014. hatS: A Heterogeneity-aware Tiered Storage for Hadoop. In Proc. of the 14th IEEE/ACM Intl. Symp. on Cluster, Cloud and Grid Computing (CCGrid). IEEE, 502--511. Google ScholarDigital Library
- Sparsh Mittal and Jeffrey S Vetter. 2015. A Survey of Software Techniques for using Non-volatile Memories for Storage and Main Memory Systems. IEEE Transactions on Parallel and Distributed Systems (TPDS) 27, 5 (2015), 1537--1550. Google ScholarDigital Library
- Seyed Reza Pakize. 2014. A Comprehensive View of Hadoop MapReduce Scheduling Algorithms. International Journal of Computer Networks & Communications Security 2, 9 (2014), 308--317.Google Scholar
- Fengfeng Pan, Jin Xiong, Yijie Shen, Tianshi Wang, and Dejun Jiang. 2018. H-scheduler: Storage-aware task scheduling for heterogeneous-storage spark clusters. In Proc. of the 24th IEEE Intl. Conf. on Parallel and Distributed Systems (ICPADS). IEEE, 1--9.Google ScholarCross Ref
- Mario Pastorelli, Damiano Carra, Matteo Dell'Amico, and Pietro Michiardi. 2015. HFSP: Bringing Size-based Scheduling to Hadoop. IEEE Transactions on Cloud Computing 5, 1 (2015), 43--56.Google ScholarCross Ref
- Aparna Raj, Kamaldeep Kaur, Uddipan Dutta, V Venkat Sandeep, and Shrisha Rao. 2012. Enhancement of Hadoop Clusters with Virtualization Using the Capacity Scheduler. In Proc. of the Third Intl. Conf. on Services in Emerging Markets. IEEE, 50--57. Google ScholarDigital Library
- Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In Proc. of the 26th Intl. Conf. on Massive Storage Systems and Technology (MSST). IEEE, 1--10. Google ScholarDigital Library
- Mbarka Soualhia, Foutse Khomh, and Sofiène Tahar. 2017. Task Scheduling in Big Data Platforms: A Systematic Literature Review. Journal of Systems and Software 134 (2017), 170--189. Google ScholarDigital Library
- Xiaoyu Sun, C. He, and Ying Lu. 2012. ESAMR: An Enhanced Self-Adaptive MapReduce Scheduling Algorithm. In Proc. of the 18th IEEE Intl. Conf. on Parallel and Distributed Systems (ICPADS). IEEE, 148--155. Google ScholarDigital Library
- SWIM 2016. SWIM: Statistical Workload Injector for MapReduce. Retrieved May 5, 2021 from https://github.com/SWIMProjectUCB/SWIM/wikiGoogle Scholar
- Jian Tan, Xiaoqiao Meng, and Li Zhang. 2013. Coupling Task Progress for MapReduce Resource-aware Scheduling. In Proc. of the 32nd IEEE Intl. Conf. on Computer Communications (INFOCOM). IEEE, 1618--1626.Google ScholarCross Ref
- Zhuo Tang, Min Liu, Almoalmi Ammar, Kenli Li, and Keqin Li. 2016. An Optimized MapReduce Workflow Scheduling Algorithm for Heterogeneous Computing. The Journal of Supercomputing 72, 6 (2016), 2059--2079. Google ScholarDigital Library
- Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, et al. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proc. of the 4th ACM Symp. on Cloud Computing (SoCC). ACM, 1--16. Google ScholarDigital Library
- Jiayin Wang, Yi Yao, Ying Mao, Bo Sheng, and Ningfang Mi. 2014. Fresh: Fair and Efficient Slot Configuration and Scheduling for Hadoop Clusters. In Proc. of the 7th IEEE Intl. Conf. on Cloud Computing (CLOUD). IEEE, 761--768. Google ScholarDigital Library
- Weina Wang, Kai Zhu, Lei Ying, Jian Tan, and Li Zhang. 2014. Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-traffic Optimality. IEEE/ACM Transactions On Networking 24, 1 (2014), 190--203. Google ScholarDigital Library
- Luna Xu, A. Butt, Seung-Hwan Lim, and R. Kannan. 2018. A Heterogeneity-Aware Task Scheduler for Spark. In Proc. of the 2018 IEEE Intl. Conf. on Cluster Computing (CLUSTER). IEEE, 245--256.Google Scholar
- Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. 2009. Job Scheduling for Multi-User MapReduce Clusters. Technical Report UCB/EECS-2009-55. EECS Department, University of California, Berkeley. Retrieved May 5, 2021 from http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-55.htmlGoogle Scholar
- Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. 2010. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. In Proc. of the 5th European Conf. on Computer Systems (EuroSys). ACM, 265--278. Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, et al. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proc. of the 9th USENIX Symp. on Networked Systems Design and Implementation (NSDI). USENIX, 15--28. Google ScholarDigital Library
- Matei Zaharia, Andy Konwinski, Anthony D Joseph, Randy H Katz, and Ion Stoica. 2008. Improving MapReduce Performance in Heterogeneous Environments. In Proc. of the 8th USENIX Symp. on Operating Systems Design and Implementation (OSDI). USENIX, 29--42. Google ScholarDigital Library
Index Terms
- Trident: task scheduling over tiered storage systems in big data platforms
Recommendations
Trident: a scalable architecture for scalar, vector, and matrix operations
Within a few years it will be possible to integrate a billion transistors on a single chip. At this integration level, we propose using a high level ISA to express parallelism to hardware instead of using a huge transistor budget to dynamically extract ...
Trident: a scalable architecture for scalar, vector, and matrix operations
CRPIT '02: Proceedings of the seventh Asia-Pacific conference on Computer systems architectureWithin a few years it will be possible to integrate a billion transistors on a single chip. At this integration level, we propose using a high level ISA to express parallelism to hardware instead of using a huge transistor budget to dynamically extract ...
Comments