ABSTRACT
Cloud computing promises flexibility and high performance for users and high cost-efficiency for operators. Nevertheless, most cloud facilities operate at very low utilization, hurting both cost effectiveness and future scalability.
We present Quasar, a cluster management system that increases resource utilization while providing consistently high application performance. Quasar employs three techniques. First, it does not rely on resource reservations, which lead to underutilization as users do not necessarily understand workload dynamics and physical resource requirements of complex codebases. Instead, users express performance constraints for each workload, letting Quasar determine the right amount of resources to meet these constraints at any point. Second, Quasar uses classification techniques to quickly and accurately determine the impact of the amount of resources (scale-out and scale-up), type of resources, and interference on performance for each workload and dataset. Third, it uses the classification results to jointly perform resource allocation and assignment, quickly exploring the large space of options for an efficient way to pack workloads on available resources. Quasar monitors workload performance and adjusts resource allocation and assignment when needed. We evaluate Quasar over a wide range of workload scenarios, including combinations of distributed analytics frameworks and low-latency, stateful services, both on a local cluster and a cluster of dedicated EC2 servers. At steady state, Quasar improves resource utilization by 47% in the 200-server EC2 cluster, while meeting performance constraints for workloads of all types.
- Faraz Ahmad, Srimat T. Chakradhar, Anand Raghunathan, and T. N. Vijaykumar. Tarazu: optimizing mapreduce on heterogeneous clusters. In Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). London, UK, 2012. Google ScholarDigital Library
- Amazon ec2. http://aws.amazon.com/ec2/.Google Scholar
- Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Effective straggler mitigation: Attack of the clones. In Proc. of the USENIX Symposium on Networked Systems Design and Implementation (NSDI). Lombard, IL, 2013. Google ScholarDigital Library
- Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. Reining in the outliers in map-reduce clusters using mantri. In Proc. of the 9th USENIX conference on Operating Systems Design and Implementation (OSDI). Vancouver, CA, 2010. Google ScholarDigital Library
- Apache zookeeper. http://zookeeper.apache.org/.Google Scholar
- Autoscale. https://cwiki.apache.org/CLOUDSTACK/autoscaling.html.Google Scholar
- AWS Autoscaling. http://aws.amazon.com/autoscaling/.Google Scholar
- Gaurav Banga, Peter Druschel, and Jeffrey C. Mogul. Resource containers: a new facility for resource management in server systems. OSDI, 1999. Google ScholarDigital Library
- L. Barroso. Warehouse-scale computing: Entering the teenage decade. ISCA Keynote, SJ, June 2011. Google Scholar
- Luiz Barroso and Urs Hoelzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers, 2009. Google ScholarDigital Library
- R. Bell, M. Koren, and C. Volinsky. The BellKor 2008 Solution to the Netflix Prize. Technical report, AT&T Labs, 2008.Google Scholar
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The parsec benchmark suite: Characterization and architectural implications. In Proc. of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT). Toronto, CA, October, 2008. Google ScholarDigital Library
- Leon Bottou. Large-scale machine learning with stochastic gradient descent. In Proc. of the International Conference on Computational Statistics (COMPSTAT). Paris, France, 2010.Google ScholarCross Ref
- Apache cassandra. http://cassandra.apache.org/.Google Scholar
- McKinsey & Company. Revolutionizing data center efficiency. In Uptime Institute Symposium, 2008.Google Scholar
- Jeffrey Dean and Luiz Andre Barroso. The tail at scale. In Communications of the ACM, Vol. 56 No. 2, Pages 74--80. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. In Proc. of OSDI, pages 10--10, 2004. Google ScholarDigital Library
- Christina Delimitrou, Nick Bambos, and Christos Kozyrakis. QoS-Aware Admission Control in Heterogeneous Datacenters. In Proceedings of the International Conference on Autonomic Computing (ICAC). San Jose, June 2013.Google Scholar
- Christina Delimitrou and Christos Kozyrakis. iBench: Quantifying Interference for Datacenter Workloads. In Proceedings of the 2013 IEEE International Symposium on Workload Characterization (IISWC). Portland, OR, September 2013.Google ScholarCross Ref
- Christina Delimitrou and Christos Kozyrakis. Paragon: QoSAware Scheduling for Heterogeneous Datacenters. In Proc. of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Houston, TX, USA, 2013. Google ScholarDigital Library
- Eucalyptus cloud services. http://www.eucalyptus.com/.Google Scholar
- Brad Fitzpatrick. Distributed caching with memcached. In Linux Journal, Volume 2004, Issue 124, 2004. Google ScholarDigital Library
- R. Gandhi and A. Sabne. Finding stragglers in hadoop. In Tech. Report. 2011.Google Scholar
- Gartner says efficient data center design can lead to 300 percent capacity growth in 60 percent less space. http://www.gartner.com/newsroom/id/1472714.Google Scholar
- Google compute engine. http://cloud.google.com/products/compute-engine.html.Google Scholar
- Z. Ghahramani and M. Jordan. Learning from incomplete data. In Lab Memo No. 1509, CBCL Paper No. 108, MIT AI Lab. Google ScholarDigital Library
- Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. Dominant resource fairness: fair allocation of multiple resource types. In Proc. of the 8th USENIX conference on Networked systems design and implementation (NSDI). Boston, MA, 2011. Google ScholarDigital Library
- Daniel Gmach, Jerry Rolia, Ludmila Cherkasova, and Alfons Kemper. Workload analysis and demand prediction of enterprise data center applications. In Proc. of the 10th IEEE International Symposium on Workload Characterization. Boston, 2007. Google ScholarDigital Library
- Zhenhuan Gong, Xiaohui Gu, and John Wilkes. Press: Predictive elastic resource scaling for cloud systems. In Proc. of the International Conference on Network and Service Management (CNSM). Niagara Falls, ON, 2010.Google Scholar
- Apache hadoop. http://hadoop.apache.org/.Google Scholar
- J. Hamilton. Cost of power in large-scale data centers. http://perspectives.mvdirona.com.Google Scholar
- Ben Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proc. of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Boston, MA, 2011. Google ScholarDigital Library
- Hotcrp conference management system. http://read.seas.harvard.edu/?kohler/hotcrp/.Google Scholar
- Aamer Jaleel, Matthew Mattina, and Bruce L. Jacob. Last level cache (llc) performance of data mining workloads on a cmp - a case study of parallel bioinformatics workloads. In Proc. of the 12th International Symposium on High-Performance Computer Architecture (HPCA-12). Austin, Texas, 2006.Google ScholarCross Ref
- K.C. Kiwiel. Convergence and efficiency of subgradient methods for quasiconvex minimization. In Mathematical Programming (Series A) (Berlin, Heidelberg: Springer) 90 (1): pp. 1--25, 2001.Google Scholar
- Jacob Leverich and Christos Kozyrakis. On the energy (in)efficiency of hadoop clusters. In Proc. of HotPower. Big Sky, MT, 2009.Google Scholar
- J. Lin. The curse of zipf and limits to parallelization: A look at the stragglers problem in mapreduce. In Proc. of LSDS-IR Workshop. Boston, MA, 2009.Google Scholar
- Host server cpu utilization in amazon ec2 cloud. http://huanliu.wordpress.com/2012/02/17/host-server-cpu-utilization-in-amazon-ec2-cloud/.Google Scholar
- Mahout. http://mahout.apache.org/.Google Scholar
- Jason Mars and Lingjia Tang. Whare-map: heterogeneity in 'homogeneous' warehouse-scale computers. In Proc. of the 40th Annual International Symposium on Computer Architecture (ISCA). Tel-Aviv, Israel, 2013. Google ScholarDigital Library
- Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-up: increasing utilization in modern warehouse scale computers via sensible co-locations. In Proc. of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pages 248--259, 2011. Google ScholarDigital Library
- David Meisner, Christopher M. Sadler, Luiz André Barroso, Wolf-Dietrich Weber, and Thomas F. Wenisch. Power management of online data-intensive services. In Proceedings of the 38th annual international symposium on Computer architecture, pages 319--330, 2011. Google ScholarDigital Library
- Ramanathan Narayanan, Berkin Ozisikyilmaz, Joseph Zambreno, Gokhan Memik, and Alok N. Choudhary. Minebench: A benchmark suite for data mining workloads. In Proceedings of the 9th IEEE International Symposium on Workload Characterization (IISWC). San Jose, California, 2006.Google ScholarCross Ref
- R. Nathuji, C. Isci, and E. Gorbatov. Exploiting platform heterogeneity for power efficient data centers. In Proc. of ICAC'07, FL, 2007. Google ScholarDigital Library
- R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-clouds: Managing performance interference effects for qos-aware clouds. In Proc. of EuroSys France, 2010, 2010. Google ScholarDigital Library
- Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Sethuraman Subbiah, and John Wilkes. Agile: Elastic distributed resource scaling for infrastructure-as-a-service. In Proc. of the USENIX International Conference on Automated Computing (ICAC'13). San Jose, CA, 2013.Google Scholar
- Dejan Novakovic, Nedeljko Vasic, Novakovic, Stanko, Dejan Kostic, and Ricardo Bianchini. Deepdive: Transparently identifying and managing performance interference in virtualized environments. In Proc. of the USENIX Annual Technical Conference (ATC'13). San Jose, CA, 2013. Google ScholarDigital Library
- Openstack cloud software. http://www.openstack.org/.Google Scholar
- Nathan Parrish, Hyrum Anderson, Maya Gupta, and Dun Yu Hsaio. Classifying with confidence from incomplete information. In Proc. of the Journal Machine Learning Research (JMLR). 2013.Google Scholar
- A. Rajaraman and J. Ullman. Textbook on Mining of Massive Datasets. 2011. Google ScholarDigital Library
- Charles Reiss, Alexey Tumanov, Gregory Ganger, Randy Katz, and Michael Kozych. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proc. of the Third ACM Symposium on Cloud Computing (SOCC). San Jose, CA, 2012. Google ScholarDigital Library
- Rightscale. https://aws.amazon.com/solution-providers/isv/rightscale.Google Scholar
- Daniel Sanchez and Christos Kozyrakis. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In Proc. of the 38th annual International Symposium in Computer Architecture (ISCA-38). San Jose, CA, June, 2011. Google ScholarDigital Library
- Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and JohnWilkes. Omega: flexible, scalable schedulers for large compute clusters. In Proc. of the 8th ACM European Conference on Computer Systems (EuroSys). Prague, Czech Republic, April 2013. Google ScholarDigital Library
- Upendra Sharma, Prashant Shenoy, Sambit Sahu, and Anees Shaikh. A cost-aware elasticity provisioning system for the cloud. In Proc. of the 2011 31st International Conference on Distributed Computing Systems (ICDCS). Minneapolis, MN, 2011. Google ScholarDigital Library
- Zhiming Shen, Sethuraman Subbiah, Xiaohui Gu, and John Wilkes. Cloudscale: elastic resource scaling for multi-tenant cloud systems. In Proc. of the 2nd ACM Symposium on Cloud Computing (SOCC). Cascais, Portugal, 2011. Google ScholarDigital Library
- Storm. https://github.com/nathanmarz/storm/.Google Scholar
- Torque resource manager. http://www.adaptivecomputing.com/products/open-source/torque/.Google Scholar
- Arunchandar Vasan, Anand Sivasubramaniam, Vikrant Shimpi, T. Sivabalan, and Rajesh Subbiah. Worth their watts? an empirical study of datacenter servers. In Proc. of the 16th International Symposium on High Performance Computer Architecture (HPCA). Bangalore, India, 2010.Google ScholarCross Ref
- Nedeljko Vasić, Dejan Novaković, Svetozar Miučin, Dejan Kostić, and Ricardo Bianchini. Dejavu: accelerating resource allocation in virtualized environments. In Proc. of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). London, UK, 2012. Google ScholarDigital Library
- Vmware vcloud suite. http://www.vmware.com/products/vcloud.Google Scholar
- Virtualbox. https://www.virtualbox.org/.Google Scholar
- Vmware virtual machines. http://www.vmware.com/.Google Scholar
- C. Wang, X. Liao, L. Carin, and D. B. Dunson. Classification with incomplete data using dirichlet process priors. In Journal of Machine Learning Research (JMLR), 2010. Google ScholarDigital Library
- Windows azure. http://www.windowsazure.com/.Google Scholar
- Ian H. Witten, Eibe Frank, and Geoffrey Holmes. Data Mining: Practical Machine Learning Tools and Techniques. 3rd Edition. Google ScholarDigital Library
- Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The splash-2 programs: characterization and methodological considerations. In Proc. of the 22nd International Symposium on Computer Architecture (ISCA). Santa Margherita Ligure, Italy, 1995. Google ScholarDigital Library
- The xen project. http://www.xen.org/.Google Scholar
- Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. Bubble-flux: precise online qos management for increased utilization in warehouse scale computers. In Proc. of the 40th Annual International Symposium on Computer Architecture (ISCA). Tel-Aviv, Israel, 2013. Google ScholarDigital Library
- M Zaharia, A Konwinski, A.D Joseph, R Katz, and I Stoica. Improving mapreduce performance in heterogeneous environments. In Proc. of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI). San Diego, CA, 2008. Google ScholarDigital Library
- Matei Zaharia, M Chowdhury, T Das, A Dave, J Ma, M McCauley, M.J Franklin, S Shenker, and I Stoica. Spark: Cluster computing with working sets. In Proc. of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI). San Jose, CA, 2012.Google Scholar
- Zfs. http://www.freebsd.org/doc/handbook/filesystems-zfs.html.Google Scholar
- Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. Cpi2: Cpu performance isolation for shared compute clusters. In Proc. of the 8th ACM European Conference on Computer Systems (EuroSys). Prague, Czech Republic, 2013. Google ScholarDigital Library
Index Terms
- Quasar: resource-efficient and QoS-aware cluster management
Recommendations
Sinan: ML-based and QoS-aware resource management for cloud microservices
ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating SystemsCloud applications are increasingly shifting from large monolithic services, to large numbers of loosely-coupled, specialized microservices. Despite their advantages in terms of facilitating development, deployment, modularity, and isolation, ...
PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating SystemsMulti-tenancy in modern datacenters is currently limited to a single latency-critical, interactive service, running alongside one or more low-priority, best-effort jobs. This limits the efficiency gains from multi-tenancy, especially as an increasing ...
Quasar: resource-efficient and QoS-aware cluster management
ASPLOS '14Cloud computing promises flexibility and high performance for users and high cost-efficiency for operators. Nevertheless, most cloud facilities operate at very low utilization, hurting both cost effectiveness and future scalability.
We present Quasar, a ...
Comments