research-article

Quasar: resource-efficient and QoS-aware cluster management

Authors:
Christina Delimitrou

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

,
Christos Kozyrakis

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systemsFebruary 2014Pages 127–144https://doi.org/10.1145/2541940.2541941

Published:24 February 2014Publication History

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Pages 127–144

ABSTRACT

Cloud computing promises flexibility and high performance for users and high cost-efficiency for operators. Nevertheless, most cloud facilities operate at very low utilization, hurting both cost effectiveness and future scalability.

We present Quasar, a cluster management system that increases resource utilization while providing consistently high application performance. Quasar employs three techniques. First, it does not rely on resource reservations, which lead to underutilization as users do not necessarily understand workload dynamics and physical resource requirements of complex codebases. Instead, users express performance constraints for each workload, letting Quasar determine the right amount of resources to meet these constraints at any point. Second, Quasar uses classification techniques to quickly and accurately determine the impact of the amount of resources (scale-out and scale-up), type of resources, and interference on performance for each workload and dataset. Third, it uses the classification results to jointly perform resource allocation and assignment, quickly exploring the large space of options for an efficient way to pack workloads on available resources. Quasar monitors workload performance and adjusts resource allocation and assignment when needed. We evaluate Quasar over a wide range of workload scenarios, including combinations of distributed analytics frameworks and low-latency, stateful services, both on a local cluster and a cluster of dedicated EC2 servers. At steady state, Quasar improves resource utilization by 47% in the 200-server EC2 cluster, while meeting performance constraints for workloads of all types.

References

Faraz Ahmad, Srimat T. Chakradhar, Anand Raghunathan, and T. N. Vijaykumar. Tarazu: optimizing mapreduce on heterogeneous clusters. In Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). London, UK, 2012. Google ScholarDigital Library
Amazon ec2. http://aws.amazon.com/ec2/.Google Scholar
Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Effective straggler mitigation: Attack of the clones. In Proc. of the USENIX Symposium on Networked Systems Design and Implementation (NSDI). Lombard, IL, 2013. Google ScholarDigital Library
Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. Reining in the outliers in map-reduce clusters using mantri. In Proc. of the 9th USENIX conference on Operating Systems Design and Implementation (OSDI). Vancouver, CA, 2010. Google ScholarDigital Library
Apache zookeeper. http://zookeeper.apache.org/.Google Scholar
Autoscale. https://cwiki.apache.org/CLOUDSTACK/autoscaling.html.Google Scholar
AWS Autoscaling. http://aws.amazon.com/autoscaling/.Google Scholar
Gaurav Banga, Peter Druschel, and Jeffrey C. Mogul. Resource containers: a new facility for resource management in server systems. OSDI, 1999. Google ScholarDigital Library
L. Barroso. Warehouse-scale computing: Entering the teenage decade. ISCA Keynote, SJ, June 2011. Google Scholar
Luiz Barroso and Urs Hoelzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers, 2009. Google ScholarDigital Library
R. Bell, M. Koren, and C. Volinsky. The BellKor 2008 Solution to the Netflix Prize. Technical report, AT&T Labs, 2008.Google Scholar
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The parsec benchmark suite: Characterization and architectural implications. In Proc. of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT). Toronto, CA, October, 2008. Google ScholarDigital Library
Leon Bottou. Large-scale machine learning with stochastic gradient descent. In Proc. of the International Conference on Computational Statistics (COMPSTAT). Paris, France, 2010.Google ScholarCross Ref
Apache cassandra. http://cassandra.apache.org/.Google Scholar
McKinsey & Company. Revolutionizing data center efficiency. In Uptime Institute Symposium, 2008.Google Scholar
Jeffrey Dean and Luiz Andre Barroso. The tail at scale. In Communications of the ACM, Vol. 56 No. 2, Pages 74--80. Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. In Proc. of OSDI, pages 10--10, 2004. Google ScholarDigital Library
Christina Delimitrou, Nick Bambos, and Christos Kozyrakis. QoS-Aware Admission Control in Heterogeneous Datacenters. In Proceedings of the International Conference on Autonomic Computing (ICAC). San Jose, June 2013.Google Scholar
Christina Delimitrou and Christos Kozyrakis. iBench: Quantifying Interference for Datacenter Workloads. In Proceedings of the 2013 IEEE International Symposium on Workload Characterization (IISWC). Portland, OR, September 2013.Google ScholarCross Ref
Christina Delimitrou and Christos Kozyrakis. Paragon: QoSAware Scheduling for Heterogeneous Datacenters. In Proc. of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Houston, TX, USA, 2013. Google ScholarDigital Library
Eucalyptus cloud services. http://www.eucalyptus.com/.Google Scholar
Brad Fitzpatrick. Distributed caching with memcached. In Linux Journal, Volume 2004, Issue 124, 2004. Google ScholarDigital Library
R. Gandhi and A. Sabne. Finding stragglers in hadoop. In Tech. Report. 2011.Google Scholar
Gartner says efficient data center design can lead to 300 percent capacity growth in 60 percent less space. http://www.gartner.com/newsroom/id/1472714.Google Scholar
Google compute engine. http://cloud.google.com/products/compute-engine.html.Google Scholar
Z. Ghahramani and M. Jordan. Learning from incomplete data. In Lab Memo No. 1509, CBCL Paper No. 108, MIT AI Lab. Google ScholarDigital Library
Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. Dominant resource fairness: fair allocation of multiple resource types. In Proc. of the 8th USENIX conference on Networked systems design and implementation (NSDI). Boston, MA, 2011. Google ScholarDigital Library
Daniel Gmach, Jerry Rolia, Ludmila Cherkasova, and Alfons Kemper. Workload analysis and demand prediction of enterprise data center applications. In Proc. of the 10th IEEE International Symposium on Workload Characterization. Boston, 2007. Google ScholarDigital Library
Zhenhuan Gong, Xiaohui Gu, and John Wilkes. Press: Predictive elastic resource scaling for cloud systems. In Proc. of the International Conference on Network and Service Management (CNSM). Niagara Falls, ON, 2010.Google Scholar
Apache hadoop. http://hadoop.apache.org/.Google Scholar
J. Hamilton. Cost of power in large-scale data centers. http://perspectives.mvdirona.com.Google Scholar
Ben Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proc. of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Boston, MA, 2011. Google ScholarDigital Library
Hotcrp conference management system. http://read.seas.harvard.edu/?kohler/hotcrp/.Google Scholar
Aamer Jaleel, Matthew Mattina, and Bruce L. Jacob. Last level cache (llc) performance of data mining workloads on a cmp - a case study of parallel bioinformatics workloads. In Proc. of the 12th International Symposium on High-Performance Computer Architecture (HPCA-12). Austin, Texas, 2006.Google ScholarCross Ref
K.C. Kiwiel. Convergence and efficiency of subgradient methods for quasiconvex minimization. In Mathematical Programming (Series A) (Berlin, Heidelberg: Springer) 90 (1): pp. 1--25, 2001.Google Scholar
Jacob Leverich and Christos Kozyrakis. On the energy (in)efficiency of hadoop clusters. In Proc. of HotPower. Big Sky, MT, 2009.Google Scholar
J. Lin. The curse of zipf and limits to parallelization: A look at the stragglers problem in mapreduce. In Proc. of LSDS-IR Workshop. Boston, MA, 2009.Google Scholar
Host server cpu utilization in amazon ec2 cloud. http://huanliu.wordpress.com/2012/02/17/host-server-cpu-utilization-in-amazon-ec2-cloud/.Google Scholar
Mahout. http://mahout.apache.org/.Google Scholar
Jason Mars and Lingjia Tang. Whare-map: heterogeneity in 'homogeneous' warehouse-scale computers. In Proc. of the 40th Annual International Symposium on Computer Architecture (ISCA). Tel-Aviv, Israel, 2013. Google ScholarDigital Library
Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-up: increasing utilization in modern warehouse scale computers via sensible co-locations. In Proc. of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pages 248--259, 2011. Google ScholarDigital Library
David Meisner, Christopher M. Sadler, Luiz André Barroso, Wolf-Dietrich Weber, and Thomas F. Wenisch. Power management of online data-intensive services. In Proceedings of the 38th annual international symposium on Computer architecture, pages 319--330, 2011. Google ScholarDigital Library
Ramanathan Narayanan, Berkin Ozisikyilmaz, Joseph Zambreno, Gokhan Memik, and Alok N. Choudhary. Minebench: A benchmark suite for data mining workloads. In Proceedings of the 9th IEEE International Symposium on Workload Characterization (IISWC). San Jose, California, 2006.Google ScholarCross Ref
R. Nathuji, C. Isci, and E. Gorbatov. Exploiting platform heterogeneity for power efficient data centers. In Proc. of ICAC'07, FL, 2007. Google ScholarDigital Library
R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-clouds: Managing performance interference effects for qos-aware clouds. In Proc. of EuroSys France, 2010, 2010. Google ScholarDigital Library
Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Sethuraman Subbiah, and John Wilkes. Agile: Elastic distributed resource scaling for infrastructure-as-a-service. In Proc. of the USENIX International Conference on Automated Computing (ICAC'13). San Jose, CA, 2013.Google Scholar
Dejan Novakovic, Nedeljko Vasic, Novakovic, Stanko, Dejan Kostic, and Ricardo Bianchini. Deepdive: Transparently identifying and managing performance interference in virtualized environments. In Proc. of the USENIX Annual Technical Conference (ATC'13). San Jose, CA, 2013. Google ScholarDigital Library
Openstack cloud software. http://www.openstack.org/.Google Scholar
Nathan Parrish, Hyrum Anderson, Maya Gupta, and Dun Yu Hsaio. Classifying with confidence from incomplete information. In Proc. of the Journal Machine Learning Research (JMLR). 2013.Google Scholar
A. Rajaraman and J. Ullman. Textbook on Mining of Massive Datasets. 2011. Google ScholarDigital Library
Charles Reiss, Alexey Tumanov, Gregory Ganger, Randy Katz, and Michael Kozych. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proc. of the Third ACM Symposium on Cloud Computing (SOCC). San Jose, CA, 2012. Google ScholarDigital Library
Rightscale. https://aws.amazon.com/solution-providers/isv/rightscale.Google Scholar
Daniel Sanchez and Christos Kozyrakis. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In Proc. of the 38th annual International Symposium in Computer Architecture (ISCA-38). San Jose, CA, June, 2011. Google ScholarDigital Library
Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and JohnWilkes. Omega: flexible, scalable schedulers for large compute clusters. In Proc. of the 8th ACM European Conference on Computer Systems (EuroSys). Prague, Czech Republic, April 2013. Google ScholarDigital Library
Upendra Sharma, Prashant Shenoy, Sambit Sahu, and Anees Shaikh. A cost-aware elasticity provisioning system for the cloud. In Proc. of the 2011 31st International Conference on Distributed Computing Systems (ICDCS). Minneapolis, MN, 2011. Google ScholarDigital Library
Zhiming Shen, Sethuraman Subbiah, Xiaohui Gu, and John Wilkes. Cloudscale: elastic resource scaling for multi-tenant cloud systems. In Proc. of the 2nd ACM Symposium on Cloud Computing (SOCC). Cascais, Portugal, 2011. Google ScholarDigital Library
Storm. https://github.com/nathanmarz/storm/.Google Scholar
Torque resource manager. http://www.adaptivecomputing.com/products/open-source/torque/.Google Scholar
Arunchandar Vasan, Anand Sivasubramaniam, Vikrant Shimpi, T. Sivabalan, and Rajesh Subbiah. Worth their watts? an empirical study of datacenter servers. In Proc. of the 16th International Symposium on High Performance Computer Architecture (HPCA). Bangalore, India, 2010.Google ScholarCross Ref
Nedeljko Vasić, Dejan Novaković, Svetozar Miučin, Dejan Kostić, and Ricardo Bianchini. Dejavu: accelerating resource allocation in virtualized environments. In Proc. of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). London, UK, 2012. Google ScholarDigital Library
Vmware vcloud suite. http://www.vmware.com/products/vcloud.Google Scholar
Virtualbox. https://www.virtualbox.org/.Google Scholar
Vmware virtual machines. http://www.vmware.com/.Google Scholar
C. Wang, X. Liao, L. Carin, and D. B. Dunson. Classification with incomplete data using dirichlet process priors. In Journal of Machine Learning Research (JMLR), 2010. Google ScholarDigital Library
Windows azure. http://www.windowsazure.com/.Google Scholar
Ian H. Witten, Eibe Frank, and Geoffrey Holmes. Data Mining: Practical Machine Learning Tools and Techniques. 3rd Edition. Google ScholarDigital Library
Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The splash-2 programs: characterization and methodological considerations. In Proc. of the 22nd International Symposium on Computer Architecture (ISCA). Santa Margherita Ligure, Italy, 1995. Google ScholarDigital Library
The xen project. http://www.xen.org/.Google Scholar
Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. Bubble-flux: precise online qos management for increased utilization in warehouse scale computers. In Proc. of the 40th Annual International Symposium on Computer Architecture (ISCA). Tel-Aviv, Israel, 2013. Google ScholarDigital Library
M Zaharia, A Konwinski, A.D Joseph, R Katz, and I Stoica. Improving mapreduce performance in heterogeneous environments. In Proc. of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI). San Diego, CA, 2008. Google ScholarDigital Library
Matei Zaharia, M Chowdhury, T Das, A Dave, J Ma, M McCauley, M.J Franklin, S Shenker, and I Stoica. Spark: Cluster computing with working sets. In Proc. of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI). San Jose, CA, 2012.Google Scholar
Zfs. http://www.freebsd.org/doc/handbook/filesystems-zfs.html.Google Scholar
Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. Cpi2: Cpu performance isolation for shared compute clusters. In Proc. of the 8th ACM European Conference on Computer Systems (EuroSys). Prague, Czech Republic, 2013. Google ScholarDigital Library

Index Terms

Quasar: resource-efficient and QoS-aware cluster management

Recommendations

Sinan: ML-based and QoS-aware resource management for cloud microservices
ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Cloud applications are increasingly shifting from large monolithic services, to large numbers of loosely-coupled, specialized microservices. Despite their advantages in terms of facilitating development, deployment, modularity, and isolation, ...
Read More
PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

Multi-tenancy in modern datacenters is currently limited to a single latency-critical, interactive service, running alongside one or more low-priority, best-effort jobs. This limits the efficiency gains from multi-tenancy, especially as an increasing ...
Read More
Quasar: resource-efficient and QoS-aware cluster management
ASPLOS '14

Cloud computing promises flexibility and high performance for users and high cost-efficiency for operators. Nevertheless, most cloud facilities operate at very low utilization, hurting both cost effectiveness and future scalability.

We present Quasar, a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
February 2014
780 pages
ISBN:9781450323055
DOI:10.1145/2541940
General Chairs:
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Program Chair:
Sarita Adve
University of Illinois at Urbana-Champ
ACM SIGARCH Computer Architecture News Volume 42, Issue 1
ASPLOS '14
March 2014
729 pages
ISSN:0163-5964
DOI:10.1145/2654822
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 49, Issue 4
ASPLOS '14
April 2014
729 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2644865
Editors:
Mark W. Bailey
Hamilton College, Clinton, NY
,
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Sarita Adve
University of Illinois at Urbana-Champ
Issue’s Table of Contents
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 February 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cloud computing
cluster management
datacenters
quality of service
resource allocation and assignment
resource efficiency
Qualifiers
- research-article
Conference

Acceptance Rates
ASPLOS '14 Paper Acceptance Rate49of217submissions,23%Overall Acceptance Rate535of2,713submissions,20%
More
Upcoming Conference
ASPLOS '24

Sponsor:

sigarch

sigarch

sigarch

29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

April 27 - May 1, 2024

La Jolla , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 740
  Total Citations
  View Citations
- 4,057
  Total Downloads
- Downloads (Last 12 months)316
- Downloads (Last 6 weeks)41
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Quasar: resource-efficient and QoS-aware cluster management

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Sinan: ML-based and QoS-aware resource management for cloud microservices

PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services

Quasar: resource-efficient and QoS-aware cluster management