skip to main content
10.1145/2541940.2541941acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Quasar: resource-efficient and QoS-aware cluster management

Published:24 February 2014Publication History

ABSTRACT

Cloud computing promises flexibility and high performance for users and high cost-efficiency for operators. Nevertheless, most cloud facilities operate at very low utilization, hurting both cost effectiveness and future scalability.

We present Quasar, a cluster management system that increases resource utilization while providing consistently high application performance. Quasar employs three techniques. First, it does not rely on resource reservations, which lead to underutilization as users do not necessarily understand workload dynamics and physical resource requirements of complex codebases. Instead, users express performance constraints for each workload, letting Quasar determine the right amount of resources to meet these constraints at any point. Second, Quasar uses classification techniques to quickly and accurately determine the impact of the amount of resources (scale-out and scale-up), type of resources, and interference on performance for each workload and dataset. Third, it uses the classification results to jointly perform resource allocation and assignment, quickly exploring the large space of options for an efficient way to pack workloads on available resources. Quasar monitors workload performance and adjusts resource allocation and assignment when needed. We evaluate Quasar over a wide range of workload scenarios, including combinations of distributed analytics frameworks and low-latency, stateful services, both on a local cluster and a cluster of dedicated EC2 servers. At steady state, Quasar improves resource utilization by 47% in the 200-server EC2 cluster, while meeting performance constraints for workloads of all types.

References

  1. Faraz Ahmad, Srimat T. Chakradhar, Anand Raghunathan, and T. N. Vijaykumar. Tarazu: optimizing mapreduce on heterogeneous clusters. In Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). London, UK, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Amazon ec2. http://aws.amazon.com/ec2/.Google ScholarGoogle Scholar
  3. Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Effective straggler mitigation: Attack of the clones. In Proc. of the USENIX Symposium on Networked Systems Design and Implementation (NSDI). Lombard, IL, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. Reining in the outliers in map-reduce clusters using mantri. In Proc. of the 9th USENIX conference on Operating Systems Design and Implementation (OSDI). Vancouver, CA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Apache zookeeper. http://zookeeper.apache.org/.Google ScholarGoogle Scholar
  6. Autoscale. https://cwiki.apache.org/CLOUDSTACK/autoscaling.html.Google ScholarGoogle Scholar
  7. AWS Autoscaling. http://aws.amazon.com/autoscaling/.Google ScholarGoogle Scholar
  8. Gaurav Banga, Peter Druschel, and Jeffrey C. Mogul. Resource containers: a new facility for resource management in server systems. OSDI, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. Barroso. Warehouse-scale computing: Entering the teenage decade. ISCA Keynote, SJ, June 2011. Google ScholarGoogle Scholar
  10. Luiz Barroso and Urs Hoelzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Bell, M. Koren, and C. Volinsky. The BellKor 2008 Solution to the Netflix Prize. Technical report, AT&T Labs, 2008.Google ScholarGoogle Scholar
  12. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The parsec benchmark suite: Characterization and architectural implications. In Proc. of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT). Toronto, CA, October, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Leon Bottou. Large-scale machine learning with stochastic gradient descent. In Proc. of the International Conference on Computational Statistics (COMPSTAT). Paris, France, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  14. Apache cassandra. http://cassandra.apache.org/.Google ScholarGoogle Scholar
  15. McKinsey & Company. Revolutionizing data center efficiency. In Uptime Institute Symposium, 2008.Google ScholarGoogle Scholar
  16. Jeffrey Dean and Luiz Andre Barroso. The tail at scale. In Communications of the ACM, Vol. 56 No. 2, Pages 74--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. In Proc. of OSDI, pages 10--10, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Christina Delimitrou, Nick Bambos, and Christos Kozyrakis. QoS-Aware Admission Control in Heterogeneous Datacenters. In Proceedings of the International Conference on Autonomic Computing (ICAC). San Jose, June 2013.Google ScholarGoogle Scholar
  19. Christina Delimitrou and Christos Kozyrakis. iBench: Quantifying Interference for Datacenter Workloads. In Proceedings of the 2013 IEEE International Symposium on Workload Characterization (IISWC). Portland, OR, September 2013.Google ScholarGoogle ScholarCross RefCross Ref
  20. Christina Delimitrou and Christos Kozyrakis. Paragon: QoSAware Scheduling for Heterogeneous Datacenters. In Proc. of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Houston, TX, USA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Eucalyptus cloud services. http://www.eucalyptus.com/.Google ScholarGoogle Scholar
  22. Brad Fitzpatrick. Distributed caching with memcached. In Linux Journal, Volume 2004, Issue 124, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Gandhi and A. Sabne. Finding stragglers in hadoop. In Tech. Report. 2011.Google ScholarGoogle Scholar
  24. Gartner says efficient data center design can lead to 300 percent capacity growth in 60 percent less space. http://www.gartner.com/newsroom/id/1472714.Google ScholarGoogle Scholar
  25. Google compute engine. http://cloud.google.com/products/compute-engine.html.Google ScholarGoogle Scholar
  26. Z. Ghahramani and M. Jordan. Learning from incomplete data. In Lab Memo No. 1509, CBCL Paper No. 108, MIT AI Lab. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. Dominant resource fairness: fair allocation of multiple resource types. In Proc. of the 8th USENIX conference on Networked systems design and implementation (NSDI). Boston, MA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Daniel Gmach, Jerry Rolia, Ludmila Cherkasova, and Alfons Kemper. Workload analysis and demand prediction of enterprise data center applications. In Proc. of the 10th IEEE International Symposium on Workload Characterization. Boston, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Zhenhuan Gong, Xiaohui Gu, and John Wilkes. Press: Predictive elastic resource scaling for cloud systems. In Proc. of the International Conference on Network and Service Management (CNSM). Niagara Falls, ON, 2010.Google ScholarGoogle Scholar
  30. Apache hadoop. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  31. J. Hamilton. Cost of power in large-scale data centers. http://perspectives.mvdirona.com.Google ScholarGoogle Scholar
  32. Ben Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proc. of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Boston, MA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Hotcrp conference management system. http://read.seas.harvard.edu/?kohler/hotcrp/.Google ScholarGoogle Scholar
  34. Aamer Jaleel, Matthew Mattina, and Bruce L. Jacob. Last level cache (llc) performance of data mining workloads on a cmp - a case study of parallel bioinformatics workloads. In Proc. of the 12th International Symposium on High-Performance Computer Architecture (HPCA-12). Austin, Texas, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  35. K.C. Kiwiel. Convergence and efficiency of subgradient methods for quasiconvex minimization. In Mathematical Programming (Series A) (Berlin, Heidelberg: Springer) 90 (1): pp. 1--25, 2001.Google ScholarGoogle Scholar
  36. Jacob Leverich and Christos Kozyrakis. On the energy (in)efficiency of hadoop clusters. In Proc. of HotPower. Big Sky, MT, 2009.Google ScholarGoogle Scholar
  37. J. Lin. The curse of zipf and limits to parallelization: A look at the stragglers problem in mapreduce. In Proc. of LSDS-IR Workshop. Boston, MA, 2009.Google ScholarGoogle Scholar
  38. Host server cpu utilization in amazon ec2 cloud. http://huanliu.wordpress.com/2012/02/17/host-server-cpu-utilization-in-amazon-ec2-cloud/.Google ScholarGoogle Scholar
  39. Mahout. http://mahout.apache.org/.Google ScholarGoogle Scholar
  40. Jason Mars and Lingjia Tang. Whare-map: heterogeneity in 'homogeneous' warehouse-scale computers. In Proc. of the 40th Annual International Symposium on Computer Architecture (ISCA). Tel-Aviv, Israel, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-up: increasing utilization in modern warehouse scale computers via sensible co-locations. In Proc. of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pages 248--259, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. David Meisner, Christopher M. Sadler, Luiz André Barroso, Wolf-Dietrich Weber, and Thomas F. Wenisch. Power management of online data-intensive services. In Proceedings of the 38th annual international symposium on Computer architecture, pages 319--330, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Ramanathan Narayanan, Berkin Ozisikyilmaz, Joseph Zambreno, Gokhan Memik, and Alok N. Choudhary. Minebench: A benchmark suite for data mining workloads. In Proceedings of the 9th IEEE International Symposium on Workload Characterization (IISWC). San Jose, California, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  44. R. Nathuji, C. Isci, and E. Gorbatov. Exploiting platform heterogeneity for power efficient data centers. In Proc. of ICAC'07, FL, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-clouds: Managing performance interference effects for qos-aware clouds. In Proc. of EuroSys France, 2010, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Sethuraman Subbiah, and John Wilkes. Agile: Elastic distributed resource scaling for infrastructure-as-a-service. In Proc. of the USENIX International Conference on Automated Computing (ICAC'13). San Jose, CA, 2013.Google ScholarGoogle Scholar
  47. Dejan Novakovic, Nedeljko Vasic, Novakovic, Stanko, Dejan Kostic, and Ricardo Bianchini. Deepdive: Transparently identifying and managing performance interference in virtualized environments. In Proc. of the USENIX Annual Technical Conference (ATC'13). San Jose, CA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Openstack cloud software. http://www.openstack.org/.Google ScholarGoogle Scholar
  49. Nathan Parrish, Hyrum Anderson, Maya Gupta, and Dun Yu Hsaio. Classifying with confidence from incomplete information. In Proc. of the Journal Machine Learning Research (JMLR). 2013.Google ScholarGoogle Scholar
  50. A. Rajaraman and J. Ullman. Textbook on Mining of Massive Datasets. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Charles Reiss, Alexey Tumanov, Gregory Ganger, Randy Katz, and Michael Kozych. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proc. of the Third ACM Symposium on Cloud Computing (SOCC). San Jose, CA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Rightscale. https://aws.amazon.com/solution-providers/isv/rightscale.Google ScholarGoogle Scholar
  53. Daniel Sanchez and Christos Kozyrakis. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In Proc. of the 38th annual International Symposium in Computer Architecture (ISCA-38). San Jose, CA, June, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and JohnWilkes. Omega: flexible, scalable schedulers for large compute clusters. In Proc. of the 8th ACM European Conference on Computer Systems (EuroSys). Prague, Czech Republic, April 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Upendra Sharma, Prashant Shenoy, Sambit Sahu, and Anees Shaikh. A cost-aware elasticity provisioning system for the cloud. In Proc. of the 2011 31st International Conference on Distributed Computing Systems (ICDCS). Minneapolis, MN, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Zhiming Shen, Sethuraman Subbiah, Xiaohui Gu, and John Wilkes. Cloudscale: elastic resource scaling for multi-tenant cloud systems. In Proc. of the 2nd ACM Symposium on Cloud Computing (SOCC). Cascais, Portugal, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Storm. https://github.com/nathanmarz/storm/.Google ScholarGoogle Scholar
  58. Torque resource manager. http://www.adaptivecomputing.com/products/open-source/torque/.Google ScholarGoogle Scholar
  59. Arunchandar Vasan, Anand Sivasubramaniam, Vikrant Shimpi, T. Sivabalan, and Rajesh Subbiah. Worth their watts? an empirical study of datacenter servers. In Proc. of the 16th International Symposium on High Performance Computer Architecture (HPCA). Bangalore, India, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  60. Nedeljko Vasić, Dejan Novaković, Svetozar Miučin, Dejan Kostić, and Ricardo Bianchini. Dejavu: accelerating resource allocation in virtualized environments. In Proc. of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). London, UK, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Vmware vcloud suite. http://www.vmware.com/products/vcloud.Google ScholarGoogle Scholar
  62. Virtualbox. https://www.virtualbox.org/.Google ScholarGoogle Scholar
  63. Vmware virtual machines. http://www.vmware.com/.Google ScholarGoogle Scholar
  64. C. Wang, X. Liao, L. Carin, and D. B. Dunson. Classification with incomplete data using dirichlet process priors. In Journal of Machine Learning Research (JMLR), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Windows azure. http://www.windowsazure.com/.Google ScholarGoogle Scholar
  66. Ian H. Witten, Eibe Frank, and Geoffrey Holmes. Data Mining: Practical Machine Learning Tools and Techniques. 3rd Edition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The splash-2 programs: characterization and methodological considerations. In Proc. of the 22nd International Symposium on Computer Architecture (ISCA). Santa Margherita Ligure, Italy, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. The xen project. http://www.xen.org/.Google ScholarGoogle Scholar
  69. Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. Bubble-flux: precise online qos management for increased utilization in warehouse scale computers. In Proc. of the 40th Annual International Symposium on Computer Architecture (ISCA). Tel-Aviv, Israel, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. M Zaharia, A Konwinski, A.D Joseph, R Katz, and I Stoica. Improving mapreduce performance in heterogeneous environments. In Proc. of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI). San Diego, CA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Matei Zaharia, M Chowdhury, T Das, A Dave, J Ma, M McCauley, M.J Franklin, S Shenker, and I Stoica. Spark: Cluster computing with working sets. In Proc. of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI). San Jose, CA, 2012.Google ScholarGoogle Scholar
  72. Zfs. http://www.freebsd.org/doc/handbook/filesystems-zfs.html.Google ScholarGoogle Scholar
  73. Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. Cpi2: Cpu performance isolation for shared compute clusters. In Proc. of the 8th ACM European Conference on Computer Systems (EuroSys). Prague, Czech Republic, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Quasar: resource-efficient and QoS-aware cluster management

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader