research-article

QoS-Aware scheduling in heterogeneous datacenters with paragon

Authors:
Christina Delimitrou

Stanford University

Stanford University
View Profile

,
Christos Kozyrakis

Stanford University

Stanford University
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 31 Issue 4Article No.: 12pp 1–34https://doi.org/10.1145/2556583

Published:20 December 2013Publication History

ACM Transactions on Computer Systems

Abstract

Large-scale datacenters (DCs) host tens of thousands of diverse applications each day. However, interference between colocated workloads and the difficulty of matching applications to one of the many hardware platforms available can degrade performance, violating the quality of service (QoS) guarantees that many cloud workloads require. While previous work has identified the impact of heterogeneity and interference, existing solutions are computationally intensive, cannot be applied online, and do not scale beyond a few applications.

We present Paragon, an online and scalable DC scheduler that is heterogeneity- and interference-aware. Paragon is derived from robust analytical methods, and instead of profiling each application in detail, it leverages information the system already has about applications it has previously seen. It uses collaborative filtering techniques to quickly and accurately classify an unknown incoming workload with respect to heterogeneity and interference in multiple shared resources. It does so by identifying similarities to previously scheduled applications. The classification allows Paragon to greedily schedule applications in a manner that minimizes interference and maximizes server utilization. After the initial application placement, Paragon monitors application behavior and adjusts the scheduling decisions at runtime to avoid performance degradations. Additionally, we design ARQ, a multiclass admission control protocol that constrains application waiting time. ARQ queues applications in separate classes based on the type of resources they need and avoids long queueing delays for easy-to-satisfy workloads in highly-loaded scenarios. Paragon scales to tens of thousands of servers and applications with marginal scheduling overheads in terms of time or state.

We evaluate Paragon with a wide range of workload scenarios, on both small and large-scale systems, including 1,000 servers on EC2. For a 2,500-workload scenario, Paragon enforces performance guarantees for 91% of applications, while significantly improving utilization. In comparison, heterogeneity-oblivious, interference-oblivious, and least-loaded schedulers only provide similar guarantees for 14%, 11%, and 3% of workloads. The differences are more striking in oversubscribed scenarios where resource efficiency is more critical.

References

Alameldeen, A. R. and Wood, D. A. 2006. IPC considered harmful for multiprocessor workloads. IEEE Micro (Special Issue on Computer Architecture Simulation and Modeling). Google ScholarDigital Library
Amazon EC2. http://aws.amazon.com/ec2/.Google Scholar
Banga, G., Druschel, P., and Mogul, J. C. 1999. Resource containers: A new facility for resource management in server systems. In Proceedings of the Third Symposium on Operating Systems Design and Implementation (OSDI). Google ScholarDigital Library
Barroso, L. 2011. Warehouse-scale computing: entering the teenage decade. In Proceedings of ISCA. Google ScholarCross Ref
Barroso, L. and Hoelzle, U. 2009. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool. Google ScholarDigital Library
Bell, R. M., Koren, Y., and Volinsky, C. 2007. The BellKor 2008 solution to the Netflix Prize. Tech. rep., AT&T Labs.Google Scholar
Bertsimas, D., Gamarnik, D., and Tsitsiklis, J. N. 2001. Performance of multiclass Markovian queueing networks via piecewise linear Lyapunov functions. Ann. Appl. Probab. 11, 1384--1428.Google ScholarCross Ref
Bienia, C., Kumar, S., Singh, J. P., and Li, K. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT). Google ScholarDigital Library
Bottou, L. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of the International Conference on Computational Statistics (COMPSTAT).Google ScholarCross Ref
Calder, B., Wang, J., Ogus, A., Nilakantan, N., Skjolsvold, A., McKelvie, S., Xu, Y., Srivastav, S., Wu, J., Simitci, H., Haridas, J., Uddaraju, C., Khatri, H., Edwards, A., Bedekar, V., Mainali, S., Abbasi, R., Agarwal, A., ul Haq, M. F., ul Haq, M. I., Bhardwaj, D., Dayanand, S., Adusumilli, A., McNett, M., Sankaran, S., Manivannan, K., and Rigas, L. 2011. Windows Azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP). Google ScholarDigital Library
Chase, J., Anderson, D., Thakar, P., Vahdat, A., and Doyle, R. 2001. Managing energy and server resources in hosting centers. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP). Google ScholarDigital Library
Craeynest, K. V., Jaleel, A., Eeckhout, L., Narvaez, P., and Emer, J. 2012. Scheduling heterogeneous multi-cores through performance impact estimation (PIE). In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
Dai, J. G. 1995. On positive Harris recurrence of multiclass queueing networks: A unified approach via fluid limit models. Ann. Appl. Probab. 5, 49--77.Google ScholarCross Ref
Dai, J. G. 1996. A fluid-limit model criterion for instability of multiclass queueing networks. Ann. Appl. Probab. 6, 751--757.Google ScholarCross Ref
Delimitrou, C. and Kozyrakis, C. 2013a. iBench: Quantifying interference for datacenter applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC).Google Scholar
Delimitrou, C. and Kozyrakis, C. 2013b. Paragon: QoS-aware scheduling for heterogeneous datacenters. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
Delimitrou, C. and Kozyrakis, C. 2013c. The Netflix challenge: Datacenter edition. IEEE Comput. Archit. Lett. (June). Google ScholarDigital Library
Fedorova, A., Seltzer, M., and Smith, M. D. 2007. Improving performance isolation on chip multiprocessors via an operating system scheduler. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT). Google ScholarDigital Library
Gamarnik, D. 2000. On deciding stability of scheduling policies in queuing systems. In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms. 467--476. Google ScholarDigital Library
Google Compute Engine GCE. http://cloud.google.com/products/compute-engine.html.Google Scholar
Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., and Stoica, I. 2011. Dominant resource fairness: Fair allocation of multiple resource types. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
Gmach, D., Rolia, J., Cherkasova, L., and Kemper, A. 2007. Workload analysis and demand prediction of enterprise data center applications. In Proceedings of the 10th IEEE International Symposium on Workload Characterization (IISWC). Google ScholarDigital Library
Govindan, S., Liu, J., Kansal, A., and Sivasubramaniam, A. 2011. Cuanta: Quantifying effects of shared on-chip resource interference for consolidated virtual machines. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC). Google ScholarDigital Library
Hamilton, J. 2009. Internet-scale service infrastructure efficiency. In Proceedings of the 37th International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
Hamilton, J. 2010. Cost of power in large-scale data centers. http://perspectives.mvdirona.com.Google Scholar
Hasenbein, J. J. 1998. Stability, capacity, and scheduling of multiclass queuing networks. Ph.D. dissertation, Georgia Institute of Technology. Google ScholarDigital Library
Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R., Shenker, S., and Stoica, I. 2011. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
Jaleel, A., Mattina, M., and Jacob, B. L. 2006. Last level cache (LLC) performance of data mining workloads on a CMP—A case study of parallel bioinformatics workloads. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA-12).Google Scholar
Katz, J. and Lindell, Y. 2007. Introduction to Modern Cryptography. Chapman & Hall/CRC Press. Google ScholarDigital Library
Kiwiel, K. C. 2001. Convergence and efficiency of subgradient methods for quasiconvex minimization. Math. Program. (Series A), 90, 1, 1--25.Google ScholarCross Ref
Kozyrakis, C., Kansal, A., Sankar, S., and Vaid, K. 2010. Server engineering insights for large-scale online services. IEEE Micro 30, 4, 8--19. DOI:http://dx.doi.org/10.1109/MM.2010.73. Google ScholarDigital Library
Leverich, J. and Kozyrakis, C. 2010. On the energy (in)efficiency of Hadoop clusters. SIGOPS Oper. Syst. Rev. 44, 1, 61--65. Google ScholarDigital Library
Lin, J. and Kolcz, A. 2012. Large-scale machine learning at Twitter. In Proceedings of the ACM SIGMOD Conference. Google ScholarDigital Library
Mars, J. and Tang, L. 2013. Whare-map: heterogeneity in “homogeneous” warehouse-scale computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
Mars, J., Tang, L., and Hundt, R. 2011. Heterogeneity in “homogeneous”; warehouse-scale computers: A performance opportunity. IEEE Comput. Archit. Lett. 10, 2, 29--32. DOI:http://dx.doi.org/10.1109/L-CA.2011.14. Google ScholarDigital Library
Meisner, D., Sadler, C. M., Barroso, L. A., Weber, W.-D., and Wenisch, T. F. 2011. Power management of online data-intensive services. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
Miller, B. L. 1969. A queuing reward system with several customer classes. Manage. Sci. 16, 3, 234--245.Google Scholar
Narayanan, R., Ozisikyilmaz, B., Zambreno, J., Memik, G., and Choudhary, A. N. 2006. MineBench: A benchmark suite for data mining workloads. In Proceedings of the 9th IEEE International Symposium on Workload Characterization (IISWC).Google Scholar
Nathuji, R., Isci, C., and Gorbatov, E. 2007. Exploiting platform heterogeneity for power efficient data centers. In Proceedings of the International Conference on Autonomic Computing (ICAC). Google ScholarDigital Library
Nathuji, R., Kansal, A., and Ghaffarkhah, A. 2010. Q-Clouds: Managing performance interference effects for QoS-aware clouds. In Proceedings of the European Conference on Computer Systems (EuroSys'10). Google ScholarDigital Library
Novakovi&cgrave;, D., Vasi&cgrave;, N., Novakovi&cgrave;, S., Kosti&cgrave;, D., and Bianchini, R. 2013. DeepDive: Transparently identifying and managing performance interference in virtualized environments. In Proceedings of the USENIX Annual Technical Conference (ATC). Google ScholarDigital Library
Rackspace. Open Cloud. http://www.rackspace.com/.Google Scholar
Rajaraman, A. and Ullman, J. 2011. Textbook on Mining of Massive Datasets. Rightscale. https://aws.amazon.com/solution-providers/isv/rightscale.Google Scholar
Sanchez, D. and Kozyrakis, C. 2011. Vantage: Scalable and efficient fine-grain cache partitioning. In Proceedings of the 38th Annual International Symposium in Computer Architecture (ISCA-38). Google ScholarDigital Library
Schein, A., Popescul, A., Ungar, L., and Pennock, D. 2002. Methods and metrics for cold-start recommendations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Google ScholarDigital Library
Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., and Wilkes, J. 2013. Omega: Flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys'13). Google ScholarDigital Library
Shelepov, D., Alcaide, J. C. S., Jeffery, S., Fedorova, A., Perez, N., Huang, Z. F., Blagodurov, S., and Kumar, V. 2009. HASS: A scheduler for heterogeneous multicore systems. SIGOPS Oper. Syst. Rev. 43, 2. Google ScholarDigital Library
Shen, Z., Subbiah, S., Gu, X., and Wilkes, J. 2011. CloudScale: elastic resource scaling for multi-tenant cloud systems. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC). Google ScholarDigital Library
Sun, J., Xie, Y., Zhang, H., and Faloutsos, C. 2008. Less is more: Compact matrix decomposition for large sparse graphs. J. Stat. Anal. Data Mining 1, 1.Google ScholarCross Ref
Tanenbaum, A. S. 2007. Modern Operating Systems. 3rd Ed. Peason Education, Inc. Google ScholarDigital Library
Vasić, N., Novaković, D., Miučin, S., Kostić, D., and Bianchini, R. 2012. Deja vu: accelerating resource allocation in virtualized environments. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
vMotion. Migrate VMs with Zero Downtime. http//www.vmware.com/products/vmotion.Google Scholar
VMWare-DRS. 2012. Distributed resource scheduler: design, implementation and lessons learned. VMware Tech. J. 1, 1.Google Scholar
VMWare vSphere. http://www.vmware.com/products/vsphere/.Google Scholar
Weng, L.-T., Yue, X., Yuefeng, L., and Nayak, R. 2008. Exploiting item taxonomy for solving cold-start problem in recommendation making. In Proceedings of the 20th IEEE International Conference on Tools with Artificial Intelligence (ICTAI). Google ScholarDigital Library
Wenisch, T. F., Wunderlich, R. E., Ferdman, M., Ailamaki, A., Falsafi, B., and Hoe, J. C. 2006. SimFlex: Statistical sampling of computer system simulation. IEEE MICRO 26, 4. Google ScholarDigital Library
Windows Azure. http://www.windowsazure.com/.Google Scholar
Witten, I. H., Frank, E., and Holmes, G. 2011. Data Mining: Practical Machine Learning Tools and Techniques. 3rd Ed. Google ScholarDigital Library
Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
Xenserver. 6.1. http://www.citrix.com/xenserver/.Google Scholar
Yang, H., Breslow, A., Mars, J., and Tang, L. 2013. Bubble-flux: Precise online QoS management for increased utilization in warehouse scale computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. 2012. Spark: Cluster computing with working sets. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI).Google Scholar
Zhang, X., Tune, E., Hagmann, R., Jnagal, R., Gokhale, V., and Wilkes, J. 2013. CPI2: CPU performance isolation for shared compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys'13). Google ScholarDigital Library
Zhang, Z.-K., Liu, C., Zhang, Y.-C., and Zhou, T. 2010. Solving the cold-start problem in recommender systems with social tags. arXiv:1004.3732v2.Google Scholar
Zhu, X., Young, D., Watson, B. J., Wang, Z., Rolia, J., Singhal, S., Mckee, B., Hyser, C., Gmach, D., Gardner, R., Christian, T., and Cherkasova, L. 2009. 1000 Islands: An integrated approach to resource management for virtualized datacenters. J. Cluster Comput. 12, 1. Google ScholarDigital Library

Index Terms

Recommendations

Paragon: QoS-aware scheduling for heterogeneous datacenters
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Large-scale datacenters (DCs) host tens of thousands of diverse applications each day. However, interference between colocated workloads and the difficulty to match applications to one of the many hardware platforms available can degrade performance, ...
Read More
Paragon: QoS-aware scheduling for heterogeneous datacenters
ASPLOS '13

Large-scale datacenters (DCs) host tens of thousands of diverse applications each day. However, interference between colocated workloads and the difficulty to match applications to one of the many hardware platforms available can degrade performance, ...
Read More
Paragon: QoS-aware scheduling for heterogeneous datacenters
ASPLOS '13

Large-scale datacenters (DCs) host tens of thousands of diverse applications each day. However, interference between colocated workloads and the difficulty to match applications to one of the many hardware platforms available can degrade performance, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Computer Systems Volume 31, Issue 4
December 2013
90 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/2542150
Editor:
Todd C. Mowry
Issue’s Table of Contents
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 December 2013
- Revised: 1 September 2013
- Accepted: 1 September 2013
- Received: 1 May 2013
Published in tocs Volume 31, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Datacenter
QoS
cloud computing
heterogeneity
interference
resource-efficiency
scheduling
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 125
  Total Citations
  View Citations
- 1,098
  Total Downloads
- Downloads (Last 12 months)46
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

QoS-Aware scheduling in heterogeneous datacenters with paragon

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Paragon: QoS-aware scheduling for heterogeneous datacenters

Paragon: QoS-aware scheduling for heterogeneous datacenters

Paragon: QoS-aware scheduling for heterogeneous datacenters

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

QoS-Aware scheduling in heterogeneous datacenters with paragon

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Paragon: QoS-aware scheduling for heterogeneous datacenters

Paragon: QoS-aware scheduling for heterogeneous datacenters

Paragon: QoS-aware scheduling for heterogeneous datacenters

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media