ABSTRACT
Current large scale systems show increasing power demands, to the point that it has become a huge strain on facilities and budgets. Researchers in academia, labs and industry are focusing on dealing with this "power wall", striving to find a balance between performance and power consumption. Some commodity processors enable power capping, which opens up new opportunities for applications to directly manage their power behavior at user level. However, while power capping ensures a system will never exceed a given power limit, it also leads to a new form of heterogeneity: natural manufacturing variability, which was previously hidden by varying power to achieve homogeneous performance, now results in heterogeneous performance caused by different CPU frequencies, potentially for each core, to enforce the power limit.
In this work we show how a parallel runtime system can be used to effectively deal with this new kind of performance heterogeneity by compensating the uneven effects of power capping. In the context of a NUMA node composed of several multi-core sockets, our system is able to optimize the energy and concurrency levels assigned to each socket to maximize performance. Applied transparently within the parallel runtime system, it does not require any programmer interaction like changing the application source code or manually reconfiguring the parallel system. We compare our novel runtime analysis with an offline approach and demonstrate that it can achieve equal performance at a fraction of the cost.
- P. E. Bailey, A. Marathe, D. K. Lowenthal, B. Rountree, and M. Schulz. Finding the limits of power-constrained application performance. In SC, pages 79:1--79:12, 2015. Google ScholarDigital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In PACT, pages 72--81, 2008. Google ScholarDigital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In PPoPP, pages 207--216, 1995. Google ScholarDigital Library
- R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46(5):720--748, Sept. 1999. Google ScholarDigital Library
- S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De. Parameter variations and impact on circuits and microarchitecture. In DAC, pages 338--342, 2003. Google ScholarDigital Library
- BSC. Programming models group. the nanos++ parallel runtime. https://pm.bsc.es/nanox, 2015.Google Scholar
- M. Casas, R. M. Badia, and J. Labarta. Automatic phase detection and structure extraction of mpi applications. Int. J. High Perform. Comput. Appl., 24(3):335--360, Aug. 2010. Google ScholarDigital Library
- M. Casas, M. Moreto, L. Alvarez, E. Castillo, D. Chasapis, T. Hayes, L. Jaulmes, O. Palomar, O. Unsal, A. Cristal, E. Ayguade, J. Labarta, and M. Valero. Euro-Par 2015, chapter Runtime-Aware Architectures, pages 16--27. August 2015.Google Scholar
- D. Chasapis, M. Casas, M. Moretó, R. Vidal, E. Ayguadé, J. Labarta, and M. Valero. Parsecss: Evaluating the impact of task parallelism in the parsec benchmark suite. ACM Trans. Archit. Code Optim., 12(4):41:1--41:22, Dec. 2015. Google ScholarDigital Library
- R. Cochran, C. Hankendi, A. K. Coskun, and S. Reda. Pack & cap: Adaptive dvfs and thread packing under power caps. In MICRO, pages 175--185, 2011. Google ScholarDigital Library
- J. D. Davis, S. Rivoire, M. Goldszmidt, and E. K. Ardestani. Accounting for Variability in Large-Scale Cluster Power Models. In EXERT, 2011.Google Scholar
- J. W. Demmel. Applied Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1997. Google ScholarDigital Library
- D. A. Ellsworth, A. D. Malony, B. Rountree, and M. Schulz. POW: System-wide Dynamic Reallocation of Limited Power in HPC. In HPDC, pages 145--148, 2015. Google ScholarDigital Library
- M. Etinski, J. Corbalan, J. Labarta, and M. Valero. Linear programming based parallel job scheduling for power constrained systems. In HPCS, pages 72--80, July 2011.Google ScholarCross Ref
- L. R. Harriott. Limits of lithography. Proceedings of the IEEE, 89(3):366--374, Mar 2001.Google ScholarCross Ref
- S. Herbert, S. Garg, and D. Marculescu. Exploiting process variability in voltage/frequency control. IEEE Trans. Very Large Scale Integr. Syst., 20(8):1392--1404, Aug. 2012. Google ScholarDigital Library
- S. Herbert and D. Marculescu. Variation-aware dynamic voltage/frequency scaling. In HPCA, pages 301--312, 2009.Google ScholarCross Ref
- Y. Inadomi, T. Patki, K. Inoue, M. Aoyagi, B. Rountree, M. Schulz, D. Lowenthal, Y. Wada, K. Fukazawa, M. Ueda, M. Kondo, and I. Miyoshi. Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing. In SC, pages 78:1--78:12, 2015. Google ScholarDigital Library
- Intel. Intel-64 and IA-32 Architectures Software Developer's Manual. Intel, December 2011.Google Scholar
- K. E. Isaacs, A. Bhatele, J. Lifflander, D. Böhme, T. Gamblin, M. Schulz, B. Hamann, and P.-T. Bremer. Recovering logical structure from charm++ event traces. In SC, pages 49:1--49:12, 2015. Google ScholarDigital Library
- B. Lin, A. Mallik, P. Dinda, G. Memik, and R. Dick. User- and process-driven dynamic voltage and frequency scaling. In ISPASS, pages 11--22, April 2009.Google ScholarCross Ref
- Livermore Computing. The Catalyst supercomputer. http://computation.llnl.gov/computers/catalyst, 2014.Google Scholar
- A. Marathe, P. Bailey, D. Lowenthal, B. Rountree, M. Schulz, and B. de Supinski. A run-time system for power-constrained HPC applications. In High Performance Computing, volume 9137 of Lecture Notes in Computer Science, pages 394--408. 2015.Google ScholarCross Ref
- T. Patki, D. K. Lowenthal, B. Rountree, M. Schulz, and B. R. de Supinski. Exploring hardware overprovisioning in power-constrained, high performance computing. In ICS, pages 173--182, 2013. Google ScholarDigital Library
- N. Rajovic, P. Carpenter, I. Gelado, N. Puzovic, A. Ramirez, and M. Valero. Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC? In SC, pages 1--12, Nov 2013. Google ScholarDigital Library
- K. Ravichandran, S. Lee, and S. Pande. Work stealing for multi-core hpc clusters. In Euro-Par, pages 205--217, 2011. Google ScholarDigital Library
- B. Rountree, D. Ahn, B. de Supinski, D. Lowenthal, and M. Schulz. Beyond DVFS: A first look at performance under a hardware-enforced power bound. In IPDPS Workshops PhD Forum, pages 947--953, May 2012. Google ScholarDigital Library
- P. B. S. Ashby and, J. Chen, P. Colella, B. Collins, D. Crawford, J. Dongarra, D. Kothe, R. Lusk, P. Messina, T. Mezzacappa, P. Moin, M. Norman, R. Rosner, V. Sarkar, A. Siegel, F. Streitz, A. White, and M. Wright. The opportunities and challenges of exascale computing. DOE Technical Report, 2010.Google Scholar
- S. Samaan. The impact of device parameter variations on the frequency and performance of VLSI chips. In ICCAD, pages 343--346, Nov 2004. Google ScholarDigital Library
- O. Sarood, A. Langer, A. Gupta, and L. Kale. Maximizing throughput of overprovisioned hpc data centers under a strict power budget. In SC, pages 807--818, 2014. Google ScholarDigital Library
- K. Shoga, B. Rountree, and M. Schulz. Whitelisting MSRs with msr-safe, November 2014.Google Scholar
- R. Teodorescu and J. Torrellas. Variation-aware application scheduling and power management for chip multiprocessors. SIGARCH Comput. Archit. News, 36(3):363--374, June 2008. Google ScholarDigital Library
- E. Totoni, J. Torrellas, and L. V. Kale. Using an adaptive hpc runtime system to reconfigure the cache hierarchy. In SC, pages 1047--1058, 2014. Google ScholarDigital Library
- J. Tschanz, J. Kao, S. Narendra, R. Nair, D. Antoniadis, A. Chandrakasan, and V. De. Adaptive body bias for reducing impacts of die-to-die and within-die parameter variations on microprocessor frequency and leakage. Solid-State Circuits, IEEE Journal of, 37(11):1396--1402, Nov 2002.Google Scholar
- M. Valero, M. Moreto, M. Casas, E. Ayguade, and J. Labarta. Runtime-aware architectures: A first approach. Supercomputing frontiers and innovations, 1(1), 2014.Google Scholar
- G. Zheng, A. Bhatelé, E. Meneses, and L. V. Kalé. Periodic hierarchical load balancing for large supercomputers. Int. J. High Perform. Comput. Appl., 25(4):371--385, Nov. 2011. Google ScholarDigital Library
Recommendations
MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors
In this work, we develop MrPhi, an optimized MapReduce framework on a heterogeneous computing platform, particularly equipped with multiple Intel Xeon Phi coprocessors. To the best of our knowledge, this is the first work to optimize the MapReduce ...
Runtime coordinated heterogeneous tasks in charm++
ESPM2: Proceedings of the Second Internationsl Workshop on Extreme Scale Programming Models and MiddlewareEffective utilization of the increasingly heterogeneous hardware in modern supercomputers is a significant challenge. Many applications have seen performance gains by using GPUs, but many implementations leave CPUs sitting idle.
In this paper, we ...
Intra-Socket and Inter-Socket Communication in Multi-core Systems
The increasing computational and communication demands of the scientific and industrial communities require a clear understanding of the performance trade-offs involved in multi-core computing platforms. Such analysis can help application and toolkit ...
Comments