skip to main content
10.1145/3577193.3593740acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

FLORIA: A Fast and Featherlight Approach for Predicting Cache Performance

Published:21 June 2023Publication History

ABSTRACT

The cache Miss Ratio Curve (MRC) serves a variety of purposes such as cache partitioning, application profiling and code tuning. In this work, we propose a new metric, called cache miss distribution, that describes cache miss behavior over cache sets, for predicting cache MRCs. Based on this metric, we present FLORIA, a software-based, online approach that approximates cache MRCs on commodity systems. By polluting a tunable number of cache lines in some selected cache sets using our designed microbenchmark, the cache miss distribution for the target workload is obtained via hardware performance counters with the support of precise event based sampling (PEBS). A model is developed to predict the MRC of the target workload based on its cache miss distribution.

We evaluate FLORIA for systems consisting of a single application as well as a wide range of different workload mixes. Compared with the state-of-the-art approaches in predicting online MRCs, FLORIA achieves the highest average accuracy of 97.29% with negligible overhead. It also allows fast and accurate estimation of online MRC within 5ms, 20X faster than the state-of-the-art approaches. We also demonstrate that FLORIA can be applied to guiding cache partitioning for multiprogrammed workloads, helping to improve overall system performance.

References

  1. A. Agarwal, J. Hennessy, and M. Horowitz. An analytical cache model. ACM Trans. Comput. Syst., 7(2):184--215, May 1989.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Michael Badamo, Jeff Casarona, Minshu Zhao, and Donald Yeung. Identifying power-efficient multicore cache hierarchies via reuse distance analysis. ACM Transactions on Computer Systems (TOCS), 34(1):3, 2016.Google ScholarGoogle Scholar
  3. Nathan Beckmann and Daniel Sanchez. Modeling cache performance beyond lru. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 225--236. IEEE, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  4. Erik Berg and Erik Hagersten. Fast data-locality profiling of native execution. In ACM SIGMETRICS Performance Evaluation Review, volume 33, pages 169--180. ACM, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jacob Brock, Chencheng Ye, Chen Ding, Yechen Li, Xiaolin Wang, and Yingwei Luo. Optimal cache partition-sharing. In 2015 44th International Conference on Parallel Processing, pages 749--758. IEEE, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dehao Chen, Neil Vachharajani, Robert Hundt, Shih-wei Liao, Vinodha Ramasamy, Paul Yuan, Wenguang Chen, and Weimin Zheng. Taming hardware event samples for fdo compilation. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, pages 42--52, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Henry Cook, Miquel Moreto, Sarah Bird, Khanh Dao, David A Patterson, and Krste Asanovic. A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness. In ACM SIGARCH Computer Architecture News, volume 41, pages 308--319. ACM, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3B: System Programming Guide, Part 2, 2010.Google ScholarGoogle Scholar
  9. Perf developers. perf_event_open - Linux man page.Google ScholarGoogle Scholar
  10. Advanced Micro Devices. BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h Models 30h-3Fh Processors, February, 2015.Google ScholarGoogle Scholar
  11. Chen Ding and Yutao Zhong. Predicting whole-program locality through reuse distance analysis. In ACM Sigplan Notices, volume 38, pages 245--257. ACM, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. David Eklov and Erik Hagersten. Statstack: Efficient modeling of lru caches. In 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pages 55--65. IEEE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  13. Nosayba El-Sayed, Anurag Mukkara, Po-An Tsai, Harshad Kasture, Xiaosong Ma, and Daniel Sanchez. Kpart: A hybrid cache partitioning-sharing technique for commodity multicores. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 104--117. IEEE, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  14. Andrew Hilton, Neeraj Eswaran, and Amir Roth. Fiesta: A sample-balanced multi-program workload methodology. Proc. MoBS, 2009.Google ScholarGoogle Scholar
  15. Xiameng Hu, Xiaolin Wang, Lan Zhou, Yingwei Luo, Chen Ding, and Zhenlin Wang. Kinetic modeling of data eviction in cache. In 2016 {USENIX} Annual Technical Conference ({USENIX}{ATC} 16), pages 351--364, 2016.Google ScholarGoogle Scholar
  16. Gorka Irazoqui, Thomas Eisenbarth, and Berk Sunar. Systematic reverse engineering of cache slice selection in intel processors. In 2015 Euromicro Conference on Digital System Design, pages 629--636. IEEE, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yunlian Jiang, Eddy Z Zhang, Kai Tian, and Xipeng Shen. Is reuse distance applicable to data locality analysis on chip multiprocessors? In International Conference on Compiler Construction, pages 264--282. Springer, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Kostis Kaffes, Neeraja J. Yadwadkar, and Christos Kozyrakis. Hermod: principled and practical scheduling for serverless functions. In Ada Gavrilovska, Deniz Altinbüken, and Carsten Binnig, editors, Proceedings of the 13th Symposium on Cloud Computing, SoCC 2022, San Francisco, California, November 7--11, 2022, pages 289--305. ACM, 2022.Google ScholarGoogle Scholar
  19. Fangfei Liu, Yuval Yarom, Qian Ge, Gernot Heiser, and Ruby B Lee. Last-level cache side-channel attacks are practical. In 2015 IEEE Symposium on Security and Privacy, pages 605--622. IEEE, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Xu Liu and John Mellor-Crummey. Pinpointing data locality bottlenecks with low overhead. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 183--193. IEEE, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  21. Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture, pages 248--259. ACM, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Richard L. Mattson, Jan Gecsei, Donald R. Slutz, and Irving L. Traiger. Evaluation techniques for storage hierarchies. IBM Systems journal, 9(2):78--117, 1970.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Clémentine Maurice, Nicolas Le Scouarnec, Christoph Neumann, Olivier Heen, and Aurélien Francillon. Reverse engineering intel last-level cache complex addressing using performance counters. In International Symposium on Recent Advances in Intrusion Detection, pages 48--65. Springer, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Andrzej Nowak, Ahmad Yasin, Avi Mendelson, and Willy Zwaenepoel. Establishing a base of trust with performance counters for enterprise workloads. In 2015 {USENIX} Annual Technical Conference ({USENIX}{ATC} 15), pages 541--548, 2015.Google ScholarGoogle Scholar
  25. Cedric Nugteren, Gert-Jan Van den Braak, Henk Corporaal, and Henri Bal. A detailed gpu cache model based on reuse distance theory. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pages 37--48. IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  26. Moinuddin K Qureshi and Yale N Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06), pages 423--432. IEEE, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Ashay Rane and James Browne. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 147--156. IEEE, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Trausti Saemundsson, Hjortur Bjornsson, Gregory Chockler, and Ymir Vigfusson. Dynamic performance profiling of cloud caches. In Proceedings of the ACM Symposium on Cloud Computing, pages 1--14. ACM, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Derek L Schuff, Milind Kulkarni, and Vijay S Pai. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pages 53--64. ACM, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Andreas Sembrant, David Black-Schaffer, and Erik Hagersten. Phase guided profiling for fast cache modeling. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, pages 175--185. ACM, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Xipeng Shen, Jonathan Shaw, Brian Meeker, and Chen Ding. Locality approximation using time. In ACM SIGPLAN Notices, volume 42, pages 55--61. ACM, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. SPEC. SPEC CPU Benchmarks.Google ScholarGoogle Scholar
  33. Harold S. Stone, John Turek, and Joel L. Wolf. Optimal partitioning of cache memory. IEEE Transactions on computers, 41(9):1054--1068, 1992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. G Edward Suh, Srinivas Devadas, and Larry Rudolph. Analytical cache models with applications to cache partitioning. In ACM International Conference on Supercomputing 25th Anniversary Volume, pages 323--334. ACM, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. G Edward Suh, Larry Rudolph, and Srinivas Devadas. Dynamic partitioning of shared cache memory. The Journal of Supercomputing, 28(1):7--26, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. David K Tam, Reza Azimi, Livio B Soares, and Michael Stumm. Rapidmrc: approximating l2 miss rate curves on commodity systems for online optimizations. ACM SIGARCH Computer Architecture News, 37(1):121--132, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Carl A Waldspurger, Nohhyun Park, Alexander Garthwaite, and Irfan Ahmad. Efficient {MRC} construction with {SHARDS}. In 13th {USENIX} Conference on File and Storage Technologies ({FAST} 15), pages 95--110, 2015.Google ScholarGoogle Scholar
  38. Qingsen Wang, Xu Liu, and Milind Chabbi. Featherlight reuse-distance measurement. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 440--453. IEEE, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  39. Jake Wires, Stephen Ingram, Zachary Drudi, Nicholas JA Harvey, and Andrew Warfield. Characterizing storage workloads with counter stacks. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14), pages 335--349, 2014.Google ScholarGoogle Scholar
  40. H. Wong. Intel Ivy Bridge cache replacement policy, http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/.Google ScholarGoogle Scholar
  41. Xiaoya Xiang, Chen Ding, Hao Luo, and Bin Bao. Hotl: a higher order theory of locality. ACM SIGPLAN Notices, 48(4):343--356, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yaocheng Xiang, Xiaolin Wang, Zihui Huang, Zeyu Wang, Yingwei Luo, and Zhenlin Wang. Dcaps: dynamic cache allocation with partial sharing. In Proceedings of the Thirteenth EuroSys Conference, page 13. ACM, 2018.Google ScholarGoogle Scholar
  43. Jun Xiao, Andy D. Pimentel, and Xu Liu. Cppf: A prefetch aware llc partitioning approach. In Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarGoogle Scholar
  44. Xiao Zhang, Sandhya Dwarkadas, and Kai Shen. Towards practical page coloring-based multicore cache management. In Proceedings of the 4th ACM European conference on Computer systems, pages 89--102. ACM, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. FLORIA: A Fast and Featherlight Approach for Predicting Cache Performance

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ICS '23: Proceedings of the 37th International Conference on Supercomputing
          June 2023
          505 pages
          ISBN:9798400700569
          DOI:10.1145/3577193

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 June 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate584of2,055submissions,28%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader