research-article

FLORIA: A Fast and Featherlight Approach for Predicting Cache Performance

Authors:
Jun Xiao

University of Amsterdam, Amsterdam, Netherlands

Peking University, Beijing, China

University of Amsterdam, Amsterdam, Netherlands

Peking University, Beijing, China

https://orcid.org/0000-0001-9306-3876
View Profile

,
Yaocheng Xiang

Peking University, Beijing, China

Peking University, Beijing, China

https://orcid.org/0000-0003-4664-3979
View Profile

,
Xiaolin Wang

Peking University, Beijing, China

Peking University, Beijing, China

https://orcid.org/0000-0002-6951-1613
View Profile

,
Yingwei Luo

Peking University, Beijing, China

Peking University, Beijing, China

https://orcid.org/0000-0002-7903-0717
View Profile

,
Andy Pimentel

University of Amsterdam, Amsterdam, Netherlands

University of Amsterdam, Amsterdam, Netherlands

https://orcid.org/0000-0002-2043-4469
View Profile

,
Zhenlin Wang

Michigan Tech, Houghton, United States of America

Michigan Tech, Houghton, United States of America

https://orcid.org/0000-0002-0429-4371
View Profile

ICS '23: Proceedings of the 37th International Conference on SupercomputingJune 2023Pages 25–36https://doi.org/10.1145/3577193.3593740

Published:21 June 2023Publication History

ICS '23: Proceedings of the 37th International Conference on Supercomputing

Pages 25–36

ABSTRACT

The cache Miss Ratio Curve (MRC) serves a variety of purposes such as cache partitioning, application profiling and code tuning. In this work, we propose a new metric, called cache miss distribution, that describes cache miss behavior over cache sets, for predicting cache MRCs. Based on this metric, we present FLORIA, a software-based, online approach that approximates cache MRCs on commodity systems. By polluting a tunable number of cache lines in some selected cache sets using our designed microbenchmark, the cache miss distribution for the target workload is obtained via hardware performance counters with the support of precise event based sampling (PEBS). A model is developed to predict the MRC of the target workload based on its cache miss distribution.

We evaluate FLORIA for systems consisting of a single application as well as a wide range of different workload mixes. Compared with the state-of-the-art approaches in predicting online MRCs, FLORIA achieves the highest average accuracy of 97.29% with negligible overhead. It also allows fast and accurate estimation of online MRC within 5ms, 20X faster than the state-of-the-art approaches. We also demonstrate that FLORIA can be applied to guiding cache partitioning for multiprogrammed workloads, helping to improve overall system performance.

References

A. Agarwal, J. Hennessy, and M. Horowitz. An analytical cache model. ACM Trans. Comput. Syst., 7(2):184--215, May 1989.Google ScholarDigital Library
Michael Badamo, Jeff Casarona, Minshu Zhao, and Donald Yeung. Identifying power-efficient multicore cache hierarchies via reuse distance analysis. ACM Transactions on Computer Systems (TOCS), 34(1):3, 2016.Google Scholar
Nathan Beckmann and Daniel Sanchez. Modeling cache performance beyond lru. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 225--236. IEEE, 2016.Google ScholarCross Ref
Erik Berg and Erik Hagersten. Fast data-locality profiling of native execution. In ACM SIGMETRICS Performance Evaluation Review, volume 33, pages 169--180. ACM, 2005.Google ScholarDigital Library
Jacob Brock, Chencheng Ye, Chen Ding, Yechen Li, Xiaolin Wang, and Yingwei Luo. Optimal cache partition-sharing. In 2015 44th International Conference on Parallel Processing, pages 749--758. IEEE, 2015.Google ScholarDigital Library
Dehao Chen, Neil Vachharajani, Robert Hundt, Shih-wei Liao, Vinodha Ramasamy, Paul Yuan, Wenguang Chen, and Weimin Zheng. Taming hardware event samples for fdo compilation. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, pages 42--52, 2010.Google ScholarDigital Library
Henry Cook, Miquel Moreto, Sarah Bird, Khanh Dao, David A Patterson, and Krste Asanovic. A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness. In ACM SIGARCH Computer Architecture News, volume 41, pages 308--319. ACM, 2013.Google ScholarDigital Library
Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3B: System Programming Guide, Part 2, 2010.Google Scholar
Perf developers. perf_event_open - Linux man page.Google Scholar
Advanced Micro Devices. BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h Models 30h-3Fh Processors, February, 2015.Google Scholar
Chen Ding and Yutao Zhong. Predicting whole-program locality through reuse distance analysis. In ACM Sigplan Notices, volume 38, pages 245--257. ACM, 2003.Google ScholarDigital Library
David Eklov and Erik Hagersten. Statstack: Efficient modeling of lru caches. In 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pages 55--65. IEEE, 2010.Google ScholarCross Ref
Nosayba El-Sayed, Anurag Mukkara, Po-An Tsai, Harshad Kasture, Xiaosong Ma, and Daniel Sanchez. Kpart: A hybrid cache partitioning-sharing technique for commodity multicores. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 104--117. IEEE, 2018.Google ScholarCross Ref
Andrew Hilton, Neeraj Eswaran, and Amir Roth. Fiesta: A sample-balanced multi-program workload methodology. Proc. MoBS, 2009.Google Scholar
Xiameng Hu, Xiaolin Wang, Lan Zhou, Yingwei Luo, Chen Ding, and Zhenlin Wang. Kinetic modeling of data eviction in cache. In 2016 {USENIX} Annual Technical Conference ({USENIX}{ATC} 16), pages 351--364, 2016.Google Scholar
Gorka Irazoqui, Thomas Eisenbarth, and Berk Sunar. Systematic reverse engineering of cache slice selection in intel processors. In 2015 Euromicro Conference on Digital System Design, pages 629--636. IEEE, 2015.Google ScholarDigital Library
Yunlian Jiang, Eddy Z Zhang, Kai Tian, and Xipeng Shen. Is reuse distance applicable to data locality analysis on chip multiprocessors? In International Conference on Compiler Construction, pages 264--282. Springer, 2010.Google ScholarDigital Library
Kostis Kaffes, Neeraja J. Yadwadkar, and Christos Kozyrakis. Hermod: principled and practical scheduling for serverless functions. In Ada Gavrilovska, Deniz Altinbüken, and Carsten Binnig, editors, Proceedings of the 13th Symposium on Cloud Computing, SoCC 2022, San Francisco, California, November 7--11, 2022, pages 289--305. ACM, 2022.Google Scholar
Fangfei Liu, Yuval Yarom, Qian Ge, Gernot Heiser, and Ruby B Lee. Last-level cache side-channel attacks are practical. In 2015 IEEE Symposium on Security and Privacy, pages 605--622. IEEE, 2015.Google ScholarDigital Library
Xu Liu and John Mellor-Crummey. Pinpointing data locality bottlenecks with low overhead. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 183--193. IEEE, 2013.Google ScholarCross Ref
Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture, pages 248--259. ACM, 2011.Google ScholarDigital Library
Richard L. Mattson, Jan Gecsei, Donald R. Slutz, and Irving L. Traiger. Evaluation techniques for storage hierarchies. IBM Systems journal, 9(2):78--117, 1970.Google ScholarDigital Library
Clémentine Maurice, Nicolas Le Scouarnec, Christoph Neumann, Olivier Heen, and Aurélien Francillon. Reverse engineering intel last-level cache complex addressing using performance counters. In International Symposium on Recent Advances in Intrusion Detection, pages 48--65. Springer, 2015.Google ScholarDigital Library
Andrzej Nowak, Ahmad Yasin, Avi Mendelson, and Willy Zwaenepoel. Establishing a base of trust with performance counters for enterprise workloads. In 2015 {USENIX} Annual Technical Conference ({USENIX}{ATC} 15), pages 541--548, 2015.Google Scholar
Cedric Nugteren, Gert-Jan Van den Braak, Henk Corporaal, and Henri Bal. A detailed gpu cache model based on reuse distance theory. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pages 37--48. IEEE, 2014.Google ScholarCross Ref
Moinuddin K Qureshi and Yale N Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06), pages 423--432. IEEE, 2006.Google ScholarDigital Library
Ashay Rane and James Browne. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 147--156. IEEE, 2012.Google ScholarDigital Library
Trausti Saemundsson, Hjortur Bjornsson, Gregory Chockler, and Ymir Vigfusson. Dynamic performance profiling of cloud caches. In Proceedings of the ACM Symposium on Cloud Computing, pages 1--14. ACM, 2014.Google ScholarDigital Library
Derek L Schuff, Milind Kulkarni, and Vijay S Pai. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pages 53--64. ACM, 2010.Google ScholarDigital Library
Andreas Sembrant, David Black-Schaffer, and Erik Hagersten. Phase guided profiling for fast cache modeling. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, pages 175--185. ACM, 2012.Google ScholarDigital Library
Xipeng Shen, Jonathan Shaw, Brian Meeker, and Chen Ding. Locality approximation using time. In ACM SIGPLAN Notices, volume 42, pages 55--61. ACM, 2007.Google ScholarDigital Library
SPEC. SPEC CPU Benchmarks.Google Scholar
Harold S. Stone, John Turek, and Joel L. Wolf. Optimal partitioning of cache memory. IEEE Transactions on computers, 41(9):1054--1068, 1992.Google ScholarDigital Library
G Edward Suh, Srinivas Devadas, and Larry Rudolph. Analytical cache models with applications to cache partitioning. In ACM International Conference on Supercomputing 25th Anniversary Volume, pages 323--334. ACM, 2014.Google ScholarDigital Library
G Edward Suh, Larry Rudolph, and Srinivas Devadas. Dynamic partitioning of shared cache memory. The Journal of Supercomputing, 28(1):7--26, 2004.Google ScholarDigital Library
David K Tam, Reza Azimi, Livio B Soares, and Michael Stumm. Rapidmrc: approximating l2 miss rate curves on commodity systems for online optimizations. ACM SIGARCH Computer Architecture News, 37(1):121--132, 2009.Google ScholarDigital Library
Carl A Waldspurger, Nohhyun Park, Alexander Garthwaite, and Irfan Ahmad. Efficient {MRC} construction with {SHARDS}. In 13th {USENIX} Conference on File and Storage Technologies ({FAST} 15), pages 95--110, 2015.Google Scholar
Qingsen Wang, Xu Liu, and Milind Chabbi. Featherlight reuse-distance measurement. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 440--453. IEEE, 2019.Google ScholarCross Ref
Jake Wires, Stephen Ingram, Zachary Drudi, Nicholas JA Harvey, and Andrew Warfield. Characterizing storage workloads with counter stacks. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14), pages 335--349, 2014.Google Scholar
H. Wong. Intel Ivy Bridge cache replacement policy, http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/.Google Scholar
Xiaoya Xiang, Chen Ding, Hao Luo, and Bin Bao. Hotl: a higher order theory of locality. ACM SIGPLAN Notices, 48(4):343--356, 2013.Google ScholarDigital Library
Yaocheng Xiang, Xiaolin Wang, Zihui Huang, Zeyu Wang, Yingwei Luo, and Zhenlin Wang. Dcaps: dynamic cache allocation with partial sharing. In Proceedings of the Thirteenth EuroSys Conference, page 13. ACM, 2018.Google Scholar
Jun Xiao, Andy D. Pimentel, and Xu Liu. Cppf: A prefetch aware llc partitioning approach. In Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019, New York, NY, USA, 2019. Association for Computing Machinery.Google Scholar
Xiao Zhang, Sandhya Dwarkadas, and Kai Shen. Towards practical page coloring-based multicore cache management. In Proceedings of the 4th ACM European conference on Computer systems, pages 89--102. ACM, 2009.Google ScholarDigital Library

Index Terms

FLORIA: A Fast and Featherlight Approach for Predicting Cache Performance
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10

Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
Read More
High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
Read More
Reactive NUCA: near-optimal block placement and replication in distributed caches

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '23: Proceedings of the 37th International Conference on Supercomputing
June 2023
505 pages
ISBN:9798400700569
DOI:10.1145/3577193
Chair:
Kyle Gallivan,
Co-chair:
Efstratios Gallopoulos,
Program Co-chairs:
Dimitrios S. Nikolopoulos,
Ramon Beivide
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 June 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
shared cache
performance prediction
cache management
cache performance
locality modeling
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 242
  Total Downloads
- Downloads (Last 12 months)242
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

FLORIA: A Fast and Featherlight Approach for Predicting Cache Performance

ICS '23: Proceedings of the 37th International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

High performance cache replacement using re-reference interval prediction (RRIP)

High performance cache replacement using re-reference interval prediction (RRIP)

Reactive NUCA: near-optimal block placement and replication in distributed caches

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

FLORIA: A Fast and Featherlight Approach for Predicting Cache Performance

ICS '23: Proceedings of the 37th International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

High performance cache replacement using re-reference interval prediction (RRIP)

High performance cache replacement using re-reference interval prediction (RRIP)

Reactive NUCA: near-optimal block placement and replication in distributed caches

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media