DCAPS: dynamic cache allocation with partial sharing

Authors:
Yaocheng Xiang

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Xiaolin Wang

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Zihui Huang

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Zeyu Wang

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Yingwei Luo

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Zhenlin Wang

Michigan Technological University

Michigan Technological University
View Profile

EuroSys '18: Proceedings of the Thirteenth EuroSys ConferenceApril 2018Article No.: 13Pages 1–15https://doi.org/10.1145/3190508.3190511

Published:23 April 2018Publication History

EuroSys '18: Proceedings of the Thirteenth EuroSys Conference

Pages 1–15

ABSTRACT

In a multicore system, effective management of shared last level cache (LLC), such as hardware/software cache partitioning, has attracted significant research attention. Some eminent progress is that Intel introduced Cache Allocation Technology (CAT) to its commodity processors recently. CAT implements way partitioning and provides software interface to control cache allocation. Unfortunately, CAT can only allocate at way level, which does not scale well for a large thread or program count to serve their various performance goals effectively. This paper proposes Dynamic Cache Allocation with Partial Sharing (DCAPS), a framework that dynamically monitors and predicts a multi-programmed workload's cache demand, and reallocates LLC given a performance target. Further, DCAPS explores partial sharing of a cache partition among programs and thus practically achieves cache allocation at a finer granularity. DCAPS consists of three parts: (1) Online Practical Miss Rate Curve (OPMRC), a low-overhead software technique to predict online miss rate curves (MRCs) of individual programs of a workload; (2) a prediction model that estimates the LLC occupancy of each individual program under any CAT allocation scheme; (3) a simulated annealing algorithm that searches for a near-optimal CAT scheme given a specific performance goal. Our experimental results show that DCAPS is able to optimize for a wide range of performance targets and can scale to a large core count.

References

User space software for intel resource director technology, https://github.com/01 org/intel-cmt-cat/tree/master/pqos.Google Scholar
Aarts, E., and Korst, J. Simulated annealing and boltzmann machines. Wiley (1989). Google ScholarDigital Library
Akiyama, S., and Hirofuchi, T. Quantitative evaluation of intel pebs overhead for online system-noise analysis. In Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017 (2017), ACM, p. 3. Google ScholarDigital Library
Beyls, K., and D'Hollander, E. H. Discovery of locality-improving refactorings by reuse path analysis. In High Performance Computing and Communications, Second International Conference, HPCC 2006, Munich, Germany, September 13--15, 2006, Proceedings (2006), pp. 220--229. Google ScholarDigital Library
Chandra, D., Guo, F., Kim, S., and Solihin, Y. Predicting inter-thread cache contention on a chip multi-processor architecture. In 11th International Symposium on High-Performance Computer Architecture (2005), IEEE, pp. 340--351. Google ScholarDigital Library
Chang, J., and Sohi, G. S. Cooperative cache partitioning for chip multiprocessors. In ACM International Conference on Supercomputing 25th Anniversary Volume (2014), ACM, pp. 402--412. Google ScholarDigital Library
Eklov, D., and Hagersten, E. Statstack: Efficient modeling of lru caches. In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on (2010), IEEE, pp. 55--65.Google ScholarCross Ref
El-Sayed, N., Mukkara, A., Tsai, P.-A., Kasture, H., Ma, X., and Sanchez, D. KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores. In Proceedings of the 24th International Symposium on High Performance Computer Architecture (HPCA-24) (February 2018).Google ScholarCross Ref
Funaro, L., Ben-Yehuda, O. A., and Schuster, A. Ginseng: Market-driven llc allocation. In 2016 USENIX Annual Technical Conference (USENIX ATC 16) (Denver, CO, 2016), USENIX Association, pp. 295--308. Google ScholarDigital Library
Guide, P. Intel® 64 and ia-32 architectures software developer's manual. Volume 3B: System programming Guide, Part 3 (2017).Google Scholar
Herdrich, A., Verplanke, E., Autee, P., Illikkal, R., Gianos, C., Singhal, R., and Iyer, R. Cache QoS: From concept to reality in the intel® xeon® processor e5--2600 v3 product family. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) (2016), IEEE, pp. 657--668.Google ScholarCross Ref
Hsu, L. R., Reinhardt, S. K., Iyer, R., and Makineni, S. Communist, utilitarian, and capitalist cache policies on cmps: caches as a shared resource. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques (2006), ACM, pp. 13--22. Google ScholarDigital Library
Hu, X., Wang, X., Li, Y., Luo, Y., Ding, C., and Wang, Z. Optimal symbiosis and fair scheduling in shared cache. IEEE Trans. Parallel Distrib. Syst. 28, 4 (2017), 1134--1148. Google ScholarDigital Library
Hu, X., Wang, X., Zhou, L., Luo, Y., Ding, C., and Wang, Z. Kinetic modeling of data eviction in cache. In 2016 USENIX Annual Technical Conference (USENIX ATC 16) (Denver, CO, 2016), USENIX Association, pp. 351--364. Google ScholarDigital Library
Hwang, C.-R. Simulated annealing: theory and applications. Acta Applicandae Mathematicae 12, 1 (1988), 108--111.Google Scholar
Iyer, R. CQoS: a framework for enabling qos in shared caches of cmp platforms. In Proceedings of the 18th Annual International Conference on Supercomputing (ICS) (2004), ACM, pp. 257--266. Google ScholarDigital Library
Kim, S., Chandra, D., and Solihin, Y. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (2004), IEEE Computer Society, pp. 111--122. Google ScholarDigital Library
Kim, Y. H., Hill, M. D., and Wood, D. A. Implementing stack simulation for highly-associative memories. SIGMETRICS Perform. Eval. Rev. 19, 1 (Apr. 1991), 212--213. Google ScholarDigital Library
Lin, J., Lu, Q., Ding, X., Zhang, Z., Zhang, X., and Sadayappan, P. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In 2008 IEEE 14th International Symposium on High Performance Computer Architecture (2008), IEEE, pp. 367--378.Google Scholar
Lo, D., Cheng, L., Govindaraju, R., Ranganathan, P., and Kozyrakis, C. Heracles: improving resource efficiency at scale. In ACM SIGARCH Computer Architecture News (2015), vol. 43, ACM, pp. 450--462. Google ScholarDigital Library
LUO, K., Gummaraju, J., and Franklin, M. Balancing thoughput and fairness in smt processors. In Performance Analysis of Systems and Software, 2001. ISPASS. 2001 IEEE International Symposium on (2001), IEEE, pp. 164--171.Google Scholar
Manikantan, R., Rajan, K., and Govindarajan, R. Probabilistic shared cache management (prism). In ACM SIGARCH Computer Architecture News (2012), vol. 40, IEEE Computer Society, pp. 428--439. Google ScholarDigital Library
Mattson, R. L., Gecsei, J., Slutz, D. R., and Traiger, I. L. Evaluation techniques for storage hierarchies. IBM Systems Journal 9, 2 (1970), 78--117. Google ScholarDigital Library
Olken, F. Efficient methods for calculating the success function of fixed-space replacement policies. Tech. rep., Lawrence Berkeley Lab., CA (USA), 1981.Google Scholar
Qureshi, M. K., Jaleel, A., Patt, Y. N., Steely, S. C., and Emer, J. Adaptive insertion policies for high performance caching. In ACM SIGARCH Computer Architecture News (2007), vol. 35, ACM, pp. 381--391. Google ScholarDigital Library
Qureshi, M. K., and Patt, Y. N. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (2006), IEEE Computer Society, pp. 423--432. Google ScholarDigital Library
Rafique, N., Lim, W.-T., and Thottethodi, M. Architectural support for operating system-driven cmp cache management. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques (PACT) (2006), ACM, pp. 2--12. Google ScholarDigital Library
Sanchez, D., and Kozyrakis, C. Vantage: scalable and efficient fine-grain cache partitioning. In ACM SIGARCH Computer Architecture News (2011), vol. 39, ACM, pp. 57--68. Google ScholarDigital Library
Snavely, A., and Tullsen, D. M. Symbiotic jobscheduling for a simultaneous mutlithreading processor. ACM SIGPLAN Notices 35, 11 (2000), 234--244. Google ScholarDigital Library
Suh, G. E., Devadas, S., and Rudolph, L. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proceedings of the Eighth International Symposium on High-Performance Computer Architecture (HPCA) (2002), IEEE, pp. 117--128. Google ScholarDigital Library
Suh, G. E., Devadas, S., and Rudolph, L. Analytical cache models with applications to cache partitioning. In ACM International Conference on Supercomputing 25th Anniversary Volume (2014), ACM, pp. 323--334. Google ScholarDigital Library
Suh, G. E., Rudolph, L., and Devadas, S. Dynamic partitioning of shared cache memory. The Journal of Supercomputing 28, 1 (2004), 7--26. Google ScholarDigital Library
Tam, D. K., Azimi, R., Soares, L. B., and Stumm, M. Rapidmrc: approximating l2 miss rate curves on commodity systems for online optimizations. In ACM SIGARCH Computer Architecture News (2009), vol. 37, ACM, pp. 121--132. Google ScholarDigital Library
Vitter, J. S. Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 1 (Mar. 1985), 37--57. Google ScholarDigital Library
Wang, X., Chen, S., Setter, J., and Martínez, J. F. SWAP: effective fine-grain management of shared last-level caches with minimum hardware support. In 2017 IEEE International Symposium on High Performance Computer Architecture, HPCA 2017, Austin, TX, USA, February 4--8, 2017 (2017), pp. 121--132.Google ScholarCross Ref
Wang, X., Li, Y., Luo, Y., Hu, X., Brock, J., Ding, C., and Wang, Z. Optimal footprint symbiosis in shared cache. In 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2015, Shenzhen, China, May 4--7, 2015 (2015), pp. 412--422.Google ScholarDigital Library
West, R., Zaroo, P., Waldspurger, C. A., and Zhang, X. Online cache modeling for commodity multicore processors. ACM SIGOPS Operating Systems Review 44, 4 (2010), 19--29. Google ScholarDigital Library
Xiang, X., Bao, B., Bai, T., Ding, C., and Chilimbi, T. All-window profiling and composable models of cache sharing. In ACM SIGPLAN Notices (2011), vol. 46, ACM, pp. 91--102. Google ScholarDigital Library
Xiang, X., Bao, B., Ding, C., and Gao, Y. Linear-time modeling of program working set in shared cache. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on (2011), IEEE, pp. 350--360. Google ScholarDigital Library
Xiang, X., Ding, C., Luo, H., and Bao, B. HOTL: a higher order theory of locality. In Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, Houston, TX, USA - March 16 -- 20, 2013 (2013), pp. 343--356. Google ScholarDigital Library
Xie, Y., and Loh, G. H. Pipp: promotion/insertion pseudo-partitioning of multi-core shared caches. In ACM SIGARCH Computer Architecture News (2009), vol. 37, ACM, pp. 174--183. Google ScholarDigital Library
Ye, C., Ding, C., Luo, H., Brock, J., Chen, D., and Jin, H. Cache exclusivity and sharing: Theory and optimization. ACM Trans. Archit. Code Optim. 14, 4 (Nov. 2017), 34:1--34:26. Google ScholarDigital Library
Zhang, X., Dwarkadas, S., and Shen, K. Towards practical page coloring-based multicore cache management. In Proceedings of the 4th ACM European conference on Computer systems (2009), ACM, pp. 89--102. Google ScholarDigital Library
Zhou, P., Pandey, V., Sundaresan, J., Raghuraman, A., Zhou, Y., and Kumar, S. Dynamic tracking of page miss ratio curve for memory management. SIGOPS Oper. Syst. Rev. 38, 5 (Oct. 2004), 177--188. Google ScholarDigital Library

Recommendations

Make the Most out of Last Level Cache in Intel Processors
EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019

In modern (Intel) processors, Last Level Cache (LLC) is divided into multiple slices and an undocumented hashing algorithm (aka Complex Addressing) maps different parts of memory address space among these slices to increase the effective memory ...
Read More
CPpf: a prefetch aware LLC partitioning approach
ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

Hardware cache prefetching is deployed in modern multicore processors to reduce memory latencies, addressing the memory wall problem. However, it tends to increase the Last Level Cache (LLC) contention among applications in multiprogrammed workloads, ...
Read More
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations
ASPLOS 2009

Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroSys '18: Proceedings of the Thirteenth EuroSys Conference
April 2018
631 pages
ISBN:9781450355841
DOI:10.1145/3190508
General Chair:
Rui Oliveira,
Program Chairs:
Pascal Felber,
Y. Charlie Hu
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 April 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache allocation technology
cache occupancy
cache partitioning
miss rate curve
multi-core architectures
Qualifiers
- research-article
Conference

Acceptance Rates
EuroSys '18 Paper Acceptance Rate43of262submissions,16%Overall Acceptance Rate241of1,308submissions,18%
More
Upcoming Conference
EuroSys '24

Sponsor:

sigops

Nineteenth European Conference on Computer Systems

April 22 - 25, 2024

Athens , Greece
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 47
  Total Citations
  View Citations
- 1,925
  Total Downloads
- Downloads (Last 12 months)290
- Downloads (Last 6 weeks)43
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DCAPS: dynamic cache allocation with partial sharing

EuroSys '18: Proceedings of the Thirteenth EuroSys Conference

ABSTRACT

References

Cited By

Recommendations

Make the Most out of Last Level Cache in Intel Processors

CPpf: a prefetch aware LLC partitioning approach

RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations