research-article

SP-NUCA: a cost effective dynamic non-uniform cache architecture

Authors:
Javier Merino

Universidad de Cantabria, Spain

Universidad de Cantabria, Spain
View Profile

,
Valentín Puente

Universidad de Cantabria, Spain

Universidad de Cantabria, Spain
View Profile

,
Pablo Prieto

Universidad de Cantabria, Spain

Universidad de Cantabria, Spain
View Profile

,
José Ángel Gregorio

Universidad de Cantabria, Spain

Universidad de Cantabria, Spain
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 36 Issue 2May 2008pp 64–71https://doi.org/10.1145/1399972.1399973

Published:01 May 2008Publication History

ACM SIGARCH Computer Architecture News

Abstract

This paper presents a simple but effective method to reduce on-chip access latency and improve core isolation in CMP Non-Uniform Cache Architectures (NUCA). The paper introduces a feasible way to allocate cache blocks according to the access pattern. Each L2 bank is dynamically partitioned at set level in private and shared content. Simply by adjusting the replacement algorithm, we can place private data closer to its owner processor. In contrast, independently of the accessing processor, shared data is always placed in the same position. This approach is capable of reducing on-chip latency without significantly sacrificing hit rates or increasing implementation cost of a conventional static NUCA. Additionally, most of the unnecessary interference between cores in private accesses is removed.

To support the architectural decisions adopted and provide a comparative study, a comprehensive evaluation framework is employed. The workbench is composed of a full system simulator, and a representative set of multithreaded and multiprogrammed workloads. With this infrastructure, different alternatives for the coherence protocol, replacement policies, and cache utilization are analyzed to find the optimal proposal. We conclude that the cost for a feasible implementation should be closer to a conventional static NUCA, and significantly less than a dynamic NUCA.

Finally, a comparison with static and dynamic NUCA is presented. The simulation results suggest that on average the mechanism proposed could improve system performance of a static NUCA and idealized dynamic NUCA by 16% and 6% respectively.

References

B. M. Beckmann and D. A. Wood, "Managing wire delay in large chip-multiprocessor caches", MICRO 37, 2004. Google ScholarDigital Library
B. M. Beckmann, M. R. Marty, D. A. Wood, "ASR: Adaptive Selective Replication for CMP Caches", MICRO 2006. Google ScholarDigital Library
J. Chang and G. S. Sohi, "Cooperative caching for chip multiprocessors", ISCA, 2006. Google ScholarDigital Library
Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Optimizing replication, communication, and capacity allocation in CMPs", ISCA, 2005. Google ScholarDigital Library
H. Dybdahl and P. Stenström, "An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors", HPCA 2007. Google ScholarDigital Library
J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, S. W. Keckler, "A NUCA Substrate for Flexible CMP Cache Sharing", IEEE Trans. Parallel Distrib. Syst, vol.18, no.8, pp: 1028--1040, September 2007. Google ScholarDigital Library
R. Iyer, "CQoS: a Framework for Enabling QoS in Shared Caches of CMP Platforms", ICS 2004. Google ScholarDigital Library
I. T. R. for Semiconductors. ITRS 2005 Update. Semiconductor Industry Association, 2005.Google Scholar
H. Jin, M. Frumkin, J. Yan; "The OpenMP Implementation of NAS Parallel Benchmarks and its Performance", NAS Technical Report NAS-99-011, NASA Ames Research Center, Moffett Field, CA, 1999.Google Scholar
C. Kim, D. Burger and, S. W. Keckler, "An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches". ASPLOS X, pp. 211--222, October 2002. Google ScholarDigital Library
S. Kim, D. Chandra, and Y. Solihin, "Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture". PACT 2004. Google ScholarDigital Library
D. Lee, J. Choi, J. H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim, "LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies". IEEE Trans. Computers, vol. 50, no. 12, pp 1352--1361, December 2001 Google ScholarDigital Library
N. Megiddo and D. S. Modha, "ARC: A Self-Tuning, Low Overhead Replacement Cache," Proc. Usenix Conf. File and Storage Technologies (FAST 2003), Usenix, 2003, pp. 115--130 Google ScholarDigital Library
P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, F. Larsson, A. Moestedt, B. Werner, "Simics: A Full System Simulation Platform". Computer, Vol. 35, No.2, pp. 50--58, February 2002. Google ScholarDigital Library
M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, D. Wood, "Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset", SIGARCH Comput. Archit. News, Vol.33, No.4, pp.92--99, November 2005. Google ScholarDigital Library
M. K. Martin, M. D. Hill, and D. A. Wood, "Token Coherence: Decoupling Performance and Correctness", ISCA 2003. Google ScholarDigital Library
C. J. Mauer, M. D. Hill, D. A. Wood, "Full-system timing-first simulation", SIGMETRICS 2002: 108--116. Google ScholarDigital Library
Michael R. Marty, Jesse D. Bingham, Mark D. Hill, Alan J. Hu, Milo M. K. Martin, David A. Wood, "Improving Multiple-CMP Systems Using Token Coherence," hpca, pp. 328--339, 11th International Symposium on High-Performance Computer Architecture (HPCA'05), 2005 Google ScholarDigital Library
M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. "A case for MLP-aware cache replacement". ISCA, 2006. Google ScholarDigital Library
SPEC2000, http://www.spec.org/cpu2000/Google Scholar
H. S. Stone, J. Turek, J. L. Wolf, "Optimal Partitioning of Cache Memory", IEEE Trans. Computers vol. 41, no 9, pp 1054--1068, September 1992. Google ScholarDigital Library
G. Suh, S. Devadas, and L. Rudolph. "Dynamic cache partitioning for simultaneous multithreading systems". IASTED Int. Conf. on Parallel and Distributed Computing Systems, 2001Google Scholar
G. E. Suh, S. Devadas, and L. Rudolph, "A new memory monitoring scheme for memory-aware scheduling and partitioning", HPCA, 2002. Google ScholarDigital Library
S. Thoziyoor, N. Muralimanohar, and N. P. Jouppi. CACTI 5.0: An Integrated Cache Timing, Power, and AreaModel. Technical report, HP Laboratories Palo Alto, 2007.Google Scholar
M. Zhang and K. Asanovic, "Victim replication: Maximizing capacity while hiding wire delay in tiled chipmultiprocessors", ISCA, 2005. Google ScholarDigital Library
L. Zhao, R. Iyer, M. Upton, D. Newell, "Towards Hybrid Last Level Caches for Chip-Multiprocessors", dasCMP 2007. Google ScholarDigital Library

Index Terms

SP-NUCA: a cost effective dynamic non-uniform cache architecture

Recommendations

Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Read More
Reactive NUCA: near-optimal block placement and replication in distributed caches

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Read More
Way adaptable D-NUCA caches

Non-uniform cache architecture (NUCA) aims to limit the wire-delay problem typical of large on-chip last level caches: by partitioning a large cache into several banks, with the latency of each one depending on its physical location and by employing a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGARCH Computer Architecture News Volume 36, Issue 2
May 2008
77 pages
ISSN:0163-5964
DOI:10.1145/1399972
Issue’s Table of Contents

Copyright © 2008 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2008
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 435
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SP-NUCA: a cost effective dynamic non-uniform cache architecture

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Reactive NUCA: near-optimal block placement and replication in distributed caches

Reactive NUCA: near-optimal block placement and replication in distributed caches

Way adaptable D-NUCA caches

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SP-NUCA: a cost effective dynamic non-uniform cache architecture

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Reactive NUCA: near-optimal block placement and replication in distributed caches

Reactive NUCA: near-optimal block placement and replication in distributed caches

Way adaptable D-NUCA caches

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media