skip to main content
research-article

SP-NUCA: a cost effective dynamic non-uniform cache architecture

Published:01 May 2008Publication History
Skip Abstract Section

Abstract

This paper presents a simple but effective method to reduce on-chip access latency and improve core isolation in CMP Non-Uniform Cache Architectures (NUCA). The paper introduces a feasible way to allocate cache blocks according to the access pattern. Each L2 bank is dynamically partitioned at set level in private and shared content. Simply by adjusting the replacement algorithm, we can place private data closer to its owner processor. In contrast, independently of the accessing processor, shared data is always placed in the same position. This approach is capable of reducing on-chip latency without significantly sacrificing hit rates or increasing implementation cost of a conventional static NUCA. Additionally, most of the unnecessary interference between cores in private accesses is removed.

To support the architectural decisions adopted and provide a comparative study, a comprehensive evaluation framework is employed. The workbench is composed of a full system simulator, and a representative set of multithreaded and multiprogrammed workloads. With this infrastructure, different alternatives for the coherence protocol, replacement policies, and cache utilization are analyzed to find the optimal proposal. We conclude that the cost for a feasible implementation should be closer to a conventional static NUCA, and significantly less than a dynamic NUCA.

Finally, a comparison with static and dynamic NUCA is presented. The simulation results suggest that on average the mechanism proposed could improve system performance of a static NUCA and idealized dynamic NUCA by 16% and 6% respectively.

References

  1. B. M. Beckmann and D. A. Wood, "Managing wire delay in large chip-multiprocessor caches", MICRO 37, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. B. M. Beckmann, M. R. Marty, D. A. Wood, "ASR: Adaptive Selective Replication for CMP Caches", MICRO 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Chang and G. S. Sohi, "Cooperative caching for chip multiprocessors", ISCA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Optimizing replication, communication, and capacity allocation in CMPs", ISCA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H. Dybdahl and P. Stenström, "An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors", HPCA 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, S. W. Keckler, "A NUCA Substrate for Flexible CMP Cache Sharing", IEEE Trans. Parallel Distrib. Syst, vol.18, no.8, pp: 1028--1040, September 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Iyer, "CQoS: a Framework for Enabling QoS in Shared Caches of CMP Platforms", ICS 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. I. T. R. for Semiconductors. ITRS 2005 Update. Semiconductor Industry Association, 2005.Google ScholarGoogle Scholar
  9. H. Jin, M. Frumkin, J. Yan; "The OpenMP Implementation of NAS Parallel Benchmarks and its Performance", NAS Technical Report NAS-99-011, NASA Ames Research Center, Moffett Field, CA, 1999.Google ScholarGoogle Scholar
  10. C. Kim, D. Burger and, S. W. Keckler, "An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches". ASPLOS X, pp. 211--222, October 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Kim, D. Chandra, and Y. Solihin, "Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture". PACT 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Lee, J. Choi, J. H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim, "LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies". IEEE Trans. Computers, vol. 50, no. 12, pp 1352--1361, December 2001 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Megiddo and D. S. Modha, "ARC: A Self-Tuning, Low Overhead Replacement Cache," Proc. Usenix Conf. File and Storage Technologies (FAST 2003), Usenix, 2003, pp. 115--130 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, F. Larsson, A. Moestedt, B. Werner, "Simics: A Full System Simulation Platform". Computer, Vol. 35, No.2, pp. 50--58, February 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, D. Wood, "Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset", SIGARCH Comput. Archit. News, Vol.33, No.4, pp.92--99, November 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. K. Martin, M. D. Hill, and D. A. Wood, "Token Coherence: Decoupling Performance and Correctness", ISCA 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. J. Mauer, M. D. Hill, D. A. Wood, "Full-system timing-first simulation", SIGMETRICS 2002: 108--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Michael R. Marty, Jesse D. Bingham, Mark D. Hill, Alan J. Hu, Milo M. K. Martin, David A. Wood, "Improving Multiple-CMP Systems Using Token Coherence," hpca, pp. 328--339, 11th International Symposium on High-Performance Computer Architecture (HPCA'05), 2005 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. "A case for MLP-aware cache replacement". ISCA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. SPEC2000, http://www.spec.org/cpu2000/Google ScholarGoogle Scholar
  21. H. S. Stone, J. Turek, J. L. Wolf, "Optimal Partitioning of Cache Memory", IEEE Trans. Computers vol. 41, no 9, pp 1054--1068, September 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Suh, S. Devadas, and L. Rudolph. "Dynamic cache partitioning for simultaneous multithreading systems". IASTED Int. Conf. on Parallel and Distributed Computing Systems, 2001Google ScholarGoogle Scholar
  23. G. E. Suh, S. Devadas, and L. Rudolph, "A new memory monitoring scheme for memory-aware scheduling and partitioning", HPCA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Thoziyoor, N. Muralimanohar, and N. P. Jouppi. CACTI 5.0: An Integrated Cache Timing, Power, and AreaModel. Technical report, HP Laboratories Palo Alto, 2007.Google ScholarGoogle Scholar
  25. M. Zhang and K. Asanovic, "Victim replication: Maximizing capacity while hiding wire delay in tiled chipmultiprocessors", ISCA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L. Zhao, R. Iyer, M. Upton, D. Newell, "Towards Hybrid Last Level Caches for Chip-Multiprocessors", dasCMP 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SP-NUCA: a cost effective dynamic non-uniform cache architecture

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGARCH Computer Architecture News
          ACM SIGARCH Computer Architecture News  Volume 36, Issue 2
          May 2008
          77 pages
          ISSN:0163-5964
          DOI:10.1145/1399972
          Issue’s Table of Contents

          Copyright © 2008 Authors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 May 2008

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader