Abstract
In this paper, we examine the area-performance design space of a processing core for a chip multiprocessor (CMP), considering both the architectural design space and the tradeoffs of the physical design on which the architecture relies. We first propose a methodology for performing an integrated optimization of both the micro-architecture and the physical circuit design of a microprocessor. In our approach, we use statistical and convex fitting methods to capture a large micro-architectural design space. We then characterize the area-delay tradeoffs of the underlying circuits through RTL synthesis. Finally, we establish the relationship between the architecture and the circuits in an integrative model, which we use to optimize the processor. As a case study, we apply this methodology to explore the performance-area tradeoffs in a highly parallel accelerator architecture for visual computing applications. Based on some early circuit tradeoff data, our results indicate that two separate designs are performance/area optimal for our set of benchmarks: a simpler single-issue, 2-way multithreaded core running at high-frequency, and a more aggressively tuned dual-issue 4-way multithreaded design running at a lower frequency.
- J. Balfour and W.J. Dally. Design tradeoffs for tiled CMP on-chip networks. In Proceedings of the 20th International Conference on Supercomputing, pages 187--198, 2006. Google ScholarDigital Library
- A. Hartstein and T.R. Puzak. The optimum pipeline depth for a microprocessor. isca, 00:0007, 2002. Google ScholarDigital Library
- A. Hartstein and T.R. Puzak. The optimum pipeline depth considering both power and performance. ACM Trans. Archit. Code Optim., 1(4):369--388, 2004. Google ScholarDigital Library
- M.S. Hrishikesh, D. Burger, S.W. Keckler, P. Shivakumar, N.P. Jouppi, and K.I. Farkas. The optimal logic depth per pipeline stage is 6 to 8 fo4 inverter delays. isca, 00:0014, 2002. Google ScholarDigital Library
- L. Hsu, R. Iyer, S. Makineni, S. Reinhardt, and D. Newell. Exploring the cache design space for large scale CMPs. ACM SIGARCH Somputer Architecture News, 33(4):24--33, 2005. Google ScholarDigital Library
- J. Huh, D. Burger, and S. Keckler. Exploring the design space of future CMPs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, pages 199--210, 2001. Google ScholarDigital Library
- E. Ipek, S. McKee, B. de Supinski, M. Schulz, and R. Caruana. Efficiently Exploring Architectural Design Spaces via Predictive Modeling. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, 2006. Google ScholarDigital Library
- R. Kumar, D.M. Tullsen, and N.P. Jouppi. Core architecture optimization for heterogeneous chip multiprocessors. In PACT '06: Proceedings of the 15th international conference on Parallel architectures and compilation techniques, pages 23--32, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- R. Kumar, V. Zyuban, and D.M. Tullsen. Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads, and Scaling. In Proceedings of the 32th Annual International Symposium on Computer Architecture, 2005. Google ScholarDigital Library
- B.C. Lee and D.M. Brooks. Illustrative design space studies with microarchitectural regression models. In Proceedings of the 13th International Symposium on High Performance Computer Architecture, 2007. Google ScholarDigital Library
- Y. Li, B. Lee, D. Brooks, Z. Hu, and K. Skadron. CMP Design Space Exploration Subject to Physical Constraints. In Proceedings of the 12th International Symposium on High Performance Computer Architecture, 2006.Google ScholarCross Ref
- A. Mahesri, D.R. Johnson, N. Crago, and S.J. Patel. Tradeoffs in Designing Accelerator Architectures for Visual Computing. Technical Report UILU-ENG-08-2008, University of Illinois, May 2008.Google ScholarDigital Library
- M. Monchiero, R. Canal, and A. Gonzlez. Design space exploration for multicore architectures: a power/performance/thermal view. In Proceedings of the 20th International Conference on Supercomputing, pages 178--186, 2006. Google ScholarDigital Library
- K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, and K. Chang. The case for a single-chip multiprocessor. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, 1996. Google ScholarDigital Library
- D. Patil, S.J. Kim, and M. Horowitz. Joint supply, threshold voltage and sizing optimization for design of robust digital circuits. Technical report, Department of Electrical Engineering, Stanford University.Google Scholar
- D.A. Patterson and C.H. Sequin. RISC I: A Reduced Instruction Set VLSI Computer. In ISCA '81: Proceedings of the 8th annual symposium on Computer Architecture, pages 443--457, Los Alamitos, CA, USA, 1981. IEEE Computer Society Press. Google ScholarDigital Library
- E. Sprangle and D. Carmean. Increasing processor performance by implementing deeper pipelines. isca, 00:0025, 2002. Google ScholarDigital Library
- V. Zyuban and P. Strenski. Unified methodology for resolving power-performance tradeoffs at the microarchitectural and circuit levels. In ISLPED '02: Proceedings of the 2002 international symposium on Low power electronics and design, pages 166--171, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
Index Terms
- Area-efficiency in CMP core design: co-optimization of microarchitecture and physical design
Recommendations
Area and System Clock Effects on SMT/CMP Throughput
Two approaches to high throughput processors are Chip Multi-Processing (CMP) and Simultaneous Multi-Threading (SMT). CMP increases layout efficiency, which allows more functional units and a faster clock rate. However, CMP suffers from hardware ...
Area and System Clock Effects on SMT/CMP Processors
PACT '01: Proceedings of the 2001 International Conference on Parallel Architectures and Compilation TechniquesAbstract: Two approaches to high throughput processors are Chip Multi-Processing (CMP) and Simultaneous Multi-Threading (SMT). CMP increases layout efficiency, which allows more functional units and a faster clock rate. However, CMP suffers from ...
Mat-core: a decoupled matrix core extension for general-purpose processors
This paper proposes new processor architecture to exploit the increasingly number of transistors per integrated circuit and improve the performance of many applications on general-purpose processors. The proposed processor (called Mat-Core) is based on ...
Comments