research-article

Free Access

Understanding throughput-oriented architectures

Authors:
Michael Garland

NVIDIA Research, Santa Clara, CA

NVIDIA Research, Santa Clara, CA
View Profile

,
David B. Kirk

NVIDIA Research, Santa Clara, CA

NVIDIA Research, Santa Clara, CA
View Profile

Authors Info & Claims

Communications of the ACM Volume 53 Issue 11November 2010pp 58–66https://doi.org/10.1145/1839676.1839694

Published:01 November 2010Publication History

Communications of the ACM

Abstract

For workloads with abundant parallelism, GPUs deliver higher peak computational throughput than latency-oriented CPUs.

References

Alverson, G., Alverson, R., Callahan, D., Koblenz, B., Porterfield, A., and Smith, B. Exploiting heterogeneous parallelism on a multithreaded multiprocessor. In Proceedings of the Sixth international Conference on Supercomputing (Washington, D.C., July 19--24). ACM Press, New York, 1992, 188--197. Google ScholarDigital Library
Alverson, R., Callahan, D., Cummings, D., Koblenz, B., Porterfield, A., and Smith, B. The Tera computer system. In Proceedings of the Fourth international Conference on Supercomputing (Amsterdam, The Netherlands, June 11--15). ACM Press, New York, 1990, 1--6 Google ScholarDigital Library
Bell, N. and Garland, M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (Portland, OR, Nov. 14--20). ACM Press, New York, 2009, 1--11. Google ScholarDigital Library
Birrell, A.D. An Introduction to Programming with Threads. Research Report 35. Digital Equipment Corp. Systems Research, Palo Alto, CA, 1989.Google Scholar
Blank, T. The MasPar MP-1 architecture. In Proceedings of Compcon (San Francisco, CA, Feb. 26--Mar. 2). IEEE Press, 1990, 20--24.Google Scholar
Borkar, S., Jouppi, N.P., and Stenstrom, P. Microprocessors in the era of terascale integration. In Proceedings of the Conference on Design, Automation and Test in Europe (Nice, France, Apr. 16--20). EDA Consortium, San Jose, CA, 2007, 237--242. Google ScholarDigital Library
Bouknight, W.J., Denenberg, S.A., McIntyre, D.E., Randall, J.M., Sameh, A.H., and Slotnick, D.L. The Illiac IV system. Proceedings of the IEEE 60, 4 (Apr. 1972), 369--388.Google ScholarCross Ref
Dally, W. Power efficient supercomputing. Presented at the Accelerator-based Computing and Manycore Workshop (Lawrence Berkeley National Laboratory, Berkeley, CA, Nov. 30--Dec. 2, 2009); http://www.lbl.gov/cs/html/Manycore_Workshop09/GPUMulticoreSLAC2009/dallyppt.pdfGoogle Scholar
Dally, W.J., Labonte, F., Das, A., Hanrahan, P., Ahn, J., Gummaraju, J., Erez, M., Jayasena, N., Buck, I., Knight, T. J., and Kapasi, U.J. Merrimac: Supercomputing with streams. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing (Nov. 15--21). IEEE Computer Society, Washington, D.C., 2003. Google ScholarDigital Library
Davis, J.D., Laudon, J., and Olukotun, K. Maximizing CMP throughput with mediocre cores. In Proceedings of the 14th international Conference on Parallel Architectures and Compilation Techniques (Sept. 17--21). IEEE Computer Society, Washington, D.C., 2005, 51--62. Google ScholarDigital Library
Espasa, R., Valero, M., and Smith, J.E. Vector architectures: Past, present and future. In Proceedings of the 12th international Conference on Supercomputing (Melbourne, Australia). ACM Press, New York, 1998, 425--432. Google ScholarDigital Library
Flynn, M.J. Very high-speed computing systems. Proceedings of the IEEE 54, 12 (Dec. 1966), 1901--1909.Google ScholarCross Ref
Garland, M., Grand, S.L., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, E., Zhang, Y., and Volkov, V. Parallel computing experiences with CUDA. IEEE Micro 28, 4 (July 2008), 13--27. Google ScholarDigital Library
Gavril, F. Merging with parallel processors. Commun. ACM 18, 10 (Oct. 1975), 588--591. Google ScholarDigital Library
Grochowski, E., Ronen, R., Shen, J., and Wang, H. Best of both latency and throughput. In Proceedings of the IEEE international Conference on Computer Design (Oct. 11--13). IEEE Computer Society, Washington, D.C., 2004, 236--243. Google ScholarDigital Library
Kapasi, U., Dally, W.J., Rixner, S., Owens, J.D., and Khailany, B. The Imagine stream processor. In Proceedings of the 2002 IEEE International Conference on Computer Design (Sept. 16--18). IEEE Computer Society, Washington, D.C., 2002, 282--288. Google ScholarDigital Library
Khailany, B.K., Williams, T., Lin, J., Long, E.P., Rygh, M., Tovey, D.W., and Dally, W.J. A programmable 512 GOPS stream processor for signal, image, and video processing. IEEE Journal of Solid-State Circuits 43, 1 (Jan. 2008), 202--213.Google ScholarCross Ref
Kongetira, P., Aingaran, K., and Olukotun, K. Niagara: A 32-way multithreaded Sparc processor. IEEE Micro 25, 2 (Mar./Apr. 2005), 21--29. Google ScholarDigital Library
Kozyrakis, C. and Patterson, D. Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks. In Proceedings of the 35th Annual ACM/IEEE International Symposium on Microarchitecture (Istanbul, Turkey, Nov. 18--22). IEEE Computer Society Press, Los Alamitos, CA, 2002, 283--293. Google ScholarDigital Library
Krashinsky, R., Batten, C., Hampton, M., Gerding, S., Pharris, B., Casper, J., and Asanovic, K. The vector-thread architecture. SIGARCH Computer Architecture News 32, 2 (Mar. 2004), 52--63. Google ScholarDigital Library
Laudon, J., Gupta, A., and Horowitz, M. Interleaving: A multithreading technique targeting multiprocessors and workstations. In Proceedings of the Sixth International Conference on Architectural Support For Programming Languages and Operating Systems (San Jose, CA, Oct. 5--7). ACM Press, New York, 1994, 308--318. Google ScholarDigital Library
Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (Mar./Apr. 2008), 39--55. Google ScholarDigital Library
Nickolls, J., Buck, I., Garland, M., and Skadron, K. Scalable parallel programming with CUDA. Queue 6, 2 (Mar./Apr. 2008), 40--53. Google ScholarDigital Library
NVIDIA. NVIDIA's Next-Generation CUDA Compute Architecture: Fermi, Oct. 2009; http://www.nvidia.com/fermiGoogle Scholar
Russell, R.M. The Cray-1 computer system. Commun. ACM, 21, 1 (Jan. 1978), 63--72. Google ScholarDigital Library
Sanders, J. and Kandrot, E. CUDA By Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley, July 2010. Google ScholarDigital Library
Satish, N., Harris, M., and Garland, M. Designing efficient sorting algorithms for manycore GPUs. In Proceedings of the 2009 IEEE international Symposium on Parallel & Distributed Processing (May 23--29). IEEE Computer Society, Washington, D.C., 2009, 1--10. Google ScholarDigital Library
Smith, B.J. Architecture and applications of the HEP multiprocessor computer system. Proceedings of the International Society for Optical Engineering 298 (Aug. 1981), 241--248.Google ScholarCross Ref
Tucker, L.W. and Robertson, G.G. Architecture and applications of the Connection Machine. Computer 21, 8 (Aug. 1988), 26--38. Google ScholarDigital Library
Tullsen, D.M., Eggers, S.J., and Levy, H.M. Simultaneous multithreading: maximizing on-chip parallelism. In Proceedings of the 22nd Annual international Symposium on Computer Architecture (S. Margherita Ligure, Italy, June 22--24). ACM Press, New York, 1995, 392--403. Google ScholarDigital Library
Ungerer, T., Robic, B., and Šilc, J. A survey of processors with explicit multithreading. ACM Computing Surveys 35, 1 (Mar. 2003), 29--63. Google ScholarDigital Library

Index Terms

Understanding throughput-oriented architectures

Recommendations

Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures

Architecture designers tend to integrate both CPUs and GPUs on the same chip to deliver energy-efficient designs. It is still an open problem to effectively leverage the advantages of both CPUs and GPUs on integrated architectures. In this work, we port ...
Read More
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Read More
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Communications of the ACM Volume 53, Issue 11
November 2010
112 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/1839676
Issue’s Table of Contents

Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 November 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Popular
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 108
  Total Citations
  View Citations
- 32,722
  Total Downloads
- Downloads (Last 12 months)156
- Downloads (Last 6 weeks)80
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Understanding throughput-oriented architectures

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures

Vectorizing Unstructured Mesh Computations for Many-core Architectures

Vectorizing Unstructured Mesh Computations for Many-core Architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Understanding throughput-oriented architectures

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures

Vectorizing Unstructured Mesh Computations for Many-core Architectures

Vectorizing Unstructured Mesh Computations for Many-core Architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media