Abstract
For workloads with abundant parallelism, GPUs deliver higher peak computational throughput than latency-oriented CPUs.
- Alverson, G., Alverson, R., Callahan, D., Koblenz, B., Porterfield, A., and Smith, B. Exploiting heterogeneous parallelism on a multithreaded multiprocessor. In Proceedings of the Sixth international Conference on Supercomputing (Washington, D.C., July 19--24). ACM Press, New York, 1992, 188--197. Google ScholarDigital Library
- Alverson, R., Callahan, D., Cummings, D., Koblenz, B., Porterfield, A., and Smith, B. The Tera computer system. In Proceedings of the Fourth international Conference on Supercomputing (Amsterdam, The Netherlands, June 11--15). ACM Press, New York, 1990, 1--6 Google ScholarDigital Library
- Bell, N. and Garland, M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (Portland, OR, Nov. 14--20). ACM Press, New York, 2009, 1--11. Google ScholarDigital Library
- Birrell, A.D. An Introduction to Programming with Threads. Research Report 35. Digital Equipment Corp. Systems Research, Palo Alto, CA, 1989.Google Scholar
- Blank, T. The MasPar MP-1 architecture. In Proceedings of Compcon (San Francisco, CA, Feb. 26--Mar. 2). IEEE Press, 1990, 20--24.Google Scholar
- Borkar, S., Jouppi, N.P., and Stenstrom, P. Microprocessors in the era of terascale integration. In Proceedings of the Conference on Design, Automation and Test in Europe (Nice, France, Apr. 16--20). EDA Consortium, San Jose, CA, 2007, 237--242. Google ScholarDigital Library
- Bouknight, W.J., Denenberg, S.A., McIntyre, D.E., Randall, J.M., Sameh, A.H., and Slotnick, D.L. The Illiac IV system. Proceedings of the IEEE 60, 4 (Apr. 1972), 369--388.Google ScholarCross Ref
- Dally, W. Power efficient supercomputing. Presented at the Accelerator-based Computing and Manycore Workshop (Lawrence Berkeley National Laboratory, Berkeley, CA, Nov. 30--Dec. 2, 2009); http://www.lbl.gov/cs/html/Manycore_Workshop09/GPUMulticoreSLAC2009/dallyppt.pdfGoogle Scholar
- Dally, W.J., Labonte, F., Das, A., Hanrahan, P., Ahn, J., Gummaraju, J., Erez, M., Jayasena, N., Buck, I., Knight, T. J., and Kapasi, U.J. Merrimac: Supercomputing with streams. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing (Nov. 15--21). IEEE Computer Society, Washington, D.C., 2003. Google ScholarDigital Library
- Davis, J.D., Laudon, J., and Olukotun, K. Maximizing CMP throughput with mediocre cores. In Proceedings of the 14th international Conference on Parallel Architectures and Compilation Techniques (Sept. 17--21). IEEE Computer Society, Washington, D.C., 2005, 51--62. Google ScholarDigital Library
- Espasa, R., Valero, M., and Smith, J.E. Vector architectures: Past, present and future. In Proceedings of the 12th international Conference on Supercomputing (Melbourne, Australia). ACM Press, New York, 1998, 425--432. Google ScholarDigital Library
- Flynn, M.J. Very high-speed computing systems. Proceedings of the IEEE 54, 12 (Dec. 1966), 1901--1909.Google ScholarCross Ref
- Garland, M., Grand, S.L., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, E., Zhang, Y., and Volkov, V. Parallel computing experiences with CUDA. IEEE Micro 28, 4 (July 2008), 13--27. Google ScholarDigital Library
- Gavril, F. Merging with parallel processors. Commun. ACM 18, 10 (Oct. 1975), 588--591. Google ScholarDigital Library
- Grochowski, E., Ronen, R., Shen, J., and Wang, H. Best of both latency and throughput. In Proceedings of the IEEE international Conference on Computer Design (Oct. 11--13). IEEE Computer Society, Washington, D.C., 2004, 236--243. Google ScholarDigital Library
- Kapasi, U., Dally, W.J., Rixner, S., Owens, J.D., and Khailany, B. The Imagine stream processor. In Proceedings of the 2002 IEEE International Conference on Computer Design (Sept. 16--18). IEEE Computer Society, Washington, D.C., 2002, 282--288. Google ScholarDigital Library
- Khailany, B.K., Williams, T., Lin, J., Long, E.P., Rygh, M., Tovey, D.W., and Dally, W.J. A programmable 512 GOPS stream processor for signal, image, and video processing. IEEE Journal of Solid-State Circuits 43, 1 (Jan. 2008), 202--213.Google ScholarCross Ref
- Kongetira, P., Aingaran, K., and Olukotun, K. Niagara: A 32-way multithreaded Sparc processor. IEEE Micro 25, 2 (Mar./Apr. 2005), 21--29. Google ScholarDigital Library
- Kozyrakis, C. and Patterson, D. Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks. In Proceedings of the 35th Annual ACM/IEEE International Symposium on Microarchitecture (Istanbul, Turkey, Nov. 18--22). IEEE Computer Society Press, Los Alamitos, CA, 2002, 283--293. Google ScholarDigital Library
- Krashinsky, R., Batten, C., Hampton, M., Gerding, S., Pharris, B., Casper, J., and Asanovic, K. The vector-thread architecture. SIGARCH Computer Architecture News 32, 2 (Mar. 2004), 52--63. Google ScholarDigital Library
- Laudon, J., Gupta, A., and Horowitz, M. Interleaving: A multithreading technique targeting multiprocessors and workstations. In Proceedings of the Sixth International Conference on Architectural Support For Programming Languages and Operating Systems (San Jose, CA, Oct. 5--7). ACM Press, New York, 1994, 308--318. Google ScholarDigital Library
- Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (Mar./Apr. 2008), 39--55. Google ScholarDigital Library
- Nickolls, J., Buck, I., Garland, M., and Skadron, K. Scalable parallel programming with CUDA. Queue 6, 2 (Mar./Apr. 2008), 40--53. Google ScholarDigital Library
- NVIDIA. NVIDIA's Next-Generation CUDA Compute Architecture: Fermi, Oct. 2009; http://www.nvidia.com/fermiGoogle Scholar
- Russell, R.M. The Cray-1 computer system. Commun. ACM, 21, 1 (Jan. 1978), 63--72. Google ScholarDigital Library
- Sanders, J. and Kandrot, E. CUDA By Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley, July 2010. Google ScholarDigital Library
- Satish, N., Harris, M., and Garland, M. Designing efficient sorting algorithms for manycore GPUs. In Proceedings of the 2009 IEEE international Symposium on Parallel & Distributed Processing (May 23--29). IEEE Computer Society, Washington, D.C., 2009, 1--10. Google ScholarDigital Library
- Smith, B.J. Architecture and applications of the HEP multiprocessor computer system. Proceedings of the International Society for Optical Engineering 298 (Aug. 1981), 241--248.Google ScholarCross Ref
- Tucker, L.W. and Robertson, G.G. Architecture and applications of the Connection Machine. Computer 21, 8 (Aug. 1988), 26--38. Google ScholarDigital Library
- Tullsen, D.M., Eggers, S.J., and Levy, H.M. Simultaneous multithreading: maximizing on-chip parallelism. In Proceedings of the 22nd Annual international Symposium on Computer Architecture (S. Margherita Ligure, Italy, June 22--24). ACM Press, New York, 1995, 392--403. Google ScholarDigital Library
- Ungerer, T., Robic, B., and Šilc, J. A survey of processors with explicit multithreading. ACM Computing Surveys 35, 1 (Mar. 2003), 29--63. Google ScholarDigital Library
Index Terms
- Understanding throughput-oriented architectures
Recommendations
Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures
Architecture designers tend to integrate both CPUs and GPUs on the same chip to deliver energy-efficient designs. It is still an open problem to effectively leverage the advantages of both CPUs and GPUs on integrated architectures. In this work, we port ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Comments