Region-based parallelization of irregular reductions on explicitly managed memory hierarchies

Kim, Seonggun; Han, Hwansoo; Choe, Kwang-Moo

doi:10.1007/s11227-009-0340-3

Region-based parallelization of irregular reductions on explicitly managed memory hierarchies

Published: 10 October 2009

Volume 56, pages 25–55, (2011)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Seonggun Kim¹,
Hwansoo Han² &
Kwang-Moo Choe¹

123 Accesses
3 Citations
Explore all metrics

Abstract

Multicore architectures are evolving with the promise of extreme performance for the classes of applications that require high performance and large bandwidth of memory. Irregular reduction is one of important computation patterns for many complex scientific applications, and it typically requires high performance and large bandwidth of memory. In this article, we propose region-based parallelization techniques for irregular reductions on multicore architectures with explicitly managed memory hierarchies. Managing memory hierarchy in software requires a lot of programming efforts and tends to be error-prone. The difficulties are even worse for applications with irregular data access patterns. To relieve the burden of memory management from programmers, we develop abstractions, particularly targeted to irregular reduction, for structuring parallel tasks, mapping the parallel tasks to processing units and scheduling data transfers between the memory hierarchies. Our framework employs iteration reordering based on regions of data along with dynamic scheduling of parallel tasks. We experimentally evaluate the effectiveness of our techniques for irregular reduction kernels on the Cell processor embedded in a Sony PlayStation3. Experimental results show the speedups of 8 to 14 on the six available SPEs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Supporting Adaptive Privatization Techniques for Irregular Array Reductions in Task-Parallel Programming Models

An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines

Locality-Aware Task Scheduling and Data Distribution on NUMA Systems

References

Ahn JH, Erez M, Dally WJ (2005) Scatter-Add in data parallel architectures. In: HPCA’05: international symposium on high-performance computer architecture, pp 132–142
Arevalo A, Matinata RM, Pandian M, Peri E, Ruby K, Thomas F, Almond C. Programming the cell broadband engine architecture: examples and best practices. http://www.redbooks.ibm.com/redbooks/pdfs/sg247575.pdf
Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA (2006) The landscape of parallel computing research: a view from Berkeley. Tech Rep UCB/EECS-2006-183, EECS Department, University of California, Berkeley
Balart J, Duran A, Gonzalez M, Martorell X, Ayguade E, Labarta J (2004) Nanos Mercurium: a research compiler for OpenMP. In: EWOMP’04: European workshop on OpenMP, pp 103–109
Bellens P, Perez JM, Badia RM, Labarta J (2006) CellSs: a programming model for the Cell BE architecture. In: SC’06: ACM/IEEE conference on supercomputing, p 86
Brooks B, Bruccoleri R, Olafson D, States D, Swaminathan S, Karplus M (1983) CHARMM: a program for macromolecular energy, minimization, and dynamics calculations. J Comput Chem 4:187–217
Article Google Scholar
Chen T, Sura Z, O’Brien K, O’Brien J (2006) Optimizing the use of static buffers for DMA on a Cell chip. In: LCPC’06: international workshop on languages and compilers for parallel computing. Springer, Berlin
Google Scholar
ClearSpeed. ClearSpeed whitepaper: CSX processor architecture. http://www.clearspeed.com/docs/resources/ClearSpeed_Architecture_Whitepaper_Feb07v2.pdf
Ding C, Kennedy K (1999) Improving cache performance of dynamic applications with computation and data layout transformations. In: PLDI’99: ACM SIGPLAN conference on programming language design and implementation
Eichenberger AE, O’Brien K, O’Brien K, Wu P, Chen T, Oden PH, Prener DA, Shepherd JC, So B, Sura Z, Wang A, Zhang T, Zhao P, Gschwind M (2005) Optimizing compiler for the cell processor. In: PACT ’05: international conference on parallel architectures and compilation techniques, pp 161–172
Eichenberger AE, O’Brien JK, O’Brien KM, Wu P, Chen T, Oden PH, Prener DA, Shepherd JC, So B, Sura Z, Wang A, Zhang T, Zhao P, Gschwind MK, Archambault R, Gao Y, Koo R (2006) Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture. IBM Syst J 45(1):59–84
Article Google Scholar
Eigenmann R, Hoeflinger J, Li Z, Padua D (1991) Experience in the automatic parallelization of four perfect-benchmark programs. In: LCPC’91: workshop on languages and compilers for parallel computing. Springer, Berlin, pp 65–83
Google Scholar
Fatahalian K, Horn DR, Knight TJ, Leem L, Houston M, Park JY, Erez M, Ren M, Aiken A, Dally WJ, Hanrahan P (2006) Sequoia: programming the memory hierarchy. In: SC’06: ACM/IEEE conference on supercomputing, p 83
Feautrier P (1988) Array expansion. In: ICS’88: international conference on supercomputing, pp 429–441
Gutierrez E, Plata O, Zapata E (2004) Data partitioning-based parallel irregular reductions. Concurr Comput Pract Exp 16(2–3):155–172
Article Google Scholar
Gutiérrez E, Plata O, Zapata EL (2008) An analytical model of locality-based parallel irregular reductions. Parallel Comput 34(3):133–157
Article Google Scholar
Hammond L, Wong V, Chen M, Carlstrom BD, Davis JD, Hertzberg B, Prabhu MK, Wijaya H, Kozyrakis C, Olukotun K (2004) Transactional memory coherence and consistency. In: ISCA’04: international symposium on computer architecture. IEEE Computer Society, Los Alamitos, p 102
Chapter Google Scholar
Han H, Tseng CW (2006) Exploiting locality for irregular scientific codes. IEEE Trans Parallel Distrib Syst 17(7):606–618
Article Google Scholar
Han H, Tseng CW (2000) A comparison of locality transformations for irregular codes. In: LCR’00: international workshop on languages, compilers, and run-time systems for scalable computers, pp 70–84
Hofstee HP (2005) Power efficient processor architecture and the cell processor. In: HPCA’05: international symposium on high-performance computer architecture, pp 258–262
IBM: cell broadband engine programming handbook version 1.1
IBM: using the IBM XL C/C++ alpha edition for multicore acceleration single-source compiler. https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/C609359652E175AF00257353006E8063
IGC at ETH Zurich: Cell/B.E. technology-based software. http://www-03.ibm.com/technology/cell/software.html
Kahle J. Cell architecture (presentation slides). http://www.power.org/resources/devcorner/cellcorner/CellTraining_Track1
Kodukula I, Ahmed N, Pingali K (1997) Data-centric multi-level blocking. In: PLDI’97: ACM SIGPLAN conference on programming language design and implementation, pp 346–357
Lee SI, Johnson TA, Eigenmann R (2003) Cetus—an extensible compiler infrastructure for source-to-source transformation. In: LCPC’03: international workshop on languages and compilers for parallel computing, pp 539–553
Li Z (1992) Array privatization for parallel execution of loops. In: ICS’92: international conference on supercomputing, pp 313–322
Lin Y, Padua DA (1998) On the automatic parallelization of sparse and irregular Fortran programs. In: LCR’98: international workshop on languages, compilers, and run-time systems for scalable computers, pp 41–56
Mellor-Crummey J, Whalley D, Kennedy K (1999) Improving memory hierarchy performance for irregular applications. In: ICS’99: international conference on supercomputing
Mirchandaney R, Saltz JH, Smith RM, Nico DM, Crowley K (1988) Principles of runtime support for parallel processors. In: ICS’88: international conference on supercomputing, pp 140–152
Mitchell N, Carter L, Ferrante J (1999) Localizing non-affine array references. In: PACT’99: international conference on parallel architectures and compilation techniques. IEEE Computer Society, Washington, p 192
Google Scholar
NVIDIA. NVIDIA GeForce GTX 200 GPU architectural overview. http://www.nvidia.com/object/io_1213615494642.html
Poletto M, Engler DR, Kaashoek MF (1996) tcc: a template-based compiler for ‘C. In: WCSSS’96: workshop on compiler support for systems software, pp 1–7
Schneider S, Yeom JS, Rose B, Linford JC, Sandu A, Nikolopoulos DS (2009) A comparison of programming models for multiprocessors with explicitly managed memory hierarchies. In: PPoPP’09: ACM SIGPLAN symposium on principles and practice of parallel programming, pp 131–140
Shavit N, Touitou D (1995) Software transactional memory. In: PODC’95: ACM symposium on principles of distributed computing, pp 204–213
Strout MM, Carter L, Ferrante J (2003) Compile-time composition of run-time data and iteration reorderings. In: PLDI’03: ACM SIGPLAN conference on programming language design and implementation
The OpenMP architecture review board: the OpenMP API specification for parallel programming. http://openmp.org
Wulf WA, McKee SA (1995) Hitting the memory wall: implications of the obvious. SIGARCH Comput Arch News 23(1):20–24
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, KAIST, Daejeon, 305-701, Republic of Korea
Seonggun Kim & Kwang-Moo Choe
Department of Computer Engineering, Sungkyunkwan University, Suwon, 440-746, Republic of Korea
Hwansoo Han

Authors

Seonggun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Hwansoo Han
View author publications
You can also search for this author in PubMed Google Scholar
Kwang-Moo Choe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hwansoo Han.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, S., Han, H. & Choe, KM. Region-based parallelization of irregular reductions on explicitly managed memory hierarchies. J Supercomput 56, 25–55 (2011). https://doi.org/10.1007/s11227-009-0340-3

Download citation

Published: 10 October 2009
Issue Date: April 2011
DOI: https://doi.org/10.1007/s11227-009-0340-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Region-based parallelization of irregular reductions on explicitly managed memory hierarchies

Abstract

Access this article

Similar content being viewed by others

Supporting Adaptive Privatization Techniques for Irregular Array Reductions in Task-Parallel Programming Models

An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines

Locality-Aware Task Scheduling and Data Distribution on NUMA Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Region-based parallelization of irregular reductions on explicitly managed memory hierarchies

Abstract

Access this article

Similar content being viewed by others

Supporting Adaptive Privatization Techniques for Irregular Array Reductions in Task-Parallel Programming Models

An Efficient OpenMP Loop Scheduler for Irregular Applications on Large-Scale NUMA Machines

Locality-Aware Task Scheduling and Data Distribution on NUMA Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation