Abstract
Multicore architectures are evolving with the promise of extreme performance for the classes of applications that require high performance and large bandwidth of memory. Irregular reduction is one of important computation patterns for many complex scientific applications, and it typically requires high performance and large bandwidth of memory. In this article, we propose region-based parallelization techniques for irregular reductions on multicore architectures with explicitly managed memory hierarchies. Managing memory hierarchy in software requires a lot of programming efforts and tends to be error-prone. The difficulties are even worse for applications with irregular data access patterns. To relieve the burden of memory management from programmers, we develop abstractions, particularly targeted to irregular reduction, for structuring parallel tasks, mapping the parallel tasks to processing units and scheduling data transfers between the memory hierarchies. Our framework employs iteration reordering based on regions of data along with dynamic scheduling of parallel tasks. We experimentally evaluate the effectiveness of our techniques for irregular reduction kernels on the Cell processor embedded in a Sony PlayStation3. Experimental results show the speedups of 8 to 14 on the six available SPEs.
Similar content being viewed by others
References
Ahn JH, Erez M, Dally WJ (2005) Scatter-Add in data parallel architectures. In: HPCA’05: international symposium on high-performance computer architecture, pp 132–142
Arevalo A, Matinata RM, Pandian M, Peri E, Ruby K, Thomas F, Almond C. Programming the cell broadband engine architecture: examples and best practices. http://www.redbooks.ibm.com/redbooks/pdfs/sg247575.pdf
Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA (2006) The landscape of parallel computing research: a view from Berkeley. Tech Rep UCB/EECS-2006-183, EECS Department, University of California, Berkeley
Balart J, Duran A, Gonzalez M, Martorell X, Ayguade E, Labarta J (2004) Nanos Mercurium: a research compiler for OpenMP. In: EWOMP’04: European workshop on OpenMP, pp 103–109
Bellens P, Perez JM, Badia RM, Labarta J (2006) CellSs: a programming model for the Cell BE architecture. In: SC’06: ACM/IEEE conference on supercomputing, p 86
Brooks B, Bruccoleri R, Olafson D, States D, Swaminathan S, Karplus M (1983) CHARMM: a program for macromolecular energy, minimization, and dynamics calculations. J Comput Chem 4:187–217
Chen T, Sura Z, O’Brien K, O’Brien J (2006) Optimizing the use of static buffers for DMA on a Cell chip. In: LCPC’06: international workshop on languages and compilers for parallel computing. Springer, Berlin
ClearSpeed. ClearSpeed whitepaper: CSX processor architecture. http://www.clearspeed.com/docs/resources/ClearSpeed_Architecture_Whitepaper_Feb07v2.pdf
Ding C, Kennedy K (1999) Improving cache performance of dynamic applications with computation and data layout transformations. In: PLDI’99: ACM SIGPLAN conference on programming language design and implementation
Eichenberger AE, O’Brien K, O’Brien K, Wu P, Chen T, Oden PH, Prener DA, Shepherd JC, So B, Sura Z, Wang A, Zhang T, Zhao P, Gschwind M (2005) Optimizing compiler for the cell processor. In: PACT ’05: international conference on parallel architectures and compilation techniques, pp 161–172
Eichenberger AE, O’Brien JK, O’Brien KM, Wu P, Chen T, Oden PH, Prener DA, Shepherd JC, So B, Sura Z, Wang A, Zhang T, Zhao P, Gschwind MK, Archambault R, Gao Y, Koo R (2006) Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture. IBM Syst J 45(1):59–84
Eigenmann R, Hoeflinger J, Li Z, Padua D (1991) Experience in the automatic parallelization of four perfect-benchmark programs. In: LCPC’91: workshop on languages and compilers for parallel computing. Springer, Berlin, pp 65–83
Fatahalian K, Horn DR, Knight TJ, Leem L, Houston M, Park JY, Erez M, Ren M, Aiken A, Dally WJ, Hanrahan P (2006) Sequoia: programming the memory hierarchy. In: SC’06: ACM/IEEE conference on supercomputing, p 83
Feautrier P (1988) Array expansion. In: ICS’88: international conference on supercomputing, pp 429–441
Gutierrez E, Plata O, Zapata E (2004) Data partitioning-based parallel irregular reductions. Concurr Comput Pract Exp 16(2–3):155–172
Gutiérrez E, Plata O, Zapata EL (2008) An analytical model of locality-based parallel irregular reductions. Parallel Comput 34(3):133–157
Hammond L, Wong V, Chen M, Carlstrom BD, Davis JD, Hertzberg B, Prabhu MK, Wijaya H, Kozyrakis C, Olukotun K (2004) Transactional memory coherence and consistency. In: ISCA’04: international symposium on computer architecture. IEEE Computer Society, Los Alamitos, p 102
Han H, Tseng CW (2006) Exploiting locality for irregular scientific codes. IEEE Trans Parallel Distrib Syst 17(7):606–618
Han H, Tseng CW (2000) A comparison of locality transformations for irregular codes. In: LCR’00: international workshop on languages, compilers, and run-time systems for scalable computers, pp 70–84
Hofstee HP (2005) Power efficient processor architecture and the cell processor. In: HPCA’05: international symposium on high-performance computer architecture, pp 258–262
IBM: cell broadband engine programming handbook version 1.1
IBM: using the IBM XL C/C++ alpha edition for multicore acceleration single-source compiler. https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/C609359652E175AF00257353006E8063
IGC at ETH Zurich: Cell/B.E. technology-based software. http://www-03.ibm.com/technology/cell/software.html
Kahle J. Cell architecture (presentation slides). http://www.power.org/resources/devcorner/cellcorner/CellTraining_Track1
Kodukula I, Ahmed N, Pingali K (1997) Data-centric multi-level blocking. In: PLDI’97: ACM SIGPLAN conference on programming language design and implementation, pp 346–357
Lee SI, Johnson TA, Eigenmann R (2003) Cetus—an extensible compiler infrastructure for source-to-source transformation. In: LCPC’03: international workshop on languages and compilers for parallel computing, pp 539–553
Li Z (1992) Array privatization for parallel execution of loops. In: ICS’92: international conference on supercomputing, pp 313–322
Lin Y, Padua DA (1998) On the automatic parallelization of sparse and irregular Fortran programs. In: LCR’98: international workshop on languages, compilers, and run-time systems for scalable computers, pp 41–56
Mellor-Crummey J, Whalley D, Kennedy K (1999) Improving memory hierarchy performance for irregular applications. In: ICS’99: international conference on supercomputing
Mirchandaney R, Saltz JH, Smith RM, Nico DM, Crowley K (1988) Principles of runtime support for parallel processors. In: ICS’88: international conference on supercomputing, pp 140–152
Mitchell N, Carter L, Ferrante J (1999) Localizing non-affine array references. In: PACT’99: international conference on parallel architectures and compilation techniques. IEEE Computer Society, Washington, p 192
NVIDIA. NVIDIA GeForce GTX 200 GPU architectural overview. http://www.nvidia.com/object/io_1213615494642.html
Poletto M, Engler DR, Kaashoek MF (1996) tcc: a template-based compiler for ‘C. In: WCSSS’96: workshop on compiler support for systems software, pp 1–7
Schneider S, Yeom JS, Rose B, Linford JC, Sandu A, Nikolopoulos DS (2009) A comparison of programming models for multiprocessors with explicitly managed memory hierarchies. In: PPoPP’09: ACM SIGPLAN symposium on principles and practice of parallel programming, pp 131–140
Shavit N, Touitou D (1995) Software transactional memory. In: PODC’95: ACM symposium on principles of distributed computing, pp 204–213
Strout MM, Carter L, Ferrante J (2003) Compile-time composition of run-time data and iteration reorderings. In: PLDI’03: ACM SIGPLAN conference on programming language design and implementation
The OpenMP architecture review board: the OpenMP API specification for parallel programming. http://openmp.org
Wulf WA, McKee SA (1995) Hitting the memory wall: implications of the obvious. SIGARCH Comput Arch News 23(1):20–24
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kim, S., Han, H. & Choe, KM. Region-based parallelization of irregular reductions on explicitly managed memory hierarchies. J Supercomput 56, 25–55 (2011). https://doi.org/10.1007/s11227-009-0340-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-009-0340-3