Skip to main content
Log in

Region-based parallelization of irregular reductions on explicitly managed memory hierarchies

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Multicore architectures are evolving with the promise of extreme performance for the classes of applications that require high performance and large bandwidth of memory. Irregular reduction is one of important computation patterns for many complex scientific applications, and it typically requires high performance and large bandwidth of memory. In this article, we propose region-based parallelization techniques for irregular reductions on multicore architectures with explicitly managed memory hierarchies. Managing memory hierarchy in software requires a lot of programming efforts and tends to be error-prone. The difficulties are even worse for applications with irregular data access patterns. To relieve the burden of memory management from programmers, we develop abstractions, particularly targeted to irregular reduction, for structuring parallel tasks, mapping the parallel tasks to processing units and scheduling data transfers between the memory hierarchies. Our framework employs iteration reordering based on regions of data along with dynamic scheduling of parallel tasks. We experimentally evaluate the effectiveness of our techniques for irregular reduction kernels on the Cell processor embedded in a Sony PlayStation3. Experimental results show the speedups of 8 to 14 on the six available SPEs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahn JH, Erez M, Dally WJ (2005) Scatter-Add in data parallel architectures. In: HPCA’05: international symposium on high-performance computer architecture, pp 132–142

  2. Arevalo A, Matinata RM, Pandian M, Peri E, Ruby K, Thomas F, Almond C. Programming the cell broadband engine architecture: examples and best practices. http://www.redbooks.ibm.com/redbooks/pdfs/sg247575.pdf

  3. Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA (2006) The landscape of parallel computing research: a view from Berkeley. Tech Rep UCB/EECS-2006-183, EECS Department, University of California, Berkeley

  4. Balart J, Duran A, Gonzalez M, Martorell X, Ayguade E, Labarta J (2004) Nanos Mercurium: a research compiler for OpenMP. In: EWOMP’04: European workshop on OpenMP, pp 103–109

  5. Bellens P, Perez JM, Badia RM, Labarta J (2006) CellSs: a programming model for the Cell BE architecture. In: SC’06: ACM/IEEE conference on supercomputing, p 86

  6. Brooks B, Bruccoleri R, Olafson D, States D, Swaminathan S, Karplus M (1983) CHARMM: a program for macromolecular energy, minimization, and dynamics calculations. J Comput Chem 4:187–217

    Article  Google Scholar 

  7. Chen T, Sura Z, O’Brien K, O’Brien J (2006) Optimizing the use of static buffers for DMA on a Cell chip. In: LCPC’06: international workshop on languages and compilers for parallel computing. Springer, Berlin

    Google Scholar 

  8. ClearSpeed. ClearSpeed whitepaper: CSX processor architecture. http://www.clearspeed.com/docs/resources/ClearSpeed_Architecture_Whitepaper_Feb07v2.pdf

  9. Ding C, Kennedy K (1999) Improving cache performance of dynamic applications with computation and data layout transformations. In: PLDI’99: ACM SIGPLAN conference on programming language design and implementation

  10. Eichenberger AE, O’Brien K, O’Brien K, Wu P, Chen T, Oden PH, Prener DA, Shepherd JC, So B, Sura Z, Wang A, Zhang T, Zhao P, Gschwind M (2005) Optimizing compiler for the cell processor. In: PACT ’05: international conference on parallel architectures and compilation techniques, pp 161–172

  11. Eichenberger AE, O’Brien JK, O’Brien KM, Wu P, Chen T, Oden PH, Prener DA, Shepherd JC, So B, Sura Z, Wang A, Zhang T, Zhao P, Gschwind MK, Archambault R, Gao Y, Koo R (2006) Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture. IBM Syst J 45(1):59–84

    Article  Google Scholar 

  12. Eigenmann R, Hoeflinger J, Li Z, Padua D (1991) Experience in the automatic parallelization of four perfect-benchmark programs. In: LCPC’91: workshop on languages and compilers for parallel computing. Springer, Berlin, pp 65–83

    Google Scholar 

  13. Fatahalian K, Horn DR, Knight TJ, Leem L, Houston M, Park JY, Erez M, Ren M, Aiken A, Dally WJ, Hanrahan P (2006) Sequoia: programming the memory hierarchy. In: SC’06: ACM/IEEE conference on supercomputing, p 83

  14. Feautrier P (1988) Array expansion. In: ICS’88: international conference on supercomputing, pp 429–441

  15. Gutierrez E, Plata O, Zapata E (2004) Data partitioning-based parallel irregular reductions. Concurr Comput Pract Exp 16(2–3):155–172

    Article  Google Scholar 

  16. Gutiérrez E, Plata O, Zapata EL (2008) An analytical model of locality-based parallel irregular reductions. Parallel Comput 34(3):133–157

    Article  Google Scholar 

  17. Hammond L, Wong V, Chen M, Carlstrom BD, Davis JD, Hertzberg B, Prabhu MK, Wijaya H, Kozyrakis C, Olukotun K (2004) Transactional memory coherence and consistency. In: ISCA’04: international symposium on computer architecture. IEEE Computer Society, Los Alamitos, p 102

    Chapter  Google Scholar 

  18. Han H, Tseng CW (2006) Exploiting locality for irregular scientific codes. IEEE Trans Parallel Distrib Syst 17(7):606–618

    Article  Google Scholar 

  19. Han H, Tseng CW (2000) A comparison of locality transformations for irregular codes. In: LCR’00: international workshop on languages, compilers, and run-time systems for scalable computers, pp 70–84

  20. Hofstee HP (2005) Power efficient processor architecture and the cell processor. In: HPCA’05: international symposium on high-performance computer architecture, pp 258–262

  21. IBM: cell broadband engine programming handbook version 1.1

  22. IBM: using the IBM XL C/C++ alpha edition for multicore acceleration single-source compiler. https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/C609359652E175AF00257353006E8063

  23. IGC at ETH Zurich: Cell/B.E. technology-based software. http://www-03.ibm.com/technology/cell/software.html

  24. Kahle J. Cell architecture (presentation slides). http://www.power.org/resources/devcorner/cellcorner/CellTraining_Track1

  25. Kodukula I, Ahmed N, Pingali K (1997) Data-centric multi-level blocking. In: PLDI’97: ACM SIGPLAN conference on programming language design and implementation, pp 346–357

  26. Lee SI, Johnson TA, Eigenmann R (2003) Cetus—an extensible compiler infrastructure for source-to-source transformation. In: LCPC’03: international workshop on languages and compilers for parallel computing, pp 539–553

  27. Li Z (1992) Array privatization for parallel execution of loops. In: ICS’92: international conference on supercomputing, pp 313–322

  28. Lin Y, Padua DA (1998) On the automatic parallelization of sparse and irregular Fortran programs. In: LCR’98: international workshop on languages, compilers, and run-time systems for scalable computers, pp 41–56

  29. Mellor-Crummey J, Whalley D, Kennedy K (1999) Improving memory hierarchy performance for irregular applications. In: ICS’99: international conference on supercomputing

  30. Mirchandaney R, Saltz JH, Smith RM, Nico DM, Crowley K (1988) Principles of runtime support for parallel processors. In: ICS’88: international conference on supercomputing, pp 140–152

  31. Mitchell N, Carter L, Ferrante J (1999) Localizing non-affine array references. In: PACT’99: international conference on parallel architectures and compilation techniques. IEEE Computer Society, Washington, p 192

    Google Scholar 

  32. NVIDIA. NVIDIA GeForce GTX 200 GPU architectural overview. http://www.nvidia.com/object/io_1213615494642.html

  33. Poletto M, Engler DR, Kaashoek MF (1996) tcc: a template-based compiler for ‘C. In: WCSSS’96: workshop on compiler support for systems software, pp 1–7

  34. Schneider S, Yeom JS, Rose B, Linford JC, Sandu A, Nikolopoulos DS (2009) A comparison of programming models for multiprocessors with explicitly managed memory hierarchies. In: PPoPP’09: ACM SIGPLAN symposium on principles and practice of parallel programming, pp 131–140

  35. Shavit N, Touitou D (1995) Software transactional memory. In: PODC’95: ACM symposium on principles of distributed computing, pp 204–213

  36. Strout MM, Carter L, Ferrante J (2003) Compile-time composition of run-time data and iteration reorderings. In: PLDI’03: ACM SIGPLAN conference on programming language design and implementation

  37. The OpenMP architecture review board: the OpenMP API specification for parallel programming. http://openmp.org

  38. Wulf WA, McKee SA (1995) Hitting the memory wall: implications of the obvious. SIGARCH Comput Arch News 23(1):20–24

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hwansoo Han.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, S., Han, H. & Choe, KM. Region-based parallelization of irregular reductions on explicitly managed memory hierarchies. J Supercomput 56, 25–55 (2011). https://doi.org/10.1007/s11227-009-0340-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-009-0340-3

Keywords

Navigation