Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Henretty, Tom; Stock, Kevin; Pouchet, Louis-Noël; Franchetti, Franz; Ramanujam, J.; Sadayappan, P.

doi:10.1007/978-3-642-19861-8_13

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Tom Henretty¹⁷,
Kevin Stock¹⁷,
Louis-Noël Pouchet¹⁷,
Franz Franchetti¹⁸,
J. Ramanujam¹⁹ &
…
P. Sadayappan¹⁷

Conference paper

1997 Accesses
44 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6601))

Abstract

Stencil computations are at the core of applications in many domains such as computational electromagnetics, image processing, and partial differential equation solvers used in a variety of scientific and engineering applications. Short-vector SIMD instruction sets such as SSE and VMX provide a promising and widely available avenue for enhancing performance on modern processors. However a fundamental memory stream alignment issue limits achieved performance with stencil computations on modern short SIMD architectures. In this paper, we propose a novel data layout transformation that avoids the stream alignment conflict, along with a static analysis technique for determining where this transformation is applicable. Significant performance increases are demonstrated for a variety of stencil codes on three modern SIMD-capable processors.

Download to read the full chapter text

Chapter PDF

References

Allen, R., Kennedy, K.: Automatic translation of fortran programs to vector form. ACM TOPLAS 9(4) (1987)
Google Scholar
Amarasinghe, S., Lam, M.: Communication optimization and code generation for distributed memory machines. In: PLDI (1993)
Google Scholar
Anderson, J., Amarasinghe, S., Lam, M.: Data and computation transformations for multiprocessors. In: PPoPP (1995)
Google Scholar
Augustin, W., Heuveline, V., Weiss, J.-P.: Optimized stencil computation using in-place calculation on modern multicore systems. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 772–784. Springer, Heidelberg (2009)
Chapter Google Scholar
Chatterjee, S., Gilbert, J., Schreiber, R., Teng, S.: Automatic array alignment in data-parallel programs. In: POPL (1993)
Google Scholar
Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review 51(1) (2009)
Google Scholar
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: SC 2008, pp. 1–12 (2008)
Google Scholar
Datta, K., Williams, S., Volkov, V., Carter, J., Oliker, L., Shalf, J., Yelick, K.: Auto-tuning the 27-point stencil for multicore. In: iWAPT 2009 (2009)
Google Scholar
de la Cruz, R., Araya-Polo, M., Cela, J.M.: Introducing the semi-stencil algorithm. In: PPAM (1) (2009)
Google Scholar
Dursun, H., Nomura, K., Wang, W., Kunaseth, M., Peng, L., Seymour, R., Kalia, R., Nakano, A., Vashishta, P.: In-core optimization of high-order stencil computations. In: PDPTA (2009)
Google Scholar
Dursun, H., Nomura, K.-i., Peng, L., Seymour, R., Wang, W., Kalia, R.K., Nakano, A., Vashishta, P.: A multilevel parallelization framework for high-order stencil computations. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 642–653. Springer, Heidelberg (2009)
Chapter Google Scholar
Eichenberger, A., Wu, P., O’Brien, K.: Vectorization for simd architectures with alignment constraints. In: PLDI (2004)
Google Scholar
Fireman, L., Petrank, E., Zaks, A.: New algorithms for SIMD alignment. In: Adsul, B., Vetta, A. (eds.) CC 2007. LNCS, vol. 4420, pp. 1–15. Springer, Heidelberg (2007)
Chapter Google Scholar
Hohenauer, M., Engel, F., Leupers, R., Ascheid, G., Meyr, H.: A simd optimization framework for retargetable compilers. ACM TACO 6(1) (2009)
Google Scholar
Jang, B., Mistry, P., Schaa, D., Dominguez, R., Kaeli, D.R.: Data transformations enabling loop vectorization on multithreaded data parallel architectures. In: PPOPP (2010)
Google Scholar
Kamil, S., Datta, K., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Implicit and explicit optimizations for stencil computations. In: MSPC 2006 (2006)
Google Scholar
Kamil, S., Husbands, P., Oliker, L., Shalf, J., Yelick, K.: Impact of modern memory subsystems on cache optimizations for stencil computations. In: MSP 2005 (2005)
Google Scholar
Kandemir, M., Choudhary, A., Shenoy, N., Banerjee, P., Ramanujam, J.: A linear algebra framework for automatic determination of optimal data layouts. IEEE TPDS 10(2) (1999)
Google Scholar
Kennedy, K., Allen, J.: Optimizing compilers for modern architectures: A dependence-based approach. Morgan Kaufmann, San Francisco (2002)
Google Scholar
Kennedy, K., Kremer, U.: Automatic data layout for distributed-memory machines. ACM TOPLAS 20(4) (1998)
Google Scholar
Krishnamoorthy, S., Baskaran, M., Bondhugula, U., Ramanujam, J., Rountev, A., Sadayappan, P.: Effective automatic parallelization of stencil computations. In: PLDI (2007)
Google Scholar
Larsen, S., Amarasinghe, S.P.: Exploiting superword level parallelism with multimedia instruction sets. In: PLDI (2000)
Google Scholar
Larsen, S., Witchel, E., Amarasinghe, S.P.: Increasing and detecting memory address congruence. In: IEEE PACT (2002)
Google Scholar
Li, Z., Song, Y.: Automatic tiling of iterative stencil loops. ACM TOPLAS 26(6) (2004)
Google Scholar
Meng, J., Skadron, K.: Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In: ICS (2009)
Google Scholar
Micikevicius, P.: 3d finite difference computation on gpus using cuda. In: GPGPU-2 (2009)
Google Scholar
Nuzman, D., Henderson, R.: Multi-platform auto-vectorization. In: CGO (2006)
Google Scholar
Nuzman, D., Rosen, I., Zaks, A.: Auto-vectorization of interleaved data for simd. In: PLDI (2006)
Google Scholar
Nuzman, D., Zaks, A.: Outer-loop vectorization: revisited for short simd architectures. In: PACT (2008)
Google Scholar
O’Boyle, M., Knijnenburg, P.: Nonsingular data transformations: Definition, validity, and applications. IJPP 27(3) (1999)
Google Scholar
Orozco, D., Gao, G.R.: Mapping the FDTD Application to Many-Core Chip Architectures. In: ICPP (2009)
Google Scholar
Rivera, G., Tseng, C.-W.: Data transformations for eliminating conflict misses. In: PLDI (1998)
Google Scholar
Shafiq, M., Pericas, M., de la Cruz, R., Araya-Polo, M., Navarro, N., Ayguade, E.: Exploiting memory customization in fpga for 3d stencil computations. In: FPT (2009)
Google Scholar
Solar-Lezama, A., Arnold, G., Tancau, L., Bodik, R., Saraswat, V., Seshia, S.: Sketching stencils. In: PLDI (2007)
Google Scholar
Treibig, J., Wellein, G., Hager, G.: Efficient multicore-aware parallelization strategies for iterative stencil computations. CoRR, abs/1004.1741 (2010)
Google Scholar
Venkatasubramanian, S., Vuduc, R.: Tuned and wildly asynchronous stencil kernels for hybrid cpu/gpu systems. In: ICS (2009)
Google Scholar
Wellein, G., Hager, G., Zeiser, T., Wittmann, M., Fehske, H.: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: COMPSAC (2009)
Google Scholar
Wittmann, M., Hager, G., Treibig, J., Wellein, G.: Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters. CoRR, abs/1006.3148 (2010)
Google Scholar
Wolfe, M.J.: High Performance Compilers For Parallel Computing. Addison-Wesley, Reading (1996)
MATH Google Scholar
Wonnacott, D.: Achieving scalable locality with time skewing. IJPP 30(3) (2002)
Google Scholar
Wu, P., Eichenberger, A.E., Wang, A.: Efficient SIMD Code Generation for Runtime Alignment and Length Conversion. In: CGO (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

The Ohio State University, USA
Tom Henretty, Kevin Stock, Louis-Noël Pouchet & P. Sadayappan
Carnegie Mellon University, USA
Franz Franchetti
Louisiana State University, USA
J. Ramanujam

Authors

Tom Henretty
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Stock
View author publications
You can also search for this author in PubMed Google Scholar
Louis-Noël Pouchet
View author publications
You can also search for this author in PubMed Google Scholar
Franz Franchetti
View author publications
You can also search for this author in PubMed Google Scholar
J. Ramanujam
View author publications
You can also search for this author in PubMed Google Scholar
P. Sadayappan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Institute of Computer Languages, TU Vienna, Argentinierstr. 8 / E185.1, 1040, Vienna, Austria
Jens Knoop

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Henretty, T., Stock, K., Pouchet, LN., Franchetti, F., Ramanujam, J., Sadayappan, P. (2011). Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures. In: Knoop, J. (eds) Compiler Construction. CC 2011. Lecture Notes in Computer Science, vol 6601. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19861-8_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-19861-8_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19860-1
Online ISBN: 978-3-642-19861-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics