Abstract
Stencil computations are at the core of applications in many domains such as computational electromagnetics, image processing, and partial differential equation solvers used in a variety of scientific and engineering applications. Short-vector SIMD instruction sets such as SSE and VMX provide a promising and widely available avenue for enhancing performance on modern processors. However a fundamental memory stream alignment issue limits achieved performance with stencil computations on modern short SIMD architectures. In this paper, we propose a novel data layout transformation that avoids the stream alignment conflict, along with a static analysis technique for determining where this transformation is applicable. Significant performance increases are demonstrated for a variety of stencil codes on three modern SIMD-capable processors.
Chapter PDF
References
Allen, R., Kennedy, K.: Automatic translation of fortran programs to vector form. ACM TOPLAS 9(4) (1987)
Amarasinghe, S., Lam, M.: Communication optimization and code generation for distributed memory machines. In: PLDI (1993)
Anderson, J., Amarasinghe, S., Lam, M.: Data and computation transformations for multiprocessors. In: PPoPP (1995)
Augustin, W., Heuveline, V., Weiss, J.-P.: Optimized stencil computation using in-place calculation on modern multicore systems. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 772–784. Springer, Heidelberg (2009)
Chatterjee, S., Gilbert, J., Schreiber, R., Teng, S.: Automatic array alignment in data-parallel programs. In: POPL (1993)
Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review 51(1) (2009)
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: SC 2008, pp. 1–12 (2008)
Datta, K., Williams, S., Volkov, V., Carter, J., Oliker, L., Shalf, J., Yelick, K.: Auto-tuning the 27-point stencil for multicore. In: iWAPT 2009 (2009)
de la Cruz, R., Araya-Polo, M., Cela, J.M.: Introducing the semi-stencil algorithm. In: PPAM (1) (2009)
Dursun, H., Nomura, K., Wang, W., Kunaseth, M., Peng, L., Seymour, R., Kalia, R., Nakano, A., Vashishta, P.: In-core optimization of high-order stencil computations. In: PDPTA (2009)
Dursun, H., Nomura, K.-i., Peng, L., Seymour, R., Wang, W., Kalia, R.K., Nakano, A., Vashishta, P.: A multilevel parallelization framework for high-order stencil computations. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 642–653. Springer, Heidelberg (2009)
Eichenberger, A., Wu, P., O’Brien, K.: Vectorization for simd architectures with alignment constraints. In: PLDI (2004)
Fireman, L., Petrank, E., Zaks, A.: New algorithms for SIMD alignment. In: Adsul, B., Vetta, A. (eds.) CC 2007. LNCS, vol. 4420, pp. 1–15. Springer, Heidelberg (2007)
Hohenauer, M., Engel, F., Leupers, R., Ascheid, G., Meyr, H.: A simd optimization framework for retargetable compilers. ACM TACO 6(1) (2009)
Jang, B., Mistry, P., Schaa, D., Dominguez, R., Kaeli, D.R.: Data transformations enabling loop vectorization on multithreaded data parallel architectures. In: PPOPP (2010)
Kamil, S., Datta, K., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Implicit and explicit optimizations for stencil computations. In: MSPC 2006 (2006)
Kamil, S., Husbands, P., Oliker, L., Shalf, J., Yelick, K.: Impact of modern memory subsystems on cache optimizations for stencil computations. In: MSP 2005 (2005)
Kandemir, M., Choudhary, A., Shenoy, N., Banerjee, P., Ramanujam, J.: A linear algebra framework for automatic determination of optimal data layouts. IEEE TPDS 10(2) (1999)
Kennedy, K., Allen, J.: Optimizing compilers for modern architectures: A dependence-based approach. Morgan Kaufmann, San Francisco (2002)
Kennedy, K., Kremer, U.: Automatic data layout for distributed-memory machines. ACM TOPLAS 20(4) (1998)
Krishnamoorthy, S., Baskaran, M., Bondhugula, U., Ramanujam, J., Rountev, A., Sadayappan, P.: Effective automatic parallelization of stencil computations. In: PLDI (2007)
Larsen, S., Amarasinghe, S.P.: Exploiting superword level parallelism with multimedia instruction sets. In: PLDI (2000)
Larsen, S., Witchel, E., Amarasinghe, S.P.: Increasing and detecting memory address congruence. In: IEEE PACT (2002)
Li, Z., Song, Y.: Automatic tiling of iterative stencil loops. ACM TOPLAS 26(6) (2004)
Meng, J., Skadron, K.: Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In: ICS (2009)
Micikevicius, P.: 3d finite difference computation on gpus using cuda. In: GPGPU-2 (2009)
Nuzman, D., Henderson, R.: Multi-platform auto-vectorization. In: CGO (2006)
Nuzman, D., Rosen, I., Zaks, A.: Auto-vectorization of interleaved data for simd. In: PLDI (2006)
Nuzman, D., Zaks, A.: Outer-loop vectorization: revisited for short simd architectures. In: PACT (2008)
O’Boyle, M., Knijnenburg, P.: Nonsingular data transformations: Definition, validity, and applications. IJPP 27(3) (1999)
Orozco, D., Gao, G.R.: Mapping the FDTD Application to Many-Core Chip Architectures. In: ICPP (2009)
Rivera, G., Tseng, C.-W.: Data transformations for eliminating conflict misses. In: PLDI (1998)
Shafiq, M., Pericas, M., de la Cruz, R., Araya-Polo, M., Navarro, N., Ayguade, E.: Exploiting memory customization in fpga for 3d stencil computations. In: FPT (2009)
Solar-Lezama, A., Arnold, G., Tancau, L., Bodik, R., Saraswat, V., Seshia, S.: Sketching stencils. In: PLDI (2007)
Treibig, J., Wellein, G., Hager, G.: Efficient multicore-aware parallelization strategies for iterative stencil computations. CoRR, abs/1004.1741 (2010)
Venkatasubramanian, S., Vuduc, R.: Tuned and wildly asynchronous stencil kernels for hybrid cpu/gpu systems. In: ICS (2009)
Wellein, G., Hager, G., Zeiser, T., Wittmann, M., Fehske, H.: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: COMPSAC (2009)
Wittmann, M., Hager, G., Treibig, J., Wellein, G.: Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters. CoRR, abs/1006.3148 (2010)
Wolfe, M.J.: High Performance Compilers For Parallel Computing. Addison-Wesley, Reading (1996)
Wonnacott, D.: Achieving scalable locality with time skewing. IJPP 30(3) (2002)
Wu, P., Eichenberger, A.E., Wang, A.: Efficient SIMD Code Generation for Runtime Alignment and Length Conversion. In: CGO (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Henretty, T., Stock, K., Pouchet, LN., Franchetti, F., Ramanujam, J., Sadayappan, P. (2011). Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures. In: Knoop, J. (eds) Compiler Construction. CC 2011. Lecture Notes in Computer Science, vol 6601. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19861-8_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-19861-8_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19860-1
Online ISBN: 978-3-642-19861-8
eBook Packages: Computer ScienceComputer Science (R0)