ABSTRACT
Many applications in media processing, control, graphics, and other domains require efficient small-scale linear algebra computations. However, most existing high performance libraries for linear algebra, such as ATLAS or Intel MKL are more geared towards large-scale problems (matrix sizes in the hundreds and larger) and towards specific interfaces (e.g., BLAS). In this paper we present LGen: a compiler for small-scale, basic linear algebra computations. The input to LGen is a fixed-size linear algebra expression; the output is a corresponding C function optionally including intrinsics to efficiently use SIMD vector extensions. LGen generates code using two levels of mathematical domain-specific languages (DSLs). The DSLs are used to perform tiling, loop fusion, and vectorization at a high level of abstraction, before the final code is generated. In addition, search is used to select among alternative generated implementations. We show benchmarks of code generated by LGen against Intel MKL and IPP as well as against alternative generators, such as the C++ template-based Eigen and the BTO compiler. The achieved speed-up is typically about a factor of two to three.
- E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999. Google ScholarDigital Library
- G. Barthe, J. M. Crespo, S. Gulwani, C. Kunz, and M. Marron. From relational verification to SIMD loop synthesis. In Principles and Practice of Parallel Programming (PPoPP), pages 123--134, 2013. Google ScholarDigital Library
- G. Belter, E. R. Jessup, T. Nelson, B. Norris, and J. G. Siek. Reliable generation of high-performance matrix algebra. Computing Research Repository (CoRR), abs/1205.1098, 2012.Google Scholar
- P. Bientinesi, J. A. Gunnels, M. E. Myers, E. S. Quintana-Ortí, and R. A. v. d. Geijn. The science of deriving dense linear algebra algorithms. ACM Transactions on Mathematical Software (TOMS), 31(1):1--26, 2005. Google ScholarDigital Library
- J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In International Conference on Supercomputing (ICS), pages 340--347, 1997. Google ScholarDigital Library
- J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS), 16(1):1--17, 1990. Google ScholarDigital Library
- J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS), 14(1):1--17, 1988. Google ScholarDigital Library
- D. Fabregat-Traver and P. Bientinesi. A domain-specific compiler for linear algebra operations. In High Performance Computing for Computational Science (VECPAR 2012), volume 7851 of Lecture Notes in Computer Science (LNCS), pages 346--361. Springer, 2013.Google ScholarCross Ref
- F. Franchetti, F. Mesmay, D. Mcfarlin, and M. Püschel. Operator language: A program generation framework for fast kernels. In IFIP Working Conference on Domain-Specific Languages (DSL WC), volume 5658 of Lecture Notes in Computer Science (LNCS), pages 385--410. Springer, 2009. Google ScholarDigital Library
- F. Franchetti and M. Püschel. Generating SIMD vectorized permutations. In International Conference on Compiler Construction (CC), volume 4959 of Lecture Notes in Computer Science (LNCS), pages 116--131. Springer, 2008. Google ScholarDigital Library
- F. Franchetti, Y. Voronenko, and M. Püschel. Formal loop merging for signal transforms. In Programming Language Design and Implementation (PLDI), pages 315--326, 2005. Google ScholarDigital Library
- M. Frigge, D. C. Hoaglin, and B. Iglewicz. Some implementations of the boxplot. The American Statistician, 43(1):50--54, 1989.Google Scholar
- M. Frigo. A fast Fourier transform compiler. In Programming Language Design and Implementation (PLDI), pages 169--180, 1999. Google ScholarDigital Library
- M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005.Google ScholarCross Ref
- K. Goto and R. A. v. d. Geijn. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS), 34(3):12:1--12:25, 2008. Google ScholarDigital Library
- P. Gottschling and C. Steinhardt. Meta-tuning in MTL4. In International Conference on Numerical Analysis and Applied Mathematics (ICNAAM), volume 1281, pages 778--782, 2010.Google ScholarCross Ref
- P. Gottschling, D. S. Wise, and A. Joshi. Generic support of algorithmic and structural recursion for scientific computing. International Journal of Parallel, Emergent and Distributed Systems (IJPEDS), 24(6):479--503, 2009.Google Scholar
- G. Guennebaud, B. Jacob, et al. Eigen v3. http://eigen.tuxfamily.org.Google Scholar
- J. A. Gunnels, F. G. Gustavson, G. Henry, and R. A. van de Geijn. FLAME: Formal linear algebra methods environment. ACM Transactions on Mathematical Software (TOMS), 27(4):422--455, 2001. Google ScholarDigital Library
- J. Guo, G. Bikshandi, B. B. Fraguela, M. J. Garzaran, and D. Padua. Programming with tiles. In Principles and Practice of Parallel Programming (PPoPP), pages 111--122, 2008. Google ScholarDigital Library
- A. Hartono, M. M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris, J. Ramanujam, and P. Sadayappan. Parametric multi-level tiling of imperfectly nested loops. In International Conference on Supercomputing (ICS), pages 147--157, 2009. Google ScholarDigital Library
- Intel. Intel integrated performance primitives (IPP). http://software.intel.com/en-us/intel-ipp.Google Scholar
- Intel. Intel math kernel library (MKL). http://software.intel.com/en-us/intel-mkl.Google Scholar
- M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When polyhedral transformations meet SIMD code generation. In Programming Language Design and Implementation (PLDI), pages 127--138, 2013. Google ScholarDigital Library
- D. Nuzman, S. Dyshel, E. Rohou, I. Rosen, K. Williams, D. Yuste, A. Cohen, and A. Zaks. Vapor SIMD: Auto-vectorize once, run everywhere. In International Symposium on Code Generation and Optimization (CGO), pages 151--160, 2011. Google ScholarDigital Library
- D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In Programming Language Design and Implementation (PLDI), pages 132--143, 2006. Google ScholarDigital Library
- M. Püschel, F. Franchetti, and Y. Voronenko. Encyclopedia of Parallel Computing, chapter Spiral. Springer, 2011.Google ScholarDigital Library
- M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 93(2):232--275, 2005.Google ScholarCross Ref
- J. Shin, M. Hall, J. Chame, C. Chen, and P. Hovland. Autotuning and specialization: Speeding up matrix multiply for small matrices with compiler technology. In K. Naono, K. Teranishi, J. Cavazos, and R. Suda, editors, Software Automatic Tuning, pages 353--370. Springer New York, 2010.Google Scholar
- J. Siek, I. Karlin, and E. Jessup. Build to order linear algebra kernels. In International Parallel & Distributed Processing Symposium (IPDPS), pages 1--8, 2008.Google ScholarCross Ref
- F. G. Van Zee and R. A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software (TOMS). To appear.Google Scholar
- Y. Voronenko, F. de Mesmay, and M. Püschel. Computer generation of general size linear transform libraries. In International Symposium on Code Generation and Optimization (CGO), pages 102--113, 2009. Google ScholarDigital Library
- J. Walter, M. Koch, et al. uBLAS. www.boost.org/libs/numeric.Google Scholar
- R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. In Supercomputing (SC), pages 1--27, 1998. Google ScholarDigital Library
- K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE, 93(2):358--386, 2005.Google ScholarCross Ref
Index Terms
- A Basic Linear Algebra Compiler
Recommendations
A basic linear algebra compiler for structured matrices
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and OptimizationMany problems in science and engineering are in practice modeled and solved through matrix computations. Often, the matrices involved have structure such as symmetric or triangular, which reduces the operations count needed to perform the computation. ...
A Basic Linear Algebra Compiler
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationMany applications in media processing, control, graphics, and other domains require efficient small-scale linear algebra computations. However, most existing high performance libraries for linear algebra, such as ATLAS or Intel MKL are more geared ...
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD ForumIntel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...
Comments