skip to main content
10.1145/2581122.2544155acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
tutorial

A Basic Linear Algebra Compiler

Published:15 February 2014Publication History

ABSTRACT

Many applications in media processing, control, graphics, and other domains require efficient small-scale linear algebra computations. However, most existing high performance libraries for linear algebra, such as ATLAS or Intel MKL are more geared towards large-scale problems (matrix sizes in the hundreds and larger) and towards specific interfaces (e.g., BLAS). In this paper we present LGen: a compiler for small-scale, basic linear algebra computations. The input to LGen is a fixed-size linear algebra expression; the output is a corresponding C function optionally including intrinsics to efficiently use SIMD vector extensions. LGen generates code using two levels of mathematical domain-specific languages (DSLs). The DSLs are used to perform tiling, loop fusion, and vectorization at a high level of abstraction, before the final code is generated. In addition, search is used to select among alternative generated implementations. We show benchmarks of code generated by LGen against Intel MKL and IPP as well as against alternative generators, such as the C++ template-based Eigen and the BTO compiler. The achieved speed-up is typically about a factor of two to three.

References

  1. E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Barthe, J. M. Crespo, S. Gulwani, C. Kunz, and M. Marron. From relational verification to SIMD loop synthesis. In Principles and Practice of Parallel Programming (PPoPP), pages 123--134, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Belter, E. R. Jessup, T. Nelson, B. Norris, and J. G. Siek. Reliable generation of high-performance matrix algebra. Computing Research Repository (CoRR), abs/1205.1098, 2012.Google ScholarGoogle Scholar
  4. P. Bientinesi, J. A. Gunnels, M. E. Myers, E. S. Quintana-Ortí, and R. A. v. d. Geijn. The science of deriving dense linear algebra algorithms. ACM Transactions on Mathematical Software (TOMS), 31(1):1--26, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In International Conference on Supercomputing (ICS), pages 340--347, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS), 16(1):1--17, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS), 14(1):1--17, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Fabregat-Traver and P. Bientinesi. A domain-specific compiler for linear algebra operations. In High Performance Computing for Computational Science (VECPAR 2012), volume 7851 of Lecture Notes in Computer Science (LNCS), pages 346--361. Springer, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  9. F. Franchetti, F. Mesmay, D. Mcfarlin, and M. Püschel. Operator language: A program generation framework for fast kernels. In IFIP Working Conference on Domain-Specific Languages (DSL WC), volume 5658 of Lecture Notes in Computer Science (LNCS), pages 385--410. Springer, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. F. Franchetti and M. Püschel. Generating SIMD vectorized permutations. In International Conference on Compiler Construction (CC), volume 4959 of Lecture Notes in Computer Science (LNCS), pages 116--131. Springer, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. F. Franchetti, Y. Voronenko, and M. Püschel. Formal loop merging for signal transforms. In Programming Language Design and Implementation (PLDI), pages 315--326, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Frigge, D. C. Hoaglin, and B. Iglewicz. Some implementations of the boxplot. The American Statistician, 43(1):50--54, 1989.Google ScholarGoogle Scholar
  13. M. Frigo. A fast Fourier transform compiler. In Programming Language Design and Implementation (PLDI), pages 169--180, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  15. K. Goto and R. A. v. d. Geijn. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS), 34(3):12:1--12:25, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Gottschling and C. Steinhardt. Meta-tuning in MTL4. In International Conference on Numerical Analysis and Applied Mathematics (ICNAAM), volume 1281, pages 778--782, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  17. P. Gottschling, D. S. Wise, and A. Joshi. Generic support of algorithmic and structural recursion for scientific computing. International Journal of Parallel, Emergent and Distributed Systems (IJPEDS), 24(6):479--503, 2009.Google ScholarGoogle Scholar
  18. G. Guennebaud, B. Jacob, et al. Eigen v3. http://eigen.tuxfamily.org.Google ScholarGoogle Scholar
  19. J. A. Gunnels, F. G. Gustavson, G. Henry, and R. A. van de Geijn. FLAME: Formal linear algebra methods environment. ACM Transactions on Mathematical Software (TOMS), 27(4):422--455, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Guo, G. Bikshandi, B. B. Fraguela, M. J. Garzaran, and D. Padua. Programming with tiles. In Principles and Practice of Parallel Programming (PPoPP), pages 111--122, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Hartono, M. M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris, J. Ramanujam, and P. Sadayappan. Parametric multi-level tiling of imperfectly nested loops. In International Conference on Supercomputing (ICS), pages 147--157, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Intel. Intel integrated performance primitives (IPP). http://software.intel.com/en-us/intel-ipp.Google ScholarGoogle Scholar
  23. Intel. Intel math kernel library (MKL). http://software.intel.com/en-us/intel-mkl.Google ScholarGoogle Scholar
  24. M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When polyhedral transformations meet SIMD code generation. In Programming Language Design and Implementation (PLDI), pages 127--138, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Nuzman, S. Dyshel, E. Rohou, I. Rosen, K. Williams, D. Yuste, A. Cohen, and A. Zaks. Vapor SIMD: Auto-vectorize once, run everywhere. In International Symposium on Code Generation and Optimization (CGO), pages 151--160, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In Programming Language Design and Implementation (PLDI), pages 132--143, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Püschel, F. Franchetti, and Y. Voronenko. Encyclopedia of Parallel Computing, chapter Spiral. Springer, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 93(2):232--275, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  29. J. Shin, M. Hall, J. Chame, C. Chen, and P. Hovland. Autotuning and specialization: Speeding up matrix multiply for small matrices with compiler technology. In K. Naono, K. Teranishi, J. Cavazos, and R. Suda, editors, Software Automatic Tuning, pages 353--370. Springer New York, 2010.Google ScholarGoogle Scholar
  30. J. Siek, I. Karlin, and E. Jessup. Build to order linear algebra kernels. In International Parallel & Distributed Processing Symposium (IPDPS), pages 1--8, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  31. F. G. Van Zee and R. A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software (TOMS). To appear.Google ScholarGoogle Scholar
  32. Y. Voronenko, F. de Mesmay, and M. Püschel. Computer generation of general size linear transform libraries. In International Symposium on Code Generation and Optimization (CGO), pages 102--113, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Walter, M. Koch, et al. uBLAS. www.boost.org/libs/numeric.Google ScholarGoogle Scholar
  34. R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. In Supercomputing (SC), pages 1--27, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE, 93(2):358--386, 2005.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Basic Linear Algebra Compiler

      Recommendations

      Reviews

      Mike Minkoff

      The subject of computational linear algebra has a fundamental role in the development of computational science, owing to its need and use in a wide area of applications. Such work includes notably computational fluid dynamics, control theory, graphics, and the solution of differential equations. While there are a number of production linear algebra libraries, such as ATLAS and the Intel Math Kernel Library (MKL), there are also newer approaches that involve the development of compilers targeted specifically toward automated compiler techniques to achieve even higher performance. This paper focuses on this area. The paper addresses smaller-scale linear algebra applications based on LGen, which uses fixed-size linear algebra expressions and provides output using C with intrinsic code for single instruction, multiple data (SIMD) vector extensions. LGen generates code using domain-specific languages (DSLs), and addresses tiling, loop fusion, and vectorization involving high-level abstraction methods to generate code. The authors provide benchmarks using code generated by LGen, Intel MKL, Intel Integrated Performance Primitives (IPP), and other generators such as Eigen (C++ template library for linear algebra) and the build to order (BTO) BLAS compiler. The code presented in this paper typically improves performance by a factor of 2x to 3x. The paper's seven sections begin with an introduction, including historical background, and an overview. This is followed by scalar code generation that includes timing and loop optimization results, along with figures to illustrate the results. The next section on vector code generation provides performance results, and section 5 presents computational experiments. The final two sections address limitations and future work and present a conclusion. The paper is quite thorough and includes 35 references. This is a highly interesting and well-done paper, as it combines aspects of compiler methodology with high-performance numerical methods, as well as computational results with current state-of-the-art production numerical libraries. It is an excellent paper to study in order to understand the topics addressed within. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
        February 2014
        328 pages
        ISBN:9781450326704
        DOI:10.1145/2581122

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 February 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • tutorial
        • Research
        • Refereed limited

        Acceptance Rates

        CGO '14 Paper Acceptance Rate29of100submissions,29%Overall Acceptance Rate312of1,061submissions,29%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader