tutorial

A Basic Linear Algebra Compiler

Authors:
Daniele G. Spampinato

Dept. of Computer Science, ETH Zurich

Dept. of Computer Science, ETH Zurich
View Profile

,
Markus Püschel

Dept. of Computer Science, ETH Zurich

Dept. of Computer Science, ETH Zurich
View Profile

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and OptimizationFebruary 2014Pages 23–32https://doi.org/10.1145/2581122.2544155

Published:15 February 2014Publication History

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Pages 23–32

ABSTRACT

Many applications in media processing, control, graphics, and other domains require efficient small-scale linear algebra computations. However, most existing high performance libraries for linear algebra, such as ATLAS or Intel MKL are more geared towards large-scale problems (matrix sizes in the hundreds and larger) and towards specific interfaces (e.g., BLAS). In this paper we present LGen: a compiler for small-scale, basic linear algebra computations. The input to LGen is a fixed-size linear algebra expression; the output is a corresponding C function optionally including intrinsics to efficiently use SIMD vector extensions. LGen generates code using two levels of mathematical domain-specific languages (DSLs). The DSLs are used to perform tiling, loop fusion, and vectorization at a high level of abstraction, before the final code is generated. In addition, search is used to select among alternative generated implementations. We show benchmarks of code generated by LGen against Intel MKL and IPP as well as against alternative generators, such as the C++ template-based Eigen and the BTO compiler. The achieved speed-up is typically about a factor of two to three.

References

E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999. Google ScholarDigital Library
G. Barthe, J. M. Crespo, S. Gulwani, C. Kunz, and M. Marron. From relational verification to SIMD loop synthesis. In Principles and Practice of Parallel Programming (PPoPP), pages 123--134, 2013. Google ScholarDigital Library
G. Belter, E. R. Jessup, T. Nelson, B. Norris, and J. G. Siek. Reliable generation of high-performance matrix algebra. Computing Research Repository (CoRR), abs/1205.1098, 2012.Google Scholar
P. Bientinesi, J. A. Gunnels, M. E. Myers, E. S. Quintana-Ortí, and R. A. v. d. Geijn. The science of deriving dense linear algebra algorithms. ACM Transactions on Mathematical Software (TOMS), 31(1):1--26, 2005. Google ScholarDigital Library
J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In International Conference on Supercomputing (ICS), pages 340--347, 1997. Google ScholarDigital Library
J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS), 16(1):1--17, 1990. Google ScholarDigital Library
J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS), 14(1):1--17, 1988. Google ScholarDigital Library
D. Fabregat-Traver and P. Bientinesi. A domain-specific compiler for linear algebra operations. In High Performance Computing for Computational Science (VECPAR 2012), volume 7851 of Lecture Notes in Computer Science (LNCS), pages 346--361. Springer, 2013.Google ScholarCross Ref
F. Franchetti, F. Mesmay, D. Mcfarlin, and M. Püschel. Operator language: A program generation framework for fast kernels. In IFIP Working Conference on Domain-Specific Languages (DSL WC), volume 5658 of Lecture Notes in Computer Science (LNCS), pages 385--410. Springer, 2009. Google ScholarDigital Library
F. Franchetti and M. Püschel. Generating SIMD vectorized permutations. In International Conference on Compiler Construction (CC), volume 4959 of Lecture Notes in Computer Science (LNCS), pages 116--131. Springer, 2008. Google ScholarDigital Library
F. Franchetti, Y. Voronenko, and M. Püschel. Formal loop merging for signal transforms. In Programming Language Design and Implementation (PLDI), pages 315--326, 2005. Google ScholarDigital Library
M. Frigge, D. C. Hoaglin, and B. Iglewicz. Some implementations of the boxplot. The American Statistician, 43(1):50--54, 1989.Google Scholar
M. Frigo. A fast Fourier transform compiler. In Programming Language Design and Implementation (PLDI), pages 169--180, 1999. Google ScholarDigital Library
M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005.Google ScholarCross Ref
K. Goto and R. A. v. d. Geijn. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS), 34(3):12:1--12:25, 2008. Google ScholarDigital Library
P. Gottschling and C. Steinhardt. Meta-tuning in MTL4. In International Conference on Numerical Analysis and Applied Mathematics (ICNAAM), volume 1281, pages 778--782, 2010.Google ScholarCross Ref
P. Gottschling, D. S. Wise, and A. Joshi. Generic support of algorithmic and structural recursion for scientific computing. International Journal of Parallel, Emergent and Distributed Systems (IJPEDS), 24(6):479--503, 2009.Google Scholar
G. Guennebaud, B. Jacob, et al. Eigen v3. http://eigen.tuxfamily.org.Google Scholar
J. A. Gunnels, F. G. Gustavson, G. Henry, and R. A. van de Geijn. FLAME: Formal linear algebra methods environment. ACM Transactions on Mathematical Software (TOMS), 27(4):422--455, 2001. Google ScholarDigital Library
J. Guo, G. Bikshandi, B. B. Fraguela, M. J. Garzaran, and D. Padua. Programming with tiles. In Principles and Practice of Parallel Programming (PPoPP), pages 111--122, 2008. Google ScholarDigital Library
A. Hartono, M. M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris, J. Ramanujam, and P. Sadayappan. Parametric multi-level tiling of imperfectly nested loops. In International Conference on Supercomputing (ICS), pages 147--157, 2009. Google ScholarDigital Library
Intel. Intel integrated performance primitives (IPP). http://software.intel.com/en-us/intel-ipp.Google Scholar
Intel. Intel math kernel library (MKL). http://software.intel.com/en-us/intel-mkl.Google Scholar
M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When polyhedral transformations meet SIMD code generation. In Programming Language Design and Implementation (PLDI), pages 127--138, 2013. Google ScholarDigital Library
D. Nuzman, S. Dyshel, E. Rohou, I. Rosen, K. Williams, D. Yuste, A. Cohen, and A. Zaks. Vapor SIMD: Auto-vectorize once, run everywhere. In International Symposium on Code Generation and Optimization (CGO), pages 151--160, 2011. Google ScholarDigital Library
D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In Programming Language Design and Implementation (PLDI), pages 132--143, 2006. Google ScholarDigital Library
M. Püschel, F. Franchetti, and Y. Voronenko. Encyclopedia of Parallel Computing, chapter Spiral. Springer, 2011.Google ScholarDigital Library
M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 93(2):232--275, 2005.Google ScholarCross Ref
J. Shin, M. Hall, J. Chame, C. Chen, and P. Hovland. Autotuning and specialization: Speeding up matrix multiply for small matrices with compiler technology. In K. Naono, K. Teranishi, J. Cavazos, and R. Suda, editors, Software Automatic Tuning, pages 353--370. Springer New York, 2010.Google Scholar
J. Siek, I. Karlin, and E. Jessup. Build to order linear algebra kernels. In International Parallel & Distributed Processing Symposium (IPDPS), pages 1--8, 2008.Google ScholarCross Ref
F. G. Van Zee and R. A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software (TOMS). To appear.Google Scholar
Y. Voronenko, F. de Mesmay, and M. Püschel. Computer generation of general size linear transform libraries. In International Symposium on Code Generation and Optimization (CGO), pages 102--113, 2009. Google ScholarDigital Library
J. Walter, M. Koch, et al. uBLAS. www.boost.org/libs/numeric.Google Scholar
R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. In Supercomputing (SC), pages 1--27, 1998. Google ScholarDigital Library
K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE, 93(2):358--386, 2005.Google ScholarCross Ref

Index Terms

A Basic Linear Algebra Compiler
1. Mathematics of computing
  1. Mathematical software
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

A basic linear algebra compiler for structured matrices
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

Many problems in science and engineering are in practice modeled and solved through matrix computations. Often, the matrices involved have structure such as symmetric or triangular, which reduces the operations count needed to perform the computation. ...
Read More
A Basic Linear Algebra Compiler
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Many applications in media processing, control, graphics, and other domains require efficient small-scale linear algebra computations. However, most existing high performance libraries for linear algebra, such as ATLAS or Intel MKL are more geared ...
Read More
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Intel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...
Read More

Reviews

Reviewer: Mike Minkoff

The subject of computational linear algebra has a fundamental role in the development of computational science, owing to its need and use in a wide area of applications. Such work includes notably computational fluid dynamics, control theory, graphics, and the solution of differential equations. While there are a number of production linear algebra libraries, such as ATLAS and the Intel Math Kernel Library (MKL), there are also newer approaches that involve the development of compilers targeted specifically toward automated compiler techniques to achieve even higher performance. This paper focuses on this area. The paper addresses smaller-scale linear algebra applications based on LGen, which uses fixed-size linear algebra expressions and provides output using C with intrinsic code for single instruction, multiple data (SIMD) vector extensions. LGen generates code using domain-specific languages (DSLs), and addresses tiling, loop fusion, and vectorization involving high-level abstraction methods to generate code. The authors provide benchmarks using code generated by LGen, Intel MKL, Intel Integrated Performance Primitives (IPP), and other generators such as Eigen (C++ template library for linear algebra) and the build to order (BTO) BLAS compiler. The code presented in this paper typically improves performance by a factor of 2x to 3x. The paper's seven sections begin with an introduction, including historical background, and an overview. This is followed by scalar code generation that includes timing and loop optimization results, along with figures to illustrate the results. The next section on vector code generation provides performance results, and section 5 presents computational experiments. The final two sections address limitations and future work and present a conclusion. The paper is quite thorough and includes 35 references. This is a highly interesting and well-done paper, as it combines aspects of compiler methodology with high-performance numerical methods, as well as computational results with current state-of-the-art production numerical libraries. It is an excellent paper to study in order to understand the topics addressed within. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
February 2014
328 pages
ISBN:9781450326704
DOI:10.1145/2581122

Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 February 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Basic linear algebra
DSL
Program synthesis
SIMD vectorization
Small matrices
Tiling
Qualifiers
- tutorial
- Research
- Refereed limited
Conference

Acceptance Rates
CGO '14 Paper Acceptance Rate29of100submissions,29%Overall Acceptance Rate312of1,061submissions,29%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 591
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Basic Linear Algebra Compiler

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

ABSTRACT

References

Cited By

Index Terms

Recommendations

A basic linear algebra compiler for structured matrices

A Basic Linear Algebra Compiler

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Basic Linear Algebra Compiler

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

ABSTRACT

References

Cited By

Index Terms

Recommendations

A basic linear algebra compiler for structured matrices

A Basic Linear Algebra Compiler

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors