research-article

A Holistic Approach to Automatic Mixed-Precision Code Generation and Tuning for Affine Programs

Authors:
Jinchen Xu

Information Engineering University, Zhengzhou, China

Information Engineering University, Zhengzhou, China

https://orcid.org/0000-0002-6275-2617
View Profile

,
Guanghui Song

Li Auto Inc., Shanghai, China

Li Auto Inc., Shanghai, China

https://orcid.org/0009-0001-8089-1151
View Profile

,
Bei Zhou

Information Engineering University, Zhengzhou, China

Information Engineering University, Zhengzhou, China

https://orcid.org/0000-0003-1515-0602
View Profile

,
Fei Li

Information Engineering University, Zhengzhou, China

Information Engineering University, Zhengzhou, China

https://orcid.org/0009-0002-1706-3625
View Profile

,
Jiangwei Hao

Information Engineering University, Zhengzhou, China

Information Engineering University, Zhengzhou, China

https://orcid.org/0000-0001-6012-0870
View Profile

,
Jie Zhao

College of Computer Science and Electronic Engineering, Hunan University, Changsha, China

College of Computer Science and Electronic Engineering, Hunan University, Changsha, China

https://orcid.org/0000-0003-2303-9736
View Profile

PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel ProgrammingMarch 2024Pages 55–67https://doi.org/10.1145/3627535.3638484

Published:20 February 2024Publication History

PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

Pages 55–67

ABSTRACT

Reducing floating-point (FP) precision is used to trade the quality degradation of a numerical program's output for performance, but this optimization coincides with type casting, whose overhead is undisclosed until a mixed-precision code version is generated. This uncertainty enforces the decoupled implementation of mixed-precision code generation and autotuning in prior work. In this paper, we present a holistic approach called PrecTuner that consolidates the mixed-precision code generator and the autotuner by defining one parameter. This parameter is first initialized by some automatically sampled values and used to generate several code variants, with various loop transformations also taken into account. The generated code variants are next profiled to solve a performance model formulated using the aforementioned parameter, possibly under a pre-defined quality degradation budget. The best-performing value of the defined parameter is finally predicted without evaluating all code variants. Experimental results of the PolyBench benchmarks on CPU demonstrate that PrecTuner outperforms LuIs by 3.28× while achieving smaller errors, and we also validate its effectiveness in optimizing a real-life large-scale application. In addition, PrecTuner also obtains a mean speedup of 1.81× and 1.52×-1.73× over Pluto on single- and multi-core CPU, respectively, and 1.71× over PPCG on GPU.

References

Aravind Acharya, Uday Bondhugula, and Albert Cohen. 2018. Polyhedral Auto-transformation with No Integer Linear Programming. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (Philadelphia, PA, USA) (PLDI 2018). ACM, New York, NY, USA, 529--542. Google ScholarDigital Library
Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. 2009. PetaBricks: A Language and Compiler for Algorithmic Choice. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (Dublin, Ireland) (PLDI'09). Association for Computing Machinery, New York, NY, USA, 38--49. Google ScholarDigital Library
IEEE Standards Association. 2019. IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008) (2019), 1--84. Google ScholarCross Ref
Cédric Bastoul. 2004. Code Generation in the Polyhedral Model Is Easier Than You Think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT '04). IEEE Computer Society, Washington, DC, USA, 7--16. Google ScholarDigital Library
Atılım Günes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2017. Automatic Differentiation in Machine Learning: A Survey. J. Mach. Learn. Res. 18, 1 (jan 2017), 5595--5637.Google Scholar
Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, Guillain Potron, and Nicolas Vasilache. 2014. Tiling and Optimizing Time-Iterated Computations on Periodic Domains. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (Edmonton, AB, Canada) (PACT '14). Association for Computing Machinery, New York, NY, USA, 39--50. Google ScholarDigital Library
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (Tucson, AZ, USA) (PLDI'08). ACM, New York, NY, USA, 101--113. Google ScholarDigital Library
Daniele Cattaneo, Michele Chiari, Nicola Fossati, Stefano Cherubin, and Giovanni Agosta. 2021. Architecture-Aware Precision Tuning with Multiple Number Representation Systems. In 2021 58th ACM/IEEE Design Automation Conference (DAC) (San Francisco, CA, USA). IEEE Press, 673--678. Google ScholarDigital Library
Stefano Cherubin and Giovanni Agosta. 2020. Tools for Reduced Precision Computation: A Survey. ACM Comput. Surv. 53, 2, Article 33 (apr 2020), 35 pages. Google ScholarDigital Library
Stefano Cherubin, Daniele Cattaneo, Michele Chiari, Antonio Di Bello, and Giovanni Agosta. 2020. TAFFO: Tuning Assistant for Floating to Fixed Point Optimization. IEEE Embedded Systems Letters 12, 1 (2020), 5--8. Google ScholarDigital Library
Eva Darulova and Viktor Kuncak. 2017. Towards a Compiler for Reals. ACM Trans. Program. Lang. Syst. 39, 2, Article 8 (mar 2017), 28 pages. Google ScholarDigital Library
Marc Daumas and Guillaume Melquiond. 2010. Certification of Bounds on Expressions Involving Rounded Operators. ACM Trans. Math. Softw. 37, 1, Article 2 (jan 2010), 20 pages. Google ScholarDigital Library
Paul Feautrier. 1991. Dataflow analysis of array and scalar references. International Journal of Parallel Programming 20, 1 (01 Feb 1991), 23--53. Google ScholarCross Ref
Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. International journal of parallel programming 21, 6 (1992), 389--420.Google ScholarDigital Library
Paul Feautrier and Christian Lengauer. 2011. Polyhedron Model. Springer US, Boston, MA, 1581--1592. Google ScholarCross Ref
Eric Goubault and Sylvie Putot. 2006. Static Analysis of Numerical Algorithms. In Static Analysis, Kwangkeun Yi (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 18--34.Google Scholar
Martin Griebl, Paul Feautrier, and Christian Lengauer. 2000. Index Set Splitting. Int. J. Parallel Program. 28, 6 (Dec. 2000), 607--631. Google ScholarCross Ref
Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly-Performing Polyhedral Optimizations on a Low-level Intermediate Representation. Parallel Processing Letters 22, 04 (2012), 1250010.Google ScholarCross Ref
Tobias Grosser, Sven Verdoolaege, and Albert Cohen. 2015. Polyhedral AST Generation Is More Than Scanning Polyhedra. ACM Trans. Program. Lang. Syst. 37, 4, Article 12 (July 2015), 50 pages. Google ScholarDigital Library
Hui Guo and Cindy Rubio-González. 2018. Exploiting Community Structure for Floating-Point Precision Tuning. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (Amsterdam, Netherlands) (ISSTA 2018). Association for Computing Machinery, New York, NY, USA, 333--343. Google ScholarDigital Library
Ken Kennedy and John R. Allen. 2002. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.Google ScholarDigital Library
Pradeep V Kotipalli, Ranvijay Singh, Paul Wood, Ignacio Laguna, and Saurabh Bagchi. 2019. AMPT-GA: Automatic Mixed Precision Floating Point Tuning for GPU Applications (ICS '19). Association for Computing Machinery, New York, NY, USA, 160--170. Google ScholarDigital Library
Ki-Il Kum, Jiyang Kang, and Wonyong Sung. 2000. AUTOSCALER for C: an optimizing floating-point to integer C program converter for fixed-point digital signal processors. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 47, 9 (2000), 840--848. Google ScholarCross Ref
Ignacio Laguna, Paul C. Wood, Ranvijay Singh, and Saurabh Bagchi. 2019. GPUMixer: Performance-Driven Floating-Point Tuning for GPU Scientific Applications. In High Performance Computing, Michèle Weiland, Guido Juckeland, Carsten Trinitis, and Ponnuswamy Sadayappan (Eds.). Springer International Publishing, Cham, 227--246.Google Scholar
Michael O. Lam, Jeffrey K. Hollingsworth, Bronis R. de Supinski, and Matthew P. Legendre. 2013. Automatically Adapting Programs for Mixed-Precision Floating-Point Computation. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (Eugene, Oregon, USA) (ICS '13). Association for Computing Machinery, New York, NY, USA, 369--378. Google ScholarDigital Library
Victor Magron, George Constantinides, and Alastair Donaldson. 2017. Certified Roundoff Error Bounds Using Semidefinite Programming. ACM Trans. Math. Softw. 43, 4, Article 34 (jan 2017), 31 pages. Google ScholarDigital Library
Daniel Menard, Daniel Chillet, François Charot, and Olivier Sentieys. 2002. Automatic Floating-Point to Fixed-Point Conversion for DSP Code Generation. In Proceedings of the 2002 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (Grenoble, France) (CASES '02). Association for Computing Machinery, New York, NY, USA, 270--276. Google ScholarDigital Library
Harshitha Menon, Michael O. Lam, Daniel Osei-Kuffuor, Markus Schordan, Scott Lloyd, Kathryn Mohror, and Jeffrey Hittinger. 2018. ADAPT: Algorithmic Differentiation Applied to Floating-Point Precision Tuning. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 614--626. Google ScholarDigital Library
Ricardo Nobre, Luís Reis, João Bispo, Tiago Carvalho, João M.P. Cardoso, Stefano Cherubin, and Giovanni Agosta. 2018. Aspect-Driven Mixed-Precision Tuning Targeting GPUs. In Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (Manchester, United Kingdom) (PARMA-DITAM '18). Association for Computing Machinery, New York, NY, USA, 26--31. Google ScholarDigital Library
Arjun Pitchanathan, Christian Ulmann, Michel Weber, Torsten Hoefler, and Tobias Grosser. 2021. FPL: Fast Presburger Arithmetic through Transprecision. Proc. ACM Program. Lang. 5, OOPSLA, Article 162 (oct 2021), 26 pages. Google ScholarDigital Library
Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, P. Sadayappan, and Nicolas Vasilache. 2011. Loop Transformations: Convexity, Pruning and Optimization. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Austin, Texas, USA) (POPL'11). Association for Computing Machinery, New York, NY, USA, 549--562. Google ScholarDigital Library
Louis-Noël Pouchet and Tomofumi Yuki. 2016. PolyBench/C 4.2. https://sourceforge.net/projects/polybenchGoogle Scholar
Cindy Rubio-González, Cuong Nguyen, Benjamin Mehne, Koushik Sen, James Demmel, William Kahan, Costin Iancu, Wim Lavrijsen, David H. Bailey, and David Hough. 2016. Floating-Point Precision Tuning Using Blame Analysis. In Proceedings of the 38th International Conference on Software Engineering (Austin, Texas) (ICSE '16). Association for Computing Machinery, New York, NY, USA, 1074--1085. Google ScholarDigital Library
Cindy Rubio-González, Cuong Nguyen, Hong Diep Nguyen, James Demmel, William Kahan, Koushik Sen, David H. Bailey, Costin Iancu, and David Hough. 2013. Precimonious: Tuning Assistant for FloatingPoint Precision. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '13). Association for Computing Machinery, New York, NY, USA, Article 27, 12 pages. Google ScholarDigital Library
Alex Sanchez-Stern, Pavel Panchekha, Sorin Lerner, and Zachary Tatlock. 2018. Finding Root Causes of Floating Point Error. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (Philadelphia, PA, USA) (PLDI 2018). Association for Computing Machinery, New York, NY, USA, 256--269. Google ScholarDigital Library
Alexey Solovyev, Marek S. Baranowski, Ian Briggs, Charles Jacobsen, Zvonimir Rakamarić, and Ganesh Gopalakrishnan. 2018. Rigorous Estimation of Floating-Point Round-Off Errors with Symbolic Taylor Expansions. ACM Trans. Program. Lang. Syst. 41, 1, Article 2 (dec 2018), 39 pages. Google ScholarDigital Library
Giuseppe Tagliavini, Stefan Mach, Davide Rossi, Andrea Marongiu, and Luca Benini. 2018. A transprecision floating-point platform for ultra-low power computing. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE). 1051--1056. Google ScholarCross Ref
Xiaohan Tao, Yu Zhu, Boyang Wang, Jinlong Xu, Jianmin Pang, and Jie Zhao. 2023. Automatically Generating High-Performance Matrix Multiplication Kernels on the Latest Sunway Processor. In Proceedings of the 51st International Conference on Parallel Processing (Bordeaux, France) (ICPP '22). Association for Computing Machinery, New York, NY, USA, Article 52, 12 pages. Google ScholarDigital Library
Seoul National University. 2019. SNU NPB 2019 Suite. http://snuvm.snu.ac.kr/software/snu-npb-2019/Google Scholar
Anand Venkat, Manu Shantharam, Mary Hall, and Michelle Mills Strout. 2018. Non-Affine Extensions to Polyhedral Code Generation. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (Orlando, FL, USA) (CGO '14). Association for Computing Machinery, New York, NY, USA, 185--194. Google ScholarDigital Library
Sven Verdoolaege. 2010. Isl: An Integer Set Library for the Polyhedral Model. In Proceedings of the Third International Congress Conference on Mathematical Software (Kobe, Japan) (ICMS'10). Springer-Verlag, Berlin, Heidelberg, 299--302. Google ScholarCross Ref
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gömez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral Parallel Code Generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4, Article 54 (Jan. 2013), 23 pages. Google ScholarDigital Library
Sven Verdoolaege and Tobias Grosser. 2012. Polyhedral extraction tool. In Second International Workshop on Polyhedral Compilation Techniques (IMPACT'12), Paris, France, Vol. 141.Google Scholar
Jie Zhao and Peng Di. 2020. Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-53) (Vitural Event). IEEE, Piscataway, NJ, USA, 427--441. Google ScholarCross Ref
Jie Zhao, Michael Kruse, and Albert Cohen. 2018. A Polyhedral Compilation Framework for Loops with Dynamic Data-Dependent Bounds. In Proceedings of the 27th International Conference on Compiler Construction (Vienna, Austria) (CC 2018). Association for Computing Machinery, New York, NY, USA, 14--24. Google ScholarDigital Library
Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, Peng Di, Kun Zhang, and Xuefeng Jin. 2021. AKG: Automatic Kernel Generation for Neural Processing Units Using Polyhedral Transformations. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI'21). Association for Computing Machinery, New York, NY, USA, 1233--1248. Google ScholarDigital Library

Index Terms

A Holistic Approach to Automatic Mixed-Precision Code Generation and Tuning for Affine Programs
1. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Arbitrary-precision arithmetic
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Translator writing systems and compiler generators

Recommendations

Automatic code generation and tuning for stencil kernels on modern shared memory architectures

In this paper, we present Patus, a code generation and auto-tuning framework for stencil computations targeted at multi- and manycore processors, such as multicore CPUs and graphics processing units. Patus, which stands for " P arallel A uto tu ned S ...
Read More
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Read More
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10: Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction

Graphics Processing Units (GPUs) offer tremendous computational power. CUDA (Compute Unified Device Architecture) provides a multi-threaded parallel programming model, facilitating high performance implementations of general-purpose computations. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
March 2024
498 pages
ISBN:9798400704352
DOI:10.1145/3627535
Chair:
Michel Steuwer,
Program Chairs:
I-Ting Angelina Lee,
Milind Chabbi
Uber Technologies Inc.
Copyright © 2024 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 February 2024
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Available / v1.1
Author Tags
polyhedral model
performance
autotuning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate230of1,014submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 244
  Total Downloads
- Downloads (Last 12 months)244
- Downloads (Last 6 weeks)89
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Holistic Approach to Automatic Mixed-Precision Code Generation and Tuning for Affine Programs

PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic code generation and tuning for stencil kernels on modern shared memory architectures

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Automatic C-to-CUDA code generation for affine programs