Efficient Multiple-Precision and Mixed-Precision Floating-Point Fused Multiply-Accumulate Unit for HPC and AI Applications

Tan, Hongbing; Yan, Run; Yang, Ling; Huang, Libo; Xiao, Liquan; Yang, Qianming

doi:10.1007/978-3-031-22677-9_34

Hongbing Tan¹¹,
Run Yan¹¹,
Ling Yang¹¹,
Libo Huang¹¹,
Liquan Xiao¹¹ &
…
Qianming Yang¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13777))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1476 Accesses
1 Altmetric

Abstract

In this paper, a multiple-precision and mixed-precision floating-point fused multiply-accumulate (FMA) unit is proposed base on the practical requirements of high performance computing (HPC) and artificial intelligence (AI) applications. In addition to the double-precision and single-precision formats used in high performance computing, three types of low-precision formats, TensorFloat-32, BFloat16, and half-precision, dedicated to deep learning tasks are also supported by this FMA unit. The proposed FMA architecture can execute one double-precision operation, or two parallel single-precision operations, or four half-precision operations at each clock cycle. Moreover, the mixed-precision FMA operations are also supported by this proposed FMA, the products of two lower precision multiplications can be accumulated to a higher precision addend. One mixed-precision operation using single-precision multiplication and double-precision addition, or two parallel mixed-precision operations using low-precision (TensorFloat-32, BFloat16, or half-precision) multiplication and single-precision addition is performed every clock cycle. The presented FMA design uses both segmentation and reusing methods to trade off performance, such as throughput and latency, against area and power. The proposed FMA unit has only 17.0% larger area than a standard double-precision FMA implementation, but can support multiple-precision and mixed-precision operations. Compared to the state-of-the-art multiple-precision FMA design, the proposed FMA supports more types of precisions such as TensorFloat-32 and BFloat16 with less hardware overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arunachalam, V., Raj, A.N.J., Hampannavar, N., Bidul, C.: Efficient dual-precision floating-point fused-multiply-add architecture. Microprocess. Microsyst. 57, 23–31 (2018)
Article Google Scholar
Baboulin, M., et al.: Accelerating scientific computations with mixed precision algorithms. Comput. Phys. Commun. 180(12), 2526–2533 (2009)
Article MATH Google Scholar
Booth, A.D.: A signed binary multiplication technique. Q. J. Mech. Appl. Math. 4(2), 236–240 (1951)
Article MathSciNet MATH Google Scholar
Bruguera, J.D., Lang, T.: Floating-point fused multiply-add: reduced latency for floating-point addition. In: 17th IEEE Symposium on Computer Arithmetic (ARITH 2005), pp. 42–51. IEEE (2005)
Google Scholar
Chowdhary, K.: Natural language processing. In: Fundamentals of Artificial Intelligence, pp. 603–649 (2020)
Google Scholar
Dan, Z., et al.: IEEE standard for floating-point arithmetic. IEEE Std 754-2008, pp. 1–70 (2008)
Google Scholar
Fasi, M., Higham, N.J., Mikaitis, M., Pranesh, S.: Numerical behavior of NVIDIA tensor cores. PeerJ Comput. Sci. 7, e330 (2021)
Article Google Scholar
Haidar, A., Tomov, S., Dongarra, J., Higham, N.J.: Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 603–613. IEEE (2018)
Google Scholar
Hauser, J.: Berkeley testfloat, June 2018. http://www.jhauser.us/arithmetic/TestFloat.html
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huang, L., Ma, S., Shen, L., Wang, Z., Xiao, N.: Low-cost binary128 floating-point FMA unit design with SIMD support. IEEE Trans. Comput. 61(5), 745–751 (2011)
Article MathSciNet MATH Google Scholar
Huang, L., Shen, L., Dai, K., Wang, Z.: A new architecture for multiple-precision floating-point multiply-add fused unit design. In: 18th IEEE Symposium on Computer Arithmetic (ARITH 2007), pp. 69–76. IEEE (2007)
Google Scholar
Kalamkar, D., et al.: A study of BFLOAT16 for deep learning training. arXiv preprint arXiv:1905.12322 (2019)
Karatsuba, A.A., Ofman, Y.P.: Multiplication of many-digital numbers by automatic computers. In: Doklady Akademii Nauk, vol. 145, pp. 293–294. Russian Academy of Sciences (1962)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
Google Scholar
Kurth, T., et al.: Exascale deep learning for climate analytics. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 649–660. IEEE (2018)
Google Scholar
Lang, T., Bruguera, J.D.: Floating-point multiply-add-fused with reduced latency. IEEE Trans. Comput. 53(8), 988–1003 (2004)
Article Google Scholar
Langou, J., Langou, J., Luszczek, P., Kurzak, J., Buttari, A., Dongarra, J.: Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems). In: SC 2006: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, pp. 50–50. IEEE (2006)
Google Scholar
Manolopoulos, K., Reisis, D., Chouliaras, V.A.: An efficient dual-mode floating-point multiply-add fused unit. In: 2010 17th IEEE International Conference on Electronics, Circuits and Systems, pp. 5–8. IEEE (2010)
Google Scholar
Manolopoulos, K., Reisis, D., Chouliaras, V.A.: An efficient multiple precision floating-point multiply-add fused unit. Microelectron. J. 49, 10–18 (2016)
Article Google Scholar
Mathuriya, A., et al.: CosmoFlow: using deep learning to learn the universe at scale. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 819–829. IEEE (2018)
Google Scholar
Quinnell, E., Swartzlander, E.E., Lemonds, C.: Bridge floating-point fused multiply-add design. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 16(12), 1727–1731 (2008)
Google Scholar
Rifaioglu, A.S., Atas, H., Martin, M.J., Cetin-Atalay, R., Atalay, V., Doğan, T.: Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Brief. Bioinform. 20(5), 1878–1912 (2019)
Article Google Scholar
Schmookler, M.S., Nowka, K.J.: Leading zero anticipation and detection-a comparison of methods. In: Proceedings 15th IEEE Symposium on Computer Arithmetic, ARITH-15 2001, pp. 7–12. IEEE (2001)
Google Scholar
Yu, D., Deng, L.: Automatic Speech Recognition. SCT, Springer, London (2015). https://doi.org/10.1007/978-1-4471-5779-3
Book MATH Google Scholar
Zhang, H., Chen, D., Ko, S.B.: Efficient multiple-precision floating-point fused multiply-add with mixed-precision support. IEEE Trans. Comput. 68(7), 1035–1048 (2019)
Article MathSciNet MATH Google Scholar
Zhang, H., Chen, D., Ko, S.B.: New flexible multiple-precision multiply-accumulate unit for deep neural network training and inference. IEEE Trans. Comput. 69(1), 26–38 (2019)
Article MathSciNet MATH Google Scholar
Zhang, H., Lee, H.J., Ko, S.B.: Efficient fixed/floating-point merged mixed-precision multiply-accumulate unit for deep learning processors. In: 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. IEEE (2018)
Google Scholar

Download references

Acknowledgements

This work is supported in part by NSFC (No. 61872374, 62090023, 62172430), NSFHN (No. 2022JJ10064, 2021JJ10052) and NKRDP (No. 2021YFB0300300).

Author information

Authors and Affiliations

National University of Defense Technology, Changsha, 410073, China
Hongbing Tan, Run Yan, Ling Yang, Libo Huang, Liquan Xiao & Qianming Yang

Authors

Hongbing Tan
View author publications
You can also search for this author in PubMed Google Scholar
Run Yan
View author publications
You can also search for this author in PubMed Google Scholar
Ling Yang
View author publications
You can also search for this author in PubMed Google Scholar
Libo Huang
View author publications
You can also search for this author in PubMed Google Scholar
Liquan Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Qianming Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Libo Huang or Qianming Yang .

Editor information

Editors and Affiliations

Technical University of Denmark, Kongens Lyngby, Denmark
Weizhi Meng
University of New Brunswick, Fredericton, NB, Canada
Rongxing Lu
University of Exeter, Exeter, UK
Geyong Min
Rutgers University, Newark, NJ, USA
Jaideep Vaidya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tan, H., Yan, R., Yang, L., Huang, L., Xiao, L., Yang, Q. (2023). Efficient Multiple-Precision and Mixed-Precision Floating-Point Fused Multiply-Accumulate Unit for HPC and AI Applications. In: Meng, W., Lu, R., Min, G., Vaidya, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2022. Lecture Notes in Computer Science, vol 13777. Springer, Cham. https://doi.org/10.1007/978-3-031-22677-9_34

Download citation

DOI: https://doi.org/10.1007/978-3-031-22677-9_34
Published: 11 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22676-2
Online ISBN: 978-3-031-22677-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Multiple-Precision and Mixed-Precision Floating-Point Fused Multiply-Accumulate Unit for HPC and AI Applications