Skip to main content

Efficient Multiple-Precision and Mixed-Precision Floating-Point Fused Multiply-Accumulate Unit for HPC and AI Applications

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13777))

Abstract

In this paper, a multiple-precision and mixed-precision floating-point fused multiply-accumulate (FMA) unit is proposed base on the practical requirements of high performance computing (HPC) and artificial intelligence (AI) applications. In addition to the double-precision and single-precision formats used in high performance computing, three types of low-precision formats, TensorFloat-32, BFloat16, and half-precision, dedicated to deep learning tasks are also supported by this FMA unit. The proposed FMA architecture can execute one double-precision operation, or two parallel single-precision operations, or four half-precision operations at each clock cycle. Moreover, the mixed-precision FMA operations are also supported by this proposed FMA, the products of two lower precision multiplications can be accumulated to a higher precision addend. One mixed-precision operation using single-precision multiplication and double-precision addition, or two parallel mixed-precision operations using low-precision (TensorFloat-32, BFloat16, or half-precision) multiplication and single-precision addition is performed every clock cycle. The presented FMA design uses both segmentation and reusing methods to trade off performance, such as throughput and latency, against area and power. The proposed FMA unit has only 17.0% larger area than a standard double-precision FMA implementation, but can support multiple-precision and mixed-precision operations. Compared to the state-of-the-art multiple-precision FMA design, the proposed FMA supports more types of precisions such as TensorFloat-32 and BFloat16 with less hardware overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Arunachalam, V., Raj, A.N.J., Hampannavar, N., Bidul, C.: Efficient dual-precision floating-point fused-multiply-add architecture. Microprocess. Microsyst. 57, 23–31 (2018)

    Article  Google Scholar 

  2. Baboulin, M., et al.: Accelerating scientific computations with mixed precision algorithms. Comput. Phys. Commun. 180(12), 2526–2533 (2009)

    Article  MATH  Google Scholar 

  3. Booth, A.D.: A signed binary multiplication technique. Q. J. Mech. Appl. Math. 4(2), 236–240 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bruguera, J.D., Lang, T.: Floating-point fused multiply-add: reduced latency for floating-point addition. In: 17th IEEE Symposium on Computer Arithmetic (ARITH 2005), pp. 42–51. IEEE (2005)

    Google Scholar 

  5. Chowdhary, K.: Natural language processing. In: Fundamentals of Artificial Intelligence, pp. 603–649 (2020)

    Google Scholar 

  6. Dan, Z., et al.: IEEE standard for floating-point arithmetic. IEEE Std 754-2008, pp. 1–70 (2008)

    Google Scholar 

  7. Fasi, M., Higham, N.J., Mikaitis, M., Pranesh, S.: Numerical behavior of NVIDIA tensor cores. PeerJ Comput. Sci. 7, e330 (2021)

    Article  Google Scholar 

  8. Haidar, A., Tomov, S., Dongarra, J., Higham, N.J.: Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 603–613. IEEE (2018)

    Google Scholar 

  9. Hauser, J.: Berkeley testfloat, June 2018. http://www.jhauser.us/arithmetic/TestFloat.html

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  11. Huang, L., Ma, S., Shen, L., Wang, Z., Xiao, N.: Low-cost binary128 floating-point FMA unit design with SIMD support. IEEE Trans. Comput. 61(5), 745–751 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  12. Huang, L., Shen, L., Dai, K., Wang, Z.: A new architecture for multiple-precision floating-point multiply-add fused unit design. In: 18th IEEE Symposium on Computer Arithmetic (ARITH 2007), pp. 69–76. IEEE (2007)

    Google Scholar 

  13. Kalamkar, D., et al.: A study of BFLOAT16 for deep learning training. arXiv preprint arXiv:1905.12322 (2019)

  14. Karatsuba, A.A., Ofman, Y.P.: Multiplication of many-digital numbers by automatic computers. In: Doklady Akademii Nauk, vol. 145, pp. 293–294. Russian Academy of Sciences (1962)

    Google Scholar 

  15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)

    Google Scholar 

  16. Kurth, T., et al.: Exascale deep learning for climate analytics. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 649–660. IEEE (2018)

    Google Scholar 

  17. Lang, T., Bruguera, J.D.: Floating-point multiply-add-fused with reduced latency. IEEE Trans. Comput. 53(8), 988–1003 (2004)

    Article  Google Scholar 

  18. Langou, J., Langou, J., Luszczek, P., Kurzak, J., Buttari, A., Dongarra, J.: Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems). In: SC 2006: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, pp. 50–50. IEEE (2006)

    Google Scholar 

  19. Manolopoulos, K., Reisis, D., Chouliaras, V.A.: An efficient dual-mode floating-point multiply-add fused unit. In: 2010 17th IEEE International Conference on Electronics, Circuits and Systems, pp. 5–8. IEEE (2010)

    Google Scholar 

  20. Manolopoulos, K., Reisis, D., Chouliaras, V.A.: An efficient multiple precision floating-point multiply-add fused unit. Microelectron. J. 49, 10–18 (2016)

    Article  Google Scholar 

  21. Mathuriya, A., et al.: CosmoFlow: using deep learning to learn the universe at scale. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 819–829. IEEE (2018)

    Google Scholar 

  22. Quinnell, E., Swartzlander, E.E., Lemonds, C.: Bridge floating-point fused multiply-add design. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 16(12), 1727–1731 (2008)

    Google Scholar 

  23. Rifaioglu, A.S., Atas, H., Martin, M.J., Cetin-Atalay, R., Atalay, V., Doğan, T.: Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Brief. Bioinform. 20(5), 1878–1912 (2019)

    Article  Google Scholar 

  24. Schmookler, M.S., Nowka, K.J.: Leading zero anticipation and detection-a comparison of methods. In: Proceedings 15th IEEE Symposium on Computer Arithmetic, ARITH-15 2001, pp. 7–12. IEEE (2001)

    Google Scholar 

  25. Yu, D., Deng, L.: Automatic Speech Recognition. SCT, Springer, London (2015). https://doi.org/10.1007/978-1-4471-5779-3

    Book  MATH  Google Scholar 

  26. Zhang, H., Chen, D., Ko, S.B.: Efficient multiple-precision floating-point fused multiply-add with mixed-precision support. IEEE Trans. Comput. 68(7), 1035–1048 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  27. Zhang, H., Chen, D., Ko, S.B.: New flexible multiple-precision multiply-accumulate unit for deep neural network training and inference. IEEE Trans. Comput. 69(1), 26–38 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  28. Zhang, H., Lee, H.J., Ko, S.B.: Efficient fixed/floating-point merged mixed-precision multiply-accumulate unit for deep learning processors. In: 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. IEEE (2018)

    Google Scholar 

Download references

Acknowledgements

This work is supported in part by NSFC (No. 61872374, 62090023, 62172430), NSFHN (No. 2022JJ10064, 2021JJ10052) and NKRDP (No. 2021YFB0300300).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Libo Huang or Qianming Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tan, H., Yan, R., Yang, L., Huang, L., Xiao, L., Yang, Q. (2023). Efficient Multiple-Precision and Mixed-Precision Floating-Point Fused Multiply-Accumulate Unit for HPC and AI Applications. In: Meng, W., Lu, R., Min, G., Vaidya, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2022. Lecture Notes in Computer Science, vol 13777. Springer, Cham. https://doi.org/10.1007/978-3-031-22677-9_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-22677-9_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-22676-2

  • Online ISBN: 978-3-031-22677-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics