Abstract
When implementing the gradient descent method in low precision, the employment of stochastic rounding schemes helps to prevent stagnation of convergence caused by the vanishing gradient effect. Unbiased stochastic rounding yields zero bias by preserving small updates with probabilities proportional to their relative magnitudes. This study provides a theoretical explanation for the stagnation of the gradient descent method in low-precision computation. Additionally, we propose two new stochastic rounding schemes that trade the zero bias property with a larger probability to preserve small gradients. Our methods yield a constant rounding bias that, on average, lies in a descent direction. For convex problems, we prove that the proposed rounding methods typically have a beneficial effect on the convergence rate of gradient descent. We validate our theoretical analysis by comparing the performances of various rounding schemes when optimizing a multinomial logistic regression model and when training a simple neural network with an 8-bit floating-point format.
Similar content being viewed by others
Notes
The MATLAB code is available upon request.
References
Bertsekas, D.P., Tsitsiklis, J.N.: Gradient convergence in gradient methods with errors. SIAM J. Optim. 10(3), 627–642 (2000)
Böhning, D.: Multinomial logistic regression algorithm. Ann. Inst. Stat. Math. 44(1), 197–200 (1992)
Chung, E., Fowers, J., Ovtcharov, K., Papamichael, M., Caulfield, A., Massengill, T., Liu, M., Lo, D., Alkalay, S., Haselman, M., Abeydeera, M., Adams, L., Angepat, H., Boehn, C., Chiou, D., Firestein, O., Forin, A., Gatlin, K., Ghandi, M., Heil, S., Holohan, K., Husseini, A., Juhász, T., Kagi, K., Kovvuri, R., Lanka, S., Megen, F.V., Mukhortov, D., Patel, P., Perez, B., Rapsang, A., Reinhardt, S., Rouhani, B., Sapek, A., Seera, R., Shekar, S., Sridharan, B., Weisz, G., Woods, L., Xiao, P.Y., Zhang, D., Zhao, R., Burger, D.: Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38(2), 8–20 (2018)
Connolly, M.P., Higham, N.J., Mary, T.: Stochastic rounding and its probabilistic backward error analysis. SIAM J. Sci. Comput. 43(1), A566–A585 (2021)
Croci, M., Fasi, M., Higham, N.J., Mary, T., Mikaitis, M.: Stochastic rounding: implementation, error analysis and applications. R. Soc. Open Sci. 9(3), 211631 (2022)
Croci, M., Giles, M.B.: Effects of round-to-nearest and stochastic rounding in the numerical solution of the heat equation in low precision. IMA J. Numer. Anal. 43(3), 1358–1390 (2023)
Davies, M., Srinivasa, N., Lin, T.H., Chinya, G., Cao, Y., Choday, S.H., Dimou, G., Joshi, P., Imam, N., Jain, S., Liao, Y., Lin, C.-K., Lines, A., Liu, R., Mathaikutty, D., McCoy, S., Paul, A., Tse, J., Venkataramanan, G., Weng, Y.-H., Wild, A., Yang, Y., Wang, H.: Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 38(1), 82–99 (2018)
Deng, L.: The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 29(6), 141–142 (2012)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proc. of the 13th Int. Conf. Artif. Intell. Stat., pp. 249–256 (2010)
Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P.: Deep learning with limited numerical precision. In: Proc. of the 32nd Int. Conf. Mach. Learn., pp. 1737–1746 (2015)
Hickmann, B., Chen, J., Rotzin, M., Yang, A., Urbanski, M., Avancha, S.: Intel Nervana neural network processor-t (NNP-T) fused floating point many-term dot product. In: Proc. of the 27th IEEE Symp. Comput., pp. 133–136. IEEE (2020)
Higham, N.J.: Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia (2002)
Higham, N.J., Pranesh, S.: Simulating low precision floating-point arithmetic. SIAM J. Sci. Comput. 41(5), C585–C602 (2019)
Hopkins, M., Mikaitis, M., Lester, D.R., Furber, S.: Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations. Philos. Trans. Royal Soc. A 378(2166), 20190052 (2020)
Hosmer, D.W., Jr., Lemeshow, S., Sturdivant, R.X.: Applied Logistic Regression. John Wiley & Sons, Hoboken (2013)
Huskey, H.D., Hartree, D.R.: On the precision of a certain procedure of numerical integration. J. Res. Natl. Inst. Stand. Technol. 42, 57–62 (1949)
IEEE: IEEE standard for floating-point arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008) pp. 1-84 (2019)
Jouppi, N.P., Yoon, D.H., Kurian, G., Li, S., Patil, N., Laudon, J., Young, C., Patterson, D.: A domain-specific supercomputer for training deep neural networks. Commun. ACM 63(7), 67–78 (2020)
Kuczma, M.: An Introduction to the Theory of Functional Equations and Inequalities: Cauchy’s Equation and Jensen’s Inequality. Birkhäuser Verlag AG, Basel (2009)
Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. In: Proc. of the 29th Annual Conf. on Learn. Theory, pp. 1246–1257 (2016)
Li, H., De, S., Xu, Z., Studer, C., Samet, H., Goldstein, T.: Training quantized nets: A deeper understanding. In: Proc. of the 31st Neural Inf. Process. Syst. Conf., vol. 30 (2017)
Liu, Y., Gao, Y., Tong, S., Li, Y.: Fuzzy approximation-based adaptive backstepping optimal control for a class of nonlinear discrete-time systems with dead-zone. IEEE Trans. Fuzzy Syst. 24(1), 16–28 (2015)
Mikaitis, M.: Stochastic rounding: Algorithms and hardware accelerator. In: Proc. of 2021 Int. Jt. Conf. Neural Netw., pp. 1–6. IEEE (2021)
Moulay, E., Léchappé, V., Plestan, F.: Properties of the sign gradient descent algorithms. Inf. Sci. 492, 29–39 (2019)
Na, T., Ko, J.H., Kung, J., Mukhopadhyay, S.: On-chip training of recurrent neural networks with limited numerical precision. In: Proc. of the 2017 Int. Jt. Conf. Neural Netw., pp. 3716–3723. IEEE (2017)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, New York (2003)
NVIDIA H100 tensor core GPU architecture [white paper] (2022)
Ortiz, M., Cristal, A., Ayguadé, E., Casas, M.: Low-precision floating-point schemes for neural network training. arXiv preprint: 1804.05267 (2018)
Paxton, E.A., Chantry, M., Klöwer, M., Saffin, L., Palmer, T.: Climate modeling in low precision: Effects of both deterministic and stochastic rounding. J. Clim. 35(4), 1215–1229 (2022)
Petres, C., Pailhas, Y., Patron, P., Petillot, Y., Evans, J., Lane, D.: Path planning for autonomous underwater vehicles. IEEE Trans. Robot. 23(2), 331–341 (2007)
Schmidt, M., Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Proc. of the 24th Neural Inf. Process. Syst. Conf., pp. 1458–1466 (2011)
Singh, H., Upadhyaya, L., Namjoshi, U.: Estimation of finite population variance. Curr. Sci. 57, 1331–1334 (1988)
Steyer, R., Nagel, W.: Probability and Conditional Expectation: Fundamentals for the Empirical Sciences. John Wiley & Sons, Oxford (2017)
Su, C., Zhou, S., Feng, L., Zhang, W.: Towards high performance low bitwidth training for deep neural networks. J. Semicond. 41(2), 022404 (2020)
Wang, N., Choi, J., Brand, D., Chen, C.Y., Gopalakrishnan, K.: Training deep neural networks with 8-bit floating point numbers. In: Proc. of the 31st Neural Inf. Process. Syst. Conf., pp. 7675–7684 (2018)
Zou, D., Cao, Y., Zhou, D., Gu, Q.: Gradient descent optimizes over-parameterized deep ReLU networks. Mach. Learn. 109(3), 467–492 (2020)
Acknowledgements
We thank the reviewers for their constructive comments and the editor for the handling of this paper. This research was funded by the EU ECSEL Joint Undertaking under Grant agreement No. 826452 (project Arrowhead Tools).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Olivier Fercoq.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xia, L., Massei, S., Hochstenbach, M.E. et al. On Stochastic Roundoff Errors in Gradient Descent with Low-Precision Computation. J Optim Theory Appl 200, 634–668 (2024). https://doi.org/10.1007/s10957-023-02345-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-023-02345-7
Keywords
- Gradient descent method
- Stochastic roundoff error analysis
- Low-precision computation
- Convergence analysis
- Logistic regression
- Neural networks