Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training

Wang, Zelun; Liu, Jyh-Charn

doi:10.1007/s10032-020-00360-2

Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training

Original Paper
Published: 08 November 2020

Volume 24, pages 63–75, (2021)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

1694 Accesses
34 Citations
Explore all metrics

Abstract

In this paper, we propose a deep neural network model with an encoder–decoder architecture that translates images of math formulas into their LaTeX markup sequences. The encoder is a convolutional neural network that transforms images into a group of feature maps. To better capture the spatial relationships of math symbols, the feature maps are augmented with 2D positional encoding before being unfolded into a vector. The decoder is a stacked bidirectional long short-term memory model integrated with the soft attention mechanism, which works as a language model to translate the encoder output into a sequence of LaTeX tokens. The neural network is trained in two steps. The first step is token-level training using the maximum likelihood estimation as the objective function. At completion of the token-level training, the sequence-level training objective function is employed to optimize the overall model based on the policy gradient algorithm from reinforcement learning. Our design also overcomes the exposure bias problem by closing the feedback loop in the decoder during sequence-level training, i.e., feeding in the predicted token instead of the ground truth token at every time step. The model is trained and evaluated on the IM2LATEX-100 K dataset and shows state-of-the-art performance on both sequence-based and image-based evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

Artificial intelligence in the creative industries: a review

Article Open access 02 July 2021

Knowledge Distillation: A Survey

Article 22 March 2021

A review on the long short-term memory model

Article 13 May 2020

Notes

LaTeX (version 3.1415926–2.5–1.40.14).
Different sizes of width–height buckets (in pixel): (320, 40), (360, 60), (360, 50), (200, 50), (280, 50), (240, 40), (360, 100), (500, 100), (320, 50), (280, 40), (200, 40), (400, 160), (600, 100), (400, 50), (160, 40), (800, 100), (240, 50), (120, 50), (360, 40), (500, 200).

References

Ion, P., Miner, R., Buswell, S., Devitt, A.: Mathematical Markup Language (MathML) 1.0 Specification. World Wide Web Consortium (W3C) (1998)
Anderson, R.H.: Syntax-directed recognition of hand-printed two-dimensional mathematics. In: Symposium on Interactive Systems for Experimental Applied Mathematics: Proceedings of the Association for Computing Machinery Inc. Symposium, pp. 436–459. ACM (1967)
Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY: an integrated OCR system for mathematical documents. In: Proceedings of the 2003 ACM Symposium on Document Engineering, pp. 95–104. ACM (2003)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Deep structured output learning for unconstrained text recognition (2014). arXiv preprint arXiv:1412.5903
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3156–3164 (2015)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Deng, Y., Kanervisto, A., Rush, A.M.: What you get is what you see: a visual markup decompiler, vol. 10, pp. 32–37 (2016). arXiv preprint arXiv:1609.04938
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Sutton, R.S., McAllester, D.A., Singh, S. P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000).
Ranzato, M.A., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks (2015). arXiv preprint arXiv:1511.06732
Chan, K.-F., Yeung, D.-Y.: Mathematical expression recognition: a survey. Int. J. Doc. Anal. Recogn. 3(1), 3–15 (2000)
Article Google Scholar
Garain, U., Chaudhuri, B., Chaudhuri, A.R.: Identification of embedded mathematical expressions in scanned documents. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, vol. 1, pp. 384–387. IEEE (2004)
Wang, Z., Beyette, D., Lin, J., Liu, J.-C.: Extraction of math expressions from PDF documents based on unsupervised modeling of fonts. In: IAPR International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia. IEEE (2019).
Wang, X., Wang, Z., Liu, J.-C.: Bigram label regularization to reduce over-segmentation on inline math expression detection. In: IAPR International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia. IEEE (2019)
Gao, L., Yi, X., Liao, Y., Jiang, Z., Yan, Z., Tang, Z.: A deep learning-based formula detection method for PDF documents. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 553–558. IEEE (2017)
Twaakyondo, H.M., Okamoto, M.: Structure analysis and recognition of mathematical expressions. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1, pp. 430–437. IEEE (1995)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012).
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006)
Wang, T., Wu, D.J., Coates, A., Ng, A.Y.: End-to-end text recognition with convolutional neural networks. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 3304–3308. IEEE (2012)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016)
Article Google Scholar
Luong, M.-T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation (2015). arXiv preprint arXiv:1508.04025
Zhang, J., Du, J., Dai, L.: A gru-based encoder-decoder approach with attention for online handwritten mathematical expression recognition. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 902–907. IEEE (2017)
Zhang, J., Du, J., Zhang, S., Liu, D., Hu, Y., Hu, J., Wei, S., Dai, L.: Watch, attend and parse: an end-to-end neural network based approach to handwritten mathematical expression recognition. Pattern Recogn. 71, 196–206 (2017)
Article Google Scholar
Deng, Y., Kanervisto, A., Ling, J., Rush, A.M.: Image-to-markup generation with coarse-to-fine attention. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 980–989 (2017)
Wang, J., Sun, Y., Wang, S.: Image to latex with DenseNet encoder and joint attention. Procedia Comput. Sci. 147, 374–380 (2019)
Article Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017).
Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., Chua, T.-S.: Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5659–5667 (2017)
Zhang, W., Bai, Z., Zhu, Y.: An improved approach based on CNN-RNNs for mathematical expression recognition. In: Proceedings of the 2019 4th International Conference on Multimedia Systems and Signal Processing, pp. 57–610. ACM (2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Levy O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems, pp. 2177–2185 (2014)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Article Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning, p. 129. MIT Press, Cambridge (2016)
MATH Google Scholar
Shen, S., Cheng, Y., He, Z., He, W., Wu, H., Sun, M., Liu, Y.: Minimum risk training for neural machine translation (2015). arXiv preprint arXiv:1512.02433
Wu, L., Tian, F., Qin, T., Lai, J., Liu, T.-Y.: A study of reinforcement learning for neural machine translation (2018). arXiv preprint arXiv:1808.08866
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)
MATH Google Scholar
Chatterjee S., Cancedda, N.: Minimum error rate training by sampling the translation lattice. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 606–615. Association for Computational Linguistics (2010)
KaTex (2019, Aug 25). https://katex.org/
Álvaro, F., Sánchez, J.-A., Benedí, J.-M.: An image-based measure for evaluation of mathematical expression recognition. In: Iberian Conference on Pattern Recognition and Image Analysis, pp. 682–690. Springer (2013)
Mathpix Snip (2020, May 6th). https://mathpix.com/
Kingma D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Graves, A.: Sequence transduction with recurrent neural networks (2012). arXiv preprint arXiv:1211.3711
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014) arXiv preprint arXiv:1406.1078

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Texas A&M University, College Station, USA
Zelun Wang & Jyh-Charn Liu

Authors

Zelun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jyh-Charn Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jyh-Charn Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Z., Liu, JC. Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training. IJDAR 24, 63–75 (2021). https://doi.org/10.1007/s10032-020-00360-2

Download citation

Received: 07 November 2019
Revised: 08 June 2020
Accepted: 30 September 2020
Published: 08 November 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s10032-020-00360-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in the creative industries: a review

Knowledge Distillation: A Survey

A review on the long short-term memory model

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in the creative industries: a review

Knowledge Distillation: A Survey

A review on the long short-term memory model

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation