Learning Curves Prediction for a Transformers-Based Model

Francisco Cruz, Mauro Castelli

Abstract


One of the main challenges when training or fine-tuning a machine learning model concerns the number of observations necessary to achieve satisfactory performance. While, in general, more training observations result in a better-performing model, collecting more data can be time-consuming, expensive, or even impossible. For this reason, investigating the relationship between the dataset's size and the performance of a machine learning model is fundamental to deciding, with a certain likelihood, the minimum number of observations that are necessary to ensure a satisfactory-performing model is obtained as a result of the training process. The learning curve represents the relationship between the dataset’s size and the performance of the model and is especially useful when choosing a model for a specific task or planning the annotation work of a dataset. Thus, the purpose of this paper is to find the functions that best fit the learning curves of a Transformers-based model (LayoutLM) when fine-tuned to extract information from invoices. Two new datasets of invoices are made available for such a task. Combined with a third dataset already available online, 22 sub-datasets are defined, and their learning curves are plotted based on cross-validation results. The functions are fit using a non-linear least squares technique. The results show that both a bi-asymptotic and a Morgan-Mercer-Flodin function fit the learning curves extremely well. Also, an empirical relation is presented to predict the learning curve from a single parameter that may be easily obtained in the early stage of the annotation process.

 

Doi: 10.28991/ESJ-2023-07-05-03

Full Text: PDF


Keywords


Dataset Size; Document Data Extraction; Fine-Tuning; Learning Curves; Transformers.

References


Peres, F., & Castelli, M. (2021). Combinatorial Optimization Problems and Metaheuristics: Review, Challenges, Design, and Development. Applied Sciences, 11(14), 6449. doi:10.3390/app11146449.

Yang, L., & Shami, A. (2020). On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415, 295–316. doi:10.1016/j.neucom.2020.07.061.

Kalayeh, H. M., & Landgrebe, D. A. (1983). Predicting the Required Number of Training Samples. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(6), 664–667. doi:10.1109/TPAMI.1983.4767459.

Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., & Zhou, Y. (2017). Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409. doi:10.48550/arxiv.1712.00409.

Beleites, C., Neugebauer, U., Bocklitz, T., Krafft, C., & Popp, J. (2013). Sample size planning for classification models. Analytica Chimica Acta, 760, 25–33. doi:10.1016/j.aca.2012.11.007.

Dobbin, K. K., & Simon, R. M. (2007). Sample size planning for developing classifiers using high-dimensional DNA microarray data. Biostatistics, 8(1), 101–117. doi:10.1093/biostatistics/kxj036.

Dobbin, K. K., Zhao, Y., & Simon, R. M. (2008). How large a training set is needed to develop a classifier for microarray data? Clinical Cancer Research, 14(1), 108–114. doi:10.1158/1078-0432.CCR-07-0443.

Kier, C., & Aach, T. (2006). Predicting the benefit of sample size extension in multiclass k-NN classification. 18th International Conference on Pattern Recognition (ICPR’06). doi:10.1109/icpr.2006.942.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30, 1-11, NeurIPS Proceedings, Long Beach, California, United States.

Viering, T., & Loog, M. (2023). The Shape of Learning Curves: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6), 7799–7819. doi:10.1109/tpami.2022.3220744.

Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., & He, Q. (2021). A Comprehensive Survey on Transfer Learning. Proceedings of the IEEE, 109(1), 43–76. doi:10.1109/jproc.2020.3004555.

Frey, L. J., & Fisher, D. H. (1999). Modeling decision tree performance with the power law. Seventh International Workshop on Artificial Intelligence and Statistics. PMLR, 3-6 January, 1999, Fort Lauderdale, United States.

Hess, K. R., & Wei, C. (2010). Learning Curves in Classification with Microarray Data. Seminars in Oncology, 37(1), 65–68. doi:10.1053/j.seminoncol.2009.12.002.

Brumen, B., Rozman, I., Heričko, M., Černezel, A., & Hölbl, M. (2014). Best-fit learning curve model for the C4.5 algorithm. Informatica (Netherlands), 25(3), 385–399. doi:10.15388/Informatica.2014.19.

Singh, S. (2005). Modeling performance of different classification methods: deviation from the power law. Project Report, Department of Computer Science, Vanderbilt University, Nashville, United States.

Figueroa, R. L., Zeng-Treitler, Q., Kandula, S., & Ngo, L. H. (2012). Predicting sample size required for classification performance. BMC Medical Informatics and Decision Making, 12(1), 1–10. doi:10.1186/1472-6947-12-8.

Last, M. (2007). Predicting and Optimizing Classifier Utility with the Power Law. Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), Nebraska, United States. doi:10.1109/icdmw.2007.31.

Cortes, C., Jackel, L. D., Solla, S., Vapnik, V., & Denker, J. (1993). Learning curves: Asymptotic values and rate of convergence. Advances in Neural Information Processing Systems, 6, NeurIPS Proceedings, Long Beach, California, United States.

Kolachina, P., Cancedda, N., Dymetman, M., & Venkatapathy, S. (2012, July). Prediction of learning curves in machine translation. Proceedings of the 50th Annual Meeting of the Association for Computational, 8-14 July, 2012, Jeju Island, South Korea.

Leite, R., & Brazdil, P. (2004). Improving Progressive Sampling via Meta-learning on Learning Curves. Machine Learning: ECML 2004. ECML 2004. Lecture Notes in Computer Science, 3201, Springer, Berlin, Germany. doi:10.1007/978-3-540-30115-8_25.

Hoiem, D., Gupta, T., Li, Z., & Shlapentokh-Rothman, M. (2021). Learning curves for analysis of deep networks. International conference on machine learning, 18-24 July, 2021, Virtual Event.

Mukherjee, S., Tamayo, P., Rogers, S., Rifkin, R., Engle, A., Campbell, C., Golub, T. R., & Mesirov, J. P. (2003). Estimating Dataset Size Requirements for Classifying DNA Microarray Data. Journal of Computational Biology, 10(2), 119–142. doi:10.1089/106652703321825928.

Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. doi:10.1145/3411764.3445518.

Castelli, M., Pinto, D. C., Shuqair, S., Montali, D., & Vanneschi, L. (2022). The Benefits of Automated Machine Learning in Hospitality. Emerging Science Journal, 6(6), 1237-1254. doi:10.28991/ESJ-2022-06-06-02.

ICDAR. (2019). Overview - ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction. Robust Reading Competition. Available online: https://rrc.cvc.uab.es/?ch=13 (accessed on April 2023).

Cruz, F., & Castelli, M. (2022). Dataset of personal invoices and receipts including annotation of relevant fields. 16 October 2022, Version v1. doi:10.5281/ZENODO.7213544.

Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2020). LayoutLM: Pre-training of Text and Layout for Document Image Understanding. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. doi:10.1145/3394486.3403172.

Dai, A. M., & Le, Q. V. (2015). Semi-supervised sequence learning. Advances in Neural Information Processing Systems, NeurIPS Proceedings, 28, 1-9.

Nakayama, H. (2018). Chakki-works/seqeval: A Python framework for sequence labeling evaluation (named-entity recognition, pos tagging, etc...). GitHub, San Francisco, United States. Available online: https://github.com/chakki-works/seqeval (accessed on July 2023).

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., … Vázquez-Baeza, Y. (2020). SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods, 17(3), 261–272. doi:10.1038/s41592-019-0686-2.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825-2830.


Full Text: PDF

DOI: 10.28991/ESJ-2023-07-05-03

Refbacks

  • There are currently no refbacks.


Copyright (c) 2023 Francisco Cruz, Mauro Castelli