Extracting Descriptive Words from Untranscribed Handwritten Images

Prieto, Jose Ramón; Vidal, Enrique; Sánchez, Joan Andreu; Alonso, Carlos; Garrido, David

doi:10.1007/978-3-031-04881-4_43

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13256))

Included in the following conference series:

Iberian Conference on Pattern Recognition and Image Analysis

1469 Accesses
1 Citations

Abstract

Extracting descriptive text from manuscripts to be included in the manuscript metadata is an important task that is generally performed in archives and libraries by experts with a wealth of knowledge on the manuscripts contents. Unfortunately, many manuscript collections are so vast that it is not feasible to rely solely on experts to perform this task. To our knowledge, this is the first work aiming at automatic extraction of descriptive text from untranscribed text images. To attempt dealing with such a task, a first step would be to transcribe the handwritten images into text – but achieving sufficiently accurate transcripts is generally unfeasible for large sets of historical manuscripts. We propose new approaches to automatically extract descriptive words which do not rely on any explicit image transcripts. They are based on “probabilistic indexing”, a relatively novel technology which allows to effectively represent the intrinsic word-level uncertainty generally exhibited by handwritten text images. We assess the performance of this approach on samples of a large collection of complex manuscripts from the Spanish Archivo General de Indias. Since no standard metrics exist for the novel task considered in this work, we propose two new evaluation measures which aim at measuring the quality of the detected descriptive words in terms close to practical usage of these words. Using these metrics we report promising preliminary results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We will nevertheless use the time-honored term “document” when dealing with plain text rather than text images.
2.
http://prhlt-carabela.prhlt.upv.es/carabela.
3.
http://prhlt-carabela.prhlt.upv.es/PrIxDemos provides a list of public search interfaces for these collections.
4.
http://pares.mcu.es/ParesBusquedas20/catalogo/search.

References

Aggarwal, C.C., Zhai, C.: Mining Text Data. Springer Science & Business Media (2012)
Google Scholar
Bluche, T., et al.: Preparatory KWS experiments for large-scale indexing of a vast medieval manuscript collection in the HIMANIS project. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 311–316, November 2017
Google Scholar
Diederik P., K., Ba, L.: Adam: a method for stochastic optimization. In: AIP Conference Proceedings, vol. 1631, pp. 58–62 (2014)
Google Scholar
Vidal, E., et al.: The carabela project and manuscript collection: large-scale probabilistic indexing and content-based classification. In: Proceedings of 16th ICFHR (2020)
Google Scholar
Flores, J.J., Prieto, J.R., Alonso, C., Garrido, D., Vidal, E.: Classification of untranscribed handwritten notarial documents by textual contents. In: Proceedings of the 2022 Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA) (2022)
Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. J. Mach. Learn. Res. 9, 249–256 (2010)
Google Scholar
Ikonomakis, M., Kotsiantis, S., Tampakas, V.: Text classification using machine learning techniques. WSEAS Trans. Comput. 4(8), 966–974 (2005)
Google Scholar
Lang, E., Puigcerver, J., Toselli, A.H., Vidal, E.: Probabilistic indexing and search for information extraction on handwritten German parish records. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 44–49, August 2018
Google Scholar
Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA (2008)
Book Google Scholar
Prieto, J.R., Bosch, V., Vidal, E., Alonso, C., Orcero, M.C., Marquez, L.: Textual-content-based classification of bundles of untranscribed manuscript images. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 3162–3169. IEEE (2021)
Google Scholar
Puigcerver, J.: A Probabilistic Formulation of Keyword Spotting. Ph.D. thesis, Univ. Politècnica de València (2018)
Google Scholar
Romero, V., Toselli, A.H., Vidal, E., Sánchez, J.A., Alonso, C., Marqués, L.: Modern vs diplomatic transcripts for historical handwritten text recognition. In: Cristani, M., Prati, A., Lanz, O., Messelodi, S., Sebe, N. (eds.) ICIAP 2019. LNCS, vol. 11808, pp. 103–114. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30754-7_11
Chapter Google Scholar
Sánchez, J.A., Romero, V., Toselli, A.H., Villegas, M., Vidal, E.: A set of benchmarks for handwritten text recognition on historical documents. Pattern Recogn. 94, 122–134 (2019)
Article Google Scholar
Toselli, A., Romero, V., Vidal, E., Sánchez, J.: Making two vast historical manuscript collections searchable and extracting meaningful textual features through large-scale probabilistic indexing. In: 2019 15th IAPR International Conference on Document Analysis and Recognition (ICDAR) (2019)
Google Scholar
Toselli, A.H., Vidal, E., Puigcerver, J., Noya-García, E.: Probabilistic multi-word spotting in handwritten text images. Pattern Anal. Appl. 22(1), 23–32 (2018). https://doi.org/10.1007/s10044-018-0742-z
Article MathSciNet Google Scholar
Toselli, A.H., Vidal, E., Romero, V., Frinken, V.: HMM word graph based keyword spotting in handwritten document images. Inf. Sci. 370–371, 497–518 (2016)
Article Google Scholar
Vidal, E., Toselli, A.H., Puigcerver, J.: A probabilistic framework for lexicon-based keyword spotting in handwritten text images. Tech. rep, UPV (2017)
Google Scholar

Download references

Acknowledgments

Work partially supported by the research grants: Ministerio de Ciencia Innovación y Universidades “DocTIUM” (RTI2018-095645-B-C22), Generalitat Valenciana under project DeepPattern (PROMETEO/2019/121) and PID2020-116813RB-I00a funded by MCIN/AEI/ 10.13039/501100011033.

The first author’s work was partially supported by the Universitat Politècnica de València under grant FPI-I/SP20190010.

Author information

Authors and Affiliations

PRHLT Research Center, Universitat Politècnica de València, Valencia, Spain
Jose Ramón Prieto, Enrique Vidal & Joan Andreu Sánchez
tranSkriptorium AI, Valencia, Spain
Carlos Alonso
HUM313 Research Group, Universidad de Cádiz, Cádiz, Spain
David Garrido

Authors

Jose Ramón Prieto
View author publications
You can also search for this author in PubMed Google Scholar
Enrique Vidal
View author publications
You can also search for this author in PubMed Google Scholar
Joan Andreu Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Alonso
View author publications
You can also search for this author in PubMed Google Scholar
David Garrido
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jose Ramón Prieto .

Editor information

Editors and Affiliations

University of Aveiro, Aveiro, Portugal
Armando J. Pinho
University of Aveiro, Aveiro, Portugal
Petia Georgieva
University of Porto, Porto, Portugal
Luís F. Teixeira
Universitat Politècnica de València, Valencia, Spain
Joan Andreu Sánchez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prieto, J.R., Vidal, E., Sánchez, J.A., Alonso, C., Garrido, D. (2022). Extracting Descriptive Words from Untranscribed Handwritten Images. In: Pinho, A.J., Georgieva, P., Teixeira, L.F., Sánchez, J.A. (eds) Pattern Recognition and Image Analysis. IbPRIA 2022. Lecture Notes in Computer Science, vol 13256. Springer, Cham. https://doi.org/10.1007/978-3-031-04881-4_43

Download citation

DOI: https://doi.org/10.1007/978-3-031-04881-4_43
Published: 26 April 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04880-7
Online ISBN: 978-3-031-04881-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Extracting Descriptive Words from Untranscribed Handwritten Images