Skip to main content

Extracting Descriptive Words from Untranscribed Handwritten Images

  • Conference paper
  • First Online:
Pattern Recognition and Image Analysis (IbPRIA 2022)

Abstract

Extracting descriptive text from manuscripts to be included in the manuscript metadata is an important task that is generally performed in archives and libraries by experts with a wealth of knowledge on the manuscripts contents. Unfortunately, many manuscript collections are so vast that it is not feasible to rely solely on experts to perform this task. To our knowledge, this is the first work aiming at automatic extraction of descriptive text from untranscribed text images. To attempt dealing with such a task, a first step would be to transcribe the handwritten images into text – but achieving sufficiently accurate transcripts is generally unfeasible for large sets of historical manuscripts. We propose new approaches to automatically extract descriptive words which do not rely on any explicit image transcripts. They are based on “probabilistic indexing”, a relatively novel technology which allows to effectively represent the intrinsic word-level uncertainty generally exhibited by handwritten text images. We assess the performance of this approach on samples of a large collection of complex manuscripts from the Spanish Archivo General de Indias. Since no standard metrics exist for the novel task considered in this work, we propose two new evaluation measures which aim at measuring the quality of the detected descriptive words in terms close to practical usage of these words. Using these metrics we report promising preliminary results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We will nevertheless use the time-honored term “document” when dealing with plain text rather than text images.

  2. 2.

    http://prhlt-carabela.prhlt.upv.es/carabela.

  3. 3.

    http://prhlt-carabela.prhlt.upv.es/PrIxDemos provides a list of public search interfaces for these collections.

  4. 4.

    http://pares.mcu.es/ParesBusquedas20/catalogo/search.

References

  1. Aggarwal, C.C., Zhai, C.: Mining Text Data. Springer Science & Business Media (2012)

    Google Scholar 

  2. Bluche, T., et al.: Preparatory KWS experiments for large-scale indexing of a vast medieval manuscript collection in the HIMANIS project. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 311–316, November 2017

    Google Scholar 

  3. Diederik P., K., Ba, L.: Adam: a method for stochastic optimization. In: AIP Conference Proceedings, vol. 1631, pp. 58–62 (2014)

    Google Scholar 

  4. Vidal, E., et al.: The carabela project and manuscript collection: large-scale probabilistic indexing and content-based classification. In: Proceedings of 16th ICFHR (2020)

    Google Scholar 

  5. Flores, J.J., Prieto, J.R., Alonso, C., Garrido, D., Vidal, E.: Classification of untranscribed handwritten notarial documents by textual contents. In: Proceedings of the 2022 Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA) (2022)

    Google Scholar 

  6. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. J. Mach. Learn. Res. 9, 249–256 (2010)

    Google Scholar 

  7. Ikonomakis, M., Kotsiantis, S., Tampakas, V.: Text classification using machine learning techniques. WSEAS Trans. Comput. 4(8), 966–974 (2005)

    Google Scholar 

  8. Lang, E., Puigcerver, J., Toselli, A.H., Vidal, E.: Probabilistic indexing and search for information extraction on handwritten German parish records. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 44–49, August 2018

    Google Scholar 

  9. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA (2008)

    Book  Google Scholar 

  10. Prieto, J.R., Bosch, V., Vidal, E., Alonso, C., Orcero, M.C., Marquez, L.: Textual-content-based classification of bundles of untranscribed manuscript images. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 3162–3169. IEEE (2021)

    Google Scholar 

  11. Puigcerver, J.: A Probabilistic Formulation of Keyword Spotting. Ph.D. thesis, Univ. Politècnica de València (2018)

    Google Scholar 

  12. Romero, V., Toselli, A.H., Vidal, E., Sánchez, J.A., Alonso, C., Marqués, L.: Modern vs diplomatic transcripts for historical handwritten text recognition. In: Cristani, M., Prati, A., Lanz, O., Messelodi, S., Sebe, N. (eds.) ICIAP 2019. LNCS, vol. 11808, pp. 103–114. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30754-7_11

    Chapter  Google Scholar 

  13. Sánchez, J.A., Romero, V., Toselli, A.H., Villegas, M., Vidal, E.: A set of benchmarks for handwritten text recognition on historical documents. Pattern Recogn. 94, 122–134 (2019)

    Article  Google Scholar 

  14. Toselli, A., Romero, V., Vidal, E., Sánchez, J.: Making two vast historical manuscript collections searchable and extracting meaningful textual features through large-scale probabilistic indexing. In: 2019 15th IAPR International Conference on Document Analysis and Recognition (ICDAR) (2019)

    Google Scholar 

  15. Toselli, A.H., Vidal, E., Puigcerver, J., Noya-García, E.: Probabilistic multi-word spotting in handwritten text images. Pattern Anal. Appl. 22(1), 23–32 (2018). https://doi.org/10.1007/s10044-018-0742-z

    Article  MathSciNet  Google Scholar 

  16. Toselli, A.H., Vidal, E., Romero, V., Frinken, V.: HMM word graph based keyword spotting in handwritten document images. Inf. Sci. 370–371, 497–518 (2016)

    Article  Google Scholar 

  17. Vidal, E., Toselli, A.H., Puigcerver, J.: A probabilistic framework for lexicon-based keyword spotting in handwritten text images. Tech. rep, UPV (2017)

    Google Scholar 

Download references

Acknowledgments

Work partially supported by the research grants: Ministerio de Ciencia Innovación y Universidades “DocTIUM” (RTI2018-095645-B-C22), Generalitat Valenciana under project DeepPattern (PROMETEO/2019/121) and PID2020-116813RB-I00a funded by MCIN/AEI/ 10.13039/501100011033.

The first author’s work was partially supported by the Universitat Politècnica de València under grant FPI-I/SP20190010.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jose Ramón Prieto .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Prieto, J.R., Vidal, E., Sánchez, J.A., Alonso, C., Garrido, D. (2022). Extracting Descriptive Words from Untranscribed Handwritten Images. In: Pinho, A.J., Georgieva, P., Teixeira, L.F., Sánchez, J.A. (eds) Pattern Recognition and Image Analysis. IbPRIA 2022. Lecture Notes in Computer Science, vol 13256. Springer, Cham. https://doi.org/10.1007/978-3-031-04881-4_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-04881-4_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-04880-7

  • Online ISBN: 978-3-031-04881-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics