Skip to main content

Investigating the Best Configuration of HMM Spanish PoS Tagger when Minimum Amount of Training Data Is Available

  • Conference paper
Book cover Natural Language Processing and Information Systems (NLDB 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3513))

Abstract

One of the important processing steps for many natural language systems (information extraction, question answering, etc.) is Part-of-speech (PoS) tagging. This issue has been tackled with a number of different approaches in order to resolve this step. In this paper we study the functioning of a Hidden Markov Models (HMM) Spanish PoS tagger using a minimum amount of training corpora. Our PoS tagger is based on HMM where the states are tag pairs that emit words. It is based on transitional and lexical probabilities. This technique has been suggested by Rabiner [11] –and our implementation is influenced by Brants [2]–. We have investigated the best configuration of HMM using a small amount of training data which has about 50,000 words and the maximum precision obtained for an unknown Spanish text was 95.36%.

This research has been partially funded by the Spanish Government under project PROFIT number FIT-340100-2004-14.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Atserias, J., Carmona, J., Castellón, I., Cervell, S., Civit, M., Márquez, L., Martí, M.A., Padró, L., Placer, R., Rodríguez, H., Taulé, M., Turmo, J.: Morphosyntactic Analysis and Parsing of Unrestricted Spanish Text. In: First International Conference on Language Resources and Evaluation, LREC 1998, pp. 1267–1272 (1998)

    Google Scholar 

  2. Brants, T.: Tnt- a statistical part-of-speech tagger. In: Proceedings of the 6rd Conference on Applied Natural Language Procesing, ANLP, pp. 224–231 (2000)

    Google Scholar 

  3. Brill, E.: Transformation-based error-driven learning of natural language: A case study in part of speech tagging. Computational Linguistics 21, 543–565

    Google Scholar 

  4. Brill, E.: A corpus-based Approach to Language Learning (1993)

    Google Scholar 

  5. Carreras, X., Chao, I., Padró, L., Padró, M.: Freeling: An open-source suite of language analyzers. In: Proceedings of the 4th International Conference on Languge Resources and Evaluation, LREC 2004, pp. 1364–1371 (2004)

    Google Scholar 

  6. Civit, M.: Criterios de etiquetación y desambiguación morfosintáctica de corpus en español. PhD thesis, Linguistics Department,Universitat de Barcelona (2003)

    Google Scholar 

  7. Daelemans, W., Zavrel, J., Berckand, P., Gillis, S.: A memory-based part-ofspeech tagger generator. In: Proceedings of the 4th Workshop on Very Large Corpora, pp. 14–27 (1996)

    Google Scholar 

  8. Figuerola, G., Zazo, F., Rodríguez, E., Alonso, J.: La Recuperación de Información en español y la normalización de términos. Revista Iberoamericana de Inteligencia Artificial VIII(22), 135–145 (2004)

    Google Scholar 

  9. Mérialdo, B.: Tagging English text with a probabilistic model. Computational Linguistics 20(2), 155–171 (1994)

    Google Scholar 

  10. Padró, M., Padró, L.: Developing Competitive HMM PoS Taggers Using Small Training Corpora. ESPAÑA for NATURAL LANGUAGE PROCESSING, EsTAL, 127–136 (2004)

    Google Scholar 

  11. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  12. Ratnaparkhi, A.: A maximum entropy part-of-speech tagger. In: Proceedings of the 1st Conference on Empirical Methods in Natural Language Procesing, EMNLP, pp. 16–19 (1996)

    Google Scholar 

  13. Schmid, H.: TreeTagger — a language independent part-of-speech tagger. Institut fur Maschinelle Sprachverarbeitung, Universitat Stuttgart (1995)

    Google Scholar 

  14. Viterbi, A.J.: Error bounds for convolutional codes and asymptotically optimal decoding algorithm. IEEE Transactions on Inf. Theory, 260–269 (1967)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ferrández, S., Peral, J. (2005). Investigating the Best Configuration of HMM Spanish PoS Tagger when Minimum Amount of Training Data Is Available. In: Montoyo, A., Muńoz, R., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2005. Lecture Notes in Computer Science, vol 3513. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11428817_32

Download citation

  • DOI: https://doi.org/10.1007/11428817_32

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-26031-8

  • Online ISBN: 978-3-540-32110-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics