Investigating the Best Configuration of HMM Spanish PoS Tagger when Minimum Amount of Training Data Is Available

Ferrández, Sergio; Peral, Jesús

doi:10.1007/11428817_32

Sergio Ferrández¹⁹ &
Jesús Peral¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3513))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

1367 Accesses
3 Citations

Abstract

One of the important processing steps for many natural language systems (information extraction, question answering, etc.) is Part-of-speech (PoS) tagging. This issue has been tackled with a number of different approaches in order to resolve this step. In this paper we study the functioning of a Hidden Markov Models (HMM) Spanish PoS tagger using a minimum amount of training corpora. Our PoS tagger is based on HMM where the states are tag pairs that emit words. It is based on transitional and lexical probabilities. This technique has been suggested by Rabiner [11] –and our implementation is influenced by Brants [2]–. We have investigated the best configuration of HMM using a small amount of training data which has about 50,000 words and the maximum precision obtained for an unknown Spanish text was 95.36%.

This research has been partially funded by the Spanish Government under project PROFIT number FIT-340100-2004-14.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Atserias, J., Carmona, J., Castellón, I., Cervell, S., Civit, M., Márquez, L., Martí, M.A., Padró, L., Placer, R., Rodríguez, H., Taulé, M., Turmo, J.: Morphosyntactic Analysis and Parsing of Unrestricted Spanish Text. In: First International Conference on Language Resources and Evaluation, LREC 1998, pp. 1267–1272 (1998)
Google Scholar
Brants, T.: Tnt- a statistical part-of-speech tagger. In: Proceedings of the 6rd Conference on Applied Natural Language Procesing, ANLP, pp. 224–231 (2000)
Google Scholar
Brill, E.: Transformation-based error-driven learning of natural language: A case study in part of speech tagging. Computational Linguistics 21, 543–565
Google Scholar
Brill, E.: A corpus-based Approach to Language Learning (1993)
Google Scholar
Carreras, X., Chao, I., Padró, L., Padró, M.: Freeling: An open-source suite of language analyzers. In: Proceedings of the 4th International Conference on Languge Resources and Evaluation, LREC 2004, pp. 1364–1371 (2004)
Google Scholar
Civit, M.: Criterios de etiquetación y desambiguación morfosintáctica de corpus en español. PhD thesis, Linguistics Department,Universitat de Barcelona (2003)
Google Scholar
Daelemans, W., Zavrel, J., Berckand, P., Gillis, S.: A memory-based part-ofspeech tagger generator. In: Proceedings of the 4th Workshop on Very Large Corpora, pp. 14–27 (1996)
Google Scholar
Figuerola, G., Zazo, F., Rodríguez, E., Alonso, J.: La Recuperación de Información en español y la normalización de términos. Revista Iberoamericana de Inteligencia Artificial VIII(22), 135–145 (2004)
Google Scholar
Mérialdo, B.: Tagging English text with a probabilistic model. Computational Linguistics 20(2), 155–171 (1994)
Google Scholar
Padró, M., Padró, L.: Developing Competitive HMM PoS Taggers Using Small Training Corpora. ESPAÑA for NATURAL LANGUAGE PROCESSING, EsTAL, 127–136 (2004)
Google Scholar
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
Article Google Scholar
Ratnaparkhi, A.: A maximum entropy part-of-speech tagger. In: Proceedings of the 1st Conference on Empirical Methods in Natural Language Procesing, EMNLP, pp. 16–19 (1996)
Google Scholar
Schmid, H.: TreeTagger — a language independent part-of-speech tagger. Institut fur Maschinelle Sprachverarbeitung, Universitat Stuttgart (1995)
Google Scholar
Viterbi, A.J.: Error bounds for convolutional codes and asymptotically optimal decoding algorithm. IEEE Transactions on Inf. Theory, 260–269 (1967)
Google Scholar

Download references

Author information

Authors and Affiliations

Grupo de Investigación en Procesamiento del Lenguaje y Sistemas de Información, Departamento de Lenguajes y Sistemas Informáticos, University of Alicante, Spain
Sergio Ferrández & Jesús Peral

Authors

Sergio Ferrández
View author publications
You can also search for this author in PubMed Google Scholar
Jesús Peral
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Software and Computing Systems, University of Alicante, Spain
Andrés Montoyo
Grupo de investigación del Procesamiento del Lenguaje y Sistemas de Información, Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante, Alicante, Spain
Rafael Muńoz
Lab. CEDRIC, CNAM, Paris, France
Elisabeth Métais

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferrández, S., Peral, J. (2005). Investigating the Best Configuration of HMM Spanish PoS Tagger when Minimum Amount of Training Data Is Available. In: Montoyo, A., Muńoz, R., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2005. Lecture Notes in Computer Science, vol 3513. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11428817_32

Download citation

DOI: https://doi.org/10.1007/11428817_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26031-8
Online ISBN: 978-3-540-32110-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics