Skip to main content

Bilingual Text Classification

  • Conference paper
Book cover Pattern Recognition and Image Analysis (IbPRIA 2007)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4477))

Included in the following conference series:

Abstract

Bilingual documentation has become a common phenomenon in official institutions and private companies. In this scenario, the categorization of bilingual text is a useful tool. In this paper, different approaches will be proposed to tackle this bilingual classification task. On the one hand, three finite-state transducer algorithms from the grammatical inference framework will be presented. On the other hand, a naive combination of smoothed n-gram models will be introduced. To evaluate the performance of bilingual classifiers, two categorized bilingual corpora of different complexity were considered. Experiments in a limited-domain task show that all the models obtain similar results. However, results on a more open-domain task denote the supremacy of the naive approach.

Work supported by the EC (FEDER) and the Spanish MEC under grant TIN2006-15694-CO2-01 and the Consellería d’Empresa, Universitat i Ciència - Generalitat Valenciana under contract GV06/252 and the Ministerio de Educación y Ciencia.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Civera, J., Juan, A.: Mixtures of IBM Model 2. In: Proc. of EAMT, pp. 159–167 (2006)

    Google Scholar 

  2. McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI-98 on Learning for Text Categorization, pp. 41–48 (1998)

    Google Scholar 

  3. Juan, A., Vidal, E.: On the use of Bernoulli mixture models for text classification. Pattern Recognition 35(12), 2705–2710 (2002)

    Article  MATH  Google Scholar 

  4. Civera, J., Cubel, E., Juan, A., Vidal, E.: Different approaches to bilingual text classification based on grammatical inference techniques. In: Marques, J.S., Pérez de la Blanca, N., Pina, P. (eds.) IbPRIA 2005. LNCS, vol. 3523, pp. 630–637. Springer, Heidelberg (2005)

    Google Scholar 

  5. Picó, D., Casacuberta, F.: Some statistical-estimation methods for stochastic finite-state transducers. Machine Learning 44, 121–142 (2001)

    Article  MATH  Google Scholar 

  6. Knight, K., Al-Onaizan, Y.: Translation with finite-state devices. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 421–437. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  7. Oncina, J., García, P., Vidal, E.: Learning subsequential transducers for pattern recognition interpretation tasks. IEEE Transactions on PAMI 15, 448–458 (1993)

    Google Scholar 

  8. Oncina, J., Varó, M.: Using domain information during the learning of a subsequential transducer. In: Miclet, L., de la Higuera, C. (eds.) ICGI 1996. LNCS, vol. 1147, pp. 301–312. Springer, Heidelberg (1996)

    Chapter  Google Scholar 

  9. Cubel, E.: Aprendizaje de transductores subsecuenciales estocásticos. Technical Report II-DSIC-B-23/01, Universidad Politécnica de Valencia, Spain (2002)

    Google Scholar 

  10. Brown, P., et al.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–312 (1993)

    Google Scholar 

  11. Vilar, J.M.: Improve the learning of subsequential transducers by using alignments and dictionaries. In: Oliveira, A.L. (ed.) ICGI 2000. LNCS (LNAI), vol. 1891, pp. 298–311. Springer, Heidelberg (2000)

    Google Scholar 

  12. Och, F., Ney, H.: Improved statistical alignment models. In: ACL, pp. 440–447 (2000)

    Google Scholar 

  13. Casacuberta, F., et al.: Some approaches to statistical and finite-state speech-to-speech translation. Computer Speech and Language 18, 25–47 (2004)

    Article  Google Scholar 

  14. Civera, J., Vilar, J.M., Cubel, E., Lagarda, A.L., Barrachina, S., Casacuberta, F., Vidal, E., Picó, D., González, J.: A syntactic pattern recognition approach to computer assisted translation. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A.C., de Ridder, D. (eds.) SSPR&SPR 2004. LNCS, vol. 3138, pp. 207–215. Springer, Heidelberg (2004)

    Google Scholar 

  15. Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998)

    Google Scholar 

  16. Chen, S., Goodman, J.: An empirical study of smoothing techniques for language modelling. In: Proc. of ACL’96, San Francisco, USA, pp. 310–318 (1996)

    Google Scholar 

  17. Amengual, J., et al.: The EuTrans-I speech translation system. Machine Translation 15, 75–103 (2000)

    Article  MATH  Google Scholar 

  18. Simard, M.: The BAF: A Corpus of English-French Bitext. In: Proc. of LREC’98, Granada, Spain, vol. 1, pp. 489–494 (1998)

    Google Scholar 

  19. Llorens, D., Vilar, J.M., Casacuberta, F.: Finite state language models smoothed using n-grams. IJPRAI 16(3), 275–289 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Joan Martí José Miguel Benedí Ana Maria Mendonça Joan Serrat

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Civera, J., Cubel, E., Vidal, E. (2007). Bilingual Text Classification. In: Martí, J., Benedí, J.M., Mendonça, A.M., Serrat, J. (eds) Pattern Recognition and Image Analysis. IbPRIA 2007. Lecture Notes in Computer Science, vol 4477. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72847-4_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72847-4_35

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72846-7

  • Online ISBN: 978-3-540-72847-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics