skip to main content
10.5555/1609067.1609164dlproceedingsArticle/Chapter ViewAbstractPublication PageseaclConference Proceedingsconference-collections
research-article
Free Access

Language ID in the context of harvesting language data off the web

Published:30 March 2009Publication History

ABSTRACT

As the arm of NLP technologies extends beyond a small core of languages, techniques for working with instances of language data across hundreds to thousands of languages may require revisiting and recalibrating the tried and true methods that are used. Of the NLP techniques that has been treated as "solved" is language identification (language ID) of written text. However, we argue that language ID is far from solved when one considers input spanning not dozens of languages, but rather hundreds to thousands, a number that one approaches when harvesting language data found on the Web. We formulate language ID as a coreference resolution problem and apply it to a Web harvesting task for a specific linguistic data type and achieve a much higher accuracy than long accepted language ID approaches.

References

  1. Mark C. Baker and Osamuyimen Thompson Stewart. 1996. Unaccusativity and the adjective/verb distinction: Edo evidence. In Proceedings of the Fifth Annual Conference on Document Analysis and Information Retrieval (SDAIR), Amherst, Mass.Google ScholarGoogle Scholar
  2. G. Bakir, T. Hofmann, B. Scholkopf, A. Smola, B. Taskar, and S. Vishwanathan (eds). 2007. Predicting Structured Data. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. William B. Cavnar and John M. Trenkle. 1994. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161--175, Las Vegas, US.Google ScholarGoogle Scholar
  4. Chih-Chung Chang and Chih-Jen Lin, 2001. LIBSVM: a library for support vector machines. Available at http://www.csie.ntu.edu.tw/cjlin/libsvm.Google ScholarGoogle Scholar
  5. Pascal Denis and Jason Baldridge. 2007. Joint determination of anaphoricity and coreference resolution using integer programming. In Proc. of the Conference on Human Language Technologies (HLT/NAACL 2007), pages 236--243, Rochester, New York, April.Google ScholarGoogle Scholar
  6. Aria Haghighi and Dan Klein. 2006. Prototype-driven grammar induction. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), pages 881--888, Sydney, Australia, July. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy Nicholson, and Andrew MacKinlay. 2006. Reconsidering language identification for written language resources. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC2006), pages 485--488, Genoa, Italy.Google ScholarGoogle Scholar
  8. S. Kok, P. Singla, M. Richardson, P. Domingos, M. Sumner, H Poon, and D. Lowd. 2007. The Alchemy system for statistical relational AI. Technical report, Dept. of CSE, Univ. of Washington.Google ScholarGoogle Scholar
  9. William Lewis and Fei Xia. 2008. Automatically Identifying Computationally Relevant Typological Features. In Proc. of the Third International Joint Conference on Natural Language Processing (IJCNLP-2008), Hyderabad, India.Google ScholarGoogle Scholar
  10. William Lewis. 2006. ODIN: A Model for Adapting and Enriching Legacy Infrastructure. In Proc. of the e-Humanities Workshop, held in cooperation with e-Science 2006: 2nd IEEE International Conference on e-Science and Grid Computing, Amsterdam. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Xiaoqiang Luo. 2007. Coreference or not: A twin model for coreference resolution. In Proc. of the Conference on Human Language Technologies (HLT/NAACL 2007), pages 73--80. Rochester, New York.Google ScholarGoogle Scholar
  12. Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.Google ScholarGoogle Scholar
  13. Vincent Ng and Claire Cardie. 2002. Improving Machine Learning Approaches to Coreference Resolution. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-2002), pages 104--111, Philadelphia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Poon and P. Domingos. 2006. Sound and efficient inference with probabilistic and deterministic dependencies. In Proc. of AAAI-06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hoifung Poon and Pedro Domingos. 2007. Joint inference in information extraction. In Proceedings of the Twenty-Second National Conference on Artificial Intelligence (AAAI), pages 913--918, Vancouver, Canada. AAAI Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. Poon and P. Domingos. 2008. Joint unsupervised coreference resolution with markov logic. In Proc. of the 13th Conf. on Empirical Methods in Natural Language Processing (EMNLP-2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Richardson and P. Domingos. 2006. Markov logic networks. Machine Learning, pages 107--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. 2001. A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27(4). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Charles Sutton, Andrew McCallum, and Jeff Bilmes (eds.). 2006. Proc. of the HLT/NAACL-06 Workshop on Joint Inference for Natural Language Processing.Google ScholarGoogle Scholar
  20. B. Wellner, A. McCallum, F. Peng, and M. Hay. 2004. An integrated, conditional model of information extraction and coreference with application to citation matching. In Proc. of the 20th Conference on Uncertainty in AI (UAI 2004). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Fei Xia and William Lewis. 2007. Multilingual structural projection across interlinear text. In Proc. of the Conference on Human Language Technologies (HLT/NAACL 2007), pages 452--459, Rochester, New York.Google ScholarGoogle Scholar
  22. Fei Xia and William Lewis. 2008. Repurposing Theoretical Linguistic Data for Tool Development and Search. In Proc. of the Third International Joint Conference on Natural Language Processing (IJCNLP-2008), Hyderabad, India.Google ScholarGoogle Scholar

Index Terms

  1. Language ID in the context of harvesting language data off the web

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image DL Hosted proceedings
          EACL '09: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
          March 2009
          905 pages

          Publisher

          Association for Computational Linguistics

          United States

          Publication History

          • Published: 30 March 2009

          Qualifiers

          • research-article

          Acceptance Rates

          EACL '09 Paper Acceptance Rate100of360submissions,28%Overall Acceptance Rate100of360submissions,28%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader