ABSTRACT
As the arm of NLP technologies extends beyond a small core of languages, techniques for working with instances of language data across hundreds to thousands of languages may require revisiting and recalibrating the tried and true methods that are used. Of the NLP techniques that has been treated as "solved" is language identification (language ID) of written text. However, we argue that language ID is far from solved when one considers input spanning not dozens of languages, but rather hundreds to thousands, a number that one approaches when harvesting language data found on the Web. We formulate language ID as a coreference resolution problem and apply it to a Web harvesting task for a specific linguistic data type and achieve a much higher accuracy than long accepted language ID approaches.
- Mark C. Baker and Osamuyimen Thompson Stewart. 1996. Unaccusativity and the adjective/verb distinction: Edo evidence. In Proceedings of the Fifth Annual Conference on Document Analysis and Information Retrieval (SDAIR), Amherst, Mass.Google Scholar
- G. Bakir, T. Hofmann, B. Scholkopf, A. Smola, B. Taskar, and S. Vishwanathan (eds). 2007. Predicting Structured Data. MIT Press. Google ScholarDigital Library
- William B. Cavnar and John M. Trenkle. 1994. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161--175, Las Vegas, US.Google Scholar
- Chih-Chung Chang and Chih-Jen Lin, 2001. LIBSVM: a library for support vector machines. Available at http://www.csie.ntu.edu.tw/cjlin/libsvm.Google Scholar
- Pascal Denis and Jason Baldridge. 2007. Joint determination of anaphoricity and coreference resolution using integer programming. In Proc. of the Conference on Human Language Technologies (HLT/NAACL 2007), pages 236--243, Rochester, New York, April.Google Scholar
- Aria Haghighi and Dan Klein. 2006. Prototype-driven grammar induction. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), pages 881--888, Sydney, Australia, July. Association for Computational Linguistics. Google ScholarDigital Library
- Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy Nicholson, and Andrew MacKinlay. 2006. Reconsidering language identification for written language resources. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC2006), pages 485--488, Genoa, Italy.Google Scholar
- S. Kok, P. Singla, M. Richardson, P. Domingos, M. Sumner, H Poon, and D. Lowd. 2007. The Alchemy system for statistical relational AI. Technical report, Dept. of CSE, Univ. of Washington.Google Scholar
- William Lewis and Fei Xia. 2008. Automatically Identifying Computationally Relevant Typological Features. In Proc. of the Third International Joint Conference on Natural Language Processing (IJCNLP-2008), Hyderabad, India.Google Scholar
- William Lewis. 2006. ODIN: A Model for Adapting and Enriching Legacy Infrastructure. In Proc. of the e-Humanities Workshop, held in cooperation with e-Science 2006: 2nd IEEE International Conference on e-Science and Grid Computing, Amsterdam. Google ScholarDigital Library
- Xiaoqiang Luo. 2007. Coreference or not: A twin model for coreference resolution. In Proc. of the Conference on Human Language Technologies (HLT/NAACL 2007), pages 73--80. Rochester, New York.Google Scholar
- Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.Google Scholar
- Vincent Ng and Claire Cardie. 2002. Improving Machine Learning Approaches to Coreference Resolution. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-2002), pages 104--111, Philadelphia. Google ScholarDigital Library
- H. Poon and P. Domingos. 2006. Sound and efficient inference with probabilistic and deterministic dependencies. In Proc. of AAAI-06. Google ScholarDigital Library
- Hoifung Poon and Pedro Domingos. 2007. Joint inference in information extraction. In Proceedings of the Twenty-Second National Conference on Artificial Intelligence (AAAI), pages 913--918, Vancouver, Canada. AAAI Press. Google ScholarDigital Library
- H. Poon and P. Domingos. 2008. Joint unsupervised coreference resolution with markov logic. In Proc. of the 13th Conf. on Empirical Methods in Natural Language Processing (EMNLP-2008). Google ScholarDigital Library
- M. Richardson and P. Domingos. 2006. Markov logic networks. Machine Learning, pages 107--136. Google ScholarDigital Library
- Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. 2001. A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27(4). Google ScholarDigital Library
- Charles Sutton, Andrew McCallum, and Jeff Bilmes (eds.). 2006. Proc. of the HLT/NAACL-06 Workshop on Joint Inference for Natural Language Processing.Google Scholar
- B. Wellner, A. McCallum, F. Peng, and M. Hay. 2004. An integrated, conditional model of information extraction and coreference with application to citation matching. In Proc. of the 20th Conference on Uncertainty in AI (UAI 2004). Google ScholarDigital Library
- Fei Xia and William Lewis. 2007. Multilingual structural projection across interlinear text. In Proc. of the Conference on Human Language Technologies (HLT/NAACL 2007), pages 452--459, Rochester, New York.Google Scholar
- Fei Xia and William Lewis. 2008. Repurposing Theoretical Linguistic Data for Tool Development and Search. In Proc. of the Third International Joint Conference on Natural Language Processing (IJCNLP-2008), Hyderabad, India.Google Scholar
Index Terms
- Language ID in the context of harvesting language data off the web
Recommendations
Layout-sensitive language extensibility with SugarHaskell
Haskell '12: Proceedings of the 2012 Haskell SymposiumProgrammers need convenient syntax to write elegant and concise programs. Consequently, the Haskell standard provides syntactic sugar for some scenarios (e.g., do notation for monadic code), authors of Haskell compilers provide syntactic sugar for more ...
Layout-sensitive language extensibility with SugarHaskell
Haskell '12Programmers need convenient syntax to write elegant and concise programs. Consequently, the Haskell standard provides syntactic sugar for some scenarios (e.g., do notation for monadic code), authors of Haskell compilers provide syntactic sugar for more ...
Language extension and composition with language workbenches
OOPSLA '10: Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companionDomain-specific languages (DSLs) provide high expressive power focused on a particular problem domain. They provide linguistic abstractions and specialized syntax specifically designed for a domain, allowing developers to avoid boilerplate code and low-...
Comments