research-article

Free Access

Language ID in the context of harvesting language data off the web

Authors:
Fei Xia

University of Washington, Seattle, WA

University of Washington, Seattle, WA
View Profile

,
William D. Lewis

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

,
Hoifung Poon

University of Washington, Seattle, WA

University of Washington, Seattle, WA
View Profile

EACL '09: Proceedings of the 12th Conference of the European Chapter of the Association for Computational LinguisticsMarch 2009Pages 870–878

Published:30 March 2009Publication History

EACL '09: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics

Pages 870–878

ABSTRACT

As the arm of NLP technologies extends beyond a small core of languages, techniques for working with instances of language data across hundreds to thousands of languages may require revisiting and recalibrating the tried and true methods that are used. Of the NLP techniques that has been treated as "solved" is language identification (language ID) of written text. However, we argue that language ID is far from solved when one considers input spanning not dozens of languages, but rather hundreds to thousands, a number that one approaches when harvesting language data found on the Web. We formulate language ID as a coreference resolution problem and apply it to a Web harvesting task for a specific linguistic data type and achieve a much higher accuracy than long accepted language ID approaches.

References

Mark C. Baker and Osamuyimen Thompson Stewart. 1996. Unaccusativity and the adjective/verb distinction: Edo evidence. In Proceedings of the Fifth Annual Conference on Document Analysis and Information Retrieval (SDAIR), Amherst, Mass.Google Scholar
G. Bakir, T. Hofmann, B. Scholkopf, A. Smola, B. Taskar, and S. Vishwanathan (eds). 2007. Predicting Structured Data. MIT Press. Google ScholarDigital Library
William B. Cavnar and John M. Trenkle. 1994. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161--175, Las Vegas, US.Google Scholar
Chih-Chung Chang and Chih-Jen Lin, 2001. LIBSVM: a library for support vector machines. Available at http://www.csie.ntu.edu.tw/cjlin/libsvm.Google Scholar
Pascal Denis and Jason Baldridge. 2007. Joint determination of anaphoricity and coreference resolution using integer programming. In Proc. of the Conference on Human Language Technologies (HLT/NAACL 2007), pages 236--243, Rochester, New York, April.Google Scholar
Aria Haghighi and Dan Klein. 2006. Prototype-driven grammar induction. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), pages 881--888, Sydney, Australia, July. Association for Computational Linguistics. Google ScholarDigital Library
Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy Nicholson, and Andrew MacKinlay. 2006. Reconsidering language identification for written language resources. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC2006), pages 485--488, Genoa, Italy.Google Scholar
S. Kok, P. Singla, M. Richardson, P. Domingos, M. Sumner, H Poon, and D. Lowd. 2007. The Alchemy system for statistical relational AI. Technical report, Dept. of CSE, Univ. of Washington.Google Scholar
William Lewis and Fei Xia. 2008. Automatically Identifying Computationally Relevant Typological Features. In Proc. of the Third International Joint Conference on Natural Language Processing (IJCNLP-2008), Hyderabad, India.Google Scholar
William Lewis. 2006. ODIN: A Model for Adapting and Enriching Legacy Infrastructure. In Proc. of the e-Humanities Workshop, held in cooperation with e-Science 2006: 2nd IEEE International Conference on e-Science and Grid Computing, Amsterdam. Google ScholarDigital Library
Xiaoqiang Luo. 2007. Coreference or not: A twin model for coreference resolution. In Proc. of the Conference on Human Language Technologies (HLT/NAACL 2007), pages 73--80. Rochester, New York.Google Scholar
Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.Google Scholar
Vincent Ng and Claire Cardie. 2002. Improving Machine Learning Approaches to Coreference Resolution. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-2002), pages 104--111, Philadelphia. Google ScholarDigital Library
H. Poon and P. Domingos. 2006. Sound and efficient inference with probabilistic and deterministic dependencies. In Proc. of AAAI-06. Google ScholarDigital Library
Hoifung Poon and Pedro Domingos. 2007. Joint inference in information extraction. In Proceedings of the Twenty-Second National Conference on Artificial Intelligence (AAAI), pages 913--918, Vancouver, Canada. AAAI Press. Google ScholarDigital Library
H. Poon and P. Domingos. 2008. Joint unsupervised coreference resolution with markov logic. In Proc. of the 13th Conf. on Empirical Methods in Natural Language Processing (EMNLP-2008). Google ScholarDigital Library
M. Richardson and P. Domingos. 2006. Markov logic networks. Machine Learning, pages 107--136. Google ScholarDigital Library
Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. 2001. A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27(4). Google ScholarDigital Library
Charles Sutton, Andrew McCallum, and Jeff Bilmes (eds.). 2006. Proc. of the HLT/NAACL-06 Workshop on Joint Inference for Natural Language Processing.Google Scholar
B. Wellner, A. McCallum, F. Peng, and M. Hay. 2004. An integrated, conditional model of information extraction and coreference with application to citation matching. In Proc. of the 20th Conference on Uncertainty in AI (UAI 2004). Google ScholarDigital Library
Fei Xia and William Lewis. 2007. Multilingual structural projection across interlinear text. In Proc. of the Conference on Human Language Technologies (HLT/NAACL 2007), pages 452--459, Rochester, New York.Google Scholar
Fei Xia and William Lewis. 2008. Repurposing Theoretical Linguistic Data for Tool Development and Search. In Proc. of the Third International Joint Conference on Natural Language Processing (IJCNLP-2008), Hyderabad, India.Google Scholar

Index Terms

Language ID in the context of harvesting language data off the web
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Layout-sensitive language extensibility with SugarHaskell
Haskell '12: Proceedings of the 2012 Haskell Symposium

Programmers need convenient syntax to write elegant and concise programs. Consequently, the Haskell standard provides syntactic sugar for some scenarios (e.g., do notation for monadic code), authors of Haskell compilers provide syntactic sugar for more ...
Read More
Layout-sensitive language extensibility with SugarHaskell
Haskell '12

Programmers need convenient syntax to write elegant and concise programs. Consequently, the Haskell standard provides syntactic sugar for some scenarios (e.g., do notation for monadic code), authors of Haskell compilers provide syntactic sugar for more ...
Read More
Language extension and composition with language workbenches
OOPSLA '10: Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion

Domain-specific languages (DSLs) provide high expressive power focused on a particular problem domain. They provide linguistic abstractions and specialized syntax specifically designed for a domain, allowing developers to avoid boilerplate code and low-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EACL '09: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
March 2009
905 pages
General Chair:
Alex Lascarides
University of Edinburgh (UK)
,
Program Chairs:
Claire Gardent
CNRS/LORIA Nancy (France)
,
Joakim Nivre
Uppsala University and Vaxjo University (Sweden)
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 30 March 2009
Qualifiers
- research-article
Conference

Acceptance Rates
EACL '09 Paper Acceptance Rate100of360submissions,28%Overall Acceptance Rate100of360submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 130
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Language ID in the context of harvesting language data off the web

EACL '09: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Layout-sensitive language extensibility with SugarHaskell

Layout-sensitive language extensibility with SugarHaskell

Language extension and composition with language workbenches

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Language ID in the context of harvesting language data off the web

EACL '09: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Layout-sensitive language extensibility with SugarHaskell

Layout-sensitive language extensibility with SugarHaskell

Language extension and composition with language workbenches

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media