Language ID in the context of harvesting language data off the web

Authors:
Fei Xia;William D. Lewis;Hoifung Poon
Affiliations:
University of Washington, Seattle, WA;Microsoft Research, Redmond, WA;University of Washington, Seattle, WA
Venue:
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2009

Citing 10
Cited 5

A machine learning approach to coreference resolution of noun phrases

Computational Linguistics - Special issue on computational anaphora resolution
An integrated, conditional model of information extraction and coreference with application to citation matching

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Improving machine learning approaches to coreference resolution

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Markov logic networks

Machine Learning
ODIN: A Model for Adapting and Enriching Legacy Infrastructure

E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
Prototype-driven grammar induction

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Predicting Structured Data (Neural Information Processing)

Predicting Structured Data (Neural Information Processing)
Sound and efficient inference with probabilistic and deterministic dependencies

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Joint unsupervised coreference resolution with Markov logic

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Joint inference in information extraction

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1

Parsing, projecting & prototypes: repurposing linguistic data on the web

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session
Applying NLP technologies to the collection and enrichment of language data on the Web to aid linguistic research

LaTeCH-SHELT&R '09 Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education
Language identification: the long and the short of the matter

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Language identification for creating language-specific Twitter collections

LSM '12 Proceedings of the Second Workshop on Language in Social Media
Towards an ACL anthology corpus with logical document structure: an overview of the ACL 2012 contributed task

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries

Quantified Score

Hi-index	0.01

Visualization

Abstract

As the arm of NLP technologies extends beyond a small core of languages, techniques for working with instances of language data across hundreds to thousands of languages may require revisiting and recalibrating the tried and true methods that are used. Of the NLP techniques that has been treated as "solved" is language identification (language ID) of written text. However, we argue that language ID is far from solved when one considers input spanning not dozens of languages, but rather hundreds to thousands, a number that one approaches when harvesting language data found on the Web. We formulate language ID as a coreference resolution problem and apply it to a Web harvesting task for a specific linguistic data type and achieve a much higher accuracy than long accepted language ID approaches.