Language identification: the long and the short of the matter

Authors:
Timothy Baldwin;Marco Lui
Affiliations:
University of Melbourne, Australia;University of Melbourne, Australia
Venue:
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Year:
2010

Citing 10
Cited 10

The nature of statistical learning theory

The nature of statistical learning theory
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
An information-theoretic measure for document similarity

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Text classification using string kernels

The Journal of Machine Learning Research
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Automatic language identification of written texts

Proceedings of the 2004 ACM symposium on Applied computing
Language identification in web pages

Proceedings of the 2005 ACM symposium on Applied computing
Introduction to Information Retrieval

Introduction to Information Retrieval
Language ID in the context of harvesting language data off the web

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Applying NLP technologies to the collection and enrichment of language data on the Web to aid linguistic research

LaTeCH-SHELT&R '09 Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education

Lexical normalisation of short text messages: makn sens a #twitter

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Yet another language identifier

EACL '12 Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics
Language identification for creating language-specific Twitter collections

LSM '12 Proceedings of the Second Workshop on Language in Social Media
langid.py: an off-the-shelf language identification tool

ACL '12 Proceedings of the ACL 2012 System Demonstrations
Lexical normalization for social media text

ACM Transactions on Intelligent Systems and Technology (TIST) - Special section on twitter and microblogging services, social recommender systems, and CAMRa2010: Movie recommendation in context
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

ACM Transactions on the Web (TWEB)
Microblog language identification: overcoming the limitations of short, unedited and idiomatic text

Language Resources and Evaluation
Offering language based services on social media by identifying user's preferred language(s) from romanized text

Proceedings of the 22nd international conference on World Wide Web companion
Using topic models for Twitter hashtag recommendation

Proceedings of the 22nd international conference on World Wide Web companion
Signals from the crowd: uncovering social relationships through smartphone probes

Proceedings of the 2013 conference on Internet measurement conference

Quantified Score

Hi-index	0.01

Visualization

Abstract

Language identification is the task of identifying the language a given document is written in. This paper describes a detailed examination of what models perform best under different conditions, based on experiments across three separate datasets and a range of tokenisation strategies. We demonstrate that the task becomes increasingly difficult as we increase the number of languages, reduce the amount of training data and reduce the length of documents. We also show that it is possible to perform language identification without having to perform explicit character encoding detection.