A comparison of language identification approaches on short, query-style texts

Authors:
Thomas Gottron;Nedim Lipka
Affiliations:
Institut für Informatik, Johannes Gutenberg-Universität Mainz, Mainz, Germany;Faculty of Media, Media Systems, Bauhaus University Weimar, Weimar, Germany
Venue:
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Year:
2010

Citing 4
Cited 10

RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Language Identification on the Web: Extending the Dictionary Method

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
A search engine based on query logs, and search log analysis by automatic language identification

CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
n-Gram Statistics for Natural Language Understanding and Text Processing

IEEE Transactions on Pattern Analysis and Machine Intelligence

Managing misspelled queries in IR applications

Information Processing and Management: an International Journal
Classifying with co-stems: a new representation for information filtering

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Detecting culture in coordinates: cultural areas in social media

Proceedings of the 2011 international workshop on DETecting and Exploiting Cultural diversiTy on the social web
LiveTweet: monitoring and predicting interesting microblog posts

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
A high performance centroid-based classification approach for language identification

Pattern Recognition Letters
Language identification for creating language-specific Twitter collections

LSM '12 Proceedings of the Second Workshop on Language in Social Media
Microblog language identification: overcoming the limitations of short, unedited and idiomatic text

Language Resources and Evaluation
Guidelines for multilingual linked data

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Determining language variant in microblog messages

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Technical Section: EXOD: A tool for building and exploring a large graph of open datasets

Computers and Graphics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In a multi-language Information Retrieval setting, the knowledge about the language of a user query is important for further processing. Hence, we compare the performance of some typical approaches for language detection on very short, query-style texts. The results show that already for single words an accuracy of more than 80% can be achieved, for slightly longer texts we even observed accuracy values close to 100%.