Language identification of search engine queries

Authors:
Hakan Ceylan;Yookyung Kim
Affiliations:
University of North Texas, Denton, TX;Mission College Blvd., Santa Clara, CA
Venue:
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Year:
2009

Citing 11
Cited 1

Analysis of a very large web search engine query log

ACM SIGIR Forum
A taxonomy of web search

ACM SIGIR Forum
Linguini: Language Identification for Multilingual Documents

HICSS '99 Proceedings of the Thirty-Second Annual Hawaii International Conference on System Sciences-Volume 2 - Volume 2
Understanding user goals in web search

Proceedings of the 13th international conference on World Wide Web
Automatic identification of user goals in Web search

WWW '05 Proceedings of the 14th international conference on World Wide Web
The portrait of a common HTML web page

Proceedings of the 2006 ACM symposium on Document engineering
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Determining the informational, navigational, and transactional intent of Web queries

Information Processing and Management: an International Journal
A naive theory of affixation and an algorithm for extraction

SIGPHON '06 Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology
Automatic query type identification based on click through information

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Disentangling from babylonian confusion – unsupervised language identification

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing

langid.py: an off-the-shelf language identification tool

ACL '12 Proceedings of the ACL 2012 System Demonstrations

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the language identification problem for search engine queries. First, we propose a method to automatically generate a data set, which uses click-through logs of the Yahoo! Search Engine to derive the language of a query indirectly from the language of the documents clicked by the users. Next, we use this data set to train two decision tree classifiers; one that only uses linguistic features and is aimed for textual language identification, and one that additionally uses a non-linguistic feature, and is geared towards the identification of the language intended by the users of the search engine. Our results show that our method produces a highly reliable data set very efficiently, and our decision tree classifier outperforms some of the best methods that have been proposed for the task of written language identification on the domain of search engine queries.