Analysis of a very large web search engine query log
ACM SIGIR Forum
ACM SIGIR Forum
Linguini: Language Identification for Multilingual Documents
HICSS '99 Proceedings of the Thirty-Second Annual Hawaii International Conference on System Sciences-Volume 2 - Volume 2
Understanding user goals in web search
Proceedings of the 13th international conference on World Wide Web
Automatic identification of user goals in Web search
WWW '05 Proceedings of the 14th international conference on World Wide Web
The portrait of a common HTML web page
Proceedings of the 2006 ACM symposium on Document engineering
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Determining the informational, navigational, and transactional intent of Web queries
Information Processing and Management: an International Journal
A naive theory of affixation and an algorithm for extraction
SIGPHON '06 Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology
Automatic query type identification based on click through information
AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Disentangling from babylonian confusion – unsupervised language identification
CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
langid.py: an off-the-shelf language identification tool
ACL '12 Proceedings of the ACL 2012 System Demonstrations
Hi-index | 0.00 |
We consider the language identification problem for search engine queries. First, we propose a method to automatically generate a data set, which uses click-through logs of the Yahoo! Search Engine to derive the language of a query indirectly from the language of the documents clicked by the users. Next, we use this data set to train two decision tree classifiers; one that only uses linguistic features and is aimed for textual language identification, and one that additionally uses a non-linguistic feature, and is geared towards the identification of the language intended by the users of the search engine. Our results show that our method produces a highly reliable data set very efficiently, and our decision tree classifier outperforms some of the best methods that have been proposed for the task of written language identification on the domain of search engine queries.