Cross-lingual query classification: a preliminary study

Authors:
Xuerui Wang;Andrei Broder;Evgeniy Gabrilovich;Vanja Josifovski;Bo Pang
Affiliations:
University of Massachusetts, Amherst, MA, USA;Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, CA, USA
Venue:
Proceedings of the 2nd ACM workshop on Improving non english web searching
Year:
2008

Citing 7
Cited 0

Cross-language text classification

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
An EM Based Training Algorithm for Cross-Language Text Categorization

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Advanced learning algorithms for cross-language patent retrieval and classification

Information Processing and Management: an International Journal
Robust classification of rare queries using web knowledge

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Can chinese web pages be classified with english data source?

Proceedings of the 17th international conference on World Wide Web
Search advertising using web relevance feedback

Proceedings of the 17th ACM conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The non-English Web is growing at breakneck speed, but available language processing tools are mostly English based. Taxonomies are a case in point: while there are plenty of commercial and non-commercial taxonomies for the English Web, taxonomies for other languages are either not available or of very limited quality. Given that building taxonomies in all non-English languages is prohibitively expensive, it is natural to ask whether existing English taxonomies can be leveraged, possibly via machine translation, to enable information processing tasks in other languages. Preliminary results presented in this paper indicate that the answer is affirmative with respect to query classification, a task which is essential both for understanding the user intent and thus provide better search results, and for better targeting of search-based advertising, the economic underpinning of commercial Web search engines. We propose a robust method for classifying non-English queries against an English taxonomy and classifier using widely available, off-the-shelf machine translation systems. In particular, we show that by viewing the search results in the query's original language as independent sources of information, we can alleviate the impact of poor quality or erroneous machine translations. Empirical results for Chinese queries show that we achieve remarkably encouraging results.