Coupling feature selection and machine learning methods for navigational query identification

Authors:
Yumao Lu;Fuchun Peng;Xin Li;Nawaaz Ahmed
Affiliations:
Yahoo! Inc., Sunnyvale, California;Yahoo! Inc., Sunnyvale, California;Yahoo! Inc., Sunnyvale, California;Yahoo! Inc., Sunnyvale, California
Venue:
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Year:
2006

Citing 15
Cited 5

The nature of statistical learning theory

The nature of statistical learning theory
Inducing Features of Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
Stochastic gradient boosting

Computational Statistics & Data Analysis - Nonlinear methods and data mining
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A taxonomy of web search

ACM SIGIR Forum
Query type classification for web document retrieval

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Categorizing web queries according to geographical locality

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Understanding user goals in web search

Proceedings of the 13th international conference on World Wide Web
Automatic identification of user goals in Web search

WWW '05 Proceedings of the 14th international conference on World Wide Web
Improving Automatic Query Classification via Semi-Supervised Learning

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Q2C@UST: our winning solution to query classification in KDDCUP 2005

ACM SIGKDD Explorations Newsletter
A comparison of algorithms for maximum entropy parameter estimation

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Statistical modeling and conceptualization of visual patterns

IEEE Transactions on Pattern Analysis and Machine Intelligence

Classifying search queries using the Web as a source of knowledge

ACM Transactions on the Web (TWEB)
Mining search engine clickthrough log for matching N-gram features

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Exploring features for the automatic identification of user goals in web search

Information Processing and Management: an International Journal
Understanding and predicting personal navigation

Proceedings of the fourth ACM international conference on Web search and data mining
Recipe recommendation using ingredient networks

Proceedings of the 3rd Annual ACM Web Science Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is important yet hard to identify navigational queries in Web search due to a lack of sufficient information in Web queries, which are typically very short. In this paper we study several machine learning methods, including naive Bayes model, maximum entropy model, support vector machine (SVM), and stochastic gradient boosting tree (SGBT), for navigational query identification in Web search. To boost the performance of these machine techniques, we exploit several feature selection methods and propose coupling feature selection with classification approaches to achieve the best performance. Different from most prior work that uses a small number of features, in this paper, we study the problem of identifying navigational queries with thousands of available features, extracted from major commercial search engine results, Web search user click data, query log, and the whole Web's relational content. A multi-level feature extraction system is constructed.Our results on real search data show that 1) Among all the features we tested, user click distribution features are the most important set of features for identifying navigational queries. 2) In order to achieve good performance, machine learning approaches have to be coupled with good feature selection methods. We find that gradient boosting tree, coupled with linear SVM feature selection is most effective. 3) With carefully coupled feature selection and classification approaches, navigational queries can be accurately identified with 88.1% F1 score, which is 33% error rate reduction compared to the best uncoupled system, and 40% error rate reduction compared to a well tuned system without feature selection.