A novel traffic analysis for identifying search fields in the long tail of web sites

Authors:
George Forman;Evan Kirshenbaum;Shyamsundar Rajaram
Affiliations:
HP Labs, Palo Alto, CA, USA;HP Labs, Palo Alto, CA, USA;HP Labs, Palo Alto, CA, USA
Venue:
Proceedings of the 19th international conference on World wide web
Year:
2010

Citing 12
Cited 1

Query by committee

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Information extraction

Communications of the ACM
Solving the multiple instance problem with axis-parallel rectangles

Artificial Intelligence
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Support vector machine active learning with applications to text classification

The Journal of Machine Learning Research
Automatically Labeling Video Data Using Multi-class Active Learning

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Diverse ensembles for active learning

ICML '04 Proceedings of the twenty-first international conference on Machine learning
An active approach to spoken language processing

ACM Transactions on Speech and Language Processing (TSLP)
Extremely fast text feature extraction for classification and indexing

Proceedings of the 17th ACM conference on Information and knowledge management
Feature shaping for linear SVM classifiers

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
An active learning framework for content-based information retrieval

IEEE Transactions on Multimedia

Finding and exploring memes in social media

Proceedings of the 23rd ACM conference on Hypertext and social media

Quantified Score

Hi-index	0.00

Visualization

Abstract

Using a clickstream sample of 2 billion URLs from many thousand volunteer Web users, we wish to analyze typical usage of keyword searches across the Web. In order to do this, we need to be able to determine whether a given URL represents a keyword search and, if so, which field contains the query. Although it is easy to recognize 'q' as the query field in 'http://www.google.com/search?hl=en&q=music', we must do this automatically for the long tail of diverse websites. This problem is the focus of this paper. Since the names, types and number of fields differ across sites, this does not conform to traditional text classification or to multi-class problem formulations. The problem also exhibits highly non-uniform importance across websites, since traffic follows a Zipf distribution. We developed a solution based on manually identifying the query fields on the most popular sites, followed by an adaptation of machine learning for the rest. It involves an interesting case-instances structure: labeling each website case usually involves selecting at most one of the field instances as positive, based on seeing sample field values. This problem structure and soft constraint - which we believe has broader applicability - can be used to greatly reduce the manual labeling effort. We employed active learning and judicious GUI presentation to efficiently train a classifier with accuracy estimated at 96%, beating several baseline alternatives.