Improving web page classification by label-propagation over click graphs

Authors:
Soo-Min Kim;Patrick Pantel;Lei Duan;Scott Gaffney
Affiliations:
Yahoo! Labs, Sunnyvale, CA, USA;Yahoo! Labs, Sunnyvale, CA, USA;Yahoo! Labs, Sunnyvale, CA, USA;Yahoo! Labs, Sunnyvale, CA, USA
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 12
Cited 3

Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A statistical learning learning model of text classification for support vector machines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering word senses from text

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
An experimental comparison of click position-bias models

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Learning query intent from regularized click graphs

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
How does clickthrough data reflect retrieval quality?

Proceedings of the 17th ACM conference on Information and knowledge management
Are click-through data adequate for learning web search rankings?

Proceedings of the 17th ACM conference on Information and knowledge management
A dynamic bayesian network click model for web search ranking

Proceedings of the 18th international conference on World wide web

Vertical selection in the presence of unlabeled verticals

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A feature-free search query classification approach using semantic distance

Expert Systems with Applications: An International Journal
The parallel path framework for entity discovery on the web

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper, we present a semi-supervised learning method for web page classification, leveraging click logs to augment training data by propagating class labels to unlabeled similar documents. Current state-of-the-art classifiers are supervised and require large amounts of manually labeled data. We hypothesize that unlabeled documents similar to our positive and negative labeled documents tend to be clicked through by the same user queries. Our proposed method leverages this hypothesis and augments our training set by modeling the similarity between documents in a click graph. We experiment with three different web page classifiers and show empirical evidence that our proposed approach outperforms state-of-the-art methods and reduces the amount of human effort to label training data.