Leveraging Social Bookmarks from Partially Tagged Corpus for Improved Web Page Clustering

Authors:
Anusua Trivedi;Piyush Rai;Hal Daumé, III;Scott L. Duvall
Affiliations:
University of Utah, Salt Lake City;University of Utah, Salt Lake City;University of Maryland, College Park;VA SLC Health Care System and University of Utah, Salt Lake City
Venue:
ACM Transactions on Intelligent Systems and Technology (TIST)
Year:
2012

Citing 35
Cited 1

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Nonlinear component analysis as a kernel eigenvalue problem

Neural Computation
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
A Spectral Algorithm for Learning Mixtures of Distributions

FOCS '02 Proceedings of the 43rd Symposium on Foundations of Computer Science
Active + Semi-supervised Learning = Robust Multi-View Learning

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Kernel Canonical Correlation Analysis and Least Squares Support Vector Machines

ICANN '01 Proceedings of the International Conference on Artificial Neural Networks
Learning Mixtures of Gaussians

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Modeling annotated data

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Analysis of anchor text for web search

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Kernel independent component analysis

The Journal of Machine Learning Research
Latent dirichlet allocation

The Journal of Machine Learning Research
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
Convex Optimization

Convex Optimization
Learning a kernel matrix for nonlinear dimensionality reduction

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Co-EM support vector learning

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Multi-View Clustering

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Canonical Correlation Analysis: An Overview with Application to Learning Methods

Neural Computation
Improved annotation of the blogosphere via autotagging and hierarchical clustering

Proceedings of the 15th international conference on World Wide Web
Statistical entity-topic models

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Optimizing web search using social annotations

Proceedings of the 16th international conference on World Wide Web
Two-view feature generation model for semi-supervised learning

Proceedings of the 24th international conference on Machine learning
A tutorial on spectral clustering

Statistics and Computing
Exploring social annotations for information retrieval

Proceedings of the 17th international conference on World Wide Web
Extracting shared subspace for multi-label classification

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Semi-supervised Laplacian Regularization of Kernel Canonical Correlation Analysis

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Social tags: meaning and suggestions

Proceedings of the 17th ACM conference on Information and knowledge management
Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Clustering the tagged web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Multi-view clustering via canonical correlation analysis

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Getting the most out of social annotations for web page classification

Proceedings of the 9th ACM symposium on Document engineering
Exploit the tripartite network of social tagging for web clustering

Proceedings of the 18th ACM conference on Information and knowledge management
Multi-view regression via canonical correlation analysis

COLT'07 Proceedings of the 20th annual conference on Learning theory
Learning to tag from open vocabulary labels

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
A Newton-CG Augmented Lagrangian Method for Semidefinite Programming

SIAM Journal on Optimization
A correlation approach for automatic image annotation

ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications

Document Re-ranking Using Partial Social Tagging

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic clustering of Web pages helps a number of information retrieval tasks, such as improving user interfaces, collection clustering, introducing diversity in search results, etc. Typically, Web page clustering algorithms use only features extracted from the page-text. However, the advent of social-bookmarking Web sites, such as StumbleUpon.com and Delicious.com, has led to a huge amount of user-generated content such as the social tag information that is associated with the Web pages. In this article, we present a subspace based feature extraction approach that leverages the social tag information to complement the page-contents of a Web page for extracting beter features, with the goal of improved clustering performance. In our approach, we consider page-text and tags as two separate views of the data, and learn a shared subspace that maximizes the correlation between the two views. Any clustering algorithm can then be applied in this subspace. We then present an extension that allows our approach to be applicable even if the Web page corpus is only partially tagged, that is, when the social tags are present for not all, but only for a small number of Web pages. We compare our subspace based approach with a number of baselines that use tag information in various other ways, and show that the subspace based approach leads to improved performance on the Web page clustering task. We also discuss some possible future work including an active learning extension that can help in choosing which Web pages to get tags for, if we only can get the social tags for only a small number of Web pages.