Query-Sets++: a scalable approach for modeling web sites

Authors:
Barbara Poblete;Myra Spiliopoulou;Marcelo Mendoza
Affiliations:
Department of Computer Science, University of Chile and Yahoo! Research Latin-America, Chile;Otto-von-Guericke-University Magdeburg, Germany;Universidad Técnica Federico Santa María, Chile
Venue:
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Year:
2011

Citing 6
Cited 0

Modern Information Retrieval

Modern Information Retrieval
Who Links to Whom: Mining Linkage between Web Sites

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Web site mining: a new way to spot competitors, customers and suppliers in the world wide web

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Coarse-grained classification of web sites by their structural properties

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Query-sets: using implicit feedback and query patterns to organize web documents

Proceedings of the 17th international conference on World Wide Web
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We explore an effective approach for modeling and classifying Web sites in the World Wide Web. The aim of this work is to classify Web sites using features which are independent of size, structure and vocabulary. We establish Web site similarity based on search engine query hits, which convey document relevance and utility in direct relation to users' needs and interests. To achieve this, we use a generic Web site representation scheme over different feature spaces, built upon query traffic to the site's documents. For this task we extend, in a non-trivial way, our prior work using query-sets for single document representation. We discuss why this previous methodology is not scalable for a large set of heterogeneous Web sites.We show that our models achieve very compactWeb site representations. Furthermore, our experiments on site classification show excellent performance and quality/dimensionality trade-off. In particular, we sustain a reduction in the feature space to 5% of the size of the bag-of-words representation, while achieving 99% precision in our classification experiments on DMOZ.