Topic-independent web high-quality page selection based on k-means clustering

Authors:
Canhui Wang;Yiqun Liu;Min Zhang;Shaoping Ma
Affiliations:
State Key Lab of Intelligent technology & systems, Tsinghua University, Beijing, P.R.China;State Key Lab of Intelligent technology & systems, Tsinghua University, Beijing, P.R.China;State Key Lab of Intelligent technology & systems, Tsinghua University, Beijing, P.R.China;State Key Lab of Intelligent technology & systems, Tsinghua University, Beijing, P.R.China
Venue:
AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Year:
2005

Citing 8
Cited 0

Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
An empirical comparison of four initialization methods for the K-Means algorithm

Pattern Recognition Letters
The analysis of a simple k-means clustering algorithm

Proceedings of the sixteenth annual symposium on Computational geometry
Topical locality in the Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
The Importance of Prior Probabilities for Entry Page Search

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Query-independent evidence in home page finding

ACM Transactions on Information Systems (TOIS)
Challenges in web search engines

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Effective topic distillation with key resource pre-selection

AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the web search engines’ challenges is to identify the quality of web pages independent of a given user request. Web high-quality pages provide readers proper entries to get more concentrated required information on the web. This paper focuses on topic-independent web high-quality page selection to reduce web information redundancies and clean noise. Different non-content features and their effects on high-quality page selection are studied. Then K-means clustering with these features is performed to separate high-quality pages from common ones. Experiments on 19GB (document size) TREC web data set (.GOV data) have been made. By this proposed approach, less than 50% of web pages are obtained as high-quality ones, covering about 90% key information in the whole set. Information retrieval on this high-quality page set achieves more than 40% improvement, compared with that on the whole data collection.