Topic-independent web high-quality page selection based on k-means clustering

  • Authors:
  • Canhui Wang;Yiqun Liu;Min Zhang;Shaoping Ma

  • Affiliations:
  • State Key Lab of Intelligent technology & systems, Tsinghua University, Beijing, P.R.China;State Key Lab of Intelligent technology & systems, Tsinghua University, Beijing, P.R.China;State Key Lab of Intelligent technology & systems, Tsinghua University, Beijing, P.R.China;State Key Lab of Intelligent technology & systems, Tsinghua University, Beijing, P.R.China

  • Venue:
  • AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the web search engines’ challenges is to identify the quality of web pages independent of a given user request. Web high-quality pages provide readers proper entries to get more concentrated required information on the web. This paper focuses on topic-independent web high-quality page selection to reduce web information redundancies and clean noise. Different non-content features and their effects on high-quality page selection are studied. Then K-means clustering with these features is performed to separate high-quality pages from common ones. Experiments on 19GB (document size) TREC web data set (.GOV data) have been made. By this proposed approach, less than 50% of web pages are obtained as high-quality ones, covering about 90% key information in the whole set. Information retrieval on this high-quality page set achieves more than 40% improvement, compared with that on the whole data collection.