Web data cleansing for information retrieval using key resource page selection

  • Authors:
  • Yiqun Liu;Canhui Wang;Min Zhang;Shaoping Ma

  • Affiliations:
  • Tsinghua University, Beijing, China P.R.;Tsinghua University, Beijing, China P.R.;Tsinghua University, Beijing, China P.R.;Tsinghua University, Beijing, China P.R.

  • Venue:
  • WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the page explosion of WWW, how to cover more useful information with limited storage and computation resources becomes more and more important in web IR research. Using web page non-content feature analysis, we proposed a clustering-based method to select high quality pages from the whole page set. Although the result page set contains only 44.3% of the whole collection, it is related with more than 98% of links and covers about 90% of key information. Link property and retrieval affects are also observed and experiment results show that key resource selection method is more suitable for the job of data cleansing and the result page set outperforms the whole collection by smaller size and better retrieval performance.