Improving web information indexing and retrieval based on center block duplication detection

  • Authors:
  • Tyrone Cadenhead;Jinlin Chen;Terry Cook

  • Affiliations:
  • Department of Computer Science, University of Dallas, Texas, USA.;Department of Computer Science, Queens College, City University of New York, Flushing, NY 11367, USA.;Department of Computer Science, Graduate Centre, City University of New York, New York 10016, USA

  • Venue:
  • International Journal of Innovative Computing and Applications
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Duplicated information in today's Web has serious negative impact to Web search engines in that it increases the size of the index and results in low efficiency for Web information retrieval. One important fact is that a large amount of Web content duplication happens at block level in addition to site and page level due to various reasons. Besides, when searching through the Web, in most cases the desired information is located at the center block of a relevant page. Based on these two observations, we propose an efficient block level duplication detection algorithm based on resemblance transitivity, and index center blocks instead of entire Web pages for Web information retrieval. Experiments show that these strategies can effectively reduce index size and index construction time without sacrificing the effectiveness of Web information retrieval.