Efficient search in large textual collections with redundancy

  • Authors:
  • Jiangong Zhang;Torsten Suel

  • Affiliations:
  • Polytechnic University, Brooklyn, NY;Polytechnic University, Brooklyn, NY

  • Venue:
  • Proceedings of the 16th international conference on World Wide Web
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Current web search engines focus on searching only themost recentsnapshot of the web. In some cases, however, it would be desirableto search over collections that include many different crawls andversions of each page. One important example of such a collectionis the Internet Archive, though there are many others. Sincethe data size of such an archive is multiple times that of a singlesnapshot, this presents us with significant performance challenges.Current engines use various techniques for index compression andoptimized query execution, but these techniques do not exploit thesignificant similarities between different versions of a page, or betweendifferent pages.In this paper, we propose a general framework for indexing andquery processing of archival collections and, more generally, anycollections with a sufficient amount of redundancy. Our approachresults in significant reductions in index size and query processingcosts on such collections, and it is orthogonal to and can be combinedwith the existing techniques. It also supports highly efficientupdates, both locally and over a network. Within this framework,we describe and evaluate different implementations that trade offindex size versus CPU cost and other factors, and discuss applicationsranging from archival web search to local search of web sites,email archives, or file systems. We present experimental resultsbased on search engine query log and a large collection consistingof multiple crawls.