Efficient search in large textual collections with redundancy

Authors:
Jiangong Zhang;Torsten Suel
Affiliations:
Polytechnic University, Brooklyn, NY;Polytechnic University, Brooklyn, NY
Venue:
Proceedings of the 16th international conference on World Wide Web
Year:
2007

Citing 32
Cited 14

Incremental updates of inverted lists for text document retrieval

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Efficient distributed backup with delta compression

Proceedings of the fifth workshop on I/O in parallel and distributed systems
Delta algorithms: an empirical analysis

ACM Transactions on Software Engineering and Methodology (TOSEM)
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Efficient passage ranking for document databases

ACM Transactions on Information Systems (TOIS)
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A protocol-independent technique for eliminating redundant network traffic

Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
Searching the Web

ACM Transactions on Internet Technology (TOIT)
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Modern Information Retrieval

Modern Information Retrieval
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Fast Incremental Indexing for Full-Text Information Retrieval

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Dynamic maintenance of web indexes using landmarks

WWW '03 Proceedings of the 12th international conference on World Wide Web
Value-based web caching

WWW '03 Proceedings of the 12th international conference on World Wide Web
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Index Compression through Document Reordering

DCC '02 Proceedings of the Data Compression Conference
Efficient approximate and dynamic matching of patterns using a labeling paradigm

FOCS '96 Proceedings of the 37th Annual Symposium on Foundations of Computer Science
Efficient query evaluation using a two-level retrieval process

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Assigning identifiers to documents to enhance the clustering property of fulltext indexes

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing
Index compression using fixed binary codewords

ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Pastiche: making backup cheap and easy

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Hierarchical substring caching for efficient content distribution to low-bandwidth clients

WWW '05 Proceedings of the 14th international conference on World Wide Web
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Web Search Engines: Part 1

Computer
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Efficient indexing of versioned document sequences

ECIR'07 Proceedings of the 29th European conference on IR research
Indexing shared content in information retrieval systems

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology

A time machine for text search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
FluxCapacitor: efficient time-travel text search

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Genealogical trees on the web: a search engine user perspective

Proceedings of the 17th international conference on World Wide Web
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
Optimizing complex extraction programs over evolving text data

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Compact full-text indexing of versioned document collections

Proceedings of the 18th ACM conference on Information and knowledge management
Durable top-k search in document archives

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Using the past to score the present: extending term weighting models through revision history analysis

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Improved index compression techniques for versioned document collections

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A word at a time: computing word relatedness using temporal semantic analysis

Proceedings of the 20th international conference on World wide web
Faster temporal range queries over versioned text

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Full-text search in email archives using social evaluation, attached and linked resources

Proceedings of the 21st international conference companion on World Wide Web
Index maintenance for time-travel text search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Optimizing positional index structures for versioned document collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current web search engines focus on searching only themost recentsnapshot of the web. In some cases, however, it would be desirableto search over collections that include many different crawls andversions of each page. One important example of such a collectionis the Internet Archive, though there are many others. Sincethe data size of such an archive is multiple times that of a singlesnapshot, this presents us with significant performance challenges.Current engines use various techniques for index compression andoptimized query execution, but these techniques do not exploit thesignificant similarities between different versions of a page, or betweendifferent pages.In this paper, we propose a general framework for indexing andquery processing of archival collections and, more generally, anycollections with a sufficient amount of redundancy. Our approachresults in significant reductions in index size and query processingcosts on such collections, and it is orthogonal to and can be combinedwith the existing techniques. It also supports highly efficientupdates, both locally and over a network. Within this framework,we describe and evaluate different implementations that trade offindex size versus CPU cost and other factors, and discuss applicationsranging from archival web search to local search of web sites,email archives, or file systems. We present experimental resultsbased on search engine query log and a large collection consistingof multiple crawls.