Incremental updates of inverted lists for text document retrieval
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Self-indexing inverted files for fast text retrieval
ACM Transactions on Information Systems (TOIS)
Efficient distributed backup with delta compression
Proceedings of the fifth workshop on I/O in parallel and distributed systems
Delta algorithms: an empirical analysis
ACM Transactions on Software Engineering and Methodology (TOSEM)
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Efficient passage ranking for document databases
ACM Transactions on Information Systems (TOIS)
Finding replicated Web collections
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A protocol-independent technique for eliminating redundant network traffic
Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
ACM Transactions on Internet Technology (TOIT)
A low-bandwidth network file system
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Modern Information Retrieval
Compression of inverted indexes For fast query evaluation
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Venti: A New Approach to Archival Storage
FAST '02 Proceedings of the Conference on File and Storage Technologies
Fast Incremental Indexing for Full-Text Information Retrieval
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Dynamic maintenance of web indexes using landmarks
WWW '03 Proceedings of the 12th international conference on World Wide Web
WWW '03 Proceedings of the 12th international conference on World Wide Web
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Index Compression through Document Reordering
DCC '02 Proceedings of the Data Compression Conference
Efficient approximate and dynamic matching of patterns using a labeling paradigm
FOCS '96 Proceedings of the 37th Annual Symposium on Foundations of Computer Science
Efficient query evaluation using a two-level retrieval process
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Assigning identifiers to documents to enhance the clustering property of fulltext indexes
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient randomized pattern-matching algorithms
IBM Journal of Research and Development - Mathematics and computing
Index compression using fixed binary codewords
ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Pastiche: making backup cheap and easy
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Hierarchical substring caching for efficient content distribution to low-bandwidth clients
WWW '05 Proceedings of the 14th international conference on World Wide Web
Inverted files for text search engines
ACM Computing Surveys (CSUR)
Computer
Redundancy elimination within large collections of files
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Efficient indexing of versioned document sequences
ECIR'07 Proceedings of the 29th European conference on IR research
Indexing shared content in information retrieval systems
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
A time machine for text search
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
FluxCapacitor: efficient time-travel text search
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Genealogical trees on the web: a search engine user perspective
Proceedings of the 17th international conference on World Wide Web
Inverted index compression and query processing with optimized document ordering
Proceedings of the 18th international conference on World wide web
Optimizing complex extraction programs over evolving text data
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Compact full-text indexing of versioned document collections
Proceedings of the 18th ACM conference on Information and knowledge management
Durable top-k search in document archives
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Improved index compression techniques for versioned document collections
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A word at a time: computing word relatedness using temporal semantic analysis
Proceedings of the 20th international conference on World wide web
Faster temporal range queries over versioned text
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Full-text search in email archives using social evaluation, attached and linked resources
Proceedings of the 21st international conference companion on World Wide Web
Index maintenance for time-travel text search
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Optimizing positional index structures for versioned document collections
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
Current web search engines focus on searching only themost recentsnapshot of the web. In some cases, however, it would be desirableto search over collections that include many different crawls andversions of each page. One important example of such a collectionis the Internet Archive, though there are many others. Sincethe data size of such an archive is multiple times that of a singlesnapshot, this presents us with significant performance challenges.Current engines use various techniques for index compression andoptimized query execution, but these techniques do not exploit thesignificant similarities between different versions of a page, or betweendifferent pages.In this paper, we propose a general framework for indexing andquery processing of archival collections and, more generally, anycollections with a sufficient amount of redundancy. Our approachresults in significant reductions in index size and query processingcosts on such collections, and it is orthogonal to and can be combinedwith the existing techniques. It also supports highly efficientupdates, both locally and over a network. Within this framework,we describe and evaluate different implementations that trade offindex size versus CPU cost and other factors, and discuss applicationsranging from archival web search to local search of web sites,email archives, or file systems. We present experimental resultsbased on search engine query log and a large collection consistingof multiple crawls.