Efficient Update of Indexes for Dynamically Changing Web Documents

Authors:
Lipyeow Lim;Min Wang;Sriram Padmanabhan;Jeffrey Scott Vitter;Ramesh Agarwal
Affiliations:
IBM T. J. Watson Research Ctr., Hawthorne, USA 10532;IBM T. J. Watson Research Ctr., Hawthorne, USA 10532;IBM Silicon Valley Lab., San Jose, USA 95141;Purdue University, West Lafayette, USA 47907;IBM Almaden Research Ctr., San Jose, USA 95120-6099
Venue:
World Wide Web
Year:
2007

Citing 26
Cited 3

Faster methods for random sampling

Communications of the ACM
An efficient I/O interface for optical disks

ACM Transactions on Database Systems (TODS)
Algorithms for approximate string matching

Information and Control
Key-sequence data sets on indelible storage

IBM Journal of Research and Development
Optimization for dynamic inverted index maintenance

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Approaches to passage retrieval in full text information systems

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental updates of inverted lists for text document retrieval

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Efficient retrieval of partial documents

TREC-2 Proceedings of the second conference on Text retrieval conference
Passage retrieval revisited

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Block addressing indices for approximate text retrieval

Journal of the American Society for Information Science - Special topic issue: When museum informatics meets the World Wide Web
Accessibility of information on the Web

intelligence
A fast string searching algorithm

Communications of the ACM
Building a distributed full-text index for the Web

Proceedings of the 10th international conference on World Wide Web
On supporting containment queries in relational database management systems

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Modern Information Retrieval

Modern Information Retrieval
Keeping Up with the Changing Web

Computer
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Indexing and Querying XML Data for Regular Path Expressions

Proceedings of the 27th International Conference on Very Large Data Bases
Fast Incremental Indexing for Full-Text Information Retrieval

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Estimating frequency of change

ACM Transactions on Internet Technology (TOIT)
GLIMPSE: a tool to search through entire file systems

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference

Characterization of the evolution of a news Web site

Journal of Systems and Software
Efficient Index Maintenance for Frequently Updated Semantic Data

ASWC '08 Proceedings of the 3rd Asian Semantic Web Conference on The Semantic Web
Semplore: A scalable IR approach to search the Web of Data

Web Semantics: Science, Services and Agents on the World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed. In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index.