Dynamic maintenance of web indexes using landmarks

Authors:
Lipyeow Lim;Min Wang;Sriram Padmanabhan;Jeffrey Scott Vitter;Ramesh Agarwal
Affiliations:
Duke University, Durham, NC;IBM T. J. Watson Research Ctr., Hawthorne, NY;IBM T. J. Watson Research Ctr., Hawthorne, NY;Purdue University, West Lafayette, IN;IBM Almaden Research Ctr., San Jose, CA
Venue:
WWW '03 Proceedings of the 12th international conference on World Wide Web
Year:
2003

Citing 16
Cited 12

Faster methods for random sampling

Communications of the ACM
An efficient I/O interface for optical disks

ACM Transactions on Database Systems (TODS)
Algorithms for approximate string matching

Information and Control
Optimization for dynamic inverted index maintenance

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Incremental updates of inverted lists for text document retrieval

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
A fast string searching algorithm

Communications of the ACM
Building a distributed full-text index for the Web

Proceedings of the 10th international conference on World Wide Web
On supporting containment queries in relational database management systems

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Modern Information Retrieval

Modern Information Retrieval
Keeping Up with the Changing Web

Computer
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Indexing and Querying XML Data for Regular Path Expressions

Proceedings of the 27th International Conference on Very Large Data Bases
Fast Incremental Indexing for Full-Text Information Retrieval

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases

What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Efficient Inverted Lists and Query Algorithms for Structured Value Ranking in Update-Intensive Relational Databases

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Improving Web search efficiency via a locality based static pruning method

WWW '05 Proceedings of the 14th international conference on World Wide Web
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Efficient search in large textual collections with redundancy

Proceedings of the 16th international conference on World Wide Web
Just in time indexing for up to the second search

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Collection selection: ...now, with more documents!

Proceedings of the 3rd international conference on Scalable information systems
Optimizing complex extraction programs over evolving text data

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Online update of b-trees

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Searching web data: An entity retrieval and high-performance indexing model

Web Semantics: Science, Services and Agents on the World Wide Web
A node indexing scheme for web entity retrieval

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II
Optimizing positional index structures for versioned document collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed.In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index.