Faster methods for random sampling
Communications of the ACM
An efficient I/O interface for optical disks
ACM Transactions on Database Systems (TODS)
Algorithms for approximate string matching
Information and Control
Key-sequence data sets on indelible storage
IBM Journal of Research and Development
Optimization for dynamic inverted index maintenance
SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval: data structures and algorithms
Information retrieval: data structures and algorithms
Approaches to passage retrieval in full text information systems
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental updates of inverted lists for text document retrieval
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Efficient retrieval of partial documents
TREC-2 Proceedings of the second conference on Text retrieval conference
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Block addressing indices for approximate text retrieval
Journal of the American Society for Information Science - Special topic issue: When museum informatics meets the World Wide Web
Accessibility of information on the Web
intelligence
A fast string searching algorithm
Communications of the ACM
Building a distributed full-text index for the Web
Proceedings of the 10th international conference on World Wide Web
On supporting containment queries in relational database management systems
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Modern Information Retrieval
Keeping Up with the Changing Web
Computer
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Indexing and Querying XML Data for Regular Path Expressions
Proceedings of the 27th International Conference on Very Large Data Bases
Fast Incremental Indexing for Full-Text Information Retrieval
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
A large-scale study of the evolution of web pages
WWW '03 Proceedings of the 12th international conference on World Wide Web
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Estimating frequency of change
ACM Transactions on Internet Technology (TOIT)
GLIMPSE: a tool to search through entire file systems
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Characterization of the evolution of a news Web site
Journal of Systems and Software
Efficient Index Maintenance for Frequently Updated Semantic Data
ASWC '08 Proceedings of the 3rd Asian Semantic Web Conference on The Semantic Web
Semplore: A scalable IR approach to search the Web of Data
Web Semantics: Science, Services and Agents on the World Wide Web
Hi-index | 0.00 |
Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed. In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index.