Faster methods for random sampling
Communications of the ACM
An efficient I/O interface for optical disks
ACM Transactions on Database Systems (TODS)
Algorithms for approximate string matching
Information and Control
Optimization for dynamic inverted index maintenance
SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval: data structures and algorithms
Information retrieval: data structures and algorithms
Incremental updates of inverted lists for text document retrieval
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
A fast string searching algorithm
Communications of the ACM
Building a distributed full-text index for the Web
Proceedings of the 10th international conference on World Wide Web
On supporting containment queries in relational database management systems
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Modern Information Retrieval
Keeping Up with the Changing Web
Computer
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Indexing and Querying XML Data for Regular Path Expressions
Proceedings of the 27th International Conference on Very Large Data Bases
Fast Incremental Indexing for Full-Text Information Retrieval
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
What's new on the web?: the evolution of the web from a search engine perspective
Proceedings of the 13th international conference on World Wide Web
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Improving Web search efficiency via a locality based static pruning method
WWW '05 Proceedings of the 14th international conference on World Wide Web
Inverted files for text search engines
ACM Computing Surveys (CSUR)
Efficient search in large textual collections with redundancy
Proceedings of the 16th international conference on World Wide Web
Just in time indexing for up to the second search
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Collection selection: ...now, with more documents!
Proceedings of the 3rd international conference on Scalable information systems
Optimizing complex extraction programs over evolving text data
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Searching web data: An entity retrieval and high-performance indexing model
Web Semantics: Science, Services and Agents on the World Wide Web
A node indexing scheme for web entity retrieval
ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II
Optimizing positional index structures for versioned document collections
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed.In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index.