Supporting sub-document updates and queries in an inverted index

Authors:
Vuk Ercegovac;Vanja Josifovski;Ning Li;Mauricio R. Mediano;Eugene J. Shekita
Affiliations:
IBM, San Jose, CA, USA;Yahoo! Inc., Sunnyvale, CA, USA;IBM, San Jose, CA, USA;Yahoo! Inc., Sunnyvale, CA, USA;IBM, San Jose, CA, USA
Venue:
Proceedings of the 17th ACM conference on Information and knowledge management
Year:
2008

Citing 18
Cited 0

Optimization for dynamic inverted index maintenance

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
ARIES/IM: an efficient and high concurrency index management method using write-ahead logging

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Incremental updates of inverted lists for text document retrieval

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Query evaluation: strategies and optimizations

Information Processing and Management: an International Journal
Modern Information Retrieval

Modern Information Retrieval
Database System Implementation

Database System Implementation
Managing Gigabytes: Compressing and Indexing Documents and Images

Managing Gigabytes: Compressing and Indexing Documents and Images
Fast Incremental Indexing for Full-Text Information Retrieval

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Efficient query evaluation using a two-level retrieval process

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
In-place versus re-build versus re-merge: index maintenance strategies for text retrieval systems

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Fast on-line index construction by geometric partitioning

Proceedings of the 14th ACM international conference on Information and knowledge management
Optimizing cursor movement in holistic twig joins

Proceedings of the 14th ACM international conference on Information and knowledge management
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Hybrid index maintenance for growing text collections

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Holistic twig joins on indexed XML documents

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Beyond basic faceted search

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
The five-minute rule twenty years later, and how flash memory changes the rules

DaMoN '07 Proceedings of the 3rd international workshop on Data management on new hardware

Quantified Score

Hi-index	0.00

Visualization

Abstract

Inverted indexes have become the standard indexing method for supporting search queries in a variety of content-based applications. Examples of such applications include enterprise document management, e-mail, web search, and social networks. One shortcoming in current inverted index designs is that they support only document-level updates, forcing a full document to be reindexed even if just part of it changes. This paper describes a new inverted index design that enables applications to break a document into semantically meaningful sub-documents or "sections". Each section of a document can be updated separately, but search queries can still work seamlessly across sections. Our index design is motivated by applications where there is metadata associated with each document that tends to be smaller and more frequently updated than the document's content, but at the same time, it is desireable to search the metadata and content with the same index structure. A novel self-optimizing query execution algorithm is described to efficiently join the sections of a document in the inverted index. Experimental results on TREC and patent data are provided, showing that sections can dramatically improve overall system throughput on a mixed workload of updates and queries.