Engineering basic algorithms of an in-memory text search engine

Authors:
Frederik Transier;Peter Sanders
Affiliations:
University of Karlsruhe, SAP NetWeaver EIM TREX, Germany;University of Karlsruhe, Germany
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2010

Citing 64
Cited 6

Optimization for dynamic inverted index maintenance

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Supporting full-text information retrieval with a persistent object store

EDBT '94 Proceedings of the 4th international conference on extending database technology: Advances in database technology
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Modeling word occurrences for the compression of concordances

ACM Transactions on Information Systems (TOIS)
Interaction of query evaluation and buffer management for information retrieval

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Real life information retrieval: a study of user queries on the Web

ACM SIGIR Forum
Compressed inverted files with reduced decoding overheads

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Adaptive set intersections, unions, and differences

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Searching the Web: the public and their queries

Journal of the American Society for Information Science and Technology
Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Rank-preserving two-level caching for scalable search engines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Compression and Coding Algorithms

Compression and Coding Algorithms
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Binary Interpolative Coding for Effective Index Compression

Information Retrieval
Compressing Inverted Files

Information Retrieval
Experiments on Adaptive Set Intersections for Text Retrieval Systems

ALENEX '01 Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
Exploiting clustering in inverted file Compression

DCC '96 Proceedings of the Conference on Data Compression
Index Compression through Document Reordering

DCC '02 Proceedings of the Data Compression Conference
Indexing text using the Ziv-Lempel trie

Journal of Discrete Algorithms - SPIRE 2002
New text indexing functionalities of the compressed suffix arrays

Journal of Algorithms
Assigning identifiers to documents to enhance the clustering property of fulltext indexes

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Index compression using fixed binary codewords

ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Inverted Index Compression Using Word-Aligned Binary Codes

Information Retrieval
Competitive caching of query results in search engines

Theoretical Computer Science - Special issue: Online algorithms in memoriam, Steve Seiden
Three-level caching for efficient query processing in large Web search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Indexing compressed text

Journal of the ACM (JACM)
Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data

ACM Transactions on Information Systems (TOIS)
Super-Scalar RAM-CPU Cache Compression

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Improved Word-Aligned Binary Compression for Text Indexing

IEEE Transactions on Knowledge and Data Engineering
Inverted files for text search engines

ACM Computing Surveys (CSUR)
TSP and cluster-based solutions to the reassignment of document identifiers

Information Retrieval
Pruned query evaluation using pre-computed impacts

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Succinct suffix arrays based on run-length encoding

Nordic Journal of Computing
Data Compression: The Complete Reference

Data Compression: The Complete Reference
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Efficient in-memory extensible inverted file

Information Systems
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
A pipelined architecture for distributed text query evaluation

Information Retrieval
Fast generation of result snippets in web search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient document retrieval in main memory

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Compressed permuterm index

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Dynamic index pruning for effective caching

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Compact data structures with fast queries

Compact data structures with fast queries
Performance of compressed inverted list caching in search engines

Proceedings of the 17th international conference on World Wide Web
Output-sensitive autocompletion search

Information Retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Design trade-offs for search engine caching

ACM Transactions on the Web (TWEB)
Compressed Text Indexes with Fast Locate

CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
Application-Specific Disk I/O Optimisation for a Search Engine

PDCAT '08 Proceedings of the 2008 Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies
Self-indexing Natural Language

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Out of the Box Phrase Indexing

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
Improved techniques for result caching in web search engines

Proceedings of the 18th international conference on World wide web
SIMD-scan: ultra fast in-memory table scan using on-chip vector processing units

Proceedings of the VLDB Endowment
Sorting out the document identifier assignment problem

ECIR'07 Proceedings of the 29th European conference on IR research
Compact set representation for information retrieval

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Faster adaptive set intersections for text searching

WEA'06 Proceedings of the 5th international conference on Experimental Algorithms
Reducing the space requirement of LZ-Index

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Inverted files versus suffix arrays for locating patterns in primary memory

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Structured index organizations for high-throughput text querying

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Experimental analysis of a fast intersection algorithm for sorted sequences

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Suffix arrays on words

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching

Word-based self-indexes for natural language text

ACM Transactions on Information Systems (TOIS)
Efficient transaction processing in SAP HANA database: the end of a column store myth

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Ranked document retrieval in (almost) no space

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Improved address-calculation coding of integer arrays

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Implicit indexing of natural language text by reorganizing bytecodes

Information Retrieval
Indexing Word Sequences for Ranked Retrieval

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Inverted index data structures are the key to fast text search engines. We first investigate one of the predominant operation on inverted indexes, which asks for intersecting two sorted lists of document IDs of different lengths. We explore compression and performance of different inverted list data structures. In particular, we present Lookup, a new data structure that allows intersection in expected time linear in the smaller list. Based on this result, we present the algorithmic core of a full text data base that allows fast Boolean queries, phrase queries, and document reporting using less space than the input text. The system uses a carefully choreographed combination of classical data compression techniques and inverted-index-based search data structures. Our experiments show that inverted indexes are preferable over purely suffix-array-based techniques for in-memory (English) text search engines. A similar system is now running in practice in each core of the distributed data base engine TREX of SAP.