Self-indexing inverted files for fast text retrieval

Authors:
Alistair Moffat;Justin Zobel
Affiliations:
Univ. of Melbourne, Victoria, Australia;RMIT, Victoria, Australia
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
1996

Citing 23
Cited 148

Access methods for text

ACM Computing Surveys (CSUR) - Annals of discrete mathematics, 24
Multikey access methods based on superimposed coding techniques

ACM Transactions on Database Systems (TODS)
Improved techniques for processing queries in full-text systems

SIGIR '87 Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval
Implementing ranking strategies using text signatures

ACM Transactions on Information Systems (TOIS)
A document retrieval system based on nearest neighbour searching

Journal of Information Science
Fundamentals of database systems

Fundamentals of database systems
Compression of concordances in full-text retrieval systems

SIGIR '88 Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval
Storing text retrieval systems on CD-ROM: compression and encryption considerations

ACM Transactions on Information Systems (TOIS)
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Introduction to algorithms

Introduction to algorithms
Inverted files

Information retrieval
Ranking algorithms

Information retrieval
Parameterised compression for sparse bitmaps

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Compression of indexes with full positional information in very large text databases

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Data compression in full-text retrieval systems

Journal of the American Society for Information Science
Memory efficient ranking

Information Processing and Management: an International Journal - Special issue: data compression
Efficient retrieval of partial documents

TREC-2 Proceedings of the second conference on Text retrieval conference
Optimization of inverted vector searches

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Signature files: design and performance comparison of some signature extraction methods

SIGMOD '85 Proceedings of the 1985 ACM SIGMOD international conference on Management of data
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Signature files: an access method for documents and its analytical performance evaluation

ACM Transactions on Information Systems (TOIS)
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
An Efficient Indexing Technique for Full Text Databases

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases

The role of compression in document databases

ACM SIGWEB Newsletter
Index organization for multimedia database systems

ACM Computing Surveys (CSUR)
The design of a high performance information filtering system

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Passage retrieval revisited

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Inverted files versus signature files for text indexing

ACM Transactions on Database Systems (TODS)
Compressed inverted files with reduced decoding overheads

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Term-ordered query evaluation versus document-ordered query evaluation for large document databases

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Teraphim: an engine for distributed information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Effective document presentation with a locality-based similarity heuristic

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Efficient passage ranking for document databases

ACM Transactions on Information Systems (TOIS)
Shortest-substring retrieval and ranking

ACM Transactions on Information Systems (TOIS)
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Building a distributed full-text index for the Web

Proceedings of the 10th international conference on World Wide Web
Bit-sliced index arithmetic

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Vector-space ranking with effective early termination

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Building a distributed full-text index for the web

ACM Transactions on Information Systems (TOIS)
Structured information retrieval in XML documents

Proceedings of the 2002 ACM symposium on Applied computing
Improved retrieval effectiveness through impact transformation

ADC '02 Proceedings of the 13th Australasian database conference - Volume 5
Impact transformation: effective and efficient web retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient phrase querying with an auxiliary index

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Compressing inverted files in scalable information systems by binary decision diagram encoding

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Index compression vs. retrieval time of inverted files for XML documents

Proceedings of the eleventh international conference on Information and knowledge management
Signature files and signature trees

Information Processing Letters
Binary Interpolative Coding for Effective Index Compression

Information Retrieval
Affinity-based management of main memory database clusters

ACM Transactions on Internet Technology (TOIT)
Text Compression for Dynamic Document Databases

IEEE Transactions on Knowledge and Data Engineering
Indexing and Retrieval for Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Comparing Hybrid Peer-to-Peer Systems

Proceedings of the 27th International Conference on Very Large Data Bases
Inverted file compression through document identifier reassignment

Information Processing and Management: an International Journal
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Searching large text collections

Handbook of massive data sets
Indexing for fast categorisation

ACSC '03 Proceedings of the 26th Australasian computer science conference - Volume 16
Genomic information retrieval

ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Efficient single-pass index construction for text databases

Journal of the American Society for Information Science and Technology
Efficient preprocessing of XML queries using structured signatures

Information Processing Letters
A performance study of four index structures for set-valued attributes of low cardinality

The VLDB Journal — The International Journal on Very Large Data Bases
Access-ordered indexes

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
In-place versus re-build versus re-merge: index maintenance strategies for text retrieval systems

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
When indexing equals compression: experiments with compressing suffix arrays and applications

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Searching structured documents

Information Processing and Management: an International Journal
Fast phrase querying with combined indexes

ACM Transactions on Information Systems (TOIS)
A Novel Document Ranking Method Using the Discrete Cosine Transform

IEEE Transactions on Pattern Analysis and Machine Intelligence
Efficiency and effectiveness of query processing in cluster-based retrieval

Information Systems
Comparing inverted files and signature files for searching a large lexicon

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
Optimization strategies for complex queries

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient and self-tuning incremental query expansion for top-k query processing

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
n-gram/2L: a space and time efficient two-level n-gram inverted index structure

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Recommended reading for IR research students

ACM SIGIR Forum
Retrieval quality vs. effectiveness of specificity-oriented search in XML collections

Information Retrieval
Unique-order interpolative coding for fast querying and space-efficient indexing in information retrieval systems

Information Processing and Management: an International Journal
Fast query evaluation through document identifier assignment for inverted file-based information retrieval systems

Information Processing and Management: an International Journal
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Efficient online index maintenance for contiguous inverted lists

Information Processing and Management: an International Journal
Efficient query processing in geographic web search engines

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Type less, find more: fast autocompletion search with a succinct index

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Pruned query evaluation using pre-computed impacts

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Stanford WebBase components and applications

ACM Transactions on Internet Technology (TOIT)
IO-Top-k: index-access optimized top-k query processing

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A document-centric approach to static index pruning in text retrieval systems

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A combination of trie-trees and inverted files for the indexing of set-valued attributes

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
An efficient index structure for XML based on generalized suffix tree

Information Systems
Efficient query expansion with auxiliary data structures

Information Systems
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Efficient search in large textual collections with redundancy

Proceedings of the 16th international conference on World Wide Web
A pipelined architecture for distributed text query evaluation

Information Retrieval
Efficient document retrieval in main memory

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
QueST: querying music databases by acoustic and textual features

Proceedings of the 15th international conference on Multimedia
Top-k query evaluation with probabilistic guarantees

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Index compression is good, especially for random access

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Effective top-k computation in retrieving structured documents with term-proximity support

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Dynamic index pruning for effective caching

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Fast blocking of undesirable web pages on client PC by discriminating URL using neural networks

Expert Systems with Applications: An International Journal
On placing skips optimally in expectation

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Efficient phrase querying with common phrase index

Information Processing and Management: an International Journal
Incremental cluster-based retrieval using compressed cluster-skipping inverted files

ACM Transactions on Information Systems (TOIS)
Performance of compressed inverted list caching in search engines

Proceedings of the 17th international conference on World Wide Web
Site-based dynamic pruning for query processing in search engines

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Real-Time String Filtering of Large Databases Implemented Via a Combination of Artificial Neural Networks

ICANNGA '07 Proceedings of the 8th international conference on Adaptive and Natural Computing Algorithms, Part II
Using graphics processors for high performance IR query processing

Proceedings of the 18th international conference on World wide web
Measurement Techniques and Caching Effects

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Effective top-k computation with term-proximity support

Information Processing and Management: an International Journal
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Fast Single-Pass Construction of a Half-Inverted Index

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Semplore: A scalable IR approach to search the Web of Data

Web Semantics: Science, Services and Agents on the World Wide Web
Inverted indexes vs. bitmap indexes in decision support systems

Proceedings of the 18th ACM conference on Information and knowledge management
Unique-order interpolative coding for fast querying and space-efficient indexing in information retrieval systems

Information Processing and Management: an International Journal
Fast query evaluation through document identifier assignment for inverted file-based information retrieval systems

Information Processing and Management: an International Journal
Index compression using 64-bit words

Software—Practice & Experience
Early exit optimizations for additive machine learned ranking systems

Proceedings of the third ACM international conference on Web search and data mining
Technologies and the development of the Automated Metadata Indexing and Analysis (AMIA) system

Journal of Visual Communication and Image Representation
An efficient random access inverted index for information retrieval

Proceedings of the 19th international conference on World wide web
Compact set representation for information retrieval

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Semplore: an IR approach to scalable hybrid query of semantic web data

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
Book search experiments: investigating IR methods for the indexing and retrieval of books

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Online update of b-trees

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Compression, indexing, and retrieval for massive string data

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Efficient set intersection for inverted indexing

ACM Transactions on Information Systems (TOIS)
Engineering basic algorithms of an in-memory text search engine

ACM Transactions on Information Systems (TOIS)
Fast and effective focused retrieval

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
Efficient answering of set containment queries for skewed item distributions

Proceedings of the 14th International Conference on Extending Database Technology
Reordering columns for smaller indexes

Information Sciences: an International Journal
Fast construction of the HYB index

ACM Transactions on Information Systems (TOIS)
Efficient compressed inverted index skipping for disjunctive text-queries

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Within-document term-based index pruning with statistical hypothesis testing

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
SkipBlock: self-indexing for block-based inverted list

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
A novel hash-based streaming scheme for energy efficient full-text search in wireless data broadcast

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Posting list intersection on multicore architectures

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Effect of different docid orderings on dynamic pruning retrieval strategies

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Interpolative coding of integer sequences supporting log-time random access

Information Processing and Management: an International Journal
Variable length compression for bitmap indices

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part II
Upper-bound approximations for dynamic pruning

ACM Transactions on Information Systems (TOIS)
On upper bounds for dynamic pruning

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
University of Otago at INEX 2010

INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval
Workload-aware indexing for keyword search in social networks

Proceedings of the 20th ACM international conference on Information and knowledge management
Query efficiency prediction for dynamic pruning

Proceedings of the 9th workshop on Large-scale and distributed informational retrieval
Broadening vector space schemes for improving the quality of information retrieval

APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
Static index pruning in web search engines: Combining term and document popularities with query views

ACM Transactions on Information Systems (TOIS)
Effective early termination techniques for text similarity join operator

ISCIS'05 Proceedings of the 20th international conference on Computer and Information Sciences
Structured index organizations for high-throughput text querying

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Space-Limited ranked query evaluation using adaptive pruning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
ISIS: a new approach for efficient similarity search in sparse databases

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part II
Searching web data: An entity retrieval and high-performance indexing model

Web Semantics: Science, Services and Agents on the World Wide Web
A node indexing scheme for web entity retrieval

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II
Compressed perfect embedded skip lists for quick inverted-index lookups

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Index ordering by query-independent measures

Information Processing and Management: an International Journal
Efficient phrase querying with common phrase index

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Scalable search platform: improving pipelined query processing for distributed full-text retrieval

Proceedings of the 21st international conference companion on World Wide Web
Query retrieval enhancement based on Huffman index terms encoding

Proceedings of the 3rd International Conference on Information and Communication Systems
Searching by corpus with fingerprints

Proceedings of the 15th International Conference on Extending Database Technology
Efficient in-memory top-k document retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Index maintenance for time-travel text search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Learning to predict response times for online query scheduling

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
A social inverted index for social-tagging-based information retrieval

Journal of Information Science
Fast top-k similarity queries via matrix compression

Proceedings of the 21st ACM international conference on Information and knowledge management
Improved address-calculation coding of integer arrays

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Efficient and effective retrieval using selective pruning

Proceedings of the sixth ACM international conference on Web search and data mining
Quasi-succinct indices

Proceedings of the sixth ACM international conference on Web search and data mining
Comparing Different Sparse Matrix Storage Structures as Index Structure for Arabic Text Collection

International Journal of Information Retrieval Research
Hybrid query scheduling for a replicated search engine

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
An incremental approach to efficient pseudo-relevance feedback

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Fast document-at-a-time query processing using two-tier indexes

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Dynamic memory allocation policies for postings in real-time Twitter search

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
About learning models with multiple query-dependent features

ACM Transactions on Information Systems (TOIS)
Load-sensitive selective pruning for distributed search

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Exploring the magic of WAND

Proceedings of the 18th Australasian Document Computing Symposium
Scalable K-Means by ranked retrieval

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Query-processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time required. Here we show that the CPU component of query response time for conjunctive Boolean queries and for informal ranked queries can be similarly reduced, at little cost in terms of storage, by the inclusion of an internal index in each compressed inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the self-indexing strategy adds less than 20% to the size of the compressed inverted file, which itself occupies less than 10% of the indexed text, yet can reduce processing time for Boolean queries of 5-10 terms to under one fifth of the previous cost. Similarly, ranked queries of 40-50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.