Efficient single-pass index construction for text databases

Authors:
Steffen Heinz;Justin Zobel
Affiliations:
School of Computer Science and Information Technology, RMIT University GPO Box 2476V, Melbourne 3001, Australia;School of Computer Science and Information Technology, RMIT University GPO Box 2476V, Melbourne 3001, Australia
Venue:
Journal of the American Society for Information Science and Technology
Year:
2003

Citing 17
Cited 39

Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Engineering a sort function

Software—Practice & Experience
Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
In situ generation of compressed inverted files

Journal of the American Society for Information Science
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Efficient distributed algorithms to build inverted files

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Building a distributed full-text index for the Web

Proceedings of the 10th international conference on World Wide Web
Burst tries: a fast, efficient data structure for string keys

ACM Transactions on Information Systems (TOIS)
In-memory hash tables for accumulating text vocabularies

Information Processing Letters
Indexing Techniques for Advanced Database Systems

Indexing Techniques for Advanced Database Systems
Algorithms in C

Algorithms in C
Performance of data structures for small sets of strings

ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Indexing and Retrieval for Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
FAST-INV: A Fast Algorithm for building large inverted files

FAST-INV: A Fast Algorithm for building large inverted files

Index construction for linear categorisation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
In-place versus re-build versus re-merge: index maintenance strategies for text retrieval systems

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Fast on-line index construction by geometric partitioning

Proceedings of the 14th ACM international conference on Information and knowledge management
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Efficient online index maintenance for contiguous inverted lists

Information Processing and Management: an International Journal
Hybrid index maintenance for growing text collections

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Trustworthy keyword search for regulatory-compliant records retention

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Efficient in-memory extensible inverted file

Information Systems
A security model for full-text file system search in multi-user environments

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
High performance index build algorithms for intranet search engines

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Just in time indexing for up to the second search

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Efficient on-line index maintenance for dynamic text collections by using dynamic balancing tree

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Hybrid index maintenance for contiguous inverted lists

Information Retrieval
Efficient online index construction for text databases

ACM Transactions on Database Systems (TODS)
Relaxation in text search using taxonomies

Proceedings of the VLDB Endowment
A search-based method for forecasting ad impression in contextual advertising

Proceedings of the 18th international conference on World wide web
On single-pass indexing with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Fast Single-Pass Construction of a Half-Inverted Index

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
On-line index maintenance using horizontal partitioning

Proceedings of the 18th ACM conference on Information and knowledge management
Low-cost management of inverted files for online full-text search

Proceedings of the 18th ACM conference on Information and knowledge management
Improving the load balance for hybrid partitioning scheme by directing hybrid queries

PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
Efficient indexing of versioned document sequences

ECIR'07 Proceedings of the 29th European conference on IR research
Short query refinement with query derivation

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Scalable, statistical storage allocation for extensible inverted file construction

Journal of Systems and Software
Fast construction of the HYB index

ACM Transactions on Information Systems (TOIS)
Efficiently encoding term co-occurrences in inverted indexes

Proceedings of the 20th ACM international conference on Information and knowledge management
Indexing shared content in information retrieval systems

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
A framework for utilising usage trends in the crawling and indexing process of search engines

International Journal of Knowledge and Web Intelligence
A fast algorithm for constructing inverted files on heterogeneous platforms

Journal of Parallel and Distributed Computing
A hybrid approach to index maintenance in dynamic text retrieval systems

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
ImageTerrier: an extensible platform for scalable high-performance image retrieval

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
MapReduce indexing strategies: Studying scalability and efficiency

Information Processing and Management: an International Journal
Management and search of private data on storage clouds

Proceedings of the Workshop on Secure and Dependable Middleware for Cloud Monitoring and Management
Efficient fuzzy search in large text collections

ACM Transactions on Information Systems (TOIS)
Learning-Based interactive retrieval in large-scale multimedia collections

AMR'11 Proceedings of the 9th international conference on Adaptive Multimedia Retrieval: large-scale multimedia retrieval and evaluation
Dynamic memory allocation policies for postings in real-time Twitter search

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Searching private data in a cloud encrypted domain

Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
Fast candidate generation for real-time tweet search with bloom filter chains

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this article, we review the principal approaches to inversion, analyze their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approaches and propose a single-pass inversion method that, in contrast to previous approaches, does not require the complete vocabulary of the indexed collection in main memory, can operate within limited resources, and does not sacrifice speed with high temporary storage requirements. We show that the performance of the single-pass approach can be improved by constructing inverted files in segments, reducing the cost of disk accesses during inversion of large volumes of data.