Complete inverted files for efficient text retrieval and analysis

Authors:
A. Blumer;J. Blumer;D. Haussler;R. McConnell;A. Ehrenfeucht
Affiliations:
Univ. of Denver, Denver, CO;Univ. of Denver, Denver, CO;Univ. of Denver, Denver, CO;Univ. of Denver, Denver, CO;Univ. of Colorado at Boulder, Boulder
Venue:
Journal of the ACM (JACM)
Year:
1987

Citing 6
Cited 67

PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Analysis and performance of inverted data base structures

Communications of the ACM
Contentaddressable Memories

Contentaddressable Memories
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
Building a complete inverted file for a set of text files in linear time

STOC '84 Proceedings of the sixteenth annual ACM symposium on Theory of computing

Algorithms for string searching

ACM SIGIR Forum
Models and techniques for the visualization of labeled discrete objects

SAC '92 Proceedings of the 1992 ACM/SIGAPP symposium on Applied computing: technological challenges of the 1990's
A fully-dynamic data structure for external substring search

STOC '95 Proceedings of the twenty-seventh annual ACM symposium on Theory of computing
Fast string searching in secondary storage: theoretical developments and experimental results

Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Monotony of surprise and large-scale quest for unusual words

Proceedings of the sixth annual international conference on Computational biology
A speed-up for the commute between subword trees and DAWGs

Information Processing Letters
A Data Structure for Circular String Analysis and Visualization

IEEE Transactions on Computers
Computing Display Conflicts in String Visualization

IEEE Transactions on Computers
A Trie Compaction Algorithm for a Large Set of Keys

IEEE Transactions on Knowledge and Data Engineering
Time-Space Trade-Off Analysis of Morphic Trie Images

IEEE Transactions on Knowledge and Data Engineering
A dynamic construction algorithm for the compact patricia trie using the hierarchical structure

Information Processing and Management: an International Journal
Database indexing for large DNA and protein sequence collections

The VLDB Journal — The International Journal on Very Large Data Bases
Space-Economical Construction of Index Structures for All Suffixes of a String

MFCS '02 Proceedings of the 27th International Symposium on Mathematical Foundations of Computer Science
Space-Efficient Data Structures for Flexible Text Retrieval Systems

ISAAC '02 Proceedings of the 13th International Symposium on Algorithms and Computation
Compact Directed Acyclic Word Graphs for a Sliding Window

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Adaptive Algorithms for Cache-Efficient Trie Search

ALENEX '99 Selected papers from the International Workshop on Algorithm Engineering and Experimentation
Mining from Literary Texts: Pattern Discovery and Similarity Computation

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Linear Bidirectional On-Line Construction of Affix Trees

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Compact Suffix Array

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
On-Line Construction of Compact Directed Acyclic Word Graphs

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
The Minimum DAWG for All Suffixes of a String and Its Applications

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Discovering Characteristic Expressions from Literary Works: A New Text Analysis Method beyond N-Gram Statistics and KWIC

DS '00 Proceedings of the Third International Conference on Discovery Science
Discovering characteristic expressions in literary works

Theoretical Computer Science
Bidirectional construction of suffix trees

Nordic Journal of Computing - Special issue: Selected papers of the Prague Stringology conference (PSC'02), September 23-24, 2002
Computing forbidden words of regular languages

Fundamenta Informaticae - Special issue on computing patterns in strings
Compact suffix array: a space-efficient full-text index

Fundamenta Informaticae - Special issue on computing patterns in strings
On some applications of finite-state automata theory to natural language processing

Natural Language Engineering
Compact directed acyclic word graphs for a sliding window

Journal of Discrete Algorithms - SPIRE 2002
New text indexing functionalities of the compressed suffix arrays

Journal of Algorithms
Ternary directed acyclic word graphs

Theoretical Computer Science - Implementation and application of automata
Succinct suffix arrays based on run-length encoding

Nordic Journal of Computing
Succinct data structures for flexible text retrieval systems

Journal of Discrete Algorithms
On Sturmian graphs

Discrete Applied Mathematics
On-line construction of compact directed acyclic word graphs

Discrete Applied Mathematics - 12th annual symposium on combinatorial pattern matching (CPM)
Sliding CDAWG Perfection

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
A Compression Method for Natural Language Automata

Proceedings of the 2009 conference on Finite-State Methods and Natural Language Processing: Post-proceedings of the 7th International Workshop FSMNLP 2008
Contracted Suffix Trees: A Simple and Dynamic Text Indexing Data Structure

CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
On the Structure of Consistent Partitions of Substring Set of a Word

FAW '09 Proceedings of the 3d International Workshop on Frontiers in Algorithmics
The subsequence composition of a string

Theoretical Computer Science
General suffix automaton construction algorithm and space bounds

Theoretical Computer Science
General indexation of weighted automata: application to spoken utterance retrieval

SpeechIR '04 Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004
On-line construction of compact directed acyclic word graphs

Discrete Applied Mathematics
A faster algorithm for matching a set of patterns with variable length don't cares

Information Processing Letters
The maximum order complexity of sequence ensembles

EUROCRYPT'91 Proceedings of the 10th annual international conference on Theory and application of cryptographic techniques
On the implementation of compact DAWG's

CIAA'02 Proceedings of the 7th international conference on Implementation and application of automata
Ternary directed acyclic word graphs

CIAA'03 Proceedings of the 8th international conference on Implementation and application of automata
Factor automata of automata and applications

CIAA'07 Proceedings of the 12th international conference on Implementation and application of automata
Pattern matching in strings

Algorithms and theory of computation handbook
Position heaps: A simple and dynamic text indexing data structure

Journal of Discrete Algorithms
Near real-time suffix tree construction via the fringe marked ancestor problem

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Top-k document retrieval in optimal time and linear space

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
B3-SDR and effective use of structural hints

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
Sparse directed acyclic word graphs

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Weighted directed word graph

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
On the bit-parallel simulation of the nondeterministic Aho-Corasick and suffix automata for a set of patterns

Journal of Discrete Algorithms
A general weighted grammar library

CIAA'04 Proceedings of the 9th international conference on Implementation and Application of Automata
Sturmian graphs and a conjecture of moser

DLT'04 Proceedings of the 8th international conference on Developments in Language Theory
A partition-based efficient algorithm for large scale multiple-strings matching

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Searching by corpus with fingerprints

Proceedings of the 15th International Conference on Extending Database Technology
On suffix extensions in suffix trees

Theoretical Computer Science
Computing forbidden words of regular languages

Fundamenta Informaticae - Computing Patterns in Strings
Compact Suffix Array — A Space-Efficient Full-Text Index

Fundamenta Informaticae - Computing Patterns in Strings
Efficient computation of substring equivalence classes with suffix arrays

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
General algorithms for mining closed flexible patterns under various equivalence relations

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
The position heap of a trie

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Near real-time suffix tree construction via the fringe marked ancestor problem

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.01

Visualization

Abstract

Given a finite set of texts S = {w1, … , wk} over some fixed finite alphabet &Sgr;, a complete inverted file for S is an abstract data type that provides the functions find(w), which returns the longest prefix of w that occurs (as a subword of a word) in S; freq(w), which returns the number of times w occurs in S; and locations(w), which returns the set of positions where w occurs in S. A data structure that implements a complete inverted file for S that occupies linear space and can be built in linear time, using the uniform-cost RAM model, is given. Using this data structure, the time for each of the above query functions is optimal. To accomplish this, techniques from the theory of finite automata and the work on suffix trees are used to build a deterministic finite automaton that recognizes the set of all subwords of the set S. This automaton is then annotated with additional information and compacted to facilitate the desired query functions. The result is a data structure that is smaller and more flexible than the suffix tree.