Building a complete inverted file for a set of text files in linear time

Authors:
A. Blumer;J. Blumer;A. Ehrenfeucht;D. Haussler;R. McConnell
Affiliations:
-;-;-;-;-
Venue:
STOC '84 Proceedings of the sixteenth annual ACM symposium on Theory of computing
Year:
1984

Citing 6
Cited 3

PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Linear Algorithm for Data Compression via String Matching

Journal of the ACM (JACM)
Analysis and performance of inverted data base structures

Communications of the ACM
Contentaddressable Memories

Contentaddressable Memories
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms

Complete inverted files for efficient text retrieval and analysis

Journal of the ACM (JACM)
Textual and visual access to a computer by people who know nothing about it

SIGDOC '88 Proceedings of the 6th annual international conference on Systems documentation
SASE: implementation of a compressed text search engine

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a finite set of texts S &equil; {&ohgr;1, ..., &ohgr;k} over some fixed finite alphabet &Sgr;, a complete inverted file for S is an abstract data type that provides the functions find(&ohgr;), which returns the longest prefix of &ohgr; which occurs in S; freq(&ohgr;), which returns the number of times &ohgr; occurs in S; and locations(&ohgr;) which returns the set of positions at which &ohgr; occurs. We give a data structure to implement a complete inverted file for S which occupies linear space and can be built in linear time, using the uniform cost RAM model. Using this data structure, the time for each of the above query functions is optimal. To accomplish this, we use techniques from the theory of finite automata to build a deterministic finite automaton which recognizes the set of all sub words of the set S. This automaton is then annotated with additional information and compacted to facilitate the desired query functions.