One-time complete indexing of text: theory and practice

Authors:
Raymond J. D'Amore;Clinton P. Mah
Affiliations:
PAR Technology Corporation, 7926 Jones Branch Drive Suite 170, McLean, Virginia;PAR Technology Corporation, 7926 Jones Branch Drive Suite 170, McLean, Virginia
Venue:
SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
1985

Citing 3
Cited 10

An information-theoretic approach to text searching in direct access systems

Communications of the ACM
Theory of Indexing

Theory of Indexing
Complete statistical indexing of text by overlapping word fragments

ACM SIGIR Forum

Generating a dynamic hypertext environment with n-gram analysis

CIKM '93 Proceedings of the second international conference on Information and knowledge management
Trigrams as index element in full text retrieval: observations and experimental results

CSC '93 Proceedings of the 1993 ACM conference on Computer science
Recursive hashing functions for n-grams

ACM Transactions on Information Systems (TOIS)
Natural Language Processing and Information Retrieval

Information Extraction: Towards Scalable, Adaptable Systems
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Assessing creative problem-solving with automated text grading

Computers & Education
TinyLex: static n-gram index pruning with perfect recall

Proceedings of the 17th ACM conference on Information and knowledge management
Improved stable retrieval in noisy collections

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Application of variable length N-gram vectors to monolingual and bilingual information retrieval

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images
Efficient fuzzy search in large text collections

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Indexing according to occurrences of selected word fragments, called “n-grams”, offers a significant alternative to keyword indexing and full text scanning methods in the design of information systems based on documents. Finite sets of n-grams can be selected to allow effective fixed indexing of all words, numbers, and special terms in text. The characteristics of such indexing can be modeled statistically and validated over a wide range of text. The model provides a descriptive and predictive tool for controlling precision and recall in searching and for scaling estimates of relevance to an adaptive reference noise distribution for a target collection. Special techniques such as partial inversion of index terms, probabilistic ordering of index terms, and various types of data compression allow n-gram indexing to be competitive in performance with other approaches.