An information-theoretic approach to text searching in direct access systems

Authors:
Ian J. Barton;Susan E. Creasey;Michael F. Lynch;Michael J. Snell
Affiliations:
Univ. of Sheffield, Sheffield, U.K.;Univ. of Sheffield, Sheffield, U.K.;Univ. of Sheffield, Sheffield, U.K.;Univ. of Sheffield, Sheffield, U.K.
Venue:
Communications of the ACM
Year:
1974

Citing 2
Cited 6

On Harrison's substring testing technique

Communications of the ACM
Implementation of the substring test by hashing

Communications of the ACM

Access methods for text

ACM Computing Surveys (CSUR) - Annals of discrete mathematics, 24
A new character-based indexing method using frequency data for Japanese documents

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
One-time complete indexing of text: theory and practice

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Experiments with cited titles for automatic document indexing and similarity measure in a probabilistic context

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Recursive hashing functions for n-grams

ACM Transactions on Information Systems (TOIS)
Data Structures for an Integrated Data Base Management and Information Retrieval System

VLDB '82 Proceedings of the 8th International Conference on Very Large Data Bases

Quantified Score

Hi-index	48.22

Visualization

Abstract

Using direct access computer files of bibliographic information, an attempt is made to overcome one of the problems often associated with information retrieval, namely, the maintenance and use of large dictionaries, the greater part of which is used only infrequently. A novel method is presented, which maps the hyperbolic frequency distribution of text characteristics onto a rectangular distribution. This is more suited to implementation on storage devices.This method treats text as a string of characters rather than words bounded by spaces, and chooses subsets of strings such that their frequencies of occurrence are more even than those of word types. The members of this subset are then used as index keys for retrieval. The rectangular distribution of key frequencies results in a much simplified file organization and promises considerable cost advantages.