A search algorithm and data structure for an efficient information system

  • Authors:
  • Shou-chuan Yang

  • Affiliations:
  • University of Wisconsin, Madison, Wisconsin

  • Venue:
  • COLING '69 Proceedings of the 1969 conference on Computational linguistics
  • Year:
  • 1969

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes a system for information storage, retrieval, and updating, with special attention to the search algorithm and data structure demanded for maximum program efficieny. The program efficiency is especially warranted when a natural language or a symbolic language is involved in the searching process.The system is a basic framework for an efficient information system. It can be implemented for text processing and document retrieval; numerical data retrieval; and for handling of large files such as dictionaries, catalogs, and personnel records, as well as graphic informations. Currently, eight commands are implemented and operational in batch mode on a CDC 3600: STORE, RETRIEVE, ADD, DELETE, REPLACE, PRINT, COMPRESS and LIST. Further development will be on the use of teletype console, CRT terminal, and plotter under a time-sharing environment for producing immediate responses.The maximum program efficiency is obtained through a unique search algorithm and data structure. Instead of examining the recall ratio and the precision ratio at a higher level, this efficiency is measured in the most basic term of "average number of searches" required for looking up an item. In order to identify an item, at least one search is necessary even if it is found the first time. However, through the use of the hash-address of a key or keyword, in conjunction with an indirect-chaining list-structured table, and a large available space list, the average number of searches required for retrieving a certain item is 1.25 regardless of the size of the file in question. This is to be compared with 15.6 searches for the binary search technique in a 50,000-item file, and 5.8 searches for the letter-table method with no regard to file size.