Trigrams as index element in full text retrieval: observations and experimental results

  • Authors:
  • Elizabeth S. Adams;Arnold C. Meltzer

  • Affiliations:
  • Hood College, Department of Mathematics and Computer Science, Frederick, MD;George Washington University, Department of Electrical Engineering and Computer Science, Washington, DC

  • Venue:
  • CSC '93 Proceedings of the 1993 ACM conference on Computer science
  • Year:
  • 1993

Quantified Score

Hi-index 0.00

Visualization

Abstract

A trigram is a three element sequence of characters. In this paper we demonstrate the effectiveness of a trigram based index for morphologically based retrievals from a full text document retrieval system. Retrieved documents are considered relevant if they contain exact matches for each of the query terms. Using this definition of relevance we consistently achieve a recall rate of 100%. In the experiments described here, we used sets of 100 anded three term queries, and the average precision per set varied from 47% to 87%. We propose a method for increasing the average precision to 100%. Using overlapping trigrams extracted from the Brown Corpus [KUCE67] and a character set of 45 elements, we found a horizontal asymptote near 11,000 for the number of entries in a trigram based index. Finally we show that a trigram based system provides a reasonable alternative to a word based one and is superior to it in retrievals of word fragments.