Enabling information retrieval on historical document collections: the role of matching procedures and special lexica

  • Authors:
  • Annette Gotscharek;Andreas Neumann;Ulrich Reffle;Christoph Ringlstetter;Klaus U. Schulz

  • Affiliations:
  • University of Munich;Bavarian State Library;University of Munich;University of Munich;University of Munich

  • Venue:
  • Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Due to the large number of spelling variants found in historical texts, standard methods of Information Retrieval (IR) fail to produce satisfactory results on historical document collections. In order to improve recall for search engines, modern words used in queries have to be associated with corresponding historical variants found in the documents. In the literature, the use of (1) special matching procedures and (2) lexica for historical language have been suggested as two ways to solve this problem. In the first part of the paper we show how the construction of matching procedures and lexica may benefit from each other, leading the way to a combination of both approaches. A tool is presented where matching rules and a historical lexicon are built in an interleaved way based on corpus analysis. A crucial question considered in the second part of the paper is if matching procedures alone suffice to lift IR on historical texts to a satisfactory level. Since historical language changes over centuries it is not simple to obtain an answer. We present experiments where the performance of matching procedures in text collections from four centuries is studied. After classifying missed vocabulary, we measure precision and recall of the matching procedure for each period. Our results indicate that for earlier periods historical lexica represent an important corrective to matching procedures in IR applications.