Towards information retrieval on historical document collections: the role of matching procedures and special lexica

  • Authors:
  • Annette Gotscharek;Ulrich Reffle;Christoph Ringlstetter;Klaus U. Schulz;Andreas Neumann

  • Affiliations:
  • CIS, Univ. of Munich, Schellingstr. 10, 80799, Munich, Germany;CIS, Univ. of Munich, Schellingstr. 10, 80799, Munich, Germany;CIS, Univ. of Munich, Schellingstr. 10, 80799, Munich, Germany;CIS, Univ. of Munich, Schellingstr. 10, 80799, Munich, Germany;Bavarian State Library, Ludwigstr. 16, 80799, Munich, Germany

  • Venue:
  • International Journal on Document Analysis and Recognition - Special issue on noisy text analytics
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Due to the large number of spelling variants found in historical texts, standard methods of Information Retrieval (IR) fail to produce satisfactory results on historical document collections. In order to improve recall for search engines, modern words used in queries have to be associated with corresponding historical variants found in the documents. In the literature, the use of (1) special matching procedures and (2) lexica for historical language have been suggested as two alternative ways to solve this problem. In the first part of the paper, we show how the construction of matching procedures and lexica may benefit from each other, leading the way to a combination of both approaches. A tool is presented where matching rules and a historical lexicon are built in an interleaved way based on corpus analysis. In the second part of the paper, we ask if matching procedures alone suffice to lift IR on historical texts to a satisfactory level. Since historical language changes over centuries, it is not simple to obtain an answer. We present experiments where the performance of matching procedures in text collections from four centuries is studied. After classifying missed vocabulary, we measure precision and recall of the matching procedure for each period. Results indicate that for earlier periods, matching procedures alone do not lead to satisfactory results. We then describe experiments where the gain for recall obtained from historical lexica of distinct sizes is estimated.