Enabling information retrieval on historical document collections: the role of matching procedures and special lexica

Authors:
Annette Gotscharek;Andreas Neumann;Ulrich Reffle;Christoph Ringlstetter;Klaus U. Schulz
Affiliations:
University of Munich;Bavarian State Library;University of Munich;University of Munich;University of Munich
Venue:
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Year:
2009

Citing 4
Cited 4

Retrieval in text collections with historic spelling using linguistic and spelling variants

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Google Book Search: Document Understanding on a Massive Scale

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Generating search term variants for text collections with historic spellings

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
A cross-language approach to historic document retrieval

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Efficiently generating correction suggestions for garbled tokens of historical language

Natural Language Engineering
Computation of similarity: similarity search as computation

CiE'11 Proceedings of the 7th conference on Models of computation in context: computability in Europe
Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene

LaTeCH '11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Historical analysis of legal opinions with a sparse mixed-effects latent variable model

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the large number of spelling variants found in historical texts, standard methods of Information Retrieval (IR) fail to produce satisfactory results on historical document collections. In order to improve recall for search engines, modern words used in queries have to be associated with corresponding historical variants found in the documents. In the literature, the use of (1) special matching procedures and (2) lexica for historical language have been suggested as two ways to solve this problem. In the first part of the paper we show how the construction of matching procedures and lexica may benefit from each other, leading the way to a combination of both approaches. A tool is presented where matching rules and a historical lexicon are built in an interleaved way based on corpus analysis. A crucial question considered in the second part of the paper is if matching procedures alone suffice to lift IR on historical texts to a satisfactory level. Since historical language changes over centuries it is not simple to obtain an answer. We present experiments where the performance of matching procedures in text collections from four centuries is studied. After classifying missed vocabulary, we measure precision and recall of the matching procedure for each period. Our results indicate that for earlier periods historical lexica represent an important corrective to matching procedures in IR applications.