Retrieval in text collections with historic spelling using linguistic and spelling variants
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Google Book Search: Document Understanding on a Massive Scale
ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Generating search term variants for text collections with historic spellings
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
A cross-language approach to historic document retrieval
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Efficiently generating correction suggestions for garbled tokens of historical language
Natural Language Engineering
Computation of similarity: similarity search as computation
CiE'11 Proceedings of the 7th conference on Models of computation in context: computability in Europe
Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene
LaTeCH '11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Historical analysis of legal opinions with a sparse mixed-effects latent variable model
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Hi-index | 0.00 |
Due to the large number of spelling variants found in historical texts, standard methods of Information Retrieval (IR) fail to produce satisfactory results on historical document collections. In order to improve recall for search engines, modern words used in queries have to be associated with corresponding historical variants found in the documents. In the literature, the use of (1) special matching procedures and (2) lexica for historical language have been suggested as two ways to solve this problem. In the first part of the paper we show how the construction of matching procedures and lexica may benefit from each other, leading the way to a combination of both approaches. A tool is presented where matching rules and a historical lexicon are built in an interleaved way based on corpus analysis. A crucial question considered in the second part of the paper is if matching procedures alone suffice to lift IR on historical texts to a satisfactory level. Since historical language changes over centuries it is not simple to obtain an answer. We present experiments where the performance of matching procedures in text collections from four centuries is studied. After classifying missed vocabulary, we measure precision and recall of the matching procedure for each period. Our results indicate that for earlier periods historical lexica represent an important corrective to matching procedures in IR applications.