A cross-language approach to historic document retrieval

Authors:
Marijn Koolen;Frans Adriaans;Jaap Kamps;Maarten de Rijke
Affiliations:
ISLA, University of Amsterdam, The Netherlands;ISLA, University of Amsterdam, The Netherlands;ISLA, University of Amsterdam, The Netherlands;ISLA, University of Amsterdam, The Netherlands
Venue:
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Year:
2006

Citing 8
Cited 6

Searching for historical word-forms in a database of 17th-century English text using spelling-correction methods

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Statistical inference in retrieval effectiveness evaluation

Information Processing and Management: an International Journal
The String-to-String Correction Problem

Journal of the ACM (JACM)
Cross-Language Evaluation Forum: Objectives, Results, Achievements

Information Retrieval
Monolingual Document Retrieval for European Languages

Information Retrieval
Combining Multiple Strategies for Effective Monolingual and Cross-Language Retrieval

Information Retrieval
Understanding Digital Libraries, Second Edition (The Morgan Kaufmann Series in Multimedia and Information Systems)

Understanding Digital Libraries, Second Edition (The Morgan Kaufmann Series in Multimedia and Information Systems)

Retrieval in text collections with historic spelling using linguistic and spelling variants

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Enabling information retrieval on historical document collections: the role of matching procedures and special lexica

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
On lexical resources for digitization of historical documents

Proceedings of the 9th ACM symposium on Document engineering
Non-interactive OCR post-correction for giga-scale digitization projects

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Approach to cross-language retrieval for Japanese traditional fine art: Ukiyo-e database

ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Progress in information retrieval

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience: queries involving modern words may not be very effective for retrieving documents that contain many historic terms. We propose a cross-language approach to historic document retrieval, and investigate (1) the automatic construction of translation resources for historic languages, and (2) the retrieval of historic documents using cross-language information retrieval techniques. Our experimental evidence is based on a collection of 17th century Dutch documents and a set of 25 known-item topics in modern Dutch. Our main findings are as follows: First, we are able to automatically construct rules for modernizing historic language based on comparing (a) phonetic sequence similarity, (b) the relative frequency of consonant and vowel sequences, and (c) the relative frequency of character n-gram sequences, of historic and modern corpora. Second, modern queries are not very effective for retrieving historic documents, but the historic language tools lead to a substantial improvement in retrieval effectiveness. The improvements are above and beyond the improvement due to using a modern stemming algorithm (whose effectiveness actually goes up when the historic language is modernized).