SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Techniques for automatically correcting words in text
ACM Computing Surveys (CSUR)
Statistical inference in retrieval effectiveness evaluation
Information Processing and Management: an International Journal
The String-to-String Correction Problem
Journal of the ACM (JACM)
Cross-Language Evaluation Forum: Objectives, Results, Achievements
Information Retrieval
Monolingual Document Retrieval for European Languages
Information Retrieval
Combining Multiple Strategies for Effective Monolingual and Cross-Language Retrieval
Information Retrieval
Understanding Digital Libraries, Second Edition (The Morgan Kaufmann Series in Multimedia and Information Systems)
Retrieval in text collections with historic spelling using linguistic and spelling variants
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
On lexical resources for digitization of historical documents
Proceedings of the 9th ACM symposium on Document engineering
Non-interactive OCR post-correction for giga-scale digitization projects
CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Approach to cross-language retrieval for Japanese traditional fine art: Ukiyo-e database
ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Progress in information retrieval
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Hi-index | 0.00 |
Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience: queries involving modern words may not be very effective for retrieving documents that contain many historic terms. We propose a cross-language approach to historic document retrieval, and investigate (1) the automatic construction of translation resources for historic languages, and (2) the retrieval of historic documents using cross-language information retrieval techniques. Our experimental evidence is based on a collection of 17th century Dutch documents and a set of 25 known-item topics in modern Dutch. Our main findings are as follows: First, we are able to automatically construct rules for modernizing historic language based on comparing (a) phonetic sequence similarity, (b) the relative frequency of consonant and vowel sequences, and (c) the relative frequency of character n-gram sequences, of historic and modern corpora. Second, modern queries are not very effective for retrieving historic documents, but the historic language tools lead to a substantial improvement in retrieval effectiveness. The improvements are above and beyond the improvement due to using a modern stemming algorithm (whose effectiveness actually goes up when the historic language is modernized).