Pivoted document length normalization
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
BUS: an effective indexing and retrieval scheme in structured documents
Proceedings of the third ACM conference on Digital libraries
Information retrieval and spelling correction: an inquiry into lexical disambiguation
Proceedings of the 2002 ACM symposium on Applied computing
Mixing and Merging for Spoken Document Retrieval
ECDL '98 Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries
Effect of term distributions on centroid-based text categorization
Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Informatics and computer science intelligent systems applications
Using contextual spelling correction to improve retrieval effectiveness in degraded text collections
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Effects of Term Distributions on Binary Classification
IEICE - Transactions on Information and Systems
A study of information retrieval weighting schemes for sentiment analysis
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Class normalization in centroid-based text categorization
Information Sciences: an International Journal
Hi-index | 0.00 |
Optical character recognition (OCR) is the most commonly used technique to convert printed material into electronic form. Using OCR, large repositories of machine readable text can be created in a short time. An information retrieval system can then be used to search through large information bases thus created. Many information retrieval systems use sophisticated term weighting functions to improve the effectiveness of a search. Term weighting schemes can be highly sensitive to the errors in the input text, introduced by the OCR process. This study examines the effects of the well known cosine normalization method in the presence of OCR errors and proposes a new, more robust, normalization method. Experiments show that the new scheme is less sensitive to OCR errors and facilitates use of more diverse basic weighting schemes. It also yields significant improvements in retrieval effectiveness over cosine normalization.