Length Normalization in Degraded Text Collections

Authors:
Amit Singhal;Gerard Salton;Chris Buckley
Affiliations:
-;-;-
Venue:
Length Normalization in Degraded Text Collections
Year:
1995

Citing 0
Cited 9

Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
BUS: an effective indexing and retrieval scheme in structured documents

Proceedings of the third ACM conference on Digital libraries
Information retrieval and spelling correction: an inquiry into lexical disambiguation

Proceedings of the 2002 ACM symposium on Applied computing
Mixing and Merging for Spoken Document Retrieval

ECDL '98 Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries
Effect of term distributions on centroid-based text categorization

Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Informatics and computer science intelligent systems applications
Using contextual spelling correction to improve retrieval effectiveness in degraded text collections

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Effects of Term Distributions on Binary Classification

IEICE - Transactions on Information and Systems
A study of information retrieval weighting schemes for sentiment analysis

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Class normalization in centroid-based text categorization

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Optical character recognition (OCR) is the most commonly used technique to convert printed material into electronic form. Using OCR, large repositories of machine readable text can be created in a short time. An information retrieval system can then be used to search through large information bases thus created. Many information retrieval systems use sophisticated term weighting functions to improve the effectiveness of a search. Term weighting schemes can be highly sensitive to the errors in the input text, introduced by the OCR process. This study examines the effects of the well known cosine normalization method in the presence of OCR errors and proposes a new, more robust, normalization method. Experiments show that the new scheme is less sensitive to OCR errors and facilitates use of more diverse basic weighting schemes. It also yields significant improvements in retrieval effectiveness over cosine normalization.