Text categorization for multi-page documents: a hybrid naive Bayes HMM approach
Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Hidden Markov Models for Text Categorization in Multi-Page Documents
Journal of Intelligent Information Systems
Retrieval methods for English-text with missrecognized OCR characters
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Measuring the Effects of OCR Errors on Similarity Linking
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Accuracy improvement of automatic text classification based on feature transformation
Proceedings of the 2003 ACM symposium on Document engineering
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
An impact of linguistic features on automated classification of OCR texts
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Hi-index | 0.00 |
Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.