Introduction to statistical pattern recognition (2nd ed.)
Introduction to statistical pattern recognition (2nd ed.)
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval
Text categorization for multi-page documents: a hybrid naive Bayes HMM approach
Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Feature Reduction for Neural Network Based Text Categorization
DASFAA '99 Proceedings of the Sixth International Conference on Database Systems for Advanced Applications
Improving Automatic Text Classification by Integrated Feature Analysis
IEICE - Transactions on Information and Systems
The impact of OCR accuracy and feature transformation on automatic text classification
DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Hi-index | 0.00 |
Optical Character reader (OCR) systems can be used in digitizing print documents. OCR texts are generated in the process of digitizing print documents. Usually these texts need to be indexed and organized to simplify their access and retrieval. This can be done by the use of automatic classification techniques. However it is currently impossible for OCR technology to recognize all characters with an accuracy of 100%. Furthermore it is not known whether part of speech (POS) analysis contributes to proper OCR texts representation in a discriminative way. Conventionally, the bag-of-words approach is used in OCR text classification. In this paper we experimentally evaluated POS analysis on OCR texts to formulate an informative feature set. Empirical results indicate that the combination of suitably selected POS improved classification performance of OCR texts.