Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Lexical analysis and stoplists
Information retrieval
Information retrieval
An evaluation of phrasal and clustered representations on a text categorization task
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Automated learning of decision rules for text categorization
ACM Transactions on Information Systems (TOIS)
Using IR techniques for text classification in document analysis
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical methods for speech recognition
Statistical methods for speech recognition
The indexing and retrieval of document images: a survey
Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Statistical Pattern Recognition: A Review
IEEE Transactions on Pattern Analysis and Machine Intelligence
An introduction to support Vector Machines: and other kernel-based learning methods
An introduction to support Vector Machines: and other kernel-based learning methods
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Modern Information Retrieval
Imaged Document Text Retrieval Without OCR
IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Retrieval methods for English-text with missrecognized OCR characters
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Handwritten Document Retrieval
IWFHR '02 Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR'02)
Features for Word Spotting in Historical Manuscripts
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Named entity extraction from noisy input: speech and OCR
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models
IEEE Transactions on Pattern Analysis and Machine Intelligence
Spam Filtering Based On The Analysis Of Text Information Embedded Into Images
The Journal of Machine Learning Research
An approach for adding noise-tolerance to restricted-domain information retrieval
NLDB'10 Proceedings of the Natural language processing and information systems, and 15th international conference on Applications of natural language to information systems
Journal of Data and Information Quality (JDIQ)
Improving Text Classification Accuracy by Training Label Cleaning
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.14 |
This work presents categorization experiments performed over noisy texts. By noisy, we mean any text obtained through an extraction process (affected by errors) from media other than digital texts (e.g., transcriptions of speech recordings extracted with a recognition system). The performance of a categorization system over the clean and noisy (Word Error Rate between \sim 10 and \sim 50 percent) versions of the same documents is compared. The noisy texts are obtained through handwriting recognition and simulation of optical character recognition. The results show that the performance loss is acceptable for Recall values up to 60-70 percent depending on the noise sources. New measures of the extraction process performance, allowing a better explanation of the categorization results, are proposed.