Noisy Text Categorization

Authors:
Alessandro Vinciarelli
Affiliations:
-
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
2005

Citing 23
Cited 4

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Lexical analysis and stoplists

Information retrieval
Stemming algorithms

Information retrieval
An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
Using IR techniques for text classification in document analysis

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical methods for speech recognition

Statistical methods for speech recognition
The indexing and retrieval of document images: a survey

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Categorizing paper documents: a generic system for domain and language independent text categorization

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Statistical Pattern Recognition: A Review

IEEE Transactions on Pattern Analysis and Machine Intelligence
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Modern Information Retrieval

Modern Information Retrieval
Imaged Document Text Retrieval Without OCR

IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Retrieval methods for English-text with missrecognized OCR characters

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Handwritten Document Retrieval

IWFHR '02 Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR'02)
Features for Word Spotting in Historical Manuscripts

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Named entity extraction from noisy input: speech and OCR

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models

IEEE Transactions on Pattern Analysis and Machine Intelligence

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

The Journal of Machine Learning Research
An approach for adding noise-tolerance to restricted-domain information retrieval

NLDB'10 Proceedings of the Natural language processing and information systems, and 15th international conference on Applications of natural language to information systems
Combining Bayesian Text Classification and Shrinkage to Automate Healthcare Coding: A Data Quality Analysis

Journal of Data and Information Quality (JDIQ)
Improving Text Classification Accuracy by Training Label Cleaning

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.14

Visualization

Abstract

This work presents categorization experiments performed over noisy texts. By noisy, we mean any text obtained through an extraction process (affected by errors) from media other than digital texts (e.g., transcriptions of speech recordings extracted with a recognition system). The performance of a categorization system over the clean and noisy (Word Error Rate between \sim 10 and \sim 50 percent) versions of the same documents is compared. The noisy texts are obtained through handwriting recognition and simulation of optical character recognition. The results show that the performance loss is acceptable for Recall values up to 60-70 percent depending on the noise sources. New measures of the extraction process performance, allowing a better explanation of the categorization results, are proposed.