How Much Noise Is Too Much: A Study in Automatic Text Classification

Authors:
Sumeet Agarwal;Shantanu Godbole;Diwakar Punjani;Shourya Roy
Affiliations:
-;-;-;-
Venue:
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Year:
2007

Citing 0
Cited 10

Text classification, business intelligence, and interactivity: automating C-Sat analysis for services industry

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
A survey of types of text noise and techniques to handle noisy text

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Evaluating models of latent document semantics in the presence of OCR errors

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Classifying sentiment in microblogs: is brevity an advantage?

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Study of text based mining

ACAI '11 Proceedings of the International Conference on Advances in Computing and Artificial Intelligence
Experiments with artificially generated noise for cleansing noisy text

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Combining Bayesian Text Classification and Shrinkage to Automate Healthcare Coding: A Data Quality Analysis

Journal of Data and Information Quality (JDIQ)
Discovering customer intent in real-time for streamlining service desk conversations

Proceedings of the 20th ACM international conference on Information and knowledge management
Improving Text Classification Accuracy by Training Label Cleaning

ACM Transactions on Information Systems (TOIS)
A novel variable precision (θ,σ)-fuzzy rough set model based on fuzzy granules

Fuzzy Sets and Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Noise is a stark reality in real life data. Especially in the domain of text analytics, it has a significant impact as data cleaning forms a very large part of the data processing cycle. Noisy unstructured text is common in informal settings such as on-line chat, SMS, email, newsgroups and blogs, automatically transcribed text from speech, and automatically recognized text from printed or handwritten material. Gigabytes of such data is being generated everyday on the Internet, in contact centers, and on mobile phones. Researchers have looked at various text mining issues such as pre-processing and cleaning noisy text, information extraction, rule learning, and classification for noisy text. This paper focuses on the issues faced by automatic text classifiers in analyzing noisy documents coming from various sources. The goal of this paper is to bring out and study the effect of different kinds of noise on automatic text classification. Does the nature of such text warrant moving beyond traditional text classification techniques? We present detailed experimental results with simulated noise on the Reuters21578 and 20-newsgroups benchmark datasets. We present interesting results on real-life noisy datasets from various CRM domains.