Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
A survey of types of text noise and techniques to handle noisy text
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Evaluating models of latent document semantics in the presence of OCR errors
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Classifying sentiment in microblogs: is brevity an advantage?
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
ACAI '11 Proceedings of the International Conference on Advances in Computing and Artificial Intelligence
Experiments with artificially generated noise for cleansing noisy text
Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Journal of Data and Information Quality (JDIQ)
Discovering customer intent in real-time for streamlining service desk conversations
Proceedings of the 20th ACM international conference on Information and knowledge management
Improving Text Classification Accuracy by Training Label Cleaning
ACM Transactions on Information Systems (TOIS)
A novel variable precision (θ,σ)-fuzzy rough set model based on fuzzy granules
Fuzzy Sets and Systems
Hi-index | 0.00 |
Noise is a stark reality in real life data. Especially in the domain of text analytics, it has a significant impact as data cleaning forms a very large part of the data processing cycle. Noisy unstructured text is common in informal settings such as on-line chat, SMS, email, newsgroups and blogs, automatically transcribed text from speech, and automatically recognized text from printed or handwritten material. Gigabytes of such data is being generated everyday on the Internet, in contact centers, and on mobile phones. Researchers have looked at various text mining issues such as pre-processing and cleaning noisy text, information extraction, rule learning, and classification for noisy text. This paper focuses on the issues faced by automatic text classifiers in analyzing noisy documents coming from various sources. The goal of this paper is to bring out and study the effect of different kinds of noise on automatic text classification. Does the nature of such text warrant moving beyond traditional text classification techniques? We present detailed experimental results with simulated noise on the Reuters21578 and 20-newsgroups benchmark datasets. We present interesting results on real-life noisy datasets from various CRM domains.