Mining in Large Noisy Domains

Authors:
Manoranjan Dash;Ayush Singhania
Affiliations:
Nanyang Technological University, Singapore;Nanyang Technological University, Singapore
Venue:
Journal of Data and Information Quality (JDIQ)
Year:
2009

Citing 38
Cited 0

A theory of the learnable

Communications of the ACM
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Active learning using adaptive resampling

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Queries and Concept Learning

Machine Learning
Queries and Concept Learning

Machine Learning
Support Vector Machine Active Learning with Application sto Text Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Fast Outlier Detection in High Dimensional Spaces

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
An Efficient Algorithm for Mining Association Rules in Large Databases

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Active Hidden Markov Models for Information Extraction

IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
Distance-based outliers: algorithms and applications

The VLDB Journal — The International Journal on Very Large Data Bases
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A new two-phase sampling based algorithm for discovering association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of sampling for data mining of association rules

RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
The learning-curve sampling method applied to model-based clustering

The Journal of Machine Learning Research
Probabilistic Noise Identification and Data Cleaning

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient data reduction with EASE

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Outlier Detection Methodologies

Artificial Intelligence Review
On the Small Sample Performance of Boosted Classifiers

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2 - Volume 02
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Enhancing Data Analysis with Noise Removal

IEEE Transactions on Knowledge and Data Engineering
Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data Datasets

Data Mining and Knowledge Discovery
Density Estimation Technique for Data Stream Classification

DEXA '06 Proceedings of the 17th International Conference on Database and Expert Systems Applications
Class Noise Handling for Effective Cost-Sensitive Learning by Cost-Guided Iterative Classification Filtering

IEEE Transactions on Knowledge and Data Engineering
Improved Association Rule Mining by Modified Trimming

CIT '06 Proceedings of the Sixth IEEE International Conference on Computer and Information Technology
Optimized stratified sampling for approximate query processing

ACM Transactions on Database Systems (TODS)
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A framework for clustering evolving data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
An empirical study of the noise impact on cost-sensitive learning

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Active learning for class probability estimation and ranking

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article we address the issue of how to mine efficiently in large and noisy data. We propose an efficient sampling algorithm (Concise) as a solution for large and noisy data. Concise is far more superior than the Simple Random Sampling (SRS) in selecting a representative sample. Particularly when the data is very large and noisy, Concise achieves the maximum gain over SRS. The comparison is in terms of their impact on subsequent data mining tasks, specifically, classification, clustering, and association rule mining. We compared Concise with a few existing noise removal algorithms followed by SRS. Although the accuracy of mining results are similar, Concise spends very little time compared to the existing algorithms because Concise has linear time complexity.