Communications of the ACM
A sequential algorithm for training text classifiers
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Fast and effective text mining using linear-time document clustering
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient algorithms for mining outliers from large data sets
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Active learning using adaptive resampling
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Machine Learning
Machine Learning
Support Vector Machine Active Learning with Application sto Text Classification
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Fast Outlier Detection in High Dimensional Spaces
PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
An Efficient Algorithm for Mining Association Rules in Large Databases
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Active Hidden Markov Models for Information Extraction
IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
Distance-based outliers: algorithms and applications
The VLDB Journal — The International Journal on Very Large Data Bases
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A new two-phase sampling based algorithm for discovering association rules
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of sampling for data mining of association rules
RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
The learning-curve sampling method applied to model-based clustering
The Journal of Machine Learning Research
Probabilistic Noise Identification and Data Cleaning
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Mining distance-based outliers in near linear time with randomization and a simple pruning rule
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient data reduction with EASE
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Outlier Detection Methodologies
Artificial Intelligence Review
On the Small Sample Performance of Boosted Classifiers
CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2 - Volume 02
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Enhancing Data Analysis with Noise Removal
IEEE Transactions on Knowledge and Data Engineering
Data Mining and Knowledge Discovery
Density Estimation Technique for Data Stream Classification
DEXA '06 Proceedings of the 17th International Conference on Database and Expert Systems Applications
IEEE Transactions on Knowledge and Data Engineering
Improved Association Rule Mining by Modified Trimming
CIT '06 Proceedings of the Sixth IEEE International Conference on Computer and Information Technology
Optimized stratified sampling for approximate query processing
ACM Transactions on Database Systems (TODS)
Approximate frequency counts over data streams
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A framework for clustering evolving data streams
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
An empirical study of the noise impact on cost-sensitive learning
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Active learning for class probability estimation and ranking
IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Hi-index | 0.00 |
In this article we address the issue of how to mine efficiently in large and noisy data. We propose an efficient sampling algorithm (Concise) as a solution for large and noisy data. Concise is far more superior than the Simple Random Sampling (SRS) in selecting a representative sample. Particularly when the data is very large and noisy, Concise achieves the maximum gain over SRS. The comparison is in terms of their impact on subsequent data mining tasks, specifically, classification, clustering, and association rule mining. We compared Concise with a few existing noise removal algorithms followed by SRS. Although the accuracy of mining results are similar, Concise spends very little time compared to the existing algorithms because Concise has linear time complexity.