Efficient sampling: application to image data

Authors:
Surong Wang;Manoranjan Dash;Liang-Tien Chia
Affiliations:
Center for Multimedia and Network Technology, School of Computer Engineering, Nanyang Technological University, Singapore;Center for Multimedia and Network Technology, School of Computer Engineering, Nanyang Technological University, Singapore;Center for Multimedia and Network Technology, School of Computer Engineering, Nanyang Technological University, Singapore
Venue:
PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Year:
2005

Citing 6
Cited 1

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
A new two-phase sampling based algorithm for discovering association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient data reduction with EASE

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Support vector machines for histogram-based image classification

IEEE Transactions on Neural Networks

Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sampling is an important preprocessing algorithm that is used to mine large data efficiently. Although a simple random sample often works fine for reasonable sample size, accuracy falls sharply with reduced sample size. In kdd'03 we proposed ease that outputs a sample based on its ‘closeness' to the original sample. Reported results show that ease outperforms simple random sampling (srs). In this paper we propose easier that extends ease in two ways. 1) ease is a halving algorithm, i.e., to achieve the required sample ratio it starts from a suitable initial large sample and iteratively halves. easier, on the other hand, does away with the repeated halving by directly obtaining the required sample ratio in one iteration. 2) ease was shown to work on ibm quest dataset which is a categorical count dataset. easier, in addition, is shown to work on continuous data such as Color Structure Descriptor of images. Two mining tasks, classification and association rule mining, are used to validate the efficacy of easier samples vis-a-visease and srs samples.