Locality sensitive hashing for sampling-based algorithms in association rule mining

Authors:
Chyouhwa Chen;Shi-Jinn Horng;Chin-Pin Huang
Affiliations:
Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, 43, Keelung Road, Section , Taipei 10607, Taiwan, ROC;Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, 43, Keelung Road, Section , Taipei 10607, Taiwan, ROC;Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, 43, Keelung Road, Section , Taipei 10607, Taiwan, ROC
Venue:
Expert Systems with Applications: An International Journal
Year:
2011

Citing 25
Cited 1

Mining quantitative association rules in large relational tables

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Interactive Data Analysis: The Control Project

Computer
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A new two-phase sampling based algorithm for discovering association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of sampling for data mining of association rules

RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Efficient Progressive Sampling for Association Rules

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Efficient data reduction with EASE

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
An effective and efficient algorithm for high-dimensional outlier detection

The VLDB Journal — The International Journal on Very Large Data Bases
Model-based overlapping clustering

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Mining association rules through integration of clustering analysis and ant colony system for health insurance database in Taiwan

Expert Systems with Applications: An International Journal
GAPS: A clustering method using a new point symmetry-based distance measure

Pattern Recognition
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Deterministic algorithms for sampling count data

Data & Knowledge Engineering
Fast mining of distance-based outliers in high-dimensional datasets

Data Mining and Knowledge Discovery
Bounded LSH for Similarity Search in Peer-to-Peer File Systems

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Blind paraunitary equalization

Signal Processing
Mining the change of event trends for decision support in environmental scanning

Expert Systems with Applications: An International Journal
Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Analysis of sampling techniques for association rule mining

Proceedings of the 12th International Conference on Database Theory
Prioritization of association rules in data mining: Multiple criteria decision approach

Expert Systems with Applications: An International Journal

Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I

Quantified Score

Hi-index	12.05

Visualization

Abstract

Association rule mining is one of the most important techniques for intelligent system design and has been widely applied in a large number of real applications. However, classical mining algorithms cannot process very large databases in a reasonable amount of time. The sampling approach that processes a subset of the whole database is a viable alternative. Obviously, such an approach cannot extract perfectly accurate rules. Previous works have tried to improve the accuracy by removing ''outliers'' from the initial sample based on global statistical properties in the sample. In this paper, we take the view that the initial sample may actually consist of multiple possibly overlapping subsets or clusters. It is more reasonable to apply data clustering techniques to the initial sample before outlier removal is performed on the resulting clusters, so that outliers are removed based on local properties of individual clusters. However, clustering transactional data with very high dimensions is a difficult problem by itself. We solve this problem by interpreting locality sensitive hashing as a means for data clustering. Previously proposed algorithms may be then optionally used to remove the outliers in the individual clusters. We propose several concrete algorithms based on this general strategy. Using an extensive set of synthetic data and real datasets, we evaluate our proposed algorithms and find that our proposals exhibit better accuracy or execution time, or both, than previously proposed algorithms.