Toward Unsupervised Correlation Preserving Discretization

Authors:
Sameep Mehta;Srinivasan Parthasarathy;Hui Yang
Affiliations:
-;IEEE;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2005

Citing 13
Cited 8

On changing continuous attributes into ordered discrete attributes

EWSL-91 Proceedings of the European working session on learning on Machine learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Mining quantitative association rules in large relational tables

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Mining optimized association rules for numeric attributes

PODS '96 Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Latent semantic indexing: a probabilistic analysis

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Multivariate discretization for set mining

Knowledge and Information Systems
Mining Optimized Association Rules with Categorical and Numeric Attributes

IEEE Transactions on Knowledge and Data Engineering
Parallel Incremental 2D-Discretization on Dynamic Datasets

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Global Data Analysis and the Fragmentation Problem in Decision Tree Induction

ECML '97 Proceedings of the 9th European Conference on Machine Learning
Efficient Progressive Sampling for Association Rules

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
On the Use of Conceptual Reconstruction for Mining Massively Incomplete Data Sets

IEEE Transactions on Knowledge and Data Engineering
LOADED: Link-Based Outlier and Anomaly Detection in Evolving Data Sets

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining

Spatio-temporal discretization for sequential pattern mining

Proceedings of the 2nd international conference on Ubiquitous information management and communication
Estimation of Market Share by Using Discretization Technology: An Application in China Mobile

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part II
A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes

Data Mining and Knowledge Discovery
Review:

The Knowledge Engineering Review
A global unsupervised data discretization algorithm based on collective correlation coefficient

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
An ICA-Based multivariate discretization algorithm

KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
OMVD: an optimization of MVD

ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications
Regularized Gaussian Mixture Model based discretization for gene expression data association mining

Applied Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Discretization is a crucial preprocessing technique used for a variety of data warehousing and mining tasks. In this paper, we present a novel PCA-based unsupervised algorithm for the discretization of continuous attributes in multivariate data sets. The algorithm leverages the underlying correlation structure in the data set to obtain the discrete intervals and ensures that the inherent correlations are preserved. Previous efforts on this problem are largely supervised and consider only piecewise correlation among attributes. We consider the correlation among continuous attributes and, at the same time, also take into account the interactions between continuous and categorical attributes. Our approach also extends easily to data sets containing missing values. We demonstrate the efficacy of the approach on real data sets and as a preprocessing step for both classification and frequent itemset mining tasks. We show that the intervals are meaningful and can uncover hidden patterns in data. We also show that large compression factors can be obtained on the discretized data sets. The approach is task independent, i.e., the same discretized data set can be used for different data mining tasks. Thus, the data sets can be discretized, compressed, and stored once and can be used again and again.