On changing continuous attributes into ordered discrete attributes
EWSL-91 Proceedings of the European working session on learning on Machine learning
C4.5: programs for machine learning
C4.5: programs for machine learning
Mining quantitative association rules in large relational tables
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Mining optimized association rules for numeric attributes
PODS '96 Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Latent semantic indexing: a probabilistic analysis
PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Multivariate discretization for set mining
Knowledge and Information Systems
Mining Optimized Association Rules with Categorical and Numeric Attributes
IEEE Transactions on Knowledge and Data Engineering
Parallel Incremental 2D-Discretization on Dynamic Datasets
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Global Data Analysis and the Fragmentation Problem in Decision Tree Induction
ECML '97 Proceedings of the 9th European Conference on Machine Learning
Efficient Progressive Sampling for Association Rules
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
On the Use of Conceptual Reconstruction for Mining Massively Incomplete Data Sets
IEEE Transactions on Knowledge and Data Engineering
LOADED: Link-Based Outlier and Anomaly Detection in Evolving Data Sets
ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Spatio-temporal discretization for sequential pattern mining
Proceedings of the 2nd international conference on Ubiquitous information management and communication
Estimation of Market Share by Using Discretization Technology: An Application in China Mobile
ICCS '08 Proceedings of the 8th international conference on Computational Science, Part II
A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes
Data Mining and Knowledge Discovery
The Knowledge Engineering Review
A global unsupervised data discretization algorithm based on collective correlation coefficient
IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
An ICA-Based multivariate discretization algorithm
KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications
Hi-index | 0.00 |
Discretization is a crucial preprocessing technique used for a variety of data warehousing and mining tasks. In this paper, we present a novel PCA-based unsupervised algorithm for the discretization of continuous attributes in multivariate data sets. The algorithm leverages the underlying correlation structure in the data set to obtain the discrete intervals and ensures that the inherent correlations are preserved. Previous efforts on this problem are largely supervised and consider only piecewise correlation among attributes. We consider the correlation among continuous attributes and, at the same time, also take into account the interactions between continuous and categorical attributes. Our approach also extends easily to data sets containing missing values. We demonstrate the efficacy of the approach on real data sets and as a preprocessing step for both classification and frequent itemset mining tasks. We show that the intervals are meaningful and can uncover hidden patterns in data. We also show that large compression factors can be obtained on the discretized data sets. The approach is task independent, i.e., the same discretized data set can be used for different data mining tasks. Thus, the data sets can be discretized, compressed, and stored once and can be used again and again.