Learning correlations using the mixture-of-subsets model

Authors:
Manas Somaiya;Christopher Jermaine;Sanjay Ranka
Affiliations:
University of Florida, Gainesville, FL;University of Florida, Gainesville, FL;University of Florida, Gainesville, FL
Venue:
ACM Transactions on Knowledge Discovery from Data (TKDD)
Year:
2008

Citing 17
Cited 2

Information geometry of the EM and em algorithms for neural networks

Neural Networks
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Fast algorithms for projected clustering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Entropy-based subspace clustering for mining numerical data

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A general probabilistic framework for clustering individuals and objects

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering through decision tree construction

Proceedings of the ninth international conference on Information and knowledge management
Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
A new cell-based clustering method for large, high-dimensional data in data mining applications

Proceedings of the 2002 ACM symposium on Applied computing
A Monte Carlo algorithm for fast projective clustering

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
d-Clusters: Capturing Subspace Correlation in a Large Data Set

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Information Theoretic Clustering of Sparse Co-Occurrence Data

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A generalized maximum entropy approach to bregman co-clustering and matrix approximation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Unsupervised learning of parsimonious mixtures on large spaces with integrated feature and component selection

IEEE Transactions on Signal Processing

Mixture models for learning low-dimensional roles in high-dimensional data

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
A global local modeling of internet usage in large mobile societies

Proceedings of the 7th ACM workshop on Performance monitoring and measurement of heterogeneous wireless and wired networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Using a mixture of random variables to model data is a tried-and-tested method common in data mining, machine learning, and statistics. By using mixture modeling it is often possible to accurately model even complex, multimodal data via very simple components. However, the classical mixture model assumes that a data point is generated by a single component in the model. A lot of datasets can be modeled closer to the underlying reality if we drop this restriction. We propose a probabilistic framework, the mixture-of-subsets (MOS) model, by making two fundamental changes to the classical mixture model. First, we allow a data point to be generated by a set of components, rather than just a single component. Next, we limit the number of data attributes that each component can influence. We also propose an EM framework to learn the MOS model from a dataset, and experimentally evaluate it on real, high-dimensional datasets. Our results show that the MOS model learned from the data represents the underlying nature of the data accurately.