An information theoretic framework for data mining

Authors:
Tijl De Bie
Affiliations:
University of Bristol, Bristol, United Kingdom
Venue:
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2011

Citing 17
Cited 6

A theory of the learnable

Communications of the ACM
Inductive databases and condensed representations for data mining (extended abstract)

ILPS '97 Proceedings of the 1997 international symposium on Logic programming
The budgeted maximum coverage problem

Information Processing Letters
A Microeconomic View of Data Mining

Data Mining and Knowledge Discovery
A perspective on inductive databases

ACM SIGKDD Explorations Newsletter
Theoretical frameworks for data mining

ACM SIGKDD Explorations Newsletter
Interestingness of frequent itemsets using Bayesian networks as background knowledge

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
COALA: A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
On data mining, compression, and Kolmogorov complexity

Data Mining and Knowledge Discovery
Assessing data mining results via swap randomization

ACM Transactions on Knowledge Discovery from Data (TKDD)
MINI: Mining Informative Non-redundant Itemsets

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Randomization Techniques for Data Mining Methods

ADBIS '08 Proceedings of the 12th East European conference on Advances in Databases and Information Systems
Maximum entropy based significance of itemsets

Knowledge and Information Systems
Tell me something I don't know: randomization strategies for iterative data mining

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Towards a general framework for data mining

KDID'06 Proceedings of the 5th international conference on Knowledge discovery in inductive databases
A framework for mining interesting pattern sets

ACM SIGKDD Explorations Newsletter
Maximum entropy models and subjective interestingness: an application to tiles in binary databases

Data Mining and Knowledge Discovery

An architecture for component-based design of representative-based clustering algorithms

Data & Knowledge Engineering
Summarizing data succinctly with the most informative itemsets

ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on the Best of SIGKDD 2011
Knowledge discovery interestingness measures based on unexpectedness

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Mining high coherent association rules with consideration of support measure

Expert Systems with Applications: An International Journal
A statistical significance testing approach to mining the most informative set of patterns

Data Mining and Knowledge Discovery
Interesting pattern mining in multi-relational data

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

We formalize the data mining process as a process of information exchange, defined by the following key components. The data miner's state of mind is modeled as a probability distribution, called the background distribution, which represents the uncertainty and misconceptions the data miner has about the data. This model initially incorporates any prior (possibly incorrect) beliefs a data miner has about the data. During the data mining process, properties of the data (to which we refer as patterns) are revealed to the data miner, either in batch, one by one, or even interactively. This acquisition of information in the data mining process is formalized by updates to the background distribution to account for the presence of the found patterns. The proposed framework can be motivated using concepts from information theory and game theory. Understanding it from this perspective, it is easy to see how it can be extended to more sophisticated settings, e.g. where patterns are probabilistic functions of the data (thus allowing one to account for noise and errors in the data mining process, and allowing one to study data mining techniques based on subsampling the data). The framework then models the data mining process using concepts from information geometry, and I-projections in particular. The framework can be used to help in designing new data mining algorithms that maximize the efficiency of the information exchange from the algorithm to the data miner.