Provenance for data mining

Authors:
Boris Glavic;Javed Siddique;Periklis Andritsos;Renée J. Miller
Affiliations:
IIT;University of Toronto;University of Toronto;University of Toronto
Venue:
TaPP'13 Proceedings of the 5th USENIX conference on Theory and Practice of Provenance
Year:
2013

Citing 14
Cited 0

Scalable Techniques for Mining Causal Structures

Data Mining and Knowledge Discovery
Visualization Techniques for Mining Large Databases: A Comparison

IEEE Transactions on Knowledge and Data Engineering
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Interestingness measures for data mining: A survey

ACM Computing Surveys (CSUR)
VisTrails: visualization meets data management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Efficient provenance storage over nested data collections

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Provenance as data mining: combining file system metadata with content analysis

TAPP'09 First workshop on on Theory and practice of provenance
Provenance in Databases: Why, How, and Where

Foundations and Trends in Databases
TRAMP: understanding the behavior of schema mappings through provenance

Proceedings of the VLDB Endowment
Provenance-based refresh in data-oriented workflows

Proceedings of the 20th ACM international conference on Information and knowledge management
Putting lipstick on pig: enabling database-style workflow provenance

Proceedings of the VLDB Endowment
Functional programs that explain their work

Proceedings of the 17th ACM SIGPLAN international conference on Functional programming
Semiring-annotated data: queries and provenance?

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data mining aims at extracting useful information from large datasets. Most data mining approaches reduce the input data to produce a smaller output summarizing the mining result. While the purpose of data mining (extracting information) necessitates this reduction in size, the loss of information it entails can be problematic. Specifically, the results of data mining may be more confusing than insightful, if the user is not able to understand on which input data they are based and how they were created. In this paper, we argue that the user needs access to the provenance of mining results. Provenance, while extensively studied by the database, workflow, and distributed systems communities, has not yet been considered for data mining. We analyze the differences between database, workflow, and data mining provenance, suggest new types of provenance, and identify new use-cases for provenance in data mining. To illustrate our ideas, we present a more detailed discussion of these concepts for two typical data mining algorithms: frequent itemset mining and multi-dimensional scaling.