Provenance as data mining: combining file system metadata with content analysis

Authors:
Vinay Deolalikar;Hernan Laffitte
Affiliations:
Storage and Information Management Platforms Lab, Hewlett Packard Labs, Palo Alto, CA;Storage and Information Management Platforms Lab, Hewlett Packard Labs, Palo Alto, CA
Venue:
TAPP'09 First workshop on on Theory and practice of provenance
Year:
2009

Citing 7
Cited 4

Algorithms for clustering data

Algorithms for clustering data
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Data clustering: a review

ACM Computing Surveys (CSUR)
A vector space model for automatic indexing

Communications of the ACM
Provenance management in curated databases

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Provenance-aware storage systems

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
A Semantic Web approach to the provenance challenge

Concurrency and Computation: Practice & Experience - The First Provenance Challenge

Information provenance in social media

SBP'11 Proceedings of the 4th international conference on Social computing, behavioral-cultural modeling and prediction
Reconstructing provenance

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
Provenance for data mining

TaPP'13 Proceedings of the 5th USENIX conference on Theory and Practice of Provenance
Provenance for data mining

Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance

Quantified Score

Hi-index	0.00

Visualization

Abstract

Provenance describes how an object came to be in its present state. Thus, it describes the evolution of the object over time. Prior work on provenance has focussed on databases and the file system. The database or file system is enhanced or augmented in order to capture additional information about the historical evolution of document collections, and thus answer the provenance question. We address the question of provenance for unstructured information (i.e., document corpii from file systems) but without any enhancements to the file system. To provide a solution in this setting, we model the provenance problem in such a setting as a problem of data mining. We show that data mining can provide provenance information for repositories of unstructured information, including chains of historical evolution. Thus, we do not require any additions to the file system, and we can operate on legacy documents. Experimental results indicate a strong performance of our approach.