Evaluation of a Hybrid Approach for Efficient Provenance Storage

Authors:
Yulai Xie;Kiran-Kumar Muniswamy-Reddy;Dan Feng;Yan Li;Darrell D. E. Long
Affiliations:
Huazhong University of Science and Technology;Harvard University;Huazhong University of Science and Technology;University of California, Santa Cruz;University of California, Santa Cruz
Venue:
ACM Transactions on Storage (TOS)
Year:
2013

Citing 28
Cited 0

Database compression

ACM SIGMOD Record
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
XMill: an efficient compressor for XML data

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Query optimization in compressed database systems

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Compressing Relations and Indexes

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Towards Compressing Web Graphs

DCC '01 Proceedings of the Data Compression Conference
Compressing the Graph Structure of the Web

DCC '01 Proceedings of the Data Compression Conference
XGRIND: A Query-Friendly XML Compressor

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Transparent Result Caching

Transparent Result Caching
Backtracking intrusions

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
The WebGraph Framework II: Codes For The World-Wide Web

DCC '04 Proceedings of the Conference on Data Compression
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
Composing Lineage Metadata with XML for Custom Satellite-Derived Data Products

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
Super-Scalar RAM-CPU Cache Compression

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A Framework for Collecting Provenance in Data-Centric Scientific Workflows

ICWS '06 Proceedings of the IEEE International Conference on Web Services
Provenance-aware storage systems

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Data compression in Oracle

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Recording and using provenance in a protein compressibility experiment

HPDC '05 Proceedings of the High Performance Distributed Computing, 2005. HPDC-14. Proceedings. 14th IEEE International Symposium
Automatic capture and efficient storage of e-Science experiment provenance

Concurrency and Computation: Practice & Experience - The First Provenance Challenge
Using provenance to aid in personal file search

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Efficient provenance storage

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Provenance Information Model of Karma Version 3

SERVICES '09 Proceedings of the 2009 Congress on Services - I
Semantic middleware for e-science knowledge spaces

Proceedings of the 7th International Workshop on Middleware for Grids, Clouds and e-Science
Layering in provenance systems

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Taverna, reloaded

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
The Open Provenance Model core specification (v1.1)

Future Generation Computer Systems
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
A hybrid approach for efficient provenance storage

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Provenance is the metadata that describes the history of objects. Provenance provides new functionality in a variety of areas, including experimental documentation, debugging, search, and security. As a result, a number of groups have built systems to capture provenance. Most of these systems focus on provenance collection, a few systems focus on building applications that use the provenance, but all of these systems ignore an important aspect: efficient long-term storage of provenance. In this article, we first analyze the provenance collected from multiple workloads and characterize the properties of provenance with respect to long-term storage. We then propose a hybrid scheme that takes advantage of the graph structure of provenance data and the inherent duplication in provenance data. Our evaluation indicates that our hybrid scheme, a combination of Web graph compression (adapted for provenance) and dictionary encoding, provides the best trade-off in terms of compression ratio, compression time, and query performance when compared to other compression schemes.