Detecting duplicate records in scientific workflow results

Authors:
Khalid Belhajjame;Paolo Missier;Carole A. Goble
Affiliations:
School of Computer Science, University of Manchester, Manchester, UK;School of Computer Science, Newcastle University, Newcastle upon Tyne, UK;School of Computer Science, University of Manchester, Manchester, UK
Venue:
IPAW'12 Proceedings of the 4th international conference on Provenance and Annotation of Data and Processes
Year:
2012

Citing 12
Cited 0

Systematic design of program analysis frameworks

POPL '79 Proceedings of the 6th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Adaptive Blocking: Learning to Scale Up Record Linkage

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Data Management Challenges of Data-Intensive Scientific Workflows

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
The design and realisation of the Experimentmy Virtual Research Environment for social sharing of workflows

Future Generation Computer Systems
Learning blocking schemes for record linkage

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Fine-grained and efficient lineage querying of collection-based workflow provenance

Proceedings of the 13th International Conference on Extending Database Technology
Extending Semantic Provenance into the Web of Data

IEEE Internet Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific workflows are often data intensive. The data sets obtained by enacting scientific workflows have several applications, e.g., they can be used to identify data correlations or to understand phenomena, and therefore are worth storing in repositories for future analyzes. Our experience suggests that such datasets often contain duplicate records. Indeed, scientists tend to enact the same workflow multiple times using the same or overlapping datasets, which gives rise to duplicates in workflow results. The presence of duplicates may increase the complexity of workflow results interpretation and analyzes. Moreover, it unnecessarily increases the size of datasets within workflow results repositories. In this paper, we present an approach whereby duplicates detection is guided by workflow provenance trace. The hypothesis that we explore and exploit is that the operations that compose a workflow are likely to produce the same (or overlapping) dataset given the same (or overlapping) dataset. A preliminary analytic and empirical validation shows the effectiveness and applicability of the method proposed.