SemGen: towards a semantic data generator for benchmarking duplicate detectors

Authors:
Wolfgang Gottesheim;Stefan Mitsch;Werner Retschitzegger;Wieland Schwinger;Norbert Baumgartner
Affiliations:
Johannes Kepler University Linz, Linz, Austria;Johannes Kepler University Linz, Linz, Austria;Johannes Kepler University Linz, Linz, Austria;Johannes Kepler University Linz, Linz, Austria;team Communication Technology Mgt. Ltd., Vienna, Austria
Venue:
DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Year:
2011

Citing 14
Cited 0

Temporal reasoning based on semi-intervals

Artificial Intelligence
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
A framework for testing database applications

Proceedings of the 2000 ACM SIGSOFT international symposium on Software testing and analysis
Maintaining knowledge about temporal intervals

Communications of the ACM
A Framework for Generating Network-Based Moving Objects

Geoinformatica
On the Generation of Time-Evolving Regional Data

Geoinformatica
A Taxonomy of Dirty Data

Data Mining and Knowledge Discovery
Flexible database generators

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Simple and realistic data generation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A parallel general-purpose synthetic data generator

ACM SIGMOD Record
On generalizing orientation information in OPRAm

KI'06 Proceedings of the 29th annual German conference on Artificial intelligence
An Introduction to Duplicate Detection

An Introduction to Duplicate Detection
Editorial: BeAware!-Situation awareness, the ontology-driven way

Data & Knowledge Engineering
Towards duplicate detection for situation awareness based on spatio-temporal relations

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems: Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Benchmarking the quality of duplicate detection methods requires comprehensive knowledge on duplicate pairs in addition to sufficient size and variability of test data sets. While extending real-world data sets with artificially created data is promising, current approaches to such synthetic data generation, however, work solely on a quantitative level, which entails that duplicate semantics are only implicitly represented, leading to only insufficiently configurable variability. In this paper we propose SemGen, a semantics-driven approach to synthetic data generation. SemGen first diversifies real-world objects on a qualitative level, before in a second step quantitative values are generated. To demonstrate the applicability of SemGen, we propose how to define duplicate semantics for the domain of road traffic management. A discussion of lessons learned concludes the paper.