Comparable dependencies over heterogeneous data

  • Authors:
  • Shaoxu Song;Lei Chen;Philip S. Yu

  • Affiliations:
  • Key Laboratory for Information System Security, Ministry of Education/ TNList/ School of Software, Tsinghua University, Beijing, China;Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong;Department of Computer Science, University of Illinois at Chicago, Chicago, USA and Computer Science Department, King Abdulaziz University, Jeddah, Saudi Arabia

  • Venue:
  • The VLDB Journal — The International Journal on Very Large Data Bases
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

To study the data dependencies over heterogeneous data in dataspaces, we define a general dependency form, namely comparable dependencies (CDS), which specifies constraints on comparable attributes. It covers the semantics of a broad class of dependencies in databases, including functional dependencies (FDS), metric functional dependencies (MFDS), and matching dependencies (MDS). As we illustrated, comparable dependencies are useful in real practice of dataspaces, such as semantic query optimization. Due to heterogeneous data in dataspaces, the first question, known as the validation problem, is to tell whether a dependency (almost) holds in a data instance. Unfortunately, as we proved, the validation problem with certain error or confidence guarantee is generally hard. In fact, the confidence validation problem is also NP-hard to approximate to within any constant factor. Nevertheless, we develop several approaches for efficient approximation computation, such as greedy and randomized approaches with an approximation bound on the maximum number of violations that an object may introduce. Finally, through an extensive experimental evaluation on real data, we verify the superiority of our methods.