Discovering Conditional Functional Dependencies

Authors:
Wenfei Fan;Floris Geerts;Jianzhong Li;Ming Xiong
Affiliations:
University of Edinburgh, Edinburgh;University of Edinburgh, Edinburgh;Harbin Institute of Technology, Harbin;Bell Laboratories, Murray Hill
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2011

Citing 0
Cited 11

Determining the currency of data

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Dynamic constraints for record matching

The VLDB Journal — The International Journal on Very Large Data Bases
A parallel algorithm for computing borders

Proceedings of the 20th ACM international conference on Information and knowledge management
A call to arms: revisiting database design

ACM SIGMOD Record
Determining the Currency of Data

ACM Transactions on Database Systems (TODS)
Discovering conditional inclusion dependencies

Proceedings of the 21st ACM international conference on Information and knowledge management
Mining frequent conjunctive queries using functional and inclusion dependencies

The VLDB Journal — The International Journal on Very Large Data Bases
Learning queries for relational, semi-structured, and graph databases

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Discovering denial constraints

Proceedings of the VLDB Endowment
Extending inclusion dependencies with conditions

Theoretical Computer Science
Data profiling revisited

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper investigates the discovery of conditional functional dependencies (CFDs). CFDs are a recent extension of functional dependencies (FDs) by supporting patterns of semantically related constants, and can be used as rules for cleaning relational data. However, finding quality CFDs is an expensive process that involves intensive manual effort. To effectively identify data cleaning rules, we develop techniques for discovering CFDs from relations. Already hard for traditional FDs, the discovery problem is more difficult for CFDs. Indeed, mining patterns in CFDs introduces new challenges. We provide three methods for CFD discovery. The first, referred to as CFDMiner, is based on techniques for mining closed item sets, and is used to discover constant CFDs, namely, CFDs with constant patterns only. Constant CFDs are particularly important for object identification, which is essential to data cleaning and data integration. The other two algorithms are developed for discovering general CFDs. One algorithm, referred to as CTANE, is a levelwise algorithm that extends TANE, a well-known algorithm for mining FDs. The other, referred to as FastCFD, is based on the depth-first approach used in FastFD, a method for discovering FDs. It leverages closed-item-set mining to reduce the search space. As verified by our experimental study, CFDMiner can be multiple orders of magnitude faster than CTANE and FastCFD for constant CFD discovery. CTANE works well when a given relation is large, but it does not scale well with the arity of the relation. FastCFD is far more efficient than CTANE when the arity of the relation is large; better still, leveraging optimization based on closed-item-set mining, FastCFD also scales well with the size of the relation. These algorithms provide a set of cleaning-rule discovery tools for users to choose for different applications.