Rapid Identification of Column Heterogeneity

Authors:
Bing Tian Dai;Nick Koudas;Beng Chin Ooi;Divesh Srivastava;Suresh Venkatasubramanian
Affiliations:
National Univ. of Singapore, Singapore;University of Toronto;National Univ. of Singapore, Singapore;AT&T Labs-Research;AT&T Labs--Research
Venue:
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Year:
2006

Citing 0
Cited 5

Type-based categorization of relational attributes

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Information theory for data management

Proceedings of the VLDB Endowment
Information theory for data management

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Schema-as-you-go: on probabilistic tagging and querying of wide tables

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Improving the quality of predictions using textual information in online user reviews

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data quality is a serious concern in every data management application, and a variety of quality measures have been proposed, e.g., accuracy, freshness and completeness, to capture common sources of data quality degradation. We identify and focus attention on a novel measure, column heterogeneity, that seeks to quantify the data quality problems that can arise when merging data from different sources. We identify desiderata that a column heterogeneity measure should intuitively satisfy, and describe our technique to quantify database column heterogeneity based on using a novel combination of cluster entropy and soft clustering. Finally, we present detailed experimental results, using diverse data sets of different types, to demonstrate that our approach provides a robust mechanism for identifying and quantifying database column heterogeneity.