Establishing value mappings using statistical models and user feedback

Authors:
Jaewoo Kang;Tae Sik Han;Dongwon Lee;Prasenjit Mitra
Affiliations:
North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC;Pennsylvania State University, University Park, PA;Pennsylvania State University, University Park, PA
Venue:
Proceedings of the 14th ACM international conference on Information and knowledge management
Year:
2005

Citing 18
Cited 3

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Adaptive detection of approximately duplicate database records and the database integration approach to information discovery

Adaptive detection of approximately duplicate database records and the database integration approach to information discovery
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Clio: a semi-automatic tool for schema mapping

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Using Schema Matching to Simplify Heterogeneous Data Translation

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Generic Schema Matching with Cupid

Proceedings of the 27th International Conference on Very Large Data Bases
Comparison of Schema Matching Evaluations

Revised Papers from the NODe 2002 Web and Database-Related Workshops on Web, Web-Services, and Database Systems
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
On schema matching with opaque column names and data values

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
iMAP: discovering complex semantic matches between database schemas

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Information-theoretic tools for mining database structure from large data sets

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A Mathematical Theory of Communication

A Mathematical Theory of Communication
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

System support for exploration and expert feedback in resolving conflicts during integration of metadata

The VLDB Journal — The International Journal on Very Large Data Bases
Identifying value mappings for data integration: an unsupervised approach

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Schema matching and embedded value mapping for databases with opaque column names and mixed continuous and discrete-valued data fields

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a "value mapping" algorithm that does not rely on syntactic similarity or semantic interpretation of the values. The algorithm first constructs a statistical model (e.g., co-occurrence frequency or entropy vector) that captures the unique characteristics of values and their co-occurrence. It then finds the matching values by computing the distances between the models while refining the models using user feedback through iterations. Our experimental results suggest that our approach successfully establishes value mappings even in the presence of opaque data values and thus can be a useful addition to the existing data integration techniques.