Approximate matching of textual domain attributes for information source integration

Authors:
Andreas Koeller;Vinay Keelara
Affiliations:
Montclair State University, Montclair, NJ;Montclair State University, Montclair, NJ
Venue:
Proceedings of the 2nd international workshop on Information quality in information systems
Year:
2005

Citing 18
Cited 1

A Theory of Attributed Equivalence in Databases with Application to Schema Integration

IEEE Transactions on Software Engineering
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks

Data & Knowledge Engineering
Semantic integration of heterogeneous information sources

Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
Inference rules for functional and inclusion dependencies

PODS '83 Proceedings of the 2nd ACM SIGACT-SIGMOD symposium on Principles of database systems
Inclusion dependencies and their interaction with functional dependencies

PODS '82 Proceedings of the 1st ACM SIGACT-SIGMOD symposium on Principles of database systems
Efficient Algorithms for Mining Inclusion Dependencies

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Efficient Discovery of Functional and Approximate Dependencies Using Partitions

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Algebraic Properties of Bag Data Types

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Discovery of Constraints from Data for Information System Reverse Engineering

ASWEC '97 Proceedings of the Australian Software Engineering Conference
On schema matching with opaque column names and data values

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Zigzag: a new algorithm for mining large inclusion dependencies in databases

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Efficient similarity-based operations for data integration

Data & Knowledge Engineering
Efficient discovery of functional dependencies with degrees of satisfaction: Research Articles

International Journal of Intelligent Systems - Intelligent and Soft Computing Techniques for Information Processing
Database dependency discovery: a machine learning approach

AI Communications
Corpus-based knowledge representation

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Report from the First and Second International Workshops on Information Quality in Information Systems: IQIS 2004 and IQIS 2005 in conjunction with ACM SIGMOD/PODS Conferences

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

A key problem in the integration of information sources is the identification of related attributes or objects across independent sources. Inferring such meta-information from source data (rather than a-priori available meta-data, such as attribute names) is sometimes possible. For example, existing algorithms attempt to integrate information sources by finding patterns such as Inclusion Dependencies (INDs) across them. However, INDs are based on exact set inclusion and are thus very strict patterns that rarely hold across independent real-world databases.We propose two error-tolerant measures, termed Similarity Score and Distribution Score, that help identify related attributes across two independent databases, based on similarities in their data. Those measures specifically address the problem of identifying semantic relationships between textual attributes of databases that have few or no equal values.We also present implementations of those measures and some experimental results.