Element matching across data-oriented XML sources using a multi-strategy clustering model

Authors:
Charnyote Pluempitiwiriyawej;Joachim Hammer
Affiliations:
Department of Computer Science, Mahidol University, Rama VI Rd., Bangkok 10400, Thailand;Department of Computer and Information Science and Engineering, University of Florida, Box 116120, 301 CSE Building, Gainesville, FL
Venue:
Data & Knowledge Engineering
Year:
2004

Citing 38
Cited 5

A comparative analysis of methodologies for database schema integration

ACM Computing Surveys (CSUR)
Combinatorial optimization: algorithms and complexity

Combinatorial optimization: algorithms and complexity
Federated database systems for managing distributed, heterogeneous, and autonomous databases

ACM Computing Surveys (CSUR) - Special issue on heterogeneous databases
Clustering algorithms

Information retrieval
Using semantic values to facilitate interoperability among heterogeneous information systems

ACM Transactions on Database Systems (TODS)
Meaningful change detection in structured data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Distance-based indexing for high-dimensional metric spaces

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The SR-tree: an index structure for high-dimensional nearest neighbor queries

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Infomaster: an information integration system

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
STRUDEL: a Web site management system

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The WHIPS prototype for data warehouse creation and maintenance

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Advances in knowledge discovery and data mining

Advances in knowledge discovery and data mining
Using schematically heterogeneous structures

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Exploring the similarity space

ACM SIGIR Forum
Semantic similarities between objects in multiple databases

Management of heterogeneous and autonomous database systems
A Probabilistic Decision Model for Entity Matching in Heterogeneous Databases

Management Science
XML-based information mediation with MIX

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Semantic integration of semistructured and structured data sources

ACM SIGMOD Record
The OASIS multidatabase prototype

ACM SIGMOD Record
SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks

Data & Knowledge Engineering
Searching Multimedia Databases by Content

Searching Multimedia Databases by Content
Information Retrieval

Information Retrieval
RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets

Distributed and Parallel Databases - Special issue: Parallel and distributed data mining
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
The TV-tree: an index structure for high-dimensional data

The VLDB Journal — The International Journal on Very Large Data Bases - Spatial Database Systems
A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases

IEEE Transactions on Knowledge and Data Engineering
IRO-DB: Making Relational and Object-Oriented Database Systems Interoperable

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Object Exchange Across Heterogeneous Information Sources

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Similarity Indexing with the SS-tree

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Solving Domain Mismatch and Schema Mismatch Problems with an Object-Oriented Database Programming Language

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
A Metadata Approach to Resolving Semantic Conflicts

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
ObjectGlobe: Ubiquitous query processing on the Internet

The VLDB Journal — The International Journal on Very Large Data Bases
Entity Matching in Heterogeneous Databases: A Distance Based Decision Model

HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences-Volume 7 - Volume 7
A new hierarchical clustering model for speeding up the reconciliation of xml-based, semistructured data in mediation systems

A new hierarchical clustering model for speeding up the reconciliation of xml-based, semistructured data in mediation systems

Making quality count in biological data sources

Proceedings of the 2nd international workshop on Information quality in information systems
Entity matching across heterogeneous data sources: An approach based on constrained cascade generalization

Data & Knowledge Engineering
PORSCHE: Performance ORiented SCHEma mediation

Information Systems
A schema matching-based approach to XML schema clustering

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Hierarchical clustering of XML documents focused on structural components

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a family of heuristics-based clustering strategies to support the merging of XML data from multiple sources. As part of this research, we have developed a comprehensive classification for schematic and semantic conflicts that can occur when reconciling related XML data from multiple sources. Given the fact that element clustering is compute-intensive, especially when comparing large numbers of data elements that exhibit great representational diversity, performance is a critical, yet so far neglected aspect of the merging process. We have developed five heuristics for clustering data in the multi-dimensional metric space. Equivalence of data elements within the individual clusters is determined using several distance functions that calculate the semantic distances among the elements.The research described in this article is conducted within the context of the Integration Wizard (IWIZ) project at the University of Florida. IWIZ enables users to access and retrieve information from multiple XML-based sources through a consistent, integrated view. The results of our qualitative analysis of the clustering heuristics have validated the feasibility of our approach as well as its superior performance when compared to other similarity search techniques.