A Theory of Attributed Equivalence in Databases with Application to Schema Integration
IEEE Transactions on Software Engineering
Similarity measures in scientometric research: the Jaccard index versus Salton's cosine formula
Information Processing and Management: an International Journal
A Tool for Integrating Conceptual Schemas and User Views
Proceedings of the Fourth International Conference on Data Engineering
Multiobjective Optimization Using Evolutionary Algorithms - A Comparative Case Study
PPSN V Proceedings of the 5th International Conference on Parallel Problem Solving from Nature
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
iMAP: discovering complex semantic matches between database schemas
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A framework to support automated classification and labeling of brain electromagnetic patterns
Computational Intelligence and Neuroscience - Regular issue
Data Mining and Knowledge Discovery
PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Hi-index | 0.00 |
In this paper, we present a data mining approach to challenges in the matching and integration of heterogeneous datasets. In particular, we propose solutions to two problems that arise in combining information from different results of scientific research. The first problem, attribute matching, involves discovery of correspondences among distinct numeric-typed summary features ("attributes") that are used to characterize datasets that have been collected and analyzed in different research labs. The second problem, cluster matching, involves discovery of matchings between patterns across datasets. We treat both of these problems together as a multi-objective optimization problem. A multi-objective simulated annealing algorithm is described to find the optimal solution. The utility of this approach is demonstrated in a series of experiments using synthetic and realistic datasets that are designed to simulate heterogeneous data from different sources.