CORDS: automatic discovery of correlations and soft functional dependencies
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Iterative record linkage for cleaning and integration
Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Nested mappings: schema mapping reloaded
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Ontology Matching
Automated ontology construction for unstructured text documents
Data & Knowledge Engineering
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Schema merging and mapping creation for relational sources
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Schema mapping verification: the spicy way
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Discovering topical structures of databases
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Automatic record linkage using seeded nearest neighbour and support vector machine classification
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Top-k generation of integrated schemas based on directed and weighted correspondences
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Reusing ontologies on the Semantic Web: A feasibility study
Data & Knowledge Engineering
Sampling dirty data for matching attributes
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Approximate Record Matching Using Hash Grams
ICDMW '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops
An ontology-based approach for constructing Bayesian networks
Data & Knowledge Engineering
Ontology guided data linkage framework for discovering meaningful data facts
ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Hi-index | 0.00 |
There has been a surge of interest in the development of probabilistic techniques to discover meaningful data facts across multiple datasets provided by different organizations. The key aim is to approximate the structure and content of the induced data into a concise synopsis in order to extract meaningful data facts. Performing sensible queries across unrelated datasets is a complex task that requires a complete understanding of each contributing database's schema to define the structure of its information. Alternative approaches that use data modeling enterprise tools have been proposed, in order to give users without complex schema knowledge the ability to query databases. Unfortunately, data modeling-based matching is a content-based technique and incurs significant query evaluation costs, due to attribute level pairwise comparisons. We propose a multi-faceted classification technique for performing structural analysis on knowledge domain clusters, using a novel Ontology Guided Data Linkage (OGDL) framework. This framework supports self-organization of contributing databases through the discovery of structural dependencies, by performing multi-level exploitation of ontological domain knowledge relating to tables, attributes and tuples. The framework thus automates the discovery of schema structures across unrelated databases, based on the use of direct and weighted correlations between different ontological concepts, using a h-gram (hash gram) record matching technique for concept clustering and cluster mapping. We demonstrate the feasibility of our OGDL algorithms through a set of accuracy, performance and scalability experimental tests run on real-world datasets, and show that our system runs in polynomial time and performs well in practice. To the best of our knowledge, this is the first attempt initiated to solve data linkage problems using a multi-faceted cluster mapping strategy, and we believe that our approach presents a significant advancement towards accurate query answering and future real-time online semantic reasoning capacity.