Knowledge Accumulation and Resolution of Data Inconsistencies during the Integration of Microbial Information Sources

Authors:
Peter Dawyndt;Marc Vancanneyt;Hans De Meyer;Jean Swings
Affiliations:
-;-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2005

Citing 28
Cited 2

A comparative analysis of methodologies for database schema integration

ACM Computing Surveys (CSUR)
Data structures and algorithms for approximate string matching

Journal of Complexity
Handbook of record linkage: methods for health and statistical studies, administration, and business

Handbook of record linkage: methods for health and statistical studies, administration, and business
Introduction to algorithms

Introduction to algorithms
Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
On resolving schematic heterogeneity in multidatabase systems

Distributed and Parallel Databases
Retrieving terms and their variants in a lexicalized unification-based framework

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Semantic similarity relations and computation in schema integration

Data & Knowledge Engineering
Integrating heterogeneous distributed database system

CIE '96 Proceedings of the 19th international conference on Computers and industrial engineering
The object database standard: ODMG 2.0

The object database standard: ODMG 2.0
Classification of binary vectors by stochastic complexity

Journal of Multivariate Analysis
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Duplicate record elimination in large data files

ACM Transactions on Database Systems (TODS)
Hardening soft information sources

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
An optimal O(N2) algorithm for computing the min-transitive closure of a weighted graph

Information Processing Letters
Approximate String Matching

ACM Computing Surveys (CSUR)
Computer programs for detecting and correcting spelling errors

Communications of the ACM
A fast string searching algorithm

Communications of the ACM
Clustering Algorithms

Clustering Algorithms
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
An Approach to Designing Very Fast Approximate String Matching Algorithms

IEEE Transactions on Knowledge and Data Engineering
Clustering Validity Assessment: Finding the Optimal Partitioning of a Data Set

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Using Schema Matching to Simplify Heterogeneous Data Translation

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Integrating Heterogenous Overlapping Databases through Object-Oriented Transformations

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Data Transformation System for Biological Data Sources

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Theoretical and Empirical Comparisons of Approximate String Matching Algorithms

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
Identifying and Merging Related Bibliographic Records

Identifying and Merging Related Bibliographic Records

Information inconsistencies detection using a rule-map technique

Expert Systems with Applications: An International Journal
A knuckles-and-nodes approach to the integration of microbiological resource data

OTM'06 Proceedings of the 2006 international conference on On the Move to Meaningful Internet Systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Internet has emerged as an ever-increasing environment of multiple heterogeneous and autonomous data sources that contain relevant but overlapping information on microorganisms. Microbiologists might therefore seriously benefit from the design of intelligent software agents that assist in the navigation through this information-rich environment, together with the development of data mining tools that can aid in the discovery of new information. These applications heavily depend upon well-conditioned data samples that are correlated with multiple information sources, hence, accurate database merging operations are desirable. Information systems designed for joining the related knowledge provided by different microbial data sources are hampered by the labeling mechanism for referencing microbial strains and cultures that suffers from syntactical variation in the practical usage of the labels, whereas, additionally, synonymy and homonymy are also known to exist amongst the labels. This situation is even complicated by the observation that the label equivalence knowledge is itself fragmentarily recorded over several data sources which can be suspected of providing information that might be both incomplete and incorrect. This paper presents how extraction and integration of label equivalence information from several distributed data sources has led to the construction of a so-called integrated strain database, which helps to resolve most of the above problems. Given the fact that information retrieved from autonomous resources might be overlapping, incomplete, and incorrect, much energy was spent into the completion of missing information, the discovery of new associations between information objects, and the development and application of tools for error detection and correction. Through a thorough evaluation of the different levels of incompleteness and incorrectness encountered within the incorporated data sources, we have finally given proof of the added value of the integrated strain database as a necessary service provider for the seamless integration of microbial information sources.