A comparative analysis of methodologies for database schema integration
ACM Computing Surveys (CSUR)
Data structures and algorithms for approximate string matching
Journal of Complexity
Handbook of record linkage: methods for health and statistical studies, administration, and business
Handbook of record linkage: methods for health and statistical studies, administration, and business
Introduction to algorithms
Techniques for automatically correcting words in text
ACM Computing Surveys (CSUR)
On resolving schematic heterogeneity in multidatabase systems
Distributed and Parallel Databases
Retrieving terms and their variants in a lexicalized unification-based framework
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Semantic similarity relations and computation in schema integration
Data & Knowledge Engineering
Integrating heterogeneous distributed database system
CIE '96 Proceedings of the 19th international conference on Computers and industrial engineering
The object database standard: ODMG 2.0
The object database standard: ODMG 2.0
Classification of binary vectors by stochastic complexity
Journal of Multivariate Analysis
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Duplicate record elimination in large data files
ACM Transactions on Database Systems (TODS)
Hardening soft information sources
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
An optimal O(N2) algorithm for computing the min-transitive closure of a weighted graph
Information Processing Letters
ACM Computing Surveys (CSUR)
Computer programs for detecting and correcting spelling errors
Communications of the ACM
A fast string searching algorithm
Communications of the ACM
Clustering Algorithms
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
An Approach to Designing Very Fast Approximate String Matching Algorithms
IEEE Transactions on Knowledge and Data Engineering
Clustering Validity Assessment: Finding the Optimal Partitioning of a Data Set
ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Using Schema Matching to Simplify Heterogeneous Data Translation
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Integrating Heterogenous Overlapping Databases through Object-Oriented Transformations
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Data Transformation System for Biological Data Sources
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Theoretical and Empirical Comparisons of Approximate String Matching Algorithms
CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
Identifying and Merging Related Bibliographic Records
Identifying and Merging Related Bibliographic Records
Information inconsistencies detection using a rule-map technique
Expert Systems with Applications: An International Journal
A knuckles-and-nodes approach to the integration of microbiological resource data
OTM'06 Proceedings of the 2006 international conference on On the Move to Meaningful Internet Systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET - Volume Part I
Hi-index | 0.00 |
The Internet has emerged as an ever-increasing environment of multiple heterogeneous and autonomous data sources that contain relevant but overlapping information on microorganisms. Microbiologists might therefore seriously benefit from the design of intelligent software agents that assist in the navigation through this information-rich environment, together with the development of data mining tools that can aid in the discovery of new information. These applications heavily depend upon well-conditioned data samples that are correlated with multiple information sources, hence, accurate database merging operations are desirable. Information systems designed for joining the related knowledge provided by different microbial data sources are hampered by the labeling mechanism for referencing microbial strains and cultures that suffers from syntactical variation in the practical usage of the labels, whereas, additionally, synonymy and homonymy are also known to exist amongst the labels. This situation is even complicated by the observation that the label equivalence knowledge is itself fragmentarily recorded over several data sources which can be suspected of providing information that might be both incomplete and incorrect. This paper presents how extraction and integration of label equivalence information from several distributed data sources has led to the construction of a so-called integrated strain database, which helps to resolve most of the above problems. Given the fact that information retrieved from autonomous resources might be overlapping, incomplete, and incorrect, much energy was spent into the completion of missing information, the discovery of new associations between information objects, and the development and application of tools for error detection and correction. Through a thorough evaluation of the different levels of incompleteness and incorrectness encountered within the incorporated data sources, we have finally given proof of the added value of the integrated strain database as a necessary service provider for the seamless integration of microbial information sources.