Algorithms and software for collaborative discovery from autonomous, semantically heterogeneous, distributed information sources

Authors:
Doina Caragea;Jun Zhang;Jie Bao;Jyotishman Pathak;Vasant Honavar
Affiliations:
Artificial Intelligence Research Laboratory, Center for Computational Intelligence, Learning, and Discovery, Department of Computer Science, Iowa State University, Ames, IA;Artificial Intelligence Research Laboratory, Center for Computational Intelligence, Learning, and Discovery, Department of Computer Science, Iowa State University, Ames, IA;Artificial Intelligence Research Laboratory, Center for Computational Intelligence, Learning, and Discovery, Department of Computer Science, Iowa State University, Ames, IA;Artificial Intelligence Research Laboratory, Center for Computational Intelligence, Learning, and Discovery, Department of Computer Science, Iowa State University, Ames, IA;Artificial Intelligence Research Laboratory, Center for Computational Intelligence, Learning, and Discovery, Department of Computer Science, Iowa State University, Ames, IA
Venue:
ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Year:
2005

Citing 52
Cited 7

Quantifying inductive bias: AI learning algorithms and Valiant's learning framework

Artificial Intelligence
Federated database systems for managing distributed, heterogeneous, and autonomous databases

ACM Computing Surveys (CSUR) - Special issue on heterogeneous databases
The Use of Background Knowledge in Decision Tree Induction

Machine Learning
Guiding induction with domain theories

Machine learning
A Taxonomy and Current Issues in Multidatabase Systems

Computer
The Utility of Knowledge in Inductive Learning

Machine Learning
Knowledge-based artificial neural networks

Artificial Intelligence
Support-Vector Networks

Machine Learning
Efficient maintenance of materialized mediated views

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
The TSIMMIS Approach to Mediation: Data Models and Languages

Journal of Intelligent Information Systems - Special issue: next generation information technologies and systems
Attribute-oriented induction in data mining

Advances in knowledge discovery and data mining
Managing semantic heterogeneity in databases: a theoretical prospective

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Bayesian Network Classifiers

Machine Learning - Special issue on learning with probabilistic representations
Efficient noise-tolerant learning from statistical queries

Journal of the ACM (JACM)
Mind your vocabulary: query mapping across heterogeneous information sources

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Knowledge representation: logical, philosophical and computational foundations

Knowledge representation: logical, philosophical and computational foundations
MOCHA: a self-extensible database middleware system for distributed data sources

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Bioinformatics: the machine learning approach

Bioinformatics: the machine learning approach
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Machine Learning

Machine Learning
Relational Data Mining

Relational Data Mining
Logic-based techniques in data integration

Logic-based artificial intelligence
A Survey of Methods for Scaling Up Inductive Algorithms

Data Mining and Knowledge Discovery
Parallel Formulations of Decision-Tree Classification Algorithms

Data Mining and Knowledge Discovery
Resolving Database Incompatibility: An Approach to Performing Relational Operations over Mismatched Domains

IEEE Transactions on Knowledge and Data Engineering
Evaluating Aggregate Operations Over Imprecise Data

IEEE Transactions on Knowledge and Data Engineering
Scaling Access to Heterogeneous Data Sources with DISCO

IEEE Transactions on Knowledge and Data Engineering
Aggregation of Imprecise and Uncertain Information in Databases

IEEE Transactions on Knowledge and Data Engineering
The Conceptual Basis for Mediation Services

IEEE Expert: Intelligent Systems and Their Applications
Induction of Decision Trees

Machine Learning
Abstract-Driven Pattern Discovery in Databases

IEEE Transactions on Knowledge and Data Engineering
The Nimble XML Data Integration System

Proceedings of the 17th International Conference on Data Engineering
Knowledge Acquisition form Examples Vis Multiple Models

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Using Feature Hierarchies in Bayesian Network Learning

SARA '02 Proceedings of the 4th International Symposium on Abstraction, Reformulation, and Approximation
Optimizing Queries Across Diverse Data Sources

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Optimizing Recursive Information-Gathering Plans

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Learning Probabilistic Relational Models

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
M(DM): An Open Framework for Interoperation of Multimodel Multidatabase Systems

Proceedings of the Eighth International Conference on Data Engineering
Data mining tasks and methods: parallel methods for scaling data mining algorithms to large data sets

Handbook of data mining and knowledge discovery
Simple Estimators for Relational Bayesian Classifiers

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Identification of interface residues in protease-inhibitor and antigen-antibody complexes: a support vector machine approach

Neural Computing and Applications
DiscoveryLink: a system for integrated access to life sciences data sources

IBM Systems Journal - Deep computing for the life sciences
K2/Kleisli and GUS: experiments in integrated access to genomic data sources

IBM Systems Journal - Deep computing for the life sciences
Learning classifiers from distributed, semantically heterogeneous, autonomous data sources

Learning classifiers from distributed, semantically heterogeneous, autonomous data sources
A two-stage classifier for identification of protein--protein interface residues

Bioinformatics
A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees

International Journal of Hybrid Intelligent Systems
On retrieval from a small version of a large data base

VLDB '80 Proceedings of the sixth international conference on Very Large Data Bases - Volume 6
Learning support vector machines from distributed data sources

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 4
Cached sufficient statistics for efficient machine learning with large datasets

Journal of Artificial Intelligence Research
Pattern discovery in distributed databases

AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
Information integration and knowledge acquisition from semantically heterogeneous biological data sources

DILS'05 Proceedings of the Second international conference on Data Integration in the Life Sciences

Learning Classifiers from Large Databases Using Statistical Queries

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Survey of modular ontology techniques and their applications in the biomedical domain

Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
Learning Link-Based Naïve Bayes Classifiers from Ontology-Extended Distributed Data

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part II
Semantic translation for rule-based knowledge in data mining

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part II
Using semantic web tools to integrate experimental measurement data on our own terms

OTM'06 Proceedings of the 2006 international conference on On the Move to Meaningful Internet Systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET - Volume Part I
A service-oriented architecture for electric power transmission system asset management

ICSOC'06 Proceedings of the 4th international conference on Service-oriented computing
An iterative approach to build relevant ontology-aware data-driven models

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Development of high throughput data acquisition technologies, together with advances in computing, and communications have resulted in an explosive growth in the number, size, and diversity of potentially useful information sources. This has resulted in unprecedented opportunities in data-driven knowledge acquisition and decision- making in a number of emerging increasingly data-rich application domains such as bioinformatics, environmental informatics, enterprise informatics, and social informatics (among others). However, the massive size, semantic heterogeneity, autonomy, and distributed nature of the data repositories present significant hurdles in acquiring useful knowledge from the available data. This paper introduces some of the algorithmic and statistical problems that arise in such a setting, describes algorithms for learning classifiers from distributed data that offer rigorous performance guarantees (relative to their centralized or batch counterparts). It also describes how this approach can be extended to work with autonomous, and hence, inevitably semantically heterogeneous data sources, by making explicit, the ontologies (attributes and relationships between attributes) associated with the data sources and reconciling the semantic differences among the data sources from a user’s point of view. This allows user or context-dependent exploration of semantically heterogeneous data sources. The resulting algorithms have been implemented in INDUS – an open source software package for collaborative discovery from autonomous, semantically heterogeneous, distributed data sources.