Knowledge discovery by probabilistic clustering of distributed databases

Authors:
Sally McClean;Bryan Scotney;Philip Morrow;Kieran Greer
Affiliations:
School of Computing and Information Engineering, University of Ulster, Cromore Road, Coleraine BT52 1SA, Northern Ireland;School of Computing and Information Engineering, University of Ulster, Cromore Road, Coleraine BT52 1SA, Northern Ireland;School of Computing and Information Engineering, University of Ulster, Cromore Road, Coleraine BT52 1SA, Northern Ireland;School of Computing and Information Engineering, University of Ulster, Cromore Road, Coleraine BT52 1SA, Northern Ireland
Venue:
Data & Knowledge Engineering
Year:
2005

Citing 15
Cited 10

The derivation problem of summary data

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Framework for query optimization in distributed statistical databases

Information and Software Technology
A universal-scheme approach to statistical databases containing homogeneous summary tables

ACM Transactions on Database Systems (TODS)
Dataset descriptions and results

Machine learning, neural and statistical classification
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Optimal and efficient integration of heterogeneous summary tables in a distributed database

Data & Knowledge Engineering
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
ROCK: a robust clustering algorithm for categorical attributes

Information Systems
Generality-Based Conceptual Clustering with Probabilistic Concepts

IEEE Transactions on Pattern Analysis and Machine Intelligence
A robust and scalable clustering algorithm for mixed type attributes in large database environment

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
An Evidential Reasoning Approach to Attribute Value Conflict Resolution in Database Integration

IEEE Transactions on Knowledge and Data Engineering
Designing a Kernel for Data Mining

IEEE Expert: Intelligent Systems and Their Applications
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Cached sufficient statistics for efficient machine learning with large datasets

Journal of Artificial Intelligence Research

Data privacy protection in multi-party clustering

Data & Knowledge Engineering
Integrating semantically heterogeneous aggregate views of distributed databases

Distributed and Parallel Databases
Privacy-preserving data publishing for cluster analysis

Data & Knowledge Engineering
Knowledge discovery from semantically heterogeneous aggregate databases using model-based clustering

BNCOD'07 Proceedings of the 24th British national conference on Databases
Model-based segmentation of multimodal images

CAIP'07 Proceedings of the 12th international conference on Computer analysis of images and patterns
Modeling the evolution of associated data

Data & Knowledge Engineering
A log-linear approach to mining significant graph-relational patterns

Data & Knowledge Engineering
Reliable representations for association rules

Data & Knowledge Engineering
Probability-based text clustering algorithm by alternately repeating two operations

Journal of Information Science
Top-k best probability queries and semantics ranking properties on probabilistic databases

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering of distributed databases facilitates knowledge discovery through learning of new concepts that characterise common features and differences between datasets. Hence, general patterns can be learned rather than restricting learning to specific databases from which rules may not be generalisable. We cluster databases that hold aggregate count data on categorical attributes that have been classified according to homogeneous or heterogeneous classification schemes. Clustering of datasets is carried out via the probability distributions that describe their respective aggregates. The homogeneous case is straightforward. For heterogeneous data we investigate a number of clustering strategies, of which the most efficient avoid the need to compute a dynamic shared ontology to homogenise the classification schemes prior to clustering.