Merging distributed database summaries

Authors:
Mounir Bechchi;Guillaume Raschia;Noureddine Mouaddib
Affiliations:
LINA-INRIA, Nantes, France;LINA-INRIA, Nantes, France;LINA-INRIA, Nantes, France
Venue:
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Year:
2007

Citing 15
Cited 3

Concept formation in structured domains

Concept formation knowledge and experience in unsupervised learning
Some MAX SNP-hard results concerning unordered labeled trees

Information Processing Letters
Fuzzy sets as a basis for a theory of possibility

Fuzzy Sets and Systems
A State-of-the-Art Survey on Software Merging

IEEE Transactions on Software Engineering
RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets

Distributed and Parallel Databases - Special issue: Parallel and distributed data mining
A Fast Parallel Clustering Algorithm for Large Spatial Databases

Data Mining and Knowledge Discovery
On Clustering Validation Techniques

Journal of Intelligent Information Systems
SAINTETIQ: a fuzzy set-based approach to database summarization

Fuzzy Sets and Systems - Data bases and approximate reasoning
A Supra-Classifier Architecture for Scalable Knowledge Reuse

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
A three-way merge for XML documents

Proceedings of the 2004 ACM symposium on Document engineering
Online B-tree merging

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
General purpose database summarization

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Distributed clustering based on sampling local density estimates

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
A survey of schema-based matching approaches

Journal on Data Semantics IV

Summary management in P2P systems

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Parsimonious reduction of Gaussian mixture models with a variational-Bayes approach

Pattern Recognition
Full Length Article: A low-cost variational-Bayes technique for merging mixtures of probabilistic principal component analyzers

Information Fusion

Quantified Score

Hi-index	0.00

Visualization

Abstract

The database summarization system coined SaintEtiQ provides multi-resolution summaries of structured data stored into acentralized database. Summaries are computed online with a conceptual hierarchical clustering algorithm. However, most companies work in distributed legacy environments and consequently the current centralized version of SaintEtiQ is either not feasible (privacy preserving) or not desirable (resource limitations). To address this problem, we propose new algorithms to generate a single summary hierarchy given two distinct hierarchies, without scanning the raw data. The Greedy Merging Algorithm (GMA) takes all leaves of both hierarchies and generates the optimal partitioning for the considered data set with regards to a cost function (compactness and separation). Then, a hierarchical organization of summaries is built by agglomerating or dividing clusters such that the cost function may emphasize local or global patterns in the data. Thus, we obtain two different hierarchies according to the performed optimisation. However, this approach breaks down due to its exponential time complexity. Two alternative approaches with constant time complexity w.r.t. the number of data items, are proposed to tackle this problem. The first one, called Merge by Incorporation Algorithm (MIA), relies on the SaintEtiQ engine whereas the second approach, named Merge by Alignment Algorithm (MAA), consists in rearranging summaries by levels in a top-down manner. Then, we compare those approaches using an original quality measure in order to quantify how good our merged hierarchies are. Finally, an experimental study, using real data sets, shows that merging processes (MIA and MAA) are efficient in terms of computational time.