Merging distributed database summaries

  • Authors:
  • Mounir Bechchi;Guillaume Raschia;Noureddine Mouaddib

  • Affiliations:
  • LINA-INRIA, Nantes, France;LINA-INRIA, Nantes, France;LINA-INRIA, Nantes, France

  • Venue:
  • Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The database summarization system coined SaintEtiQ provides multi-resolution summaries of structured data stored into acentralized database. Summaries are computed online with a conceptual hierarchical clustering algorithm. However, most companies work in distributed legacy environments and consequently the current centralized version of SaintEtiQ is either not feasible (privacy preserving) or not desirable (resource limitations). To address this problem, we propose new algorithms to generate a single summary hierarchy given two distinct hierarchies, without scanning the raw data. The Greedy Merging Algorithm (GMA) takes all leaves of both hierarchies and generates the optimal partitioning for the considered data set with regards to a cost function (compactness and separation). Then, a hierarchical organization of summaries is built by agglomerating or dividing clusters such that the cost function may emphasize local or global patterns in the data. Thus, we obtain two different hierarchies according to the performed optimisation. However, this approach breaks down due to its exponential time complexity. Two alternative approaches with constant time complexity w.r.t. the number of data items, are proposed to tackle this problem. The first one, called Merge by Incorporation Algorithm (MIA), relies on the SaintEtiQ engine whereas the second approach, named Merge by Alignment Algorithm (MAA), consists in rearranging summaries by levels in a top-down manner. Then, we compare those approaches using an original quality measure in order to quantify how good our merged hierarchies are. Finally, an experimental study, using real data sets, shows that merging processes (MIA and MAA) are efficient in terms of computational time.