Scalable clustering of categorical data and applications

Authors:
Renee J. Miller;Periklis Andritsos
Affiliations:
University of Toronto (Canada);University of Toronto (Canada)
Venue:
Scalable clustering of categorical data and applications
Year:
2004

Citing 0
Cited 1

Framework for evaluating clustering algorithms in duplicate detection

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is widely used to explore and understand large collections of data. In this thesis, we introduce LIMBO, a scalable hierarchical categorical clustering algorithm based on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering. As a hierarchical algorithm, LIMBO can produce clusterings of different sizes in a single execution. We also define a distance measure for categorical tuples and values of a specific attribute. Within this framework, we define a heuristic for discovering candidate values for the number of meaningful clusters. Next, we consider the problem of database design, which has been characterized as a process of arriving at a design that minimizes redundancy. Redundancy is measured with respect to a prescribed model for the data (a set of constraints). We consider the problem of doing database redesign when the prescribed model is unknown or incomplete. Specifically, we consider the problem of finding structural clues in a data instance, which may contain errors, missing values, and duplicate records. We propose a set of tools based on LIMBO for finding structural summaries that are useful in characterizing the information content of the data. We study the use of these summaries in ranking functional dependencies based on their data redundancy. We also consider a different application of LIMBO, that of clustering software artifacts. The majority of previous algorithms for this problem utilize structural information in order to decompose large software systems. Other approaches using non-structural information, such as file names or ownership information, have also demonstrated merit. We present an approach that combines structural and non-structural information in an integrated fashion. We apply LIMBO to two large software systems, and the results indicate that this approach produces valid and useful clusterings. Finally, we present a set of weighting schemes that specify objective assignments of importance to the values of a data set. We use well established weighting schemes from information retrieval, web search and data clustering to assess the importance of whole attributes and individual values.