Statistical modeling of large distribution sets

  • Authors:
  • Yasuko Matsubara;Yasushi Sakurai;Masatoshi Yoshikawa

  • Affiliations:
  • Kyoto University;NTT Communication Science Labs;Kyoto University

  • Venue:
  • Proceedings of the Fourth SIGMOD PhD Workshop on Innovative Database Research
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we deal with a ubiquitous problem in data management: hierarchical model estimation for large distribution sets. This particular problem arises in many applications. Classification, top-k query processing, clustering and outlier detection are just a few possible applications. Our aim is to continuously and incrementally estimate the model parameters of 'typical' distributions that describe the characteristics of a database. Our approach to model estimation can handle arbitrary types of data (e.g., categorical and numerical data) in databases, incrementally, quickly, and with little resource consumption. Moreover, this paper proposes not only incremental algorithms for model fitting, but also a modeling framework in which the learning approach recognizes hierarchical groups, each of whose distributions has similar characteristics, and separately updates the model parameters of each group without scanning all the distributions in the database. Thus, it can provide a response, i.e., the parameters of typical distribution models, with an arbitrary level of granularity, at any time. Just as importantly, we demonstrate the utility of our approach by showing how it can be applied to two specific problems that arise in the context of data management.