Cube-space data mining

  • Authors:
  • Raghu Ramakrishnan;Bee-Chung Chen

  • Affiliations:
  • The University of Wisconsin - Madison;The University of Wisconsin - Madison

  • Venue:
  • Cube-space data mining
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In many data-mining projects, analysts face a huge number of choices in applying data-mining techniques. These choices include different ways to build data-mining models on differently selected, differently segmented, differently aggregated and differently transformed datasets. Analysts' efforts to explore this huge space of mining choices are a big bottleneck in practice. Cube-space data mining is an emerging data-mining paradigm that addresses how to allow analysts to explore this huge space of choices in a principled manner that automates routine steps and enables them to consider large parts of the space with minimal effort. The basic idea is to let the analyst use various kinds of dimensions to structure the space of choices, and then let the mining system build data-mining models repeatedly and systematically over regions of varying granularities in the analyst-specified space (which we call cube space). Even building a single data-mining model on a large dataset is computationally expensive. Repeated construction of a large number of models, an intrinsic characteristic of cube-space data mining, poses great computational challenges. This dissertation demonstrates that cube-space data mining is useful for exploring the huge space of choices in mining, and shows that it can be done in a computationally feasible way. Specifically, we define and characterize this new data-mining paradigm and demonstrate its utility by three novel applications of the paradigm, namely prediction cubes, bellwether analysis and privacy skylines. To meet the computational challenges, for each of these applications, we provide efficient and scalable algorithms by exploiting the structure of the space. Through our techniques and experiments, we show that efficient cube-space data mining on large datasets (which may not fit in memory) is achievable.