Statistical learning for high-dimensional data understanding

  • Authors:
  • King-Shy Goh;Edward Y. Chang

  • Affiliations:
  • -;-

  • Venue:
  • Statistical learning for high-dimensional data understanding
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

In many emerging applications, the databases have to deal with massive quantities of data that are described by a wide variety of feature dimensions. As a result, traditional statistical methods, which work well with a small quantity of data described by simple features, are not always effective for these new applications. In this dissertation, we explore the challenges that a large volume of high-dimensional data poses for a visual-data retrieval system. To facilitate better data organization, we can assign labels to the data by mapping the low-level features (such as color and texture) to some high-level semantics (such as “animal” and “landscape”). We present an annotation system which first uses Support Vector Machines (SVMs) to predict class-label memberships, and then employs confidence factors to ascertain the correctness of the class-prediction. The system makes dynamic adjustments to accommodate new semantics, to assist in the discovery of useful low-level features, and to improve class-prediction accuracy. In order to retrieve data that a user wants, the retrieval system needs to learn the user's query concept. A concept-dependent active learning scheme is proposed for the learning task. The scheme makes use of multimodal information (low-level features and semantic labels) to model the concept-complexity, and then make intelligent adjustments to the sampling strategy to improve concept learnability. With the improved learnability, the retrieval precision can be increased. To support rapid searches through large volume of data, we need a clustering/classification approach is proposed to index the high-dimensional data. We first cluster similar data and store each cluster sequentially on disk. We then model the similarity search as a classification problem—similar objects are much more likely to be found in clusters into which the query instance has been classified. This approach is not only able to support relevance feedback, which alters the feature weighting, but also able to support non-metric similarity measures. More specifically, our indexer uses the dynamic partial function (DPF), which quantifies the similarity between data instances significantly better than do the Minkowski-like functions for measuring perceptual similarity. Through extensive empirical studies on several large and high-dimensional image datasets, we show that our proposed approaches are able to perform data retrievals more effectively, and efficiently than traditional methods.