Efficient sampling of training set in large and noisy multimedia data

  • Authors:
  • Surong Wang;Manoranjan Dash;Liang-Tien Chia;Min Xu

  • Affiliations:
  • Nanyang Technological University, Nanyang Avenue, Singapore;Nanyang Technological University, Nanyang Avenue, Singapore;Nanyang Technological University, Nanyang Avenue, Singapore;Nanyang Technological University, Nanyang Avenue, Singapore

  • Venue:
  • ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

As the amount of multimedia data is increasing day-by-day thanks to less expensive storage devices and increasing numbers of information sources, machine learning algorithms are faced with large-sized and noisy datasets. Fortunately, the use of a good sampling set for training influences the final results significantly. But using a simple random sample (SRS) may not obtain satisfactory results because such a sample may not adequately represent the large and noisy dataset due to its blind approach in selecting samples. The difficulty is particularly apparent for huge datasets where, due to memory constraints, only very small sample sizes are used. This is typically the case for multimedia applications, where data size is usually very large. In this article we propose a new and efficient method to sample of large and noisy multimedia data. The proposed method is based on a simple distance measure that compares the histograms of the sample set and the whole set in order to estimate the representativeness of the sample. The proposed method deals with noise in an elegant manner which SRS and other methods are not able to deal with. We experiment on image and audio datasets. Comparison with SRS and other methods shows that the proposed method is vastly superior in terms of sample representativeness, particularly for small sample sizes although time-wise it is comparable to SRS, the least expensive method in terms of time.