Efficient sampling of training set in large and noisy multimedia data

Authors:
Surong Wang;Manoranjan Dash;Liang-Tien Chia;Min Xu
Affiliations:
Nanyang Technological University, Nanyang Avenue, Singapore;Nanyang Technological University, Nanyang Avenue, Singapore;Nanyang Technological University, Nanyang Avenue, Singapore;Nanyang Technological University, Nanyang Avenue, Singapore
Venue:
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Year:
2007

Citing 25
Cited 4

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Training connectionist networks with queries and selective sampling

Advances in neural information processing systems 2
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Bagging predictors

Machine Learning
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Active learning using adaptive resampling

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatically extracting highlights for TV Baseball programs

MULTIMEDIA '00 Proceedings of the eighth ACM international conference on Multimedia
Automatic detection of 'Goal' segments in basketball videos

MULTIMEDIA '01 Proceedings of the ninth ACM international conference on Multimedia
Queries and Concept Learning

Machine Learning
Queries and Concept Learning

Machine Learning
Creating Ensembles of Classifiers

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Support Vector Machine Active Learning with Application sto Text Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Active Hidden Markov Models for Information Extraction

IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A new two-phase sampling based algorithm for discovering association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
The learning-curve sampling method applied to model-based clustering

The Journal of Machine Learning Research
Efficient data reduction with EASE

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A mid-level representation framework for semantic sports video analysis

MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
Audio keyword generation for sports video analysis

Proceedings of the 12th annual ACM international conference on Multimedia
Class Noise vs. Attribute Noise: A Quantitative Study

Artificial Intelligence Review
Active learning for class probability estimation and ranking

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
HMM-Based audio keyword generation

PCM'04 Proceedings of the 5th Pacific Rim conference on Advances in Multimedia Information Processing - Volume Part III
Support vector machines for histogram-based image classification

IEEE Transactions on Neural Networks

Query by shots: retrieving meaningful events using multiple queries and rough set theory

Proceedings of the 9th International Workshop on Multimedia Data Mining: held in conjunction with the ACM SIGKDD 2008
RANSAC-based training data selection for emotion recognition from spontaneous speech

Proceedings of the 3rd international workshop on Affective interaction in natural environments
RANSAC-based training data selection on spectral features for emotion recognition from spontaneous speech

COST'10 Proceedings of the 2010 international conference on Analysis of Verbal and Nonverbal Communication and Enactment
A Web-Based Multimedia Retrieval System with MCA-Based Filtering and Subspace-Based Learning Algorithms

International Journal of Multimedia Data Engineering & Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the amount of multimedia data is increasing day-by-day thanks to less expensive storage devices and increasing numbers of information sources, machine learning algorithms are faced with large-sized and noisy datasets. Fortunately, the use of a good sampling set for training influences the final results significantly. But using a simple random sample (SRS) may not obtain satisfactory results because such a sample may not adequately represent the large and noisy dataset due to its blind approach in selecting samples. The difficulty is particularly apparent for huge datasets where, due to memory constraints, only very small sample sizes are used. This is typically the case for multimedia applications, where data size is usually very large. In this article we propose a new and efficient method to sample of large and noisy multimedia data. The proposed method is based on a simple distance measure that compares the histograms of the sample set and the whole set in order to estimate the representativeness of the sample. The proposed method deals with noise in an elegant manner which SRS and other methods are not able to deal with. We experiment on image and audio datasets. Comparison with SRS and other methods shows that the proposed method is vastly superior in terms of sample representativeness, particularly for small sample sizes although time-wise it is comparable to SRS, the least expensive method in terms of time.