Large-scale text categorization by batch mode active learning

Authors:
Steven C. H. Hoi;Rong Jin;Michael R. Lyu
Affiliations:
The Chinese University of Hong Kong;Michigan State University, East Lansing, MI;The Chinese University of Hong Kong
Venue:
Proceedings of the 15th international conference on World Wide Web
Year:
2006

Citing 25
Cited 25

Query by committee

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Classifying news stories using memory based reasoning

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Information-based objective functions for active data selection

Neural Computation
Automatic indexing based on Bayesian inference networks

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Selective Sampling Using the Query by Committee Algorithm

Machine Learning
Making large-scale support vector machine learning practical

Advances in kernel methods
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Hierarchical Text Categorization Using Neural Networks

Information Retrieval
Query by committee, linear separation and random walks

Theoretical Computer Science
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Toward Optimal Active Learning through Sampling Estimation of Error Reduction

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Less is More: Active Learning with Support Vector Machines

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Query Learning with Large Margin Classifiers

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Support Vector Machine Active Learning with Application sto Text Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Employing EM and Pool-Based Active Learning for Text Classification

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Using urls and table layout for web classification tasks

Proceedings of the 13th international conference on World Wide Web
Convex Optimization

Convex Optimization
A comprehensive comparative study on term weighting schemes for text categorization with support vector machines

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
An experimental study on large-scale web categorization

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Active learning with committees for text categorization

AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence

Learning the unified kernel machines for classification

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient training on biased minimax probability machine for imbalanced text classification

Proceedings of the 16th international conference on World Wide Web
Adaptive multiple feedback strategies for interactive video search

CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
A bayesian logistic regression model for active relevance feedback

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
On profiling blogs with representative entries

Proceedings of the second workshop on Analytics for noisy unstructured text data
Semi-supervised and active learning with the probabilistic RBF classifier

Neurocomputing
Improving supervised learning performance by using fuzzy clustering method to select training data

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology - Fuzzy theory and technology with applications
Semisupervised SVM batch mode active learning with applications to image retrieval

ACM Transactions on Information Systems (TOIS)
Classifying Amharic webnews

Information Retrieval
Active Learning Strategies for Multi-Label Text Classification

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
An intrinsic stopping criterion for committee-based active learning

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
On privacy preservation in text and document-based active learning for named entity recognition

Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
Centrality Measures from Complex Networks in Active Learning

DS '09 Proceedings of the 12th International Conference on Discovery Science
New filtering approaches for phishing email

Journal of Computer Security - EU-Funded ICT Research on Trust and Security
Batch mode active learning based multi-view text classification

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
SED: supervised experimental design and its application to text classification

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
d-Confidence: an active learning strategy which efficiently identifies small classes

ALNLP '10 Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing
An effective procedure exploiting unlabeled data to build monitoring system

Expert Systems with Applications: An International Journal
VisionGo: Towards video retrieval with joint exploration of human and computer

Information Sciences: an International Journal
Optimal batch selection for active learning in multi-label classification

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Batch Mode Active Learning for Networked Data

ACM Transactions on Intelligent Systems and Technology (TIST)
A weakly-supervised approach to argumentative zoning of scientific documents

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A utility-theoretic ranking method for semi-automated text classification

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Active hashing and its application to image and text retrieval

Data Mining and Knowledge Discovery
Active learning for networked data based on non-progressive diffusion model

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale text categorization is an important research topic for Web data mining. One of the challenges in large-scale text categorization is how to reduce the human efforts in labeling text documents for building reliable classification models. In the past, there have been many studies on applying active learning methods to automatic text categorization, which try to select the most informative documents for labeling manually. Most of these studies focused on selecting a single unlabeled document in each iteration. As a result, the text categorization model has to be retrained after each labeled document is solicited. In this paper, we present a novel active learning algorithm that selects a batch of text documents for labeling manually in each iteration. The key of the batch mode active learning is how to reduce the redundancy among the selected examples such that each example provides unique information for model updating. To this end, we use the Fisher information matrix as the measurement of model uncertainty and choose the set of documents to effectively maximize the Fisher information of a classification model. Extensive experiments with three different datasets have shown that our algorithm is more effective than the state-of-the-art active learning techniques for text categorization and can be a promising tool toward large-scale text categorization for World Wide Web documents.