A refinement approach to handling model misfit in text categorization

Authors:
Haoran Wu;Tong Heng Phang;Bing Liu;Xiaoli Li
Affiliations:
National University of Singapore, Singapore;National University of Singapore, Singapore;National University of Singapore, Singapore;National University of Singapore, Singapore
Venue:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2002

Citing 26
Cited 24

Original Contribution: Stacked generalization

Neural Networks
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
The effect of adding relevance information in a relevance feedback environment

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
The nature of statistical learning theory

The nature of statistical learning theory
Support-Vector Networks

Machine Learning
Bagging predictors

Machine Learning
Error reduction through learning multiple descriptions

Machine Learning
Method combination for document filtering

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Combining classifiers in text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Boosting and Rocchio applied to text filtering

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Improved Boosting Algorithms Using Confidence-rated Predictions

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
Text filtering by boosting naive Bayes classifiers

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Active learning using adaptive resampling

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
A meta-learning approach for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Boosting the margin: A new explanation for the effectiveness of voting methods

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Multistrategy Learning for Information Extraction

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Combining Multiple Learning Strategies for Effective Cross Validation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
THE WEIGHTED MAJORITY ALGORITHM (Supersedes 89-16)

THE WEIGHTED MAJORITY ALGORITHM (Supersedes 89-16)
Stacked generalization: when does it work?

IJCAI'97 Proceedings of the Fifteenth international joint conference on Artifical intelligence - Volume 2
Bagging, boosting, and C4.S

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

Research activities in database management and information retrieval at University of Illinois at Chicago

ACM SIGMOD Record
A maximal figure-of-merit learning approach to text categorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Using dragpushing to refine centroid text classifiers

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Adaptive sampling for thresholding in document filtering and classification

Information Processing and Management: an International Journal
A novel refinement approach for text categorization

Proceedings of the 14th ACM international conference on Information and knowledge management
Incremental mining of information interest for personalized web scanning

Information Systems
A maximal figure-of-merit (MFoM)-learning approach to robust classifier design for text categorization

ACM Transactions on Information Systems (TOIS)
Large margin DragPushing strategy for centroid text categorization

Expert Systems with Applications: An International Journal
Using hypothesis margin to boost centroid text classifier

Proceedings of the 2007 ACM symposium on Applied computing
Dynamic category profiling for text filtering and classification

Information Processing and Management: an International Journal
Raising the baseline for high-precision text classifiers

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
An incremental cluster-based approach to spam filtering

Expert Systems with Applications: An International Journal
Interactive high-quality text classification

Information Processing and Management: an International Journal
Using WordNet to Disambiguate Word Senses for Text Classification

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part III: ICCS 2007
An Effective Approach to Enhance Centroid Classifier for Text Categorization

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Document-Base Extraction for Single-Label Text Classification

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
A class-feature-centroid classifier for text categorization

Proceedings of the 18th international conference on World wide web
Enhancing the Performance of Centroid Classifier by ECOC and Model Refinement

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
A Clustering Framework Based on Adaptive Space Mapping and Rescaling

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Incremental mining of information interest for personalized web scanning

Information Systems
Text classification for healthcare information support

IEA/AIE'07 Proceedings of the 20th international conference on Industrial, engineering, and other applications of applied intelligent systems
Automatic categorization of questions for user-interactive question answering

Information Processing and Management: an International Journal
Dynamic category profiling for text filtering and classification

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Towards enhancing centroid classifier for text classification-A border-instance approach

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text categorization or classification is the automated assigning of text documents to pre-defined classes based on their contents. This problem has been studied in information retrieval, machine learning and data mining. So far, many effective techniques have been proposed. However, most techniques are based on some underlying models and/or assumptions. When the data fits the model well, the classification accuracy will be high. However, when the data does not fit the model well, the classification accuracy can be very low. In this paper, we propose a refinement approach to dealing with this problem of model misfit. We show that we do not need to change the classification technique itself (or its underlying model) to make it more flexible. Instead, we propose to use successive refinements of classification on the training data to correct the model misfit. We apply the proposed technique to improve the classification performance of two simple and efficient text classifiers, the Rocchio classifier and the naïve Bayesian classifier. These techniques are suitable for very large text collections because they allow the data to reside on disk and need only one scan of the data to build a text classifier. Extensive experiments on two benchmark document corpora show that the proposed technique is able to improve text categorization accuracy of the two techniques dramatically. In particular, our refined model is able to improve the naïve Bayesian or Rocchio classifier's prediction performance by 45% on average.