Towards enhancing centroid classifier for text classification-A border-instance approach

Authors:
Deqing Wang;Junjie Wu;Hui Zhang;Ke Xu;Mengxiang Lin
Affiliations:
State Key Laboratory of Software Development Environment, School of Computer Science and Engineering, Beihang University, Beijing 100191, China;Beijing Key Laboratory of Emergency Support Simulation Technologies for City Operations, School of Economics and Management, Beihang University, Beijing 100191, China;State Key Laboratory of Software Development Environment, School of Computer Science and Engineering, Beihang University, Beijing 100191, China;State Key Laboratory of Software Development Environment, School of Computer Science and Engineering, Beihang University, Beijing 100191, China;State Key Laboratory of Software Development Environment, School of Computer Science and Engineering, Beihang University, Beijing 100191, China
Venue:
Neurocomputing
Year:
2013

Citing 37
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
C4.5: programs for machine learning

C4.5: programs for machine learning
Improving text retrieval for the routing problem using latent semantic indexing

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating and optimizing autonomous text classification systems

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Support-Vector Networks

Machine Learning
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Information Retrieval

Information Retrieval
Boosting to correct inductive bias in text classification

Proceedings of the eleventh international conference on Information and knowledge management
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Centroid-Based Document Classification: Analysis and Experimental Results

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
A refinement approach to handling model misfit in text categorization

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Comparative Study of Centroid-Based, Neighborhood-Based and Statistical Approaches for Effective Document Categorization

ICPR '02 Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02) Volume 4 - Volume 4
Effect of term distributions on centroid-based text categorization

Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Informatics and computer science intelligent systems applications
A tutorial on support vector regression

Statistics and Computing
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Asymmetric Bagging and Random Subspace for Support Vector Machines-Based Relevance Feedback in Image Retrieval

IEEE Transactions on Pattern Analysis and Machine Intelligence
K-means clustering versus validation measures: a data distribution perspective

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Using hypothesis margin to boost centroid text classifier

Proceedings of the 2007 ACM symposium on Applied computing
Support vector machine classification for large data sets via minimum enclosing ball clustering

Neurocomputing
An improved centroid classifier for text categorization

Expert Systems with Applications: An International Journal
Supervised and Traditional Term Weighting Methods for Automatic Text Categorization

IEEE Transactions on Pattern Analysis and Machine Intelligence
A class-feature-centroid classifier for text categorization

Proceedings of the 18th international conference on World wide web
K-means clustering versus validation measures: a data-distribution perspective

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
COG: local decomposition for rare class analysis

Data Mining and Knowledge Discovery
Multiclass support vector classification via coding and regression

Neurocomputing
Full border identification for reduction of training sets

Canadian AI'08 Proceedings of the Canadian Society for computational studies of intelligence, 21st conference on Advances in artificial intelligence
An A-Team approach to learning classifiers from distributed data sources

International Journal of Intelligent Information and Database Systems
Adapting centroid classifier for document categorization

Expert Systems with Applications: An International Journal
Manifold elastic net: a unified framework for sparse dimension reduction

Data Mining and Knowledge Discovery
Cluster-based instance selection for machine classification

Knowledge and Information Systems
A comparison of methods for multiclass support vector machines

IEEE Transactions on Neural Networks
Non-Negative Patch Alignment Framework

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.01

Visualization

Abstract

Text classification/categorization (TC) is to assign new unlabeled natural language documents to the predefined thematic categories. Centroid-based classifier (CC) has been widely used for TC because of its simplicity and efficiency. However, it has also been long criticized for its relatively low classification accuracy compared with state-of-the-art classifiers such as support vector machines (SVMs). In this paper, we find that for CC using only border instances rather than all instances to construct centroid vectors can obtain higher generalization accuracy. Along this line, we propose Border-Instance-based Iteratively Adjusted Centroid Classifier (IACC_BI), which relies on the border instances found by some routines, e.g. 1-Nearest-and-1-Furthest-Neighbors strategy, to construct centroid vectors for CC. IACC_BI then iteratively adjusts the initial centroid vectors according to the misclassified training instances. Our extensive experiments on 11 real-world text corpora demonstrate that IACC_BI improves the performance of centroid-based classifiers greatly and obtains classification accuracy competitive to the well-known SVMs, while at significantly lower computational costs.