Effect of term distributions on centroid-based text categorization

Authors:
Verayuth Lertnattee;Thanaruk Theeramunkong
Affiliations:
Information Technology Program, Sirindhorn International Institute of Technology, Bangkadi Campus, 131 Moo 5 Tiwanont Road, Bangkadi, Muang, Pathumthani 12000, Thailand;Information Technology Program, Sirindhorn International Institute of Technology, Bangkadi Campus, 131 Moo 5 Tiwanont Road, Bangkadi, Muang, Pathumthani 12000, Thailand
Venue:
Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Informatics and computer science intelligent systems applications
Year:
2004

Citing 24
Cited 12

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Models for retrieval with probabilistic indexing

Information Processing and Management: an International Journal - Modeling data, information and knowledge
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
An example-based mapping method for text categorization and retrieval

ACM Transactions on Information Systems (TOIS)
Improving text retrieval for the routing problem using latent semantic indexing

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Feature selection, perceptron learning, and a usability case study for text categorization

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic essay grading using text categorization techniques

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Boosting and Rocchio applied to text filtering

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Learning to classify text from labeled and unlabeled documents

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
A Linear Text Classification Algorithm Based on Category Relevance Factors

ICADL '02 Proceedings of the 5th International Conference on Asian Digital Libraries: Digital Libraries: People, Knowledge, and Technology
A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization

PRICAI '02 Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Support Vector Machines Based on a Semantic Kernel for Text Categorization

IJCNN '00 Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN'00)-Volume 5 - Volume 5
Length Normalization in Degraded Text Collections

Length Normalization in Degraded Text Collections
Supervised term weighting for automated text categorization

Proceedings of the 2003 ACM symposium on Applied computing

Classifying web documents in a hierarchy of categories: a comprehensive study

Journal of Intelligent Information Systems
Semi-supervised single-label text categorization using centroid-based classifiers

Proceedings of the 2007 ACM symposium on Applied computing
Effects of Term Distributions on Binary Classification

IEICE - Transactions on Information and Systems
A class-feature-centroid classifier for text categorization

Proceedings of the 18th international conference on World wide web
Automatic text categorization based on content analysis with cognitive situation models

Information Sciences: an International Journal
Text classification for Thai medicinal web pages

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Experiments on kernel tree support vector machines for text categorization

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Parallel text categorization for multi-dimensional data

PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
Towards enhancing centroid classifier for text classification-A border-instance approach

Neurocomputing
A Survey of Approaches to Web Service Discovery in Service-Oriented Architectures

Journal of Database Management
Class-indexing-based term weighting for automatic text classification

Information Sciences: an International Journal
EasySOC: Making web service outsourcing easier

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most of traditional text categorization approaches utilize term frequency (tf) and inverse document frequency (idf) for representing importance of words and/or terms in classifying a text document. This paper describes an approach to apply term distributions, in addition to tf and idf, to improve performance of centroid-based text categorization. Three types of term distributions, called inter-class, intra-class and in-collection distributions, are introduced. These distributions are useful to increase classification accuracy by exploiting information of (1) term distribution among classes, (2) term distribution within a class and (3) term distribution in the whole collection of training data. In addition, this paper investigates how these term distributions contribute to weight each term in documents, e.g., a high term distribution of a word promotes or demotes importance or classification power of that word. To this end, several centroid-based classifiers are constructed with different term weightings. Using various data sets, their performances are investigated and compared to a standard centroid-based classifier (TDIDF) and a centroid-based classifier modified with information gain. Moreover, we also compare them to two well-known methods: k-NN and naïve Bayes. In addition to a unigram model of document representation, a bigram model is also explored. Finally, the effectiveness of term distributions to improve classification accuracy is explored with regard to the training set size and the number of classes.