Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Automated learning of decision rules for text categorization
ACM Transactions on Information Systems (TOIS)
An example-based mapping method for text categorization and retrieval
ACM Transactions on Information Systems (TOIS)
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Improving text retrieval for the routing problem using latent semantic indexing
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Combining multiple evidence from different properties of weighting schemes
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Pivoted document length normalization
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Feature selection, perceptron learning, and a usability case study for text categorization
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic essay grading using text categorization techniques
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Boosting and Rocchio applied to text filtering
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Learning to classify text from labeled and unlabeled documents
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM
Machine Learning - Special issue on information retrieval
Learning to construct knowledge bases from the World Wide Web
Artificial Intelligence - Special issue on Intelligent internet systems
An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval
Text classification in a hierarchical mixture model for small training sets
Proceedings of the tenth international conference on Information and knowledge management
Hierarchical Text Categorization Using Neural Networks
Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Hierarchically Classifying Documents Using Very Few Words
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Centroid-Based Document Classification: Analysis and Experimental Results
PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
A Fast Algorithm for Hierarchical Text Classification
DaWaK 2000 Proceedings of the Second International Conference on Data Warehousing and Knowledge Discovery
A Linear Text Classification Algorithm Based on Category Relevance Factors
ICADL '02 Proceedings of the 5th International Conference on Asian Digital Libraries: Digital Libraries: People, Knowledge, and Technology
Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification
PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
Combining Multiclass Maximum Entropy Text Classifiers with Neural Network Voting
PorTAL '02 Proceedings of the Third International Conference on Advances in Natural Language Processing
An information-theoretic perspective of tf—idf measures
Information Processing and Management: an International Journal
Support Vector Machines Based on a Semantic Kernel for Text Categorization
IJCNN '00 Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN'00)-Volume 5 - Volume 5
Length Normalization in Degraded Text Collections
Length Normalization in Degraded Text Collections
Supervised term weighting for automated text categorization
Proceedings of the 2003 ACM symposium on Applied computing
A study of parameter tuning for term frequency normalization
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
The Importance of Length Normalization for XML Retrieval
Information Retrieval
Learning rules for large vocabulary word sense disambiguation
IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2
On the strength of hyperclique patterns for text categorization
Information Sciences: an International Journal
A class-feature-centroid classifier for text categorization
Proceedings of the 18th international conference on World wide web
An automatically constructed thesaurus for neural network based document categorization
Expert Systems with Applications: An International Journal
Automatic text categorization based on content analysis with cognitive situation models
Information Sciences: an International Journal
A high performance centroid-based classification approach for language identification
Pattern Recognition Letters
Hi-index | 0.07 |
Centroid-based categorization is one of the most popular algorithms in text classification. In this approach, normalization is an important factor to improve performance of a centroid-based classifier when documents in text collection have quite different sizes and/or the numbers of documents in classes are unbalanced. In the past, most researchers applied document normalization, e.g., document-length normalization, while some consider a simple kind of class normalization, so-called class-length normalization, to solve the unbalancedness problem. However, there is no intensive work that clarifies how these normalizations affect classification performance and whether there are any other useful normalizations. The purpose of this paper is three folds; (1) to investigate the effectiveness of document- and class-length normalizations on several data sets, (2) to evaluate a number of commonly used normalization functions and (3) to introduce a new type of class normalization, called term-length normalization, which exploits term distribution among documents in the class. The experimental results show that a classifier with weight-merge-normalize approach (class-length normalization) performs better than one with weight-normalize-merge approach (document-length normalization) for the data sets with unbalanced numbers of documents in classes, and is quite competitive for those with balanced numbers of documents. For normalization functions, the normalization based on term weighting performs better than the others on average. For term-length normalization, it is useful for improving classification accuracy. The combination of term- and class-length normalizations outperforms pure class-length normalization and pure term-length normalization as well as unnormalization with the gaps of 4.29%, 11.50%, 30.09%, respectively.