Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Boosting and Rocchio applied to text filtering
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM
Machine Learning - Special issue on information retrieval
A study of thresholding strategies for text categorization
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Hierarchical Text Categorization Using Neural Networks
Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Linear Text Classification Algorithm Based on Category Relevance Factors
ICADL '02 Proceedings of the 5th International Conference on Asian Digital Libraries: Digital Libraries: People, Knowledge, and Technology
Voting Nearest-Neighbor Subclassifiers
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Length Normalization in Degraded Text Collections
Length Normalization in Degraded Text Collections
The Journal of Machine Learning Research
A divisive information theoretic feature clustering algorithm for text classification
The Journal of Machine Learning Research
Multiclass text categorization for automated survey coding
Proceedings of the 2003 ACM symposium on Applied computing
Effect of term distributions on centroid-based text categorization
Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Informatics and computer science intelligent systems applications
Automatic text categorization by unsupervised learning
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Hi-index | 0.00 |
In order to support decision making, text classification is an important tool. Recently, in addition to term frequency and inverse document frequency, term distributions have been shown to be useful to improve classification accuracy in multi-class classification. This paper investigates the performance of these term distributions on binary classification using a centroid-based approach. In such one-against-the-rest, there are only two classes, the positive (focused) class and the negative class. To improve the performance, a so-called hierarchical EM method is applied to cluster the negative class, which is usually much larger and more diverse than the positive one, into several homogeneous groups. The experimental results on two collections of web pages, namely Drug Information (DI) and WebKB, show the merits of term distributions and clustering on binary classification. The performance of the proposed method is also investigated using the Thai Herbal collection where the texts are written in Thai language.