Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Training algorithms for linear text classifiers
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Extended Boolean information retrieval
Communications of the ACM
A vector space model for automatic indexing
Communications of the ACM
A Study of Approaches to Hypertext Categorization
Journal of Intelligent Information Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Automatic text categorization by unsupervised learning
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Hybrid hill-climbing and knowledge-based methods for intelligent news filtering
AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
Web-page classification through summarization
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A novel efficient classification algorithm for search engines
AIC'08 Proceedings of the 8th conference on Applied informatics and communications
Finding related sentence pairs in MEDLINE
Information Retrieval
Hi-index | 0.00 |
Automatic text categorization is a problem of automatically assigning text documents to predefined categories. In order to classify text documents, we must extract good features from them. In previous research, a text document is commonly represented by the term frequency and the inverted document frequency of each feature. Since there is a difference between important sentences and unimportant sentences in a document, the features from more important sentences should be considered more than other features. In this paper, we measure the importance of sentences using text summarization techniques. Then a document is represented as a vector of features with different weights according to the importance of each sentence. To verify our new method, we conducted experiments on two language newsgroup data sets: one written by English and the other written by Korean. Four kinds of classifiers were used in our experiments: Naïve Bayes, Rocchio, k-NN, and SVM. We observed that our new method made a significant improvement in all classifiers and both data sets.