Improving text categorization using the importance of sentences

Authors:
Youngjoong Ko;Jinwoo Park;Jungyun Seo
Affiliations:
Department of Computer Science, NLP lab., Sogang University, Sinsu-dong 1, Mapo-gu, Seoul 121-742, South Korea;Department of Computer Science, NLP lab., Sogang University, Sinsu-dong 1, Mapo-gu, Seoul 121-742, South Korea;Department of Computer Science, NLP lab., Sogang University, Sinsu-dong 1, Mapo-gu, Seoul 121-742, South Korea
Venue:
Information Processing and Management: an International Journal
Year:
2004

Citing 16
Cited 18

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
The nature of statistical learning theory

The nature of statistical learning theory
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Summarizing text documents: sentence selection and evaluation metrics

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Learning to construct knowledge bases from the World Wide Web

Artificial Intelligence - Special issue on Intelligent internet systems
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Extended Boolean information retrieval

Communications of the ACM
A vector space model for automatic indexing

Communications of the ACM
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Document classification using a finite mixture model

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Automatic text categorization by unsupervised learning

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies

NAACL-ANLP-AutoSum '00 Proceedings of the 2000 NAACL-ANLPWorkshop on Automatic summarization - Volume 4
Hybrid hill-climbing and knowledge-based methods for intelligent news filtering

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

Using the feature projection technique based on a normalized voting method for text classification

Information Processing and Management: an International Journal
Exploration of textual document archives using a fuzzy hierarchical clustering algorithm in the GAMBAL system

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
Discovering "title-like" terms

Information Processing and Management: an International Journal
A study on automatically extracted keywords in text categorization

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Using the revised EM algorithm to remove noisy data for improving the one-against-the-rest method in binary text classification

Information Processing and Management: an International Journal
Keywords given by authors of scientific articles in database descriptors

Journal of the American Society for Information Science and Technology
Noise reduction through summarization for Web-page classification

Information Processing and Management: an International Journal
Using classification techniques for informal requirements in the requirements analysis-supporting system

Information and Software Technology
Semantic text similarity using corpus-based word similarity and string similarity

ACM Transactions on Knowledge Discovery from Data (TKDD)
Semantic Text Classification of Emergent Disease Reports

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Sentence similarity measurement based on shallow parsing

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
A short text modeling method combining semantic and statistical information

Information Sciences: an International Journal
Summarization as feature selection for document categorization on small datasets

IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
Cause identification from aviation safety incident reports via weakly supervised semantic lexicon construction

Journal of Artificial Intelligence Research
A novel framework for web page classification using two-stage neural network

ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
Improving Korean speech acts analysis by using shrinkage and discourse stack

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
PolyUCOMP: combining semantic vectors with skip bigrams for semantic textual similarity

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
An integrated semantic-based approach in concept based video retrieval

Multimedia Tools and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic text categorization is a problem of assigning text documents to pre-defined categories. In order to classify text documents, we must extract useful features. In previous researches, a text document is commonly represented by the term frequency and the inverted document frequency of each feature. Since there is a difference between important sentences and unimportant sentences in a document, the features from more important sentences should be considered more than other features. In this paper, we measure the importance of sentences using text summarization techniques. Then we represent a document as a vector of features with different weights according to the importance of each sentence. To verify our new method, we conduct experiments using two language newsgroup data sets: one written by English and the other written by Korean. Four kinds of classifiers are used in our experiments: Naive Bayes, Rocchio, k-NN, and SVM. We observe that our new method makes a significant improvement in all these classifiers and both data sets.