Improving text categorization using the importance of sentences

  • Authors:
  • Youngjoong Ko;Jinwoo Park;Jungyun Seo

  • Affiliations:
  • Department of Computer Science, NLP lab., Sogang University, Sinsu-dong 1, Mapo-gu, Seoul 121-742, South Korea;Department of Computer Science, NLP lab., Sogang University, Sinsu-dong 1, Mapo-gu, Seoul 121-742, South Korea;Department of Computer Science, NLP lab., Sogang University, Sinsu-dong 1, Mapo-gu, Seoul 121-742, South Korea

  • Venue:
  • Information Processing and Management: an International Journal
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automatic text categorization is a problem of assigning text documents to pre-defined categories. In order to classify text documents, we must extract useful features. In previous researches, a text document is commonly represented by the term frequency and the inverted document frequency of each feature. Since there is a difference between important sentences and unimportant sentences in a document, the features from more important sentences should be considered more than other features. In this paper, we measure the importance of sentences using text summarization techniques. Then we represent a document as a vector of features with different weights according to the importance of each sentence. To verify our new method, we conduct experiments using two language newsgroup data sets: one written by English and the other written by Korean. Four kinds of classifiers are used in our experiments: Naive Bayes, Rocchio, k-NN, and SVM. We observe that our new method makes a significant improvement in all these classifiers and both data sets.